0:00:01.560,0:00:06.758
Today we're gonna talk about spelling[br]correction. Lots of applications make use

0:00:06.758,0:00:11.628
of spelling correction. For example, word[br]processing, almost any modern word

0:00:11.628,0:00:16.630
processor will take a misspelled word like[br]component with an A and give you

0:00:16.630,0:00:22.077
suggestions like component with an E and[br]automatically replace it for you. Modern

0:00:22.077,0:00:28.259
search engines will not only flag an[br]error. So, language spelled without a u,

0:00:28.259,0:00:34.936
here. But, give you, the results, as if[br]you had spelled the word correctly. And,

0:00:34.936,0:00:40.953
modern phones additionally will[br]automatically find misspelled words. Here,

0:00:40.953,0:00:47.135
I typed l-a-y-r, and it replaced it[br]automatically, or suggests a replacement,

0:00:47.135,0:00:52.260
with late. We can distinguish a number of[br]separate tasks and spelling correction.

0:00:52.260,0:00:56.921
One is the detection of the error itself.[br]And then the correction of the error once

0:00:56.921,0:01:01.301
you've found it. And we can think about[br]different kinds of correction. We might

0:01:01.301,0:01:06.018
automatically correct an error if we're[br]positive that the error that we know the

0:01:06.018,0:01:10.510
right answer for the error. So H-T-E is a[br]very common misspelling for the, and so

0:01:10.679,0:01:14.891
many word processors automatically correct[br]H-T-E. We might suggest a single

0:01:14.891,0:01:19.495
correction if we're, there's only one very[br]likely correction, or we might suggest a

0:01:19.495,0:01:24.274
whole list of corrections and let the user[br]pick from among them. We distinguish two

0:01:24.274,0:01:30.657
different classes of spelling errors. Non[br]word errors are errors in which the, what

0:01:30.657,0:01:37.213
the user types is not a word of English.[br]So g-r-a-f-f-e a misspelling let's say for

0:01:37.213,0:01:43.510
giraffe is not a word of English. By[br]contrast, real word errors. Are errors in

0:01:43.510,0:01:49.587
which then the resulting. [sound][br]Misspelling is actually a word of English

0:01:49.587,0:01:54.867
and that makes them somewhat harder to[br]detect. And we can break up real word

0:01:54.867,0:02:00.217
errors into ones produced by really[br]typographical processes. These were meant

0:02:00.217,0:02:06.132
to type three. And typed [inaudible] let's[br]say. Or cognitive errors, where the user,

0:02:06.343,0:02:11.992
meant to type a word like [inaudible] and[br]instead typed a homophone of a, of the

0:02:11.992,0:02:16.369
word, or \u201ct-o-o\u201d instead of[br][inaudible] And in both cases what, what's

0:02:16.369,0:02:22.088
produced is a real word of English, but by[br]modeling the differences between these

0:02:22.088,0:02:27.453
kind of errors, we might come up with[br]better ways of fixing them both. How

0:02:27.453,0:02:33.523
common are spelling errors? Depends a lot[br]on the task. So in web queries, spelling

0:02:33.523,0:02:38.951
errors are extremely common. So[br]practically one in four words in a web

0:02:38.951,0:02:44.218
query are likely to be misspelled. But in[br]web processing tasks on phones it's much

0:02:44.218,0:02:48.669
harder to get an accurate number. So[br]there's been a number of studies and most

0:02:48.669,0:02:53.404
of these studies are done by retyping. You[br]give the user a passage to type and then

0:02:53.404,0:02:57.912
you measure how well they, they type it.[br]And, of course, that's not quite the same

0:02:57.912,0:03:02.591
user's naturally writing messages or[br]typing. Nonetheless If you ask users to

0:03:02.591,0:03:06.984
retype and you don't let them use the[br]backspace key, they make about thirteen

0:03:06.984,0:03:10.806
percent of the words, thirteen percent of[br]the words are in error. So indicating that

0:03:10.806,0:03:16.028
if, that a lot of words. They correct[br]themselves with the backspace. If you let

0:03:16.028,0:03:21.220
them correct, now we're trying to[br]experiment on a, on a p d a style phone

0:03:21.220,0:03:25.858
site, organizer, they'll correct about[br]seven percent of the words themselves.

0:03:25.858,0:03:30.842
They'll still leave about two percent of[br]the words uncorrected, on the organizer.

0:03:30.842,0:03:35.964
And, similar numbers on people doing[br]retyping on a regular keyboard. So, the

0:03:35.964,0:03:41.261
numbers are about two percent where people[br]typing. And probably a much higher number

0:03:41.261,0:03:46.280
for web queries and probably a much higher[br]number for people texting. Are the kind of

0:03:46.280,0:03:51.210
spelling, spelling error [inaudible] that[br]we see. How do we detect non word spelling

0:03:51.210,0:03:56.022
errors. The traditional way is just to use[br]a large dictionary. Any word not in the

0:03:56.022,0:04:00.596
dictionary is an error. And, the larger[br]the dictionary, it turns out the better

0:04:00.596,0:04:04.705
this works. For correcting these non-word[br]spelling errors, we generate a set of

0:04:04.705,0:04:08.624
candidates that's real words that are[br]similar to the error. And then we pick

0:04:08.624,0:04:12.852
whichever one is best. And we'll talk[br]about the noisy-channel probability model

0:04:12.852,0:04:17.029
of how to do that. And it's also related[br]to another method called the shortest

0:04:17.029,0:04:20.948
weighted [inaudible] distance myth. So we[br]find the words that are not in the

0:04:20.948,0:04:25.125
dictionary. For each one, we generate a[br]set of candidates. Those are going to be

0:04:25.125,0:04:29.148
real words that are similar, we'll talk[br]about what similar means, to that error

0:04:29.148,0:04:33.448
and then we'll pick the best one. For real[br]word spelling errors, the algorithm is

0:04:33.448,0:04:37.650
quite similar. Again, for each word we[br]generate a candidate set. But now we do

0:04:37.650,0:04:41.742
this for every word in a sentence, not[br]just the words that are not in some

0:04:41.742,0:04:45.944
dictionary. So real word spelling error[br]correction, we don't use a dictionary

0:04:45.944,0:04:50.245
because of course the errors are in a[br]dictionary. So that wouldn't help. So, for

0:04:50.245,0:04:54.381
every word, we generate a candidate set.[br]So we might find candidate words with

0:04:54.381,0:04:58.463
similar pronunciations, we might find[br]candidate words with similar spellings,

0:04:58.463,0:05:02.760
and depending on the algorithm, exactly.[br]And it's very important that we're gonna

0:05:02.760,0:05:07.164
include the word itself, in the candidate[br]set, because the every word might be a

0:05:07.164,0:05:11.515
misspelling of some other real word, or it[br]might be the correct word. In fact, most

0:05:11.515,0:05:15.597
words are probably correct. So, for each[br]candidate set of each possible error,

0:05:15.597,0:05:19.732
we're gonna include the word itself. And[br]most of the time, in fact, we're gonna

0:05:19.732,0:05:25.644
pick that. And again, how we pick the[br]words we might use the noisy channel

0:05:25.644,0:05:32.240
model. We might use a classifier, we'll[br]talk about that so we'll discuss the

0:05:32.240,0:05:38.428
different methods of detecting these[br]errors and correcting them in the next