0:00:01.560,0:00:06.758 Today we're gonna talk about spelling[br]correction. Lots of applications make use 0:00:06.758,0:00:11.628 of spelling correction. For example, word[br]processing, almost any modern word 0:00:11.628,0:00:16.630 processor will take a misspelled word like[br]component with an A and give you 0:00:16.630,0:00:22.077 suggestions like component with an E and[br]automatically replace it for you. Modern 0:00:22.077,0:00:28.259 search engines will not only flag an[br]error. So, language spelled without a u, 0:00:28.259,0:00:34.936 here. But, give you, the results, as if[br]you had spelled the word correctly. And, 0:00:34.936,0:00:40.953 modern phones additionally will[br]automatically find misspelled words. Here, 0:00:40.953,0:00:47.135 I typed l-a-y-r, and it replaced it[br]automatically, or suggests a replacement, 0:00:47.135,0:00:52.260 with late. We can distinguish a number of[br]separate tasks and spelling correction. 0:00:52.260,0:00:56.921 One is the detection of the error itself.[br]And then the correction of the error once 0:00:56.921,0:01:01.301 you've found it. And we can think about[br]different kinds of correction. We might 0:01:01.301,0:01:06.018 automatically correct an error if we're[br]positive that the error that we know the 0:01:06.018,0:01:10.510 right answer for the error. So H-T-E is a[br]very common misspelling for the, and so 0:01:10.679,0:01:14.891 many word processors automatically correct[br]H-T-E. We might suggest a single 0:01:14.891,0:01:19.495 correction if we're, there's only one very[br]likely correction, or we might suggest a 0:01:19.495,0:01:24.274 whole list of corrections and let the user[br]pick from among them. We distinguish two 0:01:24.274,0:01:30.657 different classes of spelling errors. Non[br]word errors are errors in which the, what 0:01:30.657,0:01:37.213 the user types is not a word of English.[br]So g-r-a-f-f-e a misspelling let's say for 0:01:37.213,0:01:43.510 giraffe is not a word of English. By[br]contrast, real word errors. Are errors in 0:01:43.510,0:01:49.587 which then the resulting. [sound][br]Misspelling is actually a word of English 0:01:49.587,0:01:54.867 and that makes them somewhat harder to[br]detect. And we can break up real word 0:01:54.867,0:02:00.217 errors into ones produced by really[br]typographical processes. These were meant 0:02:00.217,0:02:06.132 to type three. And typed [inaudible] let's[br]say. Or cognitive errors, where the user, 0:02:06.343,0:02:11.992 meant to type a word like [inaudible] and[br]instead typed a homophone of a, of the 0:02:11.992,0:02:16.369 word, or \u201ct-o-o\u201d instead of[br][inaudible] And in both cases what, what's 0:02:16.369,0:02:22.088 produced is a real word of English, but by[br]modeling the differences between these 0:02:22.088,0:02:27.453 kind of errors, we might come up with[br]better ways of fixing them both. How 0:02:27.453,0:02:33.523 common are spelling errors? Depends a lot[br]on the task. So in web queries, spelling 0:02:33.523,0:02:38.951 errors are extremely common. So[br]practically one in four words in a web 0:02:38.951,0:02:44.218 query are likely to be misspelled. But in[br]web processing tasks on phones it's much 0:02:44.218,0:02:48.669 harder to get an accurate number. So[br]there's been a number of studies and most 0:02:48.669,0:02:53.404 of these studies are done by retyping. You[br]give the user a passage to type and then 0:02:53.404,0:02:57.912 you measure how well they, they type it.[br]And, of course, that's not quite the same 0:02:57.912,0:03:02.591 user's naturally writing messages or[br]typing. Nonetheless If you ask users to 0:03:02.591,0:03:06.984 retype and you don't let them use the[br]backspace key, they make about thirteen 0:03:06.984,0:03:10.806 percent of the words, thirteen percent of[br]the words are in error. So indicating that 0:03:10.806,0:03:16.028 if, that a lot of words. They correct[br]themselves with the backspace. If you let 0:03:16.028,0:03:21.220 them correct, now we're trying to[br]experiment on a, on a p d a style phone 0:03:21.220,0:03:25.858 site, organizer, they'll correct about[br]seven percent of the words themselves. 0:03:25.858,0:03:30.842 They'll still leave about two percent of[br]the words uncorrected, on the organizer. 0:03:30.842,0:03:35.964 And, similar numbers on people doing[br]retyping on a regular keyboard. So, the 0:03:35.964,0:03:41.261 numbers are about two percent where people[br]typing. And probably a much higher number 0:03:41.261,0:03:46.280 for web queries and probably a much higher[br]number for people texting. Are the kind of 0:03:46.280,0:03:51.210 spelling, spelling error [inaudible] that[br]we see. How do we detect non word spelling 0:03:51.210,0:03:56.022 errors. The traditional way is just to use[br]a large dictionary. Any word not in the 0:03:56.022,0:04:00.596 dictionary is an error. And, the larger[br]the dictionary, it turns out the better 0:04:00.596,0:04:04.705 this works. For correcting these non-word[br]spelling errors, we generate a set of 0:04:04.705,0:04:08.624 candidates that's real words that are[br]similar to the error. And then we pick 0:04:08.624,0:04:12.852 whichever one is best. And we'll talk[br]about the noisy-channel probability model 0:04:12.852,0:04:17.029 of how to do that. And it's also related[br]to another method called the shortest 0:04:17.029,0:04:20.948 weighted [inaudible] distance myth. So we[br]find the words that are not in the 0:04:20.948,0:04:25.125 dictionary. For each one, we generate a[br]set of candidates. Those are going to be 0:04:25.125,0:04:29.148 real words that are similar, we'll talk[br]about what similar means, to that error 0:04:29.148,0:04:33.448 and then we'll pick the best one. For real[br]word spelling errors, the algorithm is 0:04:33.448,0:04:37.650 quite similar. Again, for each word we[br]generate a candidate set. But now we do 0:04:37.650,0:04:41.742 this for every word in a sentence, not[br]just the words that are not in some 0:04:41.742,0:04:45.944 dictionary. So real word spelling error[br]correction, we don't use a dictionary 0:04:45.944,0:04:50.245 because of course the errors are in a[br]dictionary. So that wouldn't help. So, for 0:04:50.245,0:04:54.381 every word, we generate a candidate set.[br]So we might find candidate words with 0:04:54.381,0:04:58.463 similar pronunciations, we might find[br]candidate words with similar spellings, 0:04:58.463,0:05:02.760 and depending on the algorithm, exactly.[br]And it's very important that we're gonna 0:05:02.760,0:05:07.164 include the word itself, in the candidate[br]set, because the every word might be a 0:05:07.164,0:05:11.515 misspelling of some other real word, or it[br]might be the correct word. In fact, most 0:05:11.515,0:05:15.597 words are probably correct. So, for each[br]candidate set of each possible error, 0:05:15.597,0:05:19.732 we're gonna include the word itself. And[br]most of the time, in fact, we're gonna 0:05:19.732,0:05:25.644 pick that. And again, how we pick the[br]words we might use the noisy channel 0:05:25.644,0:05:32.240 model. We might use a classifier, we'll[br]talk about that so we'll discuss the 0:05:32.240,0:05:38.428 different methods of detecting these[br]errors and correcting them in the next