1
00:00:01,560 --> 00:00:06,758
Today we're gonna talk about spelling
correction. Lots of applications make use

2
00:00:06,758 --> 00:00:11,628
of spelling correction. For example, word
processing, almost any modern word

3
00:00:11,628 --> 00:00:16,630
processor will take a misspelled word like
component with an A and give you

4
00:00:16,630 --> 00:00:22,077
suggestions like component with an E and
automatically replace it for you. Modern

5
00:00:22,077 --> 00:00:28,259
search engines will not only flag an
error. So, language spelled without a u,

6
00:00:28,259 --> 00:00:34,936
here. But, give you, the results, as if
you had spelled the word correctly. And,

7
00:00:34,936 --> 00:00:40,953
modern phones additionally will
automatically find misspelled words. Here,

8
00:00:40,953 --> 00:00:47,135
I typed l-a-y-r, and it replaced it
automatically, or suggests a replacement,

9
00:00:47,135 --> 00:00:52,260
with late. We can distinguish a number of
separate tasks and spelling correction.

10
00:00:52,260 --> 00:00:56,921
One is the detection of the error itself.
And then the correction of the error once

11
00:00:56,921 --> 00:01:01,301
you've found it. And we can think about
different kinds of correction. We might

12
00:01:01,301 --> 00:01:06,018
automatically correct an error if we're
positive that the error that we know the

13
00:01:06,018 --> 00:01:10,510
right answer for the error. So H-T-E is a
very common misspelling for the, and so

14
00:01:10,679 --> 00:01:14,891
many word processors automatically correct
H-T-E. We might suggest a single

15
00:01:14,891 --> 00:01:19,495
correction if we're, there's only one very
likely correction, or we might suggest a

16
00:01:19,495 --> 00:01:24,274
whole list of corrections and let the user
pick from among them. We distinguish two

17
00:01:24,274 --> 00:01:30,657
different classes of spelling errors. Non
word errors are errors in which the, what

18
00:01:30,657 --> 00:01:37,213
the user types is not a word of English.
So g-r-a-f-f-e a misspelling let's say for

19
00:01:37,213 --> 00:01:43,510
giraffe is not a word of English. By
contrast, real word errors. Are errors in

20
00:01:43,510 --> 00:01:49,587
which then the resulting. [sound]
Misspelling is actually a word of English

21
00:01:49,587 --> 00:01:54,867
and that makes them somewhat harder to
detect. And we can break up real word

22
00:01:54,867 --> 00:02:00,217
errors into ones produced by really
typographical processes. These were meant

23
00:02:00,217 --> 00:02:06,132
to type three. And typed [inaudible] let's
say. Or cognitive errors, where the user,

24
00:02:06,343 --> 00:02:11,992
meant to type a word like [inaudible] and
instead typed a homophone of a, of the

25
00:02:11,992 --> 00:02:16,369
word, or \u201ct-o-o\u201d instead of
[inaudible] And in both cases what, what's

26
00:02:16,369 --> 00:02:22,088
produced is a real word of English, but by
modeling the differences between these

27
00:02:22,088 --> 00:02:27,453
kind of errors, we might come up with
better ways of fixing them both. How

28
00:02:27,453 --> 00:02:33,523
common are spelling errors? Depends a lot
on the task. So in web queries, spelling

29
00:02:33,523 --> 00:02:38,951
errors are extremely common. So
practically one in four words in a web

30
00:02:38,951 --> 00:02:44,218
query are likely to be misspelled. But in
web processing tasks on phones it's much

31
00:02:44,218 --> 00:02:48,669
harder to get an accurate number. So
there's been a number of studies and most

32
00:02:48,669 --> 00:02:53,404
of these studies are done by retyping. You
give the user a passage to type and then

33
00:02:53,404 --> 00:02:57,912
you measure how well they, they type it.
And, of course, that's not quite the same

34
00:02:57,912 --> 00:03:02,591
user's naturally writing messages or
typing. Nonetheless If you ask users to

35
00:03:02,591 --> 00:03:06,984
retype and you don't let them use the
backspace key, they make about thirteen

36
00:03:06,984 --> 00:03:10,806
percent of the words, thirteen percent of
the words are in error. So indicating that

37
00:03:10,806 --> 00:03:16,028
if, that a lot of words. They correct
themselves with the backspace. If you let

38
00:03:16,028 --> 00:03:21,220
them correct, now we're trying to
experiment on a, on a p d a style phone

39
00:03:21,220 --> 00:03:25,858
site, organizer, they'll correct about
seven percent of the words themselves.

40
00:03:25,858 --> 00:03:30,842
They'll still leave about two percent of
the words uncorrected, on the organizer.

41
00:03:30,842 --> 00:03:35,964
And, similar numbers on people doing
retyping on a regular keyboard. So, the

42
00:03:35,964 --> 00:03:41,261
numbers are about two percent where people
typing. And probably a much higher number

43
00:03:41,261 --> 00:03:46,280
for web queries and probably a much higher
number for people texting. Are the kind of

44
00:03:46,280 --> 00:03:51,210
spelling, spelling error [inaudible] that
we see. How do we detect non word spelling

45
00:03:51,210 --> 00:03:56,022
errors. The traditional way is just to use
a large dictionary. Any word not in the

46
00:03:56,022 --> 00:04:00,596
dictionary is an error. And, the larger
the dictionary, it turns out the better

47
00:04:00,596 --> 00:04:04,705
this works. For correcting these non-word
spelling errors, we generate a set of

48
00:04:04,705 --> 00:04:08,624
candidates that's real words that are
similar to the error. And then we pick

49
00:04:08,624 --> 00:04:12,852
whichever one is best. And we'll talk
about the noisy-channel probability model

50
00:04:12,852 --> 00:04:17,029
of how to do that. And it's also related
to another method called the shortest

51
00:04:17,029 --> 00:04:20,948
weighted [inaudible] distance myth. So we
find the words that are not in the

52
00:04:20,948 --> 00:04:25,125
dictionary. For each one, we generate a
set of candidates. Those are going to be

53
00:04:25,125 --> 00:04:29,148
real words that are similar, we'll talk
about what similar means, to that error

54
00:04:29,148 --> 00:04:33,448
and then we'll pick the best one. For real
word spelling errors, the algorithm is

55
00:04:33,448 --> 00:04:37,650
quite similar. Again, for each word we
generate a candidate set. But now we do

56
00:04:37,650 --> 00:04:41,742
this for every word in a sentence, not
just the words that are not in some

57
00:04:41,742 --> 00:04:45,944
dictionary. So real word spelling error
correction, we don't use a dictionary

58
00:04:45,944 --> 00:04:50,245
because of course the errors are in a
dictionary. So that wouldn't help. So, for

59
00:04:50,245 --> 00:04:54,381
every word, we generate a candidate set.
So we might find candidate words with

60
00:04:54,381 --> 00:04:58,463
similar pronunciations, we might find
candidate words with similar spellings,

61
00:04:58,463 --> 00:05:02,760
and depending on the algorithm, exactly.
And it's very important that we're gonna

62
00:05:02,760 --> 00:05:07,164
include the word itself, in the candidate
set, because the every word might be a

63
00:05:07,164 --> 00:05:11,515
misspelling of some other real word, or it
might be the correct word. In fact, most

64
00:05:11,515 --> 00:05:15,597
words are probably correct. So, for each
candidate set of each possible error,

65
00:05:15,597 --> 00:05:19,732
we're gonna include the word itself. And
most of the time, in fact, we're gonna

66
00:05:19,732 --> 00:05:25,644
pick that. And again, how we pick the
words we might use the noisy channel

67
00:05:25,644 --> 00:05:32,240
model. We might use a classifier, we'll
talk about that so we'll discuss the

68
00:05:32,240 --> 00:05:38,428
different methods of detecting these
errors and correcting them in the next