The Spelling Correction Task

0:02 - 0:07

Today we're gonna talk about spelling
correction. Lots of applications make use
0:07 - 0:12

of spelling correction. For example, word
processing, almost any modern word
0:12 - 0:17

processor will take a misspelled word like
component with an A and give you
0:17 - 0:22

suggestions like component with an E and
automatically replace it for you. Modern
0:22 - 0:28

search engines will not only flag an
error. So, language spelled without a u,
0:28 - 0:35

here. But, give you, the results, as if
you had spelled the word correctly. And,
0:35 - 0:41

modern phones additionally will
automatically find misspelled words. Here,
0:41 - 0:47

I typed l-a-y-r, and it replaced it
automatically, or suggests a replacement,
0:47 - 0:52

with late. We can distinguish a number of
separate tasks and spelling correction.
0:52 - 0:57

One is the detection of the error itself.
And then the correction of the error once
0:57 - 1:01

you've found it. And we can think about
different kinds of correction. We might
1:01 - 1:06

automatically correct an error if we're
positive that the error that we know the
1:06 - 1:11

right answer for the error. So H-T-E is a
very common misspelling for the, and so
1:11 - 1:15

many word processors automatically correct
H-T-E. We might suggest a single
1:15 - 1:19

correction if we're, there's only one very
likely correction, or we might suggest a
1:19 - 1:24

whole list of corrections and let the user
pick from among them. We distinguish two
1:24 - 1:31

different classes of spelling errors. Non
word errors are errors in which the, what
1:31 - 1:37

the user types is not a word of English.
So g-r-a-f-f-e a misspelling let's say for
1:37 - 1:44

giraffe is not a word of English. By
contrast, real word errors. Are errors in
1:44 - 1:50

which then the resulting. [sound]
Misspelling is actually a word of English
1:50 - 1:55

and that makes them somewhat harder to
detect. And we can break up real word
1:55 - 2:00

errors into ones produced by really
typographical processes. These were meant
2:00 - 2:06

to type three. And typed [inaudible] let's
say. Or cognitive errors, where the user,
2:06 - 2:12

meant to type a word like [inaudible] and
instead typed a homophone of a, of the
2:12 - 2:16

word, or \u201ct-o-o\u201d instead of
[inaudible] And in both cases what, what's
2:16 - 2:22

produced is a real word of English, but by
modeling the differences between these
2:22 - 2:27

kind of errors, we might come up with
better ways of fixing them both. How
2:27 - 2:34

common are spelling errors? Depends a lot
on the task. So in web queries, spelling
2:34 - 2:39

errors are extremely common. So
practically one in four words in a web
2:39 - 2:44

query are likely to be misspelled. But in
web processing tasks on phones it's much
2:44 - 2:49

harder to get an accurate number. So
there's been a number of studies and most
2:49 - 2:53

of these studies are done by retyping. You
give the user a passage to type and then
2:53 - 2:58

you measure how well they, they type it.
And, of course, that's not quite the same
2:58 - 3:03

user's naturally writing messages or
typing. Nonetheless If you ask users to
3:03 - 3:07

retype and you don't let them use the
backspace key, they make about thirteen
3:07 - 3:11

percent of the words, thirteen percent of
the words are in error. So indicating that
3:11 - 3:16

if, that a lot of words. They correct
themselves with the backspace. If you let
3:16 - 3:21

them correct, now we're trying to
experiment on a, on a p d a style phone
3:21 - 3:26

site, organizer, they'll correct about
seven percent of the words themselves.
3:26 - 3:31

They'll still leave about two percent of
the words uncorrected, on the organizer.
3:31 - 3:36

And, similar numbers on people doing
retyping on a regular keyboard. So, the
3:36 - 3:41

numbers are about two percent where people
typing. And probably a much higher number
3:41 - 3:46

for web queries and probably a much higher
number for people texting. Are the kind of
3:46 - 3:51

spelling, spelling error [inaudible] that
we see. How do we detect non word spelling
3:51 - 3:56

errors. The traditional way is just to use
a large dictionary. Any word not in the
3:56 - 4:01

dictionary is an error. And, the larger
the dictionary, it turns out the better
4:01 - 4:05

this works. For correcting these non-word
spelling errors, we generate a set of
4:05 - 4:09

candidates that's real words that are
similar to the error. And then we pick
4:09 - 4:13

whichever one is best. And we'll talk
about the noisy-channel probability model
4:13 - 4:17

of how to do that. And it's also related
to another method called the shortest
4:17 - 4:21

weighted [inaudible] distance myth. So we
find the words that are not in the
4:21 - 4:25

dictionary. For each one, we generate a
set of candidates. Those are going to be
4:25 - 4:29

real words that are similar, we'll talk
about what similar means, to that error
4:29 - 4:33

and then we'll pick the best one. For real
word spelling errors, the algorithm is
4:33 - 4:38

quite similar. Again, for each word we
generate a candidate set. But now we do
4:38 - 4:42

this for every word in a sentence, not
just the words that are not in some
4:42 - 4:46

dictionary. So real word spelling error
correction, we don't use a dictionary
4:46 - 4:50

because of course the errors are in a
dictionary. So that wouldn't help. So, for
4:50 - 4:54

every word, we generate a candidate set.
So we might find candidate words with
4:54 - 4:58

similar pronunciations, we might find
candidate words with similar spellings,
4:58 - 5:03

and depending on the algorithm, exactly.
And it's very important that we're gonna
5:03 - 5:07

include the word itself, in the candidate
set, because the every word might be a
5:07 - 5:12

misspelling of some other real word, or it
might be the correct word. In fact, most
5:12 - 5:16

words are probably correct. So, for each
candidate set of each possible error,
5:16 - 5:20

we're gonna include the word itself. And
most of the time, in fact, we're gonna
5:20 - 5:26

pick that. And again, how we pick the
words we might use the noisy channel
5:26 - 5:32

model. We might use a classifier, we'll
talk about that so we'll discuss the
5:32 - 5:38

different methods of detecting these
errors and correcting them in the next

Title:: The Spelling Correction Task
Video Language:: English

	Tau Nguyen edited English subtitles for The Spelling Correction Task
	stanford-bot edited English subtitles for The Spelling Correction Task
	stanford-bot edited English subtitles for The Spelling Correction Task
	stanford-bot edited English subtitles for The Spelling Correction Task
	jngiam1 edited English subtitles for The Spelling Correction Task
	jngiam1 edited English subtitles for The Spelling Correction Task
	jngiam1 edited English subtitles for The Spelling Correction Task
	stanford-bot edited English subtitles for The Spelling Correction Task

Show all

English subtitles

Revisions

Revision 4 Edited

Tau Nguyen

The Spelling Correction Task

Revisions

Our website uses cookies

Operating cookies (Required)