Today we're gonna talk about spelling
correction. Lots of applications make use

of spelling correction. For example, word
processing, almost any modern word

processor will take a misspelled word like
component with an A and give you

suggestions like component with an E and
automatically replace it for you. Modern

search engines will not only flag an
error. So, language spelled without a u,

here. But, give you, the results, as if
you had spelled the word correctly. And,

modern phones additionally will
automatically find misspelled words. Here,

I typed l-a-y-r, and it replaced it
automatically, or suggests a replacement,

with late. We can distinguish a number of
separate tasks and spelling correction.

One is the detection of the error itself.
And then the correction of the error once

you've found it. And we can think about
different kinds of correction. We might

automatically correct an error if we're
positive that the error that we know the

right answer for the error. So H-T-E is a
very common misspelling for the, and so

many word processors automatically correct
H-T-E. We might suggest a single

correction if we're, there's only one very
likely correction, or we might suggest a

whole list of corrections and let the user
pick from among them. We distinguish two

different classes of spelling errors. Non
word errors are errors in which the, what

the user types is not a word of English.
So g-r-a-f-f-e a misspelling let's say for

giraffe is not a word of English. By
contrast, real word errors. Are errors in

which then the resulting. [sound]
Misspelling is actually a word of English

and that makes them somewhat harder to
detect. And we can break up real word

errors into ones produced by really
typographical processes. These were meant

to type three. And typed [inaudible] let's
say. Or cognitive errors, where the user,

meant to type a word like [inaudible] and
instead typed a homophone of a, of the

word, or \u201ct-o-o\u201d instead of
[inaudible] And in both cases what, what's

produced is a real word of English, but by
modeling the differences between these

kind of errors, we might come up with
better ways of fixing them both. How

common are spelling errors? Depends a lot
on the task. So in web queries, spelling

errors are extremely common. So
practically one in four words in a web

query are likely to be misspelled. But in
web processing tasks on phones it's much

harder to get an accurate number. So
there's been a number of studies and most

of these studies are done by retyping. You
give the user a passage to type and then

you measure how well they, they type it.
And, of course, that's not quite the same

user's naturally writing messages or
typing. Nonetheless If you ask users to

retype and you don't let them use the
backspace key, they make about thirteen

percent of the words, thirteen percent of
the words are in error. So indicating that

if, that a lot of words. They correct
themselves with the backspace. If you let

them correct, now we're trying to
experiment on a, on a p d a style phone

site, organizer, they'll correct about
seven percent of the words themselves.

They'll still leave about two percent of
the words uncorrected, on the organizer.

And, similar numbers on people doing
retyping on a regular keyboard. So, the

numbers are about two percent where people
typing. And probably a much higher number

for web queries and probably a much higher
number for people texting. Are the kind of

spelling, spelling error [inaudible] that
we see. How do we detect non word spelling

errors. The traditional way is just to use
a large dictionary. Any word not in the

dictionary is an error. And, the larger
the dictionary, it turns out the better

this works. For correcting these non-word
spelling errors, we generate a set of

candidates that's real words that are
similar to the error. And then we pick

whichever one is best. And we'll talk
about the noisy-channel probability model

of how to do that. And it's also related
to another method called the shortest

weighted [inaudible] distance myth. So we
find the words that are not in the

dictionary. For each one, we generate a
set of candidates. Those are going to be

real words that are similar, we'll talk
about what similar means, to that error

and then we'll pick the best one. For real
word spelling errors, the algorithm is

quite similar. Again, for each word we
generate a candidate set. But now we do

this for every word in a sentence, not
just the words that are not in some

dictionary. So real word spelling error
correction, we don't use a dictionary

because of course the errors are in a
dictionary. So that wouldn't help. So, for

every word, we generate a candidate set.
So we might find candidate words with

similar pronunciations, we might find
candidate words with similar spellings,

and depending on the algorithm, exactly.
And it's very important that we're gonna

include the word itself, in the candidate
set, because the every word might be a

misspelling of some other real word, or it
might be the correct word. In fact, most

words are probably correct. So, for each
candidate set of each possible error,

we're gonna include the word itself. And
most of the time, in fact, we're gonna

pick that. And again, how we pick the
words we might use the noisy channel

model. We might use a classifier, we'll
talk about that so we'll discuss the

different methods of detecting these
errors and correcting them in the next