Today we're gonna talk about spelling
correction. Lots of applications make use
of spelling correction. For example, word
processing, almost any modern word
processor will take a misspelled word like
component with an A and give you
suggestions like component with an E and
automatically replace it for you. Modern
search engines will not only flag an
error. So, language spelled without a u,
here. But, give you, the results, as if
you had spelled the word correctly. And,
modern phones additionally will
automatically find misspelled words. Here,
I typed l-a-y-r, and it replaced it
automatically, or suggests a replacement,
with late. We can distinguish a number of
separate tasks and spelling correction.
One is the detection of the error itself.
And then the correction of the error once
you've found it. And we can think about
different kinds of correction. We might
automatically correct an error if we're
positive that the error that we know the
right answer for the error. So H-T-E is a
very common misspelling for the, and so
many word processors automatically correct
H-T-E. We might suggest a single
correction if we're, there's only one very
likely correction, or we might suggest a
whole list of corrections and let the user
pick from among them. We distinguish two
different classes of spelling errors. Non
word errors are errors in which the, what
the user types is not a word of English.
So g-r-a-f-f-e a misspelling let's say for
giraffe is not a word of English. By
contrast, real word errors. Are errors in
which then the resulting. [sound]
Misspelling is actually a word of English
and that makes them somewhat harder to
detect. And we can break up real word
errors into ones produced by really
typographical processes. These were meant
to type three. And typed [inaudible] let's
say. Or cognitive errors, where the user,
meant to type a word like [inaudible] and
instead typed a homophone of a, of the
word, or \u201ct-o-o\u201d instead of
[inaudible] And in both cases what, what's
produced is a real word of English, but by
modeling the differences between these
kind of errors, we might come up with
better ways of fixing them both. How
common are spelling errors? Depends a lot
on the task. So in web queries, spelling
errors are extremely common. So
practically one in four words in a web
query are likely to be misspelled. But in
web processing tasks on phones it's much
harder to get an accurate number. So
there's been a number of studies and most
of these studies are done by retyping. You
give the user a passage to type and then
you measure how well they, they type it.
And, of course, that's not quite the same
user's naturally writing messages or
typing. Nonetheless If you ask users to
retype and you don't let them use the
backspace key, they make about thirteen
percent of the words, thirteen percent of
the words are in error. So indicating that
if, that a lot of words. They correct
themselves with the backspace. If you let
them correct, now we're trying to
experiment on a, on a p d a style phone
site, organizer, they'll correct about
seven percent of the words themselves.
They'll still leave about two percent of
the words uncorrected, on the organizer.
And, similar numbers on people doing
retyping on a regular keyboard. So, the
numbers are about two percent where people
typing. And probably a much higher number
for web queries and probably a much higher
number for people texting. Are the kind of
spelling, spelling error [inaudible] that
we see. How do we detect non word spelling
errors. The traditional way is just to use
a large dictionary. Any word not in the
dictionary is an error. And, the larger
the dictionary, it turns out the better
this works. For correcting these non-word
spelling errors, we generate a set of
candidates that's real words that are
similar to the error. And then we pick
whichever one is best. And we'll talk
about the noisy-channel probability model
of how to do that. And it's also related
to another method called the shortest
weighted [inaudible] distance myth. So we
find the words that are not in the
dictionary. For each one, we generate a
set of candidates. Those are going to be
real words that are similar, we'll talk
about what similar means, to that error
and then we'll pick the best one. For real
word spelling errors, the algorithm is
quite similar. Again, for each word we
generate a candidate set. But now we do
this for every word in a sentence, not
just the words that are not in some
dictionary. So real word spelling error
correction, we don't use a dictionary
because of course the errors are in a
dictionary. So that wouldn't help. So, for
every word, we generate a candidate set.
So we might find candidate words with
similar pronunciations, we might find
candidate words with similar spellings,
and depending on the algorithm, exactly.
And it's very important that we're gonna
include the word itself, in the candidate
set, because the every word might be a
misspelling of some other real word, or it
might be the correct word. In fact, most
words are probably correct. So, for each
candidate set of each possible error,
we're gonna include the word itself. And
most of the time, in fact, we're gonna
pick that. And again, how we pick the
words we might use the noisy channel
model. We might use a classifier, we'll
talk about that so we'll discuss the
different methods of detecting these
errors and correcting them in the next