-
Today we're gonna talk about spelling
correction. Lots of applications make use
-
of spelling correction. For example, word
processing, almost any modern word
-
processor will take a misspelled word like
component with an A and give you
-
suggestions like component with an E and
automatically replace it for you. Modern
-
search engines will not only flag an
error. So, language spelled without a u,
-
here. But, give you, the results, as if
you had spelled the word correctly. And,
-
modern phones additionally will
automatically find misspelled words. Here,
-
I typed l-a-y-r, and it replaced it
automatically, or suggests a replacement,
-
with late. We can distinguish a number of
separate tasks and spelling correction.
-
One is the detection of the error itself.
And then the correction of the error once
-
you've found it. And we can think about
different kinds of correction. We might
-
automatically correct an error if we're
positive that the error that we know the
-
right answer for the error. So H-T-E is a
very common misspelling for the, and so
-
many word processors automatically correct
H-T-E. We might suggest a single
-
correction if we're, there's only one very
likely correction, or we might suggest a
-
whole list of corrections and let the user
pick from among them. We distinguish two
-
different classes of spelling errors. Non
word errors are errors in which the, what
-
the user types is not a word of English.
So g-r-a-f-f-e a misspelling let's say for
-
giraffe is not a word of English. By
contrast, real word errors. Are errors in
-
which then the resulting. [sound]
Misspelling is actually a word of English
-
and that makes them somewhat harder to
detect. And we can break up real word
-
errors into ones produced by really
typographical processes. These were meant
-
to type three. And typed [inaudible] let's
say. Or cognitive errors, where the user,
-
meant to type a word like [inaudible] and
instead typed a homophone of a, of the
-
word, or \u201ct-o-o\u201d instead of
[inaudible] And in both cases what, what's
-
produced is a real word of English, but by
modeling the differences between these
-
kind of errors, we might come up with
better ways of fixing them both. How
-
common are spelling errors? Depends a lot
on the task. So in web queries, spelling
-
errors are extremely common. So
practically one in four words in a web
-
query are likely to be misspelled. But in
web processing tasks on phones it's much
-
harder to get an accurate number. So
there's been a number of studies and most
-
of these studies are done by retyping. You
give the user a passage to type and then
-
you measure how well they, they type it.
And, of course, that's not quite the same
-
user's naturally writing messages or
typing. Nonetheless If you ask users to
-
retype and you don't let them use the
backspace key, they make about thirteen
-
percent of the words, thirteen percent of
the words are in error. So indicating that
-
if, that a lot of words. They correct
themselves with the backspace. If you let
-
them correct, now we're trying to
experiment on a, on a p d a style phone
-
site, organizer, they'll correct about
seven percent of the words themselves.
-
They'll still leave about two percent of
the words uncorrected, on the organizer.
-
And, similar numbers on people doing
retyping on a regular keyboard. So, the
-
numbers are about two percent where people
typing. And probably a much higher number
-
for web queries and probably a much higher
number for people texting. Are the kind of
-
spelling, spelling error [inaudible] that
we see. How do we detect non word spelling
-
errors. The traditional way is just to use
a large dictionary. Any word not in the
-
dictionary is an error. And, the larger
the dictionary, it turns out the better
-
this works. For correcting these non-word
spelling errors, we generate a set of
-
candidates that's real words that are
similar to the error. And then we pick
-
whichever one is best. And we'll talk
about the noisy-channel probability model
-
of how to do that. And it's also related
to another method called the shortest
-
weighted [inaudible] distance myth. So we
find the words that are not in the
-
dictionary. For each one, we generate a
set of candidates. Those are going to be
-
real words that are similar, we'll talk
about what similar means, to that error
-
and then we'll pick the best one. For real
word spelling errors, the algorithm is
-
quite similar. Again, for each word we
generate a candidate set. But now we do
-
this for every word in a sentence, not
just the words that are not in some
-
dictionary. So real word spelling error
correction, we don't use a dictionary
-
because of course the errors are in a
dictionary. So that wouldn't help. So, for
-
every word, we generate a candidate set.
So we might find candidate words with
-
similar pronunciations, we might find
candidate words with similar spellings,
-
and depending on the algorithm, exactly.
And it's very important that we're gonna
-
include the word itself, in the candidate
set, because the every word might be a
-
misspelling of some other real word, or it
might be the correct word. In fact, most
-
words are probably correct. So, for each
candidate set of each possible error,
-
we're gonna include the word itself. And
most of the time, in fact, we're gonna
-
pick that. And again, how we pick the
words we might use the noisy channel
-
model. We might use a classifier, we'll
talk about that so we'll discuss the
-
different methods of detecting these
errors and correcting them in the next