Return to Video

The Spelling Correction Task

  • 0:02 - 0:07
    Today we're gonna talk about spelling
    correction. Lots of applications make use
  • 0:07 - 0:12
    of spelling correction. For example, word
    processing, almost any modern word
  • 0:12 - 0:17
    processor will take a misspelled word like
    component with an A and give you
  • 0:17 - 0:22
    suggestions like component with an E and
    automatically replace it for you. Modern
  • 0:22 - 0:28
    search engines will not only flag an
    error. So, language spelled without a u,
  • 0:28 - 0:35
    here. But, give you, the results, as if
    you had spelled the word correctly. And,
  • 0:35 - 0:41
    modern phones additionally will
    automatically find misspelled words. Here,
  • 0:41 - 0:47
    I typed l-a-y-r, and it replaced it
    automatically, or suggests a replacement,
  • 0:47 - 0:52
    with late. We can distinguish a number of
    separate tasks and spelling correction.
  • 0:52 - 0:57
    One is the detection of the error itself.
    And then the correction of the error once
  • 0:57 - 1:01
    you've found it. And we can think about
    different kinds of correction. We might
  • 1:01 - 1:06
    automatically correct an error if we're
    positive that the error that we know the
  • 1:06 - 1:11
    right answer for the error. So H-T-E is a
    very common misspelling for the, and so
  • 1:11 - 1:15
    many word processors automatically correct
    H-T-E. We might suggest a single
  • 1:15 - 1:19
    correction if we're, there's only one very
    likely correction, or we might suggest a
  • 1:19 - 1:24
    whole list of corrections and let the user
    pick from among them. We distinguish two
  • 1:24 - 1:31
    different classes of spelling errors. Non
    word errors are errors in which the, what
  • 1:31 - 1:37
    the user types is not a word of English.
    So g-r-a-f-f-e a misspelling let's say for
  • 1:37 - 1:44
    giraffe is not a word of English. By
    contrast, real word errors. Are errors in
  • 1:44 - 1:50
    which then the resulting. [sound]
    Misspelling is actually a word of English
  • 1:50 - 1:55
    and that makes them somewhat harder to
    detect. And we can break up real word
  • 1:55 - 2:00
    errors into ones produced by really
    typographical processes. These were meant
  • 2:00 - 2:06
    to type three. And typed [inaudible] let's
    say. Or cognitive errors, where the user,
  • 2:06 - 2:12
    meant to type a word like [inaudible] and
    instead typed a homophone of a, of the
  • 2:12 - 2:16
    word, or \u201ct-o-o\u201d instead of
    [inaudible] And in both cases what, what's
  • 2:16 - 2:22
    produced is a real word of English, but by
    modeling the differences between these
  • 2:22 - 2:27
    kind of errors, we might come up with
    better ways of fixing them both. How
  • 2:27 - 2:34
    common are spelling errors? Depends a lot
    on the task. So in web queries, spelling
  • 2:34 - 2:39
    errors are extremely common. So
    practically one in four words in a web
  • 2:39 - 2:44
    query are likely to be misspelled. But in
    web processing tasks on phones it's much
  • 2:44 - 2:49
    harder to get an accurate number. So
    there's been a number of studies and most
  • 2:49 - 2:53
    of these studies are done by retyping. You
    give the user a passage to type and then
  • 2:53 - 2:58
    you measure how well they, they type it.
    And, of course, that's not quite the same
  • 2:58 - 3:03
    user's naturally writing messages or
    typing. Nonetheless If you ask users to
  • 3:03 - 3:07
    retype and you don't let them use the
    backspace key, they make about thirteen
  • 3:07 - 3:11
    percent of the words, thirteen percent of
    the words are in error. So indicating that
  • 3:11 - 3:16
    if, that a lot of words. They correct
    themselves with the backspace. If you let
  • 3:16 - 3:21
    them correct, now we're trying to
    experiment on a, on a p d a style phone
  • 3:21 - 3:26
    site, organizer, they'll correct about
    seven percent of the words themselves.
  • 3:26 - 3:31
    They'll still leave about two percent of
    the words uncorrected, on the organizer.
  • 3:31 - 3:36
    And, similar numbers on people doing
    retyping on a regular keyboard. So, the
  • 3:36 - 3:41
    numbers are about two percent where people
    typing. And probably a much higher number
  • 3:41 - 3:46
    for web queries and probably a much higher
    number for people texting. Are the kind of
  • 3:46 - 3:51
    spelling, spelling error [inaudible] that
    we see. How do we detect non word spelling
  • 3:51 - 3:56
    errors. The traditional way is just to use
    a large dictionary. Any word not in the
  • 3:56 - 4:01
    dictionary is an error. And, the larger
    the dictionary, it turns out the better
  • 4:01 - 4:05
    this works. For correcting these non-word
    spelling errors, we generate a set of
  • 4:05 - 4:09
    candidates that's real words that are
    similar to the error. And then we pick
  • 4:09 - 4:13
    whichever one is best. And we'll talk
    about the noisy-channel probability model
  • 4:13 - 4:17
    of how to do that. And it's also related
    to another method called the shortest
  • 4:17 - 4:21
    weighted [inaudible] distance myth. So we
    find the words that are not in the
  • 4:21 - 4:25
    dictionary. For each one, we generate a
    set of candidates. Those are going to be
  • 4:25 - 4:29
    real words that are similar, we'll talk
    about what similar means, to that error
  • 4:29 - 4:33
    and then we'll pick the best one. For real
    word spelling errors, the algorithm is
  • 4:33 - 4:38
    quite similar. Again, for each word we
    generate a candidate set. But now we do
  • 4:38 - 4:42
    this for every word in a sentence, not
    just the words that are not in some
  • 4:42 - 4:46
    dictionary. So real word spelling error
    correction, we don't use a dictionary
  • 4:46 - 4:50
    because of course the errors are in a
    dictionary. So that wouldn't help. So, for
  • 4:50 - 4:54
    every word, we generate a candidate set.
    So we might find candidate words with
  • 4:54 - 4:58
    similar pronunciations, we might find
    candidate words with similar spellings,
  • 4:58 - 5:03
    and depending on the algorithm, exactly.
    And it's very important that we're gonna
  • 5:03 - 5:07
    include the word itself, in the candidate
    set, because the every word might be a
  • 5:07 - 5:12
    misspelling of some other real word, or it
    might be the correct word. In fact, most
  • 5:12 - 5:16
    words are probably correct. So, for each
    candidate set of each possible error,
  • 5:16 - 5:20
    we're gonna include the word itself. And
    most of the time, in fact, we're gonna
  • 5:20 - 5:26
    pick that. And again, how we pick the
    words we might use the noisy channel
  • 5:26 - 5:32
    model. We might use a classifier, we'll
    talk about that so we'll discuss the
  • 5:32 - 5:38
    different methods of detecting these
    errors and correcting them in the next
Title:
The Spelling Correction Task
Video Language:
English

English subtitles

Revisions