Return to Video

Introduction to N-grams (8:41)

  • 0:00 - 0:04
    Today we're gonna introduce the topic of
    language modeling, one of the most
  • 0:04 - 0:08
    important topics in natural language
    processing. The goal of language modeling
  • 0:08 - 0:12
    is to assign a probability to a sentence.
    Why would we want to assign a probability
  • 0:12 - 0:17
    to a sentence? This comes up in all sorts
    of applications. In machine translation,
  • 0:17 - 0:21
    for example, we'd like to be able to
    distinguish between good and bad
  • 0:21 - 0:26
    translations by their probabilities. So,
    "high winds tonight" might be a better
  • 0:26 - 0:30
    translation than "large winds tonight"
    because high and winds go together well.
  • 0:30 - 0:35
    In spelling correction, we see a phrase
    like "fifteen minuets from my house". That's
  • 0:35 - 0:40
    more likely to be a mistake from "minutes".
    And one piece of information that lets us
  • 0:40 - 0:45
    decide that is that "fifteen minutes from"
    is a much more likely phrase than "fifteen
  • 0:45 - 0:51
    minuets from". And in speech recognition, a
    phrase like "I saw a van" is much more
  • 0:51 - 0:56
    likely than a phrase that sounds
    phonetically similar, "eyes awe of an". But
  • 0:56 - 1:00
    it's much less likely to have that
    sequence of words. And it turns out
  • 1:00 - 1:04
    language modelings play a role in
    summarization, and question answering,
  • 1:04 - 1:08
    really everywhere. So the goal of a
    language model is to compute the
  • 1:08 - 1:14
    probability of a sentence or a sequence of
    words. So given some sequence of words w1
  • 1:14 - 1:19
    through wn, we're gonna compute their
    probability P of w, and we'll use capital W
  • 1:19 - 1:26
    to mean a sequence from w1 to wn. Now, this is
    related to the task of computing the
  • 1:26 - 1:32
    probability of an upcoming word, so P of w5
    given w1 through w4 is very related to the
  • 1:32 - 1:37
    task of computing P(w1, w2, w3, w4, w5). A
    model that computes either of these
  • 1:37 - 1:43
    things, either P(W), capital W, meaning a
    string, the joint probability of the whole
  • 1:43 - 1:50
    string, or the conditional probability of
    the last word given the previous words,
  • 1:50 - 1:55
    either of those, we call that a language
    model. Now it might have better to call
  • 1:55 - 1:59
    this the grammar. I mean technically what
    this is, is telling us something about how
  • 1:59 - 2:03
    good these words fit together. And we
    normally use the word grammar for that,
  • 2:03 - 2:07
    but it turns out that the word "language
    model", and often we'll see the acronym LM,
  • 2:07 - 2:12
    is standard, so we're gonna go with that.
    So, how are we going to compute this joint
  • 2:12 - 2:17
    probability? We want to compute, let's say the
    probability of the phrase "its water is so
  • 2:17 - 2:22
    transparent that", this little part of a
    sentence. And the intuition for how
  • 2:22 - 2:26
    language modeling works is that you're
    going to rely on the chain rule of
  • 2:26 - 2:32
    probability. And just to remind you about
    the chain rule of probability. Let's think
  • 2:32 - 2:37
    about the definition of conditional
    probability. So P of A given B
  • 2:37 - 2:48
    Equals P of A comma B over P of B.
    And we can rewrite that, so P of A
  • 2:48 - 2:57
    given B times P of B
    equals P of A comma B, or turning it
  • 2:57 - 3:05
    around, P of A comma B equals P of A
    given B (I'll make sure it's a "given") times P
  • 3:05 - 3:13
    of B. And then we could generalize this to
    more variables so the joint probability of
  • 3:13 - 3:20
    a whole sequence A B C D is the
    probability of A, times B given A, times C
  • 3:20 - 3:23
    conditioned on A and B, times D
    conditioned on A B C. So this is the chain
  • 3:23 - 3:28
    rule. In a more general form of the chain
    rule we have here the probability of any,
  • 3:28 - 3:32
    joint probability of any sequence of
    variables is the first, times the
  • 3:32 - 3:36
    condition of the second and the first,
    times the third conditioned of the
  • 3:36 - 3:40
    first two, up until the last conditioned
    on the first n minus one. Alright, the
  • 3:40 - 3:45
    chain rule. So now, the chain rule can be
    applied to compute the joint probability
  • 3:45 - 3:49
    of words in a sentence. So let's suppose
    we have our phrase, "its water is so
  • 3:49 - 3:53
    transparent". By the chain rule, the
    probability of that sequence is the
  • 3:53 - 3:59
    probability of "its" times the probability
    of "water" given "its", times the probability
  • 3:59 - 4:03
    of "is" given "its water", times the
    probability of "so" given "its water is", and
  • 4:03 - 4:08
    finally times the probability of
    "transparent" given "its water is so". Or, more
  • 4:08 - 4:13
    formally, the probability, joint
    probability of a sequence of words is the
  • 4:13 - 4:18
    product over all i of the probability of
    each word times the prefix up until that
  • 4:18 - 4:24
    word. How are we gonna estimate these
    probabilities? Could we just count and
  • 4:24 - 4:29
    divide? We often compute probabilities by
    counting and dividing. So, the probability
  • 4:29 - 4:34
    of "the" given "its water is so transparent
    that", we could just count how many times
  • 4:34 - 4:39
    "its water is so transparent that the"
    occurs and divide by the number of times
  • 4:39 - 4:44
    "its water is so transparent" occurs. So we
    could divide this by this. And get a
  • 4:44 - 4:49
    probability. We can't do that. And the
    reason we can't do that is there's just
  • 4:49 - 4:54
    far too many possible sentences for us to
    ever estimate these. There's no way we could
  • 4:54 - 5:00
    get enough data to see all the counts of
    all possible sentences of English. So what
  • 5:00 - 5:05
    we do instead is, we apply a simplifying
    assumption called the Markov assumption,
  • 5:05 - 5:10
    named for Andrei Markov. And the Markov
    Assumption suggest that we estimate the
  • 5:10 - 5:15
    probability of "the" given "its water is so
    transparent that" just by computing instead
  • 5:15 - 5:20
    the probability of "the" given the word "that",
    or-- The very last "that", "that" meaning the
  • 5:20 - 5:25
    last word in the sequence. Or maybe we
    compute the probability of "the" given "its
  • 5:25 - 5:29
    water is so transparent that" given just
    the last two words, so "the" given
  • 5:29 - 5:33
    "transparent that". That's the Markov
    Assumption. Let's just look at the
  • 5:33 - 5:38
    previous or maybe the couple previous
    words rather than in the entire context. More
  • 5:38 - 5:44
    formally, the Markov Assumption says: The
    probability of a sequence of words is the
  • 5:44 - 5:49
    product for each word of the
    conditional probability of that word,
  • 5:49 - 5:54
    given some prefix of the last few words.
    So, in other words, in the chain rule
  • 5:54 - 6:00
    product of all the probabilities we're
    multiplying together, we estimate the
  • 6:03 - 6:05
    probability of wᵢ, given the entire prefix
    from one to i-1 by a simpler to
  • 6:05 - 6:12
    compute probability: wᵢ given just the
    last few words. The simplest case of a
  • 6:12 - 6:17
    Markov model is called the unigram model.
    In the unigram model, we simply estimate
  • 6:17 - 6:21
    the probability of a whole sequence of
    words by the product of probabilities of
  • 6:21 - 6:25
    individual words, "unigrams". And if we
    generated sentences by randomly picking
  • 6:25 - 6:30
    words, you can see that it would look like
    a word salad. So here's some automatically
  • 6:30 - 6:34
    generated sentences generated by Dan Klein,
    and you can see that the word "fifth", the
  • 6:34 - 6:39
    word "an", the word "of" -- this doesn't look
    like a sentence at all. It's just a random
  • 6:39 - 6:43
    sequence of words: "thrift, did, eighty,
    said". That's the properties of the unigram
  • 6:43 - 6:47
    model. Words are independent in this
    model. Slightly more intelligent is a
  • 6:47 - 6:52
    bi-gram model where we condition on the
    single previous word. So again, we
  • 6:52 - 6:57
    estimate the probability of a word given
    the entire prefix from the beginning to
  • 6:57 - 7:02
    the previous word, just by the previous
    word. So now if we use that and generate
  • 7:02 - 7:07
    random sentences from a bigram model, the
    sentences look a little bit more like
  • 7:07 - 7:11
    English. Still, something's wrong with
    them clearly. "outside, new, car", well, "new
  • 7:11 - 7:16
    car" looks pretty good. "car parking" is
    pretty good. "parking lot". But together,
  • 7:16 - 7:21
    "outside new car parking lot of the
    agreement reached": that's not English. So
  • 7:21 - 7:26
    even the bigram model, by giving up this
    conditioning that English has, we're
  • 7:26 - 7:31
    simplifying the ability to model, to model
    what's going on in a language. Now we
  • 7:31 - 7:36
    can extend the n-gram model further to
    trigrams, that's 3-grams. Or 4-grams
  • 7:36 - 7:41
    or 5-grams. But in general, it's
    clear that n-gram modeling is an
  • 7:41 - 7:46
    insufficient model of language. And the
    reason is that language has long-distance
  • 7:46 - 7:52
    dependencies. So if I want to, say, predict
    "The computer which I had just put into the
  • 7:52 - 7:57
    machine room on the fifth floor", and I
    hadn't seen this next word, and I want to
  • 7:57 - 8:01
    say, what's my likelihood of the next
    word? And I conditioned it just on the
  • 8:01 - 8:06
    previous word, "floor", I'd be very unlucky
    to guess "crashed". But really, the "crashed" is
  • 8:06 - 8:11
    the main verb of the sentence, and "computer"
    is the subject, the head of the subject
  • 8:11 - 8:15
    noun phrase. So, if we know "computer" was
    the subject, we're much more likely to
  • 8:15 - 8:20
    guess crashed. So, these kind of
    long-distance dependencies mean that in the
  • 8:20 - 8:24
    limit of really good model of predicting
    English words, we'll have to take into
  • 8:24 - 8:28
    account lots of long-distance information.
    But it turns out that in practice, we can
  • 8:28 - 8:33
    often get away with these n-gram models,
    because the local information, especially
  • 8:33 - 8:37
    as we get up to trigrams and 4-grams,
    will turn out to be just constraining
  • 8:37 - 8:40
    enough that it in most cases it'll solve
    our problems for us.
Title:
Introduction to N-grams (8:41)
Description:

Changed w_i for wᵢ

more » « less
Video Language:
English

English subtitles

Revisions