Return to Video

Kneser-Ney Smoothing (8:59)

  • 0:00 - 0:03
    Let’s talk about Kneser-Ney Smoothing,
  • 0:03 - 0:05
    one of the most sophisticated forms of smoothing,
  • 0:05 - 0:08
    but also one with a beautiful and elegant intuition.
  • 0:10 - 0:14
    Remember that from Good-Turing, we talked about the c*’s,
  • 0:14 - 0:17
    the discounted counts you end up from Good-Turing,
  • 0:17 - 0:22
    and we discounted each of the counts —
  • 0:22 - 0:24
    the count of one was discounted to point four,
  • 0:24 - 0:28
    and the count of two discounted to 1.26 and so on —
  • 0:28 - 0:33
    in order to save mass to replace the zero counts with some low number.
  • 0:33 - 0:39
    And if you look at the actual values of these counts — 8.25 for nine and 7.24 for eight —
  • 0:39 - 0:42
    you’ll notice that in a very large number of cases,
  • 0:42 - 0:48
    the discounted count has a very close relationship with the original count.
  • 0:48 - 0:53
    It’s really the original count minus .75, or somewhat close to that.
  • 0:53 - 1:01
    So, in practice what Good-Turing often does is produce a fixed small discount from the counts.
  • 1:01 - 1:08
    And that intuition, that of a fixed small discount, can be applied directly.
  • 1:08 - 1:13
    When we do this, we call this absolute discounting, and absolute discounting is a popular kind of smoothing.
  • 1:13 - 1:18
    And here we’re showing you absolute discounting interpolation.
  • 1:18 - 1:22
    And again, the intuition is just: We’ll save some time and have to compute all those complicated Good-Turing numbers
  • 1:22 - 1:28
    and we’ll just subtract .75, or maybe it will be a different discount value for different corpora.
  • 1:28 - 1:30
    Now here’s the equation for absolute discounting.
  • 1:30 - 1:32
    So we’re doing diagrams again.
  • 1:32 - 1:37
    So the probability, absolute discounted, of a word, given the previous word,
  • 1:37 - 1:44
    will be some discounted bigram, interpolated with some interpolation weight, with the unigram probability.
  • 1:44 - 1:48
    So we have a unigram probability P(w), and then the bigram probability,
  • 1:48 - 1:53
    and we just subtract a fixed amount, let’s say it’s .75, from the count.
  • 1:53 - 1:57
    And otherwise compute the bigram probability in the normal way.
  • 1:57 - 2:00
    So, we have a discounted bigram probability, mixed with some weight,
  • 2:00 - 2:03
    which I shall talk later about how to set this weight, with a unigram.
  • 2:03 - 2:08
    And maybe we might keep a couple of extra values of d for counts one and two.
  • 2:08 - 2:13
    Counts one and two, we saw on a previous slide, weren’t quite subtracting .75,
  • 2:13 - 2:16
    so we can model this more carefully by having separate counts for those.
  • 2:17 - 2:23
    But the problem with absolute discounting is the unigram probability itself.
  • 2:23 - 2:26
    And I want to talk about changing the unigram probability.
  • 2:26 - 2:28
    And that’s the fundamental intuition of Kneser-Ney.
  • 2:28 - 2:35
    So in Kneser-Ney smoothing, the idea is keep that same interpolation that we saw in absolute discounting,
  • 2:35 - 2:39
    but use a better estimate of probabilities of the lower unigrams.
  • 2:39 - 2:43
    And the intuition for that, we can go back and look at the classic Shannon games.
  • 2:43 - 2:46
    Remember, in the Shannon game we’re predicting a word from previous words,
  • 2:46 - 2:50
    so we see a sentence, “I can’t see without my reading .”
  • 2:50 - 2:51
    What’s the most likely next word?
  • 2:51 - 2:53
    Well, “glasses” seems pretty likely.
  • 2:53 - 2:56
    Well, how about instead the word “Francisco”?
  • 2:56 - 2:59
    Well, that seems very unlikely in this situation.
  • 2:59 - 3:05
    And yet “Francisco”, as just a unigram, is more common than “glasses”.
  • 3:05 - 3:11
    But the reason why “Francisco” seems like a bad thing after “reading”,
  • 3:11 - 3:16
    one intuition we might be able to get is that “Francisco” always follows “San” or very often follows “San”.
  • 3:16 - 3:22
    So while “Francisco” is very frequent, it’s frequent in the context of the word “San”.
  • 3:22 - 3:26
    Now, unigrams in an interpolation model, where we’re mixing a unigram and a bigram,
  • 3:26 - 3:31
    are specifically useful — they’re very helpful — just in case where we haven’t seen a bigram.
  • 3:31 - 3:37
    So it’s unfortunate that just in the case where we haven’t seen the bigram “reading Francisco”,
  • 3:37 - 3:41
    we’re trusting “Francisco”’s unigram weight which is just where we shouldn’t trust it.
  • 3:41 - 3:46
    So instead of using the probability of w, how likely is a word,
  • 3:46 - 3:49
    our intuition is going to be when we’re backing off to something
  • 3:49 - 3:52
    we should instead use the continuation probability.
  • 3:52 - 3:54
    We’re going to call it P continuation of a word,
  • 3:54 - 3:57
    how likely is the word to appear as a novel continuation.
  • 3:57 - 3:59
    Well, how do we measure novel continuation?
  • 3:59 - 4:04
    Well, for each word we’ll just count the number of bigram types it completes.
  • 4:04 - 4:09
    How many different bigrams does it create by appearing after another word.
  • 4:09 - 4:14
    In other words, each bigram type is a novel continuation the first time we see this new bigram.
  • 4:14 - 4:21
    In other words, the continuation probability is going to be proportional to the cardinality of this set,
  • 4:21 - 4:28
    the number of words of preceding words, i minus one, that occur with our word.
  • 4:28 - 4:33
    So, how many words occur before this word in a bigram.
  • 4:33 - 4:35
    How many preceding words are there.
  • 4:35 - 4:42
    That will be the cardinality of that set, that’s a number we would like our continuation probability to be proportional to.
  • 4:42 - 4:46
    So how many times does w appear as a novel continuation?
  • 4:46 - 4:47
    We need to turn that into a probability.
  • 4:47 - 4:51
    So we just divide by the total number of word bigram types.
  • 4:51 - 4:59
    So, of all word bigrams that occur more than zero times, what’s the cardinality of that set?
  • 4:59 - 5:01
    How many different word bigram types are there,
  • 5:01 - 5:06
    and we’re just going to divide the two to get a probability of continuation,
  • 5:06 - 5:11
    of all the number of word bigram types how many of those have w as a novel continuation.
  • 5:11 - 5:17
    Now it turns out that there’s an alternative metaphor for Kneser-Ney with the same equations
  • 5:17 - 5:24
    so again we can see the numerator as the total number of word types that precede w,
  • 5:24 - 5:30
    how many word types can w follow and we’re going to normalize it by the number of words that could precede all words.
  • 5:30 - 5:37
    So, this sum over all words of the number of word types that can precede the word.
  • 5:37 - 5:40
    And these two are the same:
  • 5:40 - 5:45
    The number of this denominator and the denominator we saw on the previous slide are the same
  • 5:45 - 5:54
    because the number of possible bigram types is the same as the number of word type that can precede all words summed over all words.
  • 5:54 - 5:56
    If you think about that for a second, you’ll realize that’s true.
  • 5:56 - 6:00
    So in other words, with this kind of Kneser-Ney model,
  • 6:00 - 6:08
    a frequent word like “Francisco” that occurs only in one context like “San” will have a low continuation probability.
  • 6:08 - 6:16
    So if we put together the intuition of absolute discounting with the Kneser-Ney probability for the lower order n-gram,
  • 6:16 - 6:19
    we have the Kneser-Ney smoothing algorithm.
  • 6:19 - 6:25
    So, for the bigram itself we just have absolute discounting —
  • 6:25 - 6:28
    We take the bigram count, we subtract some d discount,
  • 6:28 - 6:31
    and I’ve just shown here that we take the max of that and zero
  • 6:31 - 6:36
    because, obviously, if the discount happens to be higher than the probability we don’t want a negative probability.
  • 6:36 - 6:40
    And we’re just gonna interpolate that with this same continuation probability that we just saw,
  • 6:40 - 6:44
    p continuation of w sub i.
  • 6:44 - 6:47
    And, the lambda, now let’s talk about how to compute that lambda.
  • 6:47 - 6:54
    The lambda is gonna take all that probability mass from all those normalized discounts
  • 6:54 - 6:56
    that we took out of these higher-order probabilities,
  • 6:56 - 7:03
    and use those to weight how much probability we should assign to the unigram.
  • 7:03 - 7:06
    We’re gonna combine those.
  • 7:06 - 7:14
    So that lambda is the amount of the discount weight divided by the denominator there.
  • 7:14 - 7:16
    So, it’s the normalized discount.
  • 7:16 - 7:23
    And then, we’re gonna multiply that by the total number of word types that can follow this context, w_i minus one.
  • 7:23 - 7:27
    In other words, how many different word types did we discount
  • 7:27 - 7:30
    or how many times did we apply this normalized discount.
  • 7:30 - 7:32
    And we multiply those together
  • 7:32 - 7:37
    and we know how much probably mass total we can now assign to the continuation of the word.
  • 7:37 - 7:41
    Now, this is the bigram formulation for Kneser-Ney.
  • 7:41 - 7:46
    Now in this slide, we’re showing you the general recursive formulation for n-grams in general,
  • 7:46 - 7:51
    and here we have to make a slight change to deal with all the higher-order n-grams.
  • 7:51 - 7:56
    So here we’re just showing the Kneser-Ney probability of a word given the prefix of the word.
  • 7:56 - 7:58
    And, just like Kneser-Ney we saw before,
  • 7:58 - 8:07
    we’re just interpolating a higher-order n-gram which is discounted with a lambda weight and a lower-order probability.
  • 8:07 - 8:15
    But now we need to distinguish between the very first top-level time that we use a count and these lower order counts.
  • 8:15 - 8:20
    So we’re going to use the actual count for the very highest order bigram,
  • 8:20 - 8:26
    and we’re going to use the continuation value that we just defined earlier for all the lower order probabilities.
  • 8:26 - 8:33
    So we’ll define this new thing count Kneser-Ney of dot which will mean actual count.
  • 8:33 - 8:36
    This will be actual count, let’s say we’re doing trigrams.
  • 8:36 - 8:42
    For the trigram, and then when we recurse and have the Kneser-Ney probability for the lower order things.
  • 8:42 - 8:45
    When we get down to the bigrams and unigrams, we’ll be using the continuation count,
  • 8:45 - 8:49
    that’s again the single word context that we defined earlier.
  • 8:49 - 8:52
    So Kneser-Ney smoothing, a very excellent algorithm.
  • 8:52 - 8:55
    It’s very commonly used in speech recognition and machine translation
  • 8:55 -
    and yet it has a very beautiful and elegant intuition and I hope you appreciate it.
Title:
Kneser-Ney Smoothing (8:59)
Video Language:
English

English subtitles

Revisions