WEBVTT 00:00:00.242 --> 00:00:03.442 Let’s talk about Kneser-Ney Smoothing, 00:00:03.442 --> 00:00:05.458 one of the most sophisticated forms of smoothing, 00:00:05.458 --> 00:00:08.334 but also one with a beautiful and elegant intuition. 00:00:09.765 --> 00:00:14.459 Remember that from Good-Turing, we talked about the c*’s, 00:00:14.459 --> 00:00:17.194 the discounted counts you end up from Good-Turing, 00:00:17.194 --> 00:00:22.191 and we discounted each of the counts — 00:00:22.191 --> 00:00:24.412 the count of one was discounted to point four, 00:00:24.412 --> 00:00:27.682 and the count of two discounted to 1.26 and so on — 00:00:27.682 --> 00:00:32.564 in order to save mass to replace the zero counts with some low number. 00:00:32.564 --> 00:00:39.421 And if you look at the actual values of these counts — 8.25 for nine and 7.24 for eight — 00:00:39.421 --> 00:00:41.621 you’ll notice that in a very large number of cases, 00:00:41.621 --> 00:00:47.559 the discounted count has a very close relationship with the original count. 00:00:47.559 --> 00:00:52.856 It’s really the original count minus .75, or somewhat close to that. 00:00:52.856 --> 00:01:01.361 So, in practice what Good-Turing often does is produce a fixed small discount from the counts. 00:01:01.361 --> 00:01:07.590 And that intuition, that of a fixed small discount, can be applied directly. 00:01:07.590 --> 00:01:12.807 When we do this, we call this absolute discounting, and absolute discounting is a popular kind of smoothing. 00:01:12.807 --> 00:01:17.548 And here we’re showing you absolute discounting interpolation. 00:01:17.548 --> 00:01:22.372 And again, the intuition is just: We’ll save some time and have to compute all those complicated Good-Turing numbers 00:01:22.372 --> 00:01:27.962 and we’ll just subtract .75, or maybe it will be a different discount value for different corpora. 00:01:27.962 --> 00:01:30.032 Now here’s the equation for absolute discounting. 00:01:30.032 --> 00:01:31.900 So we’re doing diagrams again. 00:01:31.900 --> 00:01:37.464 So the probability, absolute discounted, of a word, given the previous word, 00:01:37.464 --> 00:01:44.173 will be some discounted bigram, interpolated with some interpolation weight, with the unigram probability. 00:01:44.173 --> 00:01:47.905 So we have a unigram probability P(w), and then the bigram probability, 00:01:47.905 --> 00:01:53.402 and we just subtract a fixed amount, let’s say it’s .75, from the count. 00:01:53.402 --> 00:01:56.962 And otherwise compute the bigram probability in the normal way. 00:01:56.962 --> 00:02:00.136 So, we have a discounted bigram probability, mixed with some weight, 00:02:00.136 --> 00:02:03.103 which I shall talk later about how to set this weight, with a unigram. 00:02:03.103 --> 00:02:07.701 And maybe we might keep a couple of extra values of d for counts one and two. 00:02:07.701 --> 00:02:12.701 Counts one and two, we saw on a previous slide, weren’t quite subtracting .75, 00:02:12.701 --> 00:02:16.316 so we can model this more carefully by having separate counts for those. 00:02:17.239 --> 00:02:22.558 But the problem with absolute discounting is the unigram probability itself. 00:02:22.558 --> 00:02:25.688 And I want to talk about changing the unigram probability. 00:02:25.688 --> 00:02:28.236 And that’s the fundamental intuition of Kneser-Ney. 00:02:28.236 --> 00:02:34.501 So in Kneser-Ney smoothing, the idea is keep that same interpolation that we saw in absolute discounting, 00:02:34.501 --> 00:02:38.557 but use a better estimate of probabilities of the lower unigrams. 00:02:38.557 --> 00:02:42.671 And the intuition for that, we can go back and look at the classic Shannon games. 00:02:42.671 --> 00:02:45.573 Remember, in the Shannon game we’re predicting a word from previous words, 00:02:45.573 --> 00:02:49.960 so we see a sentence, “I can’t see without my reading .” 00:02:49.960 --> 00:02:51.375 What’s the most likely next word? 00:02:51.375 --> 00:02:52.879 Well, “glasses” seems pretty likely. 00:02:52.879 --> 00:02:56.226 Well, how about instead the word “Francisco”? 00:02:56.226 --> 00:02:58.886 Well, that seems very unlikely in this situation. 00:02:58.886 --> 00:03:04.907 And yet “Francisco”, as just a unigram, is more common than “glasses”. 00:03:04.907 --> 00:03:10.670 But the reason why “Francisco” seems like a bad thing after “reading”, 00:03:10.670 --> 00:03:16.404 one intuition we might be able to get is that “Francisco” always follows “San” or very often follows “San”. 00:03:16.404 --> 00:03:21.627 So while “Francisco” is very frequent, it’s frequent in the context of the word “San”. 00:03:21.627 --> 00:03:26.168 Now, unigrams in an interpolation model, where we’re mixing a unigram and a bigram, 00:03:26.168 --> 00:03:31.230 are specifically useful — they’re very helpful — just in case where we haven’t seen a bigram. 00:03:31.230 --> 00:03:37.158 So it’s unfortunate that just in the case where we haven’t seen the bigram “reading Francisco”, 00:03:37.158 --> 00:03:41.214 we’re trusting “Francisco”’s unigram weight which is just where we shouldn’t trust it. 00:03:41.214 --> 00:03:45.582 So instead of using the probability of w, how likely is a word, 00:03:45.582 --> 00:03:48.518 our intuition is going to be when we’re backing off to something 00:03:48.518 --> 00:03:51.523 we should instead use the continuation probability. 00:03:51.523 --> 00:03:53.730 We’re going to call it P continuation of a word, 00:03:53.730 --> 00:03:57.164 how likely is the word to appear as a novel continuation. 00:03:57.164 --> 00:03:59.453 Well, how do we measure novel continuation? 00:03:59.453 --> 00:04:04.067 Well, for each word we’ll just count the number of bigram types it completes. 00:04:04.067 --> 00:04:08.863 How many different bigrams does it create by appearing after another word. 00:04:08.863 --> 00:04:14.243 In other words, each bigram type is a novel continuation the first time we see this new bigram. 00:04:14.243 --> 00:04:21.202 In other words, the continuation probability is going to be proportional to the cardinality of this set, 00:04:21.202 --> 00:04:27.875 the number of words of preceding words, i minus one, that occur with our word. 00:04:27.875 --> 00:04:33.292 So, how many words occur before this word in a bigram. 00:04:33.292 --> 00:04:34.985 How many preceding words are there. 00:04:34.985 --> 00:04:41.792 That will be the cardinality of that set, that’s a number we would like our continuation probability to be proportional to. 00:04:41.792 --> 00:04:45.554 So how many times does w appear as a novel continuation? 00:04:45.554 --> 00:04:47.450 We need to turn that into a probability. 00:04:47.450 --> 00:04:50.967 So we just divide by the total number of word bigram types. 00:04:50.967 --> 00:04:58.648 So, of all word bigrams that occur more than zero times, what’s the cardinality of that set? 00:04:58.648 --> 00:05:00.741 How many different word bigram types are there, 00:05:00.741 --> 00:05:05.552 and we’re just going to divide the two to get a probability of continuation, 00:05:05.552 --> 00:05:11.465 of all the number of word bigram types how many of those have w as a novel continuation. 00:05:11.465 --> 00:05:16.604 Now it turns out that there’s an alternative metaphor for Kneser-Ney with the same equations 00:05:16.604 --> 00:05:23.543 so again we can see the numerator as the total number of word types that precede w, 00:05:23.543 --> 00:05:30.304 how many word types can w follow and we’re going to normalize it by the number of words that could precede all words. 00:05:30.304 --> 00:05:36.865 So, this sum over all words of the number of word types that can precede the word. 00:05:36.865 --> 00:05:39.656 And these two are the same: 00:05:39.656 --> 00:05:45.180 The number of this denominator and the denominator we saw on the previous slide are the same 00:05:45.180 --> 00:05:53.846 because the number of possible bigram types is the same as the number of word type that can precede all words summed over all words. 00:05:53.846 --> 00:05:56.470 If you think about that for a second, you’ll realize that’s true. 00:05:56.470 --> 00:06:00.159 So in other words, with this kind of Kneser-Ney model, 00:06:00.159 --> 00:06:08.261 a frequent word like “Francisco” that occurs only in one context like “San” will have a low continuation probability. 00:06:08.261 --> 00:06:15.602 So if we put together the intuition of absolute discounting with the Kneser-Ney probability for the lower order n-gram, 00:06:15.602 --> 00:06:18.809 we have the Kneser-Ney smoothing algorithm. 00:06:18.809 --> 00:06:24.794 So, for the bigram itself we just have absolute discounting — 00:06:24.794 --> 00:06:28.299 We take the bigram count, we subtract some d discount, 00:06:28.299 --> 00:06:31.417 and I’ve just shown here that we take the max of that and zero 00:06:31.417 --> 00:06:35.511 because, obviously, if the discount happens to be higher than the probability we don’t want a negative probability. 00:06:35.511 --> 00:06:40.345 And we’re just gonna interpolate that with this same continuation probability that we just saw, 00:06:40.345 --> 00:06:44.230 p continuation of w sub i. 00:06:44.230 --> 00:06:47.360 And, the lambda, now let’s talk about how to compute that lambda. 00:06:47.360 --> 00:06:54.114 The lambda is gonna take all that probability mass from all those normalized discounts 00:06:54.114 --> 00:06:56.232 that we took out of these higher-order probabilities, 00:06:56.232 --> 00:07:02.580 and use those to weight how much probability we should assign to the unigram. 00:07:02.580 --> 00:07:05.673 We’re gonna combine those. 00:07:05.673 --> 00:07:13.698 So that lambda is the amount of the discount weight divided by the denominator there. 00:07:13.698 --> 00:07:15.965 So, it’s the normalized discount. 00:07:15.965 --> 00:07:23.030 And then, we’re gonna multiply that by the total number of word types that can follow this context, w_i minus one. 00:07:23.030 --> 00:07:27.300 In other words, how many different word types did we discount 00:07:27.300 --> 00:07:30.076 or how many times did we apply this normalized discount. 00:07:30.076 --> 00:07:31.867 And we multiply those together 00:07:31.867 --> 00:07:37.409 and we know how much probably mass total we can now assign to the continuation of the word. 00:07:37.409 --> 00:07:40.659 Now, this is the bigram formulation for Kneser-Ney. 00:07:40.659 --> 00:07:46.173 Now in this slide, we’re showing you the general recursive formulation for n-grams in general, 00:07:46.173 --> 00:07:51.365 and here we have to make a slight change to deal with all the higher-order n-grams. 00:07:51.365 --> 00:07:55.603 So here we’re just showing the Kneser-Ney probability of a word given the prefix of the word. 00:07:55.603 --> 00:07:57.662 And, just like Kneser-Ney we saw before, 00:07:57.662 --> 00:08:06.809 we’re just interpolating a higher-order n-gram which is discounted with a lambda weight and a lower-order probability. 00:08:06.809 --> 00:08:14.940 But now we need to distinguish between the very first top-level time that we use a count and these lower order counts. 00:08:14.940 --> 00:08:20.107 So we’re going to use the actual count for the very highest order bigram, 00:08:20.107 --> 00:08:25.859 and we’re going to use the continuation value that we just defined earlier for all the lower order probabilities. 00:08:25.859 --> 00:08:33.052 So we’ll define this new thing count Kneser-Ney of dot which will mean actual count. 00:08:33.052 --> 00:08:35.763 This will be actual count, let’s say we’re doing trigrams. 00:08:35.763 --> 00:08:41.589 For the trigram, and then when we recurse and have the Kneser-Ney probability for the lower order things. 00:08:41.589 --> 00:08:44.607 When we get down to the bigrams and unigrams, we’ll be using the continuation count, 00:08:44.607 --> 00:08:49.216 that’s again the single word context that we defined earlier. 00:08:49.216 --> 00:08:51.720 So Kneser-Ney smoothing, a very excellent algorithm. 00:08:51.720 --> 00:08:54.743 It’s very commonly used in speech recognition and machine translation 00:08:54.743 --> 99:59:59.999 and yet it has a very beautiful and elegant intuition and I hope you appreciate it.