1 00:00:00,000 --> 00:00:04,009 Today we're gonna introduce the topic of language modeling, one of the most 2 00:00:04,009 --> 00:00:08,034 important topics in natural language processing. The goal of language modeling 3 00:00:08,034 --> 00:00:12,087 is to assign a probability to a sentence. Why would we want to assign a probability 4 00:00:12,087 --> 00:00:17,057 to a sentence? This comes up in all sorts of applications. In machine translation, 5 00:00:17,057 --> 00:00:21,069 for example, we'd like to be able to distinguish between good and bad 6 00:00:21,069 --> 00:00:26,022 translations by their probabilities. So, "high winds tonight" might be a better 7 00:00:26,022 --> 00:00:30,081 translation than "large winds tonight" because high and winds go together well. 8 00:00:30,081 --> 00:00:35,064 In spelling correction, we see a phrase like "fifteen minuets from my house". That's 9 00:00:35,064 --> 00:00:40,041 more likely to be a mistake from "minutes". And one piece of information that lets us 10 00:00:40,041 --> 00:00:45,036 decide that is that "fifteen minutes from" is a much more likely phrase than "fifteen 11 00:00:45,036 --> 00:00:51,029 minuets from". And in speech recognition, a phrase like "I saw a van" is much more 12 00:00:51,029 --> 00:00:56,089 likely than a phrase that sounds phonetically similar, "eyes awe of an". But 13 00:00:56,089 --> 00:01:00,057 it's much less likely to have that sequence of words. And it turns out 14 00:01:00,057 --> 00:01:04,030 language modelings play a role in summarization, and question answering, 15 00:01:04,030 --> 00:01:08,068 really everywhere. So the goal of a language model is to compute the 16 00:01:08,068 --> 00:01:14,045 probability of a sentence or a sequence of words. So given some sequence of words w1 17 00:01:14,045 --> 00:01:19,095 through wn, we're gonna compute their probability P of w, and we'll use capital W 18 00:01:19,095 --> 00:01:26,001 to mean a sequence from w1 to wn. Now, this is related to the task of computing the 19 00:01:26,001 --> 00:01:32,025 probability of an upcoming word, so P of w5 given w1 through w4 is very related to the 20 00:01:32,025 --> 00:01:37,071 task of computing P(w1, w2, w3, w4, w5). A model that computes either of these 21 00:01:37,071 --> 00:01:43,079 things, either P(W), capital W, meaning a string, the joint probability of the whole 22 00:01:43,079 --> 00:01:50,003 string, or the conditional probability of the last word given the previous words, 23 00:01:50,003 --> 00:01:55,027 either of those, we call that a language model. Now it might have better to call 24 00:01:55,027 --> 00:01:59,043 this the grammar. I mean technically what this is, is telling us something about how 25 00:01:59,043 --> 00:02:03,023 good these words fit together. And we normally use the word grammar for that, 26 00:02:03,023 --> 00:02:07,039 but it turns out that the word "language model", and often we'll see the acronym LM, 27 00:02:07,039 --> 00:02:12,008 is standard, so we're gonna go with that. So, how are we going to compute this joint 28 00:02:12,008 --> 00:02:17,033 probability? We want to compute, let's say the probability of the phrase "its water is so 29 00:02:17,033 --> 00:02:22,007 transparent that", this little part of a sentence. And the intuition for how 30 00:02:22,007 --> 00:02:26,087 language modeling works is that you're going to rely on the chain rule of 31 00:02:26,087 --> 00:02:32,019 probability. And just to remind you about the chain rule of probability. Let's think 32 00:02:32,019 --> 00:02:37,044 about the definition of conditional probability. So P of A given B 33 00:02:37,044 --> 00:02:48,019 Equals P of A comma B over P of B. And we can rewrite that, so P of A 34 00:02:48,019 --> 00:02:57,054 given B times P of B equals P of A comma B, or turning it 35 00:02:57,054 --> 00:03:05,091 around, P of A comma B equals P of A given B (I'll make sure it's a "given") times P 36 00:03:05,091 --> 00:03:13,089 of B. And then we could generalize this to more variables so the joint probability of 37 00:03:13,089 --> 00:03:20,008 a whole sequence A B C D is the probability of A, times B given A, times C 38 00:03:20,008 --> 00:03:23,096 conditioned on A and B, times D conditioned on A B C. So this is the chain 39 00:03:23,096 --> 00:03:28,045 rule. In a more general form of the chain rule we have here the probability of any, 40 00:03:28,061 --> 00:03:32,044 joint probability of any sequence of variables is the first, times the 41 00:03:32,044 --> 00:03:36,065 condition of the second and the first, times the third conditioned of the 42 00:03:36,065 --> 00:03:40,086 first two, up until the last conditioned on the first n minus one. Alright, the 43 00:03:40,086 --> 00:03:45,029 chain rule. So now, the chain rule can be applied to compute the joint probability 44 00:03:45,029 --> 00:03:49,045 of words in a sentence. So let's suppose we have our phrase, "its water is so 45 00:03:49,045 --> 00:03:53,081 transparent". By the chain rule, the probability of that sequence is the 46 00:03:53,081 --> 00:03:59,006 probability of "its" times the probability of "water" given "its", times the probability 47 00:03:59,006 --> 00:04:03,073 of "is" given "its water", times the probability of "so" given "its water is", and 48 00:04:03,073 --> 00:04:08,008 finally times the probability of "transparent" given "its water is so". Or, more 49 00:04:08,008 --> 00:04:13,007 formally, the probability, joint probability of a sequence of words is the 50 00:04:13,007 --> 00:04:18,031 product over all i of the probability of each word times the prefix up until that 51 00:04:18,031 --> 00:04:24,054 word. How are we gonna estimate these probabilities? Could we just count and 52 00:04:24,054 --> 00:04:29,060 divide? We often compute probabilities by counting and dividing. So, the probability 53 00:04:29,060 --> 00:04:34,061 of "the" given "its water is so transparent that", we could just count how many times 54 00:04:34,061 --> 00:04:39,043 "its water is so transparent that the" occurs and divide by the number of times 55 00:04:39,043 --> 00:04:44,047 "its water is so transparent" occurs. So we could divide this by this. And get a 56 00:04:44,047 --> 00:04:49,042 probability. We can't do that. And the reason we can't do that is there's just 57 00:04:49,042 --> 00:04:54,083 far too many possible sentences for us to ever estimate these. There's no way we could 58 00:04:54,083 --> 00:05:00,018 get enough data to see all the counts of all possible sentences of English. So what 59 00:05:00,018 --> 00:05:05,033 we do instead is, we apply a simplifying assumption called the Markov assumption, 60 00:05:05,033 --> 00:05:10,024 named for Andrei Markov. And the Markov Assumption suggest that we estimate the 61 00:05:10,024 --> 00:05:15,039 probability of "the" given "its water is so transparent that" just by computing instead 62 00:05:15,039 --> 00:05:20,041 the probability of "the" given the word "that", or-- The very last "that", "that" meaning the 63 00:05:20,041 --> 00:05:25,025 last word in the sequence. Or maybe we compute the probability of "the" given "its 64 00:05:25,025 --> 00:05:29,067 water is so transparent that" given just the last two words, so "the" given 65 00:05:29,067 --> 00:05:33,089 "transparent that". That's the Markov Assumption. Let's just look at the 66 00:05:33,089 --> 00:05:38,070 previous or maybe the couple previous words rather than in the entire context. More 67 00:05:38,070 --> 00:05:44,035 formally, the Markov Assumption says: The probability of a sequence of words is the 68 00:05:44,035 --> 00:05:49,038 product for each word of the conditional probability of that word, 69 00:05:49,038 --> 00:05:54,096 given some prefix of the last few words. So, in other words, in the chain rule 70 00:05:54,096 --> 00:06:00,013 product of all the probabilities we're multiplying together, we estimate the 71 00:06:02,542 --> 00:06:05,071 probability of wᵢ, given the entire prefix from one to i-1 by a simpler to 72 00:06:05,071 --> 00:06:12,070 compute probability: wᵢ given just the last few words. The simplest case of a 73 00:06:12,070 --> 00:06:17,029 Markov model is called the unigram model. In the unigram model, we simply estimate 74 00:06:17,029 --> 00:06:21,077 the probability of a whole sequence of words by the product of probabilities of 75 00:06:21,077 --> 00:06:25,088 individual words, "unigrams". And if we generated sentences by randomly picking 76 00:06:25,088 --> 00:06:30,042 words, you can see that it would look like a word salad. So here's some automatically 77 00:06:30,042 --> 00:06:34,090 generated sentences generated by Dan Klein, and you can see that the word "fifth", the 78 00:06:34,090 --> 00:06:39,028 word "an", the word "of" -- this doesn't look like a sentence at all. It's just a random 79 00:06:39,028 --> 00:06:43,038 sequence of words: "thrift, did, eighty, said". That's the properties of the unigram 80 00:06:43,038 --> 00:06:47,059 model. Words are independent in this model. Slightly more intelligent is a 81 00:06:47,059 --> 00:06:52,015 bi-gram model where we condition on the single previous word. So again, we 82 00:06:52,015 --> 00:06:57,021 estimate the probability of a word given the entire prefix from the beginning to 83 00:06:57,021 --> 00:07:02,015 the previous word, just by the previous word. So now if we use that and generate 84 00:07:02,015 --> 00:07:07,011 random sentences from a bigram model, the sentences look a little bit more like 85 00:07:07,011 --> 00:07:11,087 English. Still, something's wrong with them clearly. "outside, new, car", well, "new 86 00:07:11,087 --> 00:07:16,052 car" looks pretty good. "car parking" is pretty good. "parking lot". But together, 87 00:07:16,052 --> 00:07:21,028 "outside new car parking lot of the agreement reached": that's not English. So 88 00:07:21,028 --> 00:07:26,005 even the bigram model, by giving up this conditioning that English has, we're 89 00:07:26,005 --> 00:07:31,017 simplifying the ability to model, to model what's going on in a language. Now we 90 00:07:31,017 --> 00:07:36,038 can extend the n-gram model further to trigrams, that's 3-grams. Or 4-grams 91 00:07:36,038 --> 00:07:41,044 or 5-grams. But in general, it's clear that n-gram modeling is an 92 00:07:41,044 --> 00:07:46,097 insufficient model of language. And the reason is that language has long-distance 93 00:07:46,097 --> 00:07:52,036 dependencies. So if I want to, say, predict "The computer which I had just put into the 94 00:07:52,036 --> 00:07:57,011 machine room on the fifth floor", and I hadn't seen this next word, and I want to 95 00:07:57,011 --> 00:08:01,062 say, what's my likelihood of the next word? And I conditioned it just on the 96 00:08:01,062 --> 00:08:06,048 previous word, "floor", I'd be very unlucky to guess "crashed". But really, the "crashed" is 97 00:08:06,048 --> 00:08:11,022 the main verb of the sentence, and "computer" is the subject, the head of the subject 98 00:08:11,022 --> 00:08:15,078 noun phrase. So, if we know "computer" was the subject, we're much more likely to 99 00:08:15,078 --> 00:08:20,010 guess crashed. So, these kind of long-distance dependencies mean that in the 100 00:08:20,010 --> 00:08:24,050 limit of really good model of predicting English words, we'll have to take into 101 00:08:24,050 --> 00:08:28,089 account lots of long-distance information. But it turns out that in practice, we can 102 00:08:28,089 --> 00:08:33,016 often get away with these n-gram models, because the local information, especially 103 00:08:33,016 --> 00:08:37,023 as we get up to trigrams and 4-grams, will turn out to be just constraining 104 00:08:37,023 --> 00:08:40,046 enough that it in most cases it'll solve our problems for us.