0:00:00.000,0:00:04.009
Today we're gonna introduce the topic of[br]language modeling, one of the most

0:00:04.009,0:00:08.034
important topics in natural language[br]processing. The goal of language modeling

0:00:08.034,0:00:12.087
is to assign a probability to a sentence.[br]Why would we want to assign a probability

0:00:12.087,0:00:17.057
to a sentence? This comes up in all sorts[br]of applications. In machine translation,

0:00:17.057,0:00:21.069
for example, we'd like to be able to[br]distinguish between good and bad

0:00:21.069,0:00:26.022
translations by their probabilities. So,[br]"high winds tonight" might be a better

0:00:26.022,0:00:30.081
translation than "large winds tonight"[br]because high and winds go together well.

0:00:30.081,0:00:35.064
In spelling correction, we see a phrase[br]like "fifteen minuets from my house". That's

0:00:35.064,0:00:40.041
more likely to be a mistake from "minutes".[br]And one piece of information that lets us

0:00:40.041,0:00:45.036
decide that is that "fifteen minutes from"[br]is a much more likely phrase than "fifteen

0:00:45.036,0:00:51.029
minuets from". And in speech recognition, a[br]phrase like "I saw a van" is much more

0:00:51.029,0:00:56.089
likely than a phrase that sounds[br]phonetically similar, "eyes awe of an". But

0:00:56.089,0:01:00.057
it's much less likely to have that[br]sequence of words. And it turns out

0:01:00.057,0:01:04.030
language modelings play a role in[br]summarization, and question answering,

0:01:04.030,0:01:08.068
really everywhere. So the goal of a[br]language model is to compute the

0:01:08.068,0:01:14.045
probability of a sentence or a sequence of[br]words. So given some sequence of words w1

0:01:14.045,0:01:19.095
through wn, we're gonna compute their[br]probability P of w, and we'll use capital W

0:01:19.095,0:01:26.001
to mean a sequence from w1 to wn. Now, this is[br]related to the task of computing the

0:01:26.001,0:01:32.025
probability of an upcoming word, so P of w5[br]given w1 through w4 is very related to the

0:01:32.025,0:01:37.071
task of computing P(w1, w2, w3, w4, w5). A[br]model that computes either of these

0:01:37.071,0:01:43.079
things, either P(W), capital W, meaning a[br]string, the joint probability of the whole

0:01:43.079,0:01:50.003
string, or the conditional probability of[br]the last word given the previous words,

0:01:50.003,0:01:55.027
either of those, we call that a language[br]model. Now it might have better to call

0:01:55.027,0:01:59.043
this the grammar. I mean technically what[br]this is, is telling us something about how

0:01:59.043,0:02:03.023
good these words fit together. And we[br]normally use the word grammar for that,

0:02:03.023,0:02:07.039
but it turns out that the word "language[br]model", and often we'll see the acronym LM,

0:02:07.039,0:02:12.008
is standard, so we're gonna go with that.[br]So, how are we going to compute this joint

0:02:12.008,0:02:17.033
probability? We want to compute, let's say the[br]probability of the phrase "its water is so

0:02:17.033,0:02:22.007
transparent that", this little part of a[br]sentence. And the intuition for how

0:02:22.007,0:02:26.087
language modeling works is that you're[br]going to rely on the chain rule of

0:02:26.087,0:02:32.019
probability. And just to remind you about[br]the chain rule of probability. Let's think

0:02:32.019,0:02:37.044
about the definition of conditional[br]probability. So P of A given B

0:02:37.044,0:02:48.019
Equals P of A comma B over P of B.[br]And we can rewrite that, so P of A

0:02:48.019,0:02:57.054
given B times P of B [br]equals P of A comma B, or turning it

0:02:57.054,0:03:05.091
around, P of A comma B equals P of A[br]given B (I'll make sure it's a "given") times P

0:03:05.091,0:03:13.089
of B. And then we could generalize this to[br]more variables so the joint probability of

0:03:13.089,0:03:20.008
a whole sequence A B C D is the[br]probability of A, times B given A, times C

0:03:20.008,0:03:23.096
conditioned on A and B, times D[br]conditioned on A B C. So this is the chain

0:03:23.096,0:03:28.045
rule. In a more general form of the chain[br]rule we have here the probability of any,

0:03:28.061,0:03:32.044
joint probability of any sequence of[br]variables is the first, times the

0:03:32.044,0:03:36.065
condition of the second and the first,[br]times the third conditioned of the

0:03:36.065,0:03:40.086
first two, up until the last conditioned[br]on the first n minus one. Alright, the

0:03:40.086,0:03:45.029
chain rule. So now, the chain rule can be[br]applied to compute the joint probability

0:03:45.029,0:03:49.045
of words in a sentence. So let's suppose[br]we have our phrase, "its water is so

0:03:49.045,0:03:53.081
transparent". By the chain rule, the[br]probability of that sequence is the

0:03:53.081,0:03:59.006
probability of "its" times the probability[br]of "water" given "its", times the probability

0:03:59.006,0:04:03.073
of "is" given "its water", times the[br]probability of "so" given "its water is", and

0:04:03.073,0:04:08.008
finally times the probability of[br]"transparent" given "its water is so". Or, more

0:04:08.008,0:04:13.007
formally, the probability, joint[br]probability of a sequence of words is the

0:04:13.007,0:04:18.031
product over all i of the probability of[br]each word times the prefix up until that

0:04:18.031,0:04:24.054
word. How are we gonna estimate these[br]probabilities? Could we just count and

0:04:24.054,0:04:29.060
divide? We often compute probabilities by[br]counting and dividing. So, the probability

0:04:29.060,0:04:34.061
of "the" given "its water is so transparent[br]that", we could just count how many times

0:04:34.061,0:04:39.043
"its water is so transparent that the"[br]occurs and divide by the number of times

0:04:39.043,0:04:44.047
"its water is so transparent" occurs. So we[br]could divide this by this. And get a

0:04:44.047,0:04:49.042
probability. We can't do that. And the[br]reason we can't do that is there's just

0:04:49.042,0:04:54.083
far too many possible sentences for us to[br]ever estimate these. There's no way we could

0:04:54.083,0:05:00.018
get enough data to see all the counts of[br]all possible sentences of English. So what

0:05:00.018,0:05:05.033
we do instead is, we apply a simplifying[br]assumption called the Markov assumption,

0:05:05.033,0:05:10.024
named for Andrei Markov. And the Markov[br]Assumption suggest that we estimate the

0:05:10.024,0:05:15.039
probability of "the" given "its water is so[br]transparent that" just by computing instead

0:05:15.039,0:05:20.041
the probability of "the" given the word "that",[br]or-- The very last "that", "that" meaning the

0:05:20.041,0:05:25.025
last word in the sequence. Or maybe we[br]compute the probability of "the" given "its

0:05:25.025,0:05:29.067
water is so transparent that" given just[br]the last two words, so "the" given

0:05:29.067,0:05:33.089
"transparent that". That's the Markov[br]Assumption. Let's just look at the

0:05:33.089,0:05:38.070
previous or maybe the couple previous[br]words rather than in the entire context. More

0:05:38.070,0:05:44.035
formally, the Markov Assumption says: The[br]probability of a sequence of words is the

0:05:44.035,0:05:49.038
product for each word of the[br]conditional probability of that word,

0:05:49.038,0:05:54.096
given some prefix of the last few words.[br]So, in other words, in the chain rule

0:05:54.096,0:06:00.013
product of all the probabilities we're[br]multiplying together, we estimate the

0:06:02.542,0:06:05.071
probability of wᵢ, given the entire prefix[br]from one to i-1 by a simpler to

0:06:05.071,0:06:12.070
compute probability: wᵢ given just the[br]last few words. The simplest case of a

0:06:12.070,0:06:17.029
Markov model is called the unigram model.[br]In the unigram model, we simply estimate

0:06:17.029,0:06:21.077
the probability of a whole sequence of[br]words by the product of probabilities of

0:06:21.077,0:06:25.088
individual words, "unigrams". And if we[br]generated sentences by randomly picking

0:06:25.088,0:06:30.042
words, you can see that it would look like[br]a word salad. So here's some automatically

0:06:30.042,0:06:34.090
generated sentences generated by Dan Klein,[br]and you can see that the word "fifth", the

0:06:34.090,0:06:39.028
word "an", the word "of" -- this doesn't look[br]like a sentence at all. It's just a random

0:06:39.028,0:06:43.038
sequence of words: "thrift, did, eighty,[br]said". That's the properties of the unigram

0:06:43.038,0:06:47.059
model. Words are independent in this[br]model. Slightly more intelligent is a

0:06:47.059,0:06:52.015
bi-gram model where we condition on the[br]single previous word. So again, we

0:06:52.015,0:06:57.021
estimate the probability of a word given[br]the entire prefix from the beginning to

0:06:57.021,0:07:02.015
the previous word, just by the previous[br]word. So now if we use that and generate

0:07:02.015,0:07:07.011
random sentences from a bigram model, the[br]sentences look a little bit more like

0:07:07.011,0:07:11.087
English. Still, something's wrong with[br]them clearly. "outside, new, car", well, "new

0:07:11.087,0:07:16.052
car" looks pretty good. "car parking" is[br]pretty good. "parking lot". But together,

0:07:16.052,0:07:21.028
"outside new car parking lot of the[br]agreement reached": that's not English. So

0:07:21.028,0:07:26.005
even the bigram model, by giving up this[br]conditioning that English has, we're

0:07:26.005,0:07:31.017
simplifying the ability to model, to model[br]what's going on in a language. Now we

0:07:31.017,0:07:36.038
can extend the n-gram model further to[br]trigrams, that's 3-grams. Or 4-grams

0:07:36.038,0:07:41.044
or 5-grams. But in general, it's[br]clear that n-gram modeling is an

0:07:41.044,0:07:46.097
insufficient model of language. And the[br]reason is that language has long-distance

0:07:46.097,0:07:52.036
dependencies. So if I want to, say, predict[br]"The computer which I had just put into the

0:07:52.036,0:07:57.011
machine room on the fifth floor", and I[br]hadn't seen this next word, and I want to

0:07:57.011,0:08:01.062
say, what's my likelihood of the next[br]word? And I conditioned it just on the

0:08:01.062,0:08:06.048
previous word, "floor", I'd be very unlucky[br]to guess "crashed". But really, the "crashed" is

0:08:06.048,0:08:11.022
the main verb of the sentence, and "computer"[br]is the subject, the head of the subject

0:08:11.022,0:08:15.078
noun phrase. So, if we know "computer" was[br]the subject, we're much more likely to

0:08:15.078,0:08:20.010
guess crashed. So, these kind of [br]long-distance dependencies mean that in the

0:08:20.010,0:08:24.050
limit of really good model of predicting[br]English words, we'll have to take into

0:08:24.050,0:08:28.089
account lots of long-distance information.[br]But it turns out that in practice, we can

0:08:28.089,0:08:33.016
often get away with these n-gram models,[br]because the local information, especially

0:08:33.016,0:08:37.023
as we get up to trigrams and 4-grams,[br]will turn out to be just constraining

0:08:37.023,0:08:40.046
enough that it in most cases it'll solve[br]our problems for us.