Today we're gonna introduce the topic of
language modeling, one of the most

important topics in natural language
processing. The goal of language modeling

is to assign a probability to a sentence.
Why would we want to assign a probability

to a sentence? This comes up in all sorts
of applications. In machine translation,

for example, we'd like to be able to
distinguish between good and bad

translations by their probabilities. So,
"high winds tonight" might be a better

translation than "large winds tonight"
because high and winds go together well.

In spelling correction, we see a phrase
like "fifteen minuets from my house". That's

more likely to be a mistake from "minutes".
And one piece of information that lets us

decide that is that "fifteen minutes from"
is a much more likely phrase than "fifteen

minuets from". And in speech recognition, a
phrase like "I saw a van" is much more

likely than a phrase that sounds
phonetically similar, "eyes awe of an". But

it's much less likely to have that
sequence of words. And it turns out

language modelings play a role in
summarization, and question answering,

really everywhere. So the goal of a
language model is to compute the

probability of a sentence or a sequence of
words. So given some sequence of words w1

through wn, we're gonna compute their
probability P of w, and we'll use capital W

to mean a sequence from w1 to wn. Now, this is
related to the task of computing the

probability of an upcoming word, so P of w5
given w1 through w4 is very related to the

task of computing P(w1, w2, w3, w4, w5). A
model that computes either of these

things, either P(W), capital W, meaning a
string, the joint probability of the whole

string, or the conditional probability of
the last word given the previous words,

either of those, we call that a language
model. Now it might have better to call

this the grammar. I mean technically what
this is, is telling us something about how

good these words fit together. And we
normally use the word grammar for that,

but it turns out that the word "language
model", and often we'll see the acronym LM,

is standard, so we're gonna go with that.
So, how are we going to compute this joint

probability? We want to compute, let's say the
probability of the phrase "its water is so

transparent that", this little part of a
sentence. And the intuition for how

language modeling works is that you're
going to rely on the chain rule of

probability. And just to remind you about
the chain rule of probability. Let's think

about the definition of conditional
probability. So P of A given B

Equals P of A comma B over P of B.
And we can rewrite that, so P of A

given B times P of B 
equals P of A comma B, or turning it

around, P of A comma B equals P of A
given B (I'll make sure it's a "given") times P

of B. And then we could generalize this to
more variables so the joint probability of

a whole sequence A B C D is the
probability of A, times B given A, times C

conditioned on A and B, times D
conditioned on A B C. So this is the chain

rule. In a more general form of the chain
rule we have here the probability of any,

joint probability of any sequence of
variables is the first, times the

condition of the second and the first,
times the third conditioned of the

first two, up until the last conditioned
on the first n minus one. Alright, the

chain rule. So now, the chain rule can be
applied to compute the joint probability

of words in a sentence. So let's suppose
we have our phrase, "its water is so

transparent". By the chain rule, the
probability of that sequence is the

probability of "its" times the probability
of "water" given "its", times the probability

of "is" given "its water", times the
probability of "so" given "its water is", and

finally times the probability of
"transparent" given "its water is so". Or, more

formally, the probability, joint
probability of a sequence of words is the

product over all i of the probability of
each word times the prefix up until that

word. How are we gonna estimate these
probabilities? Could we just count and

divide? We often compute probabilities by
counting and dividing. So, the probability

of "the" given "its water is so transparent
that", we could just count how many times

"its water is so transparent that the"
occurs and divide by the number of times

"its water is so transparent" occurs. So we
could divide this by this. And get a

probability. We can't do that. And the
reason we can't do that is there's just

far too many possible sentences for us to
ever estimate these. There's no way we could

get enough data to see all the counts of
all possible sentences of English. So what

we do instead is, we apply a simplifying
assumption called the Markov assumption,

named for Andrei Markov. And the Markov
Assumption suggest that we estimate the

probability of "the" given "its water is so
transparent that" just by computing instead

the probability of "the" given the word "that",
or-- The very last "that", "that" meaning the

last word in the sequence. Or maybe we
compute the probability of "the" given "its

water is so transparent that" given just
the last two words, so "the" given

"transparent that". That's the Markov
Assumption. Let's just look at the

previous or maybe the couple previous
words rather than in the entire context. More

formally, the Markov Assumption says: The
probability of a sequence of words is the

product for each word of the
conditional probability of that word,

given some prefix of the last few words.
So, in other words, in the chain rule

product of all the probabilities we're
multiplying together, we estimate the

probability of wᵢ, given the entire prefix
from one to i-1 by a simpler to

compute probability: wᵢ given just the
last few words. The simplest case of a

Markov model is called the unigram model.
In the unigram model, we simply estimate

the probability of a whole sequence of
words by the product of probabilities of

individual words, "unigrams". And if we
generated sentences by randomly picking

words, you can see that it would look like
a word salad. So here's some automatically

generated sentences generated by Dan Klein,
and you can see that the word "fifth", the

word "an", the word "of" -- this doesn't look
like a sentence at all. It's just a random

sequence of words: "thrift, did, eighty,
said". That's the properties of the unigram

model. Words are independent in this
model. Slightly more intelligent is a

bi-gram model where we condition on the
single previous word. So again, we

estimate the probability of a word given
the entire prefix from the beginning to

the previous word, just by the previous
word. So now if we use that and generate

random sentences from a bigram model, the
sentences look a little bit more like

English. Still, something's wrong with
them clearly. "outside, new, car", well, "new

car" looks pretty good. "car parking" is
pretty good. "parking lot". But together,

"outside new car parking lot of the
agreement reached": that's not English. So

even the bigram model, by giving up this
conditioning that English has, we're

simplifying the ability to model, to model
what's going on in a language. Now we

can extend the n-gram model further to
trigrams, that's 3-grams. Or 4-grams

or 5-grams. But in general, it's
clear that n-gram modeling is an

insufficient model of language. And the
reason is that language has long-distance

dependencies. So if I want to, say, predict
"The computer which I had just put into the

machine room on the fifth floor", and I
hadn't seen this next word, and I want to

say, what's my likelihood of the next
word? And I conditioned it just on the

previous word, "floor", I'd be very unlucky
to guess "crashed". But really, the "crashed" is

the main verb of the sentence, and "computer"
is the subject, the head of the subject

noun phrase. So, if we know "computer" was
the subject, we're much more likely to

guess crashed. So, these kind of 
long-distance dependencies mean that in the

limit of really good model of predicting
English words, we'll have to take into

account lots of long-distance information.
But it turns out that in practice, we can

often get away with these n-gram models,
because the local information, especially

as we get up to trigrams and 4-grams,
will turn out to be just constraining

enough that it in most cases it'll solve
our problems for us.