Introduction to N-grams (8:41)

0:00 - 0:04

Today we're gonna introduce the topic of
language modeling, one of the most
0:04 - 0:08

important topics in natural language
processing. The goal of language modeling
0:08 - 0:12

is to assign a probability to a sentence.
Why would we want to assign a probability
0:12 - 0:17

to a sentence? This comes up in all sorts
of applications. In machine translation,
0:17 - 0:21

for example, we'd like to be able to
distinguish between good and bad
0:21 - 0:26

translations by their probabilities. So,
"high winds tonight" might be a better
0:26 - 0:30

translation than "large winds tonight"
because high and winds go together well.
0:30 - 0:35

In spelling correction, we see a phrase
like "fifteen minuets from my house". That's
0:35 - 0:40

more likely to be a mistake from "minutes".
And one piece of information that lets us
0:40 - 0:45

decide that is that "fifteen minutes from"
is a much more likely phrase than "fifteen
0:45 - 0:51

minuets from". And in speech recognition, a
phrase like "I saw a van" is much more
0:51 - 0:56

likely than a phrase that sounds
phonetically similar, "eyes awe of an". But
0:56 - 1:00

it's much less likely to have that
sequence of words. And it turns out
1:00 - 1:04

language modelings play a role in
summarization, and question answering,
1:04 - 1:08

really everywhere. So the goal of a
language model is to compute the
1:08 - 1:14

probability of a sentence or a sequence of
words. So given some sequence of words w1
1:14 - 1:19

through wn, we're gonna compute their
probability P of w, and we'll use capital W
1:19 - 1:26

to mean a sequence from w1 to wn. Now, this is
related to the task of computing the
1:26 - 1:32

probability of an upcoming word, so P of w5
given w1 through w4 is very related to the
1:32 - 1:37

task of computing P(w1, w2, w3, w4, w5). A
model that computes either of these
1:37 - 1:43

things, either P(W), capital W, meaning a
string, the joint probability of the whole
1:43 - 1:50

string, or the conditional probability of
the last word given the previous words,
1:50 - 1:55

either of those, we call that a language
model. Now it might have better to call
1:55 - 1:59

this the grammar. I mean technically what
this is, is telling us something about how
1:59 - 2:03

good these words fit together. And we
normally use the word grammar for that,
2:03 - 2:07

but it turns out that the word "language
model", and often we'll see the acronym LM,
2:07 - 2:12

is standard, so we're gonna go with that.
So, how are we going to compute this joint
2:12 - 2:17

probability? We want to compute, let's say the
probability of the phrase "its water is so
2:17 - 2:22

transparent that", this little part of a
sentence. And the intuition for how
2:22 - 2:26

language modeling works is that you're
going to rely on the chain rule of
2:26 - 2:32

probability. And just to remind you about
the chain rule of probability. Let's think
2:32 - 2:37

about the definition of conditional
probability. So P of A given B
2:37 - 2:48

Equals P of A comma B over P of B.
And we can rewrite that, so P of A
2:48 - 2:57

given B times P of B
equals P of A comma B, or turning it
2:57 - 3:05

around, P of A comma B equals P of A
given B (I'll make sure it's a "given") times P
3:05 - 3:13

of B. And then we could generalize this to
more variables so the joint probability of
3:13 - 3:20

a whole sequence A B C D is the
probability of A, times B given A, times C
3:20 - 3:23

conditioned on A and B, times D
conditioned on A B C. So this is the chain
3:23 - 3:28

rule. In a more general form of the chain
rule we have here the probability of any,
3:28 - 3:32

joint probability of any sequence of
variables is the first, times the
3:32 - 3:36

condition of the second and the first,
times the third conditioned of the
3:36 - 3:40

first two, up until the last conditioned
on the first n minus one. Alright, the
3:40 - 3:45

chain rule. So now, the chain rule can be
applied to compute the joint probability
3:45 - 3:49

of words in a sentence. So let's suppose
we have our phrase, "its water is so
3:49 - 3:53

transparent". By the chain rule, the
probability of that sequence is the
3:53 - 3:59

probability of "its" times the probability
of "water" given "its", times the probability
3:59 - 4:03

of "is" given "its water", times the
probability of "so" given "its water is", and
4:03 - 4:08

finally times the probability of
"transparent" given "its water is so". Or, more
4:08 - 4:13

formally, the probability, joint
probability of a sequence of words is the
4:13 - 4:18

product over all i of the probability of
each word times the prefix up until that
4:18 - 4:24

word. How are we gonna estimate these
probabilities? Could we just count and
4:24 - 4:29

divide? We often compute probabilities by
counting and dividing. So, the probability
4:29 - 4:34

of "the" given "its water is so transparent
that", we could just count how many times
4:34 - 4:39

"its water is so transparent that the"
occurs and divide by the number of times
4:39 - 4:44

"its water is so transparent" occurs. So we
could divide this by this. And get a
4:44 - 4:49

probability. We can't do that. And the
reason we can't do that is there's just
4:49 - 4:54

far too many possible sentences for us to
ever estimate these. There's no way we could
4:54 - 5:00

get enough data to see all the counts of
all possible sentences of English. So what
5:00 - 5:05

we do instead is, we apply a simplifying
assumption called the Markov assumption,
5:05 - 5:10

named for Andrei Markov. And the Markov
Assumption suggest that we estimate the
5:10 - 5:15

probability of "the" given "its water is so
transparent that" just by computing instead
5:15 - 5:20

the probability of "the" given the word "that",
or-- The very last "that", "that" meaning the
5:20 - 5:25

last word in the sequence. Or maybe we
compute the probability of "the" given "its
5:25 - 5:29

water is so transparent that" given just
the last two words, so "the" given
5:29 - 5:33

"transparent that". That's the Markov
Assumption. Let's just look at the
5:33 - 5:38

previous or maybe the couple previous
words rather than in the entire context. More
5:38 - 5:44

formally, the Markov Assumption says: The
probability of a sequence of words is the
5:44 - 5:49

product for each word of the
conditional probability of that word,
5:49 - 5:54

given some prefix of the last few words.
So, in other words, in the chain rule
5:54 - 6:00

product of all the probabilities we're
multiplying together, we estimate the
6:03 - 6:05

probability of wᵢ, given the entire prefix
from one to i-1 by a simpler to
6:05 - 6:12

compute probability: wᵢ given just the
last few words. The simplest case of a
6:12 - 6:17

Markov model is called the unigram model.
In the unigram model, we simply estimate
6:17 - 6:21

the probability of a whole sequence of
words by the product of probabilities of
6:21 - 6:25

individual words, "unigrams". And if we
generated sentences by randomly picking
6:25 - 6:30

words, you can see that it would look like
a word salad. So here's some automatically
6:30 - 6:34

generated sentences generated by Dan Klein,
and you can see that the word "fifth", the
6:34 - 6:39

word "an", the word "of" -- this doesn't look
like a sentence at all. It's just a random
6:39 - 6:43

sequence of words: "thrift, did, eighty,
said". That's the properties of the unigram
6:43 - 6:47

model. Words are independent in this
model. Slightly more intelligent is a
6:47 - 6:52

bi-gram model where we condition on the
single previous word. So again, we
6:52 - 6:57

estimate the probability of a word given
the entire prefix from the beginning to
6:57 - 7:02

the previous word, just by the previous
word. So now if we use that and generate
7:02 - 7:07

random sentences from a bigram model, the
sentences look a little bit more like
7:07 - 7:11

English. Still, something's wrong with
them clearly. "outside, new, car", well, "new
7:11 - 7:16

car" looks pretty good. "car parking" is
pretty good. "parking lot". But together,
7:16 - 7:21

"outside new car parking lot of the
agreement reached": that's not English. So
7:21 - 7:26

even the bigram model, by giving up this
conditioning that English has, we're
7:26 - 7:31

simplifying the ability to model, to model
what's going on in a language. Now we
7:31 - 7:36

can extend the n-gram model further to
trigrams, that's 3-grams. Or 4-grams
7:36 - 7:41

or 5-grams. But in general, it's
clear that n-gram modeling is an
7:41 - 7:46

insufficient model of language. And the
reason is that language has long-distance
7:46 - 7:52

dependencies. So if I want to, say, predict
"The computer which I had just put into the
7:52 - 7:57

machine room on the fifth floor", and I
hadn't seen this next word, and I want to
7:57 - 8:01

say, what's my likelihood of the next
word? And I conditioned it just on the
8:01 - 8:06

previous word, "floor", I'd be very unlucky
to guess "crashed". But really, the "crashed" is
8:06 - 8:11

the main verb of the sentence, and "computer"
is the subject, the head of the subject
8:11 - 8:15

noun phrase. So, if we know "computer" was
the subject, we're much more likely to
8:15 - 8:20

guess crashed. So, these kind of
long-distance dependencies mean that in the
8:20 - 8:24

limit of really good model of predicting
English words, we'll have to take into
8:24 - 8:28

account lots of long-distance information.
But it turns out that in practice, we can
8:28 - 8:33

often get away with these n-gram models,
because the local information, especially
8:33 - 8:37

as we get up to trigrams and 4-grams,
will turn out to be just constraining
8:37 - 8:40

enough that it in most cases it'll solve
our problems for us.

Title:: Introduction to N-grams (8:41)
Description:: Changed w_i for wᵢ

more » « less
Video Language:: English

	AgustinF edited English subtitles for Introduction to N-grams (8:41)
	AgustinF edited English subtitles for Introduction to N-grams (8:41)
	AgustinF edited English subtitles for Introduction to N-grams (8:41)
	lacucharita edited English subtitles for Introduction to N-grams (8:41)
	lacucharita edited English subtitles for Introduction to N-grams (8:41)
	lacucharita edited English subtitles for Introduction to N-grams (8:41)
	stanford-bot edited English subtitles for Introduction to N-grams (8:41)
	stanford-bot edited English subtitles for Introduction to N-grams (8:41)

Show all

English subtitles

Revisions

Revision 18

AgustinF

Introduction to N-grams (8:41)

Revisions

Our website uses cookies

Operating cookies (Required)