1
00:00:00,000 --> 00:00:04,009
Today we're gonna introduce the topic of
language modeling, one of the most

2
00:00:04,009 --> 00:00:08,034
important topics in natural language
processing. The goal of language modeling

3
00:00:08,034 --> 00:00:12,087
is to assign a probability to a sentence.
Why would we want to assign a probability

4
00:00:12,087 --> 00:00:17,057
to a sentence? This comes up in all sorts
of applications. In machine translation,

5
00:00:17,057 --> 00:00:21,069
for example, we'd like to be able to
distinguish between good and bad

6
00:00:21,069 --> 00:00:26,022
translations by their probabilities. So,
"high winds tonight" might be a better

7
00:00:26,022 --> 00:00:30,081
translation than "large winds tonight"
because high and winds go together well.

8
00:00:30,081 --> 00:00:35,064
In spelling correction, we see a phrase
like "fifteen minuets from my house". That's

9
00:00:35,064 --> 00:00:40,041
more likely to be a mistake from "minutes".
And one piece of information that lets us

10
00:00:40,041 --> 00:00:45,036
decide that is that "fifteen minutes from"
is a much more likely phrase than "fifteen

11
00:00:45,036 --> 00:00:51,029
minuets from". And in speech recognition, a
phrase like "I saw a van" is much more

12
00:00:51,029 --> 00:00:56,089
likely than a phrase that sounds
phonetically similar, "eyes awe of an". But

13
00:00:56,089 --> 00:01:00,057
it's much less likely to have that
sequence of words. And it turns out

14
00:01:00,057 --> 00:01:04,030
language modelings play a role in
summarization, and question answering,

15
00:01:04,030 --> 00:01:08,068
really everywhere. So the goal of a
language model is to compute the

16
00:01:08,068 --> 00:01:14,045
probability of a sentence or a sequence of
words. So given some sequence of words w1

17
00:01:14,045 --> 00:01:19,095
through wn, we're gonna compute their
probability P of w, and we'll use capital W

18
00:01:19,095 --> 00:01:26,001
to mean a sequence from w1 to wn. Now, this is
related to the task of computing the

19
00:01:26,001 --> 00:01:32,025
probability of an upcoming word, so P of w5
given w1 through w4 is very related to the

20
00:01:32,025 --> 00:01:37,071
task of computing P(w1, w2, w3, w4, w5). A
model that computes either of these

21
00:01:37,071 --> 00:01:43,079
things, either P(W), capital W, meaning a
string, the joint probability of the whole

22
00:01:43,079 --> 00:01:50,003
string, or the conditional probability of
the last word given the previous words,

23
00:01:50,003 --> 00:01:55,027
either of those, we call that a language
model. Now it might have better to call

24
00:01:55,027 --> 00:01:59,043
this the grammar. I mean technically what
this is, is telling us something about how

25
00:01:59,043 --> 00:02:03,023
good these words fit together. And we
normally use the word grammar for that,

26
00:02:03,023 --> 00:02:07,039
but it turns out that the word "language
model", and often we'll see the acronym LM,

27
00:02:07,039 --> 00:02:12,008
is standard, so we're gonna go with that.
So, how are we going to compute this joint

28
00:02:12,008 --> 00:02:17,033
probability? We want to compute, let's say the
probability of the phrase "its water is so

29
00:02:17,033 --> 00:02:22,007
transparent that", this little part of a
sentence. And the intuition for how

30
00:02:22,007 --> 00:02:26,087
language modeling works is that you're
going to rely on the chain rule of

31
00:02:26,087 --> 00:02:32,019
probability. And just to remind you about
the chain rule of probability. Let's think

32
00:02:32,019 --> 00:02:37,044
about the definition of conditional
probability. So P of A given B

33
00:02:37,044 --> 00:02:48,019
Equals P of A comma B over P of B.
And we can rewrite that, so P of A

34
00:02:48,019 --> 00:02:57,054
given B times P of B 
equals P of A comma B, or turning it

35
00:02:57,054 --> 00:03:05,091
around, P of A comma B equals P of A
given B (I'll make sure it's a "given") times P

36
00:03:05,091 --> 00:03:13,089
of B. And then we could generalize this to
more variables so the joint probability of

37
00:03:13,089 --> 00:03:20,008
a whole sequence A B C D is the
probability of A, times B given A, times C

38
00:03:20,008 --> 00:03:23,096
conditioned on A and B, times D
conditioned on A B C. So this is the chain

39
00:03:23,096 --> 00:03:28,045
rule. In a more general form of the chain
rule we have here the probability of any,

40
00:03:28,061 --> 00:03:32,044
joint probability of any sequence of
variables is the first, times the

41
00:03:32,044 --> 00:03:36,065
condition of the second and the first,
times the third conditioned of the

42
00:03:36,065 --> 00:03:40,086
first two, up until the last conditioned
on the first n minus one. Alright, the

43
00:03:40,086 --> 00:03:45,029
chain rule. So now, the chain rule can be
applied to compute the joint probability

44
00:03:45,029 --> 00:03:49,045
of words in a sentence. So let's suppose
we have our phrase, "its water is so

45
00:03:49,045 --> 00:03:53,081
transparent". By the chain rule, the
probability of that sequence is the

46
00:03:53,081 --> 00:03:59,006
probability of "its" times the probability
of "water" given "its", times the probability

47
00:03:59,006 --> 00:04:03,073
of "is" given "its water", times the
probability of "so" given "its water is", and

48
00:04:03,073 --> 00:04:08,008
finally times the probability of
"transparent" given "its water is so". Or, more

49
00:04:08,008 --> 00:04:13,007
formally, the probability, joint
probability of a sequence of words is the

50
00:04:13,007 --> 00:04:18,031
product over all i of the probability of
each word times the prefix up until that

51
00:04:18,031 --> 00:04:24,054
word. How are we gonna estimate these
probabilities? Could we just count and

52
00:04:24,054 --> 00:04:29,060
divide? We often compute probabilities by
counting and dividing. So, the probability

53
00:04:29,060 --> 00:04:34,061
of "the" given "its water is so transparent
that", we could just count how many times

54
00:04:34,061 --> 00:04:39,043
"its water is so transparent that the"
occurs and divide by the number of times

55
00:04:39,043 --> 00:04:44,047
"its water is so transparent" occurs. So we
could divide this by this. And get a

56
00:04:44,047 --> 00:04:49,042
probability. We can't do that. And the
reason we can't do that is there's just

57
00:04:49,042 --> 00:04:54,083
far too many possible sentences for us to
ever estimate these. There's no way we could

58
00:04:54,083 --> 00:05:00,018
get enough data to see all the counts of
all possible sentences of English. So what

59
00:05:00,018 --> 00:05:05,033
we do instead is, we apply a simplifying
assumption called the Markov assumption,

60
00:05:05,033 --> 00:05:10,024
named for Andrei Markov. And the Markov
Assumption suggest that we estimate the

61
00:05:10,024 --> 00:05:15,039
probability of "the" given "its water is so
transparent that" just by computing instead

62
00:05:15,039 --> 00:05:20,041
the probability of "the" given the word "that",
or-- The very last "that", "that" meaning the

63
00:05:20,041 --> 00:05:25,025
last word in the sequence. Or maybe we
compute the probability of "the" given "its

64
00:05:25,025 --> 00:05:29,067
water is so transparent that" given just
the last two words, so "the" given

65
00:05:29,067 --> 00:05:33,089
"transparent that". That's the Markov
Assumption. Let's just look at the

66
00:05:33,089 --> 00:05:38,070
previous or maybe the couple previous
words rather than in the entire context. More

67
00:05:38,070 --> 00:05:44,035
formally, the Markov Assumption says: The
probability of a sequence of words is the

68
00:05:44,035 --> 00:05:49,038
product for each word of the
conditional probability of that word,

69
00:05:49,038 --> 00:05:54,096
given some prefix of the last few words.
So, in other words, in the chain rule

70
00:05:54,096 --> 00:06:00,013
product of all the probabilities we're
multiplying together, we estimate the

71
00:06:02,542 --> 00:06:05,071
probability of wᵢ, given the entire prefix
from one to i-1 by a simpler to

72
00:06:05,071 --> 00:06:12,070
compute probability: wᵢ given just the
last few words. The simplest case of a

73
00:06:12,070 --> 00:06:17,029
Markov model is called the unigram model.
In the unigram model, we simply estimate

74
00:06:17,029 --> 00:06:21,077
the probability of a whole sequence of
words by the product of probabilities of

75
00:06:21,077 --> 00:06:25,088
individual words, "unigrams". And if we
generated sentences by randomly picking

76
00:06:25,088 --> 00:06:30,042
words, you can see that it would look like
a word salad. So here's some automatically

77
00:06:30,042 --> 00:06:34,090
generated sentences generated by Dan Klein,
and you can see that the word "fifth", the

78
00:06:34,090 --> 00:06:39,028
word "an", the word "of" -- this doesn't look
like a sentence at all. It's just a random

79
00:06:39,028 --> 00:06:43,038
sequence of words: "thrift, did, eighty,
said". That's the properties of the unigram

80
00:06:43,038 --> 00:06:47,059
model. Words are independent in this
model. Slightly more intelligent is a

81
00:06:47,059 --> 00:06:52,015
bi-gram model where we condition on the
single previous word. So again, we

82
00:06:52,015 --> 00:06:57,021
estimate the probability of a word given
the entire prefix from the beginning to

83
00:06:57,021 --> 00:07:02,015
the previous word, just by the previous
word. So now if we use that and generate

84
00:07:02,015 --> 00:07:07,011
random sentences from a bigram model, the
sentences look a little bit more like

85
00:07:07,011 --> 00:07:11,087
English. Still, something's wrong with
them clearly. "outside, new, car", well, "new

86
00:07:11,087 --> 00:07:16,052
car" looks pretty good. "car parking" is
pretty good. "parking lot". But together,

87
00:07:16,052 --> 00:07:21,028
"outside new car parking lot of the
agreement reached": that's not English. So

88
00:07:21,028 --> 00:07:26,005
even the bigram model, by giving up this
conditioning that English has, we're

89
00:07:26,005 --> 00:07:31,017
simplifying the ability to model, to model
what's going on in a language. Now we

90
00:07:31,017 --> 00:07:36,038
can extend the n-gram model further to
trigrams, that's 3-grams. Or 4-grams

91
00:07:36,038 --> 00:07:41,044
or 5-grams. But in general, it's
clear that n-gram modeling is an

92
00:07:41,044 --> 00:07:46,097
insufficient model of language. And the
reason is that language has long-distance

93
00:07:46,097 --> 00:07:52,036
dependencies. So if I want to, say, predict
"The computer which I had just put into the

94
00:07:52,036 --> 00:07:57,011
machine room on the fifth floor", and I
hadn't seen this next word, and I want to

95
00:07:57,011 --> 00:08:01,062
say, what's my likelihood of the next
word? And I conditioned it just on the

96
00:08:01,062 --> 00:08:06,048
previous word, "floor", I'd be very unlucky
to guess "crashed". But really, the "crashed" is

97
00:08:06,048 --> 00:08:11,022
the main verb of the sentence, and "computer"
is the subject, the head of the subject

98
00:08:11,022 --> 00:08:15,078
noun phrase. So, if we know "computer" was
the subject, we're much more likely to

99
00:08:15,078 --> 00:08:20,010
guess crashed. So, these kind of 
long-distance dependencies mean that in the

100
00:08:20,010 --> 00:08:24,050
limit of really good model of predicting
English words, we'll have to take into

101
00:08:24,050 --> 00:08:28,089
account lots of long-distance information.
But it turns out that in practice, we can

102
00:08:28,089 --> 00:08:33,016
often get away with these n-gram models,
because the local information, especially

103
00:08:33,016 --> 00:08:37,023
as we get up to trigrams and 4-grams,
will turn out to be just constraining

104
00:08:37,023 --> 00:08:40,046
enough that it in most cases it'll solve
our problems for us.