[Script Info] Title: [Events] Format: Layer, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text Dialogue: 0,0:00:00.00,0:00:04.01,Default,,0000,0000,0000,,Today we're gonna introduce the topic of\Nlanguage modeling, one of the most Dialogue: 0,0:00:04.01,0:00:08.03,Default,,0000,0000,0000,,important topics in natural language\Nprocessing. The goal of language modeling Dialogue: 0,0:00:08.03,0:00:12.09,Default,,0000,0000,0000,,is to assign a probability to a sentence.\NWhy would we want to assign a probability Dialogue: 0,0:00:12.09,0:00:17.06,Default,,0000,0000,0000,,to a sentence? This comes up in all sorts\Nof applications. In machine translation, Dialogue: 0,0:00:17.06,0:00:21.07,Default,,0000,0000,0000,,for example, we'd like to be able to\Ndistinguish between good and bad Dialogue: 0,0:00:21.07,0:00:26.02,Default,,0000,0000,0000,,translations by their probabilities. So,\N"high winds tonight" might be a better Dialogue: 0,0:00:26.02,0:00:30.08,Default,,0000,0000,0000,,translation than "large winds tonight"\Nbecause high and winds go together well. Dialogue: 0,0:00:30.08,0:00:35.06,Default,,0000,0000,0000,,In spelling correction, we see a phrase\Nlike "fifteen minuets from my house". That's Dialogue: 0,0:00:35.06,0:00:40.04,Default,,0000,0000,0000,,more likely to be a mistake from "minutes".\NAnd one piece of information that lets us Dialogue: 0,0:00:40.04,0:00:45.04,Default,,0000,0000,0000,,decide that is that "fifteen minutes from"\Nis a much more likely phrase than "fifteen Dialogue: 0,0:00:45.04,0:00:51.03,Default,,0000,0000,0000,,minuets from". And in speech recognition, a\Nphrase like "I saw a van" is much more Dialogue: 0,0:00:51.03,0:00:56.09,Default,,0000,0000,0000,,likely than a phrase that sounds\Nphonetically similar, "eyes awe of an". But Dialogue: 0,0:00:56.09,0:01:00.06,Default,,0000,0000,0000,,it's much less likely to have that\Nsequence of words. And it turns out Dialogue: 0,0:01:00.06,0:01:04.03,Default,,0000,0000,0000,,language modelings play a role in\Nsummarization, and question answering, Dialogue: 0,0:01:04.03,0:01:08.07,Default,,0000,0000,0000,,really everywhere. So the goal of a\Nlanguage model is to compute the Dialogue: 0,0:01:08.07,0:01:14.04,Default,,0000,0000,0000,,probability of a sentence or a sequence of\Nwords. So given some sequence of words w1 Dialogue: 0,0:01:14.04,0:01:19.10,Default,,0000,0000,0000,,through wn, we're gonna compute their\Nprobability P of w, and we'll use capital W Dialogue: 0,0:01:19.10,0:01:26.00,Default,,0000,0000,0000,,to mean a sequence from w1 to wn. Now, this is\Nrelated to the task of computing the Dialogue: 0,0:01:26.00,0:01:32.02,Default,,0000,0000,0000,,probability of an upcoming word, so P of w5\Ngiven w1 through w4 is very related to the Dialogue: 0,0:01:32.02,0:01:37.07,Default,,0000,0000,0000,,task of computing P(w1, w2, w3, w4, w5). A\Nmodel that computes either of these Dialogue: 0,0:01:37.07,0:01:43.08,Default,,0000,0000,0000,,things, either P(W), capital W, meaning a\Nstring, the joint probability of the whole Dialogue: 0,0:01:43.08,0:01:50.00,Default,,0000,0000,0000,,string, or the conditional probability of\Nthe last word given the previous words, Dialogue: 0,0:01:50.00,0:01:55.03,Default,,0000,0000,0000,,either of those, we call that a language\Nmodel. Now it might have better to call Dialogue: 0,0:01:55.03,0:01:59.04,Default,,0000,0000,0000,,this the grammar. I mean technically what\Nthis is, is telling us something about how Dialogue: 0,0:01:59.04,0:02:03.02,Default,,0000,0000,0000,,good these words fit together. And we\Nnormally use the word grammar for that, Dialogue: 0,0:02:03.02,0:02:07.04,Default,,0000,0000,0000,,but it turns out that the word "language\Nmodel", and often we'll see the acronym LM, Dialogue: 0,0:02:07.04,0:02:12.01,Default,,0000,0000,0000,,is standard, so we're gonna go with that.\NSo, how are we going to compute this joint Dialogue: 0,0:02:12.01,0:02:17.03,Default,,0000,0000,0000,,probability? We want to compute, let's say the\Nprobability of the phrase "its water is so Dialogue: 0,0:02:17.03,0:02:22.01,Default,,0000,0000,0000,,transparent that", this little part of a\Nsentence. And the intuition for how Dialogue: 0,0:02:22.01,0:02:26.09,Default,,0000,0000,0000,,language modeling works is that you're\Ngoing to rely on the chain rule of Dialogue: 0,0:02:26.09,0:02:32.02,Default,,0000,0000,0000,,probability. And just to remind you about\Nthe chain rule of probability. Let's think Dialogue: 0,0:02:32.02,0:02:37.04,Default,,0000,0000,0000,,about the definition of conditional\Nprobability. So P of A given B Dialogue: 0,0:02:37.04,0:02:48.02,Default,,0000,0000,0000,,Equals P of A comma B over P of B.\NAnd we can rewrite that, so P of A Dialogue: 0,0:02:48.02,0:02:57.05,Default,,0000,0000,0000,,given B times P of B \Nequals P of A comma B, or turning it Dialogue: 0,0:02:57.05,0:03:05.09,Default,,0000,0000,0000,,around, P of A comma B equals P of A\Ngiven B (I'll make sure it's a "given") times P Dialogue: 0,0:03:05.09,0:03:13.09,Default,,0000,0000,0000,,of B. And then we could generalize this to\Nmore variables so the joint probability of Dialogue: 0,0:03:13.09,0:03:20.01,Default,,0000,0000,0000,,a whole sequence A B C D is the\Nprobability of A, times B given A, times C Dialogue: 0,0:03:20.01,0:03:23.10,Default,,0000,0000,0000,,conditioned on A and B, times D\Nconditioned on A B C. So this is the chain Dialogue: 0,0:03:23.10,0:03:28.04,Default,,0000,0000,0000,,rule. In a more general form of the chain\Nrule we have here the probability of any, Dialogue: 0,0:03:28.06,0:03:32.04,Default,,0000,0000,0000,,joint probability of any sequence of\Nvariables is the first, times the Dialogue: 0,0:03:32.04,0:03:36.06,Default,,0000,0000,0000,,condition of the second and the first,\Ntimes the third conditioned of the Dialogue: 0,0:03:36.06,0:03:40.09,Default,,0000,0000,0000,,first two, up until the last conditioned\Non the first n minus one. Alright, the Dialogue: 0,0:03:40.09,0:03:45.03,Default,,0000,0000,0000,,chain rule. So now, the chain rule can be\Napplied to compute the joint probability Dialogue: 0,0:03:45.03,0:03:49.04,Default,,0000,0000,0000,,of words in a sentence. So let's suppose\Nwe have our phrase, "its water is so Dialogue: 0,0:03:49.04,0:03:53.08,Default,,0000,0000,0000,,transparent". By the chain rule, the\Nprobability of that sequence is the Dialogue: 0,0:03:53.08,0:03:59.01,Default,,0000,0000,0000,,probability of "its" times the probability\Nof "water" given "its", times the probability Dialogue: 0,0:03:59.01,0:04:03.07,Default,,0000,0000,0000,,of "is" given "its water", times the\Nprobability of "so" given "its water is", and Dialogue: 0,0:04:03.07,0:04:08.01,Default,,0000,0000,0000,,finally times the probability of\N"transparent" given "its water is so". Or, more Dialogue: 0,0:04:08.01,0:04:13.01,Default,,0000,0000,0000,,formally, the probability, joint\Nprobability of a sequence of words is the Dialogue: 0,0:04:13.01,0:04:18.03,Default,,0000,0000,0000,,product over all i of the probability of\Neach word times the prefix up until that Dialogue: 0,0:04:18.03,0:04:24.05,Default,,0000,0000,0000,,word. How are we gonna estimate these\Nprobabilities? Could we just count and Dialogue: 0,0:04:24.05,0:04:29.06,Default,,0000,0000,0000,,divide? We often compute probabilities by\Ncounting and dividing. So, the probability Dialogue: 0,0:04:29.06,0:04:34.06,Default,,0000,0000,0000,,of "the" given "its water is so transparent\Nthat", we could just count how many times Dialogue: 0,0:04:34.06,0:04:39.04,Default,,0000,0000,0000,,"its water is so transparent that the"\Noccurs and divide by the number of times Dialogue: 0,0:04:39.04,0:04:44.05,Default,,0000,0000,0000,,"its water is so transparent" occurs. So we\Ncould divide this by this. And get a Dialogue: 0,0:04:44.05,0:04:49.04,Default,,0000,0000,0000,,probability. We can't do that. And the\Nreason we can't do that is there's just Dialogue: 0,0:04:49.04,0:04:54.08,Default,,0000,0000,0000,,far too many possible sentences for us to\Never estimate these. There's no way we could Dialogue: 0,0:04:54.08,0:05:00.02,Default,,0000,0000,0000,,get enough data to see all the counts of\Nall possible sentences of English. So what Dialogue: 0,0:05:00.02,0:05:05.03,Default,,0000,0000,0000,,we do instead is, we apply a simplifying\Nassumption called the Markov assumption, Dialogue: 0,0:05:05.03,0:05:10.02,Default,,0000,0000,0000,,named for Andrei Markov. And the Markov\NAssumption suggest that we estimate the Dialogue: 0,0:05:10.02,0:05:15.04,Default,,0000,0000,0000,,probability of "the" given "its water is so\Ntransparent that" just by computing instead Dialogue: 0,0:05:15.04,0:05:20.04,Default,,0000,0000,0000,,the probability of "the" given the word "that",\Nor-- The very last "that", "that" meaning the Dialogue: 0,0:05:20.04,0:05:25.02,Default,,0000,0000,0000,,last word in the sequence. Or maybe we\Ncompute the probability of "the" given "its Dialogue: 0,0:05:25.02,0:05:29.07,Default,,0000,0000,0000,,water is so transparent that" given just\Nthe last two words, so "the" given Dialogue: 0,0:05:29.07,0:05:33.09,Default,,0000,0000,0000,,"transparent that". That's the Markov\NAssumption. Let's just look at the Dialogue: 0,0:05:33.09,0:05:38.07,Default,,0000,0000,0000,,previous or maybe the couple previous\Nwords rather than in the entire context. More Dialogue: 0,0:05:38.07,0:05:44.04,Default,,0000,0000,0000,,formally, the Markov Assumption says: The\Nprobability of a sequence of words is the Dialogue: 0,0:05:44.04,0:05:49.04,Default,,0000,0000,0000,,product for each word of the\Nconditional probability of that word, Dialogue: 0,0:05:49.04,0:05:54.10,Default,,0000,0000,0000,,given some prefix of the last few words.\NSo, in other words, in the chain rule Dialogue: 0,0:05:54.10,0:06:00.01,Default,,0000,0000,0000,,product of all the probabilities we're\Nmultiplying together, we estimate the Dialogue: 0,0:06:02.54,0:06:05.07,Default,,0000,0000,0000,,probability of wᵢ, given the entire prefix\Nfrom one to i-1 by a simpler to Dialogue: 0,0:06:05.07,0:06:12.07,Default,,0000,0000,0000,,compute probability: wᵢ given just the\Nlast few words. The simplest case of a Dialogue: 0,0:06:12.07,0:06:17.03,Default,,0000,0000,0000,,Markov model is called the unigram model.\NIn the unigram model, we simply estimate Dialogue: 0,0:06:17.03,0:06:21.08,Default,,0000,0000,0000,,the probability of a whole sequence of\Nwords by the product of probabilities of Dialogue: 0,0:06:21.08,0:06:25.09,Default,,0000,0000,0000,,individual words, "unigrams". And if we\Ngenerated sentences by randomly picking Dialogue: 0,0:06:25.09,0:06:30.04,Default,,0000,0000,0000,,words, you can see that it would look like\Na word salad. So here's some automatically Dialogue: 0,0:06:30.04,0:06:34.09,Default,,0000,0000,0000,,generated sentences generated by Dan Klein,\Nand you can see that the word "fifth", the Dialogue: 0,0:06:34.09,0:06:39.03,Default,,0000,0000,0000,,word "an", the word "of" -- this doesn't look\Nlike a sentence at all. It's just a random Dialogue: 0,0:06:39.03,0:06:43.04,Default,,0000,0000,0000,,sequence of words: "thrift, did, eighty,\Nsaid". That's the properties of the unigram Dialogue: 0,0:06:43.04,0:06:47.06,Default,,0000,0000,0000,,model. Words are independent in this\Nmodel. Slightly more intelligent is a Dialogue: 0,0:06:47.06,0:06:52.02,Default,,0000,0000,0000,,bi-gram model where we condition on the\Nsingle previous word. So again, we Dialogue: 0,0:06:52.02,0:06:57.02,Default,,0000,0000,0000,,estimate the probability of a word given\Nthe entire prefix from the beginning to Dialogue: 0,0:06:57.02,0:07:02.02,Default,,0000,0000,0000,,the previous word, just by the previous\Nword. So now if we use that and generate Dialogue: 0,0:07:02.02,0:07:07.01,Default,,0000,0000,0000,,random sentences from a bigram model, the\Nsentences look a little bit more like Dialogue: 0,0:07:07.01,0:07:11.09,Default,,0000,0000,0000,,English. Still, something's wrong with\Nthem clearly. "outside, new, car", well, "new Dialogue: 0,0:07:11.09,0:07:16.05,Default,,0000,0000,0000,,car" looks pretty good. "car parking" is\Npretty good. "parking lot". But together, Dialogue: 0,0:07:16.05,0:07:21.03,Default,,0000,0000,0000,,"outside new car parking lot of the\Nagreement reached": that's not English. So Dialogue: 0,0:07:21.03,0:07:26.00,Default,,0000,0000,0000,,even the bigram model, by giving up this\Nconditioning that English has, we're Dialogue: 0,0:07:26.00,0:07:31.02,Default,,0000,0000,0000,,simplifying the ability to model, to model\Nwhat's going on in a language. Now we Dialogue: 0,0:07:31.02,0:07:36.04,Default,,0000,0000,0000,,can extend the n-gram model further to\Ntrigrams, that's 3-grams. Or 4-grams Dialogue: 0,0:07:36.04,0:07:41.04,Default,,0000,0000,0000,,or 5-grams. But in general, it's\Nclear that n-gram modeling is an Dialogue: 0,0:07:41.04,0:07:46.10,Default,,0000,0000,0000,,insufficient model of language. And the\Nreason is that language has long-distance Dialogue: 0,0:07:46.10,0:07:52.04,Default,,0000,0000,0000,,dependencies. So if I want to, say, predict\N"The computer which I had just put into the Dialogue: 0,0:07:52.04,0:07:57.01,Default,,0000,0000,0000,,machine room on the fifth floor", and I\Nhadn't seen this next word, and I want to Dialogue: 0,0:07:57.01,0:08:01.06,Default,,0000,0000,0000,,say, what's my likelihood of the next\Nword? And I conditioned it just on the Dialogue: 0,0:08:01.06,0:08:06.05,Default,,0000,0000,0000,,previous word, "floor", I'd be very unlucky\Nto guess "crashed". But really, the "crashed" is Dialogue: 0,0:08:06.05,0:08:11.02,Default,,0000,0000,0000,,the main verb of the sentence, and "computer"\Nis the subject, the head of the subject Dialogue: 0,0:08:11.02,0:08:15.08,Default,,0000,0000,0000,,noun phrase. So, if we know "computer" was\Nthe subject, we're much more likely to Dialogue: 0,0:08:15.08,0:08:20.01,Default,,0000,0000,0000,,guess crashed. So, these kind of \Nlong-distance dependencies mean that in the Dialogue: 0,0:08:20.01,0:08:24.05,Default,,0000,0000,0000,,limit of really good model of predicting\NEnglish words, we'll have to take into Dialogue: 0,0:08:24.05,0:08:28.09,Default,,0000,0000,0000,,account lots of long-distance information.\NBut it turns out that in practice, we can Dialogue: 0,0:08:28.09,0:08:33.02,Default,,0000,0000,0000,,often get away with these n-gram models,\Nbecause the local information, especially Dialogue: 0,0:08:33.02,0:08:37.02,Default,,0000,0000,0000,,as we get up to trigrams and 4-grams,\Nwill turn out to be just constraining Dialogue: 0,0:08:37.02,0:08:40.05,Default,,0000,0000,0000,,enough that it in most cases it'll solve\Nour problems for us.