Today we're gonna introduce the topic of language modeling, one of the most important topics in natural language processing. The goal of language modeling is to assign a probability to a sentence. Why would we want to assign a probability to a sentence? This comes up in all sorts of applications. In machine translation, for example, we'd like to be able to distinguish between good and bad translations by their probabilities. So, "high winds tonight" might be a better translation than "large winds tonight" because high and winds go together well. In spelling correction, we see a phrase like "fifteen minuets from my house". That's more likely to be a mistake from "minutes". And one piece of information that lets us decide that is that "fifteen minutes from" is a much more likely phrase than "fifteen minuets from". And in speech recognition, a phrase like "I saw a van" is much more likely than a phrase that sounds phonetically similar, "eyes awe of an". But it's much less likely to have that sequence of words. And it turns out language modelings play a role in summarization, and question answering, really everywhere. So the goal of a language model is to compute the probability of a sentence or a sequence of words. So given some sequence of words w1 through wn, we're gonna compute their probability P of w, and we'll use capital W to mean a sequence from w1 to wn. Now, this is related to the task of computing the probability of an upcoming word, so P of w5 given w1 through w4 is very related to the task of computing P(w1, w2, w3, w4, w5). A model that computes either of these things, either P(W), capital W, meaning a string, the joint probability of the whole string, or the conditional probability of the last word given the previous words, either of those, we call that a language model. Now it might have better to call this the grammar. I mean technically what this is, is telling us something about how good these words fit together. And we normally use the word grammar for that, but it turns out that the word "language model", and often we'll see the acronym LM, is standard, so we're gonna go with that. So, how are we going to compute this joint probability? We want to compute, let's say the probability of the phrase "its water is so transparent that", this little part of a sentence. And the intuition for how language modeling works is that you're going to rely on the chain rule of probability. And just to remind you about the chain rule of probability. Let's think about the definition of conditional probability. So P of A given B Equals P of A comma B over P of B. And we can rewrite that, so P of A given B times P of B equals P of A comma B, or turning it around, P of A comma B equals P of A given B (I'll make sure it's a "given") times P of B. And then we could generalize this to more variables so the joint probability of a whole sequence A B C D is the probability of A, times B given A, times C conditioned on A and B, times D conditioned on A B C. So this is the chain rule. In a more general form of the chain rule we have here the probability of any, joint probability of any sequence of variables is the first, times the condition of the second and the first, times the third conditioned of the first two, up until the last conditioned on the first n minus one. Alright, the chain rule. So now, the chain rule can be applied to compute the joint probability of words in a sentence. So let's suppose we have our phrase, "its water is so transparent". By the chain rule, the probability of that sequence is the probability of "its" times the probability of "water" given "its", times the probability of "is" given "its water", times the probability of "so" given "its water is", and finally times the probability of "transparent" given "its water is so". Or, more formally, the probability, joint probability of a sequence of words is the product over all i of the probability of each word times the prefix up until that word. How are we gonna estimate these probabilities? Could we just count and divide? We often compute probabilities by counting and dividing. So, the probability of "the" given "its water is so transparent that", we could just count how many times "its water is so transparent that the" occurs and divide by the number of times "its water is so transparent" occurs. So we could divide this by this. And get a probability. We can't do that. And the reason we can't do that is there's just far too many possible sentences for us to ever estimate these. There's no way we could get enough data to see all the counts of all possible sentences of English. So what we do instead is, we apply a simplifying assumption called the Markov assumption, named for Andrei Markov. And the Markov Assumption suggest that we estimate the probability of "the" given "its water is so transparent that" just by computing instead the probability of "the" given the word "that", or-- The very last "that", "that" meaning the last word in the sequence. Or maybe we compute the probability of "the" given "its water is so transparent that" given just the last two words, so "the" given "transparent that". That's the Markov Assumption. Let's just look at the previous or maybe the couple previous words rather than in the entire context. More formally, the Markov Assumption says: The probability of a sequence of words is the product for each word of the conditional probability of that word, given some prefix of the last few words. So, in other words, in the chain rule product of all the probabilities we're multiplying together, we estimate the probability of wᵢ, given the entire prefix from one to i-1 by a simpler to compute probability: wᵢ given just the last few words. The simplest case of a Markov model is called the unigram model. In the unigram model, we simply estimate the probability of a whole sequence of words by the product of probabilities of individual words, "unigrams". And if we generated sentences by randomly picking words, you can see that it would look like a word salad. So here's some automatically generated sentences generated by Dan Klein, and you can see that the word "fifth", the word "an", the word "of" -- this doesn't look like a sentence at all. It's just a random sequence of words: "thrift, did, eighty, said". That's the properties of the unigram model. Words are independent in this model. Slightly more intelligent is a bi-gram model where we condition on the single previous word. So again, we estimate the probability of a word given the entire prefix from the beginning to the previous word, just by the previous word. So now if we use that and generate random sentences from a bigram model, the sentences look a little bit more like English. Still, something's wrong with them clearly. "outside, new, car", well, "new car" looks pretty good. "car parking" is pretty good. "parking lot". But together, "outside new car parking lot of the agreement reached": that's not English. So even the bigram model, by giving up this conditioning that English has, we're simplifying the ability to model, to model what's going on in a language. Now we can extend the n-gram model further to trigrams, that's 3-grams. Or 4-grams or 5-grams. But in general, it's clear that n-gram modeling is an insufficient model of language. And the reason is that language has long-distance dependencies. So if I want to, say, predict "The computer which I had just put into the machine room on the fifth floor", and I hadn't seen this next word, and I want to say, what's my likelihood of the next word? And I conditioned it just on the previous word, "floor", I'd be very unlucky to guess "crashed". But really, the "crashed" is the main verb of the sentence, and "computer" is the subject, the head of the subject noun phrase. So, if we know "computer" was the subject, we're much more likely to guess crashed. So, these kind of long-distance dependencies mean that in the limit of really good model of predicting English words, we'll have to take into account lots of long-distance information. But it turns out that in practice, we can often get away with these n-gram models, because the local information, especially as we get up to trigrams and 4-grams, will turn out to be just constraining enough that it in most cases it'll solve our problems for us.