0:00:00.000,0:00:05.436 Local structure that doesn’t require full table representations 0:00:05.452,0:00:09.073 is important in both directed and undirected models. 0:00:09.073,0:00:14.868 How do we incorporate local structure into undirected models? 0:00:14.868,0:00:22.084 The framework for that is called “log-linear models” for reasons that will be clear in just a moment. 0:00:23.084,0:00:24.227 So 0:00:26.566,0:00:33.140 Whereas, in the original representation of the unnormalized density 0:00:33.140,0:00:39.187 we defined P tilde as the product of factors φi(Di), 0:00:39.218,0:00:42.305 each [of] which is potentially a full table. 0:00:42.889,0:00:46.032 Now we're going to shift that representation 0:00:46.032,0:00:49.975 to something that uses a linear form 0:00:50.000,0:00:52.446 (So here's a linear form) 0:00:52.446,0:00:54.892 that is subsequently exponentiated, 0:00:54.922,0:00:56.970 and that's why it's called log-linear— 0:00:56.970,0:00:59.632 because the logarithm is a linear function. 0:00:59.632,0:01:02.742 So what is this form over here? 0:01:02.742,0:01:06.986 It's a linear function that has these things that are called “coefficients” 0:01:10.171,0:01:12.631 and these things that are called “features”. 0:01:14.200,0:01:22.108 Features, like factors, each have a scope which is a set of variables on which the feature depends. 0:01:24.153,0:01:27.026 But different features can have the same scope. 0:01:27.026,0:01:31.176 You can have multiple features all of which are over the same set of variables. 0:01:31.176,0:01:36.472 Notice that each feature has just a single parameter wj that multiplies it. 0:01:37.657,0:01:40.310 So, what does this give rise to? 0:01:40.310,0:01:43.456 I mean if we have a log-linear model, 0:01:43.487,0:01:49.465 we can push in the exponent through the summation, 0:01:49.465,0:01:58.764 and that gives us something that is a product of exponential functions. 0:01:58.795,0:02:03.530 You can think of each of these as effectively a little factor, 0:02:03.530,0:02:07.689 but it’s a factor that only has a single parameter wj. 0:02:07.689,0:02:11.201 Since this is a little bit abstract, so let’s look at an example. 0:02:11.201,0:02:17.694 Specifically lets look at how we might represent a simple table factor as a log linear model. 0:02:17.709,0:02:24.290 So here’s a param, here’s a factor φ, over two binary random variables X1 and X2. 0:02:24.290,0:02:30.788 And so a full table factor would have four parameters: a00, a01, a10, and a11. 0:02:30.788,0:02:34.552 So we can capture this model using a log linear model, 0:02:34.552,0:02:37.958 using a set of such of features, 0:02:37.958,0:02:41.365 using a set of these guys, which are indicator functions. 0:02:41.365,0:02:43.225 So this is an indicator function. 0:02:43.225,0:02:48.963 It takes one if X1 is zero and X2 is zero, 0:02:48.978,0:02:50.881 and it takes zero otherwise. 0:02:51.035,0:02:54.092 So this the general notion of an indicator function. 0:02:54.123,0:03:00.315 It looks at the event—or constraint—inside the curly braces, 0:03:00.407,0:03:05.230 and it returns a value of 0 or 1, depending on[br]whether that event is true or not. 0:03:05.722,0:03:13.595 And so, if we wanted to represent this factor as a log-linear model, 0:03:13.626,0:03:19.920 We can see that we can simply sum up over all the four values of k and ℓ, 0:03:19.920,0:03:22.691 which are either 0 or 1, each of them. 0:03:22.707,0:03:25.407 So were summing up over all four entries here. 0:03:25.407,0:03:32.480 And we have a parameter—or coefficient—w_kℓ which multiplies this feature. 0:03:33.295,0:03:44.124 And so, we would have a summation of w_kℓ: 0:03:44.139,0:03:49.224 of w00 only in the case where X1 is zero and X2 is zero. 0:03:49.255,0:04:00.483 So we would have exp of negative w00 when X1=0 and X2=0, 0:04:01.114,0:04:12.037 and we would have exp of negative w01 when[br]X1=0 and X2=1, and so on and so forth. 0:04:12.390,0:04:15.509 And it’s not difficult to convince ourselves that 0:04:15.525,0:04:24.771 if we define w_kℓ to be the negative log of the corresponding entries in this table, 0:04:24.801,0:04:28.650 then that gives us right back the factor that[br]we defined to begin with. 0:04:28.681,0:04:33.144 So this shows that this is a general representation, 0:04:34.005,0:04:37.267 in the sense that we can take any factor 0:04:37.898,0:04:40.529 and represent it as a log-linear model 0:04:40.544,0:04:47.175 simply by including all of the appropriate features. 0:04:48.605,0:04:52.359 But we don’t generally want to do that. 0:04:52.359,0:04:55.800 Generally we want a much finer grain set of features. 0:04:55.800,0:05:01.197 So let’s look at some of the examples of features that people use in practice. 0:05:01.197,0:05:03.838 So here are the features used in a language model. 0:05:03.853,0:05:07.934 This is a language model that we that we discussed previously. 0:05:07.950,0:05:12.276 And here we have features that relate: 0:05:12.292,0:05:15.839 First of all, let’s just remind ourselves [that] we have two sets of variables. 0:05:15.839,0:05:23.683 We have the variables Y which represent the annotations for each word 0:05:23.698,0:05:29.956 in the sequence corresponding to what category that corresponds to. 0:05:29.956,0:05:32.429 So this is a person. 0:05:32.429,0:05:34.903 This is the beginning of a person name. 0:05:34.936,0:05:37.847 This is the continuation of a person name. 0:05:37.847,0:05:39.302 The beginning of a location. 0:05:39.302,0:05:42.235 The continuation of a location, and so on. 0:05:42.235,0:05:45.694 As well as a bunch of words that are not: 0:05:45.694,0:05:48.444 [i.e.,] none of person, location, organization. 0:05:48.444,0:05:50.768 And they’re all labeled “other”. 0:05:50.783,0:05:54.938 And so the value Y tells us for each word what[br]category it belongs to, 0:05:54.938,0:05:59.375 so that we’re trying to identify people, locations, and[br]organizations in the sentence. 0:05:59.375,0:06:02.146 We have another set of variables X, 0:06:03.315,0:06:08.687 which are the actual words in the sentence. 0:06:09.225,0:06:12.806 Now we can go ahead and define… 0:06:12.806,0:06:15.849 We can use a full table representation that 0:06:15.849,0:06:22.729 basically tries to relate each and every Y that has a feature, 0:06:22.729,0:06:27.678 that has a full factor that looks at every possible word in the English language; 0:06:27.678,0:06:31.674 but those are going to be very, very, expensive, 0:06:31.674,0:06:34.070 and a very large number of parameters. 0:06:34.117,0:06:40.047 And so we're going to define a feature that looks, for example, at f of 0:06:40.047,0:06:45.203 say a particular Y_i, which is the label for the i’th word in the sentence, 0:06:45.203,0:06:47.652 and X_i, being that i’th word. 0:06:47.652,0:06:54.839 And that feature says, for example: Y_i equals person. 0:06:55.839,0:07:03.750 It’s the indicator function for “Y_i = person and X_i is capitalized”. 0:07:05.827,0:07:09.694 And so that feature doesn’t look at the individual words. 0:07:09.694,0:07:13.192 It just looks at whether that word is capitalized. 0:07:13.192,0:07:17.430 Now we have just the single parameter that looks just at capitalization, 0:07:17.430,0:07:23.140 and parameterizes how important is capitalization for recognizing that something's a person. 0:07:23.153,0:07:26.949 We could also have another feature. 0:07:26.949,0:07:28.862 This is an alternative: 0:07:28.862,0:07:32.590 This a different feature that can and could be part of the same model 0:07:32.590,0:07:38.381 that says: Y_i is equal to location, 0:07:38.381,0:07:41.061 Or, actually, I was little bit imprecise here— 0:07:41.061,0:07:45.464 This might be beginning of person. This might be beginning of location. 0:07:45.480,0:07:51.414 And X_i appears in some atlas. 0:07:52.137,0:07:55.370 Now there is other things that appear in the atlas than locations, 0:07:55.370,0:07:58.648 but if a word appears in the atlas, 0:07:58.663,0:08:01.972 there is a much higher probability presumably that it’s actually a location 0:08:02.003,0:08:06.199 and so we might have, again, [a] weight for this feature 0:08:06.199,0:08:13.611 that indicates that maybe increases the probability in Y_i being labeled in this way. 0:08:13.611,0:08:19.438 And so you can imagine that constructing a very rich set of features, 0:08:19.438,0:08:24.163 all of which look at certain aspects of the word, 0:08:24.163,0:08:31.900 and rather than enumerating all the possible words[br]and giving a parameter to each and one of them. 0:08:33.562,0:08:38.073 Let’s look at some other examples of feature-based models. 0:08:38.120,0:08:40.913 So this is an example from statistical physics. 0:08:40.913,0:08:43.445 It’s called the Ising model. 0:08:43.445,0:08:48.778 And the Ising model is something that looks at pairs[br]of variables. 0:08:48.778,0:08:51.773 It’s a pairwise Markov network. 0:08:53.127,0:08:55.672 And [it] looks the pairs of adjacent variables, 0:08:55.672,0:08:59.495 and basically gives us a coefficient for their products. 0:08:59.495,0:09:03.509 So now, this is a case where variables are in the end are binary, 0:09:03.509,0:09:07.836 but not in the space {0, 1} but rather[br]negative one and positive one. 0:09:07.836,0:09:10.311 And so now, we have a model that's parametrized 0:09:10.311,0:09:13.970 as features that are just the product of the values of the adjacent variables. 0:09:14.339,0:09:15.606 Where might this come up? 0:09:15.606,0:09:23.165 It comes up in the context, for example, of modeling the spins of electrons in a grid. 0:09:23.704,0:09:28.742 So here you have a case where the electrons can rotate 0:09:28.742,0:09:31.586 either along one direction or in the other direction 0:09:31.586,0:09:40.542 so here is a bunch of the atoms that are marked with a blue arrow. 0:09:40.542,0:09:43.059 You have one rotational axis, 0:09:43.059,0:09:45.576 and the red arrow[s] are rotating in the opposite direction. 0:09:46.084,0:09:52.545 And this basically says we have a term that 0:09:52.545,0:09:57.063 [whose] probability distribution over the joint set of spins. 0:09:57.063,0:10:01.519 (So this is the joint spins.) 0:10:01.950,0:10:08.142 And the model, depends on whether adjacent[br]atoms have the same spin or opposite spin. 0:10:08.142,0:10:12.445 So notice that one times one is the same as negative one times negative one. 0:10:12.445,0:10:16.543 So this really just looks at whether they have the same spin[br]or different spins. 0:10:16.543,0:10:22.888 And there is a parameter that looks at, you know, same or[br]different. 0:10:24.088,0:10:26.709 That's what this feature represents. 0:10:27.170,0:10:32.514 And, depending on the value of this parameter over here, 0:10:32.514,0:10:35.010 if the parameter goes one way, 0:10:35.010,0:10:39.599 we're going to favor systems 0:10:39.599,0:10:42.509 where the atoms spin in the same direction. 0:10:42.509,0:10:48.049 And if it’s going in the opposite direction, you’re going to favor atoms that spin in the different direction. 0:10:48.049,0:10:51.268 And those are called ferromagnetic and anti-ferromagnetic. 0:10:52.514,0:10:58.536 Furthermore, you can define in these systems the notion of a temperature. 0:10:58.813,0:11:04.720 So the temperature here says how strong is this connection. 0:11:04.720,0:11:13.126 So notice that as T grows—as the temperature grows—the w_ij’s get divided by T. 0:11:15.695,0:11:19.269 And they all kind of go towards zero, 0:11:20.992,0:11:24.812 which means that the strength of the connection between 0:11:24.812,0:11:27.896 adjacent atoms, effectively becomes almost moot, 0:11:27.896,0:11:30.980 and they become almost decoupled from each other. 0:11:30.980,0:11:36.216 On the other hand, as the temperature decreases, 0:11:38.201,0:11:43.461 Then the effect of the interaction between the atoms becomes much more significant 0:11:43.461,0:11:46.915 and they’re going to impose much stronger constraints on each other. 0:11:46.915,0:11:49.428 And this is actually a model of a real physical system. 0:11:49.428,0:11:51.941 I mean, this is real temperature, and real atoms, and so on. 0:11:51.941,0:11:57.075 And sure enough, if you look at what happens to these models as a function of temperature, 0:11:57.075,0:12:00.640 what we see over here is high temperature. 0:12:02.280,0:12:04.218 This is high temperature 0:12:04.218,0:12:10.033 and you can see that there is a lot of mixing between the two types of spin 0:12:10.033,0:12:11.925 and this is low temperature 0:12:12.587,0:12:20.486 and you can see that there is much stronger[br]constraints in this configuration 0:12:20.486,0:12:23.002 about the spins of adjacent atoms. 0:12:26.955,0:12:33.844 Another kind of feature that's used very much in lots of practical applications 0:12:33.844,0:12:37.682 is the notion of a metric, of a metric feature, an M.R.F. 0:12:37.682,0:12:39.695 So what's a metric feature? 0:12:39.695,0:12:41.800 This is something that comes up, mostly in cases 0:12:41.800,0:12:48.432 where you have a bunch of random variables X_i that all take values in some joint label space of V. 0:12:48.432,0:12:50.269 So, for example, they might all be binary. 0:12:50.269,0:12:52.852 They all might take values one, two, three, four. 0:12:52.852,0:12:57.250 And what we'd like to do is 0:12:57.250,0:13:03.466 we have X_i and X_j that are connected to each other by an edge. 0:13:03.466,0:13:08.213 We want X_ and X_j to take “similar” values. 0:13:08.213,0:13:11.968 So in order to enforce the fact that X_i and X_j should take similar values 0:13:11.968,0:13:13.939 we need a notion of similarity. 0:13:13.939,0:13:24.110 And we're going to encode that using the distance function µ that takes two values, one for X_i and one for X_j’s, 0:13:24.110,0:13:26.265 [that] says how close are they to each other. 0:13:26.265,0:13:28.939 So what does the distance function need to be? 0:13:28.939,0:13:35.552 Well, the distance function needs to satisfy the standard condition on a distance function or a metric. 0:13:35.552,0:13:39.375 So first is reflexivity, 0:13:39.375,0:13:43.459 which means that if the two variables take on the same value, 0:13:43.459,0:13:45.675 then that distance better be zero. 0:13:48.598,0:13:53.736 Oh I forgot to say that this. Sorry, this needs to be a non-negative function. 0:13:53.736,0:14:00.733 Symmetry means that the distances are symetrical. 0:14:00.733,0:14:06.276 So the distance between two values v1 and v2 are the same as the distance between v2 and v1. 0:14:06.276,0:14:12.325 And finally is the triangle inequality, which says that the distance between v1 and v2 0:14:12.325,0:14:13.505 (So here is v1) 0:14:13.505,0:14:14.685 (Here is v2) 0:14:15.485,0:14:21.091 and the distance between v1 and v2 is less than the distance between v1 and v3 0:14:21.891,0:14:25.128 and then going to v2. So the standard triangle inequality. 0:14:26.513,0:14:32.741 if a distance just satisfies these two conditions, it's called a semi metric. 0:14:33.848,0:14:37.415 Otherwise, if it satisfies all three, it's called a metric. 0:14:37.846,0:14:41.378 And both are actually used in practical applications. 0:14:43.932,0:14:48.556 But how do we take this distance feature and put it in the context of an MRF? 0:14:48.556,0:14:55.670 We have a feature that looks at two variables, X_i and X_j. 0:14:55.670,0:14:59.476 And that feature is the distance between X_i and X_j. 0:14:59.476,0:15:08.675 And now, we put it together by multiplying that with a coefficient, w_ij, 0:15:08.675,0:15:13.015 such that w_ij has to be greater than zero. 0:15:13.015,0:15:16.433 So that we want the metric MRF 0:15:17.279,0:15:21.189 [to have] the effect that 0:15:21.189,0:15:29.346 the lower the distance, the higher this is, 0:15:30.269,0:15:36.438 because of the negative coefficient, which means that higher the probability. Okay? 0:15:37.007,0:15:43.156 So, the more pairs you have that are close to each other 0:15:43.156,0:15:47.280 and the closer they are to each other the higher[br]the probability of the overall configuration. 0:15:47.280,0:15:49.652 Which is exactly what we wanted to have happen. 0:15:53.101,0:15:58.418 So, conversely, if you have values that are far from[br]each other in the distance metric 0:15:58.418,0:16:00.596 the lower the probability in the model. 0:16:01.796,0:16:04.555 So, here are some examples of metric MRF’s. 0:16:04.832,0:16:07.816 So one: The simplest possible metric MRF 0:16:07.816,0:16:11.985 is one that gives [a] distance of zero when the two classes are equal to each other 0:16:11.985,0:16:13.877 and [a] distance of one everywhere else. 0:16:13.877,0:16:16.468 So now, you know, this is just like a step function. 0:16:16.468,0:16:22.583 And, this gives rise to a potential that looks like this. 0:16:22.583,0:16:26.579 So we have 0’s on the diagonal. 0:16:26.579,0:16:33.621 So we get a bump in the probability when the two adjacent variables take on the same label 0:16:33.621,0:16:37.122 and otherwise we get a reduction in the probability. 0:16:37.137,0:16:40.692 But it doesn’t matter what particular value they take. 0:16:40.692,0:16:44.090 That’s one example of a simple metric. 0:16:44.090,0:16:51.995 A somewhat more expressive example might come up when the values V are actually numerical values. 0:16:52.026,0:16:58.099 In which case you can look at maybe the difference between the miracle values. 0:16:58.099,0:17:00.794 So, v_k minus v_l. 0:17:00.794,0:17:05.103 And you want, and when v_k is equal to v_l, the distance is zero, 0:17:05.103,0:17:14.297 and then you have a linear function that increases the[br]distance as the distance between v_k and v_l grows. 0:17:14.297,0:17:18.733 So, this is the absolute value of v_k minus v_l. 0:17:20.353,0:17:26.010 A more interesting notion that comes up a lot in[br]practice is: 0:17:26.010,0:17:32.374 we don’t want to penalize arbitrarily things that are far way from each other in label space. 0:17:32.374,0:17:37.437 So this is what is called a truncated linear penalty. 0:17:38.160,0:17:41.852 And you can see that beyond a certain threshold, 0:17:44.067,0:17:48.651 the penalty just becomes constant, so it plateaus. 0:17:48.667,0:17:54.046 So that there is a penalty, but it doesn’t keep increasing over as the labels get further from each other 0:17:54.046,0:17:59.768 One example where metric MRF’s are used is when we’re doing image segmentation. 0:17:59.768,0:18:04.136 And here we tend to favor segmentations where adjacent superpixels… 0:18:06.551,0:18:09.064 (These are adjacent superpixels.) 0:18:10.402,0:18:12.700 And we want them to take the same class. 0:18:18.146,0:18:22.279 And so here we have no penalty when the superpixels take the same class 0:18:22.279,0:18:25.199 and we have some penalty when they take different classes. 0:18:25.199,0:18:30.949 And this is actually a very common, albeit simple, model for[br]image segmentation. 0:18:32.165,0:18:37.141 Let’s look at a different MRF, also in the context of[br]computer vision. 0:18:37.141,0:18:40.440 This is an MRF that’s used for image denoising. 0:18:40.440,0:18:45.192 So here we have a noisy version of a real image that looks like this. 0:18:45.208,0:18:50.824 So this is, you can see this kind of, white noise overlayed on top of the image. 0:18:50.824,0:18:53.923 And what we’d like to do, is we’d like to get a cleaned-up version of the image. 0:18:53.923,0:19:00.799 So here we have, a set of variables, X, that correspond to the noisy pixels. 0:19:02.230,0:19:07.736 And we have a set of variables, Y, that corresponds to the cleaned pixels. 0:19:07.736,0:19:15.427 And we'd like to have a probabilistic model that relates X and Y. 0:19:15.427,0:19:20.525 And what we’re going to do is we’d like, so, intuiti—, I mean, 0:19:20.525,0:19:25.488 so you’d like to have two effects on the pixels Y: 0:19:25.488,0:19:33.105 First, you'd like Y_i to be close to X_i. 0:19:33.105,0:19:36.372 But if you just do that, then you're just going to stick with[br]the original image. 0:19:36.372,0:19:41.348 So what is the main constraint that we can employ on the image in order to clean it up 0:19:41.348,0:19:46.916 is the fact that adjacent pixels tend to have the same value. 0:19:46.916,0:19:52.907 So in this case what we’re going to do is we’re going to model, we’re going to constrain the image 0:19:52.907,0:20:00.212 so that we’re going to constrain the Y_i’s to try and make Y_i close to its neighbors. 0:20:03.012,0:20:05.831 And the further away it is, the bigger the penalty. 0:20:05.831,0:20:08.219 And that's a metric MRF. 0:20:11.526,0:20:17.085 Now we could use just a linear penalty, 0:20:17.085,0:20:21.892 but that’s going to be a very fragile model, 0:20:21.892,0:20:28.208 because, now obviously the right answer isn't the model[br]where all pixels are equal to each other 0:20:28.208,0:20:33.874 in their actual intensity value because that would be just a single, you know, grayish-looking image. 0:20:33.874,0:20:39.690 So what you like is that you would like to let one pixel depart from its adjacent pixel 0:20:39.690,0:20:44.098 if it’s getting close in a different direction either by its own observation or by other adjacent pixels. 0:20:44.098,0:20:49.482 And so the right model to use here is actually the truncated linear model 0:20:49.482,0:20:52.174 and that one is [the] one that’s commonly used 0:20:52.174,0:20:54.970 and is very successful for doing image denoising. 0:20:55.432,0:21:02.063 Interesting, almost exactly the same idea is used in the context of stereo reconstruction. 0:21:02.063,0:21:06.264 There, the values that you’d like to infer, the Y_i’s, 0:21:06.264,0:21:11.557 are the depth disparity for a given pixel in the image—how deep it is. 0:21:11.557,0:21:13.957 And here also we have spacial continuity. 0:21:13.957,0:21:19.388 We like the depth of one pixel to be close to the depth of an adjacent pixel. 0:21:19.388,0:21:22.497 But once again we don’t want to enforce this too strongly 0:21:22.497,0:21:25.159 because you do have depth disparity in the image 0:21:25.159,0:21:27.820 and so eventually you'd like things to be allowed to break away from each other. 0:21:27.820,0:21:33.320 And so once again, one typically uses some kind of truncated linear model 0:21:33.320,0:21:36.239 for doing this stereo construction, 0:21:36.239,0:21:38.091 often augmented by other little tricks. 0:21:38.091,0:21:45.143 So, for example, here we have the actual pixel appearance, 0:21:45.143,0:21:47.867 for example, the color and texture. 0:21:47.867,0:21:50.392 And if the color and texture are very similar to each other, 0:21:50.407,0:21:54.861 you might want to have the stronger constraint on similarity. 0:21:54.861,0:22:00.257 Versus: if the color and texture of the adjacent pixels are[br]very different from each other, 0:22:00.257,0:22:02.868 they may be more likely to belong to different objects 0:22:02.868,0:22:07.580 and you don’t want to enforce quite as strong of a similarity constraint.