Log-Linear Models (22:08)

0:00 - 0:05

Local structure that doesn’t require full table representations
0:05 - 0:09

is important in both directed and undirected models.
0:09 - 0:15

How do we incorporate local structure into undirected models?
0:15 - 0:22

The framework for that is called “log-linear models” for reasons that will be clear in just a moment.
0:23 - 0:24

So
0:27 - 0:33

Whereas, in the original representation of the unnormalized density
0:33 - 0:39

we defined P tilde as the product of factors φi(Di),
0:39 - 0:42

each [of] which is potentially a full table.
0:43 - 0:46

Now we're going to shift that representation
0:46 - 0:50

to something that uses a linear form
0:50 - 0:52

(So here's a linear form)
0:52 - 0:55

that is subsequently exponentiated,
0:55 - 0:57

and that's why it's called log-linear—
0:57 - 1:00

because the logarithm is a linear function.
1:00 - 1:03

So what is this form over here?
1:03 - 1:07

It's a linear function that has these things that are called “coefficients”
1:10 - 1:13

and these things that are called “features”.
1:14 - 1:22

Features, like factors, each have a scope which is a set of variables on which the feature depends.
1:24 - 1:27

But different features can have the same scope.
1:27 - 1:31

You can have multiple features all of which are over the same set of variables.
1:31 - 1:36

Notice that each feature has just a single parameter wj that multiplies it.
1:38 - 1:40

So, what does this give rise to?
1:40 - 1:43

I mean if we have a log-linear model,
1:43 - 1:49

we can push in the exponent through the summation,
1:49 - 1:59

and that gives us something that is a product of exponential functions.
1:59 - 2:04

You can think of each of these as effectively a little factor,
2:04 - 2:08

but it’s a factor that only has a single parameter wj.
2:08 - 2:11

Since this is a little bit abstract, so let’s look at an example.
2:11 - 2:18

Specifically lets look at how we might represent a simple table factor as a log linear model.
2:18 - 2:24

So here’s a param, here’s a factor φ, over two binary random variables X1 and X2.
2:24 - 2:31

And so a full table factor would have four parameters: a00, a01, a10, and a11.
2:31 - 2:35

So we can capture this model using a log linear model,
2:35 - 2:38

using a set of such of features,
2:38 - 2:41

using a set of these guys, which are indicator functions.
2:41 - 2:43

So this is an indicator function.
2:43 - 2:49

It takes one if X1 is zero and X2 is zero,
2:49 - 2:51

and it takes zero otherwise.
2:51 - 2:54

So this the general notion of an indicator function.
2:54 - 3:00

It looks at the event—or constraint—inside the curly braces,
3:00 - 3:05

and it returns a value of 0 or 1, depending on
whether that event is true or not.
3:06 - 3:14

And so, if we wanted to represent this factor as a log-linear model,
3:14 - 3:20

We can see that we can simply sum up over all the four values of k and ℓ,
3:20 - 3:23

which are either 0 or 1, each of them.
3:23 - 3:25

So were summing up over all four entries here.
3:25 - 3:32

And we have a parameter—or coefficient—w_kℓ which multiplies this feature.
3:33 - 3:44

And so, we would have a summation of w_kℓ:
3:44 - 3:49

of w00 only in the case where X1 is zero and X2 is zero.
3:49 - 4:00

So we would have exp of negative w00 when X1=0 and X2=0,
4:01 - 4:12

and we would have exp of negative w01 when
X1=0 and X2=1, and so on and so forth.
4:12 - 4:16

And it’s not difficult to convince ourselves that
4:16 - 4:25

if we define w_kℓ to be the negative log of the corresponding entries in this table,
4:25 - 4:29

then that gives us right back the factor that
we defined to begin with.
4:29 - 4:33

So this shows that this is a general representation,
4:34 - 4:37

in the sense that we can take any factor
4:38 - 4:41

and represent it as a log-linear model
4:41 - 4:47

simply by including all of the appropriate features.
4:49 - 4:52

But we don’t generally want to do that.
4:52 - 4:56

Generally we want a much finer grain set of features.
4:56 - 5:01

So let’s look at some of the examples of features that people use in practice.
5:01 - 5:04

So here are the features used in a language model.
5:04 - 5:08

This is a language model that we that we discussed previously.
5:08 - 5:12

And here we have features that relate:
5:12 - 5:16

First of all, let’s just remind ourselves [that] we have two sets of variables.
5:16 - 5:24

We have the variables Y which represent the annotations for each word
5:24 - 5:30

in the sequence corresponding to what category that corresponds to.
5:30 - 5:32

So this is a person.
5:32 - 5:35

This is the beginning of a person name.
5:35 - 5:38

This is the continuation of a person name.
5:38 - 5:39

The beginning of a location.
5:39 - 5:42

The continuation of a location, and so on.
5:42 - 5:46

As well as a bunch of words that are not:
5:46 - 5:48

[i.e.,] none of person, location, organization.
5:48 - 5:51

And they’re all labeled “other”.
5:51 - 5:55

And so the value Y tells us for each word what
category it belongs to,
5:55 - 5:59

so that we’re trying to identify people, locations, and
organizations in the sentence.
5:59 - 6:02

We have another set of variables X,
6:03 - 6:09

which are the actual words in the sentence.
6:09 - 6:13

Now we can go ahead and define…
6:13 - 6:16

We can use a full table representation that
6:16 - 6:23

basically tries to relate each and every Y that has a feature,
6:23 - 6:28

that has a full factor that looks at every possible word in the English language;
6:28 - 6:32

but those are going to be very, very, expensive,
6:32 - 6:34

and a very large number of parameters.
6:34 - 6:40

And so we're going to define a feature that looks, for example, at f of
6:40 - 6:45

say a particular Y_i, which is the label for the i’th word in the sentence,
6:45 - 6:48

and X_i, being that i’th word.
6:48 - 6:55

And that feature says, for example: Y_i equals person.
6:56 - 7:04

It’s the indicator function for “Y_i = person and X_i is capitalized”.
7:06 - 7:10

And so that feature doesn’t look at the individual words.
7:10 - 7:13

It just looks at whether that word is capitalized.
7:13 - 7:17

Now we have just the single parameter that looks just at capitalization,
7:17 - 7:23

and parameterizes how important is capitalization for recognizing that something's a person.
7:23 - 7:27

We could also have another feature.
7:27 - 7:29

This is an alternative:
7:29 - 7:33

This a different feature that can and could be part of the same model
7:33 - 7:38

that says: Y_i is equal to location,
7:38 - 7:41

Or, actually, I was little bit imprecise here—
7:41 - 7:45

This might be beginning of person. This might be beginning of location.
7:45 - 7:51

And X_i appears in some atlas.
7:52 - 7:55

Now there is other things that appear in the atlas than locations,
7:55 - 7:59

but if a word appears in the atlas,
7:59 - 8:02

there is a much higher probability presumably that it’s actually a location
8:02 - 8:06

and so we might have, again, [a] weight for this feature
8:06 - 8:14

that indicates that maybe increases the probability in Y_i being labeled in this way.
8:14 - 8:19

And so you can imagine that constructing a very rich set of features,
8:19 - 8:24

all of which look at certain aspects of the word,
8:24 - 8:32

and rather than enumerating all the possible words
and giving a parameter to each and one of them.
8:34 - 8:38

Let’s look at some other examples of feature-based models.
8:38 - 8:41

So this is an example from statistical physics.
8:41 - 8:43

It’s called the Ising model.
8:43 - 8:49

And the Ising model is something that looks at pairs
of variables.
8:49 - 8:52

It’s a pairwise Markov network.
8:53 - 8:56

And [it] looks the pairs of adjacent variables,
8:56 - 8:59

and basically gives us a coefficient for their products.
8:59 - 9:04

So now, this is a case where variables are in the end are binary,
9:04 - 9:08

but not in the space {0, 1} but rather
negative one and positive one.
9:08 - 9:10

And so now, we have a model that's parametrized
9:10 - 9:14

as features that are just the product of the values of the adjacent variables.
9:14 - 9:16

Where might this come up?
9:16 - 9:23

It comes up in the context, for example, of modeling the spins of electrons in a grid.
9:24 - 9:29

So here you have a case where the electrons can rotate
9:29 - 9:32

either along one direction or in the other direction
9:32 - 9:41

so here is a bunch of the atoms that are marked with a blue arrow.
9:41 - 9:43

You have one rotational axis,
9:43 - 9:46

and the red arrow[s] are rotating in the opposite direction.
9:46 - 9:53

And this basically says we have a term that
9:53 - 9:57

[whose] probability distribution over the joint set of spins.
9:57 - 10:02

(So this is the joint spins.)
10:02 - 10:08

And the model, depends on whether adjacent
atoms have the same spin or opposite spin.
10:08 - 10:12

So notice that one times one is the same as negative one times negative one.
10:12 - 10:17

So this really just looks at whether they have the same spin
or different spins.
10:17 - 10:23

And there is a parameter that looks at, you know, same or
different.
10:24 - 10:27

That's what this feature represents.
10:27 - 10:33

And, depending on the value of this parameter over here,
10:33 - 10:35

if the parameter goes one way,
10:35 - 10:40

we're going to favor systems
10:40 - 10:43

where the atoms spin in the same direction.
10:43 - 10:48

And if it’s going in the opposite direction, you’re going to favor atoms that spin in the different direction.
10:48 - 10:51

And those are called ferromagnetic and anti-ferromagnetic.
10:53 - 10:59

Furthermore, you can define in these systems the notion of a temperature.
10:59 - 11:05

So the temperature here says how strong is this connection.
11:05 - 11:13

So notice that as T grows—as the temperature grows—the w_ij’s get divided by T.
11:16 - 11:19

And they all kind of go towards zero,
11:21 - 11:25

which means that the strength of the connection between
11:25 - 11:28

adjacent atoms, effectively becomes almost moot,
11:28 - 11:31

and they become almost decoupled from each other.
11:31 - 11:36

On the other hand, as the temperature decreases,
11:38 - 11:43

Then the effect of the interaction between the atoms becomes much more significant
11:43 - 11:47

and they’re going to impose much stronger constraints on each other.
11:47 - 11:49

And this is actually a model of a real physical system.
11:49 - 11:52

I mean, this is real temperature, and real atoms, and so on.
11:52 - 11:57

And sure enough, if you look at what happens to these models as a function of temperature,
11:57 - 12:01

what we see over here is high temperature.
12:02 - 12:04

This is high temperature
12:04 - 12:10

and you can see that there is a lot of mixing between the two types of spin
12:10 - 12:12

and this is low temperature
12:13 - 12:20

and you can see that there is much stronger
constraints in this configuration
12:20 - 12:23

about the spins of adjacent atoms.
12:27 - 12:34

Another kind of feature that's used very much in lots of practical applications
12:34 - 12:38

is the notion of a metric, of a metric feature, an M.R.F.
12:38 - 12:40

So what's a metric feature?
12:40 - 12:42

This is something that comes up, mostly in cases
12:42 - 12:48

where you have a bunch of random variables X_i that all take values in some joint label space of V.
12:48 - 12:50

So, for example, they might all be binary.
12:50 - 12:53

They all might take values one, two, three, four.
12:53 - 12:57

And what we'd like to do is
12:57 - 13:03

we have X_i and X_j that are connected to each other by an edge.
13:03 - 13:08

We want X_ and X_j to take “similar” values.
13:08 - 13:12

So in order to enforce the fact that X_i and X_j should take similar values
13:12 - 13:14

we need a notion of similarity.
13:14 - 13:24

And we're going to encode that using the distance function µ that takes two values, one for X_i and one for X_j’s,
13:24 - 13:26

[that] says how close are they to each other.
13:26 - 13:29

So what does the distance function need to be?
13:29 - 13:36

Well, the distance function needs to satisfy the standard condition on a distance function or a metric.
13:36 - 13:39

So first is reflexivity,
13:39 - 13:43

which means that if the two variables take on the same value,
13:43 - 13:46

then that distance better be zero.
13:49 - 13:54

Oh I forgot to say that this. Sorry, this needs to be a non-negative function.
13:54 - 14:01

Symmetry means that the distances are symetrical.
14:01 - 14:06

So the distance between two values v1 and v2 are the same as the distance between v2 and v1.
14:06 - 14:12

And finally is the triangle inequality, which says that the distance between v1 and v2
14:12 - 14:14

(So here is v1)
14:14 - 14:15

(Here is v2)
14:15 - 14:21

and the distance between v1 and v2 is less than the distance between v1 and v3
14:22 - 14:25

and then going to v2. So the standard triangle inequality.
14:27 - 14:33

if a distance just satisfies these two conditions, it's called a semi metric.
14:34 - 14:37

Otherwise, if it satisfies all three, it's called a metric.
14:38 - 14:41

And both are actually used in practical applications.
14:44 - 14:49

But how do we take this distance feature and put it in the context of an MRF?
14:49 - 14:56

We have a feature that looks at two variables, X_i and X_j.
14:56 - 14:59

And that feature is the distance between X_i and X_j.
14:59 - 15:09

And now, we put it together by multiplying that with a coefficient, w_ij,
15:09 - 15:13

such that w_ij has to be greater than zero.
15:13 - 15:16

So that we want the metric MRF
15:17 - 15:21

[to have] the effect that
15:21 - 15:29

the lower the distance, the higher this is,
15:30 - 15:36

because of the negative coefficient, which means that higher the probability. Okay?
15:37 - 15:43

So, the more pairs you have that are close to each other
15:43 - 15:47

and the closer they are to each other the higher
the probability of the overall configuration.
15:47 - 15:50

Which is exactly what we wanted to have happen.
15:53 - 15:58

So, conversely, if you have values that are far from
each other in the distance metric
15:58 - 16:01

the lower the probability in the model.
16:02 - 16:05

So, here are some examples of metric MRF’s.
16:05 - 16:08

So one: The simplest possible metric MRF
16:08 - 16:12

is one that gives [a] distance of zero when the two classes are equal to each other
16:12 - 16:14

and [a] distance of one everywhere else.
16:14 - 16:16

So now, you know, this is just like a step function.
16:16 - 16:23

And, this gives rise to a potential that looks like this.
16:23 - 16:27

So we have 0’s on the diagonal.
16:27 - 16:34

So we get a bump in the probability when the two adjacent variables take on the same label
16:34 - 16:37

and otherwise we get a reduction in the probability.
16:37 - 16:41

But it doesn’t matter what particular value they take.
16:41 - 16:44

That’s one example of a simple metric.
16:44 - 16:52

A somewhat more expressive example might come up when the values V are actually numerical values.
16:52 - 16:58

In which case you can look at maybe the difference between the miracle values.
16:58 - 17:01

So, v_k minus v_l.
17:01 - 17:05

And you want, and when v_k is equal to v_l, the distance is zero,
17:05 - 17:14

and then you have a linear function that increases the
distance as the distance between v_k and v_l grows.
17:14 - 17:19

So, this is the absolute value of v_k minus v_l.
17:20 - 17:26

A more interesting notion that comes up a lot in
practice is:
17:26 - 17:32

we don’t want to penalize arbitrarily things that are far way from each other in label space.
17:32 - 17:37

So this is what is called a truncated linear penalty.
17:38 - 17:42

And you can see that beyond a certain threshold,
17:44 - 17:49

the penalty just becomes constant, so it plateaus.
17:49 - 17:54

So that there is a penalty, but it doesn’t keep increasing over as the labels get further from each other
17:54 - 18:00

One example where metric MRF’s are used is when we’re doing image segmentation.
18:00 - 18:04

And here we tend to favor segmentations where adjacent superpixels…
18:07 - 18:09

(These are adjacent superpixels.)
18:10 - 18:13

And we want them to take the same class.
18:18 - 18:22

And so here we have no penalty when the superpixels take the same class
18:22 - 18:25

and we have some penalty when they take different classes.
18:25 - 18:31

And this is actually a very common, albeit simple, model for
image segmentation.
18:32 - 18:37

Let’s look at a different MRF, also in the context of
computer vision.
18:37 - 18:40

This is an MRF that’s used for image denoising.
18:40 - 18:45

So here we have a noisy version of a real image that looks like this.
18:45 - 18:51

So this is, you can see this kind of, white noise overlayed on top of the image.
18:51 - 18:54

And what we’d like to do, is we’d like to get a cleaned-up version of the image.
18:54 - 19:01

So here we have, a set of variables, X, that correspond to the noisy pixels.
19:02 - 19:08

And we have a set of variables, Y, that corresponds to the cleaned pixels.
19:08 - 19:15

And we'd like to have a probabilistic model that relates X and Y.
19:15 - 19:21

And what we’re going to do is we’d like, so, intuiti—, I mean,
19:21 - 19:25

so you’d like to have two effects on the pixels Y:
19:25 - 19:33

First, you'd like Y_i to be close to X_i.
19:33 - 19:36

But if you just do that, then you're just going to stick with
the original image.
19:36 - 19:41

So what is the main constraint that we can employ on the image in order to clean it up
19:41 - 19:47

is the fact that adjacent pixels tend to have the same value.
19:47 - 19:53

So in this case what we’re going to do is we’re going to model, we’re going to constrain the image
19:53 - 20:00

so that we’re going to constrain the Y_i’s to try and make Y_i close to its neighbors.
20:03 - 20:06

And the further away it is, the bigger the penalty.
20:06 - 20:08

And that's a metric MRF.
20:12 - 20:17

Now we could use just a linear penalty,
20:17 - 20:22

but that’s going to be a very fragile model,
20:22 - 20:28

because, now obviously the right answer isn't the model
where all pixels are equal to each other
20:28 - 20:34

in their actual intensity value because that would be just a single, you know, grayish-looking image.
20:34 - 20:40

So what you like is that you would like to let one pixel depart from its adjacent pixel
20:40 - 20:44

if it’s getting close in a different direction either by its own observation or by other adjacent pixels.
20:44 - 20:49

And so the right model to use here is actually the truncated linear model
20:49 - 20:52

and that one is [the] one that’s commonly used
20:52 - 20:55

and is very successful for doing image denoising.
20:55 - 21:02

Interesting, almost exactly the same idea is used in the context of stereo reconstruction.
21:02 - 21:06

There, the values that you’d like to infer, the Y_i’s,
21:06 - 21:12

are the depth disparity for a given pixel in the image—how deep it is.
21:12 - 21:14

And here also we have spacial continuity.
21:14 - 21:19

We like the depth of one pixel to be close to the depth of an adjacent pixel.
21:19 - 21:22

But once again we don’t want to enforce this too strongly
21:22 - 21:25

because you do have depth disparity in the image
21:25 - 21:28

and so eventually you'd like things to be allowed to break away from each other.
21:28 - 21:33

And so once again, one typically uses some kind of truncated linear model
21:33 - 21:36

for doing this stereo construction,
21:36 - 21:38

often augmented by other little tricks.
21:38 - 21:45

So, for example, here we have the actual pixel appearance,
21:45 - 21:48

for example, the color and texture.
21:48 - 21:50

And if the color and texture are very similar to each other,
21:50 - 21:55

you might want to have the stronger constraint on similarity.
21:55 - 22:00

Versus: if the color and texture of the adjacent pixels are
very different from each other,
22:00 - 22:03

they may be more likely to belong to different objects
22:03 - 22:08

and you don’t want to enforce quite as strong of a similarity constraint.

Title:: Log-Linear Models (22:08)
Video Language:: English

	Ambrose LI edited English subtitles for Log-Linear Models (22:08)
	Ambrose LI edited English subtitles for Log-Linear Models (22:08)
	Ambrose LI edited English subtitles for Log-Linear Models (22:08)
	Ambrose LI edited English subtitles for Log-Linear Models (22:08)
	Ambrose LI edited English subtitles for Log-Linear Models (22:08)
	Ambrose LI edited English subtitles for Log-Linear Models (22:08)
	Ambrose LI edited English subtitles for Log-Linear Models (22:08)
	Ambrose LI edited English subtitles for Log-Linear Models (22:08)

Show all

English subtitles

Revisions

Revision 9

Ambrose LI

Log-Linear Models (22:08)

Revisions

Our website uses cookies

Operating cookies (Required)