-
Local structure that doesn’t require full table representations
-
is important in both directed and undirected models.
-
How do we incorporate local structure into undirected models?
-
The framework for that is called “log-linear models” for reasons that will be clear in just a moment.
-
So
-
Whereas, in the original representation of the unnormalized density
-
we defined P tilde as the product of factors φi(Di),
-
each [of] which is potentially a full table.
-
Now we're going to shift that representation
-
to something that uses a linear form
-
(So here's a linear form)
-
that is subsequently exponentiated,
-
and that's why it's called log-linear—
-
because the logarithm is a linear function.
-
So what is this form over here?
-
It's a linear function that has these things that are called “coefficients”
-
and these things that are called “features”.
-
Features, like factors, each have a scope which is a set of variables on which the feature depends.
-
But different features can have the same scope.
-
You can have multiple features all of which are over the same set of variables.
-
Notice that each feature has just a single parameter wj that multiplies it.
-
So, what does this give rise to?
-
I mean if we have a log-linear model,
-
we can push in the exponent through the summation,
-
and that gives us something that is a product of exponential functions.
-
You can think of each of these as effectively a little factor,
-
but it’s a factor that only has a single parameter wj.
-
Since this is a little bit abstract, so let’s look at an example.
-
Specifically lets look at how we might represent a simple table factor as a log linear model.
-
So here’s a param, here’s a factor φ, over two binary random variables X1 and X2.
-
And so a full table factor would have four parameters: a00, a01, a10, and a11.
-
So we can capture this model using a log linear model,
-
using a set of such of features,
-
using a set of these guys, which are indicator functions.
-
So this is an indicator function.
-
It takes one if X1 is zero and X2 is zero,
-
and it takes zero otherwise.
-
So this the general notion of an indicator function.
-
It looks at the event—or constraint—inside the curly braces,
-
and it returns a value of 0 or 1, depending on
whether that event is true or not.
-
And so, if we wanted to represent this factor as a log-linear model,
-
We can see that we can simply sum up over all the four values of k and ℓ,
-
which are either 0 or 1, each of them.
-
So were summing up over all four entries here.
-
And we have a parameter—or coefficient—w_kℓ which multiplies this feature.
-
And so, we would have a summation of w_kℓ:
-
of w00 only in the case where X1 is zero and X2 is zero.
-
So we would have exp of negative w00 when X1=0 and X2=0,
-
and we would have exp of negative w01 when
X1=0 and X2=1, and so on and so forth.
-
And it’s not difficult to convince ourselves that
-
if we define w_kℓ to be the negative log of the corresponding entries in this table,
-
then that gives us right back the factor that
we defined to begin with.
-
So this shows that this is a general representation,
-
in the sense that we can take any factor
-
and represent it as a log-linear model
-
simply by including all of the appropriate features.
-
But we don’t generally want to do that.
-
Generally we want a much finer grain set of features.
-
So let’s look at some of the examples of features that people use in practice.
-
So here are the features used in a language model.
-
This is a language model that we that we discussed previously.
-
And here we have features that relate:
-
First of all, let’s just remind ourselves [that] we have two sets of variables.
-
We have the variables Y which represent the annotations for each word
-
in the sequence corresponding to what category that corresponds to.
-
So this is a person.
-
This is the beginning of a person name.
-
This is the continuation of a person name.
-
The beginning of a location.
-
The continuation of a location, and so on.
-
As well as a bunch of words that are not:
-
[i.e.,] none of person, location, organization.
-
And they’re all labeled “other”.
-
And so the value Y tells us for each word what
category it belongs to,
-
so that we’re trying to identify people, locations, and
organizations in the sentence.
-
We have another set of variables X,
-
which are the actual words in the sentence.
-
Now we can go ahead and define…
-
We can use a full table representation that
-
basically tries to relate each and every Y that has a feature,
-
that has a full factor that looks at every possible word in the English language;
-
but those are going to be very, very, expensive,
-
and a very large number of parameters.
-
And so we're going to define a feature that looks, for example, at f of
-
say a particular Y_i, which is the label for the i’th word in the sentence,
-
and X_i, being that i’th word.
-
And that feature says, for example: Y_i equals person.
-
It’s the indicator function for “Y_i = person and X_i is capitalized”.
-
And so that feature doesn’t look at the individual words.
-
It just looks at whether that word is capitalized.
-
Now we have just the single parameter that looks just at capitalization,
-
and parameterizes how important is capitalization for recognizing that something's a person.
-
We could also have another feature.
-
This is an alternative:
-
This a different feature that can and could be part of the same model
-
that says: Y_i is equal to location,
-
Or, actually, I was little bit imprecise here—
-
This might be beginning of person. This might be beginning of location.
-
And X_i appears in some atlas.
-
Now there is other things that appear in the atlas than locations,
-
but if a word appears in the atlas,
-
there is a much higher probability presumably that it’s actually a location
-
and so we might have, again, [a] weight for this feature
-
that indicates that maybe increases the probability in Y_i being labeled in this way.
-
And so you can imagine that constructing a very rich set of features,
-
all of which look at certain aspects of the word,
-
and rather than enumerating all the possible words
and giving a parameter to each and one of them.
-
Let’s look at some other examples of feature-based models.
-
So this is an example from statistical physics.
-
It’s called the Ising model.
-
And the Ising model is something that looks at pairs
of variables.
-
It’s a pairwise Markov network.
-
And [it] looks the pairs of adjacent variables,
-
and basically gives us a coefficient for their products.
-
So now, this is a case where variables are in the end are binary,
-
but not in the space {0, 1} but rather
negative one and positive one.
-
And so now, we have a model that's parametrized
-
as features that are just the product of the values of the adjacent variables.
-
Where might this come up?
-
It comes up in the context, for example, of modeling the spins of electrons in a grid.
-
So here you have a case where the electrons can rotate
-
either along one direction or in the other direction
-
so here is a bunch of the atoms that are marked with a blue arrow.
-
You have one rotational axis,
-
and the red arrow[s] are rotating in the opposite direction.
-
And this basically says we have a term that
-
[whose] probability distribution over the joint set of spins.
-
(So this is the joint spins.)
-
And the model, depends on whether adjacent
atoms have the same spin or opposite spin.
-
So notice that one times one is the same as negative one times negative one.
-
So this really just looks at whether they have the same spin
or different spins.
-
And there is a parameter that looks at, you know, same or
different.
-
That's what this feature represents.
-
And, depending on the value of this parameter over here,
-
if the parameter goes one way,
-
we're going to favor systems
-
where the atoms spin in the same direction.
-
And if it’s going in the opposite direction, you’re going to favor atoms that spin in the different direction.
-
And those are called ferromagnetic and anti-ferromagnetic.
-
Furthermore, you can define in these systems the notion of a temperature.
-
So the temperature here says how strong is this connection.
-
So notice that as T grows—as the temperature grows—the w_ij’s get divided by T.
-
And they all kind of go towards zero,
-
which means that the strength of the connection between
-
adjacent atoms, effectively becomes almost moot,
-
and they become almost decoupled from each other.
-
On the other hand, as the temperature decreases,
-
Then the effect of the interaction between the atoms becomes much more significant
-
and they’re going to impose much stronger constraints on each other.
-
And this is actually a model of a real physical system.
-
I mean, this is real temperature, and real atoms, and so on.
-
And sure enough, if you look at what happens to these models as a function of temperature,
-
what we see over here is high temperature.
-
This is high temperature
-
and you can see that there is a lot of mixing between the two types of spin
-
and this is low temperature
-
and you can see that there is much stronger
constraints in this configuration
-
about the spins of adjacent atoms.
-
Another kind of feature that's used very much in lots of practical applications
-
is the notion of a metric, of a metric feature, an M.R.F.
-
So what's a metric feature?
-
This is something that comes up, mostly in cases
-
where you have a bunch of random variables X_i that all take values in some joint label space of V.
-
So, for example, they might all be binary.
-
They all might take values one, two, three, four.
-
And what we'd like to do is
-
we have X_i and X_j that are connected to each other by an edge.
-
We want X_ and X_j to take “similar” values.
-
So in order to enforce the fact that X_i and X_j should take similar values
-
we need a notion of similarity.
-
And we're going to encode that using the distance function µ that takes two values, one for X_i and one for X_j’s,
-
[that] says how close are they to each other.
-
So what does the distance function need to be?
-
Well, the distance function needs to satisfy the standard condition on a distance function or a metric.
-
So first is reflexivity,
-
which means that if the two variables take on the same value,
-
then that distance better be zero.
-
Oh I forgot to say that this. Sorry, this needs to be a non-negative function.
-
Symmetry means that the distances are symetrical.
-
So the distance between two values v1 and v2 are the same as the distance between v2 and v1.
-
And finally is the triangle inequality, which says that the distance between v1 and v2
-
(So here is v1)
-
(Here is v2)
-
and the distance between v1 and v2 is less than the distance between v1 and v3
-
and then going to v2. So the standard triangle inequality.
-
if a distance just satisfies these two conditions, it's called a semi metric.
-
Otherwise, if it satisfies all three, it's called a metric.
-
And both are actually used in practical applications.
-
But how do we take this distance feature and put it in the context of an MRF?
-
We have a feature that looks at two variables, X_i and X_j.
-
And that feature is the distance between X_i and X_j.
-
And now, we put it together by multiplying that with a coefficient, w_ij,
-
such that w_ij has to be greater than zero.
-
So that we want the metric MRF
-
[to have] the effect that
-
the lower the distance, the higher this is,
-
because of the negative coefficient, which means that higher the probability. Okay?
-
So, the more pairs you have that are close to each other
-
and the closer they are to each other the higher
the probability of the overall configuration.
-
Which is exactly what we wanted to have happen.
-
So, conversely, if you have values that are far from
each other in the distance metric
-
the lower the probability in the model.
-
So, here are some examples of metric MRF’s.
-
So one: The simplest possible metric MRF
-
is one that gives [a] distance of zero when the two classes are equal to each other
-
and [a] distance of one everywhere else.
-
So now, you know, this is just like a step function.
-
And, this gives rise to a potential that looks like this.
-
So we have 0’s on the diagonal.
-
So we get a bump in the probability when the two adjacent variables take on the same label
-
and otherwise we get a reduction in the probability.
-
But it doesn’t matter what particular value they take.
-
That’s one example of a simple metric.
-
A somewhat more expressive example might come up when the values V are actually numerical values.
-
In which case you can look at maybe the difference between the miracle values.
-
So, v_k minus v_l.
-
And you want, and when v_k is equal to v_l, the distance is zero,
-
and then you have a linear function that increases the
distance as the distance between v_k and v_l grows.
-
So, this is the absolute value of v_k minus v_l.
-
A more interesting notion that comes up a lot in
practice is:
-
we don’t want to penalize arbitrarily things that are far way from each other in label space.
-
So this is what is called a truncated linear penalty.
-
And you can see that beyond a certain threshold,
-
the penalty just becomes constant, so it plateaus.
-
So that there is a penalty, but it doesn’t keep increasing over as the labels get further from each other
-
One example where metric MRF’s are used is when we’re doing image segmentation.
-
And here we tend to favor segmentations where adjacent superpixels…
-
(These are adjacent superpixels.)
-
And we want them to take the same class.
-
And so here we have no penalty when the superpixels take the same class
-
and we have some penalty when they take different classes.
-
And this is actually a very common, albeit simple, model for
image segmentation.
-
Let’s look at a different MRF, also in the context of
computer vision.
-
This is an MRF that’s used for image denoising.
-
So here we have a noisy version of a real image that looks like this.
-
So this is, you can see this kind of, white noise overlayed on top of the image.
-
And what we’d like to do, is we’d like to get a cleaned-up version of the image.
-
So here we have, a set of variables, X, that correspond to the noisy pixels.
-
And we have a set of variables, Y, that corresponds to the cleaned pixels.
-
And we'd like to have a probabilistic model that relates X and Y.
-
And what we’re going to do is we’d like, so, intuiti—, I mean,
-
so you’d like to have two effects on the pixels Y:
-
First, you'd like Y_i to be close to X_i.
-
But if you just do that, then you're just going to stick with
the original image.
-
So what is the main constraint that we can employ on the image in order to clean it up
-
is the fact that adjacent pixels tend to have the same value.
-
So in this case what we’re going to do is we’re going to model, we’re going to constrain the image
-
so that we’re going to constrain the Y_i’s to try and make Y_i close to its neighbors.
-
And the further away it is, the bigger the penalty.
-
And that's a metric MRF.
-
Now we could use just a linear penalty,
-
but that’s going to be a very fragile model,
-
because, now obviously the right answer isn't the model
where all pixels are equal to each other
-
in their actual intensity value because that would be just a single, you know, grayish-looking image.
-
So what you like is that you would like to let one pixel depart from its adjacent pixel
-
if it’s getting close in a different direction either by its own observation or by other adjacent pixels.
-
And so the right model to use here is actually the truncated linear model
-
and that one is [the] one that’s commonly used
-
and is very successful for doing image denoising.
-
Interesting, almost exactly the same idea is used in the context of stereo reconstruction.
-
There, the values that you’d like to infer, the Y_i’s,
-
are the depth disparity for a given pixel in the image—how deep it is.
-
And here also we have spacial continuity.
-
We like the depth of one pixel to be close to the depth of an adjacent pixel.
-
But once again we don’t want to enforce this too strongly
-
because you do have depth disparity in the image
-
and so eventually you'd like things to be allowed to break away from each other.
-
And so once again, one typically uses some kind of truncated linear model
-
for doing this stereo construction,
-
often augmented by other little tricks.
-
So, for example, here we have the actual pixel appearance,
-
for example, the color and texture.
-
And if the color and texture are very similar to each other,
-
you might want to have the stronger constraint on similarity.
-
Versus: if the color and texture of the adjacent pixels are
very different from each other,
-
they may be more likely to belong to different objects
-
and you don’t want to enforce quite as strong of a similarity constraint.