Undirected models are better at sampling

The best directed models should always be a worse at generating samples than the best undirected models, even if their log likelihoods are similar for a simple reason.

If we have an undirected model, then it defines a probability distribution by the equation

\displaystyle p(x;\theta)=\frac{\exp(G(x;\theta))}{\sum_y \exp(G(y;\theta))}

As always, the standard objective of unsupervised learning is to find a distribution p(x;\theta) so that the average log probability of the data distribution E_{x\sim D(x)} [ \log p(x;\theta) ] is as large as possible.

In theory, if we learn successfully, we should reach a local maxima of the average log probability. Taking the derivative and setting it to zero yields

E_{x\sim D(x)}[\nabla_\theta G(x;\theta^*)] = E_{x\sim p(x;\theta)}[\nabla_\theta G(x;\theta^*)]

(here \theta^* are the maximum likelihood parameters). Notice that this equation is a statement about the samples produced by the distribution p(x;\theta^*): the gradient of the goodness \nabla_\theta G(x;\theta^*) averaged over the data distribution D(x) is equal to the same gradient averaged over the model’s distribution p(x;\theta^*).  Therefore, the samples from p(x;\theta^*) must somehow be related to the samples from the data distribution D(x). This is a “promise” made to us by the learning objective of unsupervised learning.

However, directed models do not offer such a guarantee; instead, it promises that the conditional distributions of the data distribution will be similar to the conditional distributions of the model’s distribution, when the conditioned data is sampled form the data distribution. This is the critical point.

More formally, a directed model defines a distribution p(x;\theta)=\prod_j p(x_j|x_{<j};\theta). Plugging it in into the objective of maximizing the average log likelihood of the data distribution D(x), we get the following:

\sum_j E_{D(x)}[\log p(x_j|x_{<j};\theta)],

which is a sum of indepedent problems.

IF the p(x_j|x_{<j};\theta)‘s don’t share parameters for different j‘s, then the problems are truly independent and could be solved completely separately. So let’s say we found a \theta^* that makes all these objectives happy. Then E_{D(x_{<j})}[(E_{D(x_j|x_{<j})} [\log p(x_j|x_{<j};\theta^*)] will be happy, which means that p(x_j|x_{<j},\theta^*) is similar, more or less, to D(x_j|x_{<j}) for x_{<j} being sampled from D(x_{<j}) — which is the critical implied assumption made by the maximum likelihood objective applied to directed models. Why is it a problem when generating samples? It’s bad because this objective makes no “promises” about the behaviour of p(x_j|x_{<j};\theta^*) when x_{<j} \sim p(x_{<j};\theta^*). It is easy to imagine that a p(x_1;\theta^*) will be somewhat different from D(x_1), and say that x_1 was sampled from p(x_1;\theta^*). Then p(x_2|x_1;\theta^*) will freak out, having never seen anything like x_1, which will make the sample (x_1,x_2) look even less like a sample from D(x_1,x_2). Etc. This “chain reaction” will likely cause the directed model to produce worse-looking samples than an undirected model with a similar log probability.

But something should be odd: after all, any undirected model (or distribution for that matter) can be decomposed with the chain rule, p(x_1,\ldots,x_n)=\prod_j p(x_j|x_{<j}). Why won’t the above argument apply to an undirected model, which I claim is to be superior at sampling? An answer can be given, but it involves lots of handwaving.

If an undirected model is expressed as a directed model using the chain rule, then the conditional probabilities will involve massive marginalizations. What’s more, all the conditional distributions p(x_j|x_{<j}) will share parameters in a very complicated way for different values of j. In all likelihood (and that’s the weak part of the argument),  the parameterization is so complex that it’s not possible to make all the objectives E_{D(x_{<j})}[(E_{D(x_j|x_{<j})} [\log p(x_j|x_{<j})]  happy for all j simultaneously; that is, the undirected model will not necessarily make p(x_j|x_{<j}) similar to D(x_j|x_{<j}) when x_{<j}\sim D(x_{<j}). This is why I assumed that the little conditionals don’t share parameters.

So to summarize, directed models are worse at sampling because of the sequential nature of their sampling procedure. By sampling in sequence, the directed model is “fed” data which is unlike the training distribution, causing it to freak out. In contrast, sampling from undirected models requires an expensive Markov chain, which ensures the “self-consistency” of the sample. And intuitively, since we invest more work into obtaining the sample, it must be better.



  1. Posted July 17, 2011 at 7:23 pm | Permalink | Reply

    Doesn’t Gibbs sampling also converge to the true distribution for infinite steps?

    I guess it’s just a tradeoff– if you know a lot about the nature of your problem (distributions and dependencies), directed models let you model it rather precisely. Of course, you are more prone to errors in your assumptions.

  2. ilyasu1
    Posted July 18, 2011 at 12:06 am | Permalink | Reply

    It does. And Gibbs would suffer from similar problems if the model were trained with Pseudo-likelihood, which
    learns conditionals that are accurate only on the data distribution.

    And it’s always good to wisely use prior knowledge about the problem, though it can be difficult
    to “explain” the nature of our prior knowledge to the model.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: