Monthly Archives: July 2011

Undirected models are better at sampling

The best directed models should always be a worse at generating samples than the best undirected models, even if their log likelihoods are similar for a simple reason.

If we have an undirected model, then it defines a probability distribution by the equation

\displaystyle p(x;\theta)=\frac{\exp(G(x;\theta))}{\sum_y \exp(G(y;\theta))}

As always, the standard objective of unsupervised learning is to find a distribution p(x;\theta) so that the average log probability of the data distribution E_{x\sim D(x)} [ \log p(x;\theta) ] is as large as possible.

In theory, if we learn successfully, we should reach a local maxima of the average log probability. Taking the derivative and setting it to zero yields

E_{x\sim D(x)}[\nabla_\theta G(x;\theta^*)] = E_{x\sim p(x;\theta)}[\nabla_\theta G(x;\theta^*)]

(here \theta^* are the maximum likelihood parameters). Notice that this equation is a statement about the samples produced by the distribution p(x;\theta^*): the gradient of the goodness \nabla_\theta G(x;\theta^*) averaged over the data distribution D(x) is equal to the same gradient averaged over the model’s distribution p(x;\theta^*).  Therefore, the samples from p(x;\theta^*) must somehow be related to the samples from the data distribution D(x). This is a “promise” made to us by the learning objective of unsupervised learning.

However, directed models do not offer such a guarantee; instead, it promises that the conditional distributions of the data distribution will be similar to the conditional distributions of the model’s distribution, when the conditioned data is sampled form the data distribution. This is the critical point.

More formally, a directed model defines a distribution p(x;\theta)=\prod_j p(x_j|x_{<j};\theta). Plugging it in into the objective of maximizing the average log likelihood of the data distribution D(x), we get the following:

\sum_j E_{D(x)}[\log p(x_j|x_{<j};\theta)],

which is a sum of indepedent problems.

IF the p(x_j|x_{<j};\theta)‘s don’t share parameters for different j‘s, then the problems are truly independent and could be solved completely separately. So let’s say we found a \theta^* that makes all these objectives happy. Then E_{D(x_{<j})}[(E_{D(x_j|x_{<j})} [\log p(x_j|x_{<j};\theta^*)] will be happy, which means that p(x_j|x_{<j},\theta^*) is similar, more or less, to D(x_j|x_{<j}) for x_{<j} being sampled from D(x_{<j}) — which is the critical implied assumption made by the maximum likelihood objective applied to directed models. Why is it a problem when generating samples? It’s bad because this objective makes no “promises” about the behaviour of p(x_j|x_{<j};\theta^*) when x_{<j} \sim p(x_{<j};\theta^*). It is easy to imagine that a p(x_1;\theta^*) will be somewhat different from D(x_1), and say that x_1 was sampled from p(x_1;\theta^*). Then p(x_2|x_1;\theta^*) will freak out, having never seen anything like x_1, which will make the sample (x_1,x_2) look even less like a sample from D(x_1,x_2). Etc. This “chain reaction” will likely cause the directed model to produce worse-looking samples than an undirected model with a similar log probability.

But something should be odd: after all, any undirected model (or distribution for that matter) can be decomposed with the chain rule, p(x_1,\ldots,x_n)=\prod_j p(x_j|x_{<j}). Why won’t the above argument apply to an undirected model, which I claim is to be superior at sampling? An answer can be given, but it involves lots of handwaving.

If an undirected model is expressed as a directed model using the chain rule, then the conditional probabilities will involve massive marginalizations. What’s more, all the conditional distributions p(x_j|x_{<j}) will share parameters in a very complicated way for different values of j. In all likelihood (and that’s the weak part of the argument),  the parameterization is so complex that it’s not possible to make all the objectives E_{D(x_{<j})}[(E_{D(x_j|x_{<j})} [\log p(x_j|x_{<j})]  happy for all j simultaneously; that is, the undirected model will not necessarily make p(x_j|x_{<j}) similar to D(x_j|x_{<j}) when x_{<j}\sim D(x_{<j}). This is why I assumed that the little conditionals don’t share parameters.

So to summarize, directed models are worse at sampling because of the sequential nature of their sampling procedure. By sampling in sequence, the directed model is “fed” data which is unlike the training distribution, causing it to freak out. In contrast, sampling from undirected models requires an expensive Markov chain, which ensures the “self-consistency” of the sample. And intuitively, since we invest more work into obtaining the sample, it must be better.


The Miracle of the Boltzmann Machine

The Boltzmann Machine, invented by my adviser, is fairly well-known in machine learning. But the Boltzmann Machine (BM) is also the only model that actually uses dreams for a specific computational purpose, as opposed to all other models of sleep, that use it in a more ad-hoc way (e.g., to “remember the day’s events better”, or to “forget unwanted thoughts”, and other claims of this kind). In addition, the BM also forgets its dreams. Just like humans! The BM forgets its dreams due to a differentiation of a simple equation that has apparently nothing to do with sleep. It is so remarkable that I believe the Boltzmann Machine to be “right” in a very essential way.

Unfortunately the Boltzmann Machine can only be understood by using Math. So you’ll like it if you know math too.

The Boltzmann Machine defines a probability distribution over the set of possible visible binary vectors V and hidden binary vectors H. The intended analogy is that V is an observation, say the pixels on the retina, and H is the joint activity of all the neurons inside the brain. We’ll also denote the concatenation of V and H by X, so X=(V,H). The Boltzmann Machine defines a probability distribution over the configurations X=(V, H) by the equation

P(X)=\displaystyle \frac{\exp(X^\top W X/2)}{\sum_{X'} \exp(X'^\top W X'/2)}

So different choices of the matrix W yield different distributions P(X).

The BM makes observations about the world, which are summarized by the world distribution over the visible vectors D(V). For example, D(V) could be the distribution of all the images we’ve seen during the day (so V is a binary image).

The following point is a little hard to justify. We define the goal of learning by the objective function

L(W)=\displaystyle E_{D(V)} [\log P(V)],

where P(V)=\sum_{H} P(V,H) and E_{D(V)} [f(V)] = \sum_V D(V)f(V). In other words, the goal of learning is to find a BM that assigns high log probability to the kind of data we typically observe in the world. This learning rule makes some intuitive sense, because negative log probability can be interpreted as a measure of surprise. So if our BM isn’t surprised by the real world, then it must be doing something sensible. It is hard to fully justify this learnnig objective, because a BM that’s not surprised by the world isn’t obviously useful for other tasks. But we’ll just accept this assumption and see where it leads us.

So we want to find the parameter setting of W that maximizes the objective L(W), which is something we can approach with gradient ascent: we’d iteratively compute the gradient \partial L/\partial W, and change W slightly in the direction of the gradient. In other words, if we change our weights by

\Delta W_{ij} = \varepsilon \partial L/\partial W_{ij},

then we’re guaranteed to increase the value of the objective slightly. Do it enough times and our objective L will be in a good place. And finally, here is the promised math:

\partial L/\partial W_{ij} = E_{D(V)P(H|V)} [X_i X_j] - E_{P(V,H)}[X_i X_j].

If you’re into differentiation you could verify the above yourself (remember that E_{P(V,H)}[X_i,X_j] = \sum_{(V,H)} P(V,H) X_i X_j, and similarly for D(V)P(H|V)) .

We’re finally ready for the magical interpretation of the above equation. The equation states that the weight W_{ij} should change according to the difference of two averages: the average of the products X_i X_j according to D(V)P(H|V) and according to P(V,H).

But first, notice that X_i X_j is the product of the two neurons at the ends of the connection (i,j). So the connection will have little trouble detecting when the product is equal to 1 from local information (remember, our vectors live in 0,1).

More significantly, we can compute the expectation E_{D(V)P(H|V)} [X_i X_j] by taking the observed data from D(V), “clamping” it onto the visible units V, and “running” the BM’s neurons until their states converge to equilibrium. All these terms can be made precise in a technical sense. But the important analogy here is that during the day, the world sets the visible vectors V of the BM, and it does the rest, running its hidden units H until they essentially converge. Then the BM can compute the expectation by simply averaging the products X_i X_j that it observes. This part of the learning rule attempts to make the day’s patterns more likely.

Now E_{P(V,H)} [X_i X_j] is computed by disconnecting the visible vector V from the world and running the BM freely, until the states converge. This is very much like dreaming; we ask the BM to produce the patterns it truly believes in. Then the connection (i,j) computes its expectation by observing the products X_i X_j, and subtracts the resulting average from the connections. In other words, the learning rule says, “make whatever patterns we observe during sleep less likely”. As a result, the BM will not be able to easily reproduce the patterns it observed during the sleep, because it unlearned them. To paraphrase: the BM forgets its dreams.

Consequently, the BM keep on changing its weights as long as the day’s patterns are different from the sleep’s patterns, and will stop learning once they two become equal. This means that the goal of learning is to make the dreams as similar as possible to the actual patterns that are observed in reality.

It should be surprising that both dreams and the fact that they are hard to remember are a simple consequence of a simple equation.

The futility of gigantic training sets with simple models

It is believed that a simple, usually linear, model with an extra-huge training set and a gigantic feature representation is superior to a more powerful model with less data, a claim often made by Google. And indeed, large companies are able to get better results by using ever larger training sets with simple models: each order of magnitude increase in the training set results in a reasonable increase in performance.

This approach is sensible in the sense that increasing the size of the training data is essentially guaranteed to improve performance. And if I have a fast learning algorithm, thousands of cores, and lots of data, then it is conceptually trivial to use more training data whenever the learning algorithm can be parallelized without too much engineering effort. And if I needed better performance very soon, I’d do precisely that.

However, the problem with this approach is that it runs out of steam in the sense that it will not reach human level performance. The following figure illustrates the point:

In this figure, the simpler model eventually outperforms the more sophisticated one, mainly because it is easy and relatively cheap to make the model larger. However, simple models will necessarily fail to reach human level performance, and the more powerful models will eventually but certainly outperform them. It must be so, hence QED. More seriously, a model could not successfully solve tasks that involve any kind of text comprehension, for example, without first extracting a really good representation of text’s meaning with miraculous-looking properties. And that’s something simple models don’t even try to do. By not using a good representation, the simple model falls back on its more primitive feature representation, which does explicitly describe the higher-level concepts that are ultimately needed to solve our task.

Nonetheless, simple (ie linear) models have serious advantages over complex models. They are faster to train and are easier to extend and understand, and their behaviour and performance is more predictable. However, it is finally becoming recognized that neural networks have the potential to be vastly more expressive than linear models without using too many parameters. And now that we are becoming better at trainign deep neural networks, we will see the proliferation of the more powerful multilayered perceptrons. Of course, naive multilayered perceptrons will also probably run out of steam, in which case we’ll have to design more exotic and ambitious architectures. But for now they are the simplest and the most powerful model class.

Validation Error Shape

Most learning algorithms make many small changes to the model, ensuring that each little change improves the model’s fit of the training data.  But when the model starts getting too good at the training data, its test and validation error get worse. That’s the point where we stop learning because additional learning will improve improve our training error at the expense of the validation and the test error.

Many machine learning courses depict a cartoon of the validation error:

Note that both the training and the validation errors are convex functions of time (if we ignore the bit in the beginning). However, if we train the model longer, we discover the following picture:

This is the real shape of the validation error. The error is almost always bounded from above, so the validation error must eventually inflect and converge. So the training error curve is convex, but the validation isn’t!