The Miracle of the Boltzmann Machine

The Boltzmann Machine, invented by my adviser, is fairly well-known in machine learning. But the Boltzmann Machine (BM) is also the only model that actually uses dreams for a specific computational purpose, as opposed to all other models of sleep, that use it in a more ad-hoc way (e.g., to “remember the day’s events better”, or to “forget unwanted thoughts”, and other claims of this kind). In addition, the BM also forgets its dreams. Just like humans! The BM forgets its dreams due to a differentiation of a simple equation that has apparently nothing to do with sleep. It is so remarkable that I believe the Boltzmann Machine to be “right” in a very essential way.

Unfortunately the Boltzmann Machine can only be understood by using Math. So you’ll like it if you know math too.

The Boltzmann Machine defines a probability distribution over the set of possible visible binary vectors $V$ and hidden binary vectors $H$. The intended analogy is that $V$ is an observation, say the pixels on the retina, and $H$ is the joint activity of all the neurons inside the brain. We’ll also denote the concatenation of $V$ and $H$ by $X$, so $X=(V,H)$. The Boltzmann Machine defines a probability distribution over the configurations $X=(V, H)$ by the equation

$P(X)=\displaystyle \frac{\exp(X^\top W X/2)}{\sum_{X'} \exp(X'^\top W X'/2)}$

So different choices of the matrix $W$ yield different distributions $P(X)$.

The BM makes observations about the world, which are summarized by the world distribution over the visible vectors $D(V)$. For example, $D(V)$ could be the distribution of all the images we’ve seen during the day (so $V$ is a binary image).

The following point is a little hard to justify. We define the goal of learning by the objective function

$L(W)=\displaystyle E_{D(V)} [\log P(V)]$,

where $P(V)=\sum_{H} P(V,H)$ and $E_{D(V)} [f(V)] = \sum_V D(V)f(V)$. In other words, the goal of learning is to find a BM that assigns high log probability to the kind of data we typically observe in the world. This learning rule makes some intuitive sense, because negative log probability can be interpreted as a measure of surprise. So if our BM isn’t surprised by the real world, then it must be doing something sensible. It is hard to fully justify this learnnig objective, because a BM that’s not surprised by the world isn’t obviously useful for other tasks. But we’ll just accept this assumption and see where it leads us.

So we want to find the parameter setting of $W$ that maximizes the objective $L(W)$, which is something we can approach with gradient ascent: we’d iteratively compute the gradient $\partial L/\partial W$, and change $W$ slightly in the direction of the gradient. In other words, if we change our weights by

$\Delta W_{ij} = \varepsilon \partial L/\partial W_{ij}$,

then we’re guaranteed to increase the value of the objective slightly. Do it enough times and our objective $L$ will be in a good place. And finally, here is the promised math:

$\partial L/\partial W_{ij} = E_{D(V)P(H|V)} [X_i X_j] - E_{P(V,H)}[X_i X_j]$.

If you’re into differentiation you could verify the above yourself (remember that $E_{P(V,H)}[X_i,X_j] = \sum_{(V,H)} P(V,H) X_i X_j$, and similarly for $D(V)P(H|V)$) .

We’re finally ready for the magical interpretation of the above equation. The equation states that the weight $W_{ij}$ should change according to the difference of two averages: the average of the products $X_i X_j$ according to $D(V)P(H|V)$ and according to $P(V,H)$.

But first, notice that $X_i X_j$ is the product of the two neurons at the ends of the connection $(i,j)$. So the connection will have little trouble detecting when the product is equal to 1 from local information (remember, our vectors live in $0,1$).

More significantly, we can compute the expectation $E_{D(V)P(H|V)} [X_i X_j]$ by taking the observed data from $D(V)$, “clamping” it onto the visible units $V$, and “running” the BM’s neurons until their states converge to equilibrium. All these terms can be made precise in a technical sense. But the important analogy here is that during the day, the world sets the visible vectors $V$ of the BM, and it does the rest, running its hidden units $H$ until they essentially converge. Then the BM can compute the expectation by simply averaging the products $X_i X_j$ that it observes. This part of the learning rule attempts to make the day’s patterns more likely.

Now $E_{P(V,H)} [X_i X_j]$ is computed by disconnecting the visible vector $V$ from the world and running the BM freely, until the states converge. This is very much like dreaming; we ask the BM to produce the patterns it truly believes in. Then the connection $(i,j)$ computes its expectation by observing the products $X_i X_j$, and subtracts the resulting average from the connections. In other words, the learning rule says, “make whatever patterns we observe during sleep less likely”. As a result, the BM will not be able to easily reproduce the patterns it observed during the sleep, because it unlearned them. To paraphrase: the BM forgets its dreams.

Consequently, the BM keep on changing its weights as long as the day’s patterns are different from the sleep’s patterns, and will stop learning once they two become equal. This means that the goal of learning is to make the dreams as similar as possible to the actual patterns that are observed in reality.

It should be surprising that both dreams and the fact that they are hard to remember are a simple consequence of a simple equation.

1. Brian Vandenberg

I ran across a video on youtube awhile back (http://www.youtube.com/watch?v=UyPrL0cmJRs&feature=relmfu) where the author talks quite a bit about [theoretical?] understanding of how the brain works/learns. There was a lot of crossover in the concepts he discussed and how unsupervised methods such as boltzmann machines work. In particular, I found the part about baby’s brains quite fascinating.

2. ilyasu1