## The useless beauty of Reinforce

Reinforce is one of my favorite algorithms in machine learning. It’s useful for reinforcement learning.

The formal goal of reinforce is to maximize an expectation, $\sum_x p_\theta(x)r(x)$, where $r(x)$ is the reward function and $p_\theta(x)$ is a distribution. To apply reinforce, all we need is to be able to sample from $p(x)$ and to evaluate the reward $r(x)$, which is really nothing.

This is the case because of the following simple bit of math:

$\nabla_\theta \sum_x p_\theta(x) r(x) = \sum_x p_\theta(x)\nabla_\theta \log p_\theta(x) r(x)$

which clearly shows that to estimate the gradient wrt the expected reward, we merely need to sample from $p(x)$ and weigh the gradient $\nabla_\theta \log p_\theta(x)$ by the reward $r(x)$.

Reinforce is so tantalizing because sampling from a distribution is very easy. For example, the distribution $p(x)$ could be the combination of a parametrized control policy and the actual responses of the real world to our actions: $x$ could be the sequence of states and actions chosen by our policy and the environment, so

$p(x)=\prod_t p_{\textrm{world}}(x_t|x_{t-1},a_t)p_\theta(a_t|x_t)$,

and only part of $p(x)$ is parameterized by $\theta$.

Similarly, $r(x)$ is obviously easily computable from the environment.

So reinforce is dead easy to apply: for example, it could be applied to a robot’s policy. To sample from our distribution, we’d run the policy, get the robot to interact with our world, and collect our reward. And we’ll get an unbiased estimate of the gradient, and presto: we’d be doing stochastic gradient descent on the policy’s parameters.

Unfortunately, this simple approach is not so easy to apply. The problem lies in the huge variance of our estimator $\nabla_\theta p(x) r(x)$. It is easy to see, intuitively, where this variance comes from. Reinforce obtains its learning signal from the noise in its policy distribution $p_\theta$. In effect, reinforce makes a large number of random choices through the randomness in its policy distribution, and if they do better than average, then we’ll change our parameters so that these choices are more likely in the future. Similarly, if the random choices end up doing worse than random, the model will try to avoid choosing this specific configuration of actions.

($\sum_x p_\theta(x)\nabla_\theta \log p_\theta(x) r(x)$ = $\sum_x p_\theta(x)\nabla_\theta \log$ $p_\theta(x)(r(x)-r_\textrm{avg})$ because $\sum_x p(x)\nabla \log p(x)=0$. There is a simple formula for choosing the optimal $r_\textrm{avg}$).

To paraphrase, reinforce adds a small perturbation to the choices and sees if the total reward has improved. If we’re trying to do something akin to supervised learning with reinforce with a small label set, then reinforce won’t do so terribly: each action would be a classification, and we’d have a decent chance to guess the correct answer. So we’ll be guessing the correct answer quite often, which will supply our neural net with training signal, and learning will succeed.

However, it’s completely hopeless to train a system that makes a large number of decisions with a tiny reward in the end.

On a more optimistic note, large companies that deploy millions of robots could refine their robot’s policies with large scale reinforce. During the day, the robots will collect the data for the policy, and during the night, the policy will be updated.