Universal Approximation and Depth

Many years ago Hornik et al. proved that a neural network with a single hidden layer can approximate any continuous function from a compact domain to the reals to arbitrary precision. The paper is highly cited (search for “Multilayer feedforward networks are universal approximators” on google scholar; this paper has almost almost 1/3 of the citations of the original backpropagation paper!), and convinced many people that neural networks will work for their applications, as they can learn any function.

Sadly this result was very misleading. The result claimed that a single hidden layer neural network can approximate any function, so a one hidden layer neural network should be good for any application. However it is not an efficient approximator for the functions we care about (this claim is true but hard to defend, since it’s not so easy to describe the functions that we care about). Indeed, the universal approximation construction works by allocating a neuron to every to every small volume of the input space, and learning the correct answer for each such volume.  The problem is that the number of such small volumes grows exponentially in the dimensionality of the input space, so Hornik’s construction is exponentially inefficient and is thus not useful.  (it is worth noting that deep neural networks are not universal approximators unless they are also exponentially large, because there are many more different functions than there are small neural networks).

This caused researchers to miss out on the best feature of the neural networks: depth. By being deep, the neural network can represent functions that are computed with several steps of computation.  Deep neural networks are best thought of as constant-depth threshold circuits, and these are known to be able to compute a lot of interesting functions.  For example, a small 3-hidden layer threshold network can sort N N-bit numbers, add N such numbers, compute their product, their max,  compute any analytic function to high precision.  And it is this ability of deep neural networks to perform such interesting computations makes them useful for speech recognition and machine translation.

There is another simple reason why large but not infeasibly huge deep networks must be capable of doing well on vision and speech.  The argument is simple:  human beings can recognize an object in 100 milliseconds, which gives their neurons the opportunity to fire only 10 times during the recognition.  So there exists a parallel procedure that can recognize an object in 10 parallel steps, which means that a big 10-layer net should be good at vision and speech — and it turns out to be the case.    But if this argument is actually valid, it means that we should be able to train neural networks for any task that humans can solve quickly.  Reading emotions, recognizing faces, reading body language and vocal intonation, and some aspects of motor control come to mind.  On all these tasks, high performance is truly achievable if we have the right dataset and fast implementation of a big supervised network.

In addition, there is a well-known intuition for why deep convolutional neural networks work well for vision, and explain why shallow neural networks do not. Many believe that to recognize an object, many steps of computation should be performed. In the first step, the edges should be extracted from the image. In the second step, small parts (or edges of edges) should be computed from the edges, such as corners. In the third step, combinations of small parts should be computed from the small parts. They could be a small circle, a t-junction, or some other visual entity. The idea is to extract progressively more abstract and specific units at each step. If you found this description difficult to follow, here are some images of various object recognition systems, all of which work roughly on the principle of extracting larger parts from smaller ones.

 (from http://www.kip.uni-heidelberg.de/cms/vision/projects/recent_projects/hardware_perceptron_systems/image_recognition_with_hardware_neural_networks/)

(from http://www.sciencedirect.com/science/article/pii/S0031320308004603)

(from http://journal.mercubuana.ac.id/data/Hierarchical-models-of-object-recognition-in-cortex.pdf)

By being deep, the convolutional neural network can implement this multistep process. The depth of the convolutional neural network allows each of its layers to compute larger and more elaborate object parts, so that the deepest layers compute specific objects. And its large number of parameters and units allows it to do so robustly, provided that we manage to find the appropriate network parameters.

Something similar must be going on with speech recognition, where deep networks make a very big difference compared to shallow ones, so it is likely that speech recognition consists of breaking speech up into small “parts”, and increasing their complexity at each layer, which cannot be done with a shallow network.


  1. Posted September 19, 2013 at 6:25 pm | Permalink | Reply

    Interesting post! After reading Bengio’s paper “The Curse of Highly Variable Functions for Local Kernel Machine”, I thought the benefit of doing deep learning is because the higher layer units play the role as a shorter description and the hierarchical layerwise description is more efficient when the function is complicated. For example, if the true discriminative function is the coloring function of Mandelbrot set, then function interpolation methods based on local smoothness may not work well. If there is any reason that deep learning would work well with less observed input output pairs, it must be the explanation that deep learning finds the short description very like how the Mandelbrot set is actually generated. In this context, it looks to me that deep learning is better at it’s generalization because of the specific architecture used. However, over the years from what I observed, it seems that deep learning works well only when there are lots of data. Does it mean currently we are only utilizing the “better expressiveness” property of deep networks rather than it’s “better generalization with less points” property?

  2. ilyasu1
    Posted September 23, 2013 at 2:38 pm | Permalink | Reply

    I apologize for the delay in my reply. It’s not true that deep neural networks are better at *generic* complicated functions, as there are plenty of complicated functions (the vast majority of them, in fact) that deep neural networks cannot represent at all. What really happens is that, for perception problems, we have intuitive arguments that suggest that a computation with a small number of large parallel steps is *capable* of solving the problem. In this post I tried to argue, weakly, that it is the case for vision, where you can recognize objects by first computing edges, then combinations of edges, etc, until you reach units that represent the presence of specific objects robustly and reliably. So if we believe (for whatever reason) that our problem can be solved with a modest number of large parallel steps, then we know that a big neural network with ~10 layers is likely to achieve good performance, and all that remains is to train it. And while these big neural networks need a fairly large amount of labelled data, they don’t need an exponential amount. That’s the key difference between neural networks and kernel methods, at least those that do not use excellent input features (and getting good input features is the hard part).

    I would also disagree that the specific architecture is important. The architecture could be any circuit with enough layers. So, for example, the nonlinearity is not important from the expressiveness point of view (although it has an effect on the speed of SGD).

  3. Filip
    Posted August 25, 2017 at 5:28 pm | Permalink | Reply

    ilyasu1, you say in the second comment that there are functions that deep networks cannot represent at all. Could you give some example o explanation of why? Tanks

3 Trackbacks

  1. By Deep Learning oral traditions - Strata on October 20, 2013 at 4:01 pm

    […] on most machine-learning tasks. However this universal approximation property came at a steep cost: the requisite (single hidden layer) neural networks were exponentially inefficient to construct (you needed a neuron for every possible input). For a while neural networks took a backseat to more […]

  2. By Deep Learning oral traditions | The Gradient Flow on January 20, 2014 at 2:38 am

    […] on most machine-learning tasks. However this universal approximation property came at a steep cost: the requisite (single hidden layer) neural networks were exponentially inefficient to construct (you needed a neuron for every possible input). For a while neural networks took a backseat to more […]

  3. By Deep Learning oral traditions - O'Reilly Radar on July 28, 2014 at 7:24 pm

    […] on most machine-learning tasks. However this universal approximation property came at a steep cost: the requisite (single hidden layer) neural networks were exponentially inefficient to construct (you needed a neuron for every possible input). For a while neural networks took a backseat to more […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: