Many years ago Hornik et al. proved that a neural network with a single hidden layer can approximate any continuous function from a compact domain to the reals to arbitrary precision. The paper is highly cited (search for “Multilayer feedforward networks are universal approximators” on google scholar; this paper has almost almost 1/3 of the citations of the original backpropagation paper!), and convinced many people that neural networks will work for their applications, as they can learn any function.
Sadly this result was very misleading. The result claimed that a single hidden layer neural network can approximate any function, so a one hidden layer neural network should be good for any application. However it is not an efficient approximator for the functions we care about (this claim is true but hard to defend, since it’s not so easy to describe the functions that we care about). Indeed, the universal approximation construction works by allocating a neuron to every to every small volume of the input space, and learning the correct answer for each such volume. The problem is that the number of such small volumes grows exponentially in the dimensionality of the input space, so Hornik’s construction is exponentially inefficient and is thus not useful. (it is worth noting that deep neural networks are not universal approximators unless they are also exponentially large, because there are many more different functions than there are small neural networks).
This caused researchers to miss out on the best feature of the neural networks: depth. By being deep, the neural network can represent functions that are computed with several steps of computation. Deep neural networks are best thought of as constant-depth threshold circuits, and these are known to be able to compute a lot of interesting functions. For example, a small 3-hidden layer threshold network can sort N N-bit numbers, add N such numbers, compute their product, their max, compute any analytic function to high precision. And it is this ability of deep neural networks to perform such interesting computations makes them useful for speech recognition and machine translation.
There is another simple reason why large but not infeasibly huge deep networks must be capable of doing well on vision and speech. The argument is simple: human beings can recognize an object in 100 milliseconds, which gives their neurons the opportunity to fire only 10 times during the recognition. So there exists a parallel procedure that can recognize an object in 10 parallel steps, which means that a big 10-layer net should be good at vision and speech — and it turns out to be the case. But if this argument is actually valid, it means that we should be able to train neural networks for any task that humans can solve quickly. Reading emotions, recognizing faces, reading body language and vocal intonation, and some aspects of motor control come to mind. On all these tasks, high performance is truly achievable if we have the right dataset and fast implementation of a big supervised network.
In addition, there is a well-known intuition for why deep convolutional neural networks work well for vision, and explain why shallow neural networks do not. Many believe that to recognize an object, many steps of computation should be performed. In the first step, the edges should be extracted from the image. In the second step, small parts (or edges of edges) should be computed from the edges, such as corners. In the third step, combinations of small parts should be computed from the small parts. They could be a small circle, a t-junction, or some other visual entity. The idea is to extract progressively more abstract and specific units at each step. If you found this description difficult to follow, here are some images of various object recognition systems, all of which work roughly on the principle of extracting larger parts from smaller ones.
By being deep, the convolutional neural network can implement this multistep process. The depth of the convolutional neural network allows each of its layers to compute larger and more elaborate object parts, so that the deepest layers compute specific objects. And its large number of parameters and units allows it to do so robustly, provided that we manage to find the appropriate network parameters.
Something similar must be going on with speech recognition, where deep networks make a very big difference compared to shallow ones, so it is likely that speech recognition consists of breaking speech up into small “parts”, and increasing their complexity at each layer, which cannot be done with a shallow network.