In 1989, George Cybenko gave a rigorous proof that certain
feedforward neural networks can approximate any continuous
function [1]. This theorem is commonly
called the universal approximation theorem. In this post, we'll
be presenting his proof.
For this theorem, we will be considering neural networks
with a single hidden layer and one output layer (see Figure
1). A neuron in the hidden layer computes a function of the
form $\sigma (\mathbf{w} \cdot \mathbf{x} + \theta)$.
Here, we take the dot product between $\mathbf{w}$, the
weights, and $\mathbf{x}$, the input. The variable $\theta$
is the bias and $\sigma$ is some activation function that
we'll discuss later. The single output neuron has a linear
activation function. All together, a neural network with
$n$ inputs with $N$ neurons in the hidden layer is of the
form
Figure 1 - A neural network with five inputs and one
hidden layer with three neurons.
Measure Theory Background
For this article, we will strictly be focusing on the space
$I_n = [0,1]^n$. A $\sigma$-algebra $\Sigma$ on $I_n$
is a collection of subsets of $I_n$ such that
$I_n$ is in $\Sigma$,
If $A$ is in $\Sigma$, then the complement $I_n \setminus A$ is
in $\Sigma$, and
If $A_1, A_2, A_3, \dots$ are in $\Sigma$, then
$\bigcup_{i = 1}^{\infty}A_i$ is in $\Sigma$.
The sets contained in $\Sigma$ are called measurable.
We will be particularly interested in the Borel sets. A
subset of $I_n$ is called a Borel set if it is an
element of the smallest $\sigma$-algebra containing the open
sets of $I_n$.
A finite signed Borel measure $\mu$ on $I_n$ is a
real valued function defined on the Borel sets of $I_n$ that
satisfies the two properties:
$$
\begin{array}{ll} \mu(A) = \sup\{\mu(F) \mid F \subseteq A, F
\mbox{ is compact}\} \\ \mu(A) = \inf\{\mu(G) \mid A \subseteq
G, \mbox { is open}\} \end{array}.
$$
Now let $M(I_n)$ be the set of all finite signed regular
Borel measures on $I_n = [0,1]^n$. Now we're all set to get
started.
Proof of the Universal Approximation Theorem
We'll denote the space of all continuous functions on $I_n$ with norm
$||f|| = \sup_{\mathbf{x} \in I_n}|f(\mathbf{x})|$ as $C^0(I_n)$.
Definition.
We say that $\sigma \colon \mathbb{R} \to \mathbb{R}$ is
sigmoidal if
are dense in $C^0(I_n)$. What is meant by dense is the
following:
For any $f \in C^0(I_n)$, and $\epsilon > 0$, there exists
a $G \in \mathcal{N}$ such that
$$
|G(x) - f(x)| < \epsilon \mbox{ for all } x \in I_n.
$$
Proof.
The set $\mathcal{N}$ is a linear subspace of $C^0(I_n)$, and we
just have to show that the closure of $\mathcal{N}$,
$\bar{\mathcal{N}}$ is all of $C^0(I_n)$.
Assume that $\bar{\mathcal{N}} \neq C^0(I_n)$. By
the
Hahn-Banach theorem there exists a bounded
linear functional $L$ on $C^0(I_n)$ with the property that
$L \neq 0$, but $L(\bar{\mathcal{N}}) = L(\mathcal{N}) = 0$. By
the
Riesz Representation Theorem, $L$ is of the form
Because $\sigma$ is discriminatory, $\mu = 0$, which is a
contradiction to the fact that $L \neq 0$. Therefore,
$\bar{\mathcal{N}} = C^0(I_n)$ and the proof is complete.
This theorem assumes that the activation function $\sigma$
is a discriminatory function. The next lemma will show that
nice enough sigmoidal functions (such as the logistic
function) are discriminatory.
Lemma.
Every
bounded, measurable, sigmoidal function $\sigma \colon
\mathbb{R} \to \mathbb{R}$ is discriminating.
Proof.
Consider the function
$\sigma_\lambda$ to be the function
Now let $\Pi_{\mathbf{w},\theta}$ be the hyperplane
$\{\mathbf{x} \mid \mathbf{w} \cdot \mathbf{x} + \theta =
0 \}$. and let $H_{\mathbf{w},\theta}$ be the set
$\{\mathbf{x} \mid \mathbf{w} \cdot \mathbf{x} + \theta
> 0 \}$. In order to prove the lemma, we must make the
assumption that for all $\lambda$,
The functional $F$ is bounded on the space
$L^{\infty}(\mathbb{R})$ as $\mu$ is a finite signed
measure. Let $h(t)$ be the indicator function on
$[\theta, \infty)$. That is
$$
h(t) = \left\{ \begin{array}{ll} 1 & t \ge \theta \\ 0
& t < \theta \\ \end{array}\right. .
$$
A similar result holds if $h$ is the indicator function for
$(\theta, \infty)$. By linearity, $F(h) = 0$ for the
indicator of any interval, and thus any simple function.
Because simple functions are dense in
$L^{\infty}(\mathbb{R})$, $F = 0$ on $L^{\infty}(\mathbb{R})$.
Now notice that