The Universal Approximation Theorem


filed under : #machineLearning #neuralNetworks #AI

In 1989, George Cybenko gave a rigorous proof that certain feedforward neural networks can approximate any continuous function [1]. This theorem is commonly called the universal approximation theorem. In this post, we'll be presenting his proof.

For this theorem, we will be considering neural networks with a single hidden layer and one output layer (see Figure 1). A neuron in the hidden layer computes a function of the form $\sigma (\mathbf{w} \cdot \mathbf{x} + \theta)$. Here, we take the dot product between $\mathbf{w}$, the weights, and $\mathbf{x}$, the input. The variable $\theta$ is the bias and $\sigma$ is some activation function that we'll discuss later. The single output neuron has a linear activation function. All together, a neural network with $n$ inputs with $N$ neurons in the hidden layer is of the form

$$ G(\mathbf{x}) = \sum_{j=1}^N\alpha_j \sigma(\mathbf{w}_j \cdot \mathbf{x} + \theta_j). $$

network
Figure 1 - A neural network with five inputs and one hidden layer with three neurons.

Measure Theory Background

For this article, we will strictly be focusing on the space $I_n = [0,1]^n$. A $\sigma$-algebra $\Sigma$ on $I_n$ is a collection of subsets of $I_n$ such that

  1. $I_n$ is in $\Sigma$,
  2. If $A$ is in $\Sigma$, then the complement $I_n \setminus A$ is in $\Sigma$, and
  3. If $A_1, A_2, A_3, \dots$ are in $\Sigma$, then $\bigcup_{i = 1}^{\infty}A_i$ is in $\Sigma$.
The sets contained in $\Sigma$ are called measurable. We will be particularly interested in the Borel sets. A subset of $I_n$ is called a Borel set if it is an element of the smallest $\sigma$-algebra containing the open sets of $I_n$.

A finite signed Borel measure $\mu$ on $I_n$ is a real valued function defined on the Borel sets of $I_n$ that satisfies the two properties:

  1. $\mu(\emptyset) = 0$, and
  2. $\mu\left(\bigcup_{i = 1}^{\infty}A_i\right) = \sum_{i = 1}^\infty\mu(A_i)$.
A Borel measure is regular if the following hold:
$$ \begin{array}{ll} \mu(A) = \sup\{\mu(F) \mid F \subseteq A, F \mbox{ is compact}\} \\ \mu(A) = \inf\{\mu(G) \mid A \subseteq G, \mbox { is open}\} \end{array}. $$
Now let $M(I_n)$ be the set of all finite signed regular Borel measures on $I_n = [0,1]^n$. Now we're all set to get started.

Proof of the Universal Approximation Theorem

We'll denote the space of all continuous functions on $I_n$ with norm $||f|| = \sup_{\mathbf{x} \in I_n}|f(\mathbf{x})|$ as $C^0(I_n)$.
Definition. We say that $\sigma \colon \mathbb{R} \to \mathbb{R}$ is sigmoidal if
$$ \lim_{t \to \infty}\sigma(t) = 1 \mbox{ and } \lim_{t \to -\infty}\sigma(t) = 0. $$

An example of a sigmoidal function would be the logistic function

$$ \sigma(t) = \frac{1}{1+e^{-t}}. $$

Definition. We say that $\sigma: \mathbb{R} \to \mathbb{R}$ is discriminatory if for any measure $\mu \in M(I_n)$,
$$ \int_{I_n} \sigma(\mathbf{w} \cdot \mathbf{x} + \theta)\mathrm{d}\mu(\mathbf{x}) = 0, $$
for all $\mathbf{w} \in \mathbb{R}^n$ and all $\theta \in \mathbb{R}$ implies that $\mu = 0$.
Theorem. Let $\sigma$ be any continuous discriminatory function. Then the set $\mathcal{N}$ of functions of the form
$$ G(\mathbf{x}) = \sum_{j=1}^N\alpha_j \sigma(\mathbf{w}_j \cdot \mathbf{x} + \theta_j) $$
are dense in $C^0(I_n)$. What is meant by dense is the following:

For any $f \in C^0(I_n)$, and $\epsilon > 0$, there exists a $G \in \mathcal{N}$ such that

$$ |G(x) - f(x)| < \epsilon \mbox{ for all } x \in I_n. $$

Proof. The set $\mathcal{N}$ is a linear subspace of $C^0(I_n)$, and we just have to show that the closure of $\mathcal{N}$, $\bar{\mathcal{N}}$ is all of $C^0(I_n)$.

Assume that $\bar{\mathcal{N}} \neq C^0(I_n)$. By the Hahn-Banach theorem there exists a bounded linear functional $L$ on $C^0(I_n)$ with the property that $L \neq 0$, but $L(\bar{\mathcal{N}}) = L(\mathcal{N}) = 0$. By the Riesz Representation Theorem, $L$ is of the form

$$ L(h) = \int_{I_n}h(\mathbf{x})\mathrm{d}\mu(\mathbf{x}), $$
for some $\mu \in M(I_n)$ and for all $h \in C^0(I_n)$. Thus we must have for all $\mathbf{w} \in \mathbb{R}^n$ and $\theta \in \mathbb{R}$,
$$ \int_{I_n} \sigma(\mathbf{w} \cdot \mathbf{x} + \theta) \mathrm{d}\mu(\mathbf{x}) = 0. $$
Because $\sigma$ is discriminatory, $\mu = 0$, which is a contradiction to the fact that $L \neq 0$. Therefore, $\bar{\mathcal{N}} = C^0(I_n)$ and the proof is complete.

This theorem assumes that the activation function $\sigma$ is a discriminatory function. The next lemma will show that nice enough sigmoidal functions (such as the logistic function) are discriminatory.

Lemma. Every bounded, measurable, sigmoidal function $\sigma \colon \mathbb{R} \to \mathbb{R}$ is discriminating.
Proof. Consider the function $\sigma_\lambda$ to be the function
$$ \lambda \mapsto \sigma(\lambda(\mathbf{w} \cdot \mathbf{x} + \theta) + \varphi), $$
where $\varphi$ is some real number. Now define $\gamma(\mathbf{x})$ to be
$$ \gamma(\mathbf{x}) = \left\{ \begin{array}{ll} 1 & \mathbf{w} \cdot \mathbf{x} + \theta > 0 \\ 0 & \mathbf{w} \cdot \mathbf{x} + \theta < 0 \\ \sigma(\varphi) & \mathbf{w} \cdot \mathbf{x} + \theta = 0 \end{array} \right. . $$
Thus $\lim_{\lambda \to \infty} \sigma_{\lambda}(\mathbf{x}) = \gamma(\mathbf{x})$ pointwise.

Now let $\Pi_{\mathbf{w},\theta}$ be the hyperplane $\{\mathbf{x} \mid \mathbf{w} \cdot \mathbf{x} + \theta = 0 \}$. and let $H_{\mathbf{w},\theta}$ be the set $\{\mathbf{x} \mid \mathbf{w} \cdot \mathbf{x} + \theta > 0 \}$. In order to prove the lemma, we must make the assumption that for all $\lambda$,

$$ \int_{I_n} \sigma_{\lambda}(\mathbf{x}) \mathrm{d}\mu(\mathbf{x}) = 0. $$
Now by the dominated convergence theorem, we see
$$ \begin{align*} 0 & = \lim_{\lambda \to \infty} \int_{I_n} \sigma_{\lambda} \mathrm{d}\mu(\mathbf{x}) \\ & = \int_{I_n} \gamma(\mathbf{x}) \mathrm{d}\mu(\mathbf{x}) = \sigma(\varphi)\mu(\Pi_{\mathbf{w},\theta}) + \mu(H_{\mathbf{w},\theta}).\end{align*} $$
Because $\sigma$ is nonconstant, we must have $\mu(\Pi_{\mathbf{w},\theta}) = \mu(H_{\mathbf{w},\theta}) = 0$.

We now will show that if the measure of all half-planes are 0 implies that the measure must be 0 as well.

Fix $\mathbf{w}$. For a bounded measurable function $h \colon \mathbb{R} \to \mathbb{R}$ define $F$ to be the linear functional

$$ F(h) = \int_{I_n}h(\mathbf{w} \cdot \mathbf{x})\mathrm{d}\mu(\mathbf{x}). $$
The functional $F$ is bounded on the space $L^{\infty}(\mathbb{R})$ as $\mu$ is a finite signed measure. Let $h(t)$ be the indicator function on $[\theta, \infty)$. That is
$$ h(t) = \left\{ \begin{array}{ll} 1 & t \ge \theta \\ 0 & t < \theta \\ \end{array}\right. . $$
Therefore,
$$ \begin{align*} F(h) & = \int_{I_n}h(\mathbf{w} \cdot \mathbf{x})\mathrm{d}\mu(x) \\ & = \mu(\Pi_{\mathbf{w}, -\theta}) + \mu(H_{\mathbf{w}, -\theta}) = 0. \end{align*} $$
A similar result holds if $h$ is the indicator function for $(\theta, \infty)$. By linearity, $F(h) = 0$ for the indicator of any interval, and thus any simple function. Because simple functions are dense in $L^{\infty}(\mathbb{R})$, $F = 0$ on $L^{\infty}(\mathbb{R})$. Now notice that
$$ \begin{align*} 0 & = F(\cos) + iF(\sin) \\ & = \int_{I_n}\cos(\mathbf{w} \cdot \mathbf{x}) + i\sin(\mathbf{w} \cdot \mathbf{x})\mathrm{d}\mu(\mathbf{x}) \\ & = \int_{I_n}e^{i \mathbf{w} \cdot \mathbf{x}} \mathrm{d}\mu(\mathbf{x}). \end{align*} $$
Thus, the Fourier transform of $\mu$ is zero, so $\mu$ must also be zero. Therefore, $\sigma$ is discriminatory, and the proof is complete.

And that concludes the statement and proof of the universal approximation theorem for feedforward neural networks with sigmoidal activation functions.

References

  1. George Cybenko. Approximation by superpositions of a sigmoidal function , Mathematics of Control, Signals, and Systems, 2:303-314, 1989.