Why Radial Basis Function Neural Networks are equivalent to Nyström method

24 Oct 2017
Gonzalo Mateo-García

Introduction

Radial Basis Function Neural Networks (RBFN) and Nyström methods are two methods that can be used for regression. They come from different statistical learning frameworks: RBFN were proposed in the context of shallow neural networks: this is a three layer neural network where the hidden layer computes the rbf similarities to a set of centroids. Then a fully connected layer is added to give the final prediction. The Nyström method was proposed in the kernel machine literature as a way to reduce the computational burden of the full kernel similarity matrix $K$. Nyström was proposed latter, however it can be applied to different kernel approaches, not only regression.

The equivalence of the method is not something unknown we came up to; however we feel that the relationship is not crystal clear as it perhaps seems for experts in the field. In [Quiñonero-Candela 2005] and in the book by Rassmussen and Williams Guassian Process for Machine Learning these relations are taken for granted. Recent papers in the context of RBFNN explicitly say this. In addition, another clear proof that people are aware of this relationship is the scikit learn implementation of the Nyström method, which explicitly uses it. In this way the purpose of this post is to shed light on the proof of this relationship which helps to better understand this powerful methods. First we show the relation into an empirical risk minimization setting, later we will develop the Bayesian approach which leads to one of the called sparse $\mathcal{GP}$ methods, we will see that the RBFN network is equivalent in this context to the Subset of Regressors (SoR) method (following [Quiñonero-Candela 2005] naming convention). Sparse $\mathcal{GP}$s is an active field of research which has many contributions over the last years.

Let \(\mathcal{D}=\{x_i,y_i\}_{i=1}^N\) be the data of our regression problem: \(x_i\in\mathbb{R}^D\), \(y_i\in \mathbb{R}\). In matrix notation we write \(\mathcal{D}=(X,y)\) where \(X\) is an $N\times D$ matrix and \(y\) and \(N \times 1\) vector. Both methods employ a subset of \(\mathcal{D}\) of size \(M\): \(\mathcal{D}_u = (X_u,y_u)\) (\(X_u\) is an \(M\times D\) matrix and \(y_u\) and $M \times 1$). The inputs in this subset are sometimes called centroids, pseudo-inputs or inducing inputs. Also, we will sometimes refer with the subscript $f$ to the subset formed by all the data: \(\mathcal{D}=\mathcal{D_f}=(X_f,y_f)=(X,y)\)

As we are working with kernels we suppose we have a kernel function \(k(x_i,x_j)\in \mathbb{R}\) that measures similarities between points \(x_i\in\mathbb{R}^D\). We will define the distance between 1 point \(x_i\) and a bunch of points \(X_u\) as the row vector: \(k(x_i,X_u) = K_{x_iu} =[k(x_i,x_{u_1}),k(x_i,x_{u_2}),…,k(x_i,x_{u_M})] \in \mathbb{R}^{1\times M}\). Similarly we can define the matrix generated by the distances between a bunch of points \(X_u\) and \(X_f\) as:

\[k(X_f,X_u)=K_{fu} = \begin{pmatrix} k(x_1,x_{u_1}) & k(x_1,x_{u_2}) &...& k(x_1,x_{u_M}) \\ k(x_2,x_{u_1}) & k(x_2,x_{u_2}) &...& k(x_2,x_{u_M}) \\ \vdots & \vdots & \ddots &\vdots \\ k(x_N,x_{u_1}) & k(x_N,x_{u_2}) &...& k(x_N,x_{u_M}) \\ \end{pmatrix} \in \mathbb{R}^{N\times M} \quad \text{(1)}\]

Empirical risk minimization approach

Nyström method

The Nyström method [Williams and Seeger 2000] proposes the substitution of the kernel function $k$ with $k_{Nyström}$ defined as $k_{Nyström}(x_i,x_j) = k(x_i,X_u)K_{uu}^{-1}k(X_u,x_j)$. The matrix $K$ is replaced then with $Q=K_{fu}K_{uu}^{-1}K_{uf}$ ($Q_{ff}$).

The Kernel Ridge Regression (KRR) solution using the kernel function $k_{Nyström}$ for a given $x^*$ is:

\[k_{Nyström}(x^\star,X)\Big(Q_{ff}+ \sigma I \Big)^{-1} y = K_{\star u} \overbrace{K_{uu}^{-1}K_{uf}\Big(Q_{ff}+ \sigma^2 I \Big)^{-1}}^{A} y \quad \text{(2)}\]

RBFN method

The (RBFN) method is an empirical risk minimization approach originally proposed by [Poggio and Girosi 1990] (eq 25). In this approach we have first to map the inputs $x$ to the space generated by the kernel similarities to $X_u$: $k(x,X_u)$. For the training inputs $X$ yields the matrix $K_{fu}$ of (1).

Then we solve the linear ridge regression problem in the transformed domain $K_{fu}$. That is, we seek to minimize the empirical regularized risk $J$ for $\alpha$:

\[\begin{aligned} J(\alpha) &= \overbrace{\| K_{fu}\alpha - y\|^2}^{\text{empirical risk}} + \overbrace{\sigma^2 \alpha^t K_{uu} \alpha}^{\text{regularization}} \\ &= \alpha^t K_{uf}K_{fu} \alpha - 2 \alpha^t K_{uf} y + y^ty + \sigma^2 \alpha^t K_{uu}\alpha \end{aligned}\]

Solving $\nabla_{\alpha} J = 0$ yields:

\[\alpha_{opt} = \Big( K_{uf}K_{fu}\ + \sigma^2 K_{uu}\Big)^{-1} K_{uf}y\]

The prediction for a given $x^\star$ is then:

\[K_{\star u}\cdot \alpha_{opt} = K_{\star u} \overbrace{\Big( K_{uf}K_{fu}\ + \sigma^2 K_{uu}\Big)^{-1} K_{uf}}^{B} y \quad \text{(3)}\]

We see here the three layer network: the first layer is the input, the second layer is formed by the kernel similarities to $X_u$ and the third one: the output $y$.

We may wonder why do we have to use the $K_{uu}$ in the regularization of the risk of $J(\alpha)$? One practical reason is that if $X_u$ is $X$ (i.e. all the data) the solution (3) will be the standard KRR solution since $K=K_{uf}=K_{fu}=K_{uu}$. Another one probably will be related to the fact that if two rows of \(X_u\) \(x_{ui}\), \(x_{uj}\) are very correlated their corresponding \(\alpha_i\), \(\alpha_j\) coefficients should be regularized.

Proof

Saying that both methods yield the same predictions boils down to show that the matrices $A$ and $B$ are equivalent. To see this equivalence we can rely on the matrix inversion lemma or we can prove it ad hoc.

Matrix inversion lemma proof

Unfold explanation

The easiest way to prove the equivalence between both methods (which is used in [Rassmusen and Williams book] Chapter 8 section 6.1.) is to rely on the matrix inversion lemma which states:

\[\Big(Z + UWV^t\Big)^{-1} = Z^{-1} -Z^{-1} U\Big(W^{-1} + V^tZ^{-1}U\Big)^{-1}V^tZ^{-1}\]

Hence we can apply this formula into the inverse of eq (2) to get eq (3).

Collapse

Ad hoc proof

Unfold explanation

The ad hoc way to prove the result is showing that the matrices $A$ from (2) and $B$ from (3) are equivalent. We can simplify the expressions:

\[A = K_{uu}^{-1}K_{uf}\Big(Q_{ff}+ \sigma^2 I \Big)^{-1} = K_{uu}^{-1}K_{uf}\Big(K_{fu}K_{uu}^{-1}K_{uf}+ \sigma^2 I \Big)^{-1}\] \[\require{cancel} B = \Big( K_{uf}K_{fu}\ + \sigma^2 K_{uu}\Big)^{-1} K_{uf} = K_{uu}^{-1}\Big( K_{uf}K_{fu} K_{uu}^{-1} + \sigma^2 I \cancel{K_{uu}K_{uu}^{-1}} \Big)^{-1} K_{uf}\]

Thus $A = B$ lead us to:

\[\require{cancel} \cancel{K_{uu}^{-1}} K_{uf}\Big(K_{fu}K_{uu}^{-1}K_{uf}+ \sigma^2 I \Big)^{-1} = \cancel{K_{uu}^{-1}} \Big( K_{uf}K_{fu} K_{uu}^{-1} + \sigma^2 I \Big)^{-1} K_{uf} \quad \text{(4)}\]

Multiplying each part of the equation (4) at the left by $\Big( K_{uf}K_{fu} K_{uu}^{-1} + \sigma^2 I \Big)$ and at the right by $\Big(K_{fu}K_{uu}^{-1}K_{uf}+ \sigma^2 I \Big)$ we have:

\[\Big( K_{uf}K_{fu} K_{uu}^{-1} + \sigma^2 I \Big) K_{uf} = K_{uf} \Big(K_{fu}K_{uu}^{-1}K_{uf}+ \sigma^2 I \Big)\]

So the equality holds.

(This trick is taken from http://stat.wikia.com/wiki/Kernel_Ridge_Regression)).

Collapse

Bayesian approach

Nyström $\mathcal{GP}$

As we explained before, the Nyström method just replaces the $k$ function with the $k_{Nyström}$ function. In the context of $\mathcal{GP}$s this changes the covariance of the joint prior over the train and test data \((y,y^\star \mid X, X_\star )\). The joint prior distribution will be:

\[p(y,y^\star \mid X_\star, X)=\mathcal{N}\left(\begin{pmatrix}y \\ y^\star \end{pmatrix} \mid \; 0,\:\begin{pmatrix} Q_{f,f}+\sigma^2 I & Q_{f,*}\\ Q_{\star,f} & Q_{\star,*} + \sigma^2 I \end{pmatrix} \right) \quad \text{(12)}\]

Here we want to retrieve the posterior predictive distribution: \( p(y^\star \mid X_\star, X, y) \) from this joint prior. If we dust off Christopher Bishop’s: Pattern Recognition and Machine Learning book and we go to chapter 2 section 2.3.1.: Conditional Gaussian distributions we have the derivation that gives the conditional distribution from a joint Gaussian distribution.

Applying this equation (which is also here in the wikipedia) leads to the expression of the posterior predictive distribution:

\[\begin{aligned} p(y^\star | X_\star, X, y) = \mathcal{N}\big(y^\star \mid & Q_{\star,f}(Q_{f,f}+\sigma^2 I)^{-1}y, \\ & Q_{\star,*} + \sigma^2 I - Q_{\star,f}(Q_{f,f} + \sigma^2 I)^{-1}Q_{f,*} \big) \quad \text{(5)} \end{aligned}\]

So we have that the mean of the $\mathcal{GP}$ solution using the Nyström kernel is the same as the aforementioned approaches.

Bayesian RBFN

The RBFN method can be also viewed from a standard Bayesian linear regression point of view. The only difference with the Bayesian linear regression is that we first transform the inputs $x_i$ to the similarities space $K_{x_iu}$. The rest remains equal. I am going to re-develop this approach for our current case because there are interesting similarities with the $\mathcal{GP}$ regression.

So let’s start: we assume the linear in $\alpha$ model: \(y_i = K_{x_iu}\alpha + \epsilon_i\) where \(\epsilon_i\) is independent white noise: $\epsilon_i\sim \mathcal{N}(0,\sigma^2)$. This can also be written as $p(y_i\mid x_i,\alpha) = p(y_i \mid K_{x_iu},\alpha)=\mathcal{N}(y_i\mid K_{x_iu}\alpha, \sigma^2)$. The likelihood of our model is then:

\[\begin{aligned} p(\overbrace{y_1,...,y_N}^y \mid \overbrace{x_1,...,x_N}^X,\alpha) &= \prod_{i=1}^N p(y_i\mid x_i,\alpha) \\ &= \prod_{i=1}^N \mathcal{N}(y_i \mid K_{x_iu}\alpha , \sigma^2) = \mathcal{N}(y \mid K_{fu}\alpha, \sigma^2 I ) \quad \text{(6)} \\ &= \frac{1}{(2\pi)^{N/2}\sqrt{|\sigma^2I|}}\exp\big(-\frac{1}{2}(y- K_{fu}\alpha)^t \sigma^{-2}I(y- K_{fu}\alpha)\big) \\ &= \frac{1}{(2\pi \sigma^2)^{N/2}}\exp\big(-\frac{1}{2\sigma^2}\|y- K_{fu}\alpha\|^2\big) \end{aligned}\]

If we place a prior over $p(\alpha \mid X) = \mathcal{N}(\alpha \mid 0,A)$, the joint $y,\alpha$ distribution given $X$ would be:

\[p(y,\alpha \mid X) = p(y\mid X,\alpha)p(\alpha \mid X)\]

At this point it is worth to stop and consider what do we want? the natural answer is give predictions for newcomming \( x_\star \) values. This means in Bayesian language that we want a posterior predictive distribution which is a probability distribution over the \( y^\star \) values given all the available information: \( p(y^\star \mid X_\star, X, y) \).

One way to obtain the posterior predictive distribution, which is similar to the approach followed for $\mathcal{GP}$s is:

  1. Augment the above equation with the test points: \(p(y,y^\star,\alpha \mid X,X_\star)\) (Changing the likelihood term to accommodate the unseen \(X_\star\) and \(y^\star\)).
  2. Integrate out \( \alpha \): \(p(y,y^\star \mid X,X_\star) = \int p(y,y^\star,\alpha \mid X,X_\star) d\alpha\)
  3. Compute the posterior predictive using the trick we used above in the Nyström case (5).

The other, more standard approach would be:

  1. Compute the posterior for $\alpha$: $p(\alpha \mid y,X) = p(y,\alpha\mid X)/p(y\mid X) $ (This can be done since all the formulas involve Gaussian distributions).
  2. Compute the posterior predictive distribution integrating out $\alpha$:
\[\begin{aligned} p(y^\star \mid x^\star X, y) &= \int p(y^\star \mid X_\star, \alpha, \cancel{X, y}) P(\alpha \mid \cancel{X_\star}, X, y) d \alpha \\&= \int \mathcal{N}(y^\star\mid K_{\star u}\alpha, \sigma^2) p(\alpha \mid y,X)d \alpha \end{aligned}\]

We will further develop the similar to $\mathcal{GP}$ approach in the fully Bayesian approach section. However first we will consider the maximum a posteriori (MAP) approach which, we’‘ll see, is equivalent to the empirical risk minimization approach.

MAP approach

The “less Bayesian approach” would be to compute the maximum a posteriori estimate of $p(\alpha \mid y,X)$ and then use such value ($\alpha_{MAP}$) to compute predictions: $y^\star = K_{\star,u}\alpha_{MAP}$. This is the approach which is equivalent to the empirical risk minimization approach exposed above (if we use $A=K_{uu}^{-1}$):

\[\begin{aligned} \alpha_{MAP} &= \arg \max_{\alpha} p(\alpha \mid y,X) = \arg \max_{\alpha}\left[ \frac{p(y,\alpha \mid X)}{p(y\mid X)} \right] = \arg \max_{\alpha} p(y,\alpha \mid X) \\ &= \arg \max_{\alpha} \big[\log(p(y,\alpha \mid X))\big] = \\ &= \arg \min_{\alpha}\Big[ -\log(\overbrace{p(y \mid X, \alpha)}^{\text{likelihood}}) - \log(p(\alpha)) \Big] = \\ &= \arg \min_{\alpha} \Big[ \frac{1}{\cancel{2}\sigma^2} \|y- K_{fu}\alpha\|^2 + \cancel{\frac{N}{2}\log(2\pi \sigma)} + \cancel{\frac{1}{2}}\alpha^t A^{-1} \alpha + \cancel{\frac{N}{2}\log(2\pi)}+ \cancel{\frac{1}{2}\log(|A|)} \Big] \\ &= \arg \min_{\alpha} J(\alpha) \end{aligned}\]

So we see that the MAP of the Bayesian RBFN approach is the same of the empirical risk minimization RBFN approach. Given that we can also say that it is also the same to the KRR solution using the Nyström method and by the way it is also the mean of the $\mathcal{GP}$ solution using the Nyström method. So we have different methods derived from different points of view which at the end lead to the same point estimates. The only thing that remain to consider is the fully Bayesian RBFN approach; we wonder: will the mean of fully Bayesian RBFN approach give the same predictions as all the other methods? The answer, as you might expect, is also yes.

Fully Bayesian approach

We will develop here the “similar to $\mathcal{GP}$” approach to derive the posterior predictive distribution \(p(y^\star\mid X_\star,X,y)\). First we consider the augmented likelihood:

\[p(y,y^\star \mid X, X_\star, \alpha) = \mathcal{N}\left(\begin{pmatrix} y \\ y^\star \end{pmatrix} \Big | \; \begin{pmatrix} K_{fu} \\ K_{\star u} \end{pmatrix} \alpha, \sigma^2 I \right)\quad \text{(7)}\]

Which is the same expression that (6) but including the test data \((X_\star,y^\star)\). Now we integrate out $\alpha$:

\[\begin{aligned} p(y,y^\star \mid X,X_\star) &= \int p(y,y^\star, \alpha \mid X,X_\star) d\alpha = \int p(y,y^\star \mid X,X_\star,\alpha)p(\alpha \mid X,X_\star) d\alpha \\ &= \int \mathcal{N}\left(\begin{pmatrix} y \\ y^\star \end{pmatrix} \Big | \; \begin{pmatrix} K_{fu} \\ K_{\star u} \end{pmatrix} \alpha, \sigma^2 I \right) \mathcal{N}(\alpha \mid 0, A) d\alpha \quad \text{(8)} \end{aligned}\]

To solve this (nasty) integral we can rely again on Bishop’s book; in this case we will have to move to section 2.3.3. Bayes’ theorem for Gaussian variables. After a while, if you did it properly, you will end up with:

\[p(y,y^\star \mid X,X_\star) = \mathcal{N}\left(\begin{pmatrix}y \\ y^\star \end{pmatrix} \Big| \; 0,\:\begin{pmatrix} K_{fu}AK_{uf} + \sigma^2 I & K_{fu}AK_{u*} \\ K_{\star u}AK_{uf} & K_{\star u}AK_{u*} + \sigma^2 I \end{pmatrix} \right) \quad \text{(11)}\]

Now we recognize here again that if $A=K_{uu}^{-1}$ we have the same joint distribution of the train and test data of the Nyström $\mathcal{GP}$ method.

Alternative Derivation

An alternative derivation (maybe easier or more intuitive?) of (11) is given if you click unfold explanation.

If we believe that the convolution of two Gaussians is Gaussian we can just find the mean and covariance of: To do this we just have to integrate (8).

First we compute the mean:

\[\begin{aligned} \mathbb{E}_{p(y,y^\star\mid X,X_\star)}\left[\begin{pmatrix} y \\ y^\star \end{pmatrix} \right] &= \int \begin{pmatrix} y \\ y^\star \end{pmatrix} p( y , y^\star \mid X,X_\star) d(y,y^\star)\\ &= \int \begin{pmatrix} y \\ y^\star \end{pmatrix} \int \mathcal{N}\left(\begin{pmatrix} y \\ y^\star \end{pmatrix} \Big | \; \begin{pmatrix} K_{fu} \\ K_{\star u} \end{pmatrix} \alpha, \sigma^2 I \right) \mathcal{N}(\alpha \mid 0, A) d\alpha d(y,y^\star) \\ &\underbrace{=}_{\text{switch integrals}} \mathbb{E}_{p(\alpha \mid X,X_\star)}\left[\mathbb{E}_{p(y,y^\star\mid X,X_\star,\alpha,)}\left[\begin{pmatrix} y \\ y^\star \end{pmatrix} \right]\right] \quad \text{(9)}\\ &\underbrace{=}_{eq. 7} \mathbb{E}_{p(\alpha \mid X,X_\star)}\left[\begin{pmatrix} K_{fu} \\ K_{\star u} \end{pmatrix} \alpha \right] \\ &= \begin{pmatrix} K_{fu} \\ K_{\star u} \end{pmatrix} \mathbb{E}_{p(\alpha, \mid X,X_\star)}[\alpha] \underbrace{=}_{\alpha \text{ has mean zero}} 0 \end{aligned}\]

Equation (9) shows the cool trick that can be used to compute this integrals which can be generalized into:

\[\mathbb{E}_{p(a)}\left[g(a)\right] = \mathbb{E}_{p(b)}\left[\mathbb{E}_{p(a\mid b)}\left[g(a)\right]\right] \quad \text{(10)}\] \[\text{(In this case }a\text{ is }(y,y^\star \mid X, X_\star)\text{ and }b\text{ is }(\alpha \mid X,X_\star)\text{).}\]

Given that the mean is zero, the second order moment is the covariance of the distribution. In order to compute it we just have reuse the trick (10):

\[\begin{aligned} \mathbb{E}_{p(y,y^\star\mid X,X_\star)}\left[\begin{pmatrix} y \\ y^\star \end{pmatrix} \begin{pmatrix}y^t & y^{*t} \end{pmatrix} \right] &= \mathbb{E}_{p(\alpha \mid X,X_\star)}\left[\mathbb{E}_{p(y,y^\star\mid X,X_\star,\alpha)}\left[ \begin{pmatrix} y \\ y^\star \end{pmatrix} \begin{pmatrix}y^t & y^{*t} \end{pmatrix} \right]\right] \\ &= \mathbb{E}_{p(\alpha \mid X,X_\star)}\left[Cov[y,y^\star\mid X,X_\star,\alpha] + \mathbb{E}_{p(y,y^\star\mid X,X_\star,\alpha)}\left[ \begin{pmatrix} y \\ y^\star \end{pmatrix}\right]\mathbb{E}_{p(y,y^\star\mid X,X_\star,\alpha)}\left[ \begin{pmatrix} y^t & y^{t*} \end{pmatrix}\right]\right] \\ &\underbrace{=}_{eq. 7} \mathbb{E}_{p(\alpha \mid X,X_\star)}\left[ \sigma^2 I+ \begin{pmatrix} K_{fu}\alpha \\ K_{\star u}\alpha \end{pmatrix} \begin{pmatrix} \alpha^tK_{uf} & \alpha^tK_{u*} \end{pmatrix} \right] \\ &= \mathbb{E}_{p(\alpha \mid X,X_\star)}\left[\sigma^2 I + \begin{pmatrix} K_{fu} \\ K_{\star u} \end{pmatrix} \alpha \alpha^T \begin{pmatrix}K_{uf}^t & K_{u*}^t \end{pmatrix} \right] \\ &= \sigma^2 I + \begin{pmatrix} K_{fu} \\ K_{\star u} \end{pmatrix} \mathbb{E}_{p(\alpha \mid X,X_\star)}\left[\alpha \alpha^T\right] \begin{pmatrix}K_{uf}^t & K_{u*}^t \end{pmatrix} \\ &= \begin{pmatrix} K_{fu} \\ K_{\star u} \end{pmatrix} A \begin{pmatrix}K_{uf}^t & K_{u*}^t \end{pmatrix} + \sigma^2 I \\ &= \begin{pmatrix} K_{fu}AK_{uf} & K_{fu}AK_{u*} \\ K_{\star u}AK_{uf} & K_{\star u}AK_{u*} \end{pmatrix} + \sigma^2 I \end{aligned}\]

Collapse

Using the same procedure in equation (11) than in equation (5) and assuming $A=K_{uu}^{-1}$ we retrieve back the same predictive posterior as in the Nyström $\mathcal{GP}$ method. This proves:

Theorem The linear Bayesian approach in the space of similarities to $X_u$ (Bayesian RBFN) yields the same result as the Nyström $\mathcal{GP}$ regression. We only have to provide the following prior over the linear regression weights $\alpha$: $\alpha \sim \mathcal{N}(0,K_{uu}^{-1})$.

One of the biggest concerns about this predictive posterior distribution (equation 5) is that the predictive variance may go to zero when the test input \(x^\star\) is far from the the subset $X_u$. To see why consider that if \(x^\star\) is far from all the data $X$ the natural behavior of the $\mathcal{GP}$ is to stick to the prior. The prior variance is \(Q_{\star,*} = K_{\star u}K_{uu}^{-1}K_{u*}\) which in turn will be close to zero if we consider for example the rbf kernel. We refer the reader to [Quiñonero-Candela 2005] for further discussion and alternatives.

Thanks to Daniel Svendsen for corrections and all the isp group for the fruitful discussions.