Introduction
Radial Basis Function Neural Networks (RBFN) and Nyström methods are two methods that can be used for regression. They come from different statistical learning frameworks: RBFN were proposed in the context of shallow neural networks: this is a three layer neural network where the hidden layer computes the rbf similarities to a set of centroids. Then a fully connected layer is added to give the final prediction. The Nyström method was proposed in the kernel machine literature as a way to reduce the computational burden of the full kernel similarity matrix K. Nyström was proposed latter, however it can be applied to different kernel approaches, not only regression.
The equivalence of the method is not something unknown we came up to; however we feel that the relationship is not crystal clear as it perhaps seems for experts in the field. In [Quiñonero-Candela 2005] and in the book by Rassmussen and Williams Guassian Process for Machine Learning these relations are taken for granted. Recent papers in the context of RBFNN explicitly say this. In addition, another clear proof that people are aware of this relationship is the scikit learn implementation of the Nyström method, which explicitly uses it. In this way the purpose of this post is to shed light on the proof of this relationship which helps to better understand this powerful methods. First we show the relation into an empirical risk minimization setting, later we will develop the Bayesian approach which leads to one of the called sparse GP methods, we will see that the RBFN network is equivalent in this context to the Subset of Regressors (SoR) method (following [Quiñonero-Candela 2005] naming convention). Sparse GPs is an active field of research which has many contributions over the last years.
Let D={xi,yi}Ni=1 be the data of our regression problem: xi∈RD, yi∈R. In matrix notation we write D=(X,y) where X is an N×D matrix and y and N×1 vector. Both methods employ a subset of D of size M: Du=(Xu,yu) (Xu is an M×D matrix and yu and M×1). The inputs in this subset are sometimes called centroids, pseudo-inputs or inducing inputs. Also, we will sometimes refer with the subscript f to the subset formed by all the data: D=Df=(Xf,yf)=(X,y)
As we are working with kernels we suppose we have a kernel function k(xi,xj)∈R that measures similarities between points xi∈RD. We will define the distance between 1 point xi and a bunch of points Xu as the row vector: k(xi,Xu)=Kxiu=[k(xi,xu1),k(xi,xu2),…,k(xi,xuM)]∈R1×M. Similarly we can define the matrix generated by the distances between a bunch of points Xu and Xf as:
k(Xf,Xu)=Kfu=(k(x1,xu1)k(x1,xu2)...k(x1,xuM)k(x2,xu1)k(x2,xu2)...k(x2,xuM)⋮⋮⋱⋮k(xN,xu1)k(xN,xu2)...k(xN,xuM))∈RN×M(1)Empirical risk minimization approach
Nyström method
The Nyström method [Williams and Seeger 2000] proposes the substitution of the kernel function k with k_{Nyström} defined as k_{Nyström}(x_i,x_j) = k(x_i,X_u)K_{uu}^{-1}k(X_u,x_j). The matrix K is replaced then with Q=K_{fu}K_{uu}^{-1}K_{uf} (Q_{ff}).
The Kernel Ridge Regression (KRR) solution using the kernel function k_{Nyström} for a given x^* is:
k_{Nyström}(x^\star,X)\Big(Q_{ff}+ \sigma I \Big)^{-1} y = K_{\star u} \overbrace{K_{uu}^{-1}K_{uf}\Big(Q_{ff}+ \sigma^2 I \Big)^{-1}}^{A} y \quad \text{(2)}RBFN method
The (RBFN) method is an empirical risk minimization approach originally proposed by [Poggio and Girosi 1990] (eq 25). In this approach we have first to map the inputs x to the space generated by the kernel similarities to X_u: k(x,X_u). For the training inputs X yields the matrix K_{fu} of (1).
Then we solve the linear ridge regression problem in the transformed domain K_{fu}. That is, we seek to minimize the empirical regularized risk J for \alpha:
\begin{aligned} J(\alpha) &= \overbrace{\| K_{fu}\alpha - y\|^2}^{\text{empirical risk}} + \overbrace{\sigma^2 \alpha^t K_{uu} \alpha}^{\text{regularization}} \\ &= \alpha^t K_{uf}K_{fu} \alpha - 2 \alpha^t K_{uf} y + y^ty + \sigma^2 \alpha^t K_{uu}\alpha \end{aligned}Solving \nabla_{\alpha} J = 0 yields:
\alpha_{opt} = \Big( K_{uf}K_{fu}\ + \sigma^2 K_{uu}\Big)^{-1} K_{uf}yThe prediction for a given x^\star is then:
K_{\star u}\cdot \alpha_{opt} = K_{\star u} \overbrace{\Big( K_{uf}K_{fu}\ + \sigma^2 K_{uu}\Big)^{-1} K_{uf}}^{B} y \quad \text{(3)}We see here the three layer network: the first layer is the input, the second layer is formed by the kernel similarities to X_u and the third one: the output y.
We may wonder why do we have to use the K_{uu} in the regularization of the risk of J(\alpha)? One practical reason is that if X_u is X (i.e. all the data) the solution (3) will be the standard KRR solution since K=K_{uf}=K_{fu}=K_{uu}. Another one probably will be related to the fact that if two rows of X_u x_{ui}, x_{uj} are very correlated their corresponding \alpha_i, \alpha_j coefficients should be regularized.
Proof
Saying that both methods yield the same predictions boils down to show that the matrices A and B are equivalent. To see this equivalence we can rely on the matrix inversion lemma or we can prove it ad hoc.
Matrix inversion lemma proof
Ad hoc proof
Bayesian approach
Nyström \mathcal{GP}
As we explained before, the Nyström method just replaces the k function with the k_{Nyström} function. In the context of \mathcal{GP}s this changes the covariance of the joint prior over the train and test data (y,y^\star \mid X, X_\star ). The joint prior distribution will be:
p(y,y^\star \mid X_\star, X)=\mathcal{N}\left(\begin{pmatrix}y \\ y^\star \end{pmatrix} \mid \; 0,\:\begin{pmatrix} Q_{f,f}+\sigma^2 I & Q_{f,*}\\ Q_{\star,f} & Q_{\star,*} + \sigma^2 I \end{pmatrix} \right) \quad \text{(12)}Here we want to retrieve the posterior predictive distribution: p(y^\star \mid X_\star, X, y) from this joint prior. If we dust off Christopher Bishop’s: Pattern Recognition and Machine Learning book and we go to chapter 2 section 2.3.1.: Conditional Gaussian distributions we have the derivation that gives the conditional distribution from a joint Gaussian distribution.
Applying this equation (which is also here in the wikipedia) leads to the expression of the posterior predictive distribution:
\begin{aligned} p(y^\star | X_\star, X, y) = \mathcal{N}\big(y^\star \mid & Q_{\star,f}(Q_{f,f}+\sigma^2 I)^{-1}y, \\ & Q_{\star,*} + \sigma^2 I - Q_{\star,f}(Q_{f,f} + \sigma^2 I)^{-1}Q_{f,*} \big) \quad \text{(5)} \end{aligned}So we have that the mean of the \mathcal{GP} solution using the Nyström kernel is the same as the aforementioned approaches.
Bayesian RBFN
The RBFN method can be also viewed from a standard Bayesian linear regression point of view. The only difference with the Bayesian linear regression is that we first transform the inputs x_i to the similarities space K_{x_iu}. The rest remains equal. I am going to re-develop this approach for our current case because there are interesting similarities with the \mathcal{GP} regression.
So let’s start: we assume the linear in \alpha model: y_i = K_{x_iu}\alpha + \epsilon_i where \epsilon_i is independent white noise: \epsilon_i\sim \mathcal{N}(0,\sigma^2). This can also be written as p(y_i\mid x_i,\alpha) = p(y_i \mid K_{x_iu},\alpha)=\mathcal{N}(y_i\mid K_{x_iu}\alpha, \sigma^2). The likelihood of our model is then:
\begin{aligned} p(\overbrace{y_1,...,y_N}^y \mid \overbrace{x_1,...,x_N}^X,\alpha) &= \prod_{i=1}^N p(y_i\mid x_i,\alpha) \\ &= \prod_{i=1}^N \mathcal{N}(y_i \mid K_{x_iu}\alpha , \sigma^2) = \mathcal{N}(y \mid K_{fu}\alpha, \sigma^2 I ) \quad \text{(6)} \\ &= \frac{1}{(2\pi)^{N/2}\sqrt{|\sigma^2I|}}\exp\big(-\frac{1}{2}(y- K_{fu}\alpha)^t \sigma^{-2}I(y- K_{fu}\alpha)\big) \\ &= \frac{1}{(2\pi \sigma^2)^{N/2}}\exp\big(-\frac{1}{2\sigma^2}\|y- K_{fu}\alpha\|^2\big) \end{aligned}If we place a prior over p(\alpha \mid X) = \mathcal{N}(\alpha \mid 0,A), the joint y,\alpha distribution given X would be:
p(y,\alpha \mid X) = p(y\mid X,\alpha)p(\alpha \mid X)At this point it is worth to stop and consider what do we want? the natural answer is give predictions for newcomming x_\star values. This means in Bayesian language that we want a posterior predictive distribution which is a probability distribution over the y^\star values given all the available information: p(y^\star \mid X_\star, X, y) .
One way to obtain the posterior predictive distribution, which is similar to the approach followed for \mathcal{GP}s is:
- Augment the above equation with the test points: p(y,y^\star,\alpha \mid X,X_\star) (Changing the likelihood term to accommodate the unseen X_\star and y^\star).
- Integrate out \alpha : p(y,y^\star \mid X,X_\star) = \int p(y,y^\star,\alpha \mid X,X_\star) d\alpha
- Compute the posterior predictive using the trick we used above in the Nyström case (5).
The other, more standard approach would be:
- Compute the posterior for \alpha: p(\alpha \mid y,X) = p(y,\alpha\mid X)/p(y\mid X) (This can be done since all the formulas involve Gaussian distributions).
- Compute the posterior predictive distribution integrating out \alpha:
We will further develop the similar to \mathcal{GP} approach in the fully Bayesian approach section. However first we will consider the maximum a posteriori (MAP) approach which, we’‘ll see, is equivalent to the empirical risk minimization approach.
MAP approach
The “less Bayesian approach” would be to compute the maximum a posteriori estimate of p(\alpha \mid y,X) and then use such value (\alpha_{MAP}) to compute predictions: y^\star = K_{\star,u}\alpha_{MAP}. This is the approach which is equivalent to the empirical risk minimization approach exposed above (if we use A=K_{uu}^{-1}):
\begin{aligned} \alpha_{MAP} &= \arg \max_{\alpha} p(\alpha \mid y,X) = \arg \max_{\alpha}\left[ \frac{p(y,\alpha \mid X)}{p(y\mid X)} \right] = \arg \max_{\alpha} p(y,\alpha \mid X) \\ &= \arg \max_{\alpha} \big[\log(p(y,\alpha \mid X))\big] = \\ &= \arg \min_{\alpha}\Big[ -\log(\overbrace{p(y \mid X, \alpha)}^{\text{likelihood}}) - \log(p(\alpha)) \Big] = \\ &= \arg \min_{\alpha} \Big[ \frac{1}{\cancel{2}\sigma^2} \|y- K_{fu}\alpha\|^2 + \cancel{\frac{N}{2}\log(2\pi \sigma)} + \cancel{\frac{1}{2}}\alpha^t A^{-1} \alpha + \cancel{\frac{N}{2}\log(2\pi)}+ \cancel{\frac{1}{2}\log(|A|)} \Big] \\ &= \arg \min_{\alpha} J(\alpha) \end{aligned}So we see that the MAP of the Bayesian RBFN approach is the same of the empirical risk minimization RBFN approach. Given that we can also say that it is also the same to the KRR solution using the Nyström method and by the way it is also the mean of the \mathcal{GP} solution using the Nyström method. So we have different methods derived from different points of view which at the end lead to the same point estimates. The only thing that remain to consider is the fully Bayesian RBFN approach; we wonder: will the mean of fully Bayesian RBFN approach give the same predictions as all the other methods? The answer, as you might expect, is also yes.
Fully Bayesian approach
We will develop here the “similar to \mathcal{GP}” approach to derive the posterior predictive distribution p(y^\star\mid X_\star,X,y). First we consider the augmented likelihood:
p(y,y^\star \mid X, X_\star, \alpha) = \mathcal{N}\left(\begin{pmatrix} y \\ y^\star \end{pmatrix} \Big | \; \begin{pmatrix} K_{fu} \\ K_{\star u} \end{pmatrix} \alpha, \sigma^2 I \right)\quad \text{(7)}Which is the same expression that (6) but including the test data (X_\star,y^\star). Now we integrate out \alpha:
\begin{aligned} p(y,y^\star \mid X,X_\star) &= \int p(y,y^\star, \alpha \mid X,X_\star) d\alpha = \int p(y,y^\star \mid X,X_\star,\alpha)p(\alpha \mid X,X_\star) d\alpha \\ &= \int \mathcal{N}\left(\begin{pmatrix} y \\ y^\star \end{pmatrix} \Big | \; \begin{pmatrix} K_{fu} \\ K_{\star u} \end{pmatrix} \alpha, \sigma^2 I \right) \mathcal{N}(\alpha \mid 0, A) d\alpha \quad \text{(8)} \end{aligned}To solve this (nasty) integral we can rely again on Bishop’s book; in this case we will have to move to section 2.3.3. Bayes’ theorem for Gaussian variables. After a while, if you did it properly, you will end up with:
p(y,y^\star \mid X,X_\star) = \mathcal{N}\left(\begin{pmatrix}y \\ y^\star \end{pmatrix} \Big| \; 0,\:\begin{pmatrix} K_{fu}AK_{uf} + \sigma^2 I & K_{fu}AK_{u*} \\ K_{\star u}AK_{uf} & K_{\star u}AK_{u*} + \sigma^2 I \end{pmatrix} \right) \quad \text{(11)}Now we recognize here again that if A=K_{uu}^{-1} we have the same joint distribution of the train and test data of the Nyström \mathcal{GP} method.
Alternative Derivation
An alternative derivation (maybe easier or more intuitive?) of (11) is given if you click unfold explanation.
Using the same procedure in equation (11) than in equation (5) and assuming A=K_{uu}^{-1} we retrieve back the same predictive posterior as in the Nyström \mathcal{GP} method. This proves:
Theorem The linear Bayesian approach in the space of similarities to X_u (Bayesian RBFN) yields the same result as the Nyström \mathcal{GP} regression. We only have to provide the following prior over the linear regression weights \alpha: \alpha \sim \mathcal{N}(0,K_{uu}^{-1}).
One of the biggest concerns about this predictive posterior distribution (equation 5) is that the predictive variance may go to zero when the test input x^\star is far from the the subset X_u. To see why consider that if x^\star is far from all the data X the natural behavior of the \mathcal{GP} is to stick to the prior. The prior variance is Q_{\star,*} = K_{\star u}K_{uu}^{-1}K_{u*} which in turn will be close to zero if we consider for example the rbf kernel. We refer the reader to [Quiñonero-Candela 2005] for further discussion and alternatives.
Thanks to Daniel Svendsen for corrections and all the isp group for the fruitful discussions.