# Gradient Descent in Metric Learning for Kernel Regression (MLKR)

by R. Ho   Last Updated October 11, 2018 09:19 AM

I am currently studying the Metric Learning for Kernel Regression (MLKR) algorithm (http://proceedings.mlr.press/v2/weinberger07a/weinberger07a.pdf).

Let $$\{(x_{1}, y_{1}), ..., (x_{N}, y_{N})\}$$ denotes the training set. The authors of this paper propose to optimize a matrix $$\mathbf{A}$$ by minimizing the leave-one-out error: $$\mathcal{L}(\mathbf{A}) = \sum\limits_{i=1}^{N} \big( y_{i} - \hat{y_{i}}(\mathbf{A}) \big)^{2},$$ $$\hat{y_{i}}(\mathbf{A}) = \frac{\sum\limits_{\substack{j=1 \\ j \neq i}}^{N} k_{ij}(\mathbf{A})y_{j} }{\sum\limits_{\substack{j=1 \\ j \neq i}}^{N} k_{ij}(\mathbf{A})}$$ where $$k_{ij}(\mathbf{A})$$ is the Gaussian kernel: $$k_{ij}(\mathbf{A}) = e^{-x_{ij}^{\mathbf{T}}\mathbf{A}^{\mathbf{T}}\mathbf{A}x_{ij}}$$ and $$x_{ij} = x_{i} - x_{j}$$.

They state that the gradient of the loss function with respect to $$\mathbf{A}$$ is (Eq. 8): $$\frac{\partial \mathcal{L}}{\partial \mathbf{A}}(\mathbf{A}) = 4\mathbf{A}\sum\limits_{i=1}^{N} \big( \hat{y_{i}}(\mathbf{A}) - y_{i}\big) \sum\limits_{\substack{j=1 \\ j \neq i}}^{N}\big( \hat{y_{j}}(\mathbf{A}) - y_{j}\big)k_{ij}(\mathbf{A})x_{ij}x_{ij}^{\mathbf{T}}.$$ I can't figure out how they get this expression.

My attempt: $$\frac{\partial \mathcal{L}}{\partial \mathbf{A}}(\mathbf{A}) = 2 \sum\limits_{i=1}^{N} \big( \hat{y_{i}}(\mathbf{A}) - y_{i}\big)\frac{\partial \hat{y_{i}}(\mathbf{A})}{\partial \mathbf{A}}(\mathbf{A})$$ \begin{align*} \frac{\partial \hat{y_{i}}(\mathbf{A})}{\partial \mathbf{A}}(\mathbf{A}) &= \frac{\Big(\sum\limits_{\substack{j=1 \\ j \neq i}}^{N} \frac{\partial k_{ij}}{\partial \mathbf{A}}(\mathbf{A})y_{j} \Big)\Big(\sum\limits_{\substack{j=1 \\ j \neq i}}^{N} k_{ij}(\mathbf{A}) \Big)-\Big(\sum\limits_{\substack{j=1 \\ j \neq i}}^{N} k_{ij}(\mathbf{A})y_{j} \Big)\Big(\sum\limits_{\substack{j=1 \\ j \neq i}}^{N} \frac{\partial k_{ij}}{\partial \mathbf{A}}(\mathbf{A}) \Big)}{\Big(\sum\limits_{\substack{j=1 \\ j \neq i}}^{N} k_{ij}(\mathbf{A}) \Big)^{2}} \\ &= \frac{\Big(\sum\limits_{\substack{j=1 \\ j \neq i}}^{N} \frac{\partial k_{ij}}{\partial \mathbf{A}}(\mathbf{A})y_{j} \Big)- \hat{y_{i}}(\mathbf{A})\Big(\sum\limits_{\substack{j=1 \\ j \neq i}}^{N} \frac{\partial k_{ij}}{\partial \mathbf{A}}(\mathbf{A}) \Big)}{\Big(\sum\limits_{\substack{j=1 \\ j \neq i}}^{N} k_{ij}(\mathbf{A}) \Big)} \\ &= \frac{\sum\limits_{\substack{j=1 \\ j \neq i}}^{N} \big(y_{j}-\hat{y_{i}}(\mathbf{A}) \big)\frac{\partial k_{ij}}{\partial \mathbf{A}}}{\sum\limits_{\substack{j=1 \\ j \neq i}}^{N} k_{ij}(\mathbf{A})} =2\mathbf{A} \frac{\sum\limits_{\substack{j=1 \\ j \neq i}}^{N} \big(y_{j}-\hat{y_{i}}(\mathbf{A})\big)k_{ij}(\mathbf{A})x_{ij}x_{ij}^{\mathbf{T}} }{\sum\limits_{\substack{j=1 \\ j \neq i}}^{N} k_{ij}(\mathbf{A})} \end{align*} Hence, I have $$\frac{\partial \mathcal{L}}{\partial \mathbf{A}}(\mathbf{A}) = 4\mathbf{A} \sum\limits_{i=1}^{N} \big( \hat{y_{i}}(\mathbf{A}) - y_{i}\big)\frac{\sum\limits_{\substack{j=1 \\ j \neq i}}^{N} \big(y_{j}-\hat{y_{i}}(\mathbf{A})\big)k_{ij}(\mathbf{A})x_{ij}x_{ij}^{\mathbf{T}} }{\sum\limits_{\substack{j=1 \\ j \neq i}}^{N} k_{ij}(\mathbf{A})}$$

What am I doing wrong? Any suggestions would be greatly appreciated.

Tags :