Gradient Descent in Metric Learning for Kernel Regression (MLKR)

by R. Ho   Last Updated October 11, 2018 09:19 AM

I am currently studying the Metric Learning for Kernel Regression (MLKR) algorithm (http://proceedings.mlr.press/v2/weinberger07a/weinberger07a.pdf).

Let $\{(x_{1}, y_{1}), ..., (x_{N}, y_{N})\}$ denotes the training set. The authors of this paper propose to optimize a matrix $\mathbf{A}$ by minimizing the leave-one-out error: $$ \mathcal{L}(\mathbf{A}) = \sum\limits_{i=1}^{N} \big( y_{i} - \hat{y_{i}}(\mathbf{A}) \big)^{2}, $$ $$ \hat{y_{i}}(\mathbf{A}) = \frac{\sum\limits_{\substack{j=1 \\ j \neq i}}^{N} k_{ij}(\mathbf{A})y_{j} }{\sum\limits_{\substack{j=1 \\ j \neq i}}^{N} k_{ij}(\mathbf{A})} $$ where $k_{ij}(\mathbf{A})$ is the Gaussian kernel: $ k_{ij}(\mathbf{A}) = e^{-x_{ij}^{\mathbf{T}}\mathbf{A}^{\mathbf{T}}\mathbf{A}x_{ij}} $ and $x_{ij} = x_{i} - x_{j}$.

They state that the gradient of the loss function with respect to $\mathbf{A}$ is (Eq. 8): $$ \frac{\partial \mathcal{L}}{\partial \mathbf{A}}(\mathbf{A}) = 4\mathbf{A}\sum\limits_{i=1}^{N} \big( \hat{y_{i}}(\mathbf{A}) - y_{i}\big) \sum\limits_{\substack{j=1 \\ j \neq i}}^{N}\big( \hat{y_{j}}(\mathbf{A}) - y_{j}\big)k_{ij}(\mathbf{A})x_{ij}x_{ij}^{\mathbf{T}}. $$ I can't figure out how they get this expression.

My attempt: $$ \frac{\partial \mathcal{L}}{\partial \mathbf{A}}(\mathbf{A}) = 2 \sum\limits_{i=1}^{N} \big( \hat{y_{i}}(\mathbf{A}) - y_{i}\big)\frac{\partial \hat{y_{i}}(\mathbf{A})}{\partial \mathbf{A}}(\mathbf{A}) $$ \begin{align*} \frac{\partial \hat{y_{i}}(\mathbf{A})}{\partial \mathbf{A}}(\mathbf{A}) &= \frac{\Big(\sum\limits_{\substack{j=1 \\ j \neq i}}^{N} \frac{\partial k_{ij}}{\partial \mathbf{A}}(\mathbf{A})y_{j} \Big)\Big(\sum\limits_{\substack{j=1 \\ j \neq i}}^{N} k_{ij}(\mathbf{A}) \Big)-\Big(\sum\limits_{\substack{j=1 \\ j \neq i}}^{N} k_{ij}(\mathbf{A})y_{j} \Big)\Big(\sum\limits_{\substack{j=1 \\ j \neq i}}^{N} \frac{\partial k_{ij}}{\partial \mathbf{A}}(\mathbf{A}) \Big)}{\Big(\sum\limits_{\substack{j=1 \\ j \neq i}}^{N} k_{ij}(\mathbf{A}) \Big)^{2}} \\ &= \frac{\Big(\sum\limits_{\substack{j=1 \\ j \neq i}}^{N} \frac{\partial k_{ij}}{\partial \mathbf{A}}(\mathbf{A})y_{j} \Big)- \hat{y_{i}}(\mathbf{A})\Big(\sum\limits_{\substack{j=1 \\ j \neq i}}^{N} \frac{\partial k_{ij}}{\partial \mathbf{A}}(\mathbf{A}) \Big)}{\Big(\sum\limits_{\substack{j=1 \\ j \neq i}}^{N} k_{ij}(\mathbf{A}) \Big)} \\ &= \frac{\sum\limits_{\substack{j=1 \\ j \neq i}}^{N} \big(y_{j}-\hat{y_{i}}(\mathbf{A}) \big)\frac{\partial k_{ij}}{\partial \mathbf{A}}}{\sum\limits_{\substack{j=1 \\ j \neq i}}^{N} k_{ij}(\mathbf{A})} =2\mathbf{A} \frac{\sum\limits_{\substack{j=1 \\ j \neq i}}^{N} \big(y_{j}-\hat{y_{i}}(\mathbf{A})\big)k_{ij}(\mathbf{A})x_{ij}x_{ij}^{\mathbf{T}} }{\sum\limits_{\substack{j=1 \\ j \neq i}}^{N} k_{ij}(\mathbf{A})} \end{align*} Hence, I have $$ \frac{\partial \mathcal{L}}{\partial \mathbf{A}}(\mathbf{A}) = 4\mathbf{A} \sum\limits_{i=1}^{N} \big( \hat{y_{i}}(\mathbf{A}) - y_{i}\big)\frac{\sum\limits_{\substack{j=1 \\ j \neq i}}^{N} \big(y_{j}-\hat{y_{i}}(\mathbf{A})\big)k_{ij}(\mathbf{A})x_{ij}x_{ij}^{\mathbf{T}} }{\sum\limits_{\substack{j=1 \\ j \neq i}}^{N} k_{ij}(\mathbf{A})} $$

What am I doing wrong? Any suggestions would be greatly appreciated.



Related Questions


'Best score' in XGBOOST Regression

Updated October 30, 2017 12:19 PM