首页 > 代码库 > 随机梯度下降的技术细节
随机梯度下降的技术细节
本文是关于用SGD解如下四个问题\begin{align*} \mbox{SVM}: & \ \ \min_{\boldsymbol{w}} \ \frac{\lambda}{2} \|\boldsymbol{w}\|_2^2 + \frac{1}{M} \sum_{m=1}^M \max (0, 1 - y_m \boldsymbol{w}^\top \boldsymbol{x}_m ) \\ \mbox{Logisitic Regression}: & \ \ \min_{\boldsymbol{w}} \ \frac{\lambda}{2} \|\boldsymbol{w}\|_2^2 + \frac{1}{M} \sum_{m=1}^M \ln (1 + e^{-y_m \boldsymbol{w}^\top \boldsymbol{x}_m}) \\ \mbox{Multi-Class Logistic Regression}: & \ \ \min_{\boldsymbol{w}_k} \ \sum_{k=1}^K \frac{\lambda}{2}\|\boldsymbol{w}_k\|_2^2 - \frac{1}{M} \sum_{m=1}^M \ln \frac{e^{\boldsymbol{x}_m^\top \boldsymbol{w}_{y_m}}}{\sum_{j=1}^K e^{\boldsymbol{x}_m^\top \boldsymbol{w}_j}} \\ \mbox{LASSO}: & \ \ \min_{\boldsymbol{w}} \ \lambda \|\boldsymbol{w}\|_1 + \frac{1}{M} \sum_{m=1}^M \frac{1}{2} ( \boldsymbol{w}^\top \boldsymbol{x}_m - y_m )^2 \end{align*}的技术细节。
一、SVM
SVM目标函数的梯度为\begin{align*} \lambda \boldsymbol{w} + \frac{1}{M} \sum_{m=1}^M \begin{cases} - y_m \boldsymbol{x}_m & y_m \boldsymbol{w}^\top \boldsymbol{x}_m < 1 \\ \boldsymbol{0} & otherwise \end{cases} \end{align*}第$t$轮可用梯度的无偏估计\begin{align*} \lambda \boldsymbol{w} + \begin{cases} - y_t \boldsymbol{x}_t & y_t \boldsymbol{w}^\top \boldsymbol{x}_t < 1 \\ \boldsymbol{0} & otherwise \end{cases} \end{align*}作为下降方向,故更新公式可写为\begin{align*} \boldsymbol{w}_{t+1} = (1 - \lambda \eta_t)\boldsymbol{w}_t + \begin{cases} \eta_t y_t \boldsymbol{x}_t & y_t \boldsymbol{w}_t^\top \boldsymbol{x}_t < 1 \\ \boldsymbol{0} & otherwise \end{cases} \end{align*}
为了加速处理稀疏特征,做变量代换$\boldsymbol{w}_t = \boldsymbol{u}_t / a_t$,于是\begin{align*} \boldsymbol{u}_{t+1} = (1 - \lambda \eta_t) \frac{a_{t+1}}{a_t} \boldsymbol{u}_t + \begin{cases} a_{t+1} \eta_t y_t \boldsymbol{x}_t & y_t \boldsymbol{w}_t^\top \boldsymbol{x}_t < 1 \\ \boldsymbol{0} & otherwise \end{cases} \end{align*}故可化为四步操作\begin{align*} \begin{cases} 1: & z_t = y_t \boldsymbol{u}_t^\top \boldsymbol{x}_t / a_t \\ 2: & a_{t+1} = a_t / (1 - \lambda \eta_t) \\ 3: & \boldsymbol{d} = \begin{cases} a_{t+1} \eta_t y_t \boldsymbol{x}_t & z_t < 1 \\ \boldsymbol{0} & otherwise \end{cases} \\ 4: & \boldsymbol{u}_{t+1} = \boldsymbol{u}_t + \boldsymbol{d} \end{cases} \end{align*}
若采用ASGD,设$t_0$轮之后开始取平均,则对于$t > t_0$有$\boldsymbol{\bar{w}}_t = \frac{1}{t - t_0} \sum_{i=t_0+1}^t \boldsymbol{w}_i$,于是\begin{align*} \boldsymbol{\bar{w}}_{t+1} = \frac{1}{t + 1 - t_0} \sum_{i=t_0+1}^{t + 1} \boldsymbol{w}_i = \frac{t - t_0}{t + 1 - t_0} \frac{1}{t - t_0} \sum_{i=t_0+1}^t \boldsymbol{w}_i + \frac{1}{t + 1 - t_0} \boldsymbol{w}_{t + 1} = (1 - \mu_t) \boldsymbol{\bar{w}}_t + \mu_t \boldsymbol{w}_{t+1} \end{align*}其中$\mu_t = 1 / (t + 1 - t_0)$。当$t \leq t_0$时,还未开始取平均,故$\boldsymbol{\bar{w}}_t = \boldsymbol{w}_t$,于是综上有如下递推公式\begin{align*} \boldsymbol{\bar{w}}_{t+1} = (1 - \mu_t) \boldsymbol{\bar{w}}_t + \mu_t \boldsymbol{w}_{t+1} \end{align*}其中$\mu_t = 1 / \max \{ 1, t + 1 - t_0 \}$。
同样为了加速处理稀疏特征,再做变量代换$\boldsymbol{\bar{w}}_t = (\boldsymbol{v}_t + b_t \boldsymbol{u}_t) / c_t$,代入上式有\begin{align*} \frac{\boldsymbol{v}_{t + 1} + b_{t + 1} \boldsymbol{u}_{t + 1}}{c_{t + 1}} = (1 - \mu_t) \frac{\boldsymbol{v}_t + b_t \boldsymbol{u}_t}{c_t} + \mu_t \frac{\boldsymbol{u}_{t+1}}{a_{t+1}} \end{align*}整理得\begin{align*} \boldsymbol{v}_{t + 1} = (1 - \mu_t) \frac{c_{t + 1}}{c_t} \boldsymbol{v}_t + (1 - \mu_t) \frac{c_{t + 1}}{c_t} b_t \boldsymbol{u}_t - \left( b_{t + 1} - \mu_t \frac{c_{t + 1}}{a_{t+1}} \right) \boldsymbol{u}_{t+1} \end{align*}令\begin{align*} c_{t + 1} = \frac{c_t}{1 - \mu_t}, \ \ \ b_{t + 1} = b_t + \mu_t \frac{c_{t + 1}}{a_{t+1}} \end{align*}从而\begin{align*} \boldsymbol{v}_{t + 1} = \boldsymbol{v}_t + b_t (\boldsymbol{u}_t - \boldsymbol{u}_{t+1}) = \boldsymbol{v}_t - \begin{cases} a_{t+1} b_t \eta_t y_t \boldsymbol{x}_t & y_t \boldsymbol{w}_t^\top \boldsymbol{x}_t < 1 \\ \boldsymbol{0} & otherwise \end{cases}\end{align*}故可化为七步操作\begin{align*} \begin{cases} 1: & z_t = y_t \boldsymbol{u}_t^\top \boldsymbol{x}_t / a_t \\ 2: & a_{t+1} = a_t / (1 - \lambda \eta_t) \\ 3: & \boldsymbol{d} = \begin{cases} a_{t+1} \eta_t y_t \boldsymbol{x}_t & z_t < 1 \\ \boldsymbol{0} & otherwise \end{cases} \\ 4: & \boldsymbol{u}_{t+1} = \boldsymbol{u}_t + \boldsymbol{d} \\ 5: & c_{t + 1} = c_t / (1 - \mu_t) \\ 6: & \boldsymbol{v}_{t + 1} = \boldsymbol{v}_t - b_t \boldsymbol{d} \\ 7: & b_{t + 1} = b_t + \mu_t c_{t + 1} / a_{t+1} \end{cases} \end{align*}注意后三步操作是在$\mu_t < 1$,也即$t \geq t_0 + 1$的情况下成立的;若$\mu_t = 1$,此时还未开始计算平均,应有$\boldsymbol{\bar{w}}_{t + 1} = \boldsymbol{w}_{t + 1} = \boldsymbol{u}_{t + 1} / a_{t + 1}$,故只需令\begin{align*} c_{t + 1} = a_{t+1}, \ \ \ \boldsymbol{v}_{t + 1} = \boldsymbol{0}, \ \ \ b_{t + 1} = 1 \end{align*}即可。
二、Logistic Regression
Logistic Regression目标函数的梯度为\begin{align*} \lambda \boldsymbol{w} + \frac{1}{M} \sum_{m=1}^M \frac{e^{-y_m \boldsymbol{w}^\top \boldsymbol{x}_m} (-y_m \boldsymbol{x}_m)}{1 + e^{-y_m \boldsymbol{w}^\top \boldsymbol{x}_m}} = \lambda \boldsymbol{w} + \frac{1}{M} \sum_{m=1}^M (\sigma(y_m \boldsymbol{w}^\top \boldsymbol{x}_m) - 1) y_m \boldsymbol{x}_m \end{align*}第$t$轮可用梯度的无偏估计\begin{align*}\lambda \boldsymbol{w} + y_t (\sigma(y_t \boldsymbol{w}^\top \boldsymbol{x}_t) - 1) \boldsymbol{x}_t \end{align*}作为下降方向,故更新公式可写为\begin{align*} \boldsymbol{w}_{t+1} = (1 - \lambda \eta_t)\boldsymbol{w}_t + \eta_t y_t (1 - \sigma(y_t \boldsymbol{w}_t^\top \boldsymbol{x}_t)) \boldsymbol{x}_t \end{align*}
为了加速处理稀疏特征,做变量代换$\boldsymbol{w}_t = \boldsymbol{u}_t / a_t$,于是\begin{align*} \boldsymbol{u}_{t+1} = (1 - \lambda \eta_t) \frac{a_{t+1}}{a_t} \boldsymbol{u}_t + a_{t+1} \eta_t y_t (1 - \sigma(y_t \boldsymbol{w}_t^\top \boldsymbol{x}_t)) \boldsymbol{x}_t \end{align*}故可化为四步操作\begin{align*} \begin{cases} 1: & z_t = \sigma(y_t \boldsymbol{u}_t^\top \boldsymbol{x}_t / a_t) \\ 2: & a_{t+1} = a_t / (1 - \lambda \eta_t) \\ 3: & \boldsymbol{d} = a_{t+1} \eta_t y_t (1 - z_t) \boldsymbol{x}_t \\ 4: & \boldsymbol{u}_{t+1} = \boldsymbol{u}_t + \end{cases} \end{align*}
若采用ASGD,设$t_0$轮之后开始取平均,则对于$t > t_0$有$\boldsymbol{\bar{w}}_t = \frac{1}{t - t_0} \sum_{i=t_0+1}^t \boldsymbol{w}_i$,于是\begin{align*} \boldsymbol{\bar{w}}_{t+1} = \frac{1}{t + 1 - t_0} \sum_{i=t_0+1}^{t + 1} \boldsymbol{w}_i = \frac{t - t_0}{t + 1 - t_0} \frac{1}{t - t_0} \sum_{i=t_0+1}^t \boldsymbol{w}_i + \frac{1}{t + 1 - t_0} \boldsymbol{w}_{t + 1} = (1 - \mu_t) \boldsymbol{\bar{w}}_t + \mu_t \boldsymbol{w}_{t+1} \end{align*}其中$\mu_t = 1 / (t + 1 - t_0)$。当$t \leq t_0$时,还未开始取平均,故$\boldsymbol{\bar{w}}_t = \boldsymbol{w}_t$,于是综上有如下递推公式\begin{align*} \boldsymbol{\bar{w}}_{t+1} = (1 - \mu_t) \boldsymbol{\bar{w}}_t + \mu_t \boldsymbol{w}_{t+1} \end{align*}其中$\mu_t = 1 / \max \{ 1, t + 1 - t_0 \}$。
同样为了加速处理稀疏特征,再做变量代换$\boldsymbol{\bar{w}}_t = (\boldsymbol{v}_t + b_t \boldsymbol{u}_t) / c_t$,代入上式有\begin{align*}\frac{\boldsymbol{v}_{t + 1} + b_{t + 1} \boldsymbol{u}_{t + 1}}{c_{t + 1}} = (1 - \mu_t) \frac{\boldsymbol{v}_t + b_t \boldsymbol{u}_t}{c_t} + \mu_t \frac{\boldsymbol{u}_{t+1}}{a_{t+1}} \end{align*}整理得\begin{align*} \boldsymbol{v}_{t + 1} = (1 - \mu_t) \frac{c_{t + 1}}{c_t} \boldsymbol{v}_t + (1 - \mu_t) \frac{c_{t + 1}}{c_t} b_t \boldsymbol{u}_t - \left( b_{t + 1} - \mu_t \frac{c_{t + 1}}{a_{t+1}} \right) \boldsymbol{u}_{t+1} \end{align*}令\begin{align*} c_{t + 1} = \frac{c_t}{1 - \mu_t}, \ \ \ b_{t + 1} = b_t + \mu_t \frac{c_{t + 1}}{a_{t+1}} \end{align*}从而\begin{align*} \boldsymbol{v}_{t + 1} = \boldsymbol{v}_t + b_t (\boldsymbol{u}_t - \boldsymbol{u}_{t+1}) = \boldsymbol{v}_t - b_t a_{t+1} \eta_t y_t (1 - z_t) \boldsymbol{x}_t \end{align*}故可化为七步操作\begin{align*} \begin{cases} 1: & z_t = \sigma(y_t \boldsymbol{u}_t^\top \boldsymbol{x}_t / a_t) \\ 2: & a_{t+1} = a_t / (1 - \lambda \eta_t) \\ 3: & \boldsymbol{d} = a_{t+1} \eta_t y_t (1 - z_t) \boldsymbol{x}_t \\ 4: & \boldsymbol{u}_{t+1} = \boldsymbol{u}_t + \boldsymbol{d} \\ 5: & c_{t + 1} = c_t / (1 - \mu_t) \\ 6: & \boldsymbol{v}_{t + 1} = \boldsymbol{v}_t - b_t \boldsymbol{d} \\ 7: & b_{t + 1} = b_t + \mu_t c_{t + 1} / a_{t+1} \end{cases} \end{align*}注意后三步操作是在$\mu_t < 1$,也即$t \geq t_0 + 1$的情况下成立的;若$\mu_t = 1$,此时还未开始计算平均,应有$\boldsymbol{\bar{w}}_{t + 1} = \boldsymbol{w}_{t + 1} = \boldsymbol{u}_{t + 1} / a_{t + 1}$,故只需令\begin{align*} c_{t + 1} = a_{t+1}, \ \ \ \boldsymbol{v}_{t + 1} = \boldsymbol{0}, \ \ \ b_{t + 1} = 1 \end{align*}即可。
三、Multi-Class Logistic Regression
设共有$K$个类,样本为$\{\boldsymbol{x}_m, y_m\}$,其中$y_m \in \{1, \cdots, K\}$。优化目标为\begin{align*} E(\boldsymbol{w}_1, \cdots, \boldsymbol{w}_K) = \sum_k^K \frac{\lambda}{2}\|\boldsymbol{w}_k\|^2 - \frac{1}{M} \sum_{m=1}^M \ln \frac{e^{\boldsymbol{x}_m^\top \boldsymbol{w}_{y_m}}}{\sum_{j=1}^K e^{\boldsymbol{x}_m^\top \boldsymbol{w}_j}} \end{align*}设第$t$轮采样到的样本为$\{\boldsymbol{x}, y\}$,则此时的优化目标为\begin{align*} E_t(\boldsymbol{w}_1, \cdots, \boldsymbol{w}_K) = \sum_k^K \frac{\lambda}{2}\|\boldsymbol{w}_k\|^2 - \ln \frac{e^{\boldsymbol{x}^\top \boldsymbol{w}_y}}{\sum_{j=1}^K e^{\boldsymbol{x}^\top \boldsymbol{w}_j}} \end{align*}记$p_j = \boldsymbol{x}^\top \boldsymbol{w}_j$,$q_i = e^{p_i} / \sum_{j=1}^K e^{p_j}$,于是loss项可以写成$- \ln q_y$,其关于$\boldsymbol{w}_j$的导数可以写为\begin{align} \label{eq: gradient} - \frac{\partial \ln q_y}{\partial \boldsymbol{w}_j} = - \frac{\partial \ln q_y}{\partial q_y} \frac{\partial q_y}{\partial p_j} \frac{\partial p_j}{\partial \boldsymbol{w}_j} \end{align}易知若$y \neq j$,则\begin{align*} \frac{\partial q_y}{\partial p_j} = \frac{- e^{p_y} e^{p_j}}{(\sum_{j=1}^K e^{p_j})^2} = - q_y q_j \end{align*}否则\begin{align*} \frac{\partial q_y}{\partial p_j} = \frac{e^{p_y} \sum_{j=1}^K e^{p_j} - e^{p_y} e^{p_j}}{(\sum_{j=1}^K e^{p_j})^2} = q_y (1 - q_j) \end{align*}综上\begin{align*} \frac{\partial q_y}{\partial p_j} = q_y (I_{yj} - q_j) \end{align*}代入(\ref{eq: gradient})可得\begin{align*} - \frac{\partial \ln q_y}{\partial \boldsymbol{w}_j} = - \frac{1}{q_y} q_y (I_{yj} - q_j) \boldsymbol{x} = (q_j - I_{yj}) \boldsymbol{x} \end{align*}故\begin{align*} \frac{\partial E_t}{\partial \boldsymbol{w}_j} = \lambda \boldsymbol{w}_j + (q_j - I_{yj}) \boldsymbol{x} \end{align*}于是SGD第$t+1$轮关于$\boldsymbol{w}_j$的更新公式为\begin{align*} \boldsymbol{w}_j^{(t+1)} = (1 - \lambda \eta_t) \boldsymbol{w}_j^{(t)} + \eta_t (I_{yj} - q_j) \boldsymbol{x} \end{align*}
为了加速处理稀疏特征,做变量代换$\boldsymbol{w}_j^{(t)} = \boldsymbol{u} _j^{(t)} / a_t$,于是\begin{align*} \boldsymbol{u} _j^{(t+1)} = (1 - \lambda \eta_t) \frac{a_{t+1}}{a_t} \boldsymbol{u} _j^{(t)} + a_{t+1} \eta_t (I_{yj} - q_j) \boldsymbol{x} \end{align*}故可化为四步操作\begin{align*} \begin{cases} 1: & q_j = e^{p_j} / \sum_{i=1}^K e^{p_i} \\ 2: & a_{t+1} = a_t / (1 - \lambda \eta_t) \\ 3: & \boldsymbol{d} = a_{t+1} \eta_t (I_{yj} - q_j) \boldsymbol{x} \\ 4: & \boldsymbol{u} _j^{(t+1)} = \boldsymbol{u} _j^{(t)} + \boldsymbol{d} \end{cases} \end{align*}
若采用ASGD,设$t_0$轮之后开始取平均,则对于$t > t_0$有$\boldsymbol{\bar{w}}_j^{(t)} = \frac{1}{t - t_0} \sum_{i=t_0+1}^t \boldsymbol{w}_j^{(i)}$,于是\begin{align*} \boldsymbol{\bar{w}}_j^{(t+1)} = \frac{1}{t + 1 - t_0} \sum_{i=t_0+1}^{t + 1} \boldsymbol{w}_j^{(i)} = \frac{t - t_0}{t + 1 - t_0} \frac{1}{t - t_0} \sum_{i=t_0+1}^t \boldsymbol{w}_j^{(i)} + \frac{1}{t + 1 - t_0} \boldsymbol{w}_j^{(t+1)} = (1 - \mu_t) \boldsymbol{\bar{w}}_j^{(t)} + \mu_t \boldsymbol{w}_j^{(t+1)} \end{align*}其中$\mu_t = 1 / (t + 1 - t_0)$。当$t \leq t_0$时,还未开始取平均,故$\boldsymbol{\bar{w}}_j^{(t)} = \boldsymbol{w}_j^{(t)}$,于是综上有如下递推公式\begin{align*}\boldsymbol{\bar{w}}_j^{(t+1)} = (1 - \mu_t) \boldsymbol{\bar{w}}_j^{(t)} + \mu_t \boldsymbol{w}_j^{(t+1)} \end{align*}其中$\mu_t = 1 / \max \{ 1, t + 1 - t_0 \}$。
同样为了加速处理稀疏特征,再做变量代换$\boldsymbol{\bar{w}}_j^{(t)} = (\boldsymbol{v}_j^{(t)} + b_t \boldsymbol{u} _j^{(t)}) / c_t$,代入上式有\begin{align*} \frac{\boldsymbol{v}_j^{(t+1)} + b_{t + 1} \boldsymbol{v}_j^{(t+1)}}{c_{t + 1}} = (1 - \mu_t) \frac{\boldsymbol{v}_j^{(t)} + b_t \boldsymbol{u} _j^{(t)}}{c_t} + \mu_t \frac{\boldsymbol{u} _j^{(t+1)}}{a_{t+1}} \end{align*}整理得\begin{align*} \boldsymbol{v}_j^{(t+1)} = (1 - \mu_t) \frac{c_{t + 1}}{c_t} \boldsymbol{v}_j^{(t)} + (1 - \mu_t) \frac{c_{t + 1}}{c_t} b_t \boldsymbol{u}_j^{(t)} - \left( b_{t + 1} - \mu_t \frac{c_{t + 1}}{a_{t+1}} \right) \boldsymbol{u} _j^{(t+1)} \end{align*}令\begin{align*} c_{t + 1} = \frac{c_t}{1 - \mu_t}, \ \ \ b_{t + 1} = b_t + \mu_t \frac{c_{t + 1}}{a_{t+1}} \end{align*}从而\begin{align*} \boldsymbol{v}_j^{(t+1)} = \boldsymbol{v}_j^{(t)} + b_t (\boldsymbol{u} _j^{(t)} - \boldsymbol{u} _j^{(t+1)}) = \boldsymbol{v}_j^{(t)} - b_t \eta_t (I_{yj} - q_j) \boldsymbol{x} \end{align*}故可化为七步操作\begin{align*}\begin{cases} 1: & q_j = e^{p_j} / \sum_{i=1}^K e^{p_i} \\ 2: & a_{t+1} = a_t / (1 - \lambda \eta_t) \\ 3: & \boldsymbol{d} = a_{t+1} \eta_t (I_{yj} - q_j) \boldsymbol{x} \\ 4: & \boldsymbol{u} _j^{(t+1)} = \boldsymbol{u} _j^{(t)} + \boldsymbol{d} \\ 5: & c_{t + 1} = c_t / (1 - \mu_t) \\ 6: & \boldsymbol{v}_j^{(t+1)} = \boldsymbol{v}_j^{(t)} - b_t \boldsymbol{d} \\ 7: & b_{t + 1} = b_t + \mu_t c_{t + 1} / a_{t+1} \end{cases}\end{align*}注意后三步操作是在$\mu_t < 1$,也即$t \geq t_0 + 1$的情况下成立的;若$\mu_t = 1$,此时还未开始计算平均,应有$\boldsymbol{\bar{w}}_j^{(t+1)} = \boldsymbol{w}_j^{(t+1)} = \boldsymbol{u} _j^{(t+1)} / a_{t + 1}$,故只需令\begin{align*} c_{t + 1} = a_{t+1}, \ \ \ \boldsymbol{v}_{t + 1} = \boldsymbol{0}, \ \ \ b_{t + 1} = 1 \end{align*}即可。
四、LASSO
设\(\boldsymbol{w} = \boldsymbol{u} - \boldsymbol{v}\)且\(\boldsymbol{u} \geq \boldsymbol{0},\boldsymbol{v} \geq \boldsymbol{0}\),\(\boldsymbol{e}\)为全\(1\)向量,则第$t$轮的优化问题为\begin{align*} \min_{\boldsymbol{u}_t,\boldsymbol{v}_t} \ \ \ \lambda \boldsymbol{u}_t^\top \boldsymbol{e} + \lambda \boldsymbol{v}_t^\top \boldsymbol{e} + \frac{1}{2} (\boldsymbol{u}_t^\top \boldsymbol{x}_t - \boldsymbol{v}_t^\top \boldsymbol{x}_t - y_t)^2 \end{align*}分别对$\boldsymbol{u} _t$和$\boldsymbol{v}_t$求导有\begin{align*} \nabla_{\boldsymbol{u} _t} & = \lambda \boldsymbol{e} + (\boldsymbol{w}^\top \boldsymbol{x}_t - y_t)\boldsymbol{x}_t \\ \nabla_{\boldsymbol{v}_t} & = \lambda \boldsymbol{e} - (\boldsymbol{w}^\top \boldsymbol{x}_t - y_t)\boldsymbol{x}_t \end{align*}于是对应的更新公式为\begin{align*} \boldsymbol{u}_{t+1} & = \max \{ \boldsymbol{0}, \boldsymbol{u}_t - \eta_t (\lambda \boldsymbol{e} + (\boldsymbol{w}^\top \boldsymbol{x}_t - y_t)\boldsymbol{x}_t) \} = \max \{ \boldsymbol{0}, \boldsymbol{u}_t - \eta_t \lambda \boldsymbol{e}- \eta_t (\boldsymbol{w}^\top \boldsymbol{x}_t - y_t)\boldsymbol{x}_t \} \\ \boldsymbol{v}_{t+1} & = \max \{ \boldsymbol{0}, \boldsymbol{v}_t - \eta_t (\lambda \boldsymbol{e} - (\boldsymbol{w}^\top \boldsymbol{x}_t - y_t)\boldsymbol{x}_t) \} = \max \{ \boldsymbol{0}, \boldsymbol{v}_t - \eta_t \lambda \boldsymbol{e} + \eta_t (\boldsymbol{w}^\top \boldsymbol{x}_t - y_t)\boldsymbol{x}_t \} \end{align*}
随机梯度下降的技术细节