基于Caffe的Large Margin Softmax Loss的实现（中）

首页 > 代码库 > 基于Caffe的Large Margin Softmax Loss的实现（中）

基于Caffe的Large Margin Softmax Loss的实现（中）

2024-08-16 00:26:34 211人阅读

小喵的唠叨话：前一篇博客，我们做完了L-Softmax的准备工作。而这一章，我们开始进行前馈的研究。

小喵博客: http://miaoerduo.com

博客原文: http://www.miaoerduo.com/deep-learning/基于caffe的large-margin-softmax-loss的实现（中）.html

四、前馈

还记得上一篇博客，小喵给出的三个公式吗？不记得也没关系。

这次，我们要一点一点的通过代码来实现这些公式。小喵主要是GPU上实现前后馈的代码，因为这个层只是用来训练，GPU速度应该会快一点。

我们首先要进行一般的FC层的前馈，因为LM_FC的前馈只是修改了一般的FC中的若干个值，而大部分的值都是没有修改过的。

 1 const Dtype* bottom_data = http://www.mamicode.com/bottom[0]->gpu_data(); 2 const Dtype* label_data = http://www.mamicode.com/bottom[1]->gpu_data(); 3 Dtype* top_data = http://www.mamicode.com/top[0]->mutable_gpu_data(); 4 const Dtype* weight = this->blobs_[0]->gpu_data(); 5 // 普通fc层的计算 6 if (M_ == 1) { 7   caffe_gpu_gemv<Dtype>(CblasNoTrans, N_, K_, (Dtype)1., 8                        weight, bottom_data, (Dtype)0., top_data); 9 } else {10   caffe_gpu_gemm<Dtype>(CblasNoTrans,11                         transpose_ ? CblasNoTrans : CblasTrans,12                         M_, N_, K_, (Dtype)1.,13                         bottom_data, weight, (Dtype)0., top_data);14 }

这样就计算完了一个普通的FC的前馈。

之后是一些具体的实现。

1，$\cos(\theta_j)=\frac{W_j^Tx_i}{\|W_j\|\|x_i\|}$

这是要求出label为$j$的weight的权值和feature之间的余弦值。公式大家在高中应该就学过了。这样需要出三部分：$W_j^Tx_i$，$\|W_j\|$和$\|x_i\|$。这里$i$表示feature的序号，因为一个mini batch中有很多张图片。$j$表示正确的label值。

$W_j^Tx_i$的计算非常简单，因为FC层的前馈计算出来的就是这个值。因此我们可以直接从FC的前馈结果中直接复制对应位置的结果。

$\|W_j\|$和$\|x_i\|$是比较简单的模值的计算，使用caffe_cpu_dot很容易就可以求得（为什么不使用caffe_gpu_dot呢？因为小喵在使用caffe_gpu_dot的时候，caffe会报一个奇怪的错误，不知道是不是因为GPU的显存不能随意访问的）。

最后的余弦值带入到上面的式子，就一下子搞定~

这里用到了几个变量：

M_: batch size

N_: class num

K_: feature length

 1 // w * x 2 // 直接从前馈的结果中复制 3 Dtype *wx_data = http://www.mamicode.com/this->wx_.mutable_gpu_data(); 4 copy_label_score<Dtype><<<CAFFE_GET_BLOCKS(M_), CAFFE_CUDA_NUM_THREADS>>>(M_, N_, label_data, top_data, wx_data); 5  6 // w * w 7 Dtype *abs_w_data = http://www.mamicode.com/this->abs_w_.mutable_cpu_data(); 8 for (int m = 0; m < M_; ++ m) { 9   abs_w_data[m] = caffe_cpu_dot<Dtype>(10     K_,11     this->blobs_[0]->cpu_data() + static_cast<int>(label_cpu_data[m]) * K_,12     this->blobs_[0]->cpu_data() + static_cast<int>(label_cpu_data[m]) * K_13     );14 }15 16 // x * x17 Dtype *abs_x_data = http://www.mamicode.com/this->abs_x_.mutable_cpu_data();18 for (int m = 0; m < M_; ++ m) {19   abs_x_data[m] = caffe_cpu_dot<Dtype>(20     K_, 21     bottom[0]->cpu_data() + m * K_,22     bottom[0]->cpu_data() + m * K_23     );24 }25 // abs_w, abs_x26 caffe_gpu_powx<Dtype>(M_, this->abs_w_.mutable_gpu_data(), 0.5, this->abs_w_.mutable_gpu_data());27 caffe_gpu_powx<Dtype>(M_, this->abs_x_.mutable_gpu_data(), 0.5, this->abs_x_.mutable_gpu_data());28 29 // cos_t = wx / (|x| * |w|)30 Dtype *cos_t_data = http://www.mamicode.com/this->cos_t_.mutable_gpu_data();31 caffe_gpu_div<Dtype>(M_, wx_data, this->abs_x_.gpu_data(), cos_t_data);32 caffe_gpu_div<Dtype>(M_, cos_t_data, this->abs_w_.gpu_data(), cos_t_data);

其中copy_label_score是我们自己编写的用来复制结果的核函数（如何编写Cuda程序就是另一门学科了）：

1 template <typename Dtype>2 __global__ void copy_label_score(const int M, const int N, const Dtype *label_data, const Dtype *top_data, Dtype *wx_data) {3   CUDA_KERNEL_LOOP(index, M) {4     wx_data[index] = top_data[index * N + static_cast<int>(label_data[index])];5   }6 }

相信机智如你的喵粉，看到这几行代码，一定可以轻松理解。

这里，小喵想多介绍一点东西。
我们知道Caffe里面的数据都是通过Blob结构来存储的，比如这里的bottom_data，其实就是一个blob，默认形状是(n, c, h, w)，n表示的就是batch size，c是channel数，h，w分贝表示高和宽。而且blob中的内存的存储顺序，也和一般的C语言中的数组一样。因此我们这里计算feature的模的时候，是直接每K_个数值计算一次点乘。
同理，weight是存储在this->blobs[0]中的，那么weight的形状又是什么样子的呢？这里非常碰巧的是，如果我们在prototxt中设置的transpose为false的话，weight的形状是N_*K_，也就是说，我们可以将weight看成一个矩阵，它的每一行都与feature直接点乘，得到输出，也就是说weight的每一行都是我们需要计算模值的$W_j$，所以我们计算weight的模的时候，用的计算方法和计算feature模时很相似。我们这里强制设置transpose为false，因为这样计算会比较简单。如果你设成了true，那就必须自己写个求模的函数了。

2，$\cos(m\theta_i)=\sum_n(-1)^n{C_m^{2n}\cos^{m-2n}(\theta_i)\cdot(1-\cos(\theta_i)^2)^n}, (2n\leq m)$

我们在(1)中求出了$\cos(\theta)$，对于给定的margin，只需要代入公式就可以求出$\cos(m\theta)$的值了。

 1 template <typename Dtype> 2 __global__ void cal_cos_mt(const int count, const unsigned int margin, const int *C_M_N, const Dtype *cos_t_data, Dtype *cos_mt_data) { 3   CUDA_KERNEL_LOOP(index, count) { 4     Dtype cos_t = cos_t_data[index]; 5     Dtype sin_t_2 = 1 - cos_t * cos_t; 6     Dtype cos_mt = 0.; 7     int flag = -1; 8     for (int n = 0; n <= (margin / 2); ++ n) { 9       flag *= -1;10       cos_mt += flag * C_M_N[2 * n] * powf(cos_t, (margin - 2 * n)) * powf(sin_t_2, n);11     }12     cos_mt_data[index] = cos_mt;13   }14 }

上面是用来计算$\cos(m\theta)$的cuda函数，调用也十分的简单：

1 // cos(mt)2 cal_cos_mt<Dtype><<<CAFFE_GET_BLOCKS(M_), CAFFE_CUDA_NUM_THREADS>>>(3   M_, this->margin, this->C_M_N_.gpu_data(), this->cos_t_.mutable_gpu_data(), this->cos_mt_->mutable_gpu_data());

3，$f_{y_{i}}=(-1)^k\cdot\|W_{y_{i}}\|\|x_{i}\|\cos(m\theta_i)-2k\cdot\|W_{y_i}\|\|x_i\|$

严格上来说，我们需要求的并不是这个式子，而是：

\[f_{y_i}=\frac{\lambda\|W_{y_i}\|\|x_i\|\cos(\theta_{y_i})+\|W_{y_i}\|\|x_i\|\varphi(\theta_{y_i})}{1+\lambda}\]

\[\varphi(\theta)=(-1)^k\cos(m\theta)-2k, \theta\in[\frac{k\pi}{m}, \frac{(k+1)\pi}{m}]\]

可以看出，当$\lambda$为0的时候，这两个式子就退化成前面的一个式子了。

k的求法十分简单，只需要将$\cos(\theta)$与各个区间进行比较就可以得到。

 1 // k 2 int *k_cpu_data = http://www.mamicode.com/this->k_.mutable_cpu_data(); 3 const Dtype *cos_t_cpu_data = http://www.mamicode.com/this->cos_t_.cpu_data(); 4 for (int m = 0; m < M_; ++ m) { 5   for (int _k = 0; _k < this->cos_theta_bound_.count(); ++ _k) { 6     if (this->cos_theta_bound_.cpu_data()[_k] < cos_t_cpu_data[m]) { 7       k_cpu_data[m] = _k - 1; 8       break; 9     }10   }11 }

最后一步就是计算出真正的前馈值了！按照公式容易编写程序：

 1 template <typename Dtype> 2 __global__ void LMForward( 3   const int M, const int N, const float lambda, 4   const Dtype *label_data, const Dtype *cos_mt_data, const int *k_data, 5   const Dtype *abs_w_data, const Dtype *abs_x_data, Dtype *top_data) { 6  7   CUDA_KERNEL_LOOP(index, M) { 8     Dtype cos_mt = cos_mt_data[index]; 9     int k = k_data[index];10     int label = static_cast<int>(label_data[index]);11     Dtype abs_w = abs_w_data[index];12     Dtype abs_x = abs_x_data[index];13     top_data[N * index + label] =  (lambda * top_data[N * index + label] + abs_w * abs_x * ( powf(-1, k) * cos_mt - 2 * k )) / (1 + lambda);14   }15 }

调用也十分简单：

1 // y2 LMForward<Dtype><<<CAFFE_GET_BLOCKS(M_), CAFFE_CUDA_NUM_THREADS>>>(3   M_, N_, this->lambda,4   label_data, this->cos_mt_->gpu_data(), this->k_.gpu_data(),5   this->abs_w_.gpu_data(), this->abs_x_.gpu_data(), top[0]->mutable_gpu_data());

最后附上，完整的前馈代码（省略头文件和caffe的名字空间）：

  1 template <typename Dtype>  2 __global__ void copy_label_score(const int M, const int N, const Dtype *label_data, const Dtype *top_data, Dtype *wx_data) {  3   CUDA_KERNEL_LOOP(index, M) {  4     wx_data[index] = top_data[index * N + static_cast<int>(label_data[index])];  5   }  6 }  7   8 template <typename Dtype>  9 __global__ void cal_cos_mt(const int count, const unsigned int margin, const int *C_M_N, const Dtype *cos_t_data, Dtype *cos_mt_data) { 10   CUDA_KERNEL_LOOP(index, count) { 11     Dtype cos_t = cos_t_data[index]; 12     Dtype sin_t_2 = 1 - cos_t * cos_t; 13     Dtype cos_mt = 0.; 14     int flag = -1; 15     for (int n = 0; n <= (margin / 2); ++ n) { 16       flag *= -1; 17       cos_mt += flag * C_M_N[2 * n] * powf(cos_t, (margin - 2 * n)) * powf(sin_t_2, n); 18     } 19     cos_mt_data[index] = cos_mt; 20   } 21 } 22  23 template <typename Dtype> 24 __global__ void LMForward( 25   const int M, const int N, const float lambda, 26   const Dtype *label_data, const Dtype *cos_mt_data, const int *k_data, 27   const Dtype *abs_w_data, const Dtype *abs_x_data, Dtype *top_data) { 28  29   CUDA_KERNEL_LOOP(index, M) { 30     Dtype cos_mt = cos_mt_data[index]; 31     int k = k_data[index]; 32     int label = static_cast<int>(label_data[index]); 33     Dtype abs_w = abs_w_data[index]; 34     Dtype abs_x = abs_x_data[index]; 35     top_data[N * index + label] =  (lambda * top_data[N * index + label] + abs_w * abs_x * ( powf(-1, k) * cos_mt - 2 * k )) / (1 + lambda); 36   } 37 } 38  39 template <typename Dtype> 40 void LargeMarginInnerProductLayer<Dtype>::Forward_gpu(const vector<Blob<Dtype>*>& bottom, 41     const vector<Blob<Dtype>*>& top) { 42   const Dtype* bottom_data = http://www.mamicode.com/bottom[0]->gpu_data(); 43   const Dtype* label_data = http://www.mamicode.com/bottom[1]->gpu_data(); 44   Dtype* top_data = http://www.mamicode.com/top[0]->mutable_gpu_data(); 45   const Dtype* weight = this->blobs_[0]->gpu_data(); 46  47   // 普通fc层的计算 48   if (M_ == 1) { 49     caffe_gpu_gemv<Dtype>(CblasNoTrans, N_, K_, (Dtype)1., 50                          weight, bottom_data, (Dtype)0., top_data); 51   } else { 52     caffe_gpu_gemm<Dtype>(CblasNoTrans, 53                           transpose_ ? CblasNoTrans : CblasTrans, 54                           M_, N_, K_, (Dtype)1., 55                           bottom_data, weight, (Dtype)0., top_data); 56   } 57  58   const Dtype* label_cpu_data = http://www.mamicode.com/bottom[1]->cpu_data(); 59  60   // w * x 61   // 直接从前馈的结果中复制 62   Dtype *wx_data = http://www.mamicode.com/this->wx_.mutable_gpu_data(); 63   copy_label_score<Dtype><<<CAFFE_GET_BLOCKS(M_), CAFFE_CUDA_NUM_THREADS>>>(M_, N_, label_data, top_data, wx_data); 64  65   // w * w 66   Dtype *abs_w_data = http://www.mamicode.com/this->abs_w_.mutable_cpu_data(); 67   for (int m = 0; m < M_; ++ m) { 68     abs_w_data[m] = caffe_cpu_dot<Dtype>( 69       K_, 70       this->blobs_[0]->cpu_data() + static_cast<int>(label_cpu_data[m]) * K_, 71       this->blobs_[0]->cpu_data() + static_cast<int>(label_cpu_data[m]) * K_ 72       ); 73   } 74    75   // x * x 76   Dtype *abs_x_data = http://www.mamicode.com/this->abs_x_.mutable_cpu_data(); 77   for (int m = 0; m < M_; ++ m) { 78     abs_x_data[m] = caffe_cpu_dot<Dtype>( 79       K_,  80       bottom[0]->cpu_data() + m * K_, 81       bottom[0]->cpu_data() + m * K_ 82       ); 83   } 84  85   // abs_w, abs_x 86   caffe_gpu_powx<Dtype>(M_, this->abs_w_.mutable_gpu_data(), 0.5, this->abs_w_.mutable_gpu_data()); 87   caffe_gpu_powx<Dtype>(M_, this->abs_x_.mutable_gpu_data(), 0.5, this->abs_x_.mutable_gpu_data()); 88  89   // cos_t = wx / (|x| * |w|) 90   Dtype *cos_t_data = http://www.mamicode.com/this->cos_t_.mutable_gpu_data(); 91   caffe_gpu_div<Dtype>(M_, wx_data, this->abs_x_.gpu_data(), cos_t_data); 92   caffe_gpu_div<Dtype>(M_, cos_t_data, this->abs_w_.gpu_data(), cos_t_data); 93  94   // cos(mt) 95   cal_cos_mt<Dtype><<<CAFFE_GET_BLOCKS(M_), CAFFE_CUDA_NUM_THREADS>>>( 96     M_, this->margin,  97     this->C_M_N_.gpu_data(),  98     this->cos_t_.gpu_data(), 99     this->cos_mt_.mutable_gpu_data()100     );101 102   // k103   int *k_cpu_data = http://www.mamicode.com/this->k_.mutable_cpu_data();104   const Dtype *cos_t_cpu_data = http://www.mamicode.com/this->cos_t_.cpu_data();105   for (int m = 0; m < M_; ++ m) {106     for (int _k = 0; _k < this->cos_theta_bound_.count(); ++ _k) {107       if (this->cos_theta_bound_.cpu_data()[_k] < cos_t_cpu_data[m]) {108         k_cpu_data[m] = _k - 1;109         break;110       }111     }112   }113 114   // y115   LMForward<Dtype><<<CAFFE_GET_BLOCKS(M_), CAFFE_CUDA_NUM_THREADS>>>(116     M_, N_, this->lambda,117     label_data, this->cos_mt_.gpu_data(), this->k_.gpu_data(),118     this->abs_w_.gpu_data(), this->abs_x_.gpu_data(), top[0]->mutable_gpu_data());119 }

那么，这样关于large margin softmax loss的前馈我们就轻松的实现了。下一篇，我们要讲最复杂的后馈的实现了。

如果您觉得本文对您有帮助，那请小喵喝杯茶吧~~O(∩_∩)O~~ 再次感慨 $\LaTeX$ 大法好。

技术分享

转载请注明出处~

基于Caffe的Large Margin Softmax Loss的实现（中）

声明：以上内容来自用户投稿及互联网公开渠道收集整理发布，本网站不拥有所有权，未作人工编辑处理，也不承担相关法律责任，若内容有误或涉及侵权可进行投诉：投诉/举报工作人员会在5个工作日内联系你，一经查实，本站将立刻删除涉嫌侵权内容。

联系
我们

首页 > 代码库 > 基于Caffe的Large Margin Softmax Loss的实现（中）