（转）Attention

首页 > 代码库 > （转）Attention

2024-08-11 09:57:07 221人阅读

本文转自：http://www.cosmosshadow.com/ml/%E7%A5%9E%E7%BB%8F%E7%BD%91%E7%BB%9C/2016/03/08/Attention.html

Attention

Index

参考列表
Attention
Attention在视觉上的递归模型
- 模型
- 训练
- 效果
- Torch代码结构
(TODO)基于Attention的图片生成
基于Attention的图片主题生成
- 模型
- 编码
- 解码
- Stochastic “Hard” Attention
- Deterministic “Soft” Attention
基于Attention的字符识别
- 模型
- Recursive / Recurrent CNN

参考列表

Survey on Advanced Attention-based Models
Recurrent Models of Visual Attention (2014.06.24)
Recurrent Model of Visual Attention (blog)
https://github.com/Element-Research/rnn/blob/master/scripts/evaluate-rva.lua
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention (2015.02.10)
Soft Attention Mechanism for Neural Machine Translation
DRAW: A Recurrent Neural Network For Image Generation (2015.05.20)
Teaching Machines to Read and Comprehend (2015.06.04)
Learning Wake-Sleep Recurrent Attention Models (2015.09.22)
Action Recognition using Visual Attention (2015.10.12)
Recurrent Convolutional Neural Network for Object Recognition (2015)
Understanding Deep Architectures using a Recursive Convolutional Network (2014.2.19)
MULTIPLE OBJECT RECOGNITION WITH VISUAL ATTENTION (2015.04.23)
Recursive Recurrent Nets with Attention Modeling for OCR in the Wild (2016.03.09)
https://github.com/Element-Research/rnn/blob/master/examples/recurrent-visual-attention.lua (code)

Attention

在引入Attention(注意力)之前，图像识别或语言翻译都是直接把完整的图像或语句直接塞到一个输入，然后给出输出。
而且图像还经常缩放成固定大小，引起信息丢失。
而人在看东西的时候，目光沿感兴趣的地方移动，甚至仔细盯着部分细节看，然后再得到结论。
Attention就是在网络中加入关注区域的移动、缩放、旋转机制，连续部分信息的序列化输入。
关注区域的移动、缩放、旋转采用强化学习来实现。

Attention在视觉上的递归模型

参考 Recurrent Models of Visual Attention (2014.06.24)

模型

该模型称为The Recurrent Attention Model，简称RAM。

技术分享

A、Glimpse Sensor: 在

该模型每次迭代的时候，还可以输出缩放信息和结束标志。

训练

网络的参数可表示为

J (θ) = E_{p (s_{1 : T}; θ)} [\sum_{t = 1}^{T} r_{t}] = E_{p (s_{1 : T}; θ)} [R] = E_{p (s_{1 : T}; θ)} [\prod_{t = 1}^{T} π (u_{t} ∣ s_{1 : t}; θ) R]

强化学习的目标是提高

\nabla_{θ} (\log J) = E_{p (s_{1 : T}; θ)} [\sum_{t = 1}^{T} \nabla_{θ} \log π (u_{t} ∣ s_{1 : t}; θ) R] \approx \frac{1}{M} \sum_{i = 1}^{M} \sum_{t = 1}^{T} \nabla_{θ} \log π (u_{t}^{i} ∣ s_{1 : t}^{i}; θ) R^{i}

其中

在学习训练过程中，

以上等式是梯度的无偏估计，但可引起高方差，所以引入以下估计

\frac{1}{M} \sum_{i = 1}^{M} \sum_{t = 1}^{T} \nabla_{θ} \log π (u_{t}^{i} ∣ s_{1 : t}^{i}; θ) (R_{t}^{i} - b_{t})

其中

效果

技术分享

以上是论文中在识别扩大和污染了的minst数据库上，识别数字时，glimpse的移动方向。
实心绿点是开始，空心绿点是结束。
可以看到，RAM模型顺着感兴趣的方向移动。
识别效果比全链接的网络，和基于CNN的网络都要好。

Torch代码结构

在博客Recurrent Model of Visual Attention的训练代码中，结构如下

技术分享

(TODO)基于Attention的图片生成

Auto-Encoding Variational Bayes (2014.05.01)
DRAW: A Recurrent Neural Network For Image Generation (2015.05.20)

基于Attention的图片主题生成

参考 Show, Attend and Tell: Neural Image Caption Generation with Visual Attention (2015.02.10)

技术分享

如上，根据图片，生成主题描述。

模型

技术分享

如上图，模型把图片经过CNN网络，变成特征图。
LSTM的RNN结构在此上运行Attention模型，最后得到主题输出。

编码

特征图均匀地切割成多个区域，表示为

a = {a_{1}, \dots, a_{L}}, a_{i} \in R^{D}

L表示切割的区域个数。
如区域大小为

输出的主题

y = {y_{1}, \dots, y_{C}}, y_{i} \in R^{K}

K是字典的单词个数，C是句子长度。

解码

该模型使用的LSTM如下图所示

技术分享

运算为

(\begin{matrix} i_{t} \\ f_{t} \\ o_{t} \\ g_{t} \end{matrix}) = (\begin{matrix} σ \\ σ \\ σ \\ \tanh \end{matrix}) T_{D + m + n, n} (\begin{matrix} E y_{t - 1} \\ h_{t - 1} \\ {\hat{z}}_{t} \end{matrix})

c_{t} = f_{t} ⊙ c_{t - 1} + i_{t} ⊙ g_{t}

h_{t} = o_{t} ⊙ \tanh (c_{t})

其中

e_{t i} = f_{a t t} (a_{i}, h_{t - 1})

α_{t i} = \frac{\exp (e_{t i})}{\sum_{k = 1}^{L} \exp (e_{t k})}

{\hat{z}}_{t} = ϕ ({a_{i}}, {α_{t i}})

其中

技术分享

LSTM中的记忆单元与隐藏单元的初始值，是两个不同的多层感知机，采用所有特征区域的平均值来进行预测的:

c_{0} = f_{i n i t . c} (\frac{1}{L} \sum_{i}^{L} a_{i})

h_{0} = f_{i n i t . h} (\frac{1}{L} \sum_{i}^{L} a_{i})

而最终的单词概率输出，采用深度输出层实现

p (y_{t} ∣ a, y_{t - 1}) \propto \exp (L_{o} (E y_{t - 1} + L_{h} h_{t} + L_{z} {\hat{z}}_{t}))

其中

Stochastic “Hard” Attention

p (s_{t, i} = 1 ∣ a) = α_{t, i}

{\hat{z}}_{t} = \sum_{i = 1}^{L} s_{t, i} a_{i}

我们设置

\begin{aligned} L_{s} & = \sum_{s} p (s ∣ a) \log p (y ∣ s, a) \\ \leq \log \sum_{s} p (s ∣ a) p (y ∣ s, a) \\ = \log p (y ∣ a) \end{aligned}

对其进行参数求导有

\frac{\partial L_{s}}{\partial W} = \sum_{s} p (s ∣ a) [\frac{\partial \log p (y ∣ s, a)}{\partial W} + \log p (y ∣ s, a) \frac{\partial \log p (s ∣ a)}{\partial W}]

以上参数求导可用Monte Carlo方法采样实现

{\tilde{s}}_{t} \sim {M u l t i n o u l l i}_{L} ({α_{i}})

\frac{\partial L_{s}}{\partial W} \approx \frac{1}{N} \sum_{n = 1}^{N} p ({\tilde{s}}^{n} ∣ a) [\frac{\partial \log p (y ∣ {\tilde{s}}^{n}, a)}{\partial W} + \log p (y ∣ {\tilde{s}}^{n}, a) \frac{\partial \log p ({\tilde{s}}^{n} ∣ a)}{\partial W}]

为减少估计方差，可采用冲量方式，第k个 mini-batch 的时候

b_{k} = 0.9 \times b_{k - 1} + 0.1 \times \log p (y ∣ {\tilde{s}}_{k}, a)

为进一步减少估计方差，引入 multinoulli 分布的熵

\frac{\partial L_{s}}{\partial W} \approx \frac{1}{N} \sum_{n = 1}^{N} p ({\tilde{s}}^{n} ∣ a) [\frac{\partial \log p (y ∣ {\tilde{s}}^{n}, a)}{\partial W} + λ_{r} (\log p (y ∣ {\tilde{s}}^{n}, a) - b) \frac{\partial \log p ({\tilde{s}}^{n} ∣ a)}{\partial W} + λ_{e} \frac{\partial H [{\tilde{s}}^{n}]}{\partial W}]

Deterministic “Soft” Attention

上面的随机模型需要采样位置

E_{p (s_{t} ∣ a)} [{\hat{z}}_{t}] = \sum_{i = 1}^{L} α_{t, i} a_{i}

这就是Deterministic “Soft” Attention模型，通过

在计算

\sum_{t} α_{t, i} \approx 1

这个正则的加入，可以使得生成的主题更加丰富。就是结果更好嘛！

另外，在

E_{p (s_{t} ∣ a)} [{\hat{z}}_{t}] = β \sum_{i = 1}^{L} α_{t, i} a_{i}

β_{t} = σ (f_{β} (h_{t - 1}))

最终，端到端的目标函数可写为

L_{d} = - \log (P (y ∣ x)) + λ \sum_{i}^{L} (1 - \sum_{t}^{C} α_{t, i})^{2}

基于Attention的字符识别

参考 Recursive Recurrent Nets with Attention Modeling for OCR in the Wild (2016.03.09)

模型

技术分享

Recursive / Recurrent CNN

技术分享

CNN是卷积层权重共享。
Recursive CNN是在卷积层中添加多层，每层的卷积核共享:

h_{i, j, k} (t) = {\begin{cases} σ ((w_{k}^{h h})^{T} x_{i, j} + b_{k}) & a t t = 0 \\ σ ((w_{k}^{h h})^{T} h_{i, j} (t - 1) + b_{k}) & a t t > 0 \end{cases}

Recurrent CNN也是在卷积层中添加多层，但每层都在最初信息的参与，卷积核可以共享，也可能不共享:

h_{i, j, k} (t) = σ ((w_{k}^{r})^{T} h_{i, j} (t - 1) + (w_{k}^{f})^{T} x_{i, j} + b_{k})

Recursive与Recurrent CNN有都提高感受野，减少参数的作用。
在参考这篇论文中，有提到Recursive CNN效果比Recurrent CNN好。

（转）Attention

声明：以上内容来自用户投稿及互联网公开渠道收集整理发布，本网站不拥有所有权，未作人工编辑处理，也不承担相关法律责任，若内容有误或涉及侵权可进行投诉：投诉/举报工作人员会在5个工作日内联系你，一经查实，本站将立刻删除涉嫌侵权内容。

联系
我们

首页 > 代码库 > （转）Attention

（转）Attention

Index

参考列表

Attention

Attention在视觉上的递归模型

模型

训练

效果

Torch代码结构

(TODO)基于Attention的图片生成

基于Attention的图片主题生成

模型

编码

解码

Stochastic “Hard” Attention

Deterministic “Soft” Attention

基于Attention的字符识别

模型

Recursive / Recurrent CNN

看完仍有疑问？有类似问题直接问程序猿