Prioritized Experience Replay

JAN 26, 2016

Schaul, Quan, Antonoglou, Silver, 2016

This Blog from: http://pemami4911.github.io/paper-summaries/2016/01/26/prioritizing-experience-replay.html

Summary

Uniform sampling from replay memories is not an efficient way to learn. Rather, using a clever prioritization scheme to label the experiences in replay memory, learning can be carried out much faster and more effectively. However, certain biases are introduced by this non-uniform sampling; hence, weighted importance sampling must be employed in order to correct for this. It is shown through experimentation with the Atari Learning Environment that prioritized sampling with Double DQN significantly outperforms the previous state-of-the-art Atari results.

Evidence

Implemented Double DQN with main changes being the addition of prioritized experience replay sampling and importance-sampling
Tested on Atari Learning Environment

Strengths

Lots of insight about the repercussions of this research and plenty of discussion on extensions

Notes

The magnitude of the TD-error indicates how unexpected a certain transition was
The TD-error can be a poor estimate about the amount an agent can learn from a transition when rewards are noisy
Problems with greedily selecting experiences:
- High-error transitions are replayed too frequently
- Low-error transitions are almost entirely ignored
- Expensive to update entire replay memory, so errors are only updated for transitions that are replayed
- Lack of diversity leads to over-fitting
A stochastic sampling method is introduced which finds a balance between greedy prioritization and random sampling (current method)
Two variants of $P (i) = \frac{p_{i}^{α}}{\sum_{k} p_{k}^{α}}$
- Variant 1: proportional prioritization, where $p_{i} = | δ_{i} | + ϵ$
- Variant 2: rank-based prioritization, with $p_{i} = \frac{1}{r a n k (i)}$
Key insight The estimation of the expected value of the total discounted reward with stochastic updates requires that the updates correspond to the same distribution as the expectation. Prioritized replay introduces a bias that changes this distribution uncontrollably. This can be corrected by using importance-sampling (IS) weights $w_{i} = (\frac{1}{N} \frac{1}{P (i)})^{β}$
IS is annealed from $β_{0}$
IS also reduces the gradient magnitudes which is good for optimization; allows the algorithm to follow the curvature of highly non-linear optimization landscapes because the Taylor expansion (gradient descent) is constantly re-approximated

(zhuan) Prioritized Experience Replay

声明：以上内容来自用户投稿及互联网公开渠道收集整理发布，本网站不拥有所有权，未作人工编辑处理，也不承担相关法律责任，若内容有误或涉及侵权可进行投诉：投诉/举报工作人员会在5个工作日内联系你，一经查实，本站将立刻删除涉嫌侵权内容。

联系
我们

首页 > 代码库 > (zhuan) Prioritized Experience Replay