首页 > 代码库 > (zhuan) Prioritized Experience Replay
(zhuan) Prioritized Experience Replay
Prioritized Experience Replay
Schaul, Quan, Antonoglou, Silver, 2016
This Blog from: http://pemami4911.github.io/paper-summaries/2016/01/26/prioritizing-experience-replay.html
Summary
Uniform sampling from replay memories is not an efficient way to learn. Rather, using a clever prioritization scheme to label the experiences in replay memory, learning can be carried out much faster and more effectively. However, certain biases are introduced by this non-uniform sampling; hence, weighted importance sampling must be employed in order to correct for this. It is shown through experimentation with the Atari Learning Environment that prioritized sampling with Double DQN significantly outperforms the previous state-of-the-art Atari results.
Evidence
- Implemented Double DQN with main changes being the addition of prioritized experience replay sampling and importance-sampling
- Tested on Atari Learning Environment
Strengths
- Lots of insight about the repercussions of this research and plenty of discussion on extensions
Notes
- The magnitude of the TD-error indicates how unexpected a certain transition was
- The TD-error can be a poor estimate about the amount an agent can learn from a transition when rewards are noisy
- Problems with greedily selecting experiences:
- High-error transitions are replayed too frequently
- Low-error transitions are almost entirely ignored
- Expensive to update entire replay memory, so errors are only updated for transitions that are replayed
- Lack of diversity leads to over-fitting
- A stochastic sampling method is introduced which finds a balance between greedy prioritization and random sampling (current method)
- Two variants of P(i)=pαi∑kpαkP(i)=piα∑kpkα were studied, where PP is the probability of sampling transition ii, pi>0pi>0 is the priority of transition ii, and the exponent αα determines how much prioritization is used, with α=0α=0 the uniform case
- Variant 1: proportional prioritization, where pi=|δi|+?pi=|δi|+? is used and ?? is a small positive constant that prevents the edge-case of transitions not being revisited once their error is zero. δδ is the TD-error
- Variant 2: rank-based prioritization, with pi=1rank(i)pi=1rank(i) where rank(i)rank(i) is the rank of transition ii when the replay memory is sorted according to δiδi
- Key insight The estimation of the expected value of the total discounted reward with stochastic updates requires that the updates correspond to the same distribution as the expectation. Prioritized replay introduces a bias that changes this distribution uncontrollably. This can be corrected by using importance-sampling (IS) weights wi=(1N1P(i))βwi=(1N1P(i))β that fully compensate for the non-uniform probabilities P(i)P(i) if β=1β=1. These weights are folded into the Q-learning update by using wi×δiwi×δi, which is normalized by 1maxiwi1maxiwi
- IS is annealed from β0β0 to 1, which means its affect is felt more strongly at the end of the stochastic process; this is because the unbiased nature of the updates in RL is most important near convergence
- IS also reduces the gradient magnitudes which is good for optimization; allows the algorithm to follow the curvature of highly non-linear optimization landscapes because the Taylor expansion (gradient descent) is constantly re-approximated
(zhuan) Prioritized Experience Replay
声明:以上内容来自用户投稿及互联网公开渠道收集整理发布,本网站不拥有所有权,未作人工编辑处理,也不承担相关法律责任,若内容有误或涉及侵权可进行投诉: 投诉/举报 工作人员会在5个工作日内联系你,一经查实,本站将立刻删除涉嫌侵权内容。