











green bar is the reward function, blue curve is the possibility of differenct trajectories

if green bars are equally increased to yellow bars, the result will change!

























CS294-112 深度强化学习 秋季学期(伯克利)NO.4 Policy gradients introduction
原文:https://www.cnblogs.com/ecoflex/p/9085805.html