BPTT随时间反向传播[梯度爆炸和梯度消失]\[\frac{\partial L}{\partial W_o} = \Sigma_{t=1}^T\frac{\partial L_t}{\partial \hat y_t}\frac{\partial \hat y_t}{\partial W_o}
\]
\[\frac{\partial L}{\partial W_h}= \Sigma_{t=1}^T\frac{\partial L_t}{\partial \hat y_t} \frac{\partial \hat y_t}{\partial h_t} \frac{\partial h_t}{\partial W_h}
\]
由于\(h_t\)涉及\(h_{t-1}\),而\(h_{t-1}\)涉及到\(W_h\),所以随时间步回溯逐项求导,得\[\frac{\partial h_t}{\partial W_h}=\Sigma_{i=1}^{t}\frac{\partial h_t}{\partial h_i}\frac{\partial h_i}{\partial W_h}
\]
\[\frac{\partial h_t}{\partial h_i} = \Pi_{j=i}^{t-1} \frac{\partial h_{j+1}}{\partial h_j}
\]
所以,\[\frac{\partial L}{\partial W_h} = \Sigma_{t=1}^T \frac{\partial L_t}{\partial \hat y_t}\frac{\partial \hat y_t}{\partial h_t}\Sigma_{i=1}^t(\Pi_{j=i+1}^{t} \frac{\partial h_{j}}{\partial h_{j-1}})\frac{\partial h_i}{\partial W_h}
\]
考虑到\(\frac{\partial h_t}{\partial h_{t-1}} = f‘*W_h\),当f为tanh函数时,其梯度范围为[0,1],所以当\(j\)与\(t\)的相差过大(相距太远),如果\(W_h>1\),则会产生梯度爆炸的问题;如果\(W_h<1\),则会产生梯度消失的情况。(特别注意:从公式我们可以发现,总损失对于参数的梯度值是存在的,但梯度值被近距离时间步所主导,无法学习得到长期依赖信息。所以,RNN中的梯度消失实际上指的就是无法学习得到长期依赖信息)