? 由于Decoder的输入是且仅是Encoder输出在最后的\(H_m\), 因此可能会丢失部分前面的信息, 并在序列越长此问题越严重.
如何计算该weight:
Used in original paper:
\(\tilde{\alpha_i} = v^T \cdot tanh[w \cdot concat(h_i, s_0)]\), then normalize the weights with softmax.\(v^T, w\)为trainable parameters.
A more popular one, used in Transformer:
Linear maps:
Inner Product:
\(\tilde{\alpha_i} = k_i^T q_0\), for i = 1 to m.
Normalization with softmax.
时间复杂度: Encoder状态数 x Decoder状态数
Reference: shusen wang 老师讲解(强力推荐) https://www.youtube.com/watch?v=aButdUV0dxI&list=PLvOO0btloRntpSWSxFbwPIjIum3Ub4GSC
原文:https://www.cnblogs.com/ZhengPeng7/p/14409271.html