首页 > 其他 > 详细

CBOW Formula Deduction

时间:2016-05-09 20:21:50      阅读:244      评论:0      收藏:0      [点我收藏+]

In our setting, the vocabulary size is $V$, and the hidden layer size is $N$.

The input is a one-hot representation vector, which means for a given input context word, only one out of $V$ units, $\{x_1,\cdots,x_V\}$, will be 1, and all other units are 0.

The weight between the input layer and the output layer can be represented by a $V \times N$ matrix $W$. Each row of $W$ is the $N$-dimension vector representation $v_w$ of the associated word of the input layer.

Given a context (a word), assuming $x_k=1$ and $x_{k’}=0$ for $k’\neq k$ then

\[h=x^TW=W{(k,\cdot):=v_{w_I}}\]

which is just copying the $k$-throw of $W$ to $h$. $v_{w_I}$ is the vector representation of the input word $w_I$. This implies that the link (activation) function of the hidden layer units is simply linear (i.e., directly passing its weighted sum of inputs to the next layer).

From the hidden layer to the output layer, there is a different weight matrix $W’=\{w’_{ij}\}$, which is a $N \times V$ matrix. Using these weights, we can compute a score $u_j$ for each word in the vocabulary,

\[ u_j={v’_{w_j}}^T \cdot h \]

where $v’_{w_j}$ is the $j$-th column of the matrix $W’$. Then we can use the softmax classification model to obtain the posterior distribution of the words, which is a multinomial distribution.

\[p(w_j|w_I)=y_j=\frac{\exp(u_j)}{\sum_{j’=1}^V{\exp(u_{j’})}}\]

where $y_j$ is the output of the $j$-th unit in the output layer.

 

Finally, we obtain:

\[p(w_j | w_I) = y_j = \frac{\exp( {v’_{w_o}}^T v_{w_I})}{\sum_{j’=1}^V{\exp( {v’_{w’_j}}^T v_{w_I})}}\]

CBOW Formula Deduction

原文:http://www.cnblogs.com/ZJUT-jiangnan/p/5475216.html

(0)
(0)
   
举报
评论 一句话评论(0
关于我们 - 联系我们 - 留言反馈 - 联系我们:wmxa8@hotmail.com
© 2014 bubuko.com 版权所有
打开技术之扣,分享程序人生!