hljs.initHighlightingOnLoad();
hljs.initLineNumbersOnLoad();
您将构建神经机器翻译(NMT)模型,将人类可读日期 ("25th of June, 2009")转换为机器可读日期("2009-06-25")。您将使用注意模型执行此操作,注意模型是最复杂的sequence to sequence模型之一。
让我们加载此作业所需的所有包。
from keras.layers import Bidirectional, Concatenate, Permute, Dot, Input, LSTM, Multiply
from keras.layers import RepeatVector, Dense, Activation, Lambda
from keras.optimizers import Adam
from keras.utils import to_categorical
from keras.models import load_model, Model
import keras.backend as K
import numpy as np
from faker import Faker
import random
from tqdm import tqdm
from babel.dates import format_date
from nmt_utils import *
import matplotlib.pyplot as plt
%matplotlib inline
您将在此处构建的模型可用于将一种语言翻译为另一种语言,例如从英语翻译为印地语。 但是,语言翻译需要大量数据集,并且通常需要数天的GPU训练。 为了让您在不使用大量数据集的情况下尝试使用这些模型,我们将使用更简单的“日期转换”任务。
神经网络将输入以各种可能格式(e.g. "the 29th of August 1958", "03/30/1968", "24 JUNE 1987") ,将其翻译成标准化的机器可读日期((e.g. "1958-08-29", "1968-03-30", "1987-06-24")。 我们将让网络学习到机器可读的日期格式YYYY-MM-DD。
Take a look at nmt_utils.py to see all the formatting. Count and figure out how the formats work, you will need this knowledge later.
我们将利用10000个人类可读日期及其等效、标准化机器可读日期来训练一个模型。 让我们运行以下单元格来加载数据集并打印一些示例。
m = 10000
dataset, human_vocab, machine_vocab, inv_machine_vocab = load_dataset(m)
100%|██████████████████████████████████████████| 10000/10000 [00:00<00:00, 30957.97it/s]
dataset[:10]
[('15 october 1989', '1989-10-15'),
('sunday july 15 1984', '1984-07-15'),
('wednesday april 5 1978', '1978-04-05'),
('3/26/16', '2016-03-26'),
('sunday april 5 1992', '1992-04-05'),
('sunday october 16 2005', '2005-10-16'),
('11 dec 1992', '1992-12-11'),
('21 03 75', '1975-03-21'),
('tuesday july 23 2013', '2013-07-23'),
('sunday november 3 1996', '1996-11-03')]
你加载了:
dataset
:元组列表 a list of tuples(人类可读日期,机器可读日期)human_vocab
:一个python字典,将人类可读日期中使用的所有字符映射到整数值索引machine_vocab
:一个python字典,将机器可读日期中使用的所有字符映射到整数值索引。 这些索引不一定与human_vocab
一致。inv_machine_vocab
:machine_vocab
的逆字典,从索引到字符的映射。让我们预处理数据并将原始文本数据映射到索引值。 我们还将使用Tx = 30(我们假设是人类可读日期的最大长度;如果我们得到更长的输入,我们将截断它)和Ty = 10(因为“YYYY-MM-DD”是长度为10的字符)。
Tx = 30
Ty = 10
X, Y, Xoh, Yoh = preprocess_data(dataset, human_vocab, machine_vocab, Tx, Ty)
print("X.shape:", X.shape)
print("Y.shape:", Y.shape)
print("Xoh.shape:", Xoh.shape)
print("Yoh.shape:", Yoh.shape)
X.shape: (10000, 30)
Y.shape: (10000, 10)
Xoh.shape: (10000, 30, 37)
Yoh.shape: (10000, 10, 11)
你现在有:
X
:训练集中人类可读日期的处理版本,其中每个字符由通过human_vocab
映射到该字符的索引替换。 使用特殊字符(X.shape =(m,Tx)
Y
:训练集中机器可读日期的处理版本,其中每个字符由它在machine_vocab
中映射到的索引替换。 你应该有‘Y.shape =(m,Ty)`。Xoh
:X
的one-hot向量,每个样本转换为长度是len(machine_vocab),每个字符在human_vocab对应的位置表示为1,其他的位置是0 Xoh.shape =(m,Tx,len(human_vocab))
m个样本,Tx个字符, 每个字符对应的one-hot长度是len(human_vocab)。(one-hot version of X
, the "1" entry‘s index is mapped to the character thanks to human_vocab
.)Yoh
:Y
的one-hot向量, Yoh.shape = (m, Tx, len(machine_vocab))
. 这里,len(machine_vocab)= 11
,因为有11个字符(‘ - ‘以及0-9)。让我们看一下预处理训练集。 随意在下面的单元格中使用index
来导航数据集并查看源/目标日期是如何预处理的。
index = 0
print("Source date:", dataset[index][0])
print("Target date:", dataset[index][1])
print()
print("Source after preprocessing (indices):", X[index])
print("Target after preprocessing (indices):", Y[index])
print()
print("Source after preprocessing (one-hot):", Xoh[index])
print("Target after preprocessing (one-hot):", Yoh[index])
Source date: 15 october 1989
Target date: 1989-10-15
Source after preprocessing (indices): [ 4 8 0 26 15 30 26 14 17 28 0 4 12 11 12 36 36 36 36 36 36 36 36 36
36 36 36 36 36 36]
Target after preprocessing (indices): [ 2 10 9 10 0 2 1 0 2 6]
Source after preprocessing (one-hot): [[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[1. 0. 0. ... 0. 0. 0.]
...
[0. 0. 0. ... 0. 0. 1.]
[0. 0. 0. ... 0. 0. 1.]
[0. 0. 0. ... 0. 0. 1.]]
Target after preprocessing (one-hot): [[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]]
如果你必须将一本书的段落从法语翻译成英语,你就不会阅读整段,然后关闭书籍并翻译。 即使在翻译过程中,您也会阅读/重新阅读,并专注于正在翻译的英语部分相对应的法文段落部分。
注意机制告诉神经机器翻译模型,它应该关注每一步的什么地方。The attention mechanism tells a Neural Machine Translation model where it should pay attention to at any step.
在这一部分中,您将实现讲座视频中到的注意力机制。 这个图提醒你模型的工作原理。 左侧的图表显示了注意力模型。 右边的图表显示了一个“注意”步骤"Attention" step如何来计算注意力权重 attention variables \(\alpha^{\langle t, t' \rangle}\),它用于计算每个时间步(\(t=1, \ldots, T_y\))输出的上下文变量 \(context ^{\langle t \rangle}\)。(which are used to compute the context variable \(context^{\langle t \rangle}\) for each timestep in the output (\(t=1, \ldots, T_y\)). )
注:\(t'\)是双向SLTM的时间步,\(t\) 是上面的LSTM的时间步,例如\(\alpha^{\langle 1, 2 \rangle}\) 表示的是上面SLTM第1个时间步于下面的双向SLTM第二个时间步之间的权重。
对于上面SLTM每个时间步,例如第一步,注意力权重满足:注意力权重的和为1
\[\sum_{t} \alpha^{<1, t^{\prime} \rangle}=1\]
注意力权重表示在第t时间步(上层的LSTM)花在\(a^{\langle t' \rangle}\)(下层的双向LSTM)的注意力程度,也就是说在生成第t个输出词时应该花费多少注意力在第t‘个输入词上面。
第一步的上下文c等于: 双向SLTM的时间步的 \(a^{\langle t' \rangle}\) 乘对应的注意力权重的累加和,也就是每一步都考虑了这些状态,但是有不同的权重
![]() |
![]() |
以下是模型的一些属性:
此模型中有两个单独的LSTM(参见左侧图表)。图片底部的那个是双向LSTM并且在attention机制之前,我们将其称为pre-attention Bi-LSTM。 图表顶部的LSTM在attention机制之后,因此我们称之为post-attention LSTM。 pre-attention Bi-LSTM 经历了\(T_x\)时间步; post-attention LSTM经历了$ T_y $时间步。
post-attention LSTM将 \(s^{\langle t \rangle}, c^{\langle t \rangle}\)从一个时间步传递到下一个时间步。 在讲座视频中,我们仅使用基本RNN作为post-activation sequence模型,因此RNN输出状态激活\(s^{\langle t\rangle}\)。 但由于我们在这里使用LSTM,LSTM既有输出激活\(s^{\langle t\rangle}\)又有隐藏的单元状态 hidden cell state \(c^{\langle t\rangle}\)。 但是,与之前的文本生成示例(例如第1周的Dinosaurus)不同,在此模型中,$ t \(时的post-activation LSTM不会将特定生成的\)y^{\langle t-1 \rangle}$作为输入; 它只需要 \(s^{\langle t\rangle}\)和\(c^{\langle t\rangle}\) (没有输入x)作为输入。 我们以这种方式设计了模型,(与相邻字符高度相关的语言生成不同),因为YYYY-MM-DD日期中前一个字符与下一个字符之间的依赖性不强。
我们使用 \(a^{\langle t \rangle} = [\overrightarrow{a}^{\langle t \rangle}; \overleftarrow{a}^{\langle t \rangle}]\) 表示关注pre-attention Bi-LSTM的前向和后向激活的串联(concatenation)。
右边的图表使用RepeatVector
节点来复制\(s^{\langle t-1 \rangle}\)的值$ T_x \(次,然后`Concatenation`连接\)s^{\langle t-1 \rangle}\(和\)a^{\langle t \rangle}\(来计算\)e^{\langle t, t‘\rangle}\(,然后传递到softmax来计算\)\alpha^{\langle t, t‘ \rangle}$。 我们将在下面解释如何在Keras中使用RepeatVector
和Concatenation
。
让我们实现这个模型。您将从实现两个函数开始:one_step_attention()
和model()
。
1) one_step_attention()
: 在 \(t\)时间步, 根据Bi-LSTM的隐藏状态 (\([a^{<1>},a^{<2>}, ..., a^{<T_x>}]\)) 和第二个LSTM的previous隐藏状态 (previous hidden state of the second LSTM) (\(s^{<t-1>}\)), one_step_attention()
将计算出注意力权值(\([\alpha^{<t,1>},\alpha^{<t,2>}, ..., \alpha^{<t,T_x>}]\)) 并输出上下文向量(context vector)(see Figure 1 (right) for details):
\[context^{<t>} = \sum_{t' = 0}^{T_x} \alpha^{<t,t'>}a^{<t'>}\tag{1}\]
请注意,我们在此将注意力表示为\(context^{\langle t \rangle}\)。 在讲座视频中,上下文表示为\(c^{\langle t \rangle}\),但在这里我们称之为\(context^{\langle t \rangle}\) 以避免与 post-attention LSTM的内部记忆单元混淆。
2) model()
: 实现整个模型。 它首先将输入放到Bi-LSTM运行并得到\([a^{<1>},a^{<2>}, ..., a^{<T_x>}]\)。 然后,它调用one_step_attention()
\(T_y\)次(用for
循环)。 在此循环的每次迭代中,它将计算出的上下文向量 \(c^{<t>}\) (\(context^{\langle t \rangle}\)) 提供给第二个LSTM,并通过具有softmax激活的密集层生成预测结果\(\hat{y}^{<t>}\)。
练习:实现one_step_attention()
。 函数model()
将使用for循环调用one_step_attention()
$ T_y \(次,注意所有\)T_y$ copies 具有相同的权重。 也就是说,它不应该每次重新初始化权重。 换句话说,所有$T_y $步骤都应该具有共享一样的权重。 以下是如何在Keras中实现具有可共享权重的图层:
1.定义图层对象(作为示例的全局变量)。
2.传播输入时调用这些对象。
我们已经将您需要的层定义为全局变量。 请运行以下单元格来创建它们。 请查看Keras文档以确保您了解这些图层是什么:
RepeatVector(), Concatenate(), Dense(), Activation(), Dot().
# 将共享层定义为全局变量 Defined shared layers as global variables
repeator = RepeatVector(Tx)
concatenator = Concatenate(axis=-1)
densor1 = Dense(10, activation = "tanh")
densor2 = Dense(1, activation = "relu")
activator = Activation(softmax, name='attention_weights') # We are using a custom softmax(axis = 1) loaded in this notebook
dotor = Dot(axes = 1)
现在您可以使用这些层来实现one_step_attention()
。 为了将keras中的X张量传递到这些层,则使用layer(X)
(如果它需要多个输入使用 layer([X,Y])
)。 densor(X)
将通过上面定义的Dense(1)
层传播X.
Now you can use these layers to implement one_step_attention()
. In order to propagate a Keras tensor object X through one of these layers, use layer(X)
(or layer([X,Y])
if it requires multiple inputs.), e.g. densor(X)
will propagate X through the Dense(1)
layer defined above.
# GRADED FUNCTION: one_step_attention
def one_step_attention(a, s_prev):
"""
计算过程是上面的右图
Performs one step of attention: Outputs a context vector computed as a dot product of the attention weights
"alphas" and the hidden states "a" of the Bi-LSTM.
Arguments:
a -- hidden state output of the Bi-LSTM, numpy-array of shape (m, Tx, 2*n_a)
s_prev -- previous hidden state of the (post-attention) LSTM, numpy-array of shape (m, n_s)
Returns:
context -- context vector, input of the next (post-attetion) LSTM cell
"""
### START CODE HERE ###
# Use repeator to repeat s_prev to be of shape (m, Tx, n_s) so that you can concatenate it with all hidden states "a" (≈ 1 line)
s_prev = repeator(s_prev)
# Use concatenator to concatenate a and s_prev on the last axis (≈ 1 line)
concat = concatenator([a, s_prev]) #连接成 (a[1],s_prev) (a[2], s_prev)
# Use densor1 to propagate concat through a small fully-connected neural network to compute the "intermediate energies" variable e. (≈1 lines)
e = densor1(concat) #第一个全连接层
# Use densor2 to propagate e through a small fully-connected neural network to compute the "energies" variable energies. (≈1 lines)
energies = densor2(e) #第二个全连接层
# Use "activator" on "energies" to compute the attention weights "alphas" (≈ 1 line)
alphas = activator(energies) #softmax激活
# Use dotor together with "alphas" and "a" to compute the context vector to be given to the next (post-attention) LSTM-cell (≈ 1 line)
context = dotor([alphas, a])
### END CODE HERE ###
return context
在编写了model()
函数之后,检查one_step_attention()
的预期输出。
练习:实现model(),如图2和上面的文字所述。 同样,我们已经定义了要在model()中共享权重的全局图层。
n_a = 32
n_s = 64
post_activation_LSTM_cell = LSTM(n_s, return_state = True)
output_layer = Dense(len(machine_vocab), activation=softmax)
现在,您可以在for循环中使用这些layers \(??_??\)次来生成输出,并且不能重新初始化它们的参数。 您必须执行以下步骤:
1.将输入传播到Bidirectional LSTM
2.迭代\(t = 0, \dots, T_y-1\):
???? 1.使用\([\alpha^{<t,1>},\alpha^{<t,2>}, ..., \alpha^{<t,T_x>}]\)和\(s^{<t-1>}\)调用one_step_attention()
函数,来获取上下文向量\(context^{<t>}\)。
???? 2.将\(context^{<t>}\) 传递到post-attention LSTM单元。 请记住使用 initial_state= [previous hidden state, previous cell state]
来传入previous hidden-state \(s^{\langle t-1\rangle}\) 和 cell-states \(c^{\langle t-1\rangle}\),从而获取新的 hidden state \(s^{<t>}\) 和新的 cell state \(c^{<t>}\).
3.将softmax图层应用于\(s^{<t>}\),获取输出。
4.通过将输出添加到输出列表来保存输出。
3.创建您的Keras模型实例,它应该有三个输入("inputs", \(s^{<0>}\) and \(c^{<0>}\)), 最后输出“输出”列表。
# GRADED FUNCTION: model
def model(Tx, Ty, n_a, n_s, human_vocab_size, machine_vocab_size):
"""
Arguments:
Tx -- length of the input sequence
Ty -- length of the output sequence
n_a -- hidden state size of the Bi-LSTM
n_s -- hidden state size of the post-attention LSTM
human_vocab_size -- size of the python dictionary "human_vocab"
machine_vocab_size -- size of the python dictionary "machine_vocab"
Returns:
model -- Keras model instance
"""
# Define the inputs of your model with a shape (Tx,)
# Define s0 and c0, initial hidden state for the decoder LSTM of shape (n_s,)
X = Input(shape=(Tx, human_vocab_size))
s0 = Input(shape=(n_s,), name='s0')
c0 = Input(shape=(n_s,), name='c0')
s = s0
c = c0
# Initialize empty list of outputs
outputs = []
### START CODE HERE ###
# Step 1: Define your pre-attention Bi-LSTM. Remember to use return_sequences=True. (≈ 1 line)
a = Bidirectional(LSTM(n_a, return_sequences=True), name='bidirectional_1')(X)
# Step 2: Iterate for Ty steps
for t in range(Ty):
# Step 2.A: Perform one step of the attention mechanism to get back the context vector at step t (≈ 1 line)
context = one_step_attention(a, s)
# Step 2.B: Apply the post-attention LSTM cell to the "context" vector.
# Don't forget to pass: initial_state = [hidden state, cell state] (≈ 1 line)
s, _, c = post_activation_LSTM_cell(context, initial_state=[s, c])
# Step 2.C: Apply Dense layer to the hidden state output of the post-attention LSTM (≈ 1 line)
out = output_layer(s)
# Step 2.D: Append "out" to the "outputs" list (≈ 1 line)
outputs.append(out)
# Step 3: Create model instance taking three inputs and returning the list of outputs. (≈ 1 line)
model = Model(inputs=[X, s0, c0], outputs=outputs)
### END CODE HERE ###
return model
运行以下单元格以创建模型。
model = model(Tx, Ty, n_a, n_s, len(human_vocab), len(machine_vocab))
Let‘s get a summary of the model to check if it matches the expected output.
model.summary()
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_1 (InputLayer) (None, 30, 37) 0
__________________________________________________________________________________________________
s0 (InputLayer) (None, 64) 0
__________________________________________________________________________________________________
bidirectional_1 (Bidirectional) (None, 30, 64) 17920 input_1[0][0]
__________________________________________________________________________________________________
repeat_vector_1 (RepeatVector) (None, 30, 64) 0 s0[0][0]
lstm_2[0][0]
lstm_2[1][0]
lstm_2[2][0]
lstm_2[3][0]
lstm_2[4][0]
lstm_2[5][0]
lstm_2[6][0]
lstm_2[7][0]
lstm_2[8][0]
__________________________________________________________________________________________________
concatenate_1 (Concatenate) (None, 30, 128) 0 bidirectional_1[0][0]
repeat_vector_1[0][0]
bidirectional_1[0][0]
repeat_vector_1[1][0]
bidirectional_1[0][0]
repeat_vector_1[2][0]
bidirectional_1[0][0]
repeat_vector_1[3][0]
bidirectional_1[0][0]
repeat_vector_1[4][0]
bidirectional_1[0][0]
repeat_vector_1[5][0]
bidirectional_1[0][0]
repeat_vector_1[6][0]
bidirectional_1[0][0]
repeat_vector_1[7][0]
bidirectional_1[0][0]
repeat_vector_1[8][0]
bidirectional_1[0][0]
repeat_vector_1[9][0]
__________________________________________________________________________________________________
dense_1 (Dense) (None, 30, 10) 1290 concatenate_1[0][0]
concatenate_1[1][0]
concatenate_1[2][0]
concatenate_1[3][0]
concatenate_1[4][0]
concatenate_1[5][0]
concatenate_1[6][0]
concatenate_1[7][0]
concatenate_1[8][0]
concatenate_1[9][0]
__________________________________________________________________________________________________
dense_2 (Dense) (None, 30, 1) 11 dense_1[0][0]
dense_1[1][0]
dense_1[2][0]
dense_1[3][0]
dense_1[4][0]
dense_1[5][0]
dense_1[6][0]
dense_1[7][0]
dense_1[8][0]
dense_1[9][0]
__________________________________________________________________________________________________
attention_weights (Activation) (None, 30, 1) 0 dense_2[0][0]
dense_2[1][0]
dense_2[2][0]
dense_2[3][0]
dense_2[4][0]
dense_2[5][0]
dense_2[6][0]
dense_2[7][0]
dense_2[8][0]
dense_2[9][0]
__________________________________________________________________________________________________
dot_1 (Dot) (None, 1, 64) 0 attention_weights[0][0]
bidirectional_1[0][0]
attention_weights[1][0]
bidirectional_1[0][0]
attention_weights[2][0]
bidirectional_1[0][0]
attention_weights[3][0]
bidirectional_1[0][0]
attention_weights[4][0]
bidirectional_1[0][0]
attention_weights[5][0]
bidirectional_1[0][0]
attention_weights[6][0]
bidirectional_1[0][0]
attention_weights[7][0]
bidirectional_1[0][0]
attention_weights[8][0]
bidirectional_1[0][0]
attention_weights[9][0]
bidirectional_1[0][0]
__________________________________________________________________________________________________
c0 (InputLayer) (None, 64) 0
__________________________________________________________________________________________________
lstm_2 (LSTM) [(None, 64), (None, 33024 dot_1[0][0]
s0[0][0]
c0[0][0]
dot_1[1][0]
lstm_2[0][0]
lstm_2[0][2]
dot_1[2][0]
lstm_2[1][0]
lstm_2[1][2]
dot_1[3][0]
lstm_2[2][0]
lstm_2[2][2]
dot_1[4][0]
lstm_2[3][0]
lstm_2[3][2]
dot_1[5][0]
lstm_2[4][0]
lstm_2[4][2]
dot_1[6][0]
lstm_2[5][0]
lstm_2[5][2]
dot_1[7][0]
lstm_2[6][0]
lstm_2[6][2]
dot_1[8][0]
lstm_2[7][0]
lstm_2[7][2]
dot_1[9][0]
lstm_2[8][0]
lstm_2[8][2]
__________________________________________________________________________________________________
dense_4 (Dense) (None, 11) 715 lstm_2[0][0]
lstm_2[1][0]
lstm_2[2][0]
lstm_2[3][0]
lstm_2[4][0]
lstm_2[5][0]
lstm_2[6][0]
lstm_2[7][0]
lstm_2[8][0]
lstm_2[9][0]
==================================================================================================
Total params: 52,960
Trainable params: 52,960
Non-trainable params: 0
__________________________________________________________________________________________________
Expected Output:
Here is the summary you should see
**Total params:** | 185,484 |
**Trainable params:** | 185,484 |
**Non-trainable params:** | 0 |
**bidirectional_1‘s output shape ** | (None, 30, 128) |
**repeat_vector_1‘s output shape ** | (None, 30, 128) |
**concatenate_1‘s output shape ** | (None, 30, 256) |
**attention_weights‘s output shape ** | (None, 30, 1) |
**dot_1‘s output shape ** | (None, 1, 128) |
**dense_2‘s output shape ** | (None, 11) |
像往常一样,在Keras中创建模型后,您需要编译它并定义您想要使用的损失函数,优化器和指标metrics。 使用categorical_crossentropy
loss,优化器Adam optimizer编译你的模型(learning rate = 0.005
, \(\beta_1 = 0.9\), \(\beta_2 = 0.999\), decay = 0.01
), metrics是[‘accuracy‘]
### START CODE HERE ### (≈2 lines)
opt = Adam(lr=0.005, beta_1=0.9, beta_2=0.999, decay=0.001)
model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])
### END CODE HERE ###
最后一步是定义所有输入和输出以训练模型:
post_activation_LSTM_cell
。model()
,"outputs" 为11个shape (m, T_y)元素的列表。 这样outputs[i][0], ..., outputs[i][Ty]
表示对应于 \(i^{th}\) 训练样本(X[i]
)的真实label(字符)。 更一般地,outputs[i][j]
是第\(i^{th}\)真实label的第\(j^{th}\)字符。s0 = np.zeros((m, n_s))
c0 = np.zeros((m, n_s))
outputs = list(Yoh.swapaxes(0,1))
让我们现在适合模型并运行一个 epoch.
model.fit([Xoh, s0, c0], outputs, epochs=1, batch_size=100)
在训练时,您可以看到输出的10个位置中的每个位置的损失和准确性。 下表给出了一个例子,说明如果batch有两个例子,精度可能是多少:
dense_2_acc_8: 0.89
means that you are predicting the 7th character of the output correctly 89% of the time in the current batch of data. 我们已经运行了这个模型更长时间,并保存了权重。 运行下一个单元格以加载我们的权重。 (通过训练模型几分钟,您应该能够获得类似精度的模型,但加载我们的模型将节省您的时间。)
model.load_weights('models/model.h5')
You can now see the results on new examples.
EXAMPLES = ['3 May 1979', '5 April 09', '21th of August 2016', 'Tue 10 Jul 2007', 'Saturday May 9 2018', 'March 3 2001', 'March 3rd 2001', '1 March 2001']
for example in EXAMPLES:
source = string_to_int(example, Tx, human_vocab)
source = np.array(list(map(lambda x: to_categorical(x, num_classes=len(human_vocab)), source))).swapaxes(0,1)
prediction = model.predict([[source.T], s0, c0])
#prediction = model.predict([source, s0, c0]) #原来的写法维度不对
prediction = np.argmax(prediction, axis = -1)
output = [inv_machine_vocab[int(i)] for i in prediction]
print("source:", example)
print("output:", ''.join(output))
source: 3 May 1979
output: 1979-05-03
source: 5 April 09
output: 2009-05-05
source: 21th of August 2016
output: 2016-08-21
source: Tue 10 Jul 2007
output: 2007-07-10
source: Saturday May 9 2018
output: 2018-05-09
source: March 3 2001
output: 2001-03-03
source: March 3rd 2001
output: 2001-03-03
source: 1 March 2001
output: 2001-03-01
您还可以更改这些示例以使用您自己的示例进行测试。 下一部分将让您更好地了解注意机制正在做什么 - 在生成特定输出字符时网络注意哪些部分输入, what part of the input the network is paying attention to when generating a particular output character.
由于问题具有10的固定输出长度,因此还可以使用10个不同的softmax单元来执行该任务以生成输出的10个字符。 但注意模型的一个优点是输出的每个部分(比如月份)都知道它只需要依赖于输入的一小部分(输入中给出月份的字符)。 我们可以可视化输出的哪个部分正在查看输入的哪个部分。
考虑将"Saturday 9 May 2018"翻译为 "2018-05-09"的任务。 如果我们可视化计算出的attention 权重参数\(\alpha^{\langle t, t' \rangle}\) 我们得到这个:
注意输出如何忽略输入的“Saturday”部分。 输出时间步长都没有注意到输入的“Saturday”部分。 我们还看到9已被翻译为09并且May已被正确翻译为05,输出时要注意翻译所需的输入部分。 年主要要求它注意输入的“18”以产生“2018”。
现在让我们可视化网络中的注意力值。 我们将通过网络传播一个样本,然后可视化\(\alpha^{\langle t, t' \rangle}\)的值。
为了确定注意力值的位置(where the attention values are located),让我们首先打印模型的summary。
model.summary()
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_1 (InputLayer) (None, 30, 37) 0
__________________________________________________________________________________________________
s0 (InputLayer) (None, 64) 0
__________________________________________________________________________________________________
bidirectional_1 (Bidirectional) (None, 30, 64) 17920 input_1[0][0]
__________________________________________________________________________________________________
repeat_vector_1 (RepeatVector) (None, 30, 64) 0 s0[0][0]
lstm_2[0][0]
lstm_2[1][0]
lstm_2[2][0]
lstm_2[3][0]
lstm_2[4][0]
lstm_2[5][0]
lstm_2[6][0]
lstm_2[7][0]
lstm_2[8][0]
__________________________________________________________________________________________________
concatenate_1 (Concatenate) (None, 30, 128) 0 bidirectional_1[0][0]
repeat_vector_1[0][0]
bidirectional_1[0][0]
repeat_vector_1[1][0]
bidirectional_1[0][0]
repeat_vector_1[2][0]
bidirectional_1[0][0]
repeat_vector_1[3][0]
bidirectional_1[0][0]
repeat_vector_1[4][0]
bidirectional_1[0][0]
repeat_vector_1[5][0]
bidirectional_1[0][0]
repeat_vector_1[6][0]
bidirectional_1[0][0]
repeat_vector_1[7][0]
bidirectional_1[0][0]
repeat_vector_1[8][0]
bidirectional_1[0][0]
repeat_vector_1[9][0]
__________________________________________________________________________________________________
dense_1 (Dense) (None, 30, 10) 1290 concatenate_1[0][0]
concatenate_1[1][0]
concatenate_1[2][0]
concatenate_1[3][0]
concatenate_1[4][0]
concatenate_1[5][0]
concatenate_1[6][0]
concatenate_1[7][0]
concatenate_1[8][0]
concatenate_1[9][0]
__________________________________________________________________________________________________
dense_2 (Dense) (None, 30, 1) 11 dense_1[0][0]
dense_1[1][0]
dense_1[2][0]
dense_1[3][0]
dense_1[4][0]
dense_1[5][0]
dense_1[6][0]
dense_1[7][0]
dense_1[8][0]
dense_1[9][0]
__________________________________________________________________________________________________
attention_weights (Activation) (None, 30, 1) 0 dense_2[0][0]
dense_2[1][0]
dense_2[2][0]
dense_2[3][0]
dense_2[4][0]
dense_2[5][0]
dense_2[6][0]
dense_2[7][0]
dense_2[8][0]
dense_2[9][0]
__________________________________________________________________________________________________
dot_1 (Dot) (None, 1, 64) 0 attention_weights[0][0]
bidirectional_1[0][0]
attention_weights[1][0]
bidirectional_1[0][0]
attention_weights[2][0]
bidirectional_1[0][0]
attention_weights[3][0]
bidirectional_1[0][0]
attention_weights[4][0]
bidirectional_1[0][0]
attention_weights[5][0]
bidirectional_1[0][0]
attention_weights[6][0]
bidirectional_1[0][0]
attention_weights[7][0]
bidirectional_1[0][0]
attention_weights[8][0]
bidirectional_1[0][0]
attention_weights[9][0]
bidirectional_1[0][0]
__________________________________________________________________________________________________
c0 (InputLayer) (None, 64) 0
__________________________________________________________________________________________________
lstm_2 (LSTM) [(None, 64), (None, 33024 dot_1[0][0]
s0[0][0]
c0[0][0]
dot_1[1][0]
lstm_2[0][0]
lstm_2[0][2]
dot_1[2][0]
lstm_2[1][0]
lstm_2[1][2]
dot_1[3][0]
lstm_2[2][0]
lstm_2[2][2]
dot_1[4][0]
lstm_2[3][0]
lstm_2[3][2]
dot_1[5][0]
lstm_2[4][0]
lstm_2[4][2]
dot_1[6][0]
lstm_2[5][0]
lstm_2[5][2]
dot_1[7][0]
lstm_2[6][0]
lstm_2[6][2]
dot_1[8][0]
lstm_2[7][0]
lstm_2[7][2]
dot_1[9][0]
lstm_2[8][0]
lstm_2[8][2]
__________________________________________________________________________________________________
dense_4 (Dense) (None, 11) 715 lstm_2[0][0]
lstm_2[1][0]
lstm_2[2][0]
lstm_2[3][0]
lstm_2[4][0]
lstm_2[5][0]
lstm_2[6][0]
lstm_2[7][0]
lstm_2[8][0]
lstm_2[9][0]
==================================================================================================
Total params: 52,960
Trainable params: 52,960
Non-trainable params: 0
__________________________________________________________________________________________________
浏览上面的model.summary()
输出。 你可以看到, 在dot_2
计算每个时间步 \(t = 0, \ldots, T_y-1\)的上下文向量(context vector)之前, attention_weights
层输出形状(m,30,1)的alphas
。 让我们从这一层获得激活。
函数attention_map()
从模型中提取attention values并绘制它们。
attention_map = plot_attention_map(model, human_vocab, inv_machine_vocab, "Tuesday 09 Oct 1993", num = 7, n_s = 64)
<Figure size 432x288 with 0 Axes>
在生成的图上,您可以观察预测输出的每个字符的attention weights。 检查此图并检查网络关注的哪个位置对你有意义。( where the network is paying attention makes sense to you.)
在日期翻译应用程序中,您将观察到大多数时间注意力有助于预测年份,并且对预测日期/月份没有太大影响。
你已经完成了这项任务
这是你应该记住的内容:
原文:https://www.cnblogs.com/Moonshade/p/10953450.html