終於開始第六章了,一開頭就被掉坑~~什麼是NLTK哇!好歹我也是看過幾本tensorflow的書的人,不過說來好笑,我看的那些書裏都有講RNN,卻沒一個提到NLTK的!果然都是些小黑書,多少新人被從入門到放棄,這些小黑書最少有一半的功勞!!!
本想把這章分幾個來學,但發現代碼全是直線連在一起的,所以整個6.4節就要一大鍋一起炒了,真心炒不爛啊。
第一步:
先把文檔預處理抽出來成單個文件,省得在那礙眼,文件名就隨便叫stringclean64.py,要用時再import 就好了~
特朗普的文檔,我是從github上覆制粘貼下來的,因爲不知道怎麼下載 https://github.com/tensorlayer/tensorlayer/tree/master/example/data/trump 因爲用文本保存成utf-8編碼,所以with open(input_fpath, mode='r', encoding='utf-8') as f:這裏要加上encoding='utf-8',否則會報錯~~
import re
def basic_clean_str(string):
string = re.sub(r"\n", " ", string) # '\n' --> ' '
string = re.sub(r"\'s", " \'s", string) # it's --> it 's
string = re.sub(r"\'ve", " have", string) # they've --> they have
string = re.sub(r"\’ve", " have", string)
string = re.sub(r"\'t", " not", string) # can't --> can not
string = re.sub(r"\’t", " not", string)
string = re.sub(r"\'re", " are", string) # they're --> they are
string = re.sub(r"\’re", " are", string)
string = re.sub(r"\'d", "", string) # I'd (I had, I would) --> I
string = re.sub(r"\’d", "", string)
string = re.sub(r"\'ll", " will", string) # I'll --> I will
string = re.sub(r"\’ll", " will", string)
string = re.sub(r"\“", " ", string) # “a” --> “ a ”
string = re.sub(r"\”", " ", string)
string = re.sub(r"\"", " ", string) # "a" --> " a "
string = re.sub(r"\'", " ", string) # they' --> they '
string = re.sub(r"\’", " ", string) # they’ --> they ’
string = re.sub(r"\.", " . ", string) # they. --> they .
string = re.sub(r"\,", " , ", string) # they, --> they ,
string = re.sub(r"\!", " ! ", string)
string = re.sub(r"\-", " ", string) # "low-cost"--> lost cost
string = re.sub(r"\(", " ", string) # (they) --> ( they)
string = re.sub(r"\)", " ", string) # ( they) --> ( they )
string = re.sub(r"\]", " ", string) # they] --> they ]
string = re.sub(r"\[", " ", string) # they[ --> they [
string = re.sub(r"\?", " ", string) # they? --> they ?
string = re.sub(r"\>", " ", string) # they> --> they >
string = re.sub(r"\<", " ", string) # they< --> they <
string = re.sub(r"\=", " ", string) # easier= --> easier =
string = re.sub(r"\;", " ", string) # easier; --> easier ;
string = re.sub(r"\;", " ", string)
string = re.sub(r"\:", " ", string) # easier: --> easier :
string = re.sub(r"\"", " ", string) # easier" --> easier "
string = re.sub(r"\$", " ", string) # $380 --> $ 380
string = re.sub(r"\_", " ", string) # _100 --> _ 100
string = re.sub(r"\s{2,}", " ", string) # Akara is handsome --> Akara is handsome
return string.strip().lower() # lowercase
def customized_clean_str(string):
string = re.sub(r"\n", " ", string) # '\n' --> ' '
string = re.sub(r"\'s", " \'s", string) # it's --> it 's
string = re.sub(r"\’s", " \'s", string)
string = re.sub(r"\'ve", " have", string) # they've --> they have
string = re.sub(r"\’ve", " have", string)
string = re.sub(r"\'t", " not", string) # can't --> can not
string = re.sub(r"\’t", " not", string)
string = re.sub(r"\'re", " are", string) # they're --> they are
string = re.sub(r"\’re", " are", string)
string = re.sub(r"\'d", "", string) # I'd (I had, I would) --> I
string = re.sub(r"\’d", "", string)
string = re.sub(r"\'ll", " will", string) # I'll --> I will
string = re.sub(r"\’ll", " will", string)
string = re.sub(r"\“", " “ ", string) # “a” --> “ a ”
string = re.sub(r"\”", " ” ", string)
string = re.sub(r"\"", " “ ", string) # "a" --> " a "
string = re.sub(r"\'", " ' ", string) # they' --> they '
string = re.sub(r"\’", " ' ", string) # they’ --> they '
string = re.sub(r"\.", " . ", string) # they. --> they .
string = re.sub(r"\,", " , ", string) # they, --> they ,
string = re.sub(r"\-", " ", string) # "low-cost"--> lost cost
string = re.sub(r"\(", " ( ", string) # (they) --> ( they)
string = re.sub(r"\)", " ) ", string) # ( they) --> ( they )
string = re.sub(r"\!", " ! ", string) # they! --> they !
string = re.sub(r"\]", " ] ", string) # they] --> they ]
string = re.sub(r"\[", " [ ", string) # they[ --> they [
string = re.sub(r"\?", " ? ", string) # they? --> they ?
string = re.sub(r"\>", " > ", string) # they> --> they >
string = re.sub(r"\<", " < ", string) # they< --> they <
string = re.sub(r"\=", " = ", string) # easier= --> easier =
string = re.sub(r"\;", " ; ", string) # easier; --> easier ;
string = re.sub(r"\;", " ; ", string)
string = re.sub(r"\:", " : ", string) # easier: --> easier :
string = re.sub(r"\"", " \" ", string) # easier" --> easier "
string = re.sub(r"\$", " $ ", string) # $380 --> $ 380
string = re.sub(r"\_", " _ ", string) # _100 --> _ 100
string = re.sub(r"\s{2,}", " ", string) # Akara is handsome --> Akara is handsome
return string.strip().lower() # lowercase
def customized_read_words(input_fpath): #, dictionary):
with open(input_fpath, mode='r', encoding='utf-8') as f:
words = f.read()
words = customized_clean_str(words)
return words.split()
re.sub()是個python 的正則化工具,簡單地說就是替換,內容很深~
第二步:
把代碼整理好,試運行時遇到報錯如下:
解決方法看這個https://blog.csdn.net/quiet_girl/article/details/72604691
就是import 時先下載
import nltk
nltk.download()
exit()
import time
import numpy as np
import tensorflow as tf
import tensorlayer as tl
from tensorlayer.layers import *
from stringclean64 import *
**********************************************************************
Resource 'tokenizers/punkt/english.pickle' not found. Please
use the NLTK Downloader to obtain the resource: >>>
nltk.download()
Searched in:
- 'C:\\Users\\Administrator/nltk_data'
- 'C:\\nltk_data'
- 'D:\\nltk_data'
- 'E:\\nltk_data'
- 'C:\\Program Files\\Anaconda3\\nltk_data'
- 'C:\\Program Files\\Anaconda3\\lib\\nltk_data'
- 'C:\\Users\\Administrator\\AppData\\Roaming\\nltk_data'
- ''
**********************************************************************
Download Directory要改成anaconda的site-packages下。主要就是安裝MODELS下的punkt,開始我不知道,傻傻地全下載!!!全下載的話應該有500多M,如果是外網速度不行的話,那就等到死吧~~
最後別忘了加到環境變量裏去,最好還要重啓一下電腦,要不一樣報錯!我家這電腦破得觸目驚心了吧!!
第三步:
整體代碼如下:
import nltk
# nltk.download()
# exit()
import time
import numpy as np
import tensorflow as tf
import tensorlayer as tl
from tensorlayer.layers import *
from stringclean64 import *
init_scale = 0.1
learning_rate = 1.0
max_grad_norm = 5
sequence_length = 20
hidden_size = 200
max_epoch = 4
max_max_epoch = 100
lr_decay = 0.9
batch_size = 20
top_k_list = [1, 3, 5, 10]
print_length = 30
model_file_name = "641model_generate_text.npz"
print('~~~~~~~~~第一層~~~~~~~~~~')
words = customized_read_words(input_fpath="trump_text.txt")
vocab = tl.nlp.create_vocab([words], word_counts_output_file='vocab.txt', min_word_count=1)
vocab = tl.nlp.Vocabulary('vocab.txt', unk_word="<UNK>")
vocab_size = vocab.unk_id + 1
train_data = [vocab.word_to_id(word) for word in words]
seed = "it is a"
# seed = basic_clean_str(seed).split()
seed = nltk.tokenize.word_tokenize(seed)
print('seed : %s' % seed)
sess = tf.InteractiveSession()
print('~~~~~~~~~~~~第二層~~~~~~~~~~~~~')
input_data = tf.placeholder(tf.int32, [batch_size, sequence_length])
targets = tf.placeholder(tf.int32, [batch_size, sequence_length])
input_data_test = tf.placeholder(tf.int32, [1, 1])
def inference(x, is_train, sequence_length, reuse=None):
print("\nsequence_length: %d, is_train: %s, reuse: %s" % (sequence_length, is_train, reuse))
rnn_init = tf.random_uniform_initializer(-init_scale, init_scale)
with tf.variable_scope("model", reuse=reuse):
tl.layers.set_name_reuse(reuse)
network = EmbeddingInputlayer(inputs=x, vocabulary_size=vocab_size, embedding_size=hidden_size, E_init=rnn_init, name='embedding')
network = RNNLayer(network, cell_fn=tf.contrib.rnn.BasicLSTMCell, cell_init_args={
'forget_bias': 0.0,'state_is_tuple': True},n_hidden=hidden_size, initializer=rnn_init,
n_steps=sequence_length, return_last=False, return_seq_2d=True, name='lstm1')
lstm1 = network
network = DenseLayer(network, n_units=vocab_size, W_init=rnn_init, b_init=rnn_init, act=tf.identity, name='output')
return network, lstm1
network, lstm1 = inference(input_data, is_train=True, sequence_length=sequence_length, reuse=None)
network_test, lstm1_test = inference(input_data_test, is_train=False, sequence_length=1, reuse=True)
print('~~~~~~~~~~~~第三層~~~~~~~~~~~~~')
y_linear = network_test.outputs
y_soft = tf.nn.softmax(y_linear)
# y_id = tf.argmax(tf.nn.softmax(y), 1)
print('~~~~~~~~~~~6.4.3損失函數~~~~~~~~~~~~~~')
def loss_fn(outputs, targets, batch_size, sequence_length):
loss = tf.contrib.legacy_seq2seq.sequence_loss_by_example([outputs], [tf.reshape(targets, [-1])], [tf.ones([batch_size * sequence_length])])
cost = tf.reduce_sum(loss) / batch_size
return cost
cost = loss_fn(network.outputs, targets, batch_size, sequence_length)
print('~~~~~~~~~~~定義優化器~~~~~~~~~~~~~~~~')
with tf.variable_scope('learning_rate'):
lr = tf.Variable(0.0, trainable=False)
# You can get all trainable parameters as follow.
# tvars = tf.trainable_variables()
# tvars = network.all_params $ all parameters
# tvars = network.all_params[1:] $ parameters except embedding matrix
tvars = network.all_params
grads, _ = tf.clip_by_global_norm(tf.gradients(cost, tvars), max_grad_norm)
optimizer = tf.train.GradientDescentOptimizer(lr)
train_op = optimizer.apply_gradients(zip(grads, tvars))
print('~~~~~~~~~Start learning a model to generate text~~~~~~~~~~~~')
tl.layers.initialize_global_variables(sess)
for i in range(max_max_epoch):
new_lr_decay = lr_decay**max(i - max_epoch, 0.0)
sess.run(tf.assign(lr, learning_rate * new_lr_decay))
print("---Epoch: %d/%d Learning rate: %.8f" % (i + 1, max_max_epoch, sess.run(lr)))
epoch_size = ((len(train_data) // batch_size) - 1) // sequence_length
start_time = time.time()
costs = 0.0
iters = 0
state1 = tl.layers.initialize_rnn_state(lstm1.initial_state)
for step, (x, y) in enumerate(tl.iterate.ptb_iterator(train_data, batch_size, sequence_length)):
_cost, state1, _ = sess.run([cost, lstm1.final_state, train_op],
feed_dict={input_data: x, targets: y, lstm1.initial_state: state1,})
costs += _cost
iters += sequence_length
if step % (epoch_size // 10) == 1:
print("%.3f perplexity: %.3f speed: %.0f wps" % (step * 1.0 / epoch_size, np.exp(costs / iters),
iters * batch_size / (time.time() - start_time)))
train_perplexity = np.exp(costs / iters)
print("***Epoch: %d/%d Train Perplexity: %.3f" % (i + 1, max_max_epoch, train_perplexity))
for top_k in top_k_list:
state1 = tl.layers.initialize_rnn_state(lstm1_test.initial_state)
# state2 = tl.layers.initialize_rnn_state(lstm2_test.initial_state)
outs_id = [vocab.word_to_id(w) for w in seed]
for ids in outs_id[:-1]:
a_id = np.asarray(ids).reshape(1, 1)
state1 = sess.run([lstm1_test.final_state,],
feed_dict={input_data_test: a_id,lstm1_test.initial_state: state1,})
a_id = outs_id[-1]
for _ in range(print_length):
a_id = np.asarray(a_id).reshape(1, 1)
out, state1 = sess.run(
[y_soft, lstm1_test.final_state], feed_dict={
input_data_test: a_id,
lstm1_test.initial_state: state1,})
# a_id = np.argmax(out[0])
## Sample from all words, if vocab_size is large, this may have numeric error.
# a_id = tl.nlp.sample(out[0], diversity)
a_id = tl.nlp.sample_top(out[0], top_k=top_k)
outs_id.append(a_id)
sentence = [vocab.id_to_word(w) for w in outs_id]
sentence = " ".join(sentence)
# print(diversity, ':', sentence)
print(top_k, ':', sentence)
print("Save model")
tl.files.save_npz(network_test.all_params, name=model_file_name)
這個是我用家裏的破腦,跑的,max_max_epoch = 2,哈哈,先記錄一下,明天我用單位的電腦跑下max_max_epoch = 100,好對比一下:
~~~~~~~~~第一層~~~~~~~~~~
[TL] Creating vocabulary.
[TL] Total words: 9798
[TL] Words in vocabulary: 9799
[TL] Wrote vocabulary file: vocab.txt
INFO:tensorflow:Initializing vocabulary from file: vocab.txt
[TL] Initializing vocabulary from file: vocab.txt
[TL] Vocabulary from vocab.txt : <S> </S> <UNK>
[TL] vocabulary with 9802 words (includes start_word, end_word, unk_word)
[TL] start_id: 9799
[TL] end_id : 9800
[TL] unk_id : 9801
[TL] pad_id : 0
seed : ['it', 'is', 'a']
~~~~~~~~~~~~第二層~~~~~~~~~~~~~
sequence_length: 20, is_train: True, reuse: None
[TL] EmbeddingInputlayer model/embedding: (9802, 200)
[TL] RNNLayer model/lstm1: n_hidden:200 n_steps:20 in_dim:3 in_shape:(20, 20, 200) cell_fn:BasicLSTMCell
[TL] RNN batch_size (concurrent processes): 20
[TL] n_params : 2
[TL] DenseLayer model/output: 9802 identity
sequence_length: 1, is_train: False, reuse: True
[TL] EmbeddingInputlayer model/embedding: (9802, 200)
[TL] RNNLayer model/lstm1: n_hidden:200 n_steps:1 in_dim:3 in_shape:(1, 1, 200) cell_fn:BasicLSTMCell
[TL] RNN batch_size (concurrent processes): 1
[TL] n_params : 2
[TL] DenseLayer model/output: 9802 identity
~~~~~~~~~~~~第三層~~~~~~~~~~~~~
~~~~~~~~~~~6.4.3損失函數~~~~~~~~~~~~~~
~~~~~~~~~~~定義優化器~~~~~~~~~~~~~~~~
~~~~~~~~~Start learning a model to generate text~~~~~~~~~~~~
---Epoch: 1/2 Learning rate: 1.00000000
0.002 perplexity: 8709.861 speed: 431 wps
0.100 perplexity: 1625.887 speed: 622 wps
0.199 perplexity: 1037.599 speed: 631 wps
0.297 perplexity: 765.932 speed: 633 wps
0.395 perplexity: 625.504 speed: 644 wps
0.493 perplexity: 534.073 speed: 652 wps
0.592 perplexity: 480.487 speed: 656 wps
0.690 perplexity: 438.889 speed: 659 wps
0.788 perplexity: 404.562 speed: 661 wps
0.886 perplexity: 375.370 speed: 663 wps
0.985 perplexity: 351.889 speed: 664 wps
***Epoch: 1/2 Train Perplexity: 349.752
1 : it is a long time to be a long time to the people , and i would want to do that , we can not do that , we can make america great
3 : it is a long time . i want a long , i would want to the people , i would like the deal , and i would want to get the people that
5 : it is a much thing to the people , i want to do that ' s , the people are going to make america ' s time we want to the way ,
10 : it is a very long time to the people . i would know with the people , the way we are going to be a deal to receive the country , and he
---Epoch: 2/2 Learning rate: 1.00000000
0.002 perplexity: 261.161 speed: 657 wps
0.100 perplexity: 194.437 speed: 681 wps
0.199 perplexity: 187.006 speed: 673 wps
0.297 perplexity: 178.360 speed: 674 wps
0.395 perplexity: 171.464 speed: 673 wps
0.493 perplexity: 165.598 speed: 674 wps
0.592 perplexity: 164.097 speed: 675 wps
0.690 perplexity: 160.013 speed: 675 wps
0.788 perplexity: 155.866 speed: 675 wps
0.886 perplexity: 151.875 speed: 674 wps
0.985 perplexity: 148.404 speed: 673 wps
***Epoch: 2/2 Train Perplexity: 148.207
1 : it is a great honor to be a great deal , but we will make america great again . we need to get the people to get the people to get the people
3 : it is a long time . and i will do that . i will do that . and i will say that . we will make a deal , and i would like
5 : it is a great time for the people . the people can not do that . we need people to get rid of the oil . the people can make you back the
10 : it is a good deal for the people who can do , the people , but i will do something that . i will tell you that the time . we have people
Save model
[TL] [*] 641model_generate_text.npz saved
[Finished in 633.2s]
單位電腦稍好點,全也跑了近四個小時·······
1 : it is a very successful possibility . it ' s going to be built , a very real number , or a very good all . well , i will tell you what
3 : it is a choice between israel and , by the way , to make the incredible potential of the best available to deal , and it ' s a very sad line of
5 : it is a mess . it ' s a total piece . they love america , but the way — they are trying to come to other countries like that . we want
10 : it is a choice between a politician about israel . we have lost everything . we are going to bring jobs back from japan , i will say we want to keep them
---Epoch: 26/100 Learning rate: 0.10941899
0.002 perplexity: 6.456 speed: 1347 wps
0.334 perplexity: 5.350 speed: 1349 wps
0.666 perplexity: 5.138 speed: 1349 wps
0.998 perplexity: 4.887 speed: 1348 wps
***Epoch: 26/100 Train Perplexity: 4.887
1 : it is a mess . if we don not get tough and repeal obamacare , we have no choice . we will have a country of borders , we can be leading in
3 : it is a disaster for two or good , but they know exactly , but in my opinion , that ' s the kind of thinking this country needs , is to get
5 : it is a threat to israel . that ' s what you are saying . these are the people who don not need it . so i have got a call from him
10 : it is a choice . we have a president who understands what would do is because our leaders no longer use the wealth to them all for countries that is good for our
---Epoch: 51/100 Learning rate: 0.00785517
0.002 perplexity: 4.847 speed: 1350 wps
0.334 perplexity: 4.172 speed: 1348 wps
0.666 perplexity: 3.986 speed: 1347 wps
0.998 perplexity: 3.789 speed: 1346 wps
***Epoch: 51/100 Train Perplexity: 3.789
1 : it is a mess . if i ' m elected president , we will win the united states as a nation . we will have more than any other candidate ; we can
3 : it is a disgrace and an embarrassment . but we need people to think i will do that . i will do everything to bring all the border . we are not our
5 : it is a mess . and we need smart people . we have no idea who the people who can not understand how bad things . as for the very beginning we need
10 : it is a threat to our economy . that ' s what americans are talking about . and if i ' m president and all of these sudden why that are president ,
---Epoch: 76/100 Learning rate: 0.00056392
0.002 perplexity: 4.665 speed: 1347 wps
0.334 perplexity: 4.004 speed: 1360 wps
0.666 perplexity: 3.826 speed: 1360 wps
0.998 perplexity: 3.645 speed: 1360 wps
***Epoch: 76/100 Train Perplexity: 3.645
最終100的輸出:
1 : it is a mess . if i ' m elected president , we will win the united states as a nation . we will have more than any other candidate ; we can
3 : it is a disgrace and it is a honor to receive endorsement . i am the only person on this dais i ' m very proud of the question . well , i
5 : it is a disgrace and real estate . i ' m the only one that ' s going to be able to keep the polls . i am , and , frankly ,
10 : it is a good or good , great wall . i can not imagine what we have to take on that no oil and we can go at zero . and as far
Save model
[TL] [*] 641model_generate_text.npz saved
[Finished in 13881.3s]
13881/360=3.8 Hour!總體感覺偶爾有幾句還是相當精確的,但有些意思表示就很混亂了