tensorlayer學習日誌16_chapter6_6.4

 終於開始第六章了,一開頭就被掉坑~~什麼是NLTK哇!好歹我也是看過幾本tensorflow的書的人,不過說來好笑,我看的那些書裏都有講RNN,卻沒一個提到NLTK的!果然都是些小黑書,多少新人被從入門到放棄,這些小黑書最少有一半的功勞!!!

本想把這章分幾個來學,但發現代碼全是直線連在一起的,所以整個6.4節就要一大鍋一起炒了,真心炒不爛啊。

第一步:

先把文檔預處理抽出來成單個文件,省得在那礙眼,文件名就隨便叫stringclean64.py,要用時再import 就好了~

特朗普的文檔,我是從github上覆制粘貼下來的,因爲不知道怎麼下載 https://github.com/tensorlayer/tensorlayer/tree/master/example/data/trump 因爲用文本保存成utf-8編碼,所以with open(input_fpath, mode='r', encoding='utf-8') as f:這裏要加上encoding='utf-8',否則會報錯~~

import re

def basic_clean_str(string):
    string = re.sub(r"\n", " ", string)  # '\n'      --> ' '
    string = re.sub(r"\'s", " \'s", string)  # it's      --> it 's
    string = re.sub(r"\'ve", " have", string)  # they've   --> they have
    string = re.sub(r"\’ve", " have", string)
    string = re.sub(r"\'t", " not", string)  # can't     --> can not
    string = re.sub(r"\’t", " not", string)
    string = re.sub(r"\'re", " are", string)  # they're   --> they are
    string = re.sub(r"\’re", " are", string)
    string = re.sub(r"\'d", "", string)  # I'd (I had, I would) --> I
    string = re.sub(r"\’d", "", string)
    string = re.sub(r"\'ll", " will", string)  # I'll      --> I will
    string = re.sub(r"\’ll", " will", string)
    string = re.sub(r"\“", "  ", string)  # “a”       --> “ a ”
    string = re.sub(r"\”", "  ", string)
    string = re.sub(r"\"", "  ", string)  # "a"       --> " a "
    string = re.sub(r"\'", "  ", string)  # they'     --> they '
    string = re.sub(r"\’", "  ", string)  # they’     --> they ’
    string = re.sub(r"\.", " . ", string)  # they.     --> they .
    string = re.sub(r"\,", " , ", string)  # they,     --> they ,
    string = re.sub(r"\!", " ! ", string)
    string = re.sub(r"\-", "  ", string)  # "low-cost"--> lost cost
    string = re.sub(r"\(", "  ", string)  # (they)    --> ( they)
    string = re.sub(r"\)", "  ", string)  # ( they)   --> ( they )
    string = re.sub(r"\]", "  ", string)  # they]     --> they ]
    string = re.sub(r"\[", "  ", string)  # they[     --> they [
    string = re.sub(r"\?", "  ", string)  # they?     --> they ?
    string = re.sub(r"\>", "  ", string)  # they>     --> they >
    string = re.sub(r"\<", "  ", string)  # they<     --> they <
    string = re.sub(r"\=", "  ", string)  # easier=   --> easier =
    string = re.sub(r"\;", "  ", string)  # easier;   --> easier ;
    string = re.sub(r"\;", "  ", string)
    string = re.sub(r"\:", "  ", string)  # easier:   --> easier :
    string = re.sub(r"\"", "  ", string)  # easier"   --> easier "
    string = re.sub(r"\$", "  ", string)  # $380      --> $ 380
    string = re.sub(r"\_", "  ", string)  # _100     --> _ 100
    string = re.sub(r"\s{2,}", " ", string)  # Akara is    handsome --> Akara is handsome
    return string.strip().lower()  # lowercase

def customized_clean_str(string):
    string = re.sub(r"\n", " ", string)  # '\n'      --> ' '
    string = re.sub(r"\'s", " \'s", string)  # it's      --> it 's
    string = re.sub(r"\’s", " \'s", string)
    string = re.sub(r"\'ve", " have", string)  # they've   --> they have
    string = re.sub(r"\’ve", " have", string)
    string = re.sub(r"\'t", " not", string)  # can't     --> can not
    string = re.sub(r"\’t", " not", string)
    string = re.sub(r"\'re", " are", string)  # they're   --> they are
    string = re.sub(r"\’re", " are", string)
    string = re.sub(r"\'d", "", string)  # I'd (I had, I would) --> I
    string = re.sub(r"\’d", "", string)
    string = re.sub(r"\'ll", " will", string)  # I'll      --> I will
    string = re.sub(r"\’ll", " will", string)
    string = re.sub(r"\“", " “ ", string)  # “a”       --> “ a ”
    string = re.sub(r"\”", " ” ", string)
    string = re.sub(r"\"", " “ ", string)  # "a"       --> " a "
    string = re.sub(r"\'", " ' ", string)  # they'     --> they '
    string = re.sub(r"\’", " ' ", string)  # they’     --> they '
    string = re.sub(r"\.", " . ", string)  # they.     --> they .
    string = re.sub(r"\,", " , ", string)  # they,     --> they ,
    string = re.sub(r"\-", " ", string)  # "low-cost"--> lost cost
    string = re.sub(r"\(", " ( ", string)  # (they)    --> ( they)
    string = re.sub(r"\)", " ) ", string)  # ( they)   --> ( they )
    string = re.sub(r"\!", " ! ", string)  # they!     --> they !
    string = re.sub(r"\]", " ] ", string)  # they]     --> they ]
    string = re.sub(r"\[", " [ ", string)  # they[     --> they [
    string = re.sub(r"\?", " ? ", string)  # they?     --> they ?
    string = re.sub(r"\>", " > ", string)  # they>     --> they >
    string = re.sub(r"\<", " < ", string)  # they<     --> they <
    string = re.sub(r"\=", " = ", string)  # easier=   --> easier =
    string = re.sub(r"\;", " ; ", string)  # easier;   --> easier ;
    string = re.sub(r"\;", " ; ", string)
    string = re.sub(r"\:", " : ", string)  # easier:   --> easier :
    string = re.sub(r"\"", " \" ", string)  # easier"   --> easier "
    string = re.sub(r"\$", " $ ", string)  # $380      --> $ 380
    string = re.sub(r"\_", " _ ", string)  # _100     --> _ 100
    string = re.sub(r"\s{2,}", " ", string)  # Akara is    handsome --> Akara is handsome
    return string.strip().lower()  # lowercase

def customized_read_words(input_fpath):  #, dictionary):
    with open(input_fpath, mode='r', encoding='utf-8') as f:
        words = f.read()
    words = customized_clean_str(words)
    return words.split()

re.sub()是個python 的正則化工具,簡單地說就是替換,內容很深~

第二步:

把代碼整理好,試運行時遇到報錯如下:

解決方法看這個https://blog.csdn.net/quiet_girl/article/details/72604691

就是import 時先下載

import nltk
nltk.download()
exit()
import time
import numpy as np
import tensorflow as tf
import tensorlayer as tl
from tensorlayer.layers import *
from stringclean64 import *
**********************************************************************
  Resource 'tokenizers/punkt/english.pickle' not found.  Please
  use the NLTK Downloader to obtain the resource:  >>>
  nltk.download()
  Searched in:
    - 'C:\\Users\\Administrator/nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
    - 'C:\\Program Files\\Anaconda3\\nltk_data'
    - 'C:\\Program Files\\Anaconda3\\lib\\nltk_data'
    - 'C:\\Users\\Administrator\\AppData\\Roaming\\nltk_data'
    - ''
**********************************************************************

Download Directory要改成anaconda的site-packages下。主要就是安裝MODELS下的punkt,開始我不知道,傻傻地全下載!!!全下載的話應該有500多M,如果是外網速度不行的話,那就等到死吧~~

  最後別忘了加到環境變量裏去,最好還要重啓一下電腦,要不一樣報錯!我家這電腦破得觸目驚心了吧!!

第三步:

整體代碼如下:

import nltk
# nltk.download()
# exit()
import time
import numpy as np
import tensorflow as tf
import tensorlayer as tl
from tensorlayer.layers import *
from stringclean64 import *

init_scale = 0.1
learning_rate = 1.0
max_grad_norm = 5
sequence_length = 20
hidden_size = 200
max_epoch = 4
max_max_epoch = 100
lr_decay = 0.9
batch_size = 20

top_k_list = [1, 3, 5, 10]
print_length = 30

model_file_name = "641model_generate_text.npz"

print('~~~~~~~~~第一層~~~~~~~~~~')
words = customized_read_words(input_fpath="trump_text.txt")

vocab = tl.nlp.create_vocab([words], word_counts_output_file='vocab.txt', min_word_count=1)
vocab = tl.nlp.Vocabulary('vocab.txt', unk_word="<UNK>")
vocab_size = vocab.unk_id + 1
train_data = [vocab.word_to_id(word) for word in words]

seed = "it is a"
# seed = basic_clean_str(seed).split()
seed = nltk.tokenize.word_tokenize(seed)
print('seed : %s' % seed)

sess = tf.InteractiveSession()

print('~~~~~~~~~~~~第二層~~~~~~~~~~~~~')

input_data = tf.placeholder(tf.int32, [batch_size, sequence_length])
targets = tf.placeholder(tf.int32, [batch_size, sequence_length])

input_data_test = tf.placeholder(tf.int32, [1, 1])

def inference(x, is_train, sequence_length, reuse=None):
    print("\nsequence_length: %d, is_train: %s, reuse: %s" % (sequence_length, is_train, reuse))
    rnn_init = tf.random_uniform_initializer(-init_scale, init_scale)
    with tf.variable_scope("model", reuse=reuse):
    	tl.layers.set_name_reuse(reuse)
    	network = EmbeddingInputlayer(inputs=x, vocabulary_size=vocab_size, embedding_size=hidden_size, E_init=rnn_init, name='embedding')
    	network = RNNLayer(network, cell_fn=tf.contrib.rnn.BasicLSTMCell, cell_init_args={
    		'forget_bias': 0.0,'state_is_tuple': True},n_hidden=hidden_size, initializer=rnn_init, 
    		n_steps=sequence_length, return_last=False, return_seq_2d=True, name='lstm1')
    	lstm1 = network
    	network = DenseLayer(network, n_units=vocab_size, W_init=rnn_init, b_init=rnn_init, act=tf.identity, name='output')
    return network, lstm1

network, lstm1 = inference(input_data, is_train=True, sequence_length=sequence_length, reuse=None)

network_test, lstm1_test = inference(input_data_test, is_train=False, sequence_length=1, reuse=True)

print('~~~~~~~~~~~~第三層~~~~~~~~~~~~~')

y_linear = network_test.outputs
y_soft = tf.nn.softmax(y_linear)
# y_id = tf.argmax(tf.nn.softmax(y), 1)

print('~~~~~~~~~~~6.4.3損失函數~~~~~~~~~~~~~~')

def loss_fn(outputs, targets, batch_size, sequence_length):
    loss = tf.contrib.legacy_seq2seq.sequence_loss_by_example([outputs], [tf.reshape(targets, [-1])], [tf.ones([batch_size * sequence_length])])
    cost = tf.reduce_sum(loss) / batch_size
    return cost

cost = loss_fn(network.outputs, targets, batch_size, sequence_length)

print('~~~~~~~~~~~定義優化器~~~~~~~~~~~~~~~~')

with tf.variable_scope('learning_rate'):
    lr = tf.Variable(0.0, trainable=False)
# You can get all trainable parameters as follow.
# tvars = tf.trainable_variables()
# tvars = network.all_params      $ all parameters
# tvars = network.all_params[1:]  $ parameters except embedding matrix
tvars = network.all_params
grads, _ = tf.clip_by_global_norm(tf.gradients(cost, tvars), max_grad_norm)
optimizer = tf.train.GradientDescentOptimizer(lr)
train_op = optimizer.apply_gradients(zip(grads, tvars))

print('~~~~~~~~~Start learning a model to generate text~~~~~~~~~~~~')

tl.layers.initialize_global_variables(sess)

for i in range(max_max_epoch):
    new_lr_decay = lr_decay**max(i - max_epoch, 0.0)
    sess.run(tf.assign(lr, learning_rate * new_lr_decay))

    print("---Epoch: %d/%d Learning rate: %.8f" % (i + 1, max_max_epoch, sess.run(lr)))
    epoch_size = ((len(train_data) // batch_size) - 1) // sequence_length

    start_time = time.time()
    costs = 0.0
    iters = 0
    state1 = tl.layers.initialize_rnn_state(lstm1.initial_state)
    for step, (x, y) in enumerate(tl.iterate.ptb_iterator(train_data, batch_size, sequence_length)):
        _cost, state1, _ = sess.run([cost, lstm1.final_state, train_op], 
        	feed_dict={input_data: x, targets: y,  lstm1.initial_state: state1,})
        costs += _cost
        iters += sequence_length

        if step % (epoch_size // 10) == 1:
            print("%.3f perplexity: %.3f speed: %.0f wps" % (step * 1.0 / epoch_size, np.exp(costs / iters),
                                                             iters * batch_size / (time.time() - start_time)))

    train_perplexity = np.exp(costs / iters)
    print("***Epoch: %d/%d Train Perplexity: %.3f" % (i + 1, max_max_epoch, train_perplexity))


    for top_k in top_k_list:
        state1 = tl.layers.initialize_rnn_state(lstm1_test.initial_state)
        # state2 = tl.layers.initialize_rnn_state(lstm2_test.initial_state)
        outs_id = [vocab.word_to_id(w) for w in seed]
        for ids in outs_id[:-1]:
            a_id = np.asarray(ids).reshape(1, 1)
            state1 = sess.run([lstm1_test.final_state,], 
            	feed_dict={input_data_test: a_id,lstm1_test.initial_state: state1,})

        a_id = outs_id[-1]
        for _ in range(print_length):
            a_id = np.asarray(a_id).reshape(1, 1)
            out, state1 = sess.run(
                [y_soft, lstm1_test.final_state], feed_dict={
                    input_data_test: a_id,
                    lstm1_test.initial_state: state1,})
            # a_id = np.argmax(out[0])
            ## Sample from all words, if vocab_size is large, this may have numeric error.
            # a_id = tl.nlp.sample(out[0], diversity)
            a_id = tl.nlp.sample_top(out[0], top_k=top_k)
            outs_id.append(a_id)
        sentence = [vocab.id_to_word(w) for w in outs_id]
        sentence = " ".join(sentence)
        # print(diversity, ':', sentence)
        print(top_k, ':', sentence)

print("Save model")
tl.files.save_npz(network_test.all_params, name=model_file_name)  

這個是我用家裏的破腦,跑的,max_max_epoch = 2,哈哈,先記錄一下,明天我用單位的電腦跑下max_max_epoch = 100,好對比一下:

~~~~~~~~~第一層~~~~~~~~~~
[TL] Creating vocabulary.
[TL]     Total words: 9798
[TL]     Words in vocabulary: 9799
[TL]     Wrote vocabulary file: vocab.txt
INFO:tensorflow:Initializing vocabulary from file: vocab.txt
[TL] Initializing vocabulary from file: vocab.txt
[TL] Vocabulary from vocab.txt : <S> </S> <UNK>
[TL]     vocabulary with 9802 words (includes start_word, end_word, unk_word)
[TL]       start_id: 9799
[TL]       end_id  : 9800
[TL]       unk_id  : 9801
[TL]       pad_id  : 0
seed : ['it', 'is', 'a']
~~~~~~~~~~~~第二層~~~~~~~~~~~~~

sequence_length: 20, is_train: True, reuse: None
[TL] EmbeddingInputlayer model/embedding: (9802, 200)
[TL] RNNLayer model/lstm1: n_hidden:200 n_steps:20 in_dim:3 in_shape:(20, 20, 200) cell_fn:BasicLSTMCell 
[TL]        RNN batch_size (concurrent processes): 20
[TL]      n_params : 2
[TL] DenseLayer  model/output: 9802 identity

sequence_length: 1, is_train: False, reuse: True
[TL] EmbeddingInputlayer model/embedding: (9802, 200)
[TL] RNNLayer model/lstm1: n_hidden:200 n_steps:1 in_dim:3 in_shape:(1, 1, 200) cell_fn:BasicLSTMCell 
[TL]        RNN batch_size (concurrent processes): 1
[TL]      n_params : 2
[TL] DenseLayer  model/output: 9802 identity
~~~~~~~~~~~~第三層~~~~~~~~~~~~~
~~~~~~~~~~~6.4.3損失函數~~~~~~~~~~~~~~
~~~~~~~~~~~定義優化器~~~~~~~~~~~~~~~~
~~~~~~~~~Start learning a model to generate text~~~~~~~~~~~~
---Epoch: 1/2 Learning rate: 1.00000000
0.002 perplexity: 8709.861 speed: 431 wps
0.100 perplexity: 1625.887 speed: 622 wps
0.199 perplexity: 1037.599 speed: 631 wps
0.297 perplexity: 765.932 speed: 633 wps
0.395 perplexity: 625.504 speed: 644 wps
0.493 perplexity: 534.073 speed: 652 wps
0.592 perplexity: 480.487 speed: 656 wps
0.690 perplexity: 438.889 speed: 659 wps
0.788 perplexity: 404.562 speed: 661 wps
0.886 perplexity: 375.370 speed: 663 wps
0.985 perplexity: 351.889 speed: 664 wps
***Epoch: 1/2 Train Perplexity: 349.752
1 : it is a long time to be a long time to the people , and i would want to do that , we can not do that , we can make america great
3 : it is a long time . i want a long , i would want to the people , i would like the deal , and i would want to get the people that
5 : it is a much thing to the people , i want to do that ' s , the people are going to make america ' s time we want to the way ,
10 : it is a very long time to the people . i would know with the people , the way we are going to be a deal to receive the country , and he
---Epoch: 2/2 Learning rate: 1.00000000
0.002 perplexity: 261.161 speed: 657 wps
0.100 perplexity: 194.437 speed: 681 wps
0.199 perplexity: 187.006 speed: 673 wps
0.297 perplexity: 178.360 speed: 674 wps
0.395 perplexity: 171.464 speed: 673 wps
0.493 perplexity: 165.598 speed: 674 wps
0.592 perplexity: 164.097 speed: 675 wps
0.690 perplexity: 160.013 speed: 675 wps
0.788 perplexity: 155.866 speed: 675 wps
0.886 perplexity: 151.875 speed: 674 wps
0.985 perplexity: 148.404 speed: 673 wps
***Epoch: 2/2 Train Perplexity: 148.207
1 : it is a great honor to be a great deal , but we will make america great again . we need to get the people to get the people to get the people
3 : it is a long time . and i will do that . i will do that . and i will say that . we will make a deal , and i would like
5 : it is a great time for the people . the people can not do that . we need people to get rid of the oil . the people can make you back the
10 : it is a good deal for the people who can do , the people , but i will do something that . i will tell you that the time . we have people
Save model
[TL] [*] 641model_generate_text.npz saved
[Finished in 633.2s]

單位電腦稍好點,全也跑了近四個小時·······

1 : it is a very successful possibility . it ' s going to be built , a very real number , or a very good all . well , i will tell you what
3 : it is a choice between israel and , by the way , to make the incredible potential of the best available to deal , and it ' s a very sad line of
5 : it is a mess . it ' s a total piece . they love america , but the way — they are trying to come to other countries like that . we want
10 : it is a choice between a politician about israel . we have lost everything . we are going to bring jobs back from japan , i will say we want to keep them
---Epoch: 26/100 Learning rate: 0.10941899
0.002 perplexity: 6.456 speed: 1347 wps
0.334 perplexity: 5.350 speed: 1349 wps
0.666 perplexity: 5.138 speed: 1349 wps
0.998 perplexity: 4.887 speed: 1348 wps
***Epoch: 26/100 Train Perplexity: 4.887


1 : it is a mess . if we don not get tough and repeal obamacare , we have no choice . we will have a country of borders , we can be leading in
3 : it is a disaster for two or good , but they know exactly , but in my opinion , that ' s the kind of thinking this country needs , is to get
5 : it is a threat to israel . that ' s what you are saying . these are the people who don not need it . so i have got a call from him
10 : it is a choice . we have a president who understands what would do is because our leaders no longer use the wealth to them all for countries that is good for our
---Epoch: 51/100 Learning rate: 0.00785517
0.002 perplexity: 4.847 speed: 1350 wps
0.334 perplexity: 4.172 speed: 1348 wps
0.666 perplexity: 3.986 speed: 1347 wps
0.998 perplexity: 3.789 speed: 1346 wps
***Epoch: 51/100 Train Perplexity: 3.789

1 : it is a mess . if i ' m elected president , we will win the united states as a nation . we will have more than any other candidate ; we can
3 : it is a disgrace and an embarrassment . but we need people to think i will do that . i will do everything to bring all the border . we are not our
5 : it is a mess . and we need smart people . we have no idea who the people who can not understand how bad things . as for the very beginning we need
10 : it is a threat to our economy . that ' s what americans are talking about . and if i ' m president and all of these sudden why that are president ,
---Epoch: 76/100 Learning rate: 0.00056392
0.002 perplexity: 4.665 speed: 1347 wps
0.334 perplexity: 4.004 speed: 1360 wps
0.666 perplexity: 3.826 speed: 1360 wps
0.998 perplexity: 3.645 speed: 1360 wps
***Epoch: 76/100 Train Perplexity: 3.645


最終100的輸出:

1 : it is a mess . if i ' m elected president , we will win the united states as a nation . we will have more than any other candidate ; we can
3 : it is a disgrace and it is a honor to receive endorsement . i am the only person on this dais i ' m very proud of the question . well , i
5 : it is a disgrace and real estate . i ' m the only one that ' s going to be able to keep the polls . i am , and , frankly ,
10 : it is a good or good , great wall . i can not imagine what we have to take on that no oil and we can go at zero . and as far
Save model
[TL] [*] 641model_generate_text.npz saved
[Finished in 13881.3s]

13881/360=3.8 Hour!總體感覺偶爾有幾句還是相當精確的,但有些意思表示就很混亂了

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章