用RNN做文本生成

舉個小小的例子，來看看LSTM是怎麼玩的

我們這裏用溫斯頓丘吉爾的人物傳記作爲我們的學習語料。

(各種中文語料可以自行網上查找，英文的小說語料可以從古登堡計劃網站下載txt平文本：https://www.gutenberg.org/wiki/Category:Bookshelf)

第一步，一樣，先導入各種庫

import numpy
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils

接下來，我們把文本讀入

raw_text = open('../input/Winston_Churchil.txt').read()
raw_text = raw_text.lower()

既然我們是以每個字母爲層級，字母總共才26個，所以我們可以很方便的用One-Hot來編碼出所有的字母（當然，可能還有些標點符號和其他noise）

chars = sorted(list(set(raw_text)))
char_to_int = dict((c, i) for i, c in enumerate(chars))
int_to_char = dict((i, c) for i, c in enumerate(chars))

我們看到，全部的chars：

chars

['\n',
 ' ',
 '!',
 '#',
 '$',
 '%',
 '(',
 ')',
 '*',
 ',',
 '-',
 '.',
 '/',
 '0',
 '1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9',
 ':',
 ';',
 '?',
 '@',
 '[',
 ']',
 '_',
 'a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'q',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'x',
 'y',
 'z',
 '‘',
 '’',
 '“',
 '”',
 '\ufeff']

一共有：

len(chars)

同時，我們的原文本一共有:

len(raw_text)

我們這裏簡單的文本預測就是，給了前置的字母以後，下一個字母是誰？

比如，Winsto, 給出 nBritai 給出 n

構造訓練測試集

我們需要把我們的raw text變成可以用來訓練的x,y:

x 是前置字母們y 是後一個字母

seq_length = 100
x = []
y = []
for i in range(0, len(raw_text) - seq_length):
    given = raw_text[i:i + seq_length]
    predict = raw_text[i + seq_length]
    x.append([char_to_int[char] for char in given])
    y.append(char_to_int[predict])

我們可以看看我們做好的數據集的長相：

print(x[:3])
print(y[:3])

[[60, 45, 47, 44, 39, 34, 32, 49, 1, 36, 50, 49, 34, 43, 31, 34, 47, 36, 57, 48, 1, 47, 34, 30, 41, 1, 48, 44, 41, 33, 38, 34, 47, 48, 1, 44, 35, 1, 35, 44, 47, 49, 50, 43, 34, 9, 1, 31, 54, 1, 47, 38, 32, 37, 30, 47, 33, 1, 37, 30, 47, 33, 38, 43, 36, 1, 33, 30, 51, 38, 48, 0, 0, 49, 37, 38, 48, 1, 34, 31, 44, 44, 40, 1, 38, 48, 1, 35, 44, 47, 1, 49, 37, 34, 1, 50, 48, 34, 1, 44], [45, 47, 44, 39, 34, 32, 49, 1, 36, 50, 49, 34, 43, 31, 34, 47, 36, 57, 48, 1, 47, 34, 30, 41, 1, 48, 44, 41, 33, 38, 34, 47, 48, 1, 44, 35, 1, 35, 44, 47, 49, 50, 43, 34, 9, 1, 31, 54, 1, 47, 38, 32, 37, 30, 47, 33, 1, 37, 30, 47, 33, 38, 43, 36, 1, 33, 30, 51, 38, 48, 0, 0, 49, 37, 38, 48, 1, 34, 31, 44, 44, 40, 1, 38, 48, 1, 35, 44, 47, 1, 49, 37, 34, 1, 50, 48, 34, 1, 44, 35], [47, 44, 39, 34, 32, 49, 1, 36, 50, 49, 34, 43, 31, 34, 47, 36, 57, 48, 1, 47, 34, 30, 41, 1, 48, 44, 41, 33, 38, 34, 47, 48, 1, 44, 35, 1, 35, 44, 47, 49, 50, 43, 34, 9, 1, 31, 54, 1, 47, 38, 32, 37, 30, 47, 33, 1, 37, 30, 47, 33, 38, 43, 36, 1, 33, 30, 51, 38, 48, 0, 0, 49, 37, 38, 48, 1, 34, 31, 44, 44, 40, 1, 38, 48, 1, 35, 44, 47, 1, 49, 37, 34, 1, 50, 48, 34, 1, 44, 35, 1]]
[35, 1, 30]

此刻，樓上這些表達方式，類似就是一個詞袋，或者說 index。

接下來我們做兩件事：

我們已經有了一個input的數字表達（index），我們要把它變成LSTM需要的數組格式： [樣本數，時間步伐，特徵]
第二，對於output，我們在Word2Vec裏學過，用one-hot做output的預測可以給我們更好的效果，相對於直接預測一個準確的y數值的話。

n_patterns = len(x)
n_vocab = len(chars)

# 把x變成LSTM需要的樣子
x = numpy.reshape(x, (n_patterns, seq_length, 1))
# 簡單normal到0-1之間
x = x / float(n_vocab)
# output變成one-hot
y = np_utils.to_categorical(y)

print(x[11])
print(y[11])

[[ 0.80327869]
 [ 0.55737705]
 [ 0.70491803]
 [ 0.50819672]
 [ 0.55737705]
 [ 0.7704918 ]
 [ 0.59016393]
 [ 0.93442623]
 [ 0.78688525]
 [ 0.01639344]
 [ 0.7704918 ]
 [ 0.55737705]
 [ 0.49180328]
 [ 0.67213115]
 [ 0.01639344]
 [ 0.78688525]
 [ 0.72131148]
 [ 0.67213115]
 [ 0.54098361]
 [ 0.62295082]
 [ 0.55737705]
 [ 0.7704918 ]
 [ 0.78688525]
 [ 0.01639344]
 [ 0.72131148]
 [ 0.57377049]
 [ 0.01639344]
 [ 0.57377049]
 [ 0.72131148]
 [ 0.7704918 ]
 [ 0.80327869]
 [ 0.81967213]
 [ 0.70491803]
 [ 0.55737705]
 [ 0.14754098]
 [ 0.01639344]
 [ 0.50819672]
 [ 0.8852459 ]
 [ 0.01639344]
 [ 0.7704918 ]
 [ 0.62295082]
 [ 0.52459016]
 [ 0.60655738]
 [ 0.49180328]
 [ 0.7704918 ]
 [ 0.54098361]
 [ 0.01639344]
 [ 0.60655738]
 [ 0.49180328]
 [ 0.7704918 ]
 [ 0.54098361]
 [ 0.62295082]
 [ 0.70491803]
 [ 0.59016393]
 [ 0.01639344]
 [ 0.54098361]
 [ 0.49180328]
 [ 0.83606557]
 [ 0.62295082]
 [ 0.78688525]
 [ 0.        ]
 [ 0.        ]
 [ 0.80327869]
 [ 0.60655738]
 [ 0.62295082]
 [ 0.78688525]
 [ 0.01639344]
 [ 0.55737705]
 [ 0.50819672]
 [ 0.72131148]
 [ 0.72131148]
 [ 0.6557377 ]
 [ 0.01639344]
 [ 0.62295082]
 [ 0.78688525]
 [ 0.01639344]
 [ 0.57377049]
 [ 0.72131148]
 [ 0.7704918 ]
 [ 0.01639344]
 [ 0.80327869]
 [ 0.60655738]
 [ 0.55737705]
 [ 0.01639344]
 [ 0.81967213]
 [ 0.78688525]
 [ 0.55737705]
 [ 0.01639344]
 [ 0.72131148]
 [ 0.57377049]
 [ 0.01639344]
 [ 0.49180328]
 [ 0.70491803]
 [ 0.8852459 ]
 [ 0.72131148]
 [ 0.70491803]
 [ 0.55737705]
 [ 0.01639344]
 [ 0.49180328]
 [ 0.70491803]]
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  1.  0.  0.  0.  0.  0.]

模型建造

LSTM模型構建

model = Sequential()
model.add(LSTM(128, input_shape=(x.shape[1], x.shape[2])))
model.add(Dropout(0.2))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')

跑模型

model.fit(x, y, nb_epoch=10, batch_size=32)

Epoch 1/10
 21120/276730 [=>............................] - ETA: 1078s - loss: 3.0638

我們來寫個程序，看看我們訓練出來的LSTM的效果：

def predict_next(input_array):
    x = numpy.reshape(input_array, (1, seq_length, 1))
    x = x / float(n_vocab)
    y = model.predict(x)
    return y

def string_to_index(raw_input):
    res = []
    for c in raw_input[(len(raw_input)-seq_length):]:
        res.append(char_to_int[c])
    return res

def y_to_char(y):
    largest_index = y.argmax()
    c = int_to_char[largest_index]
    return c

好，寫成一個大程序：

def generate_article(init, rounds=500):
    in_string = init.lower()
    for i in range(rounds):
        n = y_to_char(predict_next(string_to_index(in_string)))
        in_string += n
    return in_string

init = 'Professor Michael S. Hart is the originator of the Project'
article = generate_article(init)
print(article)

用rnn做文本生成

用RNN做文本生成

構造訓練測試集

模型建造

跑模型

用rnn做文本生成

遺傳算法入門到掌握（一）

遺傳算法入門到掌握（二）

在虛擬機下爲Fedora擴展分區

opencv學習筆記(6)----圖像角點檢測

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結