CNN情感分類

CNN文本分類數據和代碼請點擊打開鏈接

實驗用到的數據是噹噹網書評,舉例見下方：

Negtive：不喜歡其人，更不喜歡其文，古韻有些過頭，矯情！
Positive：這本書真的很好啊，應該打五星，只可惜評分好像無法修改，很有作者的個人風格和簡介，能學到不少健康知識，作者您還能再出書嗎？要是還出這樣的好書，我一定買。謝謝噹噹熱心的顧客推薦，我才能買到這些好書，也謝謝噹噹，祝好心人閤家歡樂，身體健康！~

（1）Load data and labels

def load_data_and_labels():
  """
  Loads MR polarity data from files, splits the data into words and generates labels.
  Returns split sentences and labels.
  """
  # Load data from files
  positive_examples = list(codecs.open("./data/chinese/pos.txt", "r", "utf-8").readlines())
  positive_examples = [s.strip() for s in positive_examples]
  negative_examples = list(codecs.open("./data/chinese/neg.txt", "r", "utf-8").readlines())
  negative_examples = [s.strip() for s in negative_examples]
  # Split by words
  x_text = positive_examples + negative_examples
  # x_text = [clean_str(sent) for sent in x_text]
  x_text = [list(s) for s in x_text]

  # Generate labels
  positive_labels = [[0, 1] for _ in positive_examples]
  negative_labels = [[1, 0] for _ in negative_examples]
  y = np.concatenate([positive_labels, negative_labels], 0)
  return [x_text, y]

這個函數的作用是從文件中加載positive和negative數據，將它們組合在一起，並對每個句子都進行分詞，因此x_text是一個二維列表，存儲了每個review的每個word；它們對應的labels也組合在一起，由於labels實際對應的是二分類輸出層的兩個神經元，因此用one-hot編碼成0/1和1/0，然後返回y。
其中，f.readlines()的返回值就是一個list，每個元素都是一行文本（str類型，結尾帶有”\n”），因此其實不需要在外層再轉換成list()
用s.strip()函數去掉每個sentence結尾的換行符和空白符。
去除了換行符之後，由於剛纔提到的問題，每個sentence還需要做一些操作（具體在clean_str()函數中），將標點符號和縮寫等都分割開來。英文str最簡潔的分詞方式就是按空格split，因此我們只需要將各個需要分割的部位都加上空格，然後對整個str調用split(“ “)函數即可完成分詞。
labels的生成也類似。

（2）Pad sentence

def pad_sentences(sentences, padding_word="<PAD/>"):
  """
  Pads all sentences to the same length. The length is defined by the longest sentence.
  Returns padded sentences.
  """
  sequence_length = max(len(x) for x in sentences)
  padded_sentences = []
  for i in range(len(sentences)):
    sentence = sentences[i]
    num_padding = sequence_length - len(sentence)
    new_sentence = sentence + [padding_word] * num_padding
    padded_sentences.append(new_sentence)
  return padded_sentences

爲什麼要對sentence進行padding？
因爲TextCNN模型中的input_x對應的是tf.placeholder，是一個tensor，shape已經固定好了，比如[batch, sequence_len]，就不可能對tensor的每一行都有不同的長度，因此需要找到整個dataset中最長的sentence的長度，然後在不足長度的句子的末尾加上padding words，以保證input sentence的長度一致。
由於在load_data函數中，得到的是一個二維列表來存儲每個sentence數據，因此padding_sentences之後，仍以這樣的形式返回。只不過每個句子列表的末尾可能添加了padding word。

（3）Build vocabulary

def build_vocab(sentences):
  """
  Builds a vocabulary mapping from word to index based on the sentences.
  Returns vocabulary mapping and inverse vocabulary mapping.
  """
  # Build vocabulary
  word_counts = Counter(itertools.chain(*sentences))
  # Mapping from index to word
  vocabulary_inv = [x[0] for x in word_counts.most_common()]
  # Mapping from word to index
  vocabulary = {x: i for i, x in enumerate(vocabulary_inv)}
  return [vocabulary, vocabulary_inv]

此處，
我們知道，collections模塊中的Counter可以實現詞頻的統計，例如：

from collections import Counter
import collections
sentence = ["i", "love", "mom", "mom","mom","me","loves", "me"]
word_counts=collections.Counter(sentence)
print word_counts
print word_counts.most_common() 
vocabulary_inv = [x[0] for x in word_counts.most_common()]
print vocabulary_inv
vocabulary_inv = list(sorted(vocabulary_inv))
print vocabulary_inv
vocabulary = {x: i for i, x in enumerate(vocabulary_inv)}
print vocabulary
print vocabulary_inv
print [vocabulary,vocabulary_inv]

輸出結果：
[('mom', 3), ('me', 2), ('i', 1), ('love', 1), ('loves', 1)]
['mom', 'me', 'i', 'love', 'loves']
['i', 'love', 'loves', 'me', 'mom']
{'i': 0, 'me': 3, 'love': 1, 'mom': 4, 'loves': 2}

Counter接受的參數是iterable，但是現在有多個句子列表，如何將多個sentence word list中的所有word由一個高效的迭代器生成呢？
這就用到了itertools.chain(*iterables)

將多個迭代器作爲參數, 但只返回單個迭代器, 它產生所有參數迭代器的內容, 就好像他們是來自於一個單一的序列.

由此可以得到整個數據集上的詞頻統計，word_counts。
但是要建立字典vocabulary，就需要從word_counts中提取出每個pair的第一個元素也就是word（相當於Counter在這裏做了一個去重的工作），不需要根據詞頻建立vocabulary，而是根據word的字典序，所以對vocabulary進行一個sorted，就得到了字典順序的word list。首字母小的排在前面。（例子中是根據詞頻的）
再建立一個dict，存儲每個word對應的index，也就是vocabulary變量。

（4）Build input data

def build_input_data(sentences, labels, vocabulary):
  """
  Maps sentencs and labels to vectors based on a vocabulary.
  """
  #x present index matrix vocabulary[word] to get index
  x = np.array([[vocabulary[word] for word in sentence] for sentence in sentences])
  y = np.array(labels)
  return [x, y]

由上面兩個函數我們得到了所有sentences分詞後的二維列表，sentences對應的labels，還有查詢每個word對應index的vocabulary字典。
但是！！想一想，當前的sentences中存儲的是一個個word字符串，數據量大時很佔內存，因此，最好存儲word對應的index，index是int，佔用空間就小了。
因此就利用到剛生成的vocabulary，對sentences的二維列表中每個word進行查詢，生成一個word index構成的二維列表。最後將這個二維列表轉化成numpy中的二維array。
對應的lables因爲已經是0,1的二維列表了，直接可以轉成array。
轉成array後，就能直接作爲cnn的input和labels使用了。

（5）Load data

def load_data():
  """
  Loads and preprocessed data for the MR dataset.
  Returns input vectors, labels, vocabulary, and inverse vocabulary.
  """
  # Load and preprocess data
  sentences, labels = load_data_and_labels()
  sentences_padded = pad_sentences(sentences)
  vocabulary, vocabulary_inv = build_vocab(sentences_padded)
  x, y = build_input_data(sentences_padded, labels, vocabulary)
  return [x, y, vocabulary, vocabulary_inv]

最後整合上面的各部分處理函數，

1.首先從文本文件中加載原始數據，一開始以sentence形式暫存在list中，然後對每個sentence進行clean_str，並且分詞，得到word爲基本單位的二維列表sentences，labels對應[0,1]和[1,0]
2.找到sentence的最大長度，對於長度不足的句子進行padding
3.根據數據建立詞彙表，按照字典序返回，且得到每個word對應的index。
4.將str類型的二維列表sentences，轉成以int爲類型的sentences，並返回二維的numpy array作爲模型的input和labels供後續使用。

（6）Generate batch

def batch_iter(data, batch_size, num_epochs):
  """
  Generates a batch iterator for a dataset.
  """
  data = np.array(data)
  data_size = len(data)
  num_batches_per_epoch = int(len(data)/batch_size) + 1
  for epoch in range(num_epochs):
    # Shuffle the data at each epoch
    shuffle_indices = np.random.permutation(np.arange(data_size))
    shuffled_data = data[shuffle_indices]
    for batch_num in range(num_batches_per_epoch):
      start_index = batch_num * batch_size
      end_index = min((batch_num + 1) * batch_size, data_size)
      yield shuffled_data[start_index:end_index]

這個函數的作用是在整個訓練時，定義一個batches = batch_iter(…)，整個訓練過程中就只需要for循環這個batches即可對每一個batch數據進行操作了。

batches=batch_iter(...)

for batch in batches:

處理batch

Yield
Yield的用法有點像return,除了它返回的是一個生成器
了掌握yield的精髓，你一定要理解它的要點：當你調用這個函數的時候，你寫在這個函數中的代碼並沒有真正的運行。這個函數僅僅只是返回一個生成器對象。有點過於奇技淫巧:-)

然後，你的代碼會在每次for使用生成器的時候run起來。

現在是解釋最難的地方：
當你的for第一次調用函數的時候，它生成一個生成器，並且在你的函數中運行該循環，知道它生成第一個值。然後每次調用都會運行循環並且返回下一個值，知道沒有值返回爲止。該生成器背認爲是空的一旦該函數運行但是不再刀刀yield。之所以如此是因爲該循環已經到達終點，或者是因爲你再也不滿足“if/else”的條件。

（2）Pad sentence

（3）Build vocabulary

（4）Build input data

（5）Load data

（6）Generate batch

Wireshark 安裝+使用（一）

博客園商業化之路-衆包平臺：繼續召集早期合作開發者

React Native 實現table

[java][LeetCode]最長迴文子串,最大面積,在旋轉排序數組中找目標值

秋招面試經歷

Python字典循環添加一鍵多值用法

第一次互聯網實習面試經歷

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結