NLP系列之文本分類

1前言

本篇博客主要是記錄自然語言處理中的文本分類任務中常見的基礎模型的使用及分析。Github上brightmart大佬已經整理出很完整的一套文本分類任務的基礎模型及對應的模型代碼實現。網上也有部分博客將brightmart寫的模型實現步驟進行翻譯整理出來了。本着尊重原創的原則，後面都列出了參考鏈接，在此也感謝參考鏈接上的作者。本文將對之前文本分類基礎模型的博客和文獻進行整理，此外再加上自己的一部分模型分析。畢竟還是需要有自己的東西在這裏的，這樣才能做到又學到了又進行思考了。

2文本分類任務

2.1 文本分類是自然語言處理中很基礎的任務，是學習自然語言處理入門的很好的實際操作的任務，筆記當時就是從文本分類開始動手實踐。文本分類的任務主要是把根據給出的文本(包括長文本，比如說資訊、新聞、文檔等，也包括短文本，比如說評論，微博等)利用自然語言處理技術對文本進行歸類整理，簡單點說就是說給文本進行類別標註。
2.2 常見的文本分類模型有基於機器學習的分類方法和基於深度學習的分類方法。對於基於機器學習的分類方法，顯然特徵的提取和特徵的選擇過程將會對分類效果起到至關重要的作用。在文本的特徵提取上，基於詞級層面的TF-IDF特徵，n-gram特徵，主題詞和關鍵詞特徵。基於句子級別的有句式，句子長度等特徵。基於語義層面的特徵可以使用word2vec預訓練語料庫得到詞向量，使用詞向量合理的構造出文本的詞向量作爲文本的語義特徵。對於基於深度學習的文本分類方法，顯然模型的結構和模型的參數將會對分類效果起到至關作用。在模型結構上常用的基礎的神經網絡模型有CNN，RNN，GRU，Attention機制，Dropout等。在模型參數的調節上，一方面需要設定好模型的參數學習率，另一位方面需要根據模型的使用特點和要分析的文本內容進行調節。
說明： 本文通過介紹brightmart在基礎神經網絡在文本分類上的實驗來進行相關的模型介紹和模型分析，該實驗主要是在2017年知乎看山杯的一道競賽題，競賽內容是對知乎上的問題進行分類，當然此次任務屬性文本分類中多標籤分類，屬於文本分類的範疇。
2.3各個基模型的實驗結果
brightmart使用以下的基礎模型在上述數據集上進行了大量實驗，實驗結果如下。以下很多模型比較基礎，都是非常經典的模型，作爲實驗的基準模型BaseLine是非常合適的。如果想繼續提升實驗結果，可能就需要根據數據的特徵進行模型的改進或者模型的集成工作了。

3基礎文本分類模型的介紹及分析

本部分主要對基礎的文本分類進行介紹，主要分爲模型結構的論文來源介紹，模型結構，模型的實現步驟，代碼的主要實現(也是來自brightmart的項目)和最後關於模型的分析。

3.1FastText

3.1.1論文來源
《Bag of Tricks for Efficient Text Classification》
3.1.2模型結構
3.1.3模型的實現步驟

從模型的結構可以看出，FastText的模型的實現步驟爲：
1.embedding–>2.average–>3.linear classifier(沒有經過激活函數)-> SoftMax分類

3.1.4模型的關鍵實現代碼

# 其中None表示你的batch_size大小
#1.get emebedding of words in the sentence
sentence_embeddings = tf.nn.embedding_lookup(self.Embedding,self.sentence)  # [None,self.sentence_len,self.embed_size]
#2.average vectors, to get representation of the sentence
self.sentence_embeddings = tf.reduce_mean(sentence_embeddings, axis=1)  # [None,self.embed_size]
#3.linear classifier layer
logits = tf.matmul(self.sentence_embeddings, self.W) + self.b #[None, self.label_size]==tf.matmul([None,self.embed_size],[self.embed_size,self.label_size])

3.1.5模型的分析

FastText的模型結構相對是比較簡單的，是一個有監督的訓練模型。我們知道FastText訓練不僅可以得到分類的效果，如果語料充足的話，可以訓練得到詞向量。
1. FastText模型結構簡單，因爲最後對文本的分類都是直接經過線性層來進行分類的，可以說是完成線性的，最後是沒有經過激活函數。因此句子結構比較簡單的文本分類任務來說，FastText是可以進行的。對於複雜的分類任務，比如說情感分類等，由於網絡模型需要學習到語句的語義，語序等特徵，顯然對於簡單的線性層分類是不足的，因此還是需要引入複雜的非線性結構層。正因爲模型結構簡單，模型訓練速度是相對較快的。
2. FastText引入了N gram特徵。從FastText前面的模型結構中，第二層計算的是詞向量的平均值，此步驟將會忽略掉文本的詞序特徵。顯然對於文本的分類任務中，這將會損失掉詞序特徵的。因此，在FastText詞向量中引入了N gram的詞向量。具體做法是，在N gram也當做一個詞，因此也對應着一個詞向量，在第二層計算詞向量的均值的時候，也需要把N gram對應的詞向量也加進來進行計算平均值。通過訓練分類模型，這樣可以得到詞向量和N gram對應的詞向量。期間也會存在一個問題，N gram的量其實遠比word大的多。因此FastText採用Hash桶的方式，把所有的N gram都哈希到buckets個桶中，哈希到同一個桶的所有n-gram共享一個embedding vector。這點可以聯想到，在處理UNK的詞向量的時候，也可以使用類似的思想進行詞向量的設置。

3.2TextCNN

3.2.1論文來源
《Convolutional Neural Networks for Sentence Classification》
3.2.2模型結構
3.2.3模型的實現步驟

從模型的結構可以看出，TextCNN的模型的實現步驟爲：
1.embedding—>2.conv—>3.max pooling—>4.fully connected layer-------->5.softmax

3.2.4模型的關鍵實現代碼

# 1.=====>get emebedding of words in the sentence
self.embedded_words = tf.nn.embedding_lookup(self.Embedding,self.input_x)#[None,sentence_length,embed_size]
self.sentence_embeddings_expanded=tf.expand_dims(self.embedded_words,-1) #[None,sentence_length,embed_size,1). expand dimension so meet input requirement of 2d-conv
# 2.=====>loop each filter size. for each filter, do:convolution-pooling layer(a.create filters,b.conv,c.apply nolinearity,d.max-pooling)--->
# you can use:tf.nn.conv2d;tf.nn.relu;tf.nn.max_pool; feature shape is 4-d. feature is a new variable
#if self.use_mulitple_layer_cnn: # this may take 50G memory.
        #    print("use multiple layer CNN")
        #    h=self.cnn_multiple_layers()
        #else: # this take small memory, less than 2G memory.
print("use single layer CNN")
h=self.cnn_single_layer()
#5. logits(use linear layer)and predictions(argmax)
with tf.name_scope("output"):
    logits = tf.matmul(h,self.W_projection) + self.b_projection  #shape:[None, self.num_classes]==tf.matmul([None,self.embed_size],[self.embed_size,self.num_classes])

3.2.5模型的分析
筆者之前詳細介紹過一篇TextCNN實現的博客，可以查看卷積神經網絡(TextCNN)在句子分類上的實現

深度學習與機器學習的最重要的不同之處便是：深度學習使用神經網絡代替人工的進行特徵的抽取。所以，最終模型的效果的好壞，其實是和神經網絡的特徵抽取的能力強弱相關。在文本處理上，特徵抽取能力主要有句法特徵提取能力；語義特徵提取能力；長距離特徵捕獲能力；任務綜合特徵抽取能力。上面四個角度是從NLP的特徵抽取器能力強弱角度來評判的，另外再加入並行計算能力及運行效率，這是從是否方便大規模實用化的角度來看的。
1. TextCNN神經網絡主要以CNN網絡對文本信息進行特徵的抽取，在圖像的處理上，CNN的特徵抽取能力是非常強的。我們把詞向量的維度和文本的長度當成另一個維度是可以構成一個矩陣的，於是，CNN便可以在文本進行卷積核的計算(文本的特徵抽取)。此時，卷積核的大小就相當於N gram的特徵了。
2. TextCNN中的實現步驟中是有max pooling的一步的。具體過程是多個卷積覈對文本進行滑動獲取語義特徵，而CNN中的卷積核是能保留特徵之間的相對位置的，因爲卷積核是滑動的，從左至右滑動，因此捕獲到的特徵也是如此順序排列，所以它在結構上已經記錄了相對位置信息了。但是卷積層後面立即接上Pooling層的話，Max Pooling的操作邏輯是：從一個卷積核獲得的特徵向量裏只選中並保留最強的那一個特徵，所以到了Pooling層，位置信息就被損失掉了(信息損失)。因此在對應需要捕獲文本的詞序信息特徵時，pooling層應該需要添加上位置信息。

3.3TextRNN/LSTM

3.3.1模型結構
3.3.2模型的步驟

從模型的結構可以看出，TextRNN/LSTM的模型的實現步驟爲：
1.embedding—>2.bi-directional lstm—>3.concat output—>4.average/last output----->5.softmax layer

3.3.3模型的關鍵實現代碼

#1.get emebedding of words in the sentence
self.embedded_words = tf.nn.embedding_lookup(self.Embedding,self.input_x) #shape:[None,sentence_length,embed_size]
#2. Bi-lstm layer
# define lstm cess:get lstm cell output
lstm_fw_cell=rnn.BasicLSTMCell(self.hidden_size) #forward direction cell
lstm_bw_cell=rnn.BasicLSTMCell(self.hidden_size) #backward direction cell
if self.dropout_keep_prob is not None:
    lstm_fw_cell=rnn.DropoutWrapper(lstm_fw_cell,output_keep_prob=self.dropout_keep_prob)
    lstm_bw_cell=rnn.DropoutWrapper(lstm_bw_cell,output_keep_prob=self.dropout_keep_prob)
# bidirectional_dynamic_rnn: input: [batch_size, max_time, input_size]
#                            output: A tuple (outputs, output_states)
#                                    where:outputs: A tuple (output_fw, output_bw) containing the forward and the backward rnn output `Tensor`.
outputs,_=tf.nn.bidirectional_dynamic_rnn(lstm_fw_cell,lstm_bw_cell,self.embedded_words,dtype=tf.float32) #[batch_size,sequence_length,hidden_size] #creates a dynamic bidirectional recurrent neural network
print("outputs:===>",outputs) #outputs:(<tf.Tensor 'bidirectional_rnn/fw/fw/transpose:0' shape=(?, 5, 100) dtype=float32>, <tf.Tensor 'ReverseV2:0' shape=(?, 5, 100) dtype=float32>))
#3. concat output
output_rnn=tf.concat(outputs,axis=2) #[batch_size,sequence_length,hidden_size*2]
#4.1 average
#self.output_rnn_last=tf.reduce_mean(output_rnn,axis=1) #[batch_size,hidden_size*2]
#4.2 last output
self.output_rnn_last=output_rnn[:,-1,:] ##[batch_size,hidden_size*2] #TODO
print("output_rnn_last:", self.output_rnn_last) # <tf.Tensor 'strided_slice:0' shape=(?, 200) dtype=float32>
#5. logits(use linear layer)
with tf.name_scope("output"): #inputs: A `Tensor` of shape `[batch_size, dim]`.  The forward activations of the input network.
    logits = tf.matmul(self.output_rnn_last, self.W_projection) + self.b_projection  # [batch_size,num_classes]

3.3.4模型的分析

1. RNN是典型的序列模型結構，它是線性序列結構，它不斷從前往後收集輸入信息，但這種線性序列結構在反向傳播的時候存在優化困難問題，因爲反向傳播路徑太長，容易導致嚴重的梯度消失或梯度爆炸問題。爲了解決這個問題，引入了LSTM和GRU模型，通過增加中間狀態信息直接向後傳播，由原來RNN的迭代乘法結構變爲後面的加法結構，以此緩解梯度消失問題。於是上面是有LSTM或GRU來代替RNN。
2. RNN的線性序列結構，讓RNN能很好的對不定長文本的輸入進行接納，將文本序列當做隨着時間變換的序列狀態，很好的接納文本的從前向後的輸入。在LSTM中引入門控制機制，從而使該序列模型能存儲之前網絡的特徵，這對於捕獲長距離特徵非常有效。所以RNN特別適合NLP這種線形序列應用場景，這是RNN爲何在NLP界如此流行的根本原因。
3. 因爲RNN的序列結構，t時刻的狀態是依賴t-1時刻的網絡狀態的，這對於網絡大規模的並行進行是很不友好的。也就是說RNN的高效並行計算能力是比較差的。當然可以對RNN結構進行一定程度上的改進，使之擁有一定程度的並行能力。

3.4RCNN

3.4.1論文來源
《Recurrent Convolutional Neural Network for Text Classification》
3.4.2模型結構
3.4.3模型的步驟

從模型的結構可以看出，RCNN的模型的實現步驟爲：
1.emebedding–>2.recurrent structure (convolutional layer)—>3.max pooling—>4.fully connected layer+softmax

3.4.4模型的關鍵實現代碼

#1.get emebedding of words in the sentence
self.embedded_words = tf.nn.embedding_lookup(self.Embedding,self.input_x) #shape:[None,sentence_length,embed_size]
#2. Bi-lstm layer
output_conv=self.conv_layer_with_recurrent_structure() #shape:[None,sentence_length,embed_size*3]
#3. max pooling
#print("output_conv:",output_conv) #(3, 5, 8, 100)
output_pooling=tf.reduce_max(output_conv,axis=1) #shape:[None,embed_size*3]
#print("output_pooling:",output_pooling) #(3, 8, 100)
#4. logits(use linear layer)
with tf.name_scope("dropout"):
    h_drop=tf.nn.dropout(output_pooling,keep_prob=self.dropout_keep_prob) #[None,num_filters_total]

with tf.name_scope("output"): #inputs: A `Tensor` of shape `[batch_size, dim]`.  The forward activations of the input network.
    logits = tf.matmul(h_drop, self.W_projection) + self.b_projection  # [batch_size,num_classes]
def conv_layer_with_recurrent_structure(self):
        """
        input:self.embedded_words:[None,sentence_length,embed_size]
        :return: shape:[None,sentence_length,embed_size*3]
        """
        #1. get splitted list of word embeddings
        embedded_words_split=tf.split(self.embedded_words,self.sequence_length,axis=1) #sentence_length個[None,1,embed_size]
        embedded_words_squeezed=[tf.squeeze(x,axis=1) for x in embedded_words_split]#sentence_length個[None,embed_size]
        embedding_previous=self.left_side_first_word
        context_left_previous=tf.zeros((self.batch_size,self.embed_size))
        #2. get list of context left
        context_left_list=[]
        for i,current_embedding_word in enumerate(embedded_words_squeezed):#sentence_length個[None,embed_size]
            context_left=self.get_context_left(context_left_previous, embedding_previous) #[None,embed_size]
            context_left_list.append(context_left) #append result to list
            embedding_previous=current_embedding_word #assign embedding_previous
            context_left_previous=context_left #assign context_left_previous
        #3. get context right
        embedded_words_squeezed2=copy.copy(embedded_words_squeezed)
        embedded_words_squeezed2.reverse()
        embedding_afterward=self.right_side_last_word
        context_right_afterward = tf.zeros((self.batch_size, self.embed_size))
        context_right_list=[]
        for j,current_embedding_word in enumerate(embedded_words_squeezed2):
            context_right=self.get_context_right(context_right_afterward,embedding_afterward)
            context_right_list.append(context_right)
            embedding_afterward=current_embedding_word
            context_right_afterward=context_right
        #4.ensemble left,embedding,right to output
        output_list=[]
        for index,current_embedding_word in enumerate(embedded_words_squeezed):
            representation=tf.concat([context_left_list[index],current_embedding_word,context_right_list[index]],axis=1)
            #print(i,"representation:",representation)
            output_list.append(representation) #shape:sentence_length個[None,embed_size*3]
        #5. stack list to a tensor
        #print("output_list:",output_list) #(3, 5, 8, 100)
        output=tf.stack(output_list,axis=1) #shape:[None,sentence_length,embed_size*3]
        #print("output:",output)
        return output

3.4.5模型的分析

1. 從RCNN的模型結構來看，做出重大改變的是詞向量的表示，以往的詞向量的表示即是簡單的一個詞的[word embedding],而RCNN中的表示爲[left context; word embedding, right context],從詞向量中引入上下文語境。具體的left context=activation(pre left context*Wl+ pre word embedding * Ww)。right context則反過來爲之。
2. RCNN比TextRNN實驗的效果是好的，改進後的word embedding起到了很重要的作用。一個詞用一個詞向量來表示這其實是有一定的侷限性的，當遇到一詞多意的時候，使用一個詞向量來表示一個詞，此時就顯得不那麼恰當了，因爲使用一個詞向量來表示，這相當於對所有的詞義進行了平均獲得的。我們可以理解這種一詞一個向量的表示爲靜態詞向量。而RCNN中則在原詞向量上添加左右context，這相當於引入了詞的語境，可以理解爲對原單個詞向量進行了一定程度上的調整，讓一詞多義的表示成爲可能。

3.5Hierarchical Attention Network

3.5.1論文來源
《Hierarchical Attention Networks for Document Classification》
3.5.2模型結構
3.5.3模型的步驟

從模型的結構可以看出，HAN的模型的實現步驟爲：
1.emebedding–>2.word encoder(bi-directional GRU)—>3.word Attention—>4.Sentence Encoder(bi-directional GRU)—>5.Sentence Attetion—>6.fC+Softmax

3.5.4模型的關鍵實現代碼

# 1.1 embedding of words
input_x = tf.split(self.input_x, self.num_sentences,axis=1)  # a list. length:num_sentences.each element is:[None,self.sequence_length/num_sentences]
input_x = tf.stack(input_x, axis=1)  # shape:[None,self.num_sentences,self.sequence_length/num_sentences]
self.embedded_words = tf.nn.embedding_lookup(self.Embedding,input_x)  # [None,num_sentences,sentence_length,embed_size]
embedded_words_reshaped = tf.reshape(self.embedded_words, shape=[-1, self.sequence_length,self.embed_size])  # [batch_size*num_sentences,sentence_length,embed_size]
# 1.2 forward gru
hidden_state_forward_list = self.gru_forward_word_level(embedded_words_reshaped)  # a list,length is sentence_length, each element is [batch_size*num_sentences,hidden_size]
# 1.3 backward gru
hidden_state_backward_list = self.gru_backward_word_level(embedded_words_reshaped)  # a list,length is sentence_length, each element is [batch_size*num_sentences,hidden_size]
# 1.4 concat forward hidden state and backward hidden state. hidden_state: a list.len:sentence_length,element:[batch_size*num_sentences,hidden_size*2]
self.hidden_state = [tf.concat([h_forward, h_backward], axis=1) for h_forward, h_backward in
                     zip(hidden_state_forward_list, hidden_state_backward_list)]  # hidden_state:list,len:sentence_length,element:[batch_size*num_sentences,hidden_size*2]

# 2.Word Attention
# for each sentence.
sentence_representation = self.attention_word_level(self.hidden_state)  # output:[batch_size*num_sentences,hidden_size*2]
sentence_representation = tf.reshape(sentence_representation, shape=[-1, self.num_sentences, self.hidden_size * 2])  # shape:[batch_size,num_sentences,hidden_size*2]
#with tf.name_scope("dropout"):#TODO
#    sentence_representation = tf.nn.dropout(sentence_representation,keep_prob=self.dropout_keep_prob)  # shape:[None,hidden_size*4]

# 3.Sentence Encoder
# 3.1) forward gru for sentence
hidden_state_forward_sentences = self.gru_forward_sentence_level(sentence_representation)  # a list.length is sentence_length, each element is [None,hidden_size]
# 3.2) backward gru for sentence
hidden_state_backward_sentences = self.gru_backward_sentence_level(sentence_representation)  # a list,length is sentence_length, each element is [None,hidden_size]
# 3.3) concat forward hidden state and backward hidden state
# below hidden_state_sentence is a list,len:sentence_length,element:[None,hidden_size*2]
self.hidden_state_sentence = [tf.concat([h_forward, h_backward], axis=1) for h_forward, h_backward in zip(hidden_state_forward_sentences, hidden_state_backward_sentences)]

# 4.Sentence Attention
document_representation = self.attention_sentence_level(self.hidden_state_sentence)  # shape:[None,hidden_size*4]
with tf.name_scope("dropout"):
    self.h_drop = tf.nn.dropout(document_representation,keep_prob=self.dropout_keep_prob)  # shape:[None,hidden_size*4]
# 5. logits(use linear layer)and predictions(argmax)
with tf.name_scope("output"):
    logits = tf.matmul(self.h_drop, self.W_projection) + self.b_projection  # shape:[None,self.num_classes]==tf.matmul([None,hidden_size*2],[hidden_size*2,self.num_classes])
return logits

3.5.5模型的分析

1. Hierarchical Attention Network(HAN)分層對文本進行構建模型(Encoder)，此外在每層加上了兩個Attention層，分別表示對文本中的按錯和句子的重要性進行建模。HAN比較適用於長文本的分類，長文本包括多個句子，句子中包括多個詞，適用於對文本的分層建模。首先，HAN考慮到文本的層次結構：詞構成句，句子構成文檔。因此，對文本的建模時也針對這兩部分。因爲一個句子中每個詞對分類的結果影響的不一樣，一個句子對文本分類的結果影響也不一樣。所以，引入Attention機制，這樣每個詞，每個句子的對分類的結果的影響將不會一樣。具體計算的公式如下：

3.6Transformer

3.6.1論文來源
《Attention Is All You Need》
3.6.2模型結構
3.6.3模型的步驟

從模型的結構可以看出，Transformer的模型的實現步驟爲：
1.word embedding&position embedding–>2.Encoder(2.1multi head self attention->2.2LayerNorm->2.3position wise fully connected feed forward network->2.4LayerNorm)—>3.linear classifie

3.6.4模型的關鍵實現代碼

input_x_embeded = tf.nn.embedding_lookup(self.Embedding,self.input_x)  #[None,sequence_length, embed_size]
input_x_embeded=tf.multiply(input_x_embeded,tf.sqrt(tf.cast(self.d_model,dtype=tf.float32)))
input_mask=tf.get_variable("input_mask",[self.sequence_length,1],initializer=self.initializer)
input_x_embeded=tf.add(input_x_embeded,input_mask) #[None,sequence_length,embed_size].position embedding.

# 2. encoder
encoder_class=Encoder(self.d_model,self.d_k,self.d_v,self.sequence_length,self.h,self.batch_size,self.num_layer,input_x_embeded,input_x_embeded,dropout_keep_prob=self.dropout_keep_prob,use_residual_conn=self.use_residual_conn)
Q_encoded,K_encoded = encoder_class.encoder_fn() #K_v_encoder

Q_encoded=tf.reshape(Q_encoded,shape=(self.batch_size,-1)) #[batch_size,sequence_length*d_model]
with tf.variable_scope("output"):
    logits = tf.matmul(Q_encoded, self.W_projection) + self.b_projection #logits shape:[batch_size*decoder_sent_length,self.num_classes]
print("logits:",logits)
return logits
def encoder_single_layer(self,Q,K_s,layer_index):
    """
    singel layer for encoder.each layers has two sub-layers:
    the first is multi-head self-attention mechanism; the second is position-wise fully connected feed-forward network.
    for each sublayer. use LayerNorm(x+Sublayer(x)). input and output of last dimension: d_model
    :param Q: shape should be:       [batch_size*sequence_length,d_model]
    :param K_s: shape should be:     [batch_size*sequence_length,d_model]
    :return:output: shape should be:[batch_size*sequence_length,d_model]
    """
    #1.1 the first is multi-head self-attention mechanism
    multi_head_attention_output=self.sub_layer_multi_head_attention(layer_index,Q,K_s,self.type,mask=self.mask,dropout_keep_prob=self.dropout_keep_prob) #[batch_size,sequence_length,d_model]
    #1.2 use LayerNorm(x+Sublayer(x)). all dimension=512.
    multi_head_attention_output=self.sub_layer_layer_norm_residual_connection(K_s ,multi_head_attention_output,layer_index,'encoder_multi_head_attention',dropout_keep_prob=self.dropout_keep_prob,use_residual_conn=self.use_residual_conn)

    #2.1 the second is position-wise fully connected feed-forward network.
    postion_wise_feed_forward_output=self.sub_layer_postion_wise_feed_forward(multi_head_attention_output,layer_index,self.type)
    #2.2 use LayerNorm(x+Sublayer(x)). all dimension=512.
    postion_wise_feed_forward_output= self.sub_layer_layer_norm_residual_connection(multi_head_attention_output,postion_wise_feed_forward_output,layer_index,'encoder_postion_wise_ff',dropout_keep_prob=self.dropout_keep_prob)
    return  postion_wise_feed_forward_output,postion_wise_feed_forward_output

3.6.5模型的分析

1. 論文《Attention is all you need》中的Transformer指的是完整的Encoder-Decoder框架，而對於此項文本分類來說，Transformer是其中對應的Encoder，而一個Encoder模塊(Block)包含着多個子模塊(包括Multi-head self attention，Skip connection，LayerNorm，Feed Forward)，如下：2. 對於Transformer來說，需要明確加入位置編碼學習position Embedding.因爲對於self Attention來說，它讓當前輸入單詞和句子中任意單詞進行相似計算，然後歸一化計算出句子中各個單詞對應的權重，再利用權重與各個單詞對應的變換後V值相乘累加，得出集中後的embedding向量，此間是損失掉了位置信息的。因此，爲了引入位置信息編碼，Transformer對每個單詞一個Position embedding，將單詞embedding和單詞對應的position embedding加起來形成單詞的輸入embedding。
3. Transformer中的self Attention對文本的長距離依賴特徵的提取有很強的能力，因爲它讓當前輸入單詞和句子中任意單詞進行相似計算，它是直接進行的長距離依賴特徵的獲取的，不像RNN需要通過隱層節點序列往後傳，也不像CNN需要通過增加網絡深度來捕獲遠距離特徵。此外，對應模型訓練時的並行計算能力，Transformer也有先天的優勢，它不像RNN需要依賴前一刻的特徵量。
4. 張俊林大佬在【6】中提到過，在Transformer中的Block中不僅僅multi-head attention在發生作用，而是幾乎所有構件都在共同發揮作用，是一個小小的系統工程。例如Skip connection，LayerNorm等也是發揮了作用的。對於Transformer來說，Multi-head attention的head數量嚴重影響NLP任務中Long-range特徵捕獲能力：結論是head越多越有利於捕獲long-range特徵。

4總結

1. 對於一個人工智能領域的問題的解決，不管是使用深度學習的神經網絡還是使用機器學習的人工特徵提取，效果的好壞主要是和特徵器的提取能力掛鉤的。機器學習使用人工來做特徵提取器，對於問題的解決原因，可解釋性強。而深度學習的數據網絡結構使用線性的和非線性的神經元節點相結合比較抽象的作爲一個特徵提取器。
在文本處理上，特徵抽取能力主要包括有有句法特徵提取能力；語義特徵提取能力；長距離特徵捕獲能力；任務綜合特徵抽取能力。上面四個角度是從NLP的特徵抽取器能力強弱角度來評判的，另外再加入並行計算能力及運行效率，這是從是否方便大規模實用化的角度來看的。
而對於各個領域的的問題的解決，新浪微博AI Lab資深算法專家張俊林博士大佬說過一句話：一個特徵抽取器是否適配問題領域的特點，有時候決定了它的成敗，而很多模型改進的方向，其實就是改造得使得它更匹配領域問題的特性。
2. 對於輸出空間巨大的大規模文本分類問題。事實上，對於文本的生成問題(對於其中一個單詞的生成)，可以看做是輸出空間巨大的大規模文本的文本分類問題。在Word2vec中，解決的方法是使用層次softmax的方法，首先對分類的標籤進行哈夫曼樹的構建，每個分類標籤對應着一個哈夫曼碼，位於哈夫曼樹的葉子結點，其中每個分支表示通往每個路徑的概率。此外，word2vec中也提出negative sampling的方式，因此可以考慮使用 negative sampling的方式來解決此類問題，對標籤的樣本重新進行構建二分類，首先(x,y)表示正確的樣本爲正樣本，然後根據分類標籤的概率分佈進行採樣德奧(x,y‘)作爲負樣本，重構成二分類問題。常在Seq2Seq的生成中，採用【8】Sampled Softmax、Adaptive softmax等方式來緩解輸出空間巨大的大規模文本分類。
3. 對於標籤存在一定關聯的情況下的文本分類。我們常常做的文本分類是使用一對一或者一對多方法的方法進行分類器的訓練，例如SVM或者NB等。如果標籤也存在一點的聯繫，如標籤類目樹，或者單個標籤與單個標籤存在相關性等。對於單個標籤與單個標籤存在相關性，目前有人將此類多標籤問題當做序列生成問題來解決，效果性能得到了很大的改進,對應的論文有SGM: Sequence Generation Model for Multi-Label Classification.還有對標籤層次化問題進行了研究的相關論文A Study of multilabel text classification and the effect of label hierarchy
4. 基於深度學習技術的文本分類技術比起傳統的文本分類模型(LR，SVM 等)的優勢。首先，免去了人工的提取文本特徵，它可以自動的獲取基礎特徵並組合爲高級的特徵，訓練模型獲得文本特徵與目標分類之間的關係，省去了使用TF-IDF等提取句子的關鍵詞構建特徵工程的過程，實現端到端。其次，相比傳統的N-gram模型而言，深度學習中可以更好的利用詞序的特徵，CNN的文本分類模型中的filter的size的大小可以當做是一種類似於N-gram的方式，而RNN（LSTM）則可以利用更長的詞序，配合Attention機制則可以通過加權矩陣體現句子中的核心詞彙部位。最後，隨着樣本的增加和網絡深度的增加，深度學習的分類精度會更高。

5參考鏈接

【1】brightmart/text_classification
【2】The Annotated Transformer
【3】fastText、TextCNN、TextRNN…這套NLP文本分類深度學習方法庫供你選擇
【4】word2vec、glove和 fasttext 的比較
【5】從Word Embedding到Bert模型——自然語言處理預訓練技術發展史
【6】放棄幻想，全面擁抱Transformer：自然語言處理三大特徵抽取器（CNN/RNN/TF）比較
【7】《BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding》
【8】Sampled Softmax 論文筆記：On Using Very Large Target Vocabulary for Neural Machine Translation

NLP系列之文本分類

1前言

2文本分類任務

3基礎文本分類模型的介紹及分析

3.1FastText

3.2TextCNN

3.3TextRNN/LSTM

3.4RCNN

3.5Hierarchical Attention Network

3.6Transformer

4總結

5參考鏈接

萬事開頭難—博客篇

設計模式初篇

設計模式對比篇

設計模式精髓篇之結構型

Protocol Buffers的學習筆記

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結