XLNET中文文本分類

一.Xlnet概述

Xlnet，自BERT預訓練-微調模式開創以來，這個自然語言處理NLP中的又一重大進展。Xlnet融合了自迴歸（AR，單向語言模型）、自編碼（AE，雙向語言模型）等語言模型特徵，採用最先進的transformer特徵提取器（transformer-xl，利用分割循環機制和相對位置編碼進行高併發-超長文本處理），開創性地提出了排列語言模型（Permutation Language Modeling）。

PLM，這種語言模型機制既能保留自迴歸（AR）語言模型模型地優點（利用自迴歸（AR）模型估計文本語料庫的概率分佈，即有利於文本生成NLG任務），又巧妙地引入自編碼（AE）語言模型（因式分解文本，表現爲固定預測mask的時候打亂句子順序，這樣可以有效捕獲上下文文本特徵，對於文本理解NLU任務有利）。

同時，這種類BERT中mask機制(Masked LM)的預測被遮擋字/詞的過程，發生於Xlnet預訓練模型中的multi-head-attention內，從而克服了BERT預訓練(輸入被mask)、微調(輸入不需要mask)時候輸入不同的問題。

雖然現在(20190829)中文Xlnet谷歌版還沒有訓練好，但是涅，哈工大版中文Xlnet預訓練模型已經開放出來了(哈工大牛批)，不過這個訊飛雲什麼鬼。。。

預訓練模型已經有啦，那麼當然是進行微調的各種實驗啦，句向量embedding，分類，相似度，閱讀理解，文本生成......

xlnet-embedding地址:https://github.com/yongzhuo/nlp_xiaojiang/tree/master/FeatureProject/xlnet

xlnet-chinese-text-classification地址:https://github.com/yongzhuo/Keras-TextClassification

二.Xlnet分類實例

和bert微調大同小異，不過，還是有一些細微的差別。

以keras-xlnet爲例，預加載模型設置可以設置target_len長度(目標，即當前輸入文本最大長度)，attention類型('uni'或者'bi')，memory_len長度(分段長文本最長依賴，Tranformer-XL)。

也可以引用不同的層，各種組合。這裏中文xlnet哈工大版-初版是有24個層246個layer，包括6層輸入與embedding，其他的是每10個layer爲一個block，即一層，有的layer輸出還是兩個tensor的，這裏需要注意。

簡單xlnet-fineture代碼如下，具體embedding請看github:https://github.com/yongzhuo/Keras-TextClassification/blob/master/keras_textclassification/base/embedding.py

# -*- coding: UTF-8 -*-
# !/usr/bin/python
# @time     :2019/8/28 23:06
# @author   :Mo
# @function :graph of xlnet fineture, 後面不接什麼網絡結構, 只有一個激活層
# @paper    :XLNet: Generalized Autoregressive Pretraining for Language Understanding

from __future__ import print_function, division

from keras.layers import SpatialDropout1D, Conv1D, GlobalMaxPooling1D, Dense
from keras.layers import Dropout, Reshape, Concatenate, Lambda
from keras.layers import LSTM, GRU
from keras.layers import Flatten
from keras.models import Model
from keras import backend as K
from keras import regularizers

from keras_textclassification.base.graph import graph

import numpy as np


class XlnetGraph(graph):
    def __init__(self, hyper_parameters):
        """
            初始化
        :param hyper_parameters: json，超參
        """
        super().__init__(hyper_parameters)

    def create_model(self, hyper_parameters):
        """
            構建神經網絡
        :param hyper_parameters:json,  hyper parameters of network
        :return: tensor, moedl
        """
        super().create_model(hyper_parameters)
        embedding_output = self.word_embedding.output
        x = embedding_output
        # x = Lambda(lambda x : x[:, 0:1, :])(embedding_output) # 獲取CLS
        # # text cnn
        # bert_output_emmbed = SpatialDropout1D(rate=self.dropout)(embedding_output)
        # concat_out = []
        # for index, filter_size in enumerate(self.filters):
        #     x = Conv1D(name='TextCNN_Conv1D_{}'.format(index),
        #                filters= self.filters_num, # int(K.int_shape(embedding_output)[-1]/self.len_max),
        #                strides=1,
        #                kernel_size=self.filters[index],
        #                padding='valid',
        #                kernel_initializer='normal',
        #                activation='relu')(bert_output_emmbed)
        #     x = GlobalMaxPooling1D(name='TextCNN_MaxPool1D_{}'.format(index))(x)
        #     concat_out.append(x)
        # x = Concatenate(axis=1)(concat_out)
        # x = Dropout(self.dropout)(x)
        x = Flatten()(x)
        # 最後就是softmax
        dense_layer = Dense(self.label, activation=self.activate_classify)(x)
        output_layers = [dense_layer]
        self.model = Model(self.word_embedding.input, output_layers)
        self.model.summary(120)

希望對你有所幫助!

不足之處還望指出，謝謝!

XLNET中文文本分類

Macadam自然語言處理(NLP)工具包(TextClassification, SequenceLabeling, RelationExtraction)

Macropodus 新詞發現詳解(new word discovery, python3)

AutoML機器學習自動化與NNI

文本摘要(text summarization)四: 主題模型（LDA、LSI、NMF，topic-base）

python3寫一個http接口服務(get, post)，給別人調用3

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結