Bret介紹

Bret是Google2018年推出的最新的詞向量訓練工具，在nlp領域各個問題的性能上均有大幅提升，是nlp領域具有變革型的一項工作。已經有大量介紹Bret原理的博客，感興趣的可以取看一下。本文主要介紹如何使用Bret獲取到中文的詞向量，用於後續的諸如文本分類、命名實體識別、情感分類等工作。

下載代碼和模型

Google 的工作處處體現着 Money的重要性，畢竟 **All you need is money ** ，Bret 在編碼器和解碼器分別疊加的6層 Transformer,訓練過程及其複雜，需要很高的配置，並且需要大量的訓練時間。但是，Google 人性化的是公佈了多個預訓練好的模型，我們可以直接使用這些預訓練好的模型進行微調（fine-trun）。這也是nlp領域發展的趨勢——遷移學習

BERT-Base, Uncased: 12-layer, 768-hidden, 12-heads, 110M parameters
BERT-Large, Uncased: 24-layer, 1024-hidden, 16-heads, 340M parameters
BERT-Base, Cased: 12-layer, 768-hidden, 12-heads , 110M parameters
BERT-Large, Cased: 24-layer, 1024-hidden, 16-heads, 340M parameters
BERT-Base, Multilingual Cased (New, recommended): 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
BERT-Base, Multilingual Uncased (Orig, not recommended) (Not recommended, use Multilingual Cased instead): 102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
BERT-Base, Chinese: Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters

上面是Google訓練好的一些模型，可以在項目中下載：https://github.com/google-research/bert
我們使用的是：BERT-Base, Chinese: Chinese Simplified and Traditional 使用簡體和繁體中文訓練的一箇中文字符的模型，
下載地址：https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip
下載好模型之後，需要將 google 的Bert項目複製下來：

git clone https://github.com/google-research/bert

要提取文本的詞向量，需要使用項目中的 extract_features.py腳本，官方給出的範例：

python extract_features.py \
  --input_file=/tmp/input.txt \
  --output_file=/tmp/output.jsonl \
  --vocab_file=$BERT_BASE_DIR/vocab.txt \
  --bert_config_file=$BERT_BASE_DIR/bert_config.json \
  --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
  --layers=-1,-2,-3,-4 \
  --max_seq_length=128 \
  --batch_size=8

其中參數：
**input_file:**是要提取特徵的文件，其格式爲：

# Sentence A and Sentence B are separated by the ||| delimiter for sentence  pair tasks like question answering and entailment.
# For single sentence inputs, put one sentence per line and DON'T use the  delimiter.
echo 'Who was Jim Henson ? ||| Jim Henson was a puppeteer' > /tmp/input.txt

如果要訓練 sentense pair 則寫成：

吃了嗎？ ||| 吃過了

其中 ||| 是sentense A 和sentence B的分隔符
如果只訓練單個句子，則不需要||| 分割：

訓練單個句子的詞向量

**vocab_file:**是詞典的路徑 BERT_BASE_DIR 是解壓下載預訓練模型BERT-Base, Chinese: Chinese Simplified and Traditional 的路徑（下同）
**bert_config_file：**是網絡配置文件的路徑
**init_checkpoint：**是網絡模型文件的路徑
layers： 是輸出那些層的參數，-1就是最後一層，-2是倒數第二層，一次類推
max_seq_length： 是最大句子長度，根據自己的任務配置。如果你的GPU內存比較小，可以減小這個值，節省存儲
batch_size： 不解釋
**output_file：**輸出的結果的路徑，Bert將結果輸出到一個json文件中，具體格式如下：

{
 "linex_index": 0,
 "features": [
  {  "token": "[CLS]",//句子開始標誌
      "layers": [{  "index": -1, "values": [0.402158,    -7.281092,  -0.351869, -0.432365, -0.453649 ...(dim=768)] },
     				 {  "index": -2, "values": [0.402158,    -7.281092,  -0.351869, -0.432365, -0.453649 ...(dim=768)] },
     				 {  "index": -3, "values": [0.402158,    -7.281092,  -0.351869, -0.432365, -0.453649 ...(dim=768)] },
     				 {  "index": -4, "values": [0.402158,    -7.281092,  -0.351869, -0.432365, -0.453649 ...(dim=768)] },]
  },
  {  "token": ""token": "\u769f"",//句子中第一個字
      "layers": [{  "index": -1, "values": [0.402158,    -7.281092,  -0.351869, -0.432365, -0.453649 ...(dim=768)] },//第一個詞的最後一層（-1）網絡的參數
     				 {  "index": -2, "values": [0.402158,    -7.281092,  -0.351869, -0.432365, -0.453649 ...(dim=768)] },//第一個詞的倒數二層（-2）網絡的參數
     				 {  "index": -3, "values": [0.402158,    -7.281092,  -0.351869, -0.432365, -0.453649 ...(dim=768)] },//第一個詞的倒數三層（-3）網絡的參數
     				 {  "index": -4, "values": [0.402158,    -7.281092,  -0.351869, -0.432365, -0.453649 ...(dim=768)] },]/
  },
    {  "token": ""token": "\u45ef"",//句子中第2個字
      "layers": [{  "index": -1, "values": [0.402158,    -7.281092,  -0.351869, -0.432365, -0.453649 ...(dim=768)] },
     				 {  "index": -2, "values": [0.402158,    -7.281092,  -0.351869, -0.432365, -0.453649 ...(dim=768)] },
     				 {  "index": -3, "values": [0.402158,    -7.281092,  -0.351869, -0.432365, -0.453649 ...(dim=768)] },
     				 {  "index": -4, "values": [0.402158,    -7.281092,  -0.351869, -0.432365, -0.453649 ...(dim=768)] },]
     				 ......
     {  "token": ""token": "\SEP"",//句子結束標誌
      "layers": [{  "index": -1, "values": [0.402158,    -7.281092,  -0.351869, -0.432365, -0.453649 ...(dim=768)] },
     				 {  "index": -2, "values": [0.402158,    -7.281092,  -0.351869, -0.432365, -0.453649 ...(dim=768)] },
     				 {  "index": -3, "values": [0.402158,    -7.281092,  -0.351869, -0.432365, -0.453649 ...(dim=768)] },
     				 {  "index": -4, "values": [0.402158,    -7.281092,  -0.351869, -0.432365, -0.453649 ...(dim=768)] },]
  }]}

可以根據需要設置layers參數，獲取到需要層數中的網絡參數（詞向量）
PS：爲了方便，上面詞向量我是直接複製一行，粘貼若干次，真實的詞向量是不同的。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

使用google的Bert獲得中文的詞向量

使用google的Bert獲得中文的詞向量

Bret介紹

下載代碼和模型

tensorflow中的tensorflow.python.framework.errors_impl.NotFoundError 錯誤及解決方法

使用google的Bert獲得中文的詞向量

在win10 64位系統下安裝 LightGBM

字符識別OCR(optical character recognition)經典框架解析

VGG_VOC0712_SSD_300x300_train

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結