背景

用BERT對句子進行向量化

實施

TensorFlow版直接用肖涵博士的bert-as-service。使用方法真的很小白，簡單概括爲2點：server和client安裝。

pip install bert-serving-server  # server
pip install bert-serving-client  # client, independent of `bert-serving-server`

在server安裝完後，啓動服務，比如：bert-serving-start -model_dir /home/pretained_models/chinese_wwm_ext_L-12_H-768_A-12 -num_worker=4
通過model_dir參數可以自行指定不同類型的BERT的模型路徑，我這裏使用的是哈工大發布的WWM-EXT版。在client上的測試代碼：

def test_bert_tf(string):
    from bert_serving.client import BertClient
    bc = BertClient()
    s_encode = bc.encode([string])
    print(s_encode[0])

上述方案雖然簡單易於上手，但是個人還是覺自己動手更香，比如基於huggingface的transformers。如何驗證呢？就以bert-as-service編碼得到的句向量作爲標準值。將相同的文本輸入到transformers試圖得到與bert-as-service方案相同的句向量。

由於bert-as-service默認的句向量構造方案是取倒數第二層的隱狀態值在token上的均值，即選用的層數是倒數第2層，池化策略是REDUCE_MEAN。

import torch
import pdb
from transformers import AutoConfig
from transformers import BertTokenizer, BertModel, BertConfig

UNCASE = "/home/pretained_models/chinese_wwm_ext_pytorch"
VOCAB = "vocab.txt"
tokenizer = BertTokenizer.from_pretrained(UNCASE + "/" + VOCAB)

model = BertModel.from_pretrained(UNCASE, output_hidden_states = True) # 如果想要獲取到各個隱層值需要如此設置
model.eval()
string = '寫代碼不香嗎'
string1 = "[CLS]" + string + "[SEP]"

# Convert token to vocabulary indices
tokenized_string = tokenizer.tokenize(string1)
tokens_ids = tokenizer.convert_tokens_to_ids(tokenized_string)
# Convert inputs to PyTorch tensors
tokens_tensor = torch.tensor([tokens_ids])
outputs = model(tokens_tensor) # encoded_layers, pooled_output

if model.config.output_hidden_states:
    hidden_states = outputs[2]
    # last_layer = outputs[-1]
    second_to_last_layer = hidden_states[-2]
    # 由於只要一個句子，所以尺寸爲[1, 10, 768]
    token_vecs = second_to_last_layer[0]
    print(token_vecs.shape)
    # Calculate the average of all input token vectors.
    sentence_embedding = torch.mean(token_vecs, dim=0)
    print(sentence_embedding.shape)
    print(sentence_embedding[0:10])

print("tf version-----")
from bert_serving.client import BertClient
bc = BertClient()
s_encode = bc.encode([string])
print(s_encode[0].shape)
# pdb.set_trace()
print(s_encode[0][0:10])

結果如下：

從向量的前10維可以看出，兩者向量是相同的。那麼進一步計算二者的餘弦相似度的結果：

tf_tensor = torch.tensor(s_encode[0])
similarity = torch.cosine_similarity(sentence_embedding, tf_tensor, dim=0)
print(similarity)

餘弦相似度爲1，所以兩個向量相同。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

使用BERT對句子進行向量化(TensorFlow版和Pytorch版)

背景

實施

Kafka存儲機制

aws語音呼叫調用，告警電話

【轉】[C#] WebAPI 防止併發調用二（冥等性）

HTTP URL 詳解

創新工具：2024年開發者必備的一款表格控件（二）

車牌識別控制檯可快速整合二次開發

Rasa教程系列-0-Rasa安裝和項目創建

Rasa教程系列-Core-4-Actions

Rasa教程系列-Core-5-Policies

多個neo4j服務共用同一個data目錄

使用BERT對句子進行向量化(TensorFlow版和Pytorch版)

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結