BERT入門教程學習心得 word embedding

原創

sdywtzymy

2020-07-06 17:14

來源Youtube上一個BERT Tutorial的視頻

https://www.youtube.com/channel/UCoRX98PLOsaN8PtekB9kWrw

Word Embedding

將word變成一組特徵向量。

對word的編碼實際表示了word之間的關聯程度。

Bert是預訓練好的 => Bert中的單詞編碼是固定的
Bert擁有自己的LUT去查找對應的編碼

對於不在這個表裏的單詞：Bert將未知的單詞分成多個subword進行處理

FastText採用了類似的辦法，但與FastText不同的是，FastText採用combine的方式，算出均值作爲新單詞的vector。但Bert直接獨立的使用每個subword作爲一個新word。意味着如果有一句話，有十個單詞，包含embedding，那處理完後它將有12個單詞。

那遇到一個非常全新且不由其他單詞組成的單詞時，

在Bert中，所有的中間subword前面都加 ##

而且這個 ## 是直接加在subword前作爲subword的一部分的，因此比如 ding 和 ##ding 都會存在，這是一個redundant

Bert的vocabulary list：

(注意：這裏這個作者用的是1-indexed，實際上Bert使用的是0-indexed，所以UNK是100，CLS是101，以此類推)

其中individual character那兒包含了中文英文阿拉伯文等各種character

在做attention mask和token type id操作時，它們實際上是根據input id來的，所以需要加上[CLS] 和 [SEP]

例子來源於Transformer的一個官方案例：

sequence_a = "This is a short sequence."
sequence_b = "This is a rather long sequence. It is at least longer than the sequence A."

encoded_sequence_b == [101, 1188, 1110, 170, 1897, 1263, 4954, 119, 1135, 1110, 1120, 1655, 2039, 1190, 1103, 4954, 138, 119, 102]

mask_attention_b == [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

sequence_a = "HuggingFace is based in NYC"
sequence_b = "Where is HuggingFace based?"

encoded_sequence = tokenizer.encode(sequence_a, sequence_b)
assert tokenizer.decode(encoded_sequence) == "[CLS] HuggingFace is based in NYC [SEP] Where is HuggingFace based? [SEP]"

assert encoded_dict['input_ids'] == [101, 20164, 10932, 2271, 7954, 1110, 1359, 1107, 17520, 102, 2777, 1110, 20164, 10932, 2271, 7954, 1359, 136, 102]
assert encoded_dict['token_type_ids'] == [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

BERT入門教程學習心得 word embedding

BERT入門教程學習心得 word embedding

在jupyter notebook中導入庫失敗No module named xxx但在命令行中可以導入的問題

Hugging Face Transformers 模型下載地址（以Pytorch Bert爲例）

查看anaconda安裝庫/包的位置（路徑）

pip更改安裝目錄，將庫安裝到新位置

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結