transformers入門使用

HuggingFace是一個開源社區，它提供了先進的NLP模型，數據集，以及其他工具。

模型：https://huggingface.co/models

數據集：https://huggingface.co/datasets

主要的模型

自然迴歸： GPT2, Trasnformer-XL, XLNet
自編碼: BERT , ALBERT, ROBERTa, ELECTRA
StoS: BART, Pegasus, T5

1 簡介

transformers庫是一個用於自然語言處理（NLP）的機器學習庫，提供了近幾年在NLP領域取得巨大成功的預訓練模型，例如BERT、GPT、RoBERTa、T5等。

該庫由Hugging Face公司開發，是目前最流行的NLP預訓練模型庫之一。在實際應用中，使用已經訓練好的模型可以顯著提高模型的效果和速度。

同時，該庫還提供了豐富的工具和API，可以幫助用戶更方便地使用這些預訓練模型，包括文本分類、實體識別、摘要生成、文本翻譯等各種NLP任務。

transformers庫支持多種主流深度學習框架，例如PyTorch、Tensorflow等，用戶可以根據自己的喜好選擇相應的框架進行使用。

BERT（Bidirectional Encoder Representations from Transformers）是一種用於自然語言處理（NLP）的預訓練模型。它是在2018年由Google發佈的，並被認爲是最具革命性的NLP模型之一。BERT模型是基於Transformer網絡結構的，並使用了一種預訓練的方法來學習上下文相關的詞向量表示。

BERT模型的目標是讓一個模型具有理解自然語言的能力，而不是人爲地將所有知識和規則硬編碼到算法中。這種目標是通過對巨型文本數據集進行預訓練來實現的。BERT可以學習從一個句子到另一個句子的上下文語境，因此可以更好地處理自然語言。

BERT模型的架構涵蓋了兩個階段：預訓練和微調。

預訓練：在預訓練階段中，BERT模型利用無監督的方式從海量的文本數據中學習常識。在這個過程中，模型先使用“掩碼語言模型”（Masked Language Model，MLM）來訓練模型，然後再使用“下一句預測模型”（Next Sentence Prediction，NSP）來進一步提高模型的性能。MLM使BERT模型學會推斷詞語中被遮蓋的部分，而NSP是讓BERT模型對文本對的關係進行預測，是一個判斷文本連貫性的過程。預訓練過程中，BERT模型可處理各種類型的文本數據，包括電子郵件、網頁、微博、評論等等。
微調：微調階段是指將預訓練的模型用於實際任務，並根據實際任務的目標對模型進行微調。BERT模型的微調可以應用於多種自然語言處理任務，包括情感分析、自動問答、文本分類等等。

BERT模型的優點是它可以處理各種不規則文本數據，並提供比以前的NLP模型更加準確和有用的結果，因爲BERT能夠建立更好的理解自然語言的語境。此外，BERT還有一個優點是它開放源代碼，可以自由使用和修改，方便大家進行相關研究和開發。

2 分詞器 transformers.BertTokenizer

BERT 分詞器是一種自然語言處理工具，主要用於將文本數據進行分詞、詞性標註、實體識別、句子劃分等預處理操作。

具體來說，BERT 分詞器的作用包括以下幾個方面：

分詞：將文本中的連續字符串按照一定規則劃分成單個的詞語，常用的算法有最大匹配、最小匹配、雙向匹配等。
詞性標註：對每個詞語進行詞性標註，例如名詞、動詞、形容詞等，有助於後續的語義分析。
實體識別：識別文本中的人名、地名、組織機構名等實體，爲知識圖譜構建和命名實體識別等任務提供數據基礎。
句子劃分：將整篇文本按照句子結構進行劃分，爲文本分類、情感分析等任務提供語義單元劃分的基礎。

BERT 分詞器編碼句子的作用：

BERT 分詞器在進行分詞時，不僅僅是將每個單詞轉化爲固定的編號，而是將整個句子編碼爲一個向量。這是因爲句子的語義信息往往不僅僅取決於每個單詞的意義，還和上下文有關。編碼句子可以使得句子的語義信息包含在向量中，有助於後續的自然語言處理任務。
具體來說，BERT 分詞器採用的是 Transformer 模型，這種模型可以將句子作爲輸入，經過多層自注意力（self-attention）機制的加工，得到一個表示整個句子的向量，稱爲句子嵌入（sentence embedding）向量。這種方法可以同時考慮每個單詞的上下文信息，而且還可以避免傳統的詞向量模型中單詞之間的相關性計算問題。
通過編碼句子，BERT 分詞器爲後續的自然語言處理任務提供了更豐富的語義信息。而且由於 BERT 分詞器是基於預訓練的方式進行編碼句子，其能力可以應用到多個自然語言處理任務中，從而實現了更廣泛的語義理解和處理。

總的來說，BERT 分詞器的主要作用是將自然語言文本進行可處理的單元劃分，爲後續的BERT 模型自然語言處理任務提供數據基礎。

下面是分詞器的代碼演示：

from transformers import BertTokenizer

# 加載分詞器
tokenizer = BertTokenizer.from_pretrained(
    pretrained_model_name_or_path='bert-base-chinese',
    cache_dir=None,
    force_download=False,
)

# 待編碼的句子
sents = [
    '選擇珠江花園的原因就是方便。',
    '筆記本的鍵盤確實爽。',
    '房間太小。其他的都一般。',
    '今天才知道這書還有第6卷,真有點鬱悶.',
    '機器背面似乎被撕了張什麼標籤，殘膠還在。',
]

tokenizer, sents

2.1 簡單編碼

簡單編碼，一次可以編碼一個句子或者兩個句子。

分詞器的encode方法可以將tokens編碼成ids.

# 編碼兩個句子
input_ids = tokenizer.encode(
    text=sents[0],  # 句子1
    text_pair=sents[1],  # 句子2
    truncation=True,  # 當句子長度大於max_length時,截斷
    padding='max_length',  # 一律補pad到max_length長度
    add_special_tokens=True,
    max_length=30,  # 最大長度
    return_tensors=None,   # 返回list
)

print(input_ids)

tokenizer.decode(input_ids)

"""
句子編碼後的結果：
[101, 6848, 2885, 4403, 3736, 5709, 1736, 4638, 1333, 1728, 2218, 3221, 3175, 912, 511, 102, 5011, 6381, 3315, 4638, 7241, 4669, 4802, 2141, 4272, 511, 102, 0, 0, 0]

解碼：
'[CLS] 選 擇 珠 江 花 園 的 原 因 就 是 方 便 。 [SEP] 筆 記 本 的 鍵 盤 確 實 爽 。 [SEP] [PAD] [PAD] [PAD]'
"""

2.2 增強編碼

# 增強的編碼
out = tokenizer.encode_plus(
    text=sents[0],  # 句子1
    text_pair=sents[1],  # 句子2
    truncation=True,  # 當句子長度大於max_length時,截斷
    padding='max_length',  # 一律補零到max_length長度
    max_length=30,  # 最大長度
    add_special_tokens=True,
    return_tensors=None,  # 可取值tf,pt,np,默認爲返回list
    return_token_type_ids=True,  # 返回token_type_ids
    return_attention_mask=True,  # 返回attention_mask
    return_special_tokens_mask=True,  # 返回special_tokens_mask 特殊符號標識
    #返回offset_mapping 標識每個詞的起止位置,這個參數只能BertTokenizerFast使用
    #return_offsets_mapping=True,
    #返回length 標識長度
    return_length=True,
)

# input_ids           就是編碼後的詞
# token_type_ids      第一個句子和特殊符號的位置是0,第二個句子的位置是1
# special_tokens_mask 特殊符號的位置是1,其他位置是0
# attention_mask      pad的位置是0,其他位置是1
# length             返回句子長度
for k, v in out.items():
    print(k, ':', v)
    
"""
input_ids : [101, 6848, 2885, 4403, 3736, 5709, 1736, 4638, 1333, 1728, 2218, 3221, 3175, 912, 511, 102, 5011, 6381, 3315, 4638, 7241, 4669, 4802, 2141, 4272, 511, 102, 0, 0, 0]
token_type_ids : [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]
special_tokens_mask : [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]
attention_mask : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]
length : 30
"""

# 解碼
tokenizer.decode(out['input_ids'])
"""
'[CLS] 選 擇 珠 江 花 園 的 原 因 就 是 方 便 。 [SEP] 筆 記 本 的 鍵 盤 確 實 爽 。 [SEP] [PAD] [PAD] [PAD]'
"""

2.3 批量增強編碼

第一種：

# 批量編碼句子
out = tokenizer.batch_encode_plus(
    batch_text_or_text_pairs=[sents[0], sents[1]],  # 編碼兩個句子
    add_special_tokens=True,  # 特殊符號的位置是1,其他位置是0
    truncation=True,  # 當句子長度大於max_length時,截斷
    padding='max_length',  # 不足最大長度時， 一律補零到max_length長度
    max_length=15,  # 最大長度
    return_tensors=None,  # 可取值tf,pt,np,默認爲返回list
    return_token_type_ids=True,  # 返回token_type_ids 第一個句子和特殊符號的位置是0,第二個句子的位置是1
    return_attention_mask=True,  # 返回attention_mask pad(不足最大長度補0)的位置是0,其他位置是1
    return_special_tokens_mask=True,  # 返回special_tokens_mask 特殊符號標識
    # 返回offset_mapping 標識每個詞的起止位置,這個參數只能BertTokenizerFast使用
    # return_offsets_mapping=True,
    # 返回length 標識長度
    return_length=True,
)

# input_ids           就是編碼後的詞
# token_type_ids      第一個句子和特殊符號的位置是0,第二個句子的位置是1
# special_tokens_mask 特殊符號的位置是1,其他位置是0
# attention_mask      pad的位置是0,其他位置是1
# length             返回句子長度
for k, v in out.items():
    print(k, ':', v)
    
"""
input_ids : [[101, 6848, 2885, 4403, 3736, 5709, 1736, 4638, 1333, 1728, 2218, 3221, 3175, 912, 102], [101, 5011, 6381, 3315, 4638, 7241, 4669, 4802, 2141, 4272, 511, 102, 0, 0, 0]]
token_type_ids : [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
special_tokens_mask : [[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1], [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]]
length : [15, 12]
attention_mask : [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]]
"""

# 解碼兩個句子
tokenizer.decode(out['input_ids'][0]), 
"""
'[CLS] 選 擇 珠 江 花 園 的 原 因 就 是 方 便 [SEP]'
"""

tokenizer.decode(out['input_ids'][1])
"""
'[CLS] 筆 記 本 的 鍵 盤 確 實 爽 。 [SEP] [PAD] [PAD] [PAD]'
"""

第二種：

# 批量編碼成對的句子
out = tokenizer.batch_encode_plus(
    batch_text_or_text_pairs=[(sents[0], sents[1]), (sents[2], sents[3])],  # 可以編碼成對的多種組合 
    add_special_tokens=True,
    truncation=True,  # 當句子長度大於max_length時,截斷
    padding='max_length',  # 不足最大長度的位置，補到最大長度
    max_length=30,  # 最大長度
    return_tensors=None,  # 可取值tf,pt,np,默認爲返回list
    return_token_type_ids=True,  # 返回token_type_ids
    return_attention_mask=True,  # 返回attention_mask
    #返回special_tokens_mask 特殊符號標識
    return_special_tokens_mask=True,
    # 返回offset_mapping 標識每個詞的起止位置,這個參數只能BertTokenizerFast使用
    # return_offsets_mapping=True,
    # 返回length 標識長度
    return_length=True,
)

# input_ids           就是編碼後的詞
# token_type_ids      第一個句子和特殊符號的位置是0,第二個句子的位置是1
# special_tokens_mask 特殊符號的位置是1,其他位置是0
# attention_mask      pad的位置是0,其他位置是1
# length             返回句子長度
for k, v in out.items():
    print(k, ':', v)

"""
input_ids : [[101, 6848, 2885, 4403, 3736, 5709, 1736, 4638, 1333, 1728, 2218, 3221, 3175, 912, 511, 102, 5011, 6381, 3315, 4638, 7241, 4669, 4802, 2141, 4272, 511, 102, 0, 0, 0], [101, 2791, 7313, 1922, 2207, 511, 1071, 800, 4638, 6963, 671, 5663, 511, 102, 791, 1921, 2798, 4761, 6887, 6821, 741, 6820, 3300, 5018, 127, 1318, 117, 4696, 3300, 102]]
token_type_ids : [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]
special_tokens_mask : [[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1], [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]]
length : [27, 30]
attention_mask : [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]
"""

# 解碼
tokenizer.decode(out['input_ids'][0])
"""
'[CLS] 選 擇 珠 江 花 園 的 原 因 就 是 方 便 。 [SEP] 筆 記 本 的 鍵 盤 確 實 爽 。 [SEP] [PAD] [PAD] [PAD]'
"""

tokenizer.decode(out['input_ids'][1])
"""
'[CLS] 房 間 太 小 。 其 他 的 都 一 般 。 [SEP] 今 天 才 知 道 這 書 還 有 第 6 卷, 真 有 [SEP]'
"""

2.4 字典操作

# 獲取字典
zidian = tokenizer.get_vocab()  # {token: id, ...}
type(zidian), len(zidian), '月光' in zidian  # (dict, 21128, False)
'月' in zidian, '光' in zidian  # (True, True) 
# 月和光都在字典中， 但月光不在字典中
# 添加新詞
tokenizer.add_tokens(new_tokens=['月光', '希望'])

# 添加新符號
tokenizer.add_special_tokens({'eos_token': '[EOS]'})

zidian = tokenizer.get_vocab()
type(zidian), len(zidian), zidian['月光'], zidian['[EOS]']  # (dict, 21131, 21128, 21130)

# 編碼新添加的詞
out = tokenizer.encode(
    text='月光的新希望[EOS]',
    text_pair=None,
    truncation=True,  # 當 句子長度大於max_length時,截斷
    padding='max_length',  # 不足最大長度時，一律補pad到max_length長度
    add_special_tokens=True,
    max_length=8,  # 最大長度
    return_tensors=None,
)

print(out)  # [101, 21128, 4638, 3173, 21129, 21130, 102, 0]

tokenizer.decode(out)  # '[CLS] 月光 的 新 希望 [EOS] [SEP] [PAD]'

3 transformers 模型使用

使用transformers庫主要分爲以下幾個步驟：

安裝transformers庫

可以使用pip命令安裝transformers庫：

pip install transformers

加載預訓練模型

使用transformers庫需要先加載相應的預訓練模型。可以從Hugging Face網站上下載相應的模型，也可以通過API從transformers庫中直接下載預訓練模型。

加載預訓練模型的代碼示例：

from transformers import BertModel, BertTokenizer

model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)

以上代碼使用了BERT模型進行加載，加載過程中同樣進行了對應的tokenization操作。

使用預訓練模型進行NLP任務

加載預訓練模型後，即可使用其進行NLP任務，例如文本分類、命名實體識別、序列標註、文本生成等。

以BERT模型進行文本分類爲例的代碼示例：

from transformers import BertForSequenceClassification, BertTokenizer

model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name)

inputs = tokenizer("This is a sentence", return_tensors="pt")
outputs = model(**inputs)

logits = outputs.logits

以上代碼使用了已經加載好的BERT模型進行文本分類，輸入了一個句子，並返回了該句子的分類概率（logits）。

fine-tune預訓練模型

使用預訓練模型進行遷移學習，可以針對任務領域的特定數據進行fine-tune。fine-tune是指在預訓練模型的基礎上，針對特定數據集進行進一步的訓練，以適應特定的數據和任務。

以下是使用BERT進行情感分類fine-tune的代碼示例：

from transformers import BertForSequenceClassification, BertTokenizer, AdamW
import torch

# 1. 加載數據集，假設訓練集和驗證集已經準備好了
train_data = ...
valid_data = ...

# 2. 加載預訓練模型和tokenizer
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name)

# 3. 使用GPU進行模型訓練
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model.to(device)

# 4. 設置訓練參數
epochs = 5
batch_size = 16
optimizer = AdamW(model.parameters(), lr=2e-5)

# 5. 開始訓練
for epoch in range(epochs):
    # 6. 訓練集batch循環
    for i in range(0, len(train_data), batch_size):
        # 6.1 批處理  訓練數據
        batch = train_data[i:i+batch_size]
        # 6.2 分詞器編碼
        inputs = tokenizer.batch_encode_plus(batch, return_tensors="pt", padding=True, truncation=True)
        inputs.to(device)
        # 6.3 張量處理 labels
        labels = torch.tensor([d['label'] for d in batch]).to(device)
        # 6.4 訓練
        outputs = model(input_ids=inputs['input_ids'], token_type_ids=inputs['token_type_ids'], attention_mask=inputs['attention_mask'], labels=labels)
        # 6.5 損失函數
        loss = outputs.loss
        # 6.6 反向傳播 計算梯度
        loss.backward()
        # 6.7 參數更新
        optimizer.step()
        # 6.8 梯度置爲0
        optimizer.zero_grad()

    # 7. 驗證集batch循環
    with torch.no_grad():  # 模型評估不需要計算梯度
        for i in range(0, len(valid_data), batch_size):
            # 7.1 驗證集批處理
            batch = valid_data[i:i+batch_size]
            # 7.2 驗證集分詞器編碼
            inputs = tokenizer.batch_encode_plus(batch, return_tensors="pt", padding=True, truncation=True)
            inputs.to(device)
            # 7.3 驗證集 labels 轉 張量
            labels = torch.tensor([d['label'] for d in batch]).to(device)
            # 7.4 預測
            outputs = model(input_ids=inputs['input_ids'], token_type_ids=inputs['token_type_ids'], attention_mask=inputs['attention_mask'], labels=labels)
            # 7.5 計算損失
            loss = outputs.loss

    # 8. 打印一些訓練過程中的指標
    print(f"Epoch {epoch} training loss: {loss.item()}")

以上代碼中，針對情感分類任務的數據集，使用fine-tune技術對BERT模型進行了訓練，並輸出了訓練過程中的損失。

以上代碼加載了BERT模型和tokenizer，並使用fine-tune技術在情感分類數據集上進行了訓練。

4 數據集操作 datasets

4.1 load_dataset

from datasets import load_dataset

# 加載數據
# 注意：如果你的網絡不允許你執行這段的代碼，則直接運行【從磁盤加載數據】即可
# 轉載自seamew/ChnSentiCorp, 或者傳入本地數據數據集文件路徑
dataset = load_dataset(path='lansinuote/ChnSentiCorp')  # 從網絡上下載數據集

dataset
"""數據集結構
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 9600
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1200
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1200
    })
})
"""

4.2 save_to_disk

# save_to_disk 保存數據集到磁盤
#注意：運行這段代碼要確保【加載數據】運行是正常的，否則直接運行【從磁盤加載數據】即可
dataset.save_to_disk(dataset_dict_path='./data/ChnSentiCorp')

4.3 load_from_disk

從磁盤加載數據

# 從磁盤加載數據
from datasets import load_from_disk

dataset = load_from_disk('./data/ChnSentiCorp')

dataset

4.4 獲取訓練集測試集驗證集

# 獲取 訓練集 測試集 驗證集
# 獲取訓練集
dataset_train = dataset['train']
# 獲取測試集
dataset_test = dataset['test']
# 獲取驗證集
dataset_val = dataset['validation']

# 查看一條數據
dataset_train[0]
"""
{'text': '選擇珠江花園的原因就是方便，有電動扶梯直接到達海邊，周圍餐館、食廊、商場、超市、攤位一應俱全。酒店裝修一般，但還算整潔。 泳池在大堂的屋頂，因此很小，不過女兒倒是喜歡。 包的早餐是西式的，還算豐富。 服務嗎，一般', 'label': 1}
"""

4.5 sort

排序

# 4. 查看一條數據
dataset_train[0]
"""
{'text': '選擇珠江花園的原因就是方便，有電動扶梯直接到達海邊，周圍餐館、食廊、商場、超市、攤位一應俱全。酒店裝修一般，但還算整潔。 泳池在大堂的屋頂，因此很小，不過女兒倒是喜歡。 包的早餐是西式的，還算豐富。 服務嗎，一般', 'label': 1}
"""

# 未排序的label是亂序的
print(dataset_train['label'][:10])  # [1, 1, 0, 0, 1, 0, 0, 0, 1, 1]

# 排序之後label有序了
sorted_dataset = dataset_train.sort('label')
print(sorted_dataset['label'][:10])  # [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
print(sorted_dataset['label'][-10:])  # [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

4.6 shuffle

打亂順序

#打亂順序
shuffled_dataset = sorted_dataset.shuffle(seed=42)
shuffled_dataset['label'][:10]  # [0, 1, 0, 0, 1, 0, 1, 0, 1, 0]

4.7 select

選擇

dataset_train.select([0, 10, 20, 30, 40, 50])

"""
Dataset({
    features: ['text', 'label'],
    num_rows: 6
})
"""

4.8 filter

過濾

def f(data):
    return data['text'].startswith('選擇')


start_with_ar = dataset_train.filter(f)

4.9 train_test_split

切割訓練集和訓練集

dataset_train.train_test_split(test_size=0.1)  # 0.1 表示測試集佔01
"""
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8640
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 960
    })
})
"""

4.10 shard

將數據集均勻切分

# 切分前
"""dataset_train
Dataset({
     features: ['text', 'label'],
     num_rows: 9600
 })
"""
# 把數據切分到4個桶中,均勻分配
dataset_train.shard(num_shards=4, index=0)
"""
Dataset({
     features: ['text', 'label'],
     num_rows: 2400
 })
"""

4.11 rename_column

重名列名

dataset_train.rename_column('text', 'textA')
"""
Dataset({
    features: ['textA', 'label'],
    num_rows: 9600
})
"""

4.12 remove_columns

移除列

dataset_train.remove_columns(['text'])
"""
Dataset({
    features: ['label'],
    num_rows: 9600
})
"""

4.13 map

值映射：逐次修改每個元素值

def f(data):
    data['text'] = 'My sentence: ' + data['text']
    return data


dataset_train_map = dataset_train.map(f)

dataset_train_map['text'][:5]

4.14 set_format

設置格式

dataset.set_format(type='torch', columns=['label'])

dataset[0]
"""
{'label': tensor(1)}
"""

4.15 加載或導出csv文件

# 從網絡上加載數據
dataset = load_dataset(path='lansinuote/ChnSentiCorp', split='train')
# 保存csv文件
dataset.to_csv(path_or_buf='./data/ChnSentiCorp.csv')

# 加載csv格式數據
csv_dataset = load_dataset(path='csv',
                           data_files='./data/ChnSentiCorp.csv',
                           split='train')
csv_dataset[20]
"""
{'text': '非常不錯，服務很好，位於市中心區，交通方便，不過價格也高！', 'label': 1}
"""

4.16 加載或導出json文件

# 從網絡上加載數據
dataset = load_dataset(path='lansinuote/ChnSentiCorp', split='train')
# 保存csv文件
dataset.to_json(path_or_buf='./data/ChnSentiCorp.json')

#加載json格式數據
json_dataset = load_dataset(path='json',
                            data_files='./data/ChnSentiCorp.json',
                            split='train')
json_dataset[20]
"""
{'text': '非常不錯，服務很好，位於市中心區，交通方便，不過價格也高！', 'label': 1}
"""

4.17 評價指標 evaluate庫

地址：https://huggingface.co/docs/evaluate

有哪些評價指標呢？

from datasets import list_metrics

# 列出評價指標
metrics_list = list_metrics()

print(len(metrics_list))  # 121
print(metrics_list)

如何使用評價指標，上面列出的121種評價指標，選擇其中一個合適的指標即可，由於load_metric已棄用，並將在數據集的下一個datasets主要版本中刪除，因此需要安裝新庫評價evaluate庫：pip install evaluate

# from datasets import load_metric
# 未來警告：load_metric已棄用，並將在數據集的下一個主要版本中刪除。改用“evaluate.load”，從新庫中🤗評估：https://huggingface.co/docs/evaluate 從系統路徑中刪除 CWD 後

import evaluate
# 加載一個評價指標
glue_metric  = evaluate.load('glue', 'mrpc')

# 查看使用詳情
print(glue_metric.inputs_description)

當前選擇的評價指標使用詳情：

Compute GLUE evaluation metric associated to each GLUE dataset.
Args:
    predictions: list of predictions to score.
        Each translation should be tokenized into a list of tokens.
    references: list of lists of references for each translation.
        Each reference should be tokenized into a list of tokens.
Returns: depending on the GLUE subset, one or several of:
    "accuracy": Accuracy
    "f1": F1 score
    "pearson": Pearson Correlation
    "spearmanr": Spearman Correlation
    "matthews_correlation": Matthew Correlation
Examples:

    >>> glue_metric = evaluate.load('glue', 'sst2')  # 'sst2' or any of ["mnli", "mnli_mismatched", "mnli_matched", "qnli", "rte", "wnli", "hans"]
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print(results)
    {'accuracy': 1.0}

    >>> glue_metric = evaluate.load('glue', 'mrpc')  # 'mrpc' or 'qqp'
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print(results)
    {'accuracy': 1.0, 'f1': 1.0}

    >>> glue_metric = evaluate.load('glue', 'stsb')
    >>> references = [0., 1., 2., 3., 4., 5.]
    >>> predictions = [0., 1., 2., 3., 4., 5.]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print({"pearson": round(results["pearson"], 2), "spearmanr": round(results["spearmanr"], 2)})
    {'pearson': 1.0, 'spearmanr': 1.0}

    >>> glue_metric = evaluate.load('glue', 'cola')
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print(results)
    {'matthews_correlation': 1.0}

計算一個評價指標：

import evaluate

# 加載一個評價指標
glue_metric  = evaluate.load('glue', 'mrpc')
# 評價數據
predictions = [0, 1, 0]
references = [0, 1, 1]
# 平分
final_score = glue_metric.compute(predictions=predictions, references=references)

final_score
"""
{'accuracy': 0.6666666666666666, 'f1': 0.6666666666666666}
"""

5 管道方法 transformers.pipeline

transformers.pipeline是一種方便的API，允許用戶快速使用transformers庫中的預訓練模型處理文本或圖像。

使用transformers.pipeline，用戶可以僅僅使用幾行代碼就可以完成一些任務，例如文本分類、命名實體識別、問答等。

5.1 文本分類

以下是使用pipeline處理文本分類的示例：

from transformers import pipeline
# 沒有提供模型，默認爲distilbert-base-uncased-finetuned-sst-2-english
# 不建議在生產中使用未指定模型名稱和修訂的管道
# classifier = pipeline('text-classification')
classifier = pipeline('text-classification', model='distilbert-base-uncased')
result = classifier('This is a positive sentence.')
print(result)
"""
[{'label': 'POSITIVE', 'score': 0.9994040727615356}]
"""

5.2 圖像分類

使用pipeline處理圖像分類的示例：

from PIL import Image
from transformers import pipeline

image_classifier = pipeline('image-classification', model='google/vit-base-patch16-224')
image = Image.open('image.jpg')
result = image_classifier(image)
print(result)

"""
[{'score': 0.13602155447006226, 'label': 'corn'}, {'score': 0.13296112418174744, 'label': 'plate'}, {'score': 0.10631590336561203, 'label': 'potpie'}, {'score': 0.0996851921081543, 'label': 'frying pan, frypan, skillet'}, {'score': 0.07845009863376617, 'label': 'spatula'}]
"""

5.3 閱讀理解

使用pipeline處理閱讀理解的示例：

from transformers import pipeline

# 閱讀理解
question_answerer = pipeline("question-answering")

context = r"""
Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a 
question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune 
a model on a SQuAD task, you may leverage the examples/pytorch/question-answering/run_squad.py script.
"""

result = question_answerer(question="What is extractive question answering?",
                           context=context)
print(result)

result = question_answerer(
    question="What is a good example of a question answering dataset?",
    context=context)

print(result)

"""
{'score': 0.6177274584770203, 'start': 34, 'end': 95, 'answer': 'the task of extracting an answer from a text given a question'}
{'score': 0.5152304172515869, 'start': 148, 'end': 161, 'answer': 'SQuAD dataset'}
"""

5.4 完形填空

使用pipeline處理完形填空的示例：

from transformers import pipeline

# 完形填空
unmasker = pipeline("fill-mask")

from pprint import pprint

sentence = 'HuggingFace is creating a <mask> that the community uses to solve NLP tasks.'

unmasker(sentence)

"""
[{'score': 0.1792750507593155,
  'token': 3944,
  'token_str': ' tool',
  'sequence': 'HuggingFace is creating a tool that the community uses to solve NLP tasks.'},
 {'score': 0.11349354684352875,
  'token': 7208,
  'token_str': ' framework',
  'sequence': 'HuggingFace is creating a framework that the community uses to solve NLP tasks.'},
 {'score': 0.05243568494915962,
  'token': 5560,
  'token_str': ' library',
  'sequence': 'HuggingFace is creating a library that the community uses to solve NLP tasks.'},
 {'score': 0.03493542596697807,
  'token': 8503,
  'token_str': ' database',
  'sequence': 'HuggingFace is creating a database that the community uses to solve NLP tasks.'},
 {'score': 0.02860235795378685,
  'token': 17715,
  'token_str': ' prototype',
  'sequence': 'HuggingFace is creating a prototype that the community uses to solve NLP tasks.'}]
"""

5.5 文本生成

使用pipeline處理文本生成的示例：

from transformers import pipeline

# 文本生成
text_generator = pipeline("text-generation")

text_generator("As far as I am concerned, I will",
               max_length=50,
               do_sample=False)

"""
[{'generated_text': 'As far as I am concerned, I will be the first to admit that I am not a fan of the idea of a "free market." I think that the idea of a free market is a bit of a stretch. I think that the idea'}]
"""

5.6 文本總結

使用pipeline處理文本總結的示例：

from transformers import pipeline

# 文本總結
summarizer = pipeline("summarization")

ARTICLE = """ New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York.
A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.
Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage.
Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the
2010 marriage license application, according to court documents.
Prosecutors said the marriages were part of an immigration scam.
On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further.
After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective
Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.
All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say.
Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages.
Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted.
The case was referred to the Bronx District Attorney\'s Office by Immigration and Customs Enforcement and the Department of Homeland Security\'s
Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force.
If convicted, Barrientos faces up to four years in prison.  Her next court appearance is scheduled for May 18.
"""

summarizer(ARTICLE, max_length=130, min_length=30, do_sample=False)

"""
[{'summary_text': ' Liana Barrientos, 39, is charged with two counts of "offering a false instrument for filing in the first degree" In total, she has been married 10 times, with nine of her marriages occurring between 1999 and 2002 . At one time, she was married to eight men at once, prosecutors say .'}]
"""

5.7 命名實體識別

使用pipeline處理命名實體識別的示例：

from transformers import pipeline

# 命名實體識別
ner_pipe = pipeline("ner")

sequence = """Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO,
therefore very close to the Manhattan Bridge which is visible from the window."""

for entity in ner_pipe(sequence):
    print(entity)
    
"""
{'entity': 'I-ORG', 'score': 0.99957865, 'index': 1, 'word': 'Hu', 'start': 0, 'end': 2}
{'entity': 'I-ORG', 'score': 0.9909764, 'index': 2, 'word': '##gging', 'start': 2, 'end': 7}
{'entity': 'I-ORG', 'score': 0.9982224, 'index': 3, 'word': 'Face', 'start': 8, 'end': 12}
{'entity': 'I-ORG', 'score': 0.9994879, 'index': 4, 'word': 'Inc', 'start': 13, 'end': 16}
{'entity': 'I-LOC', 'score': 0.9994344, 'index': 11, 'word': 'New', 'start': 40, 'end': 43}
{'entity': 'I-LOC', 'score': 0.99931955, 'index': 12, 'word': 'York', 'start': 44, 'end': 48}
{'entity': 'I-LOC', 'score': 0.9993794, 'index': 13, 'word': 'City', 'start': 49, 'end': 53}
{'entity': 'I-LOC', 'score': 0.98625815, 'index': 19, 'word': 'D', 'start': 79, 'end': 80}
{'entity': 'I-LOC', 'score': 0.95142674, 'index': 20, 'word': '##UM', 'start': 80, 'end': 82}
{'entity': 'I-LOC', 'score': 0.933659, 'index': 21, 'word': '##BO', 'start': 82, 'end': 84}
{'entity': 'I-LOC', 'score': 0.9761654, 'index': 28, 'word': 'Manhattan', 'start': 114, 'end': 123}
{'entity': 'I-LOC', 'score': 0.9914629, 'index': 29, 'word': 'Bridge', 'start': 124, 'end': 130}
"""

5.8 翻譯

使用pipeline處理翻譯的示例：

from transformers import pipeline

# 翻譯 LC748NLP/SikuGPT2-translation
# translator = pipeline("translation_en_to_de")
translator = pipeline("translation_en_to_de")

sentence = "Hugging Face is a technology company based in New York and Paris"

translator(sentence, max_length=40)

"""
[{'translation_text': 'Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.'}]
"""

使用pipeline還可以處理多種其他任務，例如文本生成、序列分類、文本翻譯等。

在使用pipeline時，用戶可以指定要使用的預訓練模型和任務類型。transformers庫中可以使用的預訓練模型和任務類型非常多，用戶可以根據自己的需要進行選擇。

6 預訓練BERT模型

預訓練BERT模型的主要步驟如下：

在大規模的無標註的文本數據上對BERT模型進行預訓練。
使用預訓練的BERT模型進行Fine-tuning，用於特定任務的處理。

在Python中，使用transformers庫可以輕鬆實現預訓練和Fine-tuning。其中，預訓練BERT模型的主要代碼如下：

from transformers import BertTokenizer, BertForPreTraining, LineByLineTextDataset, DataCollatorForLanguageModeling, Trainer, TrainingArguments

# 加載tokenizer和模型
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForPreTraining.from_pretrained('bert-base-uncased')

# 加載數據集並進行訓練
dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path='path/to/text/file',
    block_size=128
)

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=1,
    per_device_train_batch_size=32,
    save_steps=10_000,
    save_total_limit=2,
    prediction_loss_only=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
    prediction_loss_only=True,
)

trainer.train()

其中，LineByLineTextDataset負責讀取無標註的文本數據，並將其分解成小塊，以便於內存使用。DataCollatorForLanguageModeling則負責將數據轉換爲可訓練的形式。Trainer是訓練的核心類，用於執行訓練過程。在訓練完成後，可以使用save_pretrained方法將模型保存。

7 Fine-tuningBERT模型

Fine-tuningBERT模型的主要步驟如下：

下載預訓練好的BERT模型。
加載有標註數據集並進行預處理，將數據集轉換爲BERT模型的輸入格式。
使用Fine-tuning技術，微調BERT模型，得到特定任務的模型。

在Python中，使用transformers庫可以方便的進行Fine-tuning。示例代碼如下：

from transformers import BertForSequenceClassification, BertTokenizer, Trainer, TrainingArguments
import datasets

# 加載tokenizer和模型
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

# 加載數據集
train_dataset, test_dataset = datasets.load_dataset('csv', data_files={'train': 'path/to/train.csv', 'test': 'path/to/test.csv'})

# tokenization和padding
def tokenize(batch):
    return tokenizer(batch['text'], padding=True, truncation=True)

train_dataset = train_dataset.map(tokenize, batched=True, batch_size=len(train_dataset))
test_dataset = test_dataset.map(tokenize, batched=True, batch_size=len(test_dataset))

# Fine-tuning
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=1,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=64,
    warmup_steps=500,  # 初始warmup的步數，用於逐步增加學習率
    evaluation_strategy='epoch',
    logging_dir='./logs',
    load_best_model_at_end=True,
    metric_for_best_model='accuracy',
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)

trainer.train()

其中，load_dataset用於加載訓練和測試數據集，tokenize函數進行分詞、padding等預處理，TrainingArguments設置訓練參數。Trainer類執行訓練過程，並提供了評估、保存模型等功能。

綜上所述，Python庫BERT可以方便的進行預訓練和Fine-tuning。在預訓練過程中，可以使用Trainer執行訓練，並使用save_pretrained方法保存模型。在Fine-tuning過程中，可以使用Trainer執行Fine-tuning過程，並使用predict方法對測試數據進行預測。

transformers入門使用

transformers入門使用

1 簡介

2 分詞器 transformers.BertTokenizer

2.1 簡單編碼

2.2 增強編碼

2.3 批量增強編碼

2.4 字典操作

3 transformers 模型使用

4 數據集操作 datasets

4.1 load_dataset

4.2 save_to_disk

4.3 load_from_disk

4.4 獲取 訓練集 測試集 驗證集

4.5 sort

4.6 shuffle

4.7 select

4.8 filter

4.9 train_test_split

4.10 shard

4.11 rename_column

4.12 remove_columns

4.13 map

4.14 set_format

4.15 加載或導出csv文件

4.16 加載或導出json文件

4.17 評價指標 evaluate庫

5 管道方法 transformers.pipeline

5.1 文本分類

5.2 圖像分類

5.3 閱讀理解

5.4 完形填空

5.5 文本生成

5.6 文本總結

5.7 命名實體識別

5.8 翻譯

6 預訓練BERT模型

7 Fine-tuningBERT模型

4.4 獲取訓練集測試集驗證集