transformers庫學習筆記(一):安裝與測試

印象中覺得transformers是一個龐然大物,但實際接觸後,卻是極其友好,感謝huggingface大神。原文見tmylla.github.io

安裝

我的版本號:python 3.6.9;pytorch 1.2.0;CUDA 10.0。

pip install transformers

pip之前確保安裝pytorch1.1.0+。

測試

驗證代碼與結果

python -c "from transformers import pipeline; print(pipeline('sentiment-analysis')('I hate you'))"

在命令行輸入如上命令後,transformers會自動下載依賴模型。輸出以下結果,安裝成果。

[{'label': 'NEGATIVE', 'score': 0.9991129040718079}]

transformer pipeline下載模型文件說明

transformers自動下載模型的保存位置:C:\Users\username\.cache\torch\,在模型下載以後,可以保存到其他位置。各文件的說明如下:

  1. json文件包含對應文件的‘url’和‘etag’標籤。

  2. ‘a41…’爲配置文件:distilbert-base-uncased-config。

  3. ‘26b…’爲詞典文件:bert-base-uncased-vocab。

  4. ‘437…’爲finetuned-sst-2的配置文件:distilbert-base-uncased-finetuned-sst-2-english-config,注意其與‘a41…’文件的不同。

  5. ‘57d…’爲Modelcard文件:distilbert-base-uncased-finetuned-sst-2-english-modelcard。

  6. ‘dd7…’爲模型參數文件:distilbert-base-uncased-finetuned-sst-2-english-pytorch_model.bin。

pipeline()簡介

可以看到,通過執行pipeline('sentiment-analysis')('I hate you'),transformers自動下載GLUE中sst2數據集的distilbert-base-uncased-finetuned-sst-2模型,對’I hate you’進行情感分析。

Pipeline是一個簡捷的NLP任務接口,執行Input -> Tokenization -> Model Inference -> Post-Processing (Task dependent) -> Output一系列操作。目前支持Named Entity Recognition, Masked Language Modeling, Sentiment Analysis, Feature Extraction and Question Answering等任務。

以Question Answering爲例:

from transformers import pipeline

nlp = pipeline("question-answering")

context = "Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune a model on a SQuAD task, you may leverage the `run_squad.py`."

print(nlp(question="What is extractive question answering?", context=context))
print(nlp(question="What is a good example of a question answering dataset?", context=context))

對QA任務,transformers使用SQuAD數據集的distilbert-base-cased-distilled-squad模型,模型文件同上文介紹。

移動模型到自定義文件夾

以QA爲例:

  1. 首先我們建立一個文件夾,命名爲distilbert-base-cased-distilled-squad,然後將詞典文件、模型配置文件、模型參數文件三個文件放入這個文件夾,並且將文件重命名爲config.json、vocab.txt、pytorch_model.bin即可。

  2. 在代碼中定義模型目錄,DISTILLED = './distilbert-base-cased-distilled-squad',完整代碼如下。

    from transformers import AutoTokenizer, AutoModelForQuestionAnswering
    import torch
    
    DISTILLED = './distilbert-base-cased-distilled-squad'
    tokenizer = AutoTokenizer.from_pretrained(DISTILLED)
    model = AutoModelForQuestionAnswering.from_pretrained(DISTILLED)
    
    text = """
    Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose
    architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural
    Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between
    TensorFlow 2.0 and PyTorch.
    """
    
    questions = [
        "How many pretrained models are available in Transformers?",
        "What does Transformers provide?",
        "Transformers provides interoperability between which frameworks?",
    ]
    
    for question in questions:
        inputs = tokenizer.encode_plus(question, text, add_special_tokens=True, return_tensors="pt")
        input_ids = inputs["input_ids"].tolist()[0]
    
        text_tokens = tokenizer.convert_ids_to_tokens(input_ids)
        answer_start_scores, answer_end_scores = model(**inputs)
    
        answer_start = torch.argmax(answer_start_scores)  # Get the most likely beginning of answer with the argmax of the score
        answer_end = torch.argmax(answer_end_scores) + 1  # Get the most likely end of answer with the argmax of the score
    
        answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))
    
        print(f"Question: {question}")
        print(f"Answer: {answer}\n")
    

參考

https://huggingface.co/transformers/installation.html

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章