torchtext.data 的 Field, RawField

原創

2020-06-25 19:25

今天試圖更改open-nmt代碼時，在preprocess階段發現一處代碼：

    fields = inputters.get_fields( 
        opt.data_type,
        src_nfeats,
        tgt_nfeats,
        dynamic_dict=opt.dynamic_dict,
        src_truncate=opt.src_seq_length_trunc,
        tgt_truncate=opt.tgt_seq_length_trunc)

而fields的各個“組成部件”爲不同的類型

    fields["tgt"] = fields_getters["text"](**tgt_field_kwargs)          # TextMultiField 

    indices = Field(use_vocab=False, dtype=torch.long, sequential=False) # 
    fields["indices"] = indices

    if dynamic_dict:
        src_map = Field(
            use_vocab=False, dtype=torch.float,
            postprocessing=make_src, sequential=False)
        fields["src_map"] = src_map

        src_ex_vocab = RawField()
        fields["src_ex_vocab"] = src_ex_vocab

        align = Field(
            use_vocab=False, dtype=torch.long,
            postprocessing=make_tgt, sequential=False)
        fields["alignment"] = align

可以看到有的是TextMultiField，有的是Field, RawField

這是爲什麼呢？有什麼區別呢？

打開torchtext官方文檔：https://torchtext.readthedocs.io/en/latest/data.html#field

查找到如下內容：

Field

class torchtext.data.Field(sequential=True, use_vocab=True, init_token=None, eos_token=None, fix_length=None, dtype=torch.int64, preprocessing=None, postprocessing=None, lower=False, tokenize=None, tokenizer_language='en', include_lengths=False, batch_first=False, pad_token='<pad>', unk_token='<unk>', pad_first=False, truncate_first=False, stop_words=None, is_target=False)
Defines a datatype together with instructions for converting to Tensor.

Field class models common text processing datatypes that can be represented by tensors. It holds a Vocab object that defines the set of possible values for elements of the field and their corresponding numerical representations. The Field object also holds other parameters relating to how a datatype should be numericalized, such as a tokenization method and the kind of Tensor that should be produced.

定義一個數據類型和轉換成張量的指令。

字段類模型可由張量表示的常見文本處理數據類型。
它持有一個Vocab對象，該對象定義字段元素的可能值集及其相應的數值表示。
Field對象還包含與數據類型如何數字化相關的其他參數，比如記號化方法和應該生成的張量類型。

If a Field is shared between two columns in a dataset (e.g., question and answer in a QA dataset), then they will have a shared vocabulary.

Variables:變量詳情見上述網址

RawField

class torchtext.data.RawField(preprocessing=None, postprocessing=None, is_target=False)
Defines a general datatype.

Every dataset consists of one or more types of data. For instance, a text classification dataset contains sentences and their classes, while a machine translation dataset contains paired examples of text in two languages. Each of these types of data is represented by a RawField object. A RawField object does not assume any property of the data type and it holds parameters relating to how a datatype should be processed.

定義一般數據類型。

每個數據集都包含一種或多種類型的數據。
例如，文本分類數據集包含句子及其類，而機器翻譯數據集包含兩種語言的成對文本示例。
這些類型的數據都由一個RawField對象表示。
RawField對象不假設數據類型的任何屬性，它持有與處理數據類型相關的參數。

Variables:

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

torchtext.data 的 Field, RawField

Python heapq（堆操作）

【書籍記錄】《編程之法》

面經 | 記錄秋招遇到的概率題與智力題（附答案）

【ERROR】TypeError: expected bytes, Descriptor found

【論文】【ACL2018】Neural Document Summarization by Jointly Learning to Score and Select Sentences

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結