torchtext.data 的 Field, RawField

今天試圖更改open-nmt代碼時,在preprocess階段發現一處代碼:

    fields = inputters.get_fields( 
        opt.data_type,
        src_nfeats,
        tgt_nfeats,
        dynamic_dict=opt.dynamic_dict,
        src_truncate=opt.src_seq_length_trunc,
        tgt_truncate=opt.tgt_seq_length_trunc)

而fields的各個“組成部件”爲不同的類型

    fields["tgt"] = fields_getters["text"](**tgt_field_kwargs)          # TextMultiField 

    indices = Field(use_vocab=False, dtype=torch.long, sequential=False) # 
    fields["indices"] = indices

    if dynamic_dict:
        src_map = Field(
            use_vocab=False, dtype=torch.float,
            postprocessing=make_src, sequential=False)
        fields["src_map"] = src_map

        src_ex_vocab = RawField()
        fields["src_ex_vocab"] = src_ex_vocab

        align = Field(
            use_vocab=False, dtype=torch.long,
            postprocessing=make_tgt, sequential=False)
        fields["alignment"] = align

可以看到有的是TextMultiField,有的是Field, RawField

這是爲什麼呢?有什麼區別呢?

 

打開torchtext官方文檔:https://torchtext.readthedocs.io/en/latest/data.html#field

查找到如下內容:

Field

class torchtext.data.Field(sequential=True, use_vocab=True, init_token=None, eos_token=None, fix_length=None, dtype=torch.int64, preprocessing=None, postprocessing=None, lower=False, tokenize=None, tokenizer_language='en', include_lengths=False, batch_first=False, pad_token='<pad>', unk_token='<unk>', pad_first=False, truncate_first=False, stop_words=None, is_target=False)
Defines a datatype together with instructions for converting to Tensor.

Field class models common text processing datatypes that can be represented by tensors. It holds a Vocab object that defines the set of possible values for elements of the field and their corresponding numerical representations. The Field object also holds other parameters relating to how a datatype should be numericalized, such as a tokenization method and the kind of Tensor that should be produced.

定義一個數據類型和轉換成張量的指令。

字段類模型可由張量表示的常見文本處理數據類型。
它持有一個Vocab對象,該對象定義字段元素的可能值集及其相應的數值表示。
Field對象還包含與數據類型如何數字化相關的其他參數,比如記號化方法和應該生成的張量類型。

If a Field is shared between two columns in a dataset (e.g., question and answer in a QA dataset), then they will have a shared vocabulary.

Variables:變量詳情見上述網址

 

RawField

class torchtext.data.RawField(preprocessing=None, postprocessing=None, is_target=False)
Defines a general datatype.

Every dataset consists of one or more types of data. For instance, a text classification dataset contains sentences and their classes, while a machine translation dataset contains paired examples of text in two languages. Each of these types of data is represented by a RawField object. A RawField object does not assume any property of the data type and it holds parameters relating to how a datatype should be processed.

定義一般數據類型。

每個數據集都包含一種或多種類型的數據。
例如,文本分類數據集包含句子及其類,而機器翻譯數據集包含兩種語言的成對文本示例。
這些類型的數據都由一個RawField對象表示。
RawField對象不假設數據類型的任何屬性,它持有與處理數據類型相關的參數。

Variables:

 

 

 

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章