今天試圖更改open-nmt代碼時,在preprocess階段發現一處代碼:
fields = inputters.get_fields(
opt.data_type,
src_nfeats,
tgt_nfeats,
dynamic_dict=opt.dynamic_dict,
src_truncate=opt.src_seq_length_trunc,
tgt_truncate=opt.tgt_seq_length_trunc)
而fields的各個“組成部件”爲不同的類型
fields["tgt"] = fields_getters["text"](**tgt_field_kwargs) # TextMultiField
indices = Field(use_vocab=False, dtype=torch.long, sequential=False) #
fields["indices"] = indices
if dynamic_dict:
src_map = Field(
use_vocab=False, dtype=torch.float,
postprocessing=make_src, sequential=False)
fields["src_map"] = src_map
src_ex_vocab = RawField()
fields["src_ex_vocab"] = src_ex_vocab
align = Field(
use_vocab=False, dtype=torch.long,
postprocessing=make_tgt, sequential=False)
fields["alignment"] = align
可以看到有的是TextMultiField,有的是Field, RawField
這是爲什麼呢?有什麼區別呢?
打開torchtext官方文檔:https://torchtext.readthedocs.io/en/latest/data.html#field
查找到如下內容:
Field
class torchtext.data.Field(sequential=True, use_vocab=True, init_token=None, eos_token=None, fix_length=None, dtype=torch.int64, preprocessing=None, postprocessing=None, lower=False, tokenize=None, tokenizer_language='en', include_lengths=False, batch_first=False, pad_token='<pad>', unk_token='<unk>', pad_first=False, truncate_first=False, stop_words=None, is_target=False)
Defines a datatype together with instructions for converting to Tensor.
Field class models common text processing datatypes that can be represented by tensors. It holds a Vocab object that defines the set of possible values for elements of the field and their corresponding numerical representations. The Field object also holds other parameters relating to how a datatype should be numericalized, such as a tokenization method and the kind of Tensor that should be produced.
定義一個數據類型和轉換成張量的指令。
字段類模型可由張量表示的常見文本處理數據類型。
它持有一個Vocab對象,該對象定義字段元素的可能值集及其相應的數值表示。
Field對象還包含與數據類型如何數字化相關的其他參數,比如記號化方法和應該生成的張量類型。
If a Field is shared between two columns in a dataset (e.g., question and answer in a QA dataset), then they will have a shared vocabulary.
Variables:變量詳情見上述網址
RawField
class torchtext.data.RawField(preprocessing=None, postprocessing=None, is_target=False)
Defines a general datatype.
Every dataset consists of one or more types of data. For instance, a text classification dataset contains sentences and their classes, while a machine translation dataset contains paired examples of text in two languages. Each of these types of data is represented by a RawField object. A RawField object does not assume any property of the data type and it holds parameters relating to how a datatype should be processed.
定義一般數據類型。
每個數據集都包含一種或多種類型的數據。
例如,文本分類數據集包含句子及其類,而機器翻譯數據集包含兩種語言的成對文本示例。
這些類型的數據都由一個RawField對象表示。
RawField對象不假設數據類型的任何屬性,它持有與處理數據類型相關的參數。
Variables: