torch text -- dataset 迷魂陣

處理文本

核心

怎樣使得訓練樣本成爲一個個 batch, 也就是怎樣自己做一個迭代器,使得訓練更加方便

Fields – 你要我怎樣

在語言模型裏面,我們一般會預測下一個單詞的出現,這樣的無監督學習,天然有label。在情感分析,文本分類裏面,label 有自己的column, 所以處理的方式是會不同。不同的field 是告訴框架,每個不同的column 是怎樣處理的。

Field api
  • sequential – Whether the datatype represents sequential data. If False, no tokenization is applied. Default: True.
  • use_vocab – Whether to use a Vocab object. If False, the data in this field should already be numerical. Default: True.
  • init_token – A token that will be prepended to every example using this field, or None for no initial token. Default: None.
  • eos_token – A token that will be appended to every example using this field, or None for no end-of-sentence token. Default: None.
  • dtype – The torch.dtype class that represents a batch of examples of this kind of data. Default: torch.long.
  • lower – Whether to lowercase the text in this field. Default: False.
  • tokenize – The function used to tokenize strings using this field into sequential examples. If “spacy”, the SpaCy tokenizer is used. If a non-serializable function is passed as an argument, the field will not be able to be serialized. Default: string.split.
  • tokenizer_language – The language of the tokenizer to be constructed. Various languages currently supported only in SpaCy.
如果想自己訓練自己的 vocabulary,同樣提供了 api

build_vocab(*args, **kwargs)

  • Parameters:
    • arguments (Positional) – Dataset objects or other iterable data sources from which to construct the Vocab object that represents the set of possible values for this field. If a Dataset object is provided, all columns corresponding to this field are used; individual columns can also be provided directly.
    • keyword arguments (Remaining) – Passed to the constructor of V

Data

Dataset
  • Defines a dataset composed of Examples along with its Fields. 這個暫時還沒用
TabularDataset
  • Defines a Dataset of columns stored in CSV, TSV, or JSON format.

  • init

    • path (str) – Path to the data file.

    • format (str) – The format of the data file. One of “CSV”, “TSV”, or “JSON” (case-insensitive).

    • fields (list(tuple(str, Field)) –
      tuple(str, Field)]: If using a list, the format must be CSV or TSV, and the values of the list should be tuples of (name, field). The fields should be in the same order as the columns in the CSV or TSV file, while tuples of (name, None) represent columns that will be ignored.If using a dict, the keys should be a subset of the JSON keys or CSV/TSV columns, and the values should be tuples of (name, field). Keys not present in the input dictionary are ignored. This allows the user to rename columns from their JSON/CSV/TSV key names and also enables selecting a subset of columns to load.

    • skip_header (bool) – Whether to skip the first line of the input file.

    • csv_reader_params (dict) – Parameters to pass to the csv reader. Only relevant when format is csv or tsv. See https://docs.python.org/3/library/csv.html#csv.reader for more details.

  • 之後的方法與上面 DataSet 一致主要是 example , split 文檔詳見

迭代器

有三種

Iterator

Defines an iterator that loads batches of data from a Dataset.
最樸實無華的迭代

BucketIterator

Defines an iterator that batches examples of similar lengths together.
可以把長短類似的vector 放在一起,進行計算

BPTTIterator

rnn 系列就用他了

Defines an iterator for language modeling tasks that use BPTT.

def tokenize(x): return x.split()

# define the type of text and label
TEXT = Field(sequential=True, tokenize=tokenize, lower=True)
LABEL = Field(sequential=False, use_vocab=False)
train_datafield = [("id", None),  # we won't be needing the id, so we pass in None as the field
                   ("comment_text", TEXT), ("toxic", LABEL),
                   ("severe_toxic", LABEL), ("threat", LABEL),
                   ("obscene", LABEL), ("insult", LABEL),
                   ("identity_hate", LABEL)]
train = TabularDataset.splits(path="data",
                              train='train.csv', format='csv',
                              skip_header=True,
                              fields=train_datafield)[0]
test_datafiled = [("id", None), ("comment_text", TEXT)]
tst = TabularDataset(path="data/test.csv", format='csv',
                     skip_header=True, fields=test_datafiled)
TEXT.build_vocab(train)
# create the iterator
train_iter = BucketIterator.splits(train,
                batch_size=(64, 64),device=torch.cuda.device(-1),
				# the BucketIterator needs to be told what function it should use to group the data.
				sort_key=lambda x: len(x.comment_text),
				sort_within_batch=False,
				repeat=False  # we pass repeat=False because we want to wrap this Iterator layer.)
                
test_iter = Iterator(tst, batch_size=64, device=-1, sort=False, sort_within_batch=False, repeat=False)

batch = next(train_iter.__iter__())

class BatchWrapper():
    def __init__(self, dl, x_var, y_vars):
        # we pass in the list of attributes for x and y
        self.dl, self.x_var, self.y_vars = dl, x_var, y_vars
    def __iter__(self):
        for batch in self.dl:
            # we assume only one input in this wrapper
            x = getattr(batch, self.x_var)
            if self.y_vars is not None:  # we will concatenate y into a single tensor
                y = torch.cat([getattr(batch, feat).unsqueeze(1) for feat in self.y_vars], dim=1).float()
            else:
                y = torch.zeros((1))
            yield (x, y)
    def __len__(self):
        return len(self.dl)
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章