torch text -- dataset 迷魂阵

处理文本

核心

怎样使得训练样本成为一个个 batch, 也就是怎样自己做一个迭代器,使得训练更加方便

Fields – 你要我怎样

在语言模型里面,我们一般会预测下一个单词的出现,这样的无监督学习,天然有label。在情感分析,文本分类里面,label 有自己的column, 所以处理的方式是会不同。不同的field 是告诉框架,每个不同的column 是怎样处理的。

Field api
  • sequential – Whether the datatype represents sequential data. If False, no tokenization is applied. Default: True.
  • use_vocab – Whether to use a Vocab object. If False, the data in this field should already be numerical. Default: True.
  • init_token – A token that will be prepended to every example using this field, or None for no initial token. Default: None.
  • eos_token – A token that will be appended to every example using this field, or None for no end-of-sentence token. Default: None.
  • dtype – The torch.dtype class that represents a batch of examples of this kind of data. Default: torch.long.
  • lower – Whether to lowercase the text in this field. Default: False.
  • tokenize – The function used to tokenize strings using this field into sequential examples. If “spacy”, the SpaCy tokenizer is used. If a non-serializable function is passed as an argument, the field will not be able to be serialized. Default: string.split.
  • tokenizer_language – The language of the tokenizer to be constructed. Various languages currently supported only in SpaCy.
如果想自己训练自己的 vocabulary,同样提供了 api

build_vocab(*args, **kwargs)

  • Parameters:
    • arguments (Positional) – Dataset objects or other iterable data sources from which to construct the Vocab object that represents the set of possible values for this field. If a Dataset object is provided, all columns corresponding to this field are used; individual columns can also be provided directly.
    • keyword arguments (Remaining) – Passed to the constructor of V

Data

Dataset
  • Defines a dataset composed of Examples along with its Fields. 这个暂时还没用
TabularDataset
  • Defines a Dataset of columns stored in CSV, TSV, or JSON format.

  • init

    • path (str) – Path to the data file.

    • format (str) – The format of the data file. One of “CSV”, “TSV”, or “JSON” (case-insensitive).

    • fields (list(tuple(str, Field)) –
      tuple(str, Field)]: If using a list, the format must be CSV or TSV, and the values of the list should be tuples of (name, field). The fields should be in the same order as the columns in the CSV or TSV file, while tuples of (name, None) represent columns that will be ignored.If using a dict, the keys should be a subset of the JSON keys or CSV/TSV columns, and the values should be tuples of (name, field). Keys not present in the input dictionary are ignored. This allows the user to rename columns from their JSON/CSV/TSV key names and also enables selecting a subset of columns to load.

    • skip_header (bool) – Whether to skip the first line of the input file.

    • csv_reader_params (dict) – Parameters to pass to the csv reader. Only relevant when format is csv or tsv. See https://docs.python.org/3/library/csv.html#csv.reader for more details.

  • 之后的方法与上面 DataSet 一致主要是 example , split 文档详见

迭代器

有三种

Iterator

Defines an iterator that loads batches of data from a Dataset.
最朴实无华的迭代

BucketIterator

Defines an iterator that batches examples of similar lengths together.
可以把长短类似的vector 放在一起,进行计算

BPTTIterator

rnn 系列就用他了

Defines an iterator for language modeling tasks that use BPTT.

def tokenize(x): return x.split()

# define the type of text and label
TEXT = Field(sequential=True, tokenize=tokenize, lower=True)
LABEL = Field(sequential=False, use_vocab=False)
train_datafield = [("id", None),  # we won't be needing the id, so we pass in None as the field
                   ("comment_text", TEXT), ("toxic", LABEL),
                   ("severe_toxic", LABEL), ("threat", LABEL),
                   ("obscene", LABEL), ("insult", LABEL),
                   ("identity_hate", LABEL)]
train = TabularDataset.splits(path="data",
                              train='train.csv', format='csv',
                              skip_header=True,
                              fields=train_datafield)[0]
test_datafiled = [("id", None), ("comment_text", TEXT)]
tst = TabularDataset(path="data/test.csv", format='csv',
                     skip_header=True, fields=test_datafiled)
TEXT.build_vocab(train)
# create the iterator
train_iter = BucketIterator.splits(train,
                batch_size=(64, 64),device=torch.cuda.device(-1),
				# the BucketIterator needs to be told what function it should use to group the data.
				sort_key=lambda x: len(x.comment_text),
				sort_within_batch=False,
				repeat=False  # we pass repeat=False because we want to wrap this Iterator layer.)
                
test_iter = Iterator(tst, batch_size=64, device=-1, sort=False, sort_within_batch=False, repeat=False)

batch = next(train_iter.__iter__())

class BatchWrapper():
    def __init__(self, dl, x_var, y_vars):
        # we pass in the list of attributes for x and y
        self.dl, self.x_var, self.y_vars = dl, x_var, y_vars
    def __iter__(self):
        for batch in self.dl:
            # we assume only one input in this wrapper
            x = getattr(batch, self.x_var)
            if self.y_vars is not None:  # we will concatenate y into a single tensor
                y = torch.cat([getattr(batch, feat).unsqueeze(1) for feat in self.y_vars], dim=1).float()
            else:
                y = torch.zeros((1))
            yield (x, y)
    def __len__(self):
        return len(self.dl)
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章