ELMo代碼詳解(一):數據準備

ELMo代碼解讀筆記


1.數據準備

  數據準備包括:1.生成word的詞彙表類; 2.生成字符的詞彙表類; 3.以word-ids作爲輸入的訓練batch生成類; 4.以char-ids作爲輸入的訓練batch生成類; 5.生成語言模型輸入的數據集類

1.1 word詞彙表類(Vocabulary)

  根據一個詞彙表文件,生成word和索引的相互對應關係,即_id_to_word和_word_to_id,前者是一個數組,後者是一個字典。當然,我們也需要加上一個特殊的詞,比如<S>, </S><UNK>(分別表示句首,句尾和不知詞)。主要的代碼如下:

def __init__(self, filename, validate_file=False):
    '''
    filename = the vocabulary file.  It is a flat text file with one
        (normalized) token per line.  In addition, the file should also
        contain the special tokens <S>, </S>, <UNK> (case sensitive).
        vocab文件,是一個純文本,每一行只有一個詞。另外,這個文件應該包含特殊詞,
        比如<S>, </S>, <UNK>等
    '''
    self._id_to_word = []
    self._word_to_id = {}
    self._unk = -1
    self._bos = -1
    self._eos = -1

    with open(filename) as f:
        idx = 0
        for line in f:                                                        #詞彙表中一行就是一個單詞
            word_name = line.strip()
            if word_name == '<S>':
                self._bos = idx
            elif word_name == '</S>':
                self._eos = idx
            elif word_name == '<UNK>':
                self._unk = idx
            if word_name == '!!!MAXTERMID':
                continue

            self._id_to_word.append(word_name)
            self._word_to_id[word_name] = idx
            idx += 1

    # check to ensure file has special tokens
    if validate_file:
        if self._bos == -1 or self._eos == -1 or self._unk == -1:
            raise ValueError("Ensure the vocabulary file has "
                             "<S>, </S>, <UNK> tokens")

當然,類中還有兩個很實用的函數,一個是編碼函數encode,另一個是解碼函數decode。編碼器encode的作用是將一條句子sentence轉化爲一個word-ids列表,注意要加上句首和句尾token。當然包括反轉選項,用來做雙向的LSTM。而解碼器decode就是將word-ids列表轉化爲相應的單詞。

def encode(self, sentence, reverse=False, split=True):
    """Convert a sentence to a list of ids, with special tokens added.
    Sentence is a single string with tokens separated by whitespace.

    If reverse, then the sentence is assumed to be reversed, and
        this method will swap the BOS/EOS tokens appropriately.
       將一個sentenct轉化爲ids序列
       並提供句子反轉的功能
    """

    if split:
        word_ids = [
            self.word_to_id(cur_word) for cur_word in sentence.split()
        ]
    else:
        word_ids = [self.word_to_id(cur_word) for cur_word in sentence]

    if reverse:
        return np.array([self.eos] + word_ids + [self.bos], dtype=np.int32) #在每一條句子首位加上了<eos>和<bos>
    else:
        return np.array([self.bos] + word_ids + [self.eos], dtype=np.int32)

def decode(self, cur_ids):
    """Convert a list of ids to a sentence, with space inserted.
       將一個ids序列轉化爲word序列
    """
    return ' '.join([self.id_to_word(cur_id) for cur_id in cur_ids])

1.2 字符詞彙表(UnicodeCharsVocabulary)

  注意這個類是上面word詞彙表Vocabulary的子類,這意味着這個字符類包含了Vocabulary的所有變量和方法!
  每個字符(character)的id是用該字符對應的utf-8編碼,這樣也就可以形成id和char之間的轉換,因爲使用utf-8編碼,這將限制char詞彙表中所有可能的id數量爲256。當然,我們也需要加入5個額外的特殊字符,包括:句首,句尾,詞頭,詞尾和padding。通過詞彙表文件,形成字符詞彙表的_word_char_ids的代碼爲:

#將詞轉化爲char_ids
def _convert_word_to_char_ids(self, word):
    code = np.zeros([self.max_word_length], dtype=np.int32)
    code[:] = self.pad_char

    #將word中每一個字符轉化爲utf-8編碼,然後用數組存起來,例如:
    #english中,e:101, n:110, g:103, l:108, h:105, s:115, h:104
    word_encoded = word.encode('utf-8', 'ignore')[:(self.max_word_length-2)]
    code[0] = self.bow_char                                      #加上詞開始和結尾的編碼
    for k, chr_id in enumerate(word_encoded, start=1):
        code[k] = chr_id
    code[k + 1] = self.eow_char

    return code


def __init__(self, filename, max_word_length, **kwargs):
    #調用父類Vocabulary,生成word和id之間的轉換等
    super(UnicodeCharsVocabulary, self).__init__(filename, **kwargs)
    self._max_word_length = max_word_length                             #每個詞對應最大字符長

    # char ids 0-255 come from utf-8 encoding bytes
    # assign 256-300 to special chars
    self.bos_char = 256  # <begin sentence>
    self.eos_char = 257  # <end sentence>
    self.bow_char = 258  # <begin word>
    self.eow_char = 259  # <end word>
    self.pad_char = 260 # <padding>

    num_words = len(self._id_to_word)                                   #單詞的個數,父類中的屬性

    #每個詞都會對應一個char_ids列表
    self._word_char_ids = np.zeros([num_words, max_word_length],
        dtype=np.int32)

    # the charcter representation of the begin/end of sentence characters
    # 對句首或者句尾的token來一個字符的表示
    def _make_bos_eos(c):
        r = np.zeros([self.max_word_length], dtype=np.int32)
        r[:] = self.pad_char
        r[0] = self.bow_char                                            #詞的開始
        r[1] = c
        r[2] = self.eow_char                                            #詞的結束
        return r
    self.bos_chars = _make_bos_eos(self.bos_char)                       #句子開始對應的char_ids
    self.eos_chars = _make_bos_eos(self.eos_char)                       #句子的結尾對應的char_ids

    for i, word in enumerate(self._id_to_word):                         #遍歷id2word數組,得到每一個詞的char_ids
        self._word_char_ids[i] = self._convert_word_to_char_ids(word)

    self._word_char_ids[self.bos] = self.bos_chars                      #將句子開頭和結尾當作一個word處理
    self._word_char_ids[self.eos] = self.eos_chars

通過以上兩個函數,我們就可以得到每個單詞(word)對應的字符id序列(char-ids),包括句首和句尾的字符id序列表示。
  這個類還提供將句子轉化爲相應的char-ids數組的功能,它首先查詞彙表字典_word_char_ids來得到每個詞的char_ids表示,然後組成句子,返回的是一個二維數組。實現如下:

#返回word對應的char_ids數組
def word_to_char_ids(self, word):
    if word in self._word_to_id:
        return self._word_char_ids[self._word_to_id[word]]
    else:
        return self._convert_word_to_char_ids(word)

def encode_chars(self, sentence, reverse=False, split=True):
    '''
    Encode the sentence as a white space delimited string of tokens.
    對一整句話進行編碼,編碼成chars
    '''
    if split:                                                             #如果切割了句子
        chars_ids = [self.word_to_char_ids(cur_word)       
                 for cur_word in sentence.split()]
    else:
        chars_ids = [self.word_to_char_ids(cur_word)
                 for cur_word in sentence]
    if reverse:
        return np.vstack([self.eos_chars] + chars_ids + [self.bos_chars]) #在每一條句子上都加了<eos>和<bos>       
    else:
        return np.vstack([self.bos_chars] + chars_ids + [self.eos_chars])

1.3 生成word-ids輸入的batch類(TokenBatcher)

  將一個batch的句子文本轉化爲相應的word-ids形式。主要代碼如下:

def batch_sentences(self, sentences: List[List[str]]):
    '''
    Batch the sentences as character ids
    確定是character_ids?而不是word_ids
    Each sentence is a list of tokens without <s> or </s>, e.g.
    [['The', 'first', 'sentence', '.'], ['Second', '.']]
    '''
    n_sentences = len(sentences)
    max_length = max(len(sentence) for sentence in sentences) + 2

    X_ids = np.zeros((n_sentences, max_length), dtype=np.int64)          #word_ids是二維的,[batch_size, max_len]

    for k, sent in enumerate(sentences):
        length = len(sent) + 2
        ids_without_mask = self._lm_vocab.encode(sent, split=False)
        # add one so that 0 is the mask value
        X_ids[k, :length] = ids_without_mask + 1                         #0表示mask值

    return X_ids

1.4 生成char-ids輸入的類(Batcher)

  和上面類似,只是這裏生成的是一個batch的句子文本的char-ids的表示,形成的是一個三維數組。主要代碼爲:

def batch_sentences(self, sentences: List[List[str]]):
    '''
    Batch the sentences as character ids
    Each sentence is a list of tokens without <s> or </s>, e.g.
    [['The', 'first', 'sentence', '.'], ['Second', '.']]
    '''
    n_sentences = len(sentences)                                      #句子個數
    max_length = max(len(sentence) for sentence in sentences) + 2     #句子最大長度,加上句首和句尾?

    X_char_ids = np.zeros(                                            #三維數組,每條句子中每個單詞對應的char_ids數組
        (n_sentences, max_length, self._max_token_length),
        dtype=np.int64
    )

    #遍歷數組
    for k, sent in enumerate(sentences):
        length = len(sent) + 2
        char_ids_without_mask = self._lm_vocab.encode_chars(          #對每個sentence得到char_ids數組
            sent, split=False)
        # add one so that 0 is the mask value, 加上1,所以0是mask值
        X_char_ids[k, :length, :] = char_ids_without_mask + 1         #直接複製粘貼?將對應值加1,其他值填0

    return X_char_ids

  接着定義了一個生成各種數據的batch的方法,該方法每次從輸入中讀取一個batch的數據,batch中每個數據條目就是一條句子,每個條目包括句子的word-ids表示,char-ids表示和targets(即句子每個詞要預測的下一個詞)。該方法中有一個生成器(generator),每次會產生一條句子的數據,包括句子的word-ids和char-ids表示,所有隻要重複調用該generator的next方法batch_size次就能夠構造出一個batch的數據,代碼如下:

def _get_batch(generator, batch_size, num_steps, max_word_length):
"""Read batches of input.
   都一個batch的輸入
"""
cur_stream = [None] * batch_size                                         #None表示任意大小

no_more_data = False
while True:
    inputs = np.zeros([batch_size, num_steps], np.int32)                 #batch中word_ids          
    if max_word_length is not None:                                      #batch中每條句子每個word對應的char_ids
        char_inputs = np.zeros([batch_size, num_steps, max_word_length],
                            np.int32)
    else:
        char_inputs = None
    targets = np.zeros([batch_size, num_steps], np.int32)                #我們的目標是預測下一個詞來優化emlo,所以我們以向右滑動的1個詞作爲target

    for i in range(batch_size):                                          #每一條句子
        cur_pos = 0                                                      #這個值?

        while cur_pos < num_steps:                                       #循環是不是有點多餘, 毫無意義
            if cur_stream[i] is None or len(cur_stream[i][0]) <= 1:
                try:
                    cur_stream[i] = list(next(generator))                #一個生成器一次只生成一條句子信息
                except StopIteration:
                    # No more data, exhaust current streams and quit     
                    no_more_data = True
                    break
            #感覺cur_stream是這樣一個東西,[i][0]代表的是word_ids,[i][1]代表的是char_ids?
            #你的猜測是完全正確的,num_steps是一個窗口大小嗎?
            #所以下面的一次是,讀一個窗口的數據?
            how_many = min(len(cur_stream[i][0]) - 1, num_steps - cur_pos)
            next_pos = cur_pos + how_many

            inputs[i, cur_pos:next_pos] = cur_stream[i][0][:how_many]
            if max_word_length is not None:
                char_inputs[i, cur_pos:next_pos] = cur_stream[i][1][
                                                                :how_many]
            targets[i, cur_pos:next_pos] = cur_stream[i][0][1:how_many+1]     #後一個詞是預測對象

            cur_pos = next_pos

            cur_stream[i][0] = cur_stream[i][0][how_many:]                    #cur_stream也跟着往後移動?
            if max_word_length is not None:
                cur_stream[i][1] = cur_stream[i][1][how_many:]

    if no_more_data:
        # There is no more data.  Note: this will not return data
        # for the incomplete batch
        break

    X = {'token_ids': inputs, 'tokens_characters': char_inputs,
             'next_token_id': targets}

    yield X

1.5 語言模型的數據集類(LMDataset)

  數據集類爲語言模型訓練提供相應的數據輸入。它是隨機的從數據文件列表中選取一個文件(數據不是僅僅在一個文件裏面,而是很多文件),一次讀取所有數據到內存中,然後提供一個句子生成器,再調用上面定義的_get_batch()函數來每次產生一個batch的數據集。具體實現代碼如下:

def get_sentence(self):
    """
    構造一個生成器嗎?
    """
    while True:
        if self._i == self._nids:
            self._ids = self._load_random_shard()                             #重新加載文件讀取
        ret = self._ids[self._i]                                              #一次僅僅訓練一條句子?
        self._i += 1
        yield ret


def iter_batches(self, batch_size, num_steps):
    """一個生成數據的迭代器"""
    for X in _get_batch(self.get_sentence(), batch_size, num_steps,
                       self.max_word_length):

        # token_ids = (batch_size, num_steps)
        # char_inputs = (batch_size, num_steps, 50) of character ids
        # targets = word ID of next word (batch_size, num_steps)
        yield X

  上面的語言模型只是普通的語言模型的輸入,爲了構建雙向的LSTM模型,我們得將正常的數據反轉,得到反向LSTM的輸入。於是有了BidirectionalLMDataset類,其核心代碼如下:

def __init__(self, filepattern, vocab, test=False, shuffle_on_load=False):
    '''
    bidirectional version of LMDataset
    前向的LSTM傳播過程數據正常取
    反向的LSTM傳播過程只需要將數據反轉就好了
    '''
    self._data_forward = LMDataset(                                            #正向數據集
        filepattern, vocab, reverse=False, test=test,
        shuffle_on_load=shuffle_on_load)
    self._data_reverse = LMDataset(
        filepattern, vocab, reverse=True, test=test,                           #反向數據集
        shuffle_on_load=shuffle_on_load)


def iter_batches(self, batch_size, num_steps):
    """
    將二者合成一個數據集?
    """
    max_word_length = self._data_forward.max_word_length

    for X, Xr in zip(
        _get_batch(self._data_forward.get_sentence(), batch_size,
                  num_steps, max_word_length),
        _get_batch(self._data_reverse.get_sentence(), batch_size,
                  num_steps, max_word_length)
        ):

        for k, v in Xr.items():                                               #都合併到X中去
            #形成token_ids_reverse, token_characters_reverse等
            X[k + '_reverse'] = v                                             

        yield X
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章