pytorch入門（二）詞向量簡介及負例採樣實現代碼

原創

张楚岚

2019-09-02 16:13

以下是我的學習筆記，以及總結，如有錯誤之處請不吝賜教。

離散表示：one-hot表示、bag of words（TF-IDF）、N-gram；

問題：無法衡量詞向量之間的關係、詞表維度隨着語料庫增長膨脹、數據稀疏問題、各種度量（距離、或與非）都不適合。

分佈式表示(用一個詞附近的其他詞來表示該詞）：word2vec、word-embedding。

skip-gram：

模型特性：無隱層、投隱層也可省略、每個詞向量作爲log-linear模型的輸入

目標函數：

概率密度由softmax給出：

損失函數：

負例採樣：P(w|context(w))：一個正樣本，V-1個負樣本，對負樣本做採樣

pytorch實現核心代碼：

K = 100 # number of negative samples
C = 3 # nearby words threshold
NUM_EPOCHS = 2 # The number of epochs of training
MAX_VOCAB_SIZE = 30000 # the vocabulary size
BATCH_SIZE = 128 # the batch size
LEARNING_RATE = 0.2 # the initial learning rate
EMBEDDING_SIZE = 100
       
LOG_FILE = "word-embedding.log"

# tokenize函數，把一篇文本轉化成一個個單詞
def word_tokenize(text):
    return text.split()

with open("./text8/text8.train.txt", "r") as fin:
    text = fin.read()
    
text = [w for w in word_tokenize(text.lower())]
vocab = dict(Counter(text).most_common(MAX_VOCAB_SIZE-1))  #減一是爲了留一個位置給unknow的單詞
vocab["<unk>"] = len(text) - np.sum(list(vocab.values()))
idx_to_word = [word for word in vocab.keys()] 
word_to_idx = {word:i for i, word in enumerate(idx_to_word)}

word_counts = np.array([count for count in vocab.values()], dtype=np.float32)
word_freqs = word_counts / np.sum(word_counts)
word_freqs = word_freqs ** (3./4.)  #論文中提到的將頻率3/4 然後做歸一化，對預測準確率有提高
word_freqs = word_freqs / np.sum(word_freqs) # 用來做 negative sampling
VOCAB_SIZE = len(idx_to_word)

class WordEmbeddingDataset(tud.Dataset):
    def __init__(self, text, word_to_idx, idx_to_word, word_freqs, word_counts):
        ''' text: a list of words, all text from the training dataset
            word_to_idx: the dictionary from word to idx
            idx_to_word: idx to word mapping
            word_freq: the frequency of each word
            word_counts: the word counts
        '''
        super(WordEmbeddingDataset, self).__init__()
        self.text_encoded = [word_to_idx.get(t, VOCAB_SIZE-1) for t in text]
        self.text_encoded = torch.Tensor(self.text_encoded).long()
        self.word_to_idx = word_to_idx
        self.idx_to_word = idx_to_word
        self.word_freqs = torch.Tensor(word_freqs)
        self.word_counts = torch.Tensor(word_counts)
        
    def __len__(self):
        ''' 返回整個數據集（所有單詞）的長度
        '''
        return len(self.text_encoded)
        
    def __getitem__(self, idx):
        ''' 這個function返回以下數據用於訓練
            - 中心詞
            - 這個單詞附近的(positive)單詞
            - 隨機採樣的K個單詞作爲negative sample
        '''
        center_word = self.text_encoded[idx]
        pos_indices = list(range(idx-C, idx)) + list(range(idx+1, idx+C+1)) #window內單詞的index
        pos_indices = [i%len(self.text_encoded) for i in pos_indices] #取餘防止超出text長度
        pos_words = self.text_encoded[pos_indices] #周圍單詞
        neg_words = torch.multinomial(self.word_freqs, K * pos_words.shape[0], True) #負例採樣
        
        return center_word, pos_words, neg_words 

class EmbeddingModel(nn.Module):
    def __init__(self, vocab_size, embed_size):
        ''' 初始化輸出和輸出embedding
        '''
        super(EmbeddingModel, self).__init__()
        self.vocab_size = vocab_size
        self.embed_size = embed_size
        
        initrange = 0.5 / self.embed_size
        self.out_embed = nn.Embedding(self.vocab_size, self.embed_size, sparse=False)
        self.out_embed.weight.data.uniform_(-initrange, initrange)
        
        
        self.in_embed = nn.Embedding(self.vocab_size, self.embed_size, sparse=False)
        self.in_embed.weight.data.uniform_(-initrange, initrange)
        
        
    def forward(self, input_labels, pos_labels, neg_labels):
        '''
        input_labels: 中心詞, [batch_size]
        pos_labels: 中心詞周圍 context window 出現過的單詞 [batch_size * (window_size * 2)]
        neg_labelss: 中心詞周圍沒有出現過的單詞，從 negative sampling 得到 [batch_size, (window_size * 2 * K)]
        
        return: loss, [batch_size]
        '''
        
        batch_size = input_labels.size(0)
        
        input_embedding = self.in_embed(input_labels) # B * embed_size
        pos_embedding = self.out_embed(pos_labels) # B * (2*C) * embed_size
        neg_embedding = self.out_embed(neg_labels) # B * (2*C * K) * embed_size
      
        log_pos = torch.bmm(pos_embedding, input_embedding.unsqueeze(2)).squeeze() # B * (2*C)
        log_neg = torch.bmm(neg_embedding, -input_embedding.unsqueeze(2)).squeeze() # B * (2*C*K)

        log_pos = F.logsigmoid(log_pos).sum(1)
        log_neg = F.logsigmoid(log_neg).sum(1) # batch_size
       
        loss = log_pos + log_neg
        
        return -loss
    
    def input_embeddings(self):
        return self.in_embed.weight.data.cpu().numpy()

論文地址：http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf

詞嵌入效果評估：

詞類比任務
詞相似度任務
作爲特徵用於CRF實體識別

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

pytorch入門（二）詞向量簡介及負例採樣實現代碼

離散表示：one-hot表示、bag of words（TF-IDF）、N-gram；

分佈式表示(用一個詞附近的其他詞來表示該詞）：word2vec、word-embedding。

pytorch實現核心代碼：

詞嵌入效果評估：

Python實現大麥網搶票的四大關鍵技術點解析

salesforce零基礎學習（一百三十八）零碎知識點小總結（十）

關於接口協議，你必須要知道這些！

pytorch入門（一）深度學習入門及pytorch相關demo

量化小白成長記（一）：量化交易基礎

nlp paper：【第3篇】句和文檔的分佈式表示學習（Distributed Representations of Sentences and Documents)

Mongodb安裝及使用命令

nlp paper：【第2篇】基於神經網絡的詞向量（Efﬁcient Estimation of Word Representations in Vector Space）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結