以下是我的學習筆記,以及總結,如有錯誤之處請不吝賜教。
離散表示:one-hot表示、bag of words(TF-IDF)、N-gram;
問題:無法衡量詞向量之間的關係、詞表維度隨着語料庫增長膨脹、數據稀疏問題、各種度量(距離、或與非)都不適合。
分佈式表示(用一個詞附近的其他詞來表示該詞):word2vec、word-embedding。
skip-gram:
模型特性:無隱層、投隱層也可省略、每個詞向量作爲log-linear模型的輸入
目標函數:
概率密度由softmax給出:
損失函數:
負例採樣:P(w|context(w)): 一個正樣本,V-1個負樣本,對負樣本做採樣
pytorch實現核心代碼:
K = 100 # number of negative samples
C = 3 # nearby words threshold
NUM_EPOCHS = 2 # The number of epochs of training
MAX_VOCAB_SIZE = 30000 # the vocabulary size
BATCH_SIZE = 128 # the batch size
LEARNING_RATE = 0.2 # the initial learning rate
EMBEDDING_SIZE = 100
LOG_FILE = "word-embedding.log"
# tokenize函數,把一篇文本轉化成一個個單詞
def word_tokenize(text):
return text.split()
with open("./text8/text8.train.txt", "r") as fin:
text = fin.read()
text = [w for w in word_tokenize(text.lower())]
vocab = dict(Counter(text).most_common(MAX_VOCAB_SIZE-1)) #減一是爲了留一個位置給unknow的單詞
vocab["<unk>"] = len(text) - np.sum(list(vocab.values()))
idx_to_word = [word for word in vocab.keys()]
word_to_idx = {word:i for i, word in enumerate(idx_to_word)}
word_counts = np.array([count for count in vocab.values()], dtype=np.float32)
word_freqs = word_counts / np.sum(word_counts)
word_freqs = word_freqs ** (3./4.) #論文中提到的將頻率3/4 然後做歸一化,對預測準確率有提高
word_freqs = word_freqs / np.sum(word_freqs) # 用來做 negative sampling
VOCAB_SIZE = len(idx_to_word)
class WordEmbeddingDataset(tud.Dataset):
def __init__(self, text, word_to_idx, idx_to_word, word_freqs, word_counts):
''' text: a list of words, all text from the training dataset
word_to_idx: the dictionary from word to idx
idx_to_word: idx to word mapping
word_freq: the frequency of each word
word_counts: the word counts
'''
super(WordEmbeddingDataset, self).__init__()
self.text_encoded = [word_to_idx.get(t, VOCAB_SIZE-1) for t in text]
self.text_encoded = torch.Tensor(self.text_encoded).long()
self.word_to_idx = word_to_idx
self.idx_to_word = idx_to_word
self.word_freqs = torch.Tensor(word_freqs)
self.word_counts = torch.Tensor(word_counts)
def __len__(self):
''' 返回整個數據集(所有單詞)的長度
'''
return len(self.text_encoded)
def __getitem__(self, idx):
''' 這個function返回以下數據用於訓練
- 中心詞
- 這個單詞附近的(positive)單詞
- 隨機採樣的K個單詞作爲negative sample
'''
center_word = self.text_encoded[idx]
pos_indices = list(range(idx-C, idx)) + list(range(idx+1, idx+C+1)) #window內單詞的index
pos_indices = [i%len(self.text_encoded) for i in pos_indices] #取餘防止超出text長度
pos_words = self.text_encoded[pos_indices] #周圍單詞
neg_words = torch.multinomial(self.word_freqs, K * pos_words.shape[0], True) #負例採樣
return center_word, pos_words, neg_words
class EmbeddingModel(nn.Module):
def __init__(self, vocab_size, embed_size):
''' 初始化輸出和輸出embedding
'''
super(EmbeddingModel, self).__init__()
self.vocab_size = vocab_size
self.embed_size = embed_size
initrange = 0.5 / self.embed_size
self.out_embed = nn.Embedding(self.vocab_size, self.embed_size, sparse=False)
self.out_embed.weight.data.uniform_(-initrange, initrange)
self.in_embed = nn.Embedding(self.vocab_size, self.embed_size, sparse=False)
self.in_embed.weight.data.uniform_(-initrange, initrange)
def forward(self, input_labels, pos_labels, neg_labels):
'''
input_labels: 中心詞, [batch_size]
pos_labels: 中心詞周圍 context window 出現過的單詞 [batch_size * (window_size * 2)]
neg_labelss: 中心詞周圍沒有出現過的單詞,從 negative sampling 得到 [batch_size, (window_size * 2 * K)]
return: loss, [batch_size]
'''
batch_size = input_labels.size(0)
input_embedding = self.in_embed(input_labels) # B * embed_size
pos_embedding = self.out_embed(pos_labels) # B * (2*C) * embed_size
neg_embedding = self.out_embed(neg_labels) # B * (2*C * K) * embed_size
log_pos = torch.bmm(pos_embedding, input_embedding.unsqueeze(2)).squeeze() # B * (2*C)
log_neg = torch.bmm(neg_embedding, -input_embedding.unsqueeze(2)).squeeze() # B * (2*C*K)
log_pos = F.logsigmoid(log_pos).sum(1)
log_neg = F.logsigmoid(log_neg).sum(1) # batch_size
loss = log_pos + log_neg
return -loss
def input_embeddings(self):
return self.in_embed.weight.data.cpu().numpy()
詞嵌入效果評估:
- 詞類比任務
- 詞相似度任務
- 作爲特徵用於CRF實體識別