PyTorch Exercise: Computing Word Embeddings: Continuous Bag-of-Words

PyTorch Tutorial

PyTorch中，關於訓練詞向量的練習，描述如下：

The Continuous Bag-of-Words model (CBOW) is frequently used in NLP deep learning. It is a model that tries to predict words given the context of a few words before and a few words after the target word. This is distinct from language modeling, since CBOW is not sequential and does not have to be probabilistic. Typcially, CBOW is used to quickly train word embeddings, and these embeddings are used to initialize the embeddings of some more complicated model. Usually, this is referred to as pretraining embeddings. It almost always helps performance a couple of percent.

The CBOW model is as follows. Given a target word wi and an N context window on each side, wi−1,…,wi−N and wi+1,…,wi+N, referring to all context words collectively as C, CBOW tries to minimize

- log p (w i | C) = - log Softmax (A (\sum w \in C q w) + b)

where qw is the embedding for word w.

Implement this model in Pytorch by filling in the class below. Some tips:

Think about which parameters you need to define.
Make sure you know what shape each operation expects. Use .view() if you need to reshape.

代碼如下：

import torch
from torch.autograd import Variable
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)

CONTEXT_SIZE = 2  # 2 words to the left, 2 to the right
raw_text = """We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells.""".split()

# By deriving a set from `raw_text`, we deduplicate the array
vocab = set(raw_text)
vocab_size = len(vocab)

word_to_ix = {word: i for i, word in enumerate(vocab)}
data = []
for i in range(2, len(raw_text) - 2):
    context = [raw_text[i - 2], raw_text[i - 1],
               raw_text[i + 1], raw_text[i + 2]]
    target = raw_text[i]
    data.append((context, target))
print(data[:5])


class CBOW(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super(CBOW,self).__init__() 
        self.embeddings = nn.Embedding(vocab_size, embedding_dim) # embeddings， 待訓練參數爲embedding詞表
        self.linear1 = nn.Linear(embedding_dim, vocab_size) # 待訓練參數爲 A b


    def forward(self, inputs):
        embeds = self.embeddings(inputs)
        add_embeds = torch.sum(embeds, dim=0).view(1,-1) # 相加後reshape
        out = self.linear1(add_embeds)
        log_probs = F.log_softmax(out)
        return log_probs

# create your model and train.  here are some functions to help you make
# the data ready for use by your module


def make_context_vector(context, word_to_ix):
    idxs = [word_to_ix[w] for w in context]
    tensor = torch.LongTensor(idxs)
    return Variable(tensor)


make_context_vector(data[0][0], word_to_ix)  # example

# 聲明loss model optimizer
losses = []
loss_function = nn.NLLLoss()
model = CBOW(vocab_size, embedding_dim=20, context_size=CONTEXT_SIZE)
optimizer = optim.SGD(model.parameters(), lr=0.001)

# 訓練10個epoch
for epoch in range(10):
    total_loss = torch.FloatTensor([0])
    for context, target in data:
        context_idxs = [word_to_ix[w] for w in context]
        target_idx = word_to_ix[target]
        context_var = Variable(torch.LongTensor(context_idxs))
        target_var = Variable(torch.LongTensor([target_idx]))
        model.zero_grad()
        log_probs = model(context_var)

        loss = loss_function(log_probs,target_var)
        loss.backward()
        optimizer.step()

        total_loss += loss.data
    losses.append(total_loss)
print(losses)

運行結果：

[(['We', 'are', 'to', 'study'], 'about'), (['are', 'about', 'study', 'the'], 'to'), (['about', 'to', 'the', 'idea'], 'study'), (['to', 'study', 'idea', 'of'], 'the'), (['study', 'the', 'of', 'a'], 'idea')]
[
 260.2805
[torch.FloatTensor of size 1]
, 
 255.0300
[torch.FloatTensor of size 1]
, 
 249.8967
[torch.FloatTensor of size 1]
, 
 244.8781
[torch.FloatTensor of size 1]
, 
 239.9720
[torch.FloatTensor of size 1]
, 
 235.1766
[torch.FloatTensor of size 1]
, 
 230.4900
[torch.FloatTensor of size 1]
, 
 225.9105
[torch.FloatTensor of size 1]
, 
 221.4367
[torch.FloatTensor of size 1]
, 
 217.0672
[torch.FloatTensor of size 1]
]

PyTorch Exercise: Computing Word Embeddings: Continuous Bag-of-Words

ubuntu 16.04 系統重裝更新步驟

【轉載】解決Microsoft .NET Framework 安裝未成功（證書方面）

【轉載】Microsoft .NET Framework 安裝未成功（證書方面）

北大AI講座公開課-精華

PyTorch Exercise: Computing Word Embeddings: Continuous Bag-of-Words

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結