Word2vec原理淺析及gensim中word2vec使用

本文轉載於以下博客鏈接:Word2vec原理淺析:https://blog.csdn.net/u010700066/article/details/83070102;

gensim中word2vec使用:https://www.jianshu.com/p/b779f8219f74

如有冒犯,還望諒解!

Word2vec原理淺析及tensorflow實現

Word2Vec是由Google的Mikolov等人提出的一個詞向量計算模型。

  • 輸入:大量已分詞的文本
  • 輸出:用一個稠密向量來表示每個詞

詞向量的重要意義在於將自然語言轉換成了計算機能夠理解的向量。相對於詞袋模型、TF-IDF等模型,詞向量能抓住詞的上下文、語義,衡量詞與詞的相似性,在文本分類、情感分析等許多自然語言處理領域有重要作用。

詞向量經典例子:![][01]
[01]:http://latex.codecogs.com/png.latex?\vec{man}-\vec{woman}\approx\vec{king}-\vec{queen}

gensim已經用python封裝好了word2vec的實現,有語料的話可以直接訓練了,參考中英文維基百科語料上的Word2Vec實驗

會使用gensim訓練詞向量,並不表示真的掌握了word2vec,只表示會讀文檔會調接口而已。

Word2vec詳細實現

word2vec的詳細實現,簡而言之,就是一個三層的神經網絡。要理解word2vec的實現,需要的預備知識是神經網絡和Logistic Regression。

神經網絡結構

word2vec原理圖

上圖是Word2vec的簡要流程圖。首先假設,詞庫裏的詞數爲10000; 詞向量的長度爲300(根據斯坦福CS224d的講解,詞向量一般爲25-1000維,300維是一個好的選擇)。下面以單個訓練樣本爲例,依次介紹每個部分的含義。

  1. 輸入層:輸入爲一個詞的one-hot向量表示。這個向量長度爲10000。假設這個詞爲ants,ants在詞庫中的ID爲i,則輸入向量的第i個分量爲1,其餘爲0。[0, 0, ..., 0, 0, 1, 0, 0, ..., 0, 0]
  2. 隱藏層:隱藏層的神經元個數就是詞向量的長度。隱藏層的參數是一個[10000 ,300]的矩陣。 實際上,這個參數矩陣就是詞向量。回憶一下矩陣相乘,一個one-hot行向量和矩陣相乘,結果就是矩陣的第i行。經過隱藏層,實際上就是把10000維的one-hot向量映射成了最終想要得到的300維的詞向量。

    矩陣乘法

  3. 輸出層: 輸出層的神經元個數爲總詞數10000,參數矩陣尺寸爲[300,10000]。詞向量經過矩陣計算後再加上softmax歸一化,重新變爲10000維的向量,每一維對應詞庫中的一個詞與輸入的詞(在這裏是ants)共同出現在上下文中的概率。

     

    輸出層

     

    上圖中計算了car與ants共現的概率,car所對應的300維列向量就是輸出層參數矩陣中的一列。輸出層的參數矩陣是[300,10000],也就是計算了詞庫中所有詞與ants共現的概率。輸出層的參數矩陣在訓練完畢後沒有作用。

  4. 訓練:訓練樣本(x, y)有輸入也有輸出,我們知道哪個詞實際上跟ants共現,因此y也是一個10000維的向量。損失函數跟Logistic Regression相似,是神經網絡的最終輸出向量和y的交叉熵(cross-entropy)。最後用隨機梯度下降來求解。

     

    交叉熵(cross-entropy)

上述步驟是一個詞作爲輸入和一個上下文中的詞作爲輸出的情況,但實際情況顯然更復雜,什麼是上下文呢?用一個詞去預測周圍的其他詞,還是用周圍的好多詞來預測一個詞?這裏就要引入實際訓練時的兩個模型skip-gram和CBOW。

skip-gram和CBOW

  • skip-gram: 核心思想是根據中心詞來預測周圍的詞。假設中心詞是cat,窗口長度爲2,則根據cat預測左邊兩個詞和右邊兩個詞。這時,cat作爲神經網絡的input,預測的詞作爲label。下圖爲一個例子:

     

    skip-gram

    在這裏窗口長度爲2,中心詞一個一個移動,遍歷所有文本。每一次中心詞的移動,最多會產生4對訓練樣本(input,label)。

  • CBOW(continuous-bag-of-words):如果理解了skip-gram,那CBOW模型其實就是倒過來,用周圍的所有詞來預測中心詞。這時候,每一次中心詞的移動,只能產生一個訓練樣本。如果還是用上面的例子,則CBOW模型會產生下列4個訓練樣本:

    1. ([quick, brown], the)
    2. ([the, brown, fox], quick)
    3. ([the, quick, fox, jumps], brown)
    4. ([quick, brown, jumps, over], fox)

    這時候,input很可能是4個詞,label只是一個詞,怎麼辦呢?其實很簡單,只要求平均就行了。經過隱藏層後,輸入的4個詞被映射成了4個300維的向量,對這4個向量求平均,然後就可以作爲下一層的輸入了。

兩個模型相比,skip-gram模型能產生更多訓練樣本,抓住更多詞與詞之間語義上的細節,在語料足夠多足夠好的理想條件下,skip-gram模型是優於CBOW模型的。在語料較少的情況下,難以抓住足夠多詞與詞之間的細節,CBOW模型求平均的特性,反而效果可能更好。

負採樣(Negative Sampling)

實際訓練時,還是假設詞庫有10000個詞,詞向量300維,那麼每一層神經網絡的參數是300萬個,輸出層相當於有一萬個可能類的多分類問題。可以想象,這樣的計算量非常非常非常大。
作者Mikolov等人提出了許多優化的方法,在這裏着重講一下負採樣。
負採樣的思想非常簡單,簡單地令人髮指:我們知道最終神經網絡經過softmax輸出一個向量,只有一個概率最大的對應正確的單詞,其餘的稱爲negative sample。現在只選擇5個negative sample,所以輸出向量就只是一個6維的向量。要考慮的參數不是300萬個,而減少到了1800個! 這樣做看上去很偷懶,實際效果卻很好,大大提升了運算效率。
我們知道,訓練神經網絡時,每一次訓練會對神經網絡的參數進行微小的修改。在word2vec中,每一個訓練樣本並不會對所有參數進行修改。假設輸入的詞是cat,我們的隱藏層參數有300萬個,但這一步訓練只會修改cat相對應的300個參數,因爲此時隱藏層的輸出只跟這300個參數有關!
負採樣是有效的,我們不需要那麼多negative sample。Mikolov等人在論文中說:對於小數據集,負採樣的個數在5-20個;對於大數據集,負採樣的個數在2-5個。

那具體如何選擇負採樣的詞呢?論文給出瞭如下公式:

 

負採樣的選擇

其中f(w)是詞頻。可以看到,負採樣的選擇只跟詞頻有關,詞頻越大,越有可能選中。

Tensorflow實現

最後用tensorflow動手實踐一下。參考Udacity Deep Learning的一次作業

這裏只是訓練了128維的詞向量,並通過TSNE的方法可視化。作爲練手和深入理解word2vec不錯,實戰還是推薦gensim。

# These are all the modules we'll be using later. Make sure you can import them
# before proceeding further.
%matplotlib inline
from __future__ import print_function
import collections
import math
import numpy as np
import os
import random
import tensorflow as tf
import zipfile
from matplotlib import pylab
from six.moves import range
from six.moves.urllib.request import urlretrieve
from sklearn.manifold import TSNE

Download the data from the source website if necessary.

url = 'http://mattmahoney.net/dc/'

def maybe_download(filename, expected_bytes):
  """Download a file if not present, and make sure it's the right size."""
  if not os.path.exists(filename):
    filename, _ = urlretrieve(url + filename, filename)
  statinfo = os.stat(filename)
  if statinfo.st_size == expected_bytes:
    print('Found and verified %s' % filename)
  else:
    print(statinfo.st_size)
    raise Exception(
      'Failed to verify ' + filename + '. Can you get to it with a browser?')
  return filename

filename = maybe_download('text8.zip', 31344016)
Found and verified text8.zip

Read the data into a string.

def read_data(filename):
  """Extract the first file enclosed in a zip file as a list of words"""
  with zipfile.ZipFile(filename) as f:
    data = tf.compat.as_str(f.read(f.namelist()[0])).split()
  return data
  
words = read_data(filename)
print('Data size %d' % len(words))
Data size 17005207

Build the dictionary and replace rare words with UNK token.

vocabulary_size = 50000

def build_dataset(words):
  count = [['UNK', -1]]
  count.extend(collections.Counter(words).most_common(vocabulary_size - 1))
  dictionary = dict()
  for word, _ in count:
    dictionary[word] = len(dictionary)
  data = list()
  unk_count = 0
  for word in words:
    if word in dictionary:
      index = dictionary[word]
    else:
      index = 0  # dictionary['UNK']
      unk_count = unk_count + 1
    data.append(index)
  count[0][1] = unk_count
  reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys())) 
  return data, count, dictionary, reverse_dictionary

data, count, dictionary, reverse_dictionary = build_dataset(words)
print('Most common words (+UNK)', count[:5])
print('Sample data', data[:10])
del words  # Hint to reduce memory.
Most common words (+UNK) [['UNK', 418391], ('the', 1061396), ('of', 593677), ('and', 416629), ('one', 411764)]
Sample data [5239, 3084, 12, 6, 195, 2, 3137, 46, 59, 156]

Function to generate a training batch for the skip-gram model.

data_index = 0

def generate_batch(batch_size, num_skips, skip_window):
    global data_index
    assert batch_size % num_skips == 0
    assert num_skips <= 2 * skip_window
    batch = np.ndarray(shape=(batch_size), dtype=np.int32)
    labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
    span = 2 * skip_window + 1 # [ skip_window target skip_window ]
    buffer = collections.deque(maxlen=span)
    for _ in range(span):
        buffer.append(data[data_index])
        data_index = (data_index + 1) % len(data)
    for i in range(batch_size // num_skips):
        target = skip_window  # target label at the center of the buffer
        targets_to_avoid = [ skip_window ]
        for j in range(num_skips):
            while target in targets_to_avoid:
                target = random.randint(0, span - 1)
            targets_to_avoid.append(target)
            batch[i * num_skips + j] = buffer[skip_window]
            labels[i * num_skips + j, 0] = buffer[target]
        buffer.append(data[data_index])
        data_index = (data_index + 1) % len(data)
    return batch, labels

print('data:', [reverse_dictionary[di] for di in data[:8]])

for num_skips, skip_window in [(2, 1), (4, 2)]:
    data_index = 0
    batch, labels = generate_batch(batch_size=8, num_skips=num_skips, skip_window=skip_window)
    print('\nwith num_skips = %d and skip_window = %d:' % (num_skips, skip_window))
    print('    batch:', [reverse_dictionary[bi] for bi in batch])
    print('    labels:', [reverse_dictionary[li] for li in labels.reshape(8)])
data: ['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first']

with num_skips = 2 and skip_window = 1:
    batch: ['originated', 'originated', 'as', 'as', 'a', 'a', 'term', 'term']
    labels: ['anarchism', 'as', 'originated', 'a', 'as', 'term', 'a', 'of']

with num_skips = 4 and skip_window = 2:
    batch: ['as', 'as', 'as', 'as', 'a', 'a', 'a', 'a']
    labels: ['originated', 'term', 'anarchism', 'a', 'of', 'as', 'originated', 'term']

Skip-Gram

Train a skip-gram model.

batch_size = 128
embedding_size = 128 # Dimension of the embedding vector.
skip_window = 1 # How many words to consider left and right.
num_skips = 2 # How many times to reuse an input to generate a label.
# We pick a random validation set to sample nearest neighbors. here we limit the
# validation samples to the words that have a low numeric ID, which by
# construction are also the most frequent. 
valid_size = 16 # Random set of words to evaluate similarity on.
valid_window = 100 # Only pick dev samples in the head of the distribution.
valid_examples = np.array(random.sample(range(valid_window), valid_size))

#######important#########
num_sampled = 64 # Number of negative examples to sample.

graph = tf.Graph()

with graph.as_default(), tf.device('/cpu:0'):

  # Input data.
  train_dataset = tf.placeholder(tf.int32, shape=[batch_size])
  train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
  valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
  
  # Variables.
  embeddings = tf.Variable(
    tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
  softmax_weights = tf.Variable(
    tf.truncated_normal([vocabulary_size, embedding_size],
                         stddev=1.0 / math.sqrt(embedding_size)))
  softmax_biases = tf.Variable(tf.zeros([vocabulary_size]))
  
  # Model.
  # Look up embeddings for inputs.
  embed = tf.nn.embedding_lookup(embeddings, train_dataset)
  # Compute the softmax loss, using a sample of the negative labels each time.
  loss = tf.reduce_mean(
    tf.nn.sampled_softmax_loss(weights=softmax_weights, biases=softmax_biases, inputs=embed,
                               labels=train_labels, num_sampled=num_sampled, num_classes=vocabulary_size))

  # Optimizer.
  # Note: The optimizer will optimize the softmax_weights AND the embeddings.
  # This is because the embeddings are defined as a variable quantity and the
  # optimizer's `minimize` method will by default modify all variable quantities 
  # that contribute to the tensor it is passed.
  # See docs on `tf.train.Optimizer.minimize()` for more details.
  optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss)
  
  # Compute the similarity between minibatch examples and all embeddings.
  # We use the cosine distance:
  norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
  normalized_embeddings = embeddings / norm
  valid_embeddings = tf.nn.embedding_lookup(
    normalized_embeddings, valid_dataset)
  similarity = tf.matmul(valid_embeddings, tf.transpose(normalized_embeddings))
num_steps = 100001

with tf.Session(graph=graph) as session:
  tf.global_variables_initializer().run()
  print('Initialized')
  average_loss = 0
  for step in range(num_steps):
    batch_data, batch_labels = generate_batch(
      batch_size, num_skips, skip_window)
    feed_dict = {train_dataset : batch_data, train_labels : batch_labels}
    _, l = session.run([optimizer, loss], feed_dict=feed_dict)
    average_loss += l
    if step % 2000 == 0:
      if step > 0:
        average_loss = average_loss / 2000
      # The average loss is an estimate of the loss over the last 2000 batches.
      print('Average loss at step %d: %f' % (step, average_loss))
      average_loss = 0
    # note that this is expensive (~20% slowdown if computed every 500 steps)
    if step % 10000 == 0:
      sim = similarity.eval()
      for i in range(valid_size):
        valid_word = reverse_dictionary[valid_examples[i]]
        top_k = 8 # number of nearest neighbors
        nearest = (-sim[i, :]).argsort()[1:top_k+1]
        log = 'Nearest to %s:' % valid_word
        for k in range(top_k):
          close_word = reverse_dictionary[nearest[k]]
          log = '%s %s,' % (log, close_word)
        print(log)
  final_embeddings = normalized_embeddings.eval()
num_points = 400

tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000)
two_d_embeddings = tsne.fit_transform(final_embeddings[1:num_points+1, :])
def plot(embeddings, labels):
  assert embeddings.shape[0] >= len(labels), 'More labels than embeddings'
  pylab.figure(figsize=(15,15))  # in inches
  for i, label in enumerate(labels):
    x, y = embeddings[i,:]
    pylab.scatter(x, y)
    pylab.annotate(label, xy=(x, y), xytext=(5, 2), textcoords='offset points',
                   ha='right', va='bottom')
  pylab.show()

words = [reverse_dictionary[i] for i in range(1, num_points+1)]
plot(two_d_embeddings, words)

skip-gram可視化


CBOW


data_index_cbow = 0

def get_cbow_batch(batch_size, num_skips, skip_window):
    global data_index_cbow
    assert batch_size % num_skips == 0
    assert num_skips <= 2 * skip_window
    batch = np.ndarray(shape=(batch_size), dtype=np.int32)
    labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
    span = 2 * skip_window + 1 # [ skip_window target skip_window ]
    buffer = collections.deque(maxlen=span)
    for _ in range(span):
        buffer.append(data[data_index_cbow])
        data_index_cbow = (data_index_cbow + 1) % len(data)
    for i in range(batch_size // num_skips):
        target = skip_window  # target label at the center of the buffer
        targets_to_avoid = [ skip_window ]
        for j in range(num_skips):
            while target in targets_to_avoid:
                target = random.randint(0, span - 1)
            targets_to_avoid.append(target)
            batch[i * num_skips + j] = buffer[skip_window]
            labels[i * num_skips + j, 0] = buffer[target]
        buffer.append(data[data_index_cbow])
        data_index_cbow = (data_index_cbow + 1) % len(data)
    cbow_batch = np.ndarray(shape=(batch_size), dtype=np.int32)
    cbow_labels = np.ndarray(shape=(batch_size // (skip_window * 2), 1), dtype=np.int32)
    for i in range(batch_size):
        cbow_batch[i] = labels[i]
    cbow_batch = np.reshape(cbow_batch, [batch_size // (skip_window * 2), skip_window * 2])
    for i in range(batch_size // (skip_window * 2)):
        # center word
        cbow_labels[i] = batch[2 * skip_window * i]
    return cbow_batch, cbow_labels
# actual batch_size = batch_size // (2 * skip_window)
batch_size = 128
embedding_size = 128 # Dimension of the embedding vector.
skip_window = 1 # How many words to consider left and right.
num_skips = 2 # How many times to reuse an input to generate a label.
# We pick a random validation set to sample nearest neighbors. here we limit the
# validation samples to the words that have a low numeric ID, which by
# construction are also the most frequent. 
valid_size = 16 # Random set of words to evaluate similarity on.
valid_window = 100 # Only pick dev samples in the head of the distribution.
valid_examples = np.array(random.sample(range(valid_window), valid_size))

#######important#########
num_sampled = 64 # Number of negative examples to sample.

graph = tf.Graph()

with graph.as_default(), tf.device('/cpu:0'):

  # Input data.
    train_dataset = tf.placeholder(tf.int32, shape=[batch_size // (skip_window * 2), skip_window * 2])
    train_labels = tf.placeholder(tf.int32, shape=[batch_size // (skip_window * 2), 1])
    valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
  
  # Variables.
    embeddings = tf.Variable(
      tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
    softmax_weights = tf.Variable(
      tf.truncated_normal([vocabulary_size, embedding_size],
                         stddev=1.0 / math.sqrt(embedding_size)))
    softmax_biases = tf.Variable(tf.zeros([vocabulary_size]))
  
  # Model.
  # Look up embeddings for inputs.
    embed = tf.nn.embedding_lookup(embeddings, train_dataset)
    
    # reshape embed
    embed = tf.reshape(embed, (skip_window * 2, batch_size // (skip_window * 2), embedding_size))
    # average embed
    embed = tf.reduce_mean(embed, 0)
    
  # Compute the softmax loss, using a sample of the negative labels each time.
    loss = tf.reduce_mean(
      tf.nn.sampled_softmax_loss(weights=softmax_weights, biases=softmax_biases, inputs=embed,
                               labels=train_labels, num_sampled=num_sampled, num_classes=vocabulary_size))

  # Optimizer.
  # Note: The optimizer will optimize the softmax_weights AND the embeddings.
  # This is because the embeddings are defined as a variable quantity and the
  # optimizer's `minimize` method will by default modify all variable quantities 
  # that contribute to the tensor it is passed.
  # See docs on `tf.train.Optimizer.minimize()` for more details.
    optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss)
  
  # Compute the similarity between minibatch examples and all embeddings.
  # We use the cosine distance:
    norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
    normalized_embeddings = embeddings / norm
    valid_embeddings = tf.nn.embedding_lookup(
      normalized_embeddings, valid_dataset)
    similarity = tf.matmul(valid_embeddings, tf.transpose(normalized_embeddings))
num_steps = 100001

with tf.Session(graph=graph) as session:
  tf.global_variables_initializer().run()
  print('Initialized')
  average_loss = 0
  for step in range(num_steps):
    batch_data, batch_labels = get_cbow_batch(
      batch_size, num_skips, skip_window)
    feed_dict = {train_dataset : batch_data, train_labels : batch_labels}
    _, l = session.run([optimizer, loss], feed_dict=feed_dict)
    average_loss += l
    if step % 2000 == 0:
      if step > 0:
        average_loss = average_loss / 2000
      # The average loss is an estimate of the loss over the last 2000 batches.
      print('Average loss at step %d: %f' % (step, average_loss))
      average_loss = 0
    # note that this is expensive (~20% slowdown if computed every 500 steps)
    if step % 10000 == 0:
      sim = similarity.eval()
      for i in range(valid_size):
        valid_word = reverse_dictionary[valid_examples[i]]
        top_k = 8 # number of nearest neighbors
        nearest = (-sim[i, :]).argsort()[1:top_k+1]
        log = 'Nearest to %s:' % valid_word
        for k in range(top_k):
          close_word = reverse_dictionary[nearest[k]]
          log = '%s %s,' % (log, close_word)
        print(log)
  final_embeddings = normalized_embeddings.eval()
num_points = 400

tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000)
two_d_embeddings = tsne.fit_transform(final_embeddings[1:num_points+1, :])
words = [reverse_dictionary[i] for i in range(200, num_points+1)]
plot(two_d_embeddings, words)

CBOW可視化

參考資料

  1. Le Q V, Mikolov T. Distributed Representations of Sentences and Documents[J]. 2014, 4:II-1188.
  2. Mikolov T, Sutskever I, Chen K, et al. Distributed Representations of Words and Phrases and their Compositionality[J]. Advances in Neural Information Processing Systems, 2013, 26:3111-3119.
  3. Word2Vec Tutorial - The Skip-Gram Model
  4. Udacity Deep Learning
  5. Stanford CS224d Lecture2,3

gensim中word2vec使用

word2vec的實現是位於gensim包中gensim\models\word2vec.py文件裏面的Word2Vec類中
參數24個:
參數名稱     默認值     用途
sentences     None     訓練的語料,一個可迭代對象。對於從磁盤加載的大型語料最好用gensim.models.word2vec.BrownCorpus,gensim.models.word2vec.Text8Corpus ,gensim.models.word2vec.LineSentence 去生成sentences
size     100     生成詞向量的維度
alpha     0.025     初始學習率
window     5     句子中當前和預測單詞之間的最大距離,取詞窗口大小
min_count     5     文檔中總頻率低於此值的單詞忽略
max_vocab_size     None     構建詞彙表最大數,詞彙大於這個數按照頻率排序,去除頻率低的詞彙
sample     1e-3     高頻詞進行隨機下采樣的閾值,範圍是(0, 1e-5)
seed     1     向量初始化的隨機數種子
workers     3     幾個CPU進行跑
min_alpha     0.0001     隨着學習進行,學習率線性下降到這個最小數
sg     0     訓練時算法選擇 0:skip-gram, 1: CBOW
hs     0     0: 當這個爲0 並且negative 參數不爲零,用負採樣,1:層次 softmax
negative     5     負採樣,大於0是使用負採樣,當爲負數值就會進行增加噪音詞
ns_exponent     0.75     負採樣指數,確定負採樣抽樣形式:1.0:完全按比例抽,0.0對所有詞均等採樣,負值對低頻詞更多的採樣。流行的是0.75
cbow_mean     1     0:使用上下文單詞向量的總和,1:使用均值; 只適用於cbow
hashfxn     hash     希函數用於隨機初始化權重,以提高訓練的可重複性。
iter     5     迭代次數,epoch
null_word     0     空填充數據
trim_rule     None     詞彙修剪規則,指定某些詞語是否應保留在詞彙表中,默認是 詞頻小於 min_count則丟棄,可以是自己定義規則
sorted_vocab     1     1:按照降序排列,0:不排序;實現方法:gensim.models.word2vec.Word2VecVocab.sort_vocab()
batch_words     10000     詞數量大小,大於10000 cython會進行截斷
compute_loss     False     損失(loss)值,如果是True 就會保存
callbacks     ()     在訓練期間的特定階段執行的回調序列~gensim.models.callbacks.CallbackAny2Vec
max_final_vocab     None     通過自動選擇匹配的min_count將詞彙限制爲目標詞彙大小,如果min_count有參數就用給定的數值


模型保存使用:完成訓練後只存儲並使用~gensim.models.keyedvectors.KeyedVectors
該模型可以通過以下方式存儲/加載:
~gensim.models.word2vec.Word2Vec.save 保存模型
~gensim.models.word2vec.Word2Vec.load 加載模型

訓練過的單詞向量也可以從與其兼容的格式存儲/加載:
gensim.models.keyedvectors.KeyedVectors.save_word2vec_format實現原始 word2vec

word2vec 的保存
gensim.models.keyedvectors.KeyedVectors.load_word2vec_format 單詞向量的加載
模型的屬性

wv: 是類 ~gensim.models.keyedvectors.Word2VecKeyedVectors生產的對象,在word2vec是一個屬性
爲了在不同的訓練算法(Word2Vec,Fastext,WordRank,VarEmbed)之間共享單詞向量查詢代碼,gensim將單詞向量的存儲和查詢分離爲一個單獨的類 KeyedVectors     

包含單詞和對應向量的映射。可以通過它進行詞向量的查詢

model_w2v.wv.most_similar("民生銀行")  # 找最相似的詞
model_w2v.wv.get_vector("民生銀行")  # 查看向量
model_w2v.wv.syn0  #  model_w2v.wv.vectors 一樣都是查看向量
model_w2v.wv.vocab  # 查看詞和對應向量
model_w2v.wv.index2word  # 每個index對應的詞

小提示:
    需要注意的是word2vec採用的是標準hash table存放方式,hash碼重複後挨着放 取的時候根據拿出index找到詞表裏真正單詞,對比一下
    syn0 :就是詞向量的大矩陣,第i行表示vocab中下標爲i的詞
    syn1:用hs算法時用到的輔助矩陣,即文章中的Wx
    syn1neg:negative sampling算法時用到的輔助矩陣
    Next_random:作者自己生成的隨機數,線程裏面初始化就是:

vocabulary:是類 ~gensim.models.word2vec.Word2VecVocab
      模型的詞彙表,除了存儲單詞外,還提供額外的功能,如構建一個霍夫曼樹(頻繁的單詞更接近根),或丟棄極其罕見的單詞。
trainables 是類 ~gensim.models.word2vec.Word2VecTrainables

  訓練詞向量的內部淺層神經網絡,CBOW和skip-gram(SG)略有不同,它的weights就是我們後面需要使用的詞向量,隱藏層的size和詞向量特徵size一致
sentences相關

訓練首先是語料集的加載。首先要生成Word2Vec需要的語料格式:
1.對於簡單的句子可以:

from gensim.models import Word2Vec
# sentences只需要是一個可迭代對象就可以
sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]]
model = Word2Vec(sentences, min_count=1)  # 執行這一句的時候就是在訓練模型了

2.對於大型語料庫:
Gemsim 的輸入只要求序列化的句子,而不需要將所有輸入都存儲在內存中。簡單來說,可以輸入一個句子,處理它,刪除它,再載入另外一個句子。
gensim.models.word2vec.BrownCorpus: BrownCorpus是一個英國語料庫,可以用這個直接處理
gensim.models.word2vec.Text8Corpus ,
gensim.models.word2vec.LineSentence

 # 使用LineSentence()
sentences = LineSentence('a.txt')   #  文本格式是 單詞空格分開,一行爲一個文檔
 # 使用Text8Corpus()
sentences = Text8Corpus('a.txt')   #  文本格式是 單詞空格分開,一行爲一個文
model = Word2Vec(sentences, min_count=1)  # 執行這一句的時候就是在訓練模型了

---------------------
 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章