gensim（一）--core

gensim核心思想是

document 用空格分割的文本
corpus 語料庫，多個document組成
vector 一個document 的向量表示
model 把文檔轉換成向量的算法模型

document = "Human machine interface for lab abc computer applications"

在gensim中語料庫corpus 的作用主要有2個：

用於模型訓練，在訓練模型時，模型使用訓練預料查找相同主題，初始化它們的內在模型參數。Gensim着力於無監督模型。
在訓練完成後，主題模型可以用來提取新文檔的主題

數據預處理

import gensim
from collections import defaultdict
text_corpus = [
    "Human machine interface for lab abc computer applications",
    "A survey of user opinion of computer system response time",
    "The EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user perceived response time to error measurement",
    "The generation of random binary unordered trees",
    "The intersection graph of paths in trees",
    "Graph minors IV Widths of trees and well quasi ordering",
    "Graph minors A survey",
]
# 上面是一個語料，事實上語料通常是很大的，同時加載到內存中是不現實的，gensim採用流式處理，一個一個的處理文檔
stoplist = set('for a of the and to in'.split(' ')) # 停用詞
# 數據預處理
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in text_corpus]
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1
#去除只出現過一次的詞
processed_corpus = [[token for token in text if frequency[token] > 1] for text in texts]
print(processed_corpus)

給文檔中每個詞一個整數的唯一id

dictionary = corpora.Dictionary(processed_corpus)
print(dictionary)

爲了推斷文章的潛在結構，需要一種數學上的表達方式。也就是用特徵向量 來表示每個文檔。

舉個例子：
文檔中splonge出現了0次，包含2個段落，使用了5種字體。
那麼（1,0）(2,2)(3,5)就是這篇文檔的一個特徵向量，
如果我們事先知道所有的問題，那麼簡化版特徵向量就是(0,2,5)。
事實上，向量通常會有很多0值，爲了節省空間，gensim省略0值。那麼上面的向量也寫成(2,2)(3,5)，這就是稀疏向量，也叫詞袋向量。

在問題相同前提下，可以通過比較特徵向量的方式來比較2個文檔的差異。
比如，兩個文檔特徵向量分別是(0.0, 2.0, 5.0)，(0.1, 1.9, 4.9)，那麼我們可以認爲這兩個文檔是相似的。

當然，要想準確判斷，那麼要先找準問題，確定好特徵。

在詞袋模型中，文檔的每個詞，表示爲詞與詞頻率構成的向量。
比如，詞典是[‘coffee’, ‘milk’, ‘sugar’, ‘spoon’]，
文檔內容是"coffee milk coffee"，那麼它的向量表示就是
[2, 1, 0, 0]

詞袋模型的缺點是忽略了詞之間的順序。
print(dictionary.token2id)查看字典

{'computer': 0, 'trees': 9, 'interface': 2, 'eps': 8, 'user': 7, 'time': 6, 'system': 5, 'survey': 4, 'human': 1, 'minors': 11, 'response': 3, 'graph': 10}

使用字典，獲取新句子的向量化表示

new_doc = "Human computer interaction"
new_vec = dictionary.doc2bow(new_doc.lower().split())
print(new_vec)

結果爲：

[(0, 1), (1, 1)]

0是computer，1是human，他們出現的頻率都是1。
interaction在字典中沒出現過，所以值爲0，被省略了。

把語料轉換爲一個向量集合

bow_corpus = [dictionary.doc2bow(text) for text in processed_corpus]
print(bow_corpus)

輸出爲：

[[(0, 1), (1, 1), (2, 1)], [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)], [(2, 1), (5, 1), (7, 1), (8, 1)], [(1, 1), (5, 2), (8, 1)], [(3, 1), (6, 1), (7, 1)], [(9, 1)], [(9, 1), (10, 1)], [(9, 1), (10, 1), (11, 1)], [(4, 1), (10, 1), (11, 1)]]

在gensim中，model可以理解爲兩個向量空間之間的轉換。
在訓練時，模型通過語料學習轉換所需要的參數。
一個典型的例子是tf-idf，tf-idf模型把詞袋模式的向量轉換爲另一種向量。其中頻率統計被權重替代（每個詞在語料中的稀少程度）。

# 訓練模型
tfidf = models.TfidfModel(bow_corpus)
# transform the "system minors" string
words = "system minors".lower().split()
print(dictionary.doc2bow(words))
print(tfidf[dictionary.doc2bow(words)])

輸出爲

[(5, 1), (11, 1)]
[(5, 0.5898341626740045), (11, 0.8075244024440723)]

key爲詞的id，value爲tf-idf權重。
在原始語料中，system 出現了4次，minors出現了2次。在原始語料出現頻率越高，權重越低。

指定文檔，和現有所有文檔做相似度比較：

index = similarities.SparseMatrixSimilarity(tfidf[bow_corpus], num_features=12)
query_bow =dictionary.doc2bow(words)
sims = index[tfidf[query_bow]]
print(list(enumerate(sims)))

輸出爲文檔相似度

[(0, 0.0), (1, 0.32448703), (2, 0.41707572), (3, 0.7184812), (4, 0.0), (5, 0.0), (6, 0.0), (7, 0.0), (8, 0.0)]

輸出相似度最高的5篇：

i=0
for document_number, score in sorted(enumerate(sims), key=lambda x: x[1], reverse=True):
    print(document_number, score)
    i+=1
    if i==5:
        break;

訓練模型，原始文本是使用jieba分詞後的語料

model = Word2Vec(LineSentence('jieba_zhu1'), size=400, window=5, min_count=5, workers=multiprocessing.cpu_count())
# LineSentence 只讀取單個文件，文件已經分好詞，用空格分割
# PathLineSentences 會掃描目錄下所有文件
    model.save('model/zhu.model')
    model.wv.save_word2vec_format('zhu.vector', binary=False)

加載模型

en_wiki_word2vec_model = Word2Vec.load('model/zhu.model')

計算相似度，得到最相似的50個詞

res = en_wiki_word2vec_model.most_similar(testwords[i],topn=50)

也可以直接讀取詞向量文件

 word_vectors = KeyedVectors.load_word2vec_format('zhu.vector',binary=False)
    print(word_vectors.wv['xx'])

獲得給定的兩個詞的相似度，結果是0.85998183

print(en_wiki_word2vec_model.wv.similarity(testwords[0],testwords[1]))

Word2Vec 參數

•size(int) - 特徵向量的維數。
•sg(INT {1 ，0}) -定義的訓練算法。如果是1，則使用skip-gram; 否則，使用CBOW。
•window(int) - 句子中當前詞和預測詞之間的最大距離。
•size(int) - 特徵向量的維數。
•min_alpha(float) - 隨着訓練的進行，學習率將線性下降至min_alpha。
•min_count(int) - 忽略總頻率低於此值的所有單詞。
•max_vocab_size(int) - 在構建詞彙表時限制RAM; 如果還有比這更獨特的單詞，那麼修剪不常用的單詞。每1000萬字類型需要大約1GB的RAM。無限制設置爲None。
•worker(int) - 線程數
•sample(float) - 用於配置哪些較高頻率的詞隨機下采樣的閾值，有用範圍是（0,1e-5）
•cbow_mean(INT {1 ，0}) -如果爲0，使用上下文詞向量的和。如果是1，則使用平均值，僅在使用cbow時適用
•negative(int) - 如果> 0，將使用negative sampling，int指定應繪製多少“噪聲詞”（通常在5-20之間）。如果設置爲0，則不使用負採樣