Gensim
Gensim是一個開源庫,用於無監督的統計建模和自然語言處理,用Python和Cython實現的
Gensim庫來實現Word2Vec
Word2Vec被認爲是自然語言處理(NLP)領域中最大、最新的突破之一。其的概念簡單,優雅,(相對)容易掌握。Google一下就會找到一堆關於如何使用諸如Gensim和TensorFlow的庫來調用Word2Vec方法的結果
Word2Vec的目標是生成帶有語義的單詞的向量表示,用於進一步的NLP任務。每個單詞向量通常有幾百個維度,語料庫中每個唯一的單詞在空間中被分配一個向量。例如,單詞“happy”可以表示爲4維向量[0.24、0.45、0.11、0.49],“sad”具有向量[0.88、0.78、0.45、0.91]。這種從單詞到向量的轉換也被稱爲單詞嵌入(word embedding)。這種轉換的原因是機器學習算法可以對數字(在向量中的)而不是單詞進行線性代數運算。
首先解壓數據,讀入到list裏面
import gzip import gensim import logging #logging格式設置 logging.basicConfig(format="", level=logging.INFO) #解壓我們的數據 data_file = "reviews_data.txt.gz" with gzip.open(data_file,'rb') as f: for i, line in enumerate(f): print(line) break #--------------下一步需要把讀的數據變爲gensim的輸入------------------------ #把gzip文件的內容讀入到list def read_input(input_file): logging.info("reading file {0}...this may take a while".format(input_file)) with gzip.open(input_file,'rb') as f: for i, line in enumerate(f): if(i%10000 == 0): logging.info("read {0} reviews".format(i)) #做預處理,每個review返回一個單詞列表 yield gensim.utils.simple_preprocess(line) documents = list(read_input((data_file))) logging.info("Done reading data file") print(documents)
訓練model
import gzip import gensim import logging #logging格式設置 logging.basicConfig(format="", level=logging.INFO) #解壓我們的數據 data_file = "reviews_data.txt.gz" with gzip.open(data_file,'rb') as f: for i, line in enumerate(f): print(line) break #--------------下一步需要把讀的數據變爲gensim的輸入------------------------ #把gzip文件的內容讀入到list def read_input(input_file): logging.info("reading file {0}...this may take a while".format(input_file)) with gzip.open(input_file,'rb') as f: for i, line in enumerate(f): if(i%10000 == 0): logging.info("read {0} reviews".format(i)) #做預處理,每個review返回一個單詞列表 yield gensim.utils.simple_preprocess(line) documents = list(read_input((data_file))) logging.info("Done reading data file") print(documents) #--------------訓練我們的model------------- model = gensim.models.Word2Vec(documents, size=150,window=10, min_count=2,workers=10) #不加這句,光上面那句也能訓練,這句是給訓練的時候規定一些參數,比如epochs,這裏規定了10,如果不規定默認是5的 model.train(documents,total_examples=len(documents), epochs=10)
我們可以通過訓練好的模型做什麼呢?
我們要做的是,給出一個之前語料中沒有出現的詞,然後能夠在語料中找一個最相近的
能夠計算兩個單詞之間的相似度
能夠在幾個單詞中找出意思和其他單詞相差較大的單詞來
找和polite最相近的6個詞
找和france最相近的6個詞
找和shocked最相近的6個詞
尋找牀上用品相關的詞
計算兩個單詞之間的相似度
在幾個單詞中找到意思和其他單詞相差較大的單詞,即the odd one
總程序
import gzip import gensim import logging #logging格式設置 logging.basicConfig(format="", level=logging.INFO) #解壓我們的數據 data_file = "reviews_data.txt.gz" with gzip.open(data_file,'rb') as f: for i, line in enumerate(f): print(line) break #--------------下一步需要把讀的數據變爲gensim的輸入------------------------ #把gzip文件的內容讀入到list def read_input(input_file): logging.info("reading file {0}...this may take a while".format(input_file)) with gzip.open(input_file,'rb') as f: for i, line in enumerate(f): if(i%10000 == 0): logging.info("read {0} reviews".format(i)) #做預處理,每個review返回一個單詞列表 yield gensim.utils.simple_preprocess(line) documents = list(read_input((data_file))) logging.info("Done reading data file") # print(documents) #--------------訓練我們的model------------- model = gensim.models.Word2Vec(documents, size=150,window=10, min_count=2,workers=10) #不加這句,光上面那句也能訓練,這句是給訓練的時候規定一些參數,比如epochs,這裏規定了10,如果不規定默認是5的 model.train(documents,total_examples=len(documents), epochs=10) #------------驗證我們的結果-------------------- w1 = "dirty" print(model.wv.most_similar(positive=w1)) # look up top 6 words similar to 'polite' w1 = ["polite"] print(model.wv.most_similar (positive=w1,topn=6)) # look up top 6 words similar to 'france' w1 = ["france"] print(model.wv.most_similar (positive=w1,topn=6)) # look up top 6 words similar to 'shocked' w1 = ["shocked"] print(model.wv.most_similar (positive=w1,topn=6)) # get everything related to stuff on the bed w1 = ["bed",'sheet','pillow'] w2 = ['couch'] print(model.wv.most_similar (positive=w1,negative=w2,topn=10)) # similarity between two different words print(model.wv.similarity(w1="dirty",w2="smelly")) # similarity between two identical words print(model.wv.similarity(w1="dirty",w2="dirty")) # similarity between two unrelated words print(model.wv.similarity(w1="dirty",w2="clean")) #Find the odd one out # Which one is the odd one out in this list? print(model.wv.doesnt_match(["cat","dog","france"])) # Which one is the odd one out in this list? print(model.wv.doesnt_match(["bed","pillow","duvet","shower"]))
使用Gensim庫來實現Word2Vec
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.