詞向量之加載word2vec和glove

1 Google用word2vec預訓練了300維的新聞語料的詞向量googlenews-vecctors-negative300.bin,解壓後3.39個G。


可以用gensim加載進來,但是需要內存足夠大。

#加載Google訓練的詞向量
import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin',binary=True)
print(model['love'])


2 用Glove預訓練的詞向量也可以用gensim加載進來,只是在加載之前要多做一步操作,代碼參考

Glove300維的詞向量有5.25個G。

# 用gensim打開glove詞向量需要在向量的開頭增加一行:所有的單詞數 詞向量的維度
import gensim
import os
import shutil
import hashlib
from sys import platform
#計算行數,就是單詞數
def getFileLineNums(filename):
	f = open(filename, 'r')
	count = 0
	for line in f:
		count += 1
	return count

#Linux或者Windows下打開詞向量文件,在開始增加一行
def prepend_line(infile, outfile, line):
	with open(infile, 'r') as old:
		with open(outfile, 'w') as new:
			new.write(str(line) + "\n")
			shutil.copyfileobj(old, new)

def prepend_slow(infile, outfile, line):
	with open(infile, 'r') as fin:
		with open(outfile, 'w') as fout:
			fout.write(line + "\n")
			for line in fin:
				fout.write(line)

def load(filename):
	num_lines = getFileLineNums(filename)
	gensim_file = 'glove_model.txt'
	gensim_first_line = "{} {}".format(num_lines, 300)
	# Prepends the line.
	if platform == "linux" or platform == "linux2":
		prepend_line(filename, gensim_file, gensim_first_line)
	else:
		prepend_slow(filename, gensim_file, gensim_first_line)
	
	model = gensim.models.KeyedVectors.load_word2vec_format(gensim_file)

load('glove.840B.300d.txt')
生成的glove_model.txt就是可以直接用gensim打開的模型。



發佈了43 篇原創文章 · 獲贊 26 · 訪問量 11萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章