refer:https://wenku.baidu.com/view/0029a79a376baf1ffd4fad8d.html
一. 未登陸詞:
未登錄詞即沒有被收錄在分詞詞表中但必須切分出來的詞,包括各類專有名詞(人名、地名、企業名等)、縮寫詞、新增詞彙等等(參考百度百科定義)
- 中文輸入法對於新詞的補充實效性比較強,國內比較知名的輸入法,例如搜狗輸入法/百度輸入法/qq輸入法在官網都有相應的詞庫;例如搜狗輸入法(https://pinyin.sogou.com/dict/),在其官網有專門的細胞詞庫
- 具體情況具體分析,這塊還是需要認爲不斷的補充,不然就不叫未登陸詞了;這裏補充的思路可以適當借鑑停用詞庫的建立;
#模擬對xx筆記本的評論場景
import os
from pyltp import Segmentor
from math import log
def word_cut(sentences):
abs_path = "/Users/hqh/nlp/3.4.0/ltp_data_v3.4.0"
cws_path = os.path.join(abs_path, 'cws.model')
seg = Segmentor() # 生成實例對象
seg.load(cws_path)
words = seg.segment(sentences)
return words
dict={
"這檯筆記本真漂亮":1,
"我覺得這檯筆記本在性能上很突出":1,
"這檯筆記本各方面都還可以":1,
"筆記本的散熱一般,筆記本的性價比不高":0
} # 評論語作爲key,value代表對筆記本的情感,褒義爲1,貶義爲0
#tf / idf 函數
''' tf-idf計算 '''
word_all={}
word_document={}
for key in dict.keys():
words=list(word_cut(key))
words_set=set(words)
for tmp in words:
if tmp not in word_all.keys(): #如果不在,咋們初始化
word_all[tmp]=1
else:
word_all[tmp]+=1 #如果在了,那麼計數器累加1
for tmp in words_set:
if tmp not in word_document.keys():
word_document[tmp]=1
else:
word_document[tmp]+=1
total=sum(word_all.values())
length=len(dict.values())
for k,v in word_all.items():
tf=word_all[k]/total
idf=log(length/word_document[k])
tf_idf=tf*idf
print(k+'---->'+str(tf)+'----->'+str(idf)+'---->'+str(tf_idf))
結果:這---->0.09090909090909091----->0.28768207245178085---->0.026152915677434625
臺---->0.09090909090909091----->0.28768207245178085---->0.026152915677434625
筆記本---->0.15151515151515152----->0.0---->0.0
真---->0.030303030303030304----->1.3862943611198906---->0.04200892003393608
漂亮---->0.030303030303030304----->1.3862943611198906---->0.04200892003393608
我---->0.030303030303030304----->1.3862943611198906---->0.04200892003393608
覺得---->0.030303030303030304----->1.3862943611198906---->0.04200892003393608
在---->0.030303030303030304----->1.3862943611198906---->0.04200892003393608
性能---->0.030303030303030304----->1.3862943611198906---->0.04200892003393608
上---->0.030303030303030304----->1.3862943611198906---->0.04200892003393608
很---->0.030303030303030304----->1.3862943611198906---->0.04200892003393608
突出---->0.030303030303030304----->1.3862943611198906---->0.04200892003393608
各---->0.030303030303030304----->1.3862943611198906---->0.04200892003393608
方面---->0.030303030303030304----->1.3862943611198906---->0.04200892003393608
都---->0.030303030303030304----->1.3862943611198906---->0.04200892003393608
還---->0.030303030303030304----->1.3862943611198906---->0.04200892003393608
可以---->0.030303030303030304----->1.3862943611198906---->0.04200892003393608
的---->0.06060606060606061----->1.3862943611198906---->0.08401784006787216
散熱---->0.030303030303030304----->1.3862943611198906---->0.04200892003393608
一般---->0.030303030303030304----->1.3862943611198906---->0.04200892003393608
,---->0.030303030303030304----->1.3862943611198906---->0.04200892003393608
性價比---->0.030303030303030304----->1.3862943611198906---->0.04200892003393608
不---->0.030303030303030304----->1.3862943611198906---->0.04200892003393608
高---->0.030303030303030304----->1.3862943611198906---->0.04200892003393608