自然語言處理與建模

原創

Marina-ju

2019-07-06 10:58

文本預處理流水線:

Python的NLTK庫介紹和使用
NLTK：

官網地址：http://www.nltk.org/
Python上註明的自然語言處理庫，具有如下優點：
自帶語料庫，詞性分類庫
自帶分類，分詞等功能
強大的社區支持
還有N多的簡單版wrapper

NLTK實現詞幹的抽提（stemming）

from nltk.stem.porter import PorterStemmer
from nltk.stem.lancaster import LancasterStemmer
from nltk.stem import SnonballStemmer

NLTK實現 Lemma：還原到單詞最初的形式
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
wordnet_lemmatizer.lemmatize('dogs')

更好地實現Lemma
沒有POS Tag，默認時NN（名詞）
wordnet_lemmatizer.lemmatize(‘are’)
wordnet_lemmatizer.Lemmatize(‘is’,pos=v)

TF-IDF
TF:Term Frequencey,衡量一個term在文檔中出現得多頻繁
TF（t）=（t出現在文檔中的次數）/（文檔中的term總數）
IDF：inverse Document Frequency,衡量一個term由多重要。
有些詞出現的很多，但沒啥作用，比如“is”，‘and’,‘the’，之類的。爲了平衡，我們把罕見詞的重要性（weight）提高，把常見詞的重要性降低。
IDF=In（文檔總數/含有t的文檔總數）
TF-IDF=TF*IDF

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

自然語言處理與建模

install quantopian時出現No module named pip.req的解決辦法

ADF檢驗

python中去掉列表降維:ravel,flatten,reshape

WOE,IV ,PSI，單變量PSI，KS值，capture rate

FM(factorization Machines)

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結