NLP基礎知識1

1.中英文分詞的區別：

中文：啓發式 Heuristic

英文：機器學習、統計學習 HMM，CRF

2.社交網絡語言的分詞處理

1）用正則表達式將特殊符號歸併起來

2）用re.complie將其編譯一下

3）自定義tokenize，返回tokens_re.findall(s)

4) 自定義preprocess函數包括大小寫的轉換

3.英文變換的詞型

不影響詞性

影響詞性

詞形歸一化：

stemming：詞幹提取，把不影響詞性的小尾巴砍掉

Lcmmatization詞形歸一，即把各種變形歸爲一個形式（wordnet）

複數歸一化Lemma

POS Tag 詞性標註

Stopwords he，the等歧義太多，刪掉

4.一條typical的文本預處理流水線

Raw_text

Tokenize POS Tag

Lemma/Stemming

stopwords

Word_List

5.文本預處理得到了一個乾淨的詞庫，接下來做特徵工程，變成數字表達的詞

情感分析

sentiment dictionary（打分機制）

like 1，good 2，bad -2， AFINN-111

（但是打分機制存在缺陷，所以通常用ML的情感分析，這裏用貝葉斯分類法：

對句子進行處理，使其變成二值式，並給其一個label，然後放進模型裏訓練）

文本相似度

用元素頻率Frequency 表示文本特徵

import nltk
nltk

from nltk import FreqDist,tokenize


corpus='you are a beautiful girl ' \
           ' I am a handsome boy  ' \
                    'you  love me'


tokens = nltk.word_tokenize(corpus)
#print(tokens)

fdist = FreqDist(tokens)
fdist


#print(fdist['alove'])
stand_vec = fdist.most_common(50)
size = len(stand_vec)
#print(stand_vec,size)
#>>>[('you', 2), ('a', 2), ('are', 1), ('beautiful', 1), ('girl', 1), ('I', 1), ('am', 1), 
#('handsome', 1), ('boy', 1), ('love', 1), ('me', 1)] 11


# 按照頻率大小，記錄單詞的位置
def postion_lookup(V):
    res = {}
    counter = 0
    for word in V:
        res[word[0]] = counter
        counter += 1
    return res

stand_position_sict = postion_lookup(stand_vec)
stand_position_sict

sentence = 'this girl is my love girl'
freq_vector = [0] * size #建立一個與標準vector同樣大小的0向量，用來記錄原來那些單詞出現的次數

tokens = nltk.word_tokenize(sentence)
for word in tokens:
    try:

        freq_vector[stand_position_sict[word]] += 1
    except KeyError:
        continue
        

print(freq_vector)
#>>>[0, 0, 0, 0, 2, 0, 0, 0, 0, 1, 0]

文本分類（大骨頭，需要慢慢啃）

TF-IDF

TF(term Frequency):衡量term的頻率

TF（t）=（t次數）/（term次數）

IDF（Inverse Document Frequency）：衡量term的重要性

IDF（t）=log_e(文檔總數/含有t的文檔總數)

TF-IDF= TF * IDF

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

NLP基礎知識1

NLP基礎知識1

NLP-從語言模型到樸素貝葉斯

無監督學習之K-Means

模型調整

神經網絡在分類中的應用

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結