機器學習Day06

自然語言處理(NLP)

Siri: 1. 聽 2. 懂 3. 思考 4. 組織語言 5.回答

語音識別
自然語言處理-語義分析
邏輯分析 - 綜合業務場景上下文
自然語言處理 - 分析結果生成自然語言文本
語音合成

自然語言處理的常用處理過程:

先針對訓練文本進行分詞處理(詞幹處理, 原型提取), 統計詞頻 - 逆文檔頻率算法獲得該詞對某種語義的貢獻, 根據每個詞的貢獻力度, 構建有監督類學習模型, 把測試樣本, 交給模型處理, 得到語義類別.

自然語言工具包 - NLTK

文本分詞

分詞處理相關API:

import nltk.tokenize as tk
# 把樣本按照句子進行拆分 sent_list: 句子拆分
sent_list = tk.sent_tokenize(text)
# 把樣本字符串按照單詞進行拆分, word_list: 單詞列表
word_list = tk.word_tokenize(text)
# WordPunctTokenizer: 分詞對象
tokenizer = tk.WordPunctTokenizer()
word_list = tokenizer.tokenize(text)

案例:

import nltk.tokenize as tk

doc = 'For support please see the REST framework discussion group, try the #restframework channel on irc.' \
      'freenode.net, search the IRC archives, or raise a question on Stack Overflow, ' \
      'making sure to include the django-rest-framework tag.' + """
Let's see how it works! We need to analyze a couple of setences with punctuations to see it in action. Let's go.
"""
# 拆出句子
sent_list = tk.sent_tokenize(doc)
for i, sent in enumerate(sent_list):
    print("%2d" % (i+1), sent)
print('-' * 45)

# 拆分單詞
word_list = tk.word_tokenize(doc)
for i, word in enumerate(word_list):
    print("%2d" % (i+1), word)

# 拆分單詞 WordPunctTokenizer
tokenizer = tk.WordPunctTokenizer()
word_list = tokenizer.tokenize(doc)
for i, word in enumerate(word_list):
    print("%2d" % (i+1), word)

詞幹提取

分詞後的單詞詞性與時態對語義分析沒有影響, 所以需要對單詞進行詞幹提取.

詞幹提取相關API:

import nltk.stem.porter as pt
import nltk.stem.lancaster as lc
import nltk.stem.snowball as sb

stemmer = pt.PorterStemmer()    # 波特詞幹提取器(寬鬆)
stemmer = lc.LancaserStemmer()    # 朗卡斯特詞幹提取器(嚴格)
stemmer = sb.SnowballStemmer('english')    # 思諾博詞幹提取器(中庸)

r = stemmer.stem('playing') # 提取詞幹, 返回結果

案例:

import nltk.stem.lancaster as lc
import nltk.stem.snowball as sb
import nltk.stem.porter as pt

words = ['table', 'probably', 'wolves', 'playing', 'is',
         'dog', 'the', 'beaches', 'grounded', 'dreamt',
         'envision']

pt_stemmer = pt.PorterStemmer()
lc_stemmer = lc.LancasterStemmer()
sb_stemmer = sb.SnowballStemmer(language='english')

for word in words:
    pt_stem = pt_stemmer.stem(word)
    lc_stem = lc_stemmer.stem(word)
    sb_stem = sb_stemmer.stem(word)
    print('%8s, %8s, %8s, %8s' %(word, pt_stem, lc_stem, sb_stem))

詞性還原

與詞幹提取作用類似, 詞性還原更利於人工二次處理. 因爲有些詞幹並非正確的單詞, 人工閱讀更麻煩. 詞性還原可以把名詞的複數形式改爲單數, 動詞特殊形式恢復爲原型.

import nltk.stem as ns
# 磁性還原器
lemmatizer = ns.WordNetLemmatizer()
# 還原名詞
n_lemma = lemmatizer.lemmatize(word, pos='n')
# 還原動詞
v_lemma = lemmatizer.lemmatize(word, pos='v')

案例:

import nltk.stem.lancaster as lc
import nltk.stem.snowball as sb
import nltk.stem.porter as pt
import nltk.stem as ns

words = ['table', 'probably', 'wolves', 'playing', 'is',
         'dog', 'the', 'beaches', 'grounded', 'dreamt',
         'envision']

lemmatizer = ns.WordNetLemmatizer()
for word in words:
    n = lemmatizer.lemmatize(word, pos='n')
    v = lemmatizer.lemmatize(word, pos='v')
    print('%8s, %8s, %8s' % (word, n, v))

詞袋模型

一句話的語義很大程度取決於某個單詞出現的次數, 所以可以把句子中所有可能出現的單詞作爲特徵名, 每一個句子爲一個樣本, 單詞在句子中出現的次數爲特徵值構建數學模型. 該模型即稱爲詞袋模型.

The brown dog is running. The black dog is in the black room. Running in the room is forbidden.

The brown dog is running.
The black dog is in the black room.
Running in the room is forbidden.

the	brown	dog	running	black	in	room	forb
1	1	1	1	1	0	0	0
2	0	1	1	0	2	1	1
1	0	0	1	1	0	1	1

構建詞袋模型相關API:

import sklearn.feature_extraction.text as ft
# 詞袋模型生成器
cv = fc.CountVectorizer()
bow = cv.fit_transform(sent)
# 獲取所有的特徵名(詞袋的表頭, 所有的單詞)
words = cv.get_feature_names()

案例:

import nltk.tokenize as tk
import sklearn.feature_extraction.text as ft

doc = 'The brown dog is running.\
 The black dog is in the black room. \
 Running in the room is forbidden.'
# 分解所有句子
sentences = tk.sent_tokenize(doc)
print(sentences)
cv = ft.CountVectorizer()
bow = cv.fit_transform(sentences)
print(bow.toarray())
words = cv.get_feature_names()
print(words)

詞頻(TF)

單詞在句子中出現的次數除以句子的總次數稱爲詞頻. 即一個單詞在一個句子中出現的頻率. 詞頻相比單詞出現的次數更可以客觀的評估單詞對一句話的貢獻度. 對詞袋矩陣歸一化處理即可得到單詞的詞頻.
案例:

tf = sp.normalize(bow, norm='l1')
print(tf, '\n', tf.toarray())

文檔頻率(DF)

含有某個單詞的文檔樣本數量/總文檔樣本數量
文檔頻率越低, 代表某個單詞對語義的貢獻度越高

逆文檔頻率(IDF)

總文檔樣本數量 / 含有某個單詞的文檔樣本數量

詞頻-逆文檔頻率(TF-IDF)

詞頻矩陣中的每一個元素, 乘以相應單詞的逆文檔頻率, 其值越大說明該詞對樣本語義的貢獻度越大, 根據每個詞的貢獻力度, 構建學習模型.

獲取TF-IDF矩陣的API:

# 首先獲取詞袋模型矩陣
cv = ft.CountVectorizer()
bow = cb.fit_transform(sentences).toArray()
# 獲取TF-IDF模型訓練器
tt = ft.TfidfTransformer()
tfidf = tt.fit_transform(bow)

案例:

tt = ft.TfidfTransformer()
tfidf = tt.fit_transform(bow)
print(tfidf.toarray())

文本分類(主題識別)

import sklearn.datasets as sd
import sklearn.feature_extraction.text as ft
import sklearn.naive_bayes as nb

train = sd.load_files('20news', encoding='latin1', shuffle=True, random_state=7)
# load_file將會在文件夾下所有文件
train_data = train.data
train_y = train.target
print(len(train_data))
categories = train.target_names
print(categories)
# 構建詞袋模型
cv = ft.CountVectorizer()
train_bow = cv.fit_transform(train_data)
# 構建TFIDF矩陣
tt = ft.TfidfTransformer()
train_x = tt.fit_transform(train_bow)
# 基於多想分佈的樸素貝葉斯進行模型訓練
model = nb.MultinomialNB()
model.fit(train_x, train_y)

# 測試模型:
test_data = [
    'The curveballs of right handed pitchers tend to curve to the left.',
    'Caesar cipher is an ancient form of encryption.',
    'This two-wheeler is really good on slippery roads',
]

test_bow = cv.transform(test_data)
test_x = tt.transform((test_bow))
pre_test_y = model.predict(test_x)

for sent, i in zip(test_data, pre_test_y):
    print(sent, '->', categories[i])

nltk分類器

nltk提供了樸素貝葉斯分類器方便的處理自然語言相關的分類問題. 並且可以自動處理詞袋, TFIDF矩陣的整理, 完成模型訓練, 最終實現類別預測.

相關API如下:

import nltk.classify as cf
import nltk.classify.util as cu
"""train_data的數據格式:
[({'age': 1, 'score': 2, 'student': 1}, 'good'),
({'age': 1, 'score': 2, 'student': 1}, 'bad')]
"""
# 使用nltk得到的樸素貝葉斯分類器訓練模型
model = cf.NaiveBayesClassifier.train(train_data)
# 對測試數據進行預測: test_data: {'age': 1, 'score': 2, 'student': 1},
model.classify(test_data)
# 評估分類器, 返回分類器的得分
ac = cu.accuracy(model, test_data)

情感分析
分析語料庫中movie_reviews文檔, 通過正面及敷面的評價進行自然語言訓練, 實現情感分析.

案例:

import nltk.corpus as nc
import nltk.classify as cf
import nltk.classify.util as cu

pdata = []
fileids = nc.movie_reviews.fileids('pos')
# 真暴力所有正面評價的單詞, 存入pdata列表
for fileid in fileids:
    sample = {}
    words = nc.movie_reviews.words(fileid)
    for word in words:
        sample[word] = True
    pdata.append((sample, 'POSITIVE'))

ndata = []
fileids = nc.movie_reviews.fileids('neg')
# 將包裏所有正面評價的單詞, 存入pdata列表
for fileid in fileids:
    sample = {}
    words = nc.movie_reviews.words(fileid)
    for word in words:
        sample[word] = True
    ndata.append((sample, 'NEGATIVE'))

print(len(pdata), len(ndata))

# 拆分訓練集與測試集(80%作爲訓練)
pnumb = int(0.8 * len(pdata))
nnumb = int(0.8 * len(ndata))
train_data = pdata[:pnumb] + ndata[:nnumb]
test_data = pdata[pnumb:] + ndata[nnumb:]
# 基於樸素貝葉斯分類器開始訓練
model = cf.NaiveBayesClassifier.train(train_data)
ac = cu.accuracy(model, test_data)
print(ac)

# 模擬業務場景
reviews = [
    'It is an amazing movie',
    'This is a dull movie, I would never recommend it to anynone.',
    'The cinematography is pretty great in this movie.',
    'The direction was terrible and the story was all over the place.'
]
sent, probs = [], []
for review in reviews:
    sample = {}
    words = review.split()
    for word in words:
        sample[word] = True
    pcls = model.classify(sample)
    print(review, '->', pcls)

語音識別

通過傅里葉變換, 將時域的聲音函數分解爲一系列不同頻率的正弦函數的疊加, 通過頻率譜線的特殊分佈, 建立音頻內容和文本的對應關係, 以此作爲模型訓練的基礎.

案例: freq.wav

import numpy as np
import numpy.fft as nf
import scipy.io.wavfile as wf
import matplotlib.pyplot as mp

sample_rate, sigs = wf.read('freq.wav')
print(sample_rate)
print(sigs.shape, sigs.dtype)
# x座標
times = np.arange(len(sigs)) / sample_rate
# 傅里葉變換, 獲取拆解出的正弦波頻率與能量信息
freqs = nf.fftfreq(sigs.size, 1 / sample_rate)
ffts = nf.fft(sigs)
pows = np.abs(ffts)
# 繪製兩個圖像
mp.figure('Audio', facecolor='lightgray')
mp.subplot(121)
mp.title('Time Domain')
mp.xlabel('Time', fontsize=12)
mp.ylabel('Signal', fontsize=12)
mp.grid(linestyle=":")
mp.plot(times, sigs, c='dodgerblue')
# 繪製頻域圖
mp.subplot(122)
mp.title('Frequency Domain')
mp.xlabel('Fequency', fontsize=12)
mp.ylabel('Pow', fontsize=12)
mp.grid(linestyle=":")
mp.plot(freqs[freqs>0], pows[freqs>0], c='orangered')

mp.tight_layout()
mp.show()

###語音識別過程
梅爾頻率倒譜系數(MFCC): 對聲音做傅里葉變換後, 發現通過與聲音內容密切相關的13個特殊頻率所對應的能量分佈, 可以使用MFCC矩陣作爲語音識別的特徵. 基於隱馬爾科夫模型進行模式識別, 找到測試樣本最匹配的聲音模型, 從而識別語音內容.

MFCC相關API:

import scipy.io.wavfile as wf
import python_speech_features as tf
# 讀取音頻文件, 獲取採樣率及每個採樣點的值
sample_rate, sigs = wf.read('freq.wav')
# 交給語音特徵提取器, 獲取該語音的梅爾頻率倒譜矩陣
mfcc = sf.mfcc(sigs, sample_rate)

案例:比較不同音頻的mfcc矩陣

import numpy as np
import numpy.fft as nf
import scipy.io.wavfile as wf
import python_speech_features as sf
import matplotlib.pyplot as mp

sample_rate, sigs = wf.read('apple01.wav')

mfcc = sf.mfcc(sigs, sample_rate)
print(mfcc.shape)

mp.matshow(mfcc.T, cmap='gist_rainbow')
mp.title('MFCC', fontsize=16)
mp.ylabel('Feature', fontsize=12)
mp.xlabel('Sample', fontsize=12)
mp.tick_params(labelsize=10)
mp.show()

sample_rate, sigs = wf.read('freq.wav')

mfcc = sf.mfcc(sigs, sample_rate)
print(mfcc.shape)

mp.matshow(mfcc.T, cmap='gist_rainbow')
mp.title('MFCC', fontsize=16)
mp.ylabel('Feature', fontsize=12)
mp.xlabel('Sample', fontsize=12)
mp.tick_params(labelsize=10)
mp.show()

隱馬爾可夫模型相關API:

import hmmlearn.hmm as hl
# 構建隱馬模型
model = hl.GaussianHMM(
    n_components=4, # 用幾個高斯分佈函數擬合樣本數據
    convariance_type='diag', # 使用相關矩陣的輔對角線進行相關性比較
    n_iter=1000 # 最大迭代上限
)
score = model.score(test_mfccs)

自然語言處理(NLP)

文本分詞

詞幹提取

詞性還原

詞袋模型

詞頻(TF)

文檔頻率(DF)

逆文檔頻率(IDF)

詞頻-逆文檔頻率(TF-IDF)

文本分類(主題識別)

nltk分類器

語音識別

numpy中axis的理解

Golang實現可延時、可限制大小的隊列

機器學習----01

Docker學習筆記----容器的連接

Gitlab執行pull拉取分支： remote: error: Out of memory, malloc failed

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結