自然語言處理(NLP)
Siri: 1. 聽 2. 懂 3. 思考 4. 組織語言 5.回答
- 語音識別
- 自然語言處理-語義分析
- 邏輯分析 - 綜合業務場景上下文
- 自然語言處理 - 分析結果生成自然語言文本
- 語音合成
自然語言處理的常用處理過程:
先針對訓練文本進行分詞處理(詞幹處理, 原型提取), 統計詞頻 - 逆文檔頻率算法獲得該詞對某種語義的貢獻, 根據每個詞的貢獻力度, 構建有監督類學習模型, 把測試樣本, 交給模型處理, 得到語義類別.
自然語言工具包 - NLTK
文本分詞
分詞處理相關API:
import nltk.tokenize as tk
# 把樣本按照句子進行拆分 sent_list: 句子拆分
sent_list = tk.sent_tokenize(text)
# 把樣本字符串按照單詞進行拆分, word_list: 單詞列表
word_list = tk.word_tokenize(text)
# WordPunctTokenizer: 分詞對象
tokenizer = tk.WordPunctTokenizer()
word_list = tokenizer.tokenize(text)
案例:
import nltk.tokenize as tk
doc = 'For support please see the REST framework discussion group, try the #restframework channel on irc.' \
'freenode.net, search the IRC archives, or raise a question on Stack Overflow, ' \
'making sure to include the django-rest-framework tag.' + """
Let's see how it works! We need to analyze a couple of setences with punctuations to see it in action. Let's go.
"""
# 拆出句子
sent_list = tk.sent_tokenize(doc)
for i, sent in enumerate(sent_list):
print("%2d" % (i+1), sent)
print('-' * 45)
# 拆分單詞
word_list = tk.word_tokenize(doc)
for i, word in enumerate(word_list):
print("%2d" % (i+1), word)
# 拆分單詞 WordPunctTokenizer
tokenizer = tk.WordPunctTokenizer()
word_list = tokenizer.tokenize(doc)
for i, word in enumerate(word_list):
print("%2d" % (i+1), word)
詞幹提取
分詞後的單詞詞性與時態對語義分析沒有影響, 所以需要對單詞進行詞幹提取.
詞幹提取相關API:
import nltk.stem.porter as pt
import nltk.stem.lancaster as lc
import nltk.stem.snowball as sb
stemmer = pt.PorterStemmer() # 波特詞幹提取器(寬鬆)
stemmer = lc.LancaserStemmer() # 朗卡斯特詞幹提取器(嚴格)
stemmer = sb.SnowballStemmer('english') # 思諾博詞幹提取器(中庸)
r = stemmer.stem('playing') # 提取詞幹, 返回結果
案例:
import nltk.stem.lancaster as lc
import nltk.stem.snowball as sb
import nltk.stem.porter as pt
words = ['table', 'probably', 'wolves', 'playing', 'is',
'dog', 'the', 'beaches', 'grounded', 'dreamt',
'envision']
pt_stemmer = pt.PorterStemmer()
lc_stemmer = lc.LancasterStemmer()
sb_stemmer = sb.SnowballStemmer(language='english')
for word in words:
pt_stem = pt_stemmer.stem(word)
lc_stem = lc_stemmer.stem(word)
sb_stem = sb_stemmer.stem(word)
print('%8s, %8s, %8s, %8s' %(word, pt_stem, lc_stem, sb_stem))
詞性還原
與詞幹提取作用類似, 詞性還原更利於人工二次處理. 因爲有些詞幹並非正確的單詞, 人工閱讀更麻煩. 詞性還原可以把名詞的複數形式改爲單數, 動詞特殊形式恢復爲原型.
import nltk.stem as ns
# 磁性還原器
lemmatizer = ns.WordNetLemmatizer()
# 還原名詞
n_lemma = lemmatizer.lemmatize(word, pos='n')
# 還原動詞
v_lemma = lemmatizer.lemmatize(word, pos='v')
案例:
import nltk.stem.lancaster as lc
import nltk.stem.snowball as sb
import nltk.stem.porter as pt
import nltk.stem as ns
words = ['table', 'probably', 'wolves', 'playing', 'is',
'dog', 'the', 'beaches', 'grounded', 'dreamt',
'envision']
lemmatizer = ns.WordNetLemmatizer()
for word in words:
n = lemmatizer.lemmatize(word, pos='n')
v = lemmatizer.lemmatize(word, pos='v')
print('%8s, %8s, %8s' % (word, n, v))
詞袋模型
一句話的語義很大程度取決於某個單詞出現的次數, 所以可以把句子中所有可能出現的單詞作爲特徵名, 每一個句子爲一個樣本, 單詞在句子中出現的次數爲特徵值構建數學模型. 該模型即稱爲詞袋模型.
The brown dog is running. The black dog is in the black room. Running in the room is forbidden.
- The brown dog is running.
- The black dog is in the black room.
- Running in the room is forbidden.
the | brown | dog | running | black | in | room | forb |
---|---|---|---|---|---|---|---|
1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 |
2 | 0 | 1 | 1 | 0 | 2 | 1 | 1 |
1 | 0 | 0 | 1 | 1 | 0 | 1 | 1 |
構建詞袋模型相關API:
import sklearn.feature_extraction.text as ft
# 詞袋模型生成器
cv = fc.CountVectorizer()
bow = cv.fit_transform(sent)
# 獲取所有的特徵名(詞袋的表頭, 所有的單詞)
words = cv.get_feature_names()
案例:
import nltk.tokenize as tk
import sklearn.feature_extraction.text as ft
doc = 'The brown dog is running.\
The black dog is in the black room. \
Running in the room is forbidden.'
# 分解所有句子
sentences = tk.sent_tokenize(doc)
print(sentences)
cv = ft.CountVectorizer()
bow = cv.fit_transform(sentences)
print(bow.toarray())
words = cv.get_feature_names()
print(words)
詞頻(TF)
單詞在句子中出現的次數除以句子的總次數稱爲詞頻. 即一個單詞在一個句子中出現的頻率. 詞頻相比單詞出現的次數更可以客觀的評估單詞對一句話的貢獻度. 對詞袋矩陣歸一化處理即可得到單詞的詞頻.
案例:
tf = sp.normalize(bow, norm='l1')
print(tf, '\n', tf.toarray())
文檔頻率(DF)
含有某個單詞的文檔樣本數量/總文檔樣本數量
文檔頻率越低, 代表某個單詞對語義的貢獻度越高
逆文檔頻率(IDF)
總文檔樣本數量 / 含有某個單詞的文檔樣本數量
詞頻-逆文檔頻率(TF-IDF)
詞頻矩陣中的每一個元素, 乘以相應單詞的逆文檔頻率, 其值越大說明該詞對樣本語義的貢獻度越大, 根據每個詞的貢獻力度, 構建學習模型.
獲取TF-IDF矩陣的API:
# 首先獲取詞袋模型矩陣
cv = ft.CountVectorizer()
bow = cb.fit_transform(sentences).toArray()
# 獲取TF-IDF模型訓練器
tt = ft.TfidfTransformer()
tfidf = tt.fit_transform(bow)
案例:
tt = ft.TfidfTransformer()
tfidf = tt.fit_transform(bow)
print(tfidf.toarray())
文本分類(主題識別)
import sklearn.datasets as sd
import sklearn.feature_extraction.text as ft
import sklearn.naive_bayes as nb
train = sd.load_files('20news', encoding='latin1', shuffle=True, random_state=7)
# load_file將會在文件夾下所有文件
train_data = train.data
train_y = train.target
print(len(train_data))
categories = train.target_names
print(categories)
# 構建詞袋模型
cv = ft.CountVectorizer()
train_bow = cv.fit_transform(train_data)
# 構建TFIDF矩陣
tt = ft.TfidfTransformer()
train_x = tt.fit_transform(train_bow)
# 基於多想分佈的樸素貝葉斯進行模型訓練
model = nb.MultinomialNB()
model.fit(train_x, train_y)
# 測試模型:
test_data = [
'The curveballs of right handed pitchers tend to curve to the left.',
'Caesar cipher is an ancient form of encryption.',
'This two-wheeler is really good on slippery roads',
]
test_bow = cv.transform(test_data)
test_x = tt.transform((test_bow))
pre_test_y = model.predict(test_x)
for sent, i in zip(test_data, pre_test_y):
print(sent, '->', categories[i])
nltk分類器
nltk提供了樸素貝葉斯分類器方便的處理自然語言相關的分類問題. 並且可以自動處理詞袋, TFIDF矩陣的整理, 完成模型訓練, 最終實現類別預測.
相關API如下:
import nltk.classify as cf
import nltk.classify.util as cu
"""train_data的數據格式:
[({'age': 1, 'score': 2, 'student': 1}, 'good'),
({'age': 1, 'score': 2, 'student': 1}, 'bad')]
"""
# 使用nltk得到的樸素貝葉斯分類器訓練模型
model = cf.NaiveBayesClassifier.train(train_data)
# 對測試數據進行預測: test_data: {'age': 1, 'score': 2, 'student': 1},
model.classify(test_data)
# 評估分類器, 返回分類器的得分
ac = cu.accuracy(model, test_data)
情感分析
分析語料庫中movie_reviews文檔, 通過正面及敷面的評價進行自然語言訓練, 實現情感分析.
案例:
import nltk.corpus as nc
import nltk.classify as cf
import nltk.classify.util as cu
pdata = []
fileids = nc.movie_reviews.fileids('pos')
# 真暴力所有正面評價的單詞, 存入pdata列表
for fileid in fileids:
sample = {}
words = nc.movie_reviews.words(fileid)
for word in words:
sample[word] = True
pdata.append((sample, 'POSITIVE'))
ndata = []
fileids = nc.movie_reviews.fileids('neg')
# 將包裏所有正面評價的單詞, 存入pdata列表
for fileid in fileids:
sample = {}
words = nc.movie_reviews.words(fileid)
for word in words:
sample[word] = True
ndata.append((sample, 'NEGATIVE'))
print(len(pdata), len(ndata))
# 拆分訓練集與測試集(80%作爲訓練)
pnumb = int(0.8 * len(pdata))
nnumb = int(0.8 * len(ndata))
train_data = pdata[:pnumb] + ndata[:nnumb]
test_data = pdata[pnumb:] + ndata[nnumb:]
# 基於樸素貝葉斯分類器開始訓練
model = cf.NaiveBayesClassifier.train(train_data)
ac = cu.accuracy(model, test_data)
print(ac)
# 模擬業務場景
reviews = [
'It is an amazing movie',
'This is a dull movie, I would never recommend it to anynone.',
'The cinematography is pretty great in this movie.',
'The direction was terrible and the story was all over the place.'
]
sent, probs = [], []
for review in reviews:
sample = {}
words = review.split()
for word in words:
sample[word] = True
pcls = model.classify(sample)
print(review, '->', pcls)
語音識別
通過傅里葉變換, 將時域的聲音函數分解爲一系列不同頻率的正弦函數的疊加, 通過頻率譜線的特殊分佈, 建立音頻內容和文本的對應關係, 以此作爲模型訓練的基礎.
案例: freq.wav
import numpy as np
import numpy.fft as nf
import scipy.io.wavfile as wf
import matplotlib.pyplot as mp
sample_rate, sigs = wf.read('freq.wav')
print(sample_rate)
print(sigs.shape, sigs.dtype)
# x座標
times = np.arange(len(sigs)) / sample_rate
# 傅里葉變換, 獲取拆解出的正弦波頻率與能量信息
freqs = nf.fftfreq(sigs.size, 1 / sample_rate)
ffts = nf.fft(sigs)
pows = np.abs(ffts)
# 繪製兩個圖像
mp.figure('Audio', facecolor='lightgray')
mp.subplot(121)
mp.title('Time Domain')
mp.xlabel('Time', fontsize=12)
mp.ylabel('Signal', fontsize=12)
mp.grid(linestyle=":")
mp.plot(times, sigs, c='dodgerblue')
# 繪製頻域圖
mp.subplot(122)
mp.title('Frequency Domain')
mp.xlabel('Fequency', fontsize=12)
mp.ylabel('Pow', fontsize=12)
mp.grid(linestyle=":")
mp.plot(freqs[freqs>0], pows[freqs>0], c='orangered')
mp.tight_layout()
mp.show()
###語音識別過程
梅爾頻率倒譜系數(MFCC): 對聲音做傅里葉變換後, 發現通過與聲音內容密切相關的13個特殊頻率所對應的能量分佈, 可以使用MFCC矩陣作爲語音識別的特徵. 基於隱馬爾科夫模型進行模式識別, 找到測試樣本最匹配的聲音模型, 從而識別語音內容.
MFCC相關API:
import scipy.io.wavfile as wf
import python_speech_features as tf
# 讀取音頻文件, 獲取採樣率及每個採樣點的值
sample_rate, sigs = wf.read('freq.wav')
# 交給語音特徵提取器, 獲取該語音的梅爾頻率倒譜矩陣
mfcc = sf.mfcc(sigs, sample_rate)
案例:比較不同音頻的mfcc矩陣
import numpy as np
import numpy.fft as nf
import scipy.io.wavfile as wf
import python_speech_features as sf
import matplotlib.pyplot as mp
sample_rate, sigs = wf.read('apple01.wav')
mfcc = sf.mfcc(sigs, sample_rate)
print(mfcc.shape)
mp.matshow(mfcc.T, cmap='gist_rainbow')
mp.title('MFCC', fontsize=16)
mp.ylabel('Feature', fontsize=12)
mp.xlabel('Sample', fontsize=12)
mp.tick_params(labelsize=10)
mp.show()
sample_rate, sigs = wf.read('freq.wav')
mfcc = sf.mfcc(sigs, sample_rate)
print(mfcc.shape)
mp.matshow(mfcc.T, cmap='gist_rainbow')
mp.title('MFCC', fontsize=16)
mp.ylabel('Feature', fontsize=12)
mp.xlabel('Sample', fontsize=12)
mp.tick_params(labelsize=10)
mp.show()
隱馬爾可夫模型相關API:
import hmmlearn.hmm as hl
# 構建隱馬模型
model = hl.GaussianHMM(
n_components=4, # 用幾個高斯分佈函數擬合樣本數據
convariance_type='diag', # 使用相關矩陣的輔對角線進行相關性比較
n_iter=1000 # 最大迭代上限
)
score = model.score(test_mfccs)