机器学习Day06

自然语言处理(NLP)

Siri: 1. 听 2. 懂 3. 思考 4. 组织语言 5.回答

  1. 语音识别
  2. 自然语言处理-语义分析
  3. 逻辑分析 - 综合业务场景上下文
  4. 自然语言处理 - 分析结果生成自然语言文本
  5. 语音合成

自然语言处理的常用处理过程:

先针对训练文本进行分词处理(词干处理, 原型提取), 统计词频 - 逆文档频率算法获得该词对某种语义的贡献, 根据每个词的贡献力度, 构建有监督类学习模型, 把测试样本, 交给模型处理, 得到语义类别.

自然语言工具包 - NLTK

文本分词

分词处理相关API:

import nltk.tokenize as tk
# 把样本按照句子进行拆分 sent_list: 句子拆分
sent_list = tk.sent_tokenize(text)
# 把样本字符串按照单词进行拆分, word_list: 单词列表
word_list = tk.word_tokenize(text)
# WordPunctTokenizer: 分词对象
tokenizer = tk.WordPunctTokenizer()
word_list = tokenizer.tokenize(text)

案例:

import nltk.tokenize as tk

doc = 'For support please see the REST framework discussion group, try the #restframework channel on irc.' \
      'freenode.net, search the IRC archives, or raise a question on Stack Overflow, ' \
      'making sure to include the django-rest-framework tag.' + """
Let's see how it works! We need to analyze a couple of setences with punctuations to see it in action. Let's go.
"""
# 拆出句子
sent_list = tk.sent_tokenize(doc)
for i, sent in enumerate(sent_list):
    print("%2d" % (i+1), sent)
print('-' * 45)

# 拆分单词
word_list = tk.word_tokenize(doc)
for i, word in enumerate(word_list):
    print("%2d" % (i+1), word)

# 拆分单词 WordPunctTokenizer
tokenizer = tk.WordPunctTokenizer()
word_list = tokenizer.tokenize(doc)
for i, word in enumerate(word_list):
    print("%2d" % (i+1), word)

词干提取

分词后的单词词性与时态对语义分析没有影响, 所以需要对单词进行词干提取.

词干提取相关API:

import nltk.stem.porter as pt
import nltk.stem.lancaster as lc
import nltk.stem.snowball as sb

stemmer = pt.PorterStemmer()    # 波特词干提取器(宽松)
stemmer = lc.LancaserStemmer()    # 朗卡斯特词干提取器(严格)
stemmer = sb.SnowballStemmer('english')    # 思诺博词干提取器(中庸)

r = stemmer.stem('playing') # 提取词干, 返回结果

案例:

import nltk.stem.lancaster as lc
import nltk.stem.snowball as sb
import nltk.stem.porter as pt

words = ['table', 'probably', 'wolves', 'playing', 'is',
         'dog', 'the', 'beaches', 'grounded', 'dreamt',
         'envision']

pt_stemmer = pt.PorterStemmer()
lc_stemmer = lc.LancasterStemmer()
sb_stemmer = sb.SnowballStemmer(language='english')

for word in words:
    pt_stem = pt_stemmer.stem(word)
    lc_stem = lc_stemmer.stem(word)
    sb_stem = sb_stemmer.stem(word)
    print('%8s, %8s, %8s, %8s' %(word, pt_stem, lc_stem, sb_stem))

词性还原

与词干提取作用类似, 词性还原更利于人工二次处理. 因为有些词干并非正确的单词, 人工阅读更麻烦. 词性还原可以把名词的复数形式改为单数, 动词特殊形式恢复为原型.

import nltk.stem as ns
# 磁性还原器
lemmatizer = ns.WordNetLemmatizer()
# 还原名词
n_lemma = lemmatizer.lemmatize(word, pos='n')
# 还原动词
v_lemma = lemmatizer.lemmatize(word, pos='v')

案例:

import nltk.stem.lancaster as lc
import nltk.stem.snowball as sb
import nltk.stem.porter as pt
import nltk.stem as ns

words = ['table', 'probably', 'wolves', 'playing', 'is',
         'dog', 'the', 'beaches', 'grounded', 'dreamt',
         'envision']

lemmatizer = ns.WordNetLemmatizer()
for word in words:
    n = lemmatizer.lemmatize(word, pos='n')
    v = lemmatizer.lemmatize(word, pos='v')
    print('%8s, %8s, %8s' % (word, n, v))

词袋模型

一句话的语义很大程度取决于某个单词出现的次数, 所以可以把句子中所有可能出现的单词作为特征名, 每一个句子为一个样本, 单词在句子中出现的次数为特征值构建数学模型. 该模型即称为词袋模型.

The brown dog is running. The black dog is in the black room. Running in the room is forbidden.

  1. The brown dog is running.
  2. The black dog is in the black room.
  3. Running in the room is forbidden.
the brown dog running black in room forb
1 1 1 1 1 0 0 0
2 0 1 1 0 2 1 1
1 0 0 1 1 0 1 1

构建词袋模型相关API:

import sklearn.feature_extraction.text as ft
# 词袋模型生成器
cv = fc.CountVectorizer()
bow = cv.fit_transform(sent)
# 获取所有的特征名(词袋的表头, 所有的单词)
words = cv.get_feature_names()

案例:

import nltk.tokenize as tk
import sklearn.feature_extraction.text as ft

doc = 'The brown dog is running.\
 The black dog is in the black room. \
 Running in the room is forbidden.'
# 分解所有句子
sentences = tk.sent_tokenize(doc)
print(sentences)
cv = ft.CountVectorizer()
bow = cv.fit_transform(sentences)
print(bow.toarray())
words = cv.get_feature_names()
print(words)

词频(TF)

单词在句子中出现的次数除以句子的总次数称为词频. 即一个单词在一个句子中出现的频率. 词频相比单词出现的次数更可以客观的评估单词对一句话的贡献度. 对词袋矩阵归一化处理即可得到单词的词频.
案例:

tf = sp.normalize(bow, norm='l1')
print(tf, '\n', tf.toarray())

文档频率(DF)

含有某个单词的文档样本数量/总文档样本数量
文档频率越低, 代表某个单词对语义的贡献度越高

逆文档频率(IDF)

总文档样本数量 / 含有某个单词的文档样本数量

词频-逆文档频率(TF-IDF)

词频矩阵中的每一个元素, 乘以相应单词的逆文档频率, 其值越大说明该词对样本语义的贡献度越大, 根据每个词的贡献力度, 构建学习模型.

获取TF-IDF矩阵的API:

# 首先获取词袋模型矩阵
cv = ft.CountVectorizer()
bow = cb.fit_transform(sentences).toArray()
# 获取TF-IDF模型训练器
tt = ft.TfidfTransformer()
tfidf = tt.fit_transform(bow)

案例:

tt = ft.TfidfTransformer()
tfidf = tt.fit_transform(bow)
print(tfidf.toarray())

文本分类(主题识别)

import sklearn.datasets as sd
import sklearn.feature_extraction.text as ft
import sklearn.naive_bayes as nb

train = sd.load_files('20news', encoding='latin1', shuffle=True, random_state=7)
# load_file将会在文件夹下所有文件
train_data = train.data
train_y = train.target
print(len(train_data))
categories = train.target_names
print(categories)
# 构建词袋模型
cv = ft.CountVectorizer()
train_bow = cv.fit_transform(train_data)
# 构建TFIDF矩阵
tt = ft.TfidfTransformer()
train_x = tt.fit_transform(train_bow)
# 基于多想分布的朴素贝叶斯进行模型训练
model = nb.MultinomialNB()
model.fit(train_x, train_y)

# 测试模型:
test_data = [
    'The curveballs of right handed pitchers tend to curve to the left.',
    'Caesar cipher is an ancient form of encryption.',
    'This two-wheeler is really good on slippery roads',
]

test_bow = cv.transform(test_data)
test_x = tt.transform((test_bow))
pre_test_y = model.predict(test_x)

for sent, i in zip(test_data, pre_test_y):
    print(sent, '->', categories[i])

nltk分类器

nltk提供了朴素贝叶斯分类器方便的处理自然语言相关的分类问题. 并且可以自动处理词袋, TFIDF矩阵的整理, 完成模型训练, 最终实现类别预测.

相关API如下:

import nltk.classify as cf
import nltk.classify.util as cu
"""train_data的数据格式:
[({'age': 1, 'score': 2, 'student': 1}, 'good'),
({'age': 1, 'score': 2, 'student': 1}, 'bad')]
"""
# 使用nltk得到的朴素贝叶斯分类器训练模型
model = cf.NaiveBayesClassifier.train(train_data)
# 对测试数据进行预测: test_data: {'age': 1, 'score': 2, 'student': 1},
model.classify(test_data)
# 评估分类器, 返回分类器的得分
ac = cu.accuracy(model, test_data)

情感分析
分析语料库中movie_reviews文档, 通过正面及敷面的评价进行自然语言训练, 实现情感分析.

案例:

import nltk.corpus as nc
import nltk.classify as cf
import nltk.classify.util as cu

pdata = []
fileids = nc.movie_reviews.fileids('pos')
# 真暴力所有正面评价的单词, 存入pdata列表
for fileid in fileids:
    sample = {}
    words = nc.movie_reviews.words(fileid)
    for word in words:
        sample[word] = True
    pdata.append((sample, 'POSITIVE'))

ndata = []
fileids = nc.movie_reviews.fileids('neg')
# 将包里所有正面评价的单词, 存入pdata列表
for fileid in fileids:
    sample = {}
    words = nc.movie_reviews.words(fileid)
    for word in words:
        sample[word] = True
    ndata.append((sample, 'NEGATIVE'))

print(len(pdata), len(ndata))

# 拆分训练集与测试集(80%作为训练)
pnumb = int(0.8 * len(pdata))
nnumb = int(0.8 * len(ndata))
train_data = pdata[:pnumb] + ndata[:nnumb]
test_data = pdata[pnumb:] + ndata[nnumb:]
# 基于朴素贝叶斯分类器开始训练
model = cf.NaiveBayesClassifier.train(train_data)
ac = cu.accuracy(model, test_data)
print(ac)

# 模拟业务场景
reviews = [
    'It is an amazing movie',
    'This is a dull movie, I would never recommend it to anynone.',
    'The cinematography is pretty great in this movie.',
    'The direction was terrible and the story was all over the place.'
]
sent, probs = [], []
for review in reviews:
    sample = {}
    words = review.split()
    for word in words:
        sample[word] = True
    pcls = model.classify(sample)
    print(review, '->', pcls)

语音识别

通过傅里叶变换, 将时域的声音函数分解为一系列不同频率的正弦函数的叠加, 通过频率谱线的特殊分布, 建立音频内容和文本的对应关系, 以此作为模型训练的基础.

案例: freq.wav

import numpy as np
import numpy.fft as nf
import scipy.io.wavfile as wf
import matplotlib.pyplot as mp

sample_rate, sigs = wf.read('freq.wav')
print(sample_rate)
print(sigs.shape, sigs.dtype)
# x座标
times = np.arange(len(sigs)) / sample_rate
# 傅里叶变换, 获取拆解出的正弦波频率与能量信息
freqs = nf.fftfreq(sigs.size, 1 / sample_rate)
ffts = nf.fft(sigs)
pows = np.abs(ffts)
# 绘制两个图像
mp.figure('Audio', facecolor='lightgray')
mp.subplot(121)
mp.title('Time Domain')
mp.xlabel('Time', fontsize=12)
mp.ylabel('Signal', fontsize=12)
mp.grid(linestyle=":")
mp.plot(times, sigs, c='dodgerblue')
# 绘制频域图
mp.subplot(122)
mp.title('Frequency Domain')
mp.xlabel('Fequency', fontsize=12)
mp.ylabel('Pow', fontsize=12)
mp.grid(linestyle=":")
mp.plot(freqs[freqs>0], pows[freqs>0], c='orangered')

mp.tight_layout()
mp.show()

###语音识别过程
梅尔频率倒谱系数(MFCC): 对声音做傅里叶变换后, 发现通过与声音内容密切相关的13个特殊频率所对应的能量分布, 可以使用MFCC矩阵作为语音识别的特征. 基于隐马尔科夫模型进行模式识别, 找到测试样本最匹配的声音模型, 从而识别语音内容.

MFCC相关API:

import scipy.io.wavfile as wf
import python_speech_features as tf
# 读取音频文件, 获取采样率及每个采样点的值
sample_rate, sigs = wf.read('freq.wav')
# 交给语音特征提取器, 获取该语音的梅尔频率倒谱矩阵
mfcc = sf.mfcc(sigs, sample_rate)

案例:比较不同音频的mfcc矩阵

import numpy as np
import numpy.fft as nf
import scipy.io.wavfile as wf
import python_speech_features as sf
import matplotlib.pyplot as mp

sample_rate, sigs = wf.read('apple01.wav')

mfcc = sf.mfcc(sigs, sample_rate)
print(mfcc.shape)

mp.matshow(mfcc.T, cmap='gist_rainbow')
mp.title('MFCC', fontsize=16)
mp.ylabel('Feature', fontsize=12)
mp.xlabel('Sample', fontsize=12)
mp.tick_params(labelsize=10)
mp.show()

sample_rate, sigs = wf.read('freq.wav')

mfcc = sf.mfcc(sigs, sample_rate)
print(mfcc.shape)

mp.matshow(mfcc.T, cmap='gist_rainbow')
mp.title('MFCC', fontsize=16)
mp.ylabel('Feature', fontsize=12)
mp.xlabel('Sample', fontsize=12)
mp.tick_params(labelsize=10)
mp.show()

隐马尔可夫模型相关API:

import hmmlearn.hmm as hl
# 构建隐马模型
model = hl.GaussianHMM(
    n_components=4, # 用几个高斯分布函数拟合样本数据
    convariance_type='diag', # 使用相关矩阵的辅对角线进行相关性比较
    n_iter=1000 # 最大迭代上限
)
score = model.score(test_mfccs)
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章