Python機器學習及實踐——進階篇：流行庫/模型實踐

1.自然語言處理包（NLTK）

使用詞袋法（Bag-of-Words）對示例文本進行特徵向量化

# 使用詞袋法對示例文本進行特徵向量化
sent1 = 'The cat is walking in the bedroom.'
sent2 = 'A dog was running across the kinchen.'

from sklearn.feature_extraction.text import CountVectorizer

count_vec = CountVectorizer()

sentences = [sent1, sent2]

# 輸出特徵向量化後的表示
print(count_vec.fit_transform(sentences).toarray())
# 輸出向量各個維度的特徵含義
print(count_vec.get_feature_names())

使用NLTK對示例文本進行語言學分析

# 使用NLTK對示例文本進行語言學分析
import nltk
nltk.download()
# 對句子進行詞彙分割和正規化，有些情況如aren't需要分割爲are和n't；或者I'm要分割欸I和'm
tokens_1 = nltk.word_tokenize(sent1)
print(tokens_1)
tokens_2 = nltk.word_tokenize(sent2)
print(tokens_2)
# 整理兩句的詞表，並且按照ASCII的排序輸出
vocab_1 = sorted(set(tokens_1))
print(vocab_1)
vocab_2 = sorted(set(tokens_2))
print(vocab_2)

# 初始化stemmer尋找各個詞彙最原始的詞根
stemmer = nltk.stem.PorterStemmer()
stem_1 = [stemmer.stem(t) for t in tokens_1]
print(stem_1)
stem_2 = [stemmer.stem(t) for t in tokens_2]
print(stem_2)

# 初始化詞性標註器，對每個詞彙進行標註
pos_tag_1 = nltk.tag.pos_tag(tokens_1)
print(pos_tag_1)
pos_tag_2 = nltk.tag.pos_tag(tokens_2)
print(pos_tag_2)

這裏需要注意的是，需要根據提示下載相關的語料庫：nltk.download()

2.詞向量（Word2Vec）技術

詞袋法（Bag of Words）可以視爲對文本向量化的表示技術，通過這項技術可以對文本之間在內容的相似性進行一定程度的度量。但是對於兩段文本，詞袋法（Bag of Words）技術似乎對計算他們的相似度表現得無能爲力。

爲了尋找詞彙之間的相似度關係，我們也將詞彙的表示向量化，通過計算表示詞彙的向量之間的相似度，來度量詞彙之間的含義是否相似。

句子中的連續詞彙片段，也被稱爲上下文（Context）。詞彙之間的聯繫就是通過無數這樣的上下文建立的。從語言模型（Language Model）的角度來講，每個連續詞彙片段的最後一個單詞究竟有可能是什麼，都受到前面n個詞彙的制約。因此，這就形成了一個根據前面n個單詞，預測最後一個單詞的監督學習系統。

用20類新聞文本進行詞向量訓練

#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
@File  : Word2Vec.py
@Author: Xinzhe.Pang
@Date  : 2019/7/25 22:15
@Desc  : 
"""
# 用20類新聞文本進行詞向量訓練
from sklearn.datasets import fetch_20newsgroups

news = fetch_20newsgroups(subset='all')
X, y = news.data, news.target

from bs4 import BeautifulSoup
import nltk, re
nltk.download()

# 定義一個函數news_to_sentences講每條新聞中的句子逐一剝離出來，並返回一個句子的列表
def news_to_sentences(news):
    news_text = BeautifulSoup(news).get_text()
    tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
    raw_sentences = tokenizer.tokenize(news_text)
    sentences = []
    for sent in raw_sentences:
        sentences.append(re.sub('[^a-zA-Z]', '', sent.lower().strip()).split())
    return sentences


sentences = []

# 將長篇新聞文本中的句子剝離出來，用於訓練
for x in X:
    sentences += news_to_sentences(x)

# 從gensim.models中導入word2vec
from gensim.models import word2vec

# 配置詞向量的維度
num_features = 300
# 保證被考慮的詞彙的頻度
min_word_count = 20
# 設定並行化訓練使用CPU計算核心的數量
num_workers = 2
# 定義訓練詞向量的上下文窗口大小
context = 5
downsampling = 1e-3

# 訓練詞向量模型
model = word2vec.Word2Vec(sentences, workers=num_workers, size=num_features, min_count=min_word_count, window=context,
                          sample=downsampling)

# 這個設定代表當前訓練好的詞向量爲最終版，也可以加快模型的訓練速度
model.init_sims(replace=True)

# 利用訓練好的模型，尋找訓練文本中與morning最相關的10個詞彙
model.most_similar('morning')

# 利用訓練好的模型，尋找訓練文本中與email最相關的10個詞彙
model.most_similar('email')

警告：

D:\Anaconda3\python.exe E:/python_learning/MyKagglePath/Advanced/Libraries/NLTK/Word2Vec.py
showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml
D:\Anaconda3\lib\site-packages\bs4\__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 34 of the file E:/python_learning/MyKagglePath/Advanced/Libraries/NLTK/Word2Vec.py. To get rid of this warning, change code that looks like this:

BeautifulSoup(YOUR_MARKUP})

to this:

BeautifulSoup(YOUR_MARKUP, "lxml")

markup_type=markup_type))
D:\Anaconda3\lib\site-packages\gensim\models\base_any2vec.py:743: UserWarning: C extension not loaded, training will be slow. Install a C compiler and reinstall gensim for fast training.
"C extension not loaded, training will be slow. "
Traceback (most recent call last):
File "E:/python_learning/MyKagglePath/Advanced/Libraries/NLTK/Word2Vec.py", line 57, in <module>
model.most_similar('morning')
File "D:\Anaconda3\lib\site-packages\gensim\utils.py", line 1447, in new_func1
return func(*args, **kwargs)
File "D:\Anaconda3\lib\site-packages\gensim\models\base_any2vec.py", line 1397, in most_similar
return self.wv.most_similar(positive, negative, topn, restrict_vocab, indexer)
File "D:\Anaconda3\lib\site-packages\gensim\models\keyedvectors.py", line 553, in most_similar
mean.append(weight * self.word_vec(word, use_norm=True))
File "D:\Anaconda3\lib\site-packages\gensim\models\keyedvectors.py", line 468, in word_vec
raise KeyError("word '%s' not in vocabulary" % word)
KeyError: "word 'morning' not in vocabulary"

Process finished with exit code 1
解決辦法：

1. news_text = BeautifulSoup(news,'lxml').get_text()

2. 在Anaconda中執行下面命令（未檢驗）：

conda install mingw libpython
pip uninstall gensim
conda install gensim
pip install scipy

3.XGBoost模型

提升（Boosting）分類器隸屬於集成學習模型，它的基本思想是把成百上千個分類準確率較低的樹模型組合起來，稱爲一個準確率很高的模型。這個模型的特點在於不斷迭代，每次迭代就生成一顆新的樹。對於如何在每一步生成合理的樹，大家提出了很多的方法，比如我們在集成（分類）模型中提到的梯度提升樹（Gradient Tree Boosting）。它在生成每一棵樹的時候採用梯度下降的思想，以之前生成的所有決策樹爲基礎，向着最小化給定目標函數的方向再進一步。
在合理的參數設置下，我們往往要生成一定數量的樹才能達到令人滿意的準確率。在數據集較大較複雜的時候，模型可能需要幾千次迭代運算。但是，XGBoost工具更好地解決這個問題。XGBoot 的全稱是eXtreme Gradient Boosting。正如其名，它是Gradient Boosting Machine的一個C++實現。XGBoost最大的特點在於能夠自動利用CPU的多線程進行並行，並在算法上提高了精度。

#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
@File  : XGBoost.py
@Author: Xinzhe.Pang
@Date  : 2019/7/25 23:31
@Desc  : 
"""
# 對比隨機決策森林以及XGBoost模型對泰坦尼克號上的乘客是否生還的預測能力
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

# 通過URL地址來下載Titanic數據
titanic = pd.read_csv('http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.txt')
# 選取pclass、age以及sex作爲訓練特徵。
X = titanic[['pclass', 'age', 'sex']]
y = titanic['survived']

# 對缺失的age信息，採用平均值方法進行補全，即以age列已知數據的平均數填充。
X['age'].fillna(X['age'].mean(), inplace=True)
# 對原數據進行分割，隨機採樣25%作爲測試集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=33)
# 對原數據進行特徵向量化處理
vec = DictVectorizer(sparse=False)
X_train = vec.fit_transform(X_train.to_dict(orient='record'))
X_test = vec.transform(X_test.to_dict(orient='record'))

# 採用默認配置的隨機森林分類器對測試集進行預測。
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)
print('The accuracy of Random Forest Classifier on testing set:', rfc.score(X_test, y_test))

# 採用默認配置的XGBoost模型對相同的測試集進行預測
xgbc = XGBClassifier()
xgbc.fit(X_train, y_train)
print('The accuracy of eXtreme Gradient Boosting Classifier on testing set:', xgbc.score(X_test, y_test))

The accuracy of Random Forest Classifier on testing set: 0.77811550152
The accuracy of eXtreme Gradient Boosting Classifier on testing set: 0.787234042553

從結果看，XGBoost分類模型的確可以發揮更好的預測能力。

【Python機器學習及實踐】進階篇：流行庫/模型實踐

Python機器學習及實踐——進階篇：流行庫/模型實踐

1.自然語言處理包（NLTK）

2.詞向量（Word2Vec）技術

3.XGBoost模型

linux安裝cuda和cudnn

測試人員都是畫畫大神，讓我看看誰還不會用代碼圖？

Object.values()對象遍歷

我拍了拍Redis，被移出了羣聊···

網絡現代化通向雲原生應用的高速公路

面試官：說說你對序列化的理解

我宣佈，這是我找到的史上AI最全論文體系！

【面試】網易遊戲面試題目整理及答案（5）

【面試】網易遊戲面試題目整理及答案（4）

【Linux】Ubuntu下如何查看用戶登錄及用戶操作歷史相關信息

【Java EE】深入Spring數據庫事務管理

【編程題】中國象棋&路燈

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結