基於隨機森林模型的葡萄酒品質分析

原創

2020-05-01 12:42

視頻：南京大學用Python玩轉數據 https://www.icourse163.org/course/NJU-1001571005

url: https://archive.ics.uci.edu/ml/datasets/Wine+Quality

# -*- coding: utf-8 -*-
# url: https://archive.ics.uci.edu/ml/datasets/Wine+Quality
import pandas as pd
import seaborn as sns
import matplotlib as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
import warnings
warnings.filterwarnings('ignore') 

try:
    wine=pd.read_csv('F:/data analysis/wine/winequality-red.csv',sep=';')
except:
    print("Cannot find the file!")
    
print(wine.info())

from sklearn.preprocessing import LabelEncoder
bins=(2,4,6,8)
#bins劃分數據，構成左開右閉區間，（2,4],(4,6],(6,8].quality的值是3-8，因此，3、4一組，5、6一組、7、8一組
group_names=['low','medium','high']
#分箱劃分數據的組名
wine['quality_lb']=pd.cut(wine['quality'],bins=bins, labels=group_names)

lb_quality=LabelEncoder()
#爲quality屬性分配標籤，0、1、2

wine['label'] = lb_quality.fit_transform(wine['quality_lb']) 
#對數據進行統一處理
print(wine.label.value_counts())

wine_copy=wine.copy()
wine.drop(['quality','quality_lb'],axis=1,inplace=True)
x=wine.iloc[:,:-1]
y=wine.label
#將目標屬性和特徵屬性分開

from sklearn.model_selection import train_test_split
#隨機從樣本中按比例選取訓練數據和測試數據,test_size用於設置測試集的比例

x_train, x_test, y_train, y_test=train_test_split(x,y,test_size=0.2)
from sklearn.preprocessing import scale
x_train=scale(x_train)
x_test=scale(x_test)
#用scale函數對測試集和訓練集進行標準化處理

from sklearn.metrics import confusion_matrix
rfc=RandomForestClassifier(n_estimators=200)
#隨機森林：並行式集成學習代表Bagging類型，對原始數據集進行多次隨機採樣，得到多個不同的採樣集，基於每個採樣集訓練一個決策樹基學習器，結合基學習器，通過投票或取均值等方式獲得較高精確度和泛化性能。
#通過隨機森林函數構建分類器，n_estimators利用最大投票數或均值預測前，建立基學習器的數量
rfc.fit(x_train,y_train) #訓練
y_pred=rfc.predict(x_test) #預測
print(confusion_matrix(y_test,y_pred))
#混淆矩陣，列爲預測值，行爲實際類別

#調參
param_rfc={"n_estimators":[10,20,30,40,50,60,70,80,90,100,150,200],
           "criterion":["gini","entropy"]
           }
grid_rfc=GridSearchCV(rfc, param_rfc, iid=False, cv=5)
#GridSearchCV 暴力搜索，給出最優化的參數和結果，適合小數據集
grid_rfc.fit(x_train, y_train)
best_param_rfc=grid_rfc.best_params_
print(best_param_rfc)
rfc=RandomForestClassifier(n_estimators=best_param_rfc['n_estimators'],
    criterion=best_param_rfc['criterion'],random_state=0)
rfc.fit(x_train, y_train)
y_pred=rfc.predict(x_test)
print(confusion_matrix(y_test,y_pred))

數據描述

最終結果

混淆矩陣

{'criterion': 'gini', 'n_estimators': 60}
[[ 19 0 23]
[ 0 0 9]
[ 9 2 258]]

看對角線，對角線上的數爲正確判斷的數據記錄的條數。第一行代表19個高品質判斷正確，23表示類別“0”倍誤判成類別“2”的記錄條數。

問題

編程在spyder上進行，不得不說，真的好用。沒有遇到跑不通的情況，但有些問題不太明白。

2 1319
0 217
1 63

爲啥2是medium，1是low，0是high?

補充

fit(): Method calculates the parameters μ and σ and saves them as internal objects.

解釋：簡單來說，就是求得訓練集X的均值，方差，最大值，最小值,這些訓練集X固有的屬性。

transform(): Method using these calculated parameters apply the transformation to a particular dataset.

解釋：在fit的基礎上，進行標準化，降維，歸一化等操作（看具體用的是哪個工具，如PCA，StandardScaler等）。

fit_transform(): joins the fit() and transform() method for transformation of dataset.

解釋：fit_transform是fit和transform的組合，既包括了訓練又包含了轉換。

transform()和fit_transform()二者的功能都是對數據進行某種統一處理（比如標準化~N(0,1)，將數據縮放(映射)到某個固定區間，歸一化，正則化等）

fit_transform(trainData)對部分數據先擬合fit，找到該part的整體指標，如均值、方差、最大值最小值等等（根據具體轉換的目的），然後對該trainData進行轉換transform，從而實現數據的標準化、歸一化等等。

————————————————

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

基於隨機森林模型的葡萄酒品質分析

DAPPER 事務 TRANSACTION

Python數據分析與挖掘實戰Chapter7 航空公司客戶價值分析

數據庫技術三級第四章

leetcode550. 遊戲玩法分析報告在首次登錄的第2天再次登錄的玩家的分數

leetcode618.學生地理信息報告

leetcode1127. 查找每天僅使用手機端用戶、僅使用桌面端用戶和同時使用桌面端和手機端的用戶人數和總支出金額

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結