基于随机森林模型的葡萄酒品质分析

原創

2020-05-01 12:42

视频：南京大学用Python玩转数据 https://www.icourse163.org/course/NJU-1001571005

url: https://archive.ics.uci.edu/ml/datasets/Wine+Quality

# -*- coding: utf-8 -*-
# url: https://archive.ics.uci.edu/ml/datasets/Wine+Quality
import pandas as pd
import seaborn as sns
import matplotlib as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
import warnings
warnings.filterwarnings('ignore') 

try:
    wine=pd.read_csv('F:/data analysis/wine/winequality-red.csv',sep=';')
except:
    print("Cannot find the file!")
    
print(wine.info())

from sklearn.preprocessing import LabelEncoder
bins=(2,4,6,8)
#bins划分数据，构成左开右闭区间，（2,4],(4,6],(6,8].quality的值是3-8，因此，3、4一组，5、6一组、7、8一组
group_names=['low','medium','high']
#分箱划分数据的组名
wine['quality_lb']=pd.cut(wine['quality'],bins=bins, labels=group_names)

lb_quality=LabelEncoder()
#为quality属性分配标签，0、1、2

wine['label'] = lb_quality.fit_transform(wine['quality_lb']) 
#对数据进行统一处理
print(wine.label.value_counts())

wine_copy=wine.copy()
wine.drop(['quality','quality_lb'],axis=1,inplace=True)
x=wine.iloc[:,:-1]
y=wine.label
#将目标属性和特征属性分开

from sklearn.model_selection import train_test_split
#随机从样本中按比例选取训练数据和测试数据,test_size用于设置测试集的比例

x_train, x_test, y_train, y_test=train_test_split(x,y,test_size=0.2)
from sklearn.preprocessing import scale
x_train=scale(x_train)
x_test=scale(x_test)
#用scale函数对测试集和训练集进行标准化处理

from sklearn.metrics import confusion_matrix
rfc=RandomForestClassifier(n_estimators=200)
#随机森林：并行式集成学习代表Bagging类型，对原始数据集进行多次随机采样，得到多个不同的采样集，基于每个采样集训练一个决策树基学习器，结合基学习器，通过投票或取均值等方式获得较高精确度和泛化性能。
#通过随机森林函数构建分类器，n_estimators利用最大投票数或均值预测前，建立基学习器的数量
rfc.fit(x_train,y_train) #训练
y_pred=rfc.predict(x_test) #预测
print(confusion_matrix(y_test,y_pred))
#混淆矩阵，列为预测值，行为实际类别

#调参
param_rfc={"n_estimators":[10,20,30,40,50,60,70,80,90,100,150,200],
           "criterion":["gini","entropy"]
           }
grid_rfc=GridSearchCV(rfc, param_rfc, iid=False, cv=5)
#GridSearchCV 暴力搜索，给出最优化的参数和结果，适合小数据集
grid_rfc.fit(x_train, y_train)
best_param_rfc=grid_rfc.best_params_
print(best_param_rfc)
rfc=RandomForestClassifier(n_estimators=best_param_rfc['n_estimators'],
    criterion=best_param_rfc['criterion'],random_state=0)
rfc.fit(x_train, y_train)
y_pred=rfc.predict(x_test)
print(confusion_matrix(y_test,y_pred))

数据描述

最终结果

混淆矩阵

{'criterion': 'gini', 'n_estimators': 60}
[[ 19 0 23]
[ 0 0 9]
[ 9 2 258]]

看对角线，对角线上的数为正确判断的数据记录的条数。第一行代表19个高品质判断正确，23表示类别“0”倍误判成类别“2”的记录条数。

问题

编程在spyder上进行，不得不说，真的好用。没有遇到跑不通的情况，但有些问题不太明白。

2 1319
0 217
1 63

为啥2是medium，1是low，0是high?

补充

fit(): Method calculates the parameters μ and σ and saves them as internal objects.

解释：简单来说，就是求得训练集X的均值，方差，最大值，最小值,这些训练集X固有的属性。

transform(): Method using these calculated parameters apply the transformation to a particular dataset.

解释：在fit的基础上，进行标准化，降维，归一化等操作（看具体用的是哪个工具，如PCA，StandardScaler等）。

fit_transform(): joins the fit() and transform() method for transformation of dataset.

解释：fit_transform是fit和transform的组合，既包括了训练又包含了转换。

transform()和fit_transform()二者的功能都是对数据进行某种统一处理（比如标准化~N(0,1)，将数据缩放(映射)到某个固定区间，归一化，正则化等）

fit_transform(trainData)对部分数据先拟合fit，找到该part的整体指标，如均值、方差、最大值最小值等等（根据具体转换的目的），然后对该trainData进行转换transform，从而实现数据的标准化、归一化等等。

————————————————

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

基于随机森林模型的葡萄酒品质分析

Nginx R31 doc 官方文档-01-nginx 如何安装

挑战程序设计竞赛 2.2章习题 POJ - 3617 Best Cow Line 贪心

golang开发环境搭建(win10)

Python數據分析與挖掘實戰Chapter7 航空公司客戶價值分析

數據庫技術三級第四章

leetcode550. 遊戲玩法分析報告在首次登錄的第2天再次登錄的玩家的分數

leetcode618.學生地理信息報告

leetcode1127. 查找每天僅使用手機端用戶、僅使用桌面端用戶和同時使用桌面端和手機端的用戶人數和總支出金額

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結