基于Gradient Booting的自动化超参数优化的销售预测----python数据分析与数据运营

原創

2020-05-14 11:46

本文基于python数据分析与数据化运营-第六章学习笔记，数据与大部分代码均来源数据该书；

案例背景针对某单品的订单量预测应用
数据介绍：731*10数据，有缺失及异常值，字段包括，是否有限购，促销活动类型，促销活动重要性，产品重要性分级，促销资源位数量，电子邮件中包含该商品的比例，单品价格，折扣率，促销活动展示的小时数，单品促销费用,销售数量；

import numpy as np
import pandas as pd
from sklearn.ensemble import GradientBoostingRegressor #集成方法GradientBoosting回归库
from sklearn.model_selection import GridSearchCV  #对超参数进行网格搜索并配合交叉检验优化的库
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error as mse #评估回归拟合程度

raw_data = pd.read_table(r"D:\data_analysis_and_data_operation_with_python\python_book_v2\chapter6\products_sales.txt",delimiter=",")
#查看各列数据的缺失情况，df.info在维度较少可以查看，但是维度过多，就不会显示
def get_null_count(raw_data):
    columns_isnull = raw_data.isnull().any(axis=0)
    columns_isnull_count = []
    for name in list(raw_data.columns):
        if columns_isnull[name]==True:
            columns_isnull_count.append([name,sum(list(raw_data.isnull()[name]))])
    return columns_isnull_count
print(raw_data.head())
get_null_count(raw_data)

['price', 2]结果显示：只有“price”字段有缺失值，使用平均值进行填充；

sales_data = raw_data.fillna(raw_data["price"].mean())
#使用describe(),查看各变量分布情况，看是否存在异常值；若变量过多，可以自定义函数进行判断，进行筛选
raw_data.describe()

#分割数据集X和y
num = int(sales_data.shape[0]*0.7)
X,y=sales_data.iloc[:,:-1],sales_data.iloc[:,-1]
X_train,X_test = X.iloc[0:num,:],X.iloc[num:,:]
y_train,y_test = y.iloc[0:num],y.iloc[num:]

#模型训练
model_gbr = GradientBoostingRegressor()
parameters = {"loss":['ls','lad','huber','quantile'],'n_estimators':[10,50,100],'learning_rate':[0.05,0.1,0.15],'max_depth':[2,3,4,5],'min_samples_split':[2,3,5],'min_samples_leaf':[1,2,4]}
model_gs = GridSearchCV(estimator=model_gbr,param_grid=parameters,cv=3,n_jobs=-1)
model_gs.fit(X_train,y_train)
print("Best score is",model_gs.best_score_)
print("Best parameter is",model_gs.best_params_)

GridSearchCV：自动调参，只要把参数输进去，就能给出最优化的结果和参数。但是这个方法适合于小数据集，一旦数据的量级上去了，很难得出结果。这个时候就是需要动脑筋了。数据量比较大的时候可以使用一个快速调优的方法——座标下降。它其实是一种贪心算法：拿当前对模型影响最大的参数调优，直到最优化；再拿下一个影响最大的参数调优，如此下去，直到所有的参数调整完毕。这个方法的缺点就是可能会调到局部最优而不是全局最优；

model_best = model_gs.best_estimator_ #获得交叉检验模型得出的最优模型对象
#对模型进行测试，得到测试评分
pre_test = model_best.predict(X_test)
mse_score = mse(pre_test,y_test)
mse_score
#绘制预测结果与实际结果的对比图
plt.style.use("ggplot")
plt.figure(figsize=(10,7))
plt.plot(np.arange(X_test.shape[0]),y_test,linestyle="-",color="k",label="true y")
plt.plot(np.arange(X_test.shape[0]),pre_test,linestyle=":",color="m",label="pre y")
plt.title("best model with mse of {}".format(int(mse_score)))
plt.legend(loc=0)

其他思考：

1、本案例自变量没有进行标准化，是因为本案例的算法基于CART的集成模型，量纲对结果本身没有影响；

2、没有进行特征的选择（共线性问题），该模型参数在某种程度上已经规避了该问题

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

基于Gradient Booting的自动化超参数优化的销售预测----python数据分析与数据运营

生鮮電商行業以及APP體驗分析

第六章-酸奶飲料新產品口味測試研究案例

基於集成算法GBDT和RandomForest的投票組合模型的異常檢測----python數據分析與數據運營

基於SPSS的中國消費者信心指數影響因素分析-----相關性分析

利用SPSS實現邏輯迴歸，樹模型，以及廣義線性模型

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結