基於R語言的Kaggle案例分析學習筆記（五）

藥店銷量預測

本案例大綱：

1、xgboost理論介紹

2、R語言中xgboost相關函數的參數

3、案例背景

4、數據預處理

5、R語言的xgb模型實現代碼

1、xgboost理論介紹

這部分我直接把一些牛人寫的關於xgb的理論介紹引用過來了，大家可以直接看以下博客鏈接資料，既有原理介紹又有代碼的函數參數介紹：

http://blog.csdn.net/a819825294/article/details/51206410

http://blog.csdn.net/sb19931201/article/details/52557382

http://blog.csdn.net/sb19931201/article/details/52577592

2、R語言中xgboost相關函數的參數

R語言的XGBOOST包的參數包括三個方面的參數：常規參數、模型參數和任務參數。通用參數用於選擇哪一類分類器，是樹模型還是線性模型；模型參數取決於常規函數中選擇的模型類型；任務參數取決於學習的場景。

常規數：

booster [default=gbtree]
選擇基分類器
silent [default=0]
設置成1則沒有運行信息輸出，最好是設置爲0.
nthread [default to maximum number of threads available if not set]
線程數
num_pbuffer
[set automatically by xgboost, no need to be set by user]
緩衝區大小
num_feature
[set automatically by xgboost, no need to be set by user]
特徵維度

模型參數：

（1）樹模型的參數

eta[default=0.3]

學習率，一般設置小一些。
range: [0,1]
gamma [default=0]
後剪枝時，用於控制是否剪枝，值越大，算法越保守。
range: [0,∞]
max_depth [default=6]
樹的最大深度
範圍: [1,∞]
min_child_weight [default=1]
這個參數默認是 1，是每個葉子裏面 h 的和至少是多少，對正負樣本不均衡時的 0-1 分類而言，假設 h 在 0.01 附近，min_child_weight 爲 1 意味着葉子節點中最少需要包含 100 個樣本。這個參數非常影響結果，控制葉子節點中二階導的和的最小值，該參數值越小，越容易 overfitting。
range: [0,∞]
max_delta_step [default=0]
這個參數在更新步驟中起作用，如果取0表示沒有約束，如果取正值則使得更新步驟更加保守。可以防止做太大的更新步子，使更新更加平緩。
range: [0,∞]
subsample [default=1]
樣本隨機採樣，較低的值使得算法更加保守，防止過擬合，但是太小的值也會造成欠擬合。
range: (0,1]
colsample_bytree [default=1]
列採樣，對每棵樹的生成用的特徵進行列採樣.一般設置爲： 0.5-1
range: (0,1]
lambda [default=1]
權重L2正則化
alpha [default=0]
權重L1正則化

（2）線性模型參數

lambda[default=0]
權重L2正則化
alpha [default=0]
權重L1正則化
lambda_bias
L2 regularization term on bias, default 0(no L1 reg on bias because it is notimportant)

偏導L2正則化參數，默認爲0（沒有偏導L1正則化參數）

任務參數：

objective [default=reg:linear ] 定義最小化損失函數類型，常用參數如下：
“reg:linear” –linear regression
“reg:logistic” –logistic regression
“binary:logistic” –logistic regression for binary classification, outputprobability
“binary:logitraw” –logistic regression for binary classification, output scorebefore logistic transformation
“count:poisson” –poisson regression for count data, output mean of poissondistribution
max_delta_step is set to 0.7 by default in poisson regression (used tosafeguard optimization)
“multi:softmax” –set XGBoost to do multiclass classification using the softmaxobjective, you also need to set num_class(number of classes)
“multi:softprob” –same as softmax, but output a vector of ndata * nclass, whichcan be further reshaped to ndata, nclass matrix. The result contains predictedprobability of each data point belonging to each class.
“rank:pairwise” –set XGBoost to do ranking task by minimizing the pairwiseloss
base_score [ default=0.5 ]
the initial prediction score of all instances, global bias 所有實例的初始預測評分, 全局偏差
eval_metric [ default according to objective ] 評估指標選擇
evaluation metrics for validation data, a default metric will be assignedaccording to objective( rmse for regression, and error for classification, meanaverage precision for ranking )
User can add multiple evaluation metrics, for Python user, remember to pass the metrics in as list of parameters pairsinstead of map, so that latter ‘eval_metric’ won’t override previous one
The choices are listed below: 評估指標可選列表如下：
“rmse”: root mean square error
“logloss”: negative log-likelihood
“error”: Binary classification error rate. It is calculated as #(wrongcases)/#(all cases). For the predictions, the evaluation will regard theinstances with prediction value larger than 0.5 as positive instances, and theothers as negative instances.
“merror”: Multiclass classification error rate. It is calculated as #(wrongcases)/#(all cases).
“mlogloss”: Multiclass logloss
“auc”: Area under the curve for ranking evaluation.
“ndcg”:Normalized Discounted Cumulative Gain
“map”:Mean average precision
“ndcg@n”,”map@n”: n can be assigned as an integer to cut off the top positionsin the lists for evaluation.
“ndcg-“,”map-“,”ndcg@n-“,”map@n-“: In XGBoost, NDCG and MAP will evaluate thescore of a list without any positive samples as 1. By adding “-” in theevaluation metric XGBoost will evaluate these score as 0 to be consistent undersome conditions. training repeatively
seed [ default=0 ] 隨機種子
random number seed.

從xgboost原理部分的第二個鏈接那位博主給出的python的xgboost參數幾乎一致，也就是R語言的xgboost的參數與python是一樣的。

3、案例背景

Rossmann在7個歐洲國家擁有3,000家藥店。目前，羅斯曼店經理的任務是提前六週預測其日銷量。商店銷售受到諸多因素的影響，包括促銷，競爭，學校和國家假日，季節性和地點。成千上萬的個人經理根據其獨特的情況預測銷售量，結果的準確性可能會有很大的變化。

Kaggle所提供的數據的字段如下：

英文名稱	英文解釋	中文解釋
Id	an Id that represents a (Store, Date) duple within the test set	表示測試集中（存儲，日期）副本的Id
Store	a unique Id for each store	每個商店的獨特Id
Sales	the turnover for any given day (this is what you are predicting)	每天的銷量（這是需要預測的因變量）
Customers	the number of customers on a given day	某一天的客戶數量
Open	an indicator for whether the store was open: 0 = closed, 1 = open	商店是否打開的指示器：0 =關閉，1 =打開
StateHoliday	indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None	表示一個國家假期。通常所有商店，除了少數例外，在國營假期關閉。請注意，所有學校在公衆假期和週末關閉。a =公衆假期，b =復活節假期，c =聖誕節，0 =無
SchoolHoliday	indicates if the (Store, Date) was affected by the closure of public schools	表示（商店，日期）是否受到公立學校關閉的影響
StoreType	differentiates between 4 different store models: a, b, c, d	區分4種不同的商店模式：a，b，c，d
Assortment	describes an assortment level: a = basic, b = extra, c = extended	描述分類級別：a = basic，b = extra，c = extended
CompetitionDistance	distance in meters to the nearest competitor store	距離最接近的競爭對手商店的距離
CompetitionOpenSince[Month/Year]	gives the approximate year and month of the time the nearest competitor was opened	給出最近的競爭對手開放時間的大約年和月
Promo	indicates whether a store is running a promo on that day	指示商店是否在當天運行促銷
Promo2	Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating	Promo2是一些持續和連續推廣的一些商店：0 =商店不參與，1 =商店正在參與
Promo2Since[Year/Week]	describes the year and calendar week when the store started participating in Promo2	描述商店開始參與Promo2的日期
PromoInterval	describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store	描述了Promo2的連續間隔開始，命名新的促銷活動的月份。例如“二月，五月，八月，十一月”是指每一輪在該店的任何一年的二月，五月，八月，十一月份開始

4、數據預處理

由於本案例主要講解xgboost模型，所以對於數據預處理和特徵工程都做得比較少。只做了兩方面的處理，第一，Kaggle官網把商店的一些屬性數據與訓練集、測試集分開放，在不同文件，所以要把store數據集與train、test數據集按列合併；第二，將數據按照xgb要求的格式進行轉換，R語言的xgboost包的xgb.Matrix是轉換格式的包。

5、代碼實現

數據下載地址：https://www.kaggle.com/c/rossmann-store-sales/data

library(readr)
library(xgboost)
library(lubridate)
train<-read.csv('D:/R語言kaggle案例實戰/Kaggle第五節課/train.csv')
test<-read.csv('D:/R語言kaggle案例實戰/Kaggle第五節課/test.csv')
store<-read.csv('D:/R語言kaggle案例實戰/Kaggle第五節課/store.csv')#store是對店鋪屬性的補充
train<-merge(train,store)#將兩個數據集按列合併
test<-merge(test,store)#將兩個數據集按列合併
train$Date<-as.POSIXct(train$Date)#將日期字符變成時間格式
test$Date<-as.POSIXct(test$Date)#將日期字符變成時間格式
train[is.na(train)]<-0#將空值置爲零
test[is.na(test)]<-0
train<-train[which(train$Open=='1'),]#選擇開門的且銷售額不爲0的樣本
train<-train[which(train$Sales!='0'),]
train$month<-month(train$Date)#提取月份
train$year<-year(train$Date)#提取年份
train$day<-day(train$Date)#提取日
train<-train[,-c(3,8)]#刪除日期列和缺失值較多的列
test<-test[,-c(4,7)]#刪除日期列和缺失值較多的列
feature.names<-names(train)[c(1,2,5:19)]#這一步主要使測試集和訓練集的結構一致。
for(f in feature.names){
  if(class(train[[f]])=="character"){
    levels<-unique(c(train[[f]],test[[f]]))
    train[[f]]<-as.integer(factor(train[[f]],levels = levels))
    test[[f]]<-as.integer(factor(test[[f]],levels = levels))
  }
}
tra<-train[,feature.names]
RMPSE<-function(preds,dtrain){ #定義一個評價函數，Kaggle官方給的評價函數作爲xgboost中的評價函數。
  labels<-getinfo(dtrain,"label")
  elab<-exp(as.numeric(labels))-1
  epreds<-exp(as.numeric(preds))-1
  err<-sqrt(mean((epreds/elab-1)^2))
  return(list(metric="RMPSE",value=err))
}
h<-sample(nrow(train),10000)#進行10000次抽樣
dval<-xgb.DMatrix(data=data.matrix(tra[h,]),label=log(train$Sales+1)[h])#用於以下構建watchlist 
dtrain<-xgb.DMatrix(data=data.matrix(tra[-h,]),label=log(train$Sales+1)[-h])#構建xgb特定的矩陣形式
watchlist<-list(val=dval,train=dtrain)#構建模型參數的watchlist,watchlist是用於監聽每次模型運行時的模型性能情況。
param<-list(objective="reg:linear",
            booster="gbtree",
            eta=0.02,
            max_depth=12,
            subsample=0.9,
            colsample_bytree=0.7,
            num_parallel_tree=2,
            alpha=0.0001,
            lambda=1)
clf<-xgb.train(  params=param,
                 data=dtrain,
                 nrounds = 3000,
                 verbose = 0,
                 early.stop.round=100,
                 watchlist = watchlist,
                 maximize = FALSE,
                 feval = RMPSE
  
)
ptest<- predict(clf,test,outputmargin=TRUE)

基於R語言的Kaggle案例分析學習筆記（五）

藥店銷量預測

本案例大綱：

1、xgboost理論介紹

2、R語言中xgboost相關函數的參數

3、案例背景

4、數據預處理

5、R語言的xgb模型實現代碼

Win10 LTSC 2019 安裝後的一些步驟

推薦2款開源、美觀的WinForm UI控件庫

NET9 AspnetCore將整合OpenAPI的文檔生成功能而無需三方庫

在Linux下管理MySQL的大小寫敏感性

轉發： python進行中文文本聚類（切詞以及Kmeans聚類）

安裝MySQL5.7版本遇到的問題及解決辦法

轉載： Python--詳解Python中re.sub 作者：Mrzhoug

基於R語言的Kaggle案例分析學習筆記（一）

python進行刪除標點符號

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結