基於R語言的Kaggle案例分析學習筆記(五)

 藥店銷量預測

本案例大綱:

1、xgboost理論介紹

2、R語言中xgboost相關函數的參數

3、案例背景

4、數據預處理

5、R語言的xgb模型實現代碼

1、xgboost理論介紹

      這部分我直接把一些牛人寫的關於xgb的理論介紹引用過來了,大家可以直接看以下博客鏈接資料,既有原理介紹又有代碼的函數參數介紹:

       http://blog.csdn.net/a819825294/article/details/51206410

      http://blog.csdn.net/sb19931201/article/details/52557382

      http://blog.csdn.net/sb19931201/article/details/52577592

2、R語言中xgboost相關函數的參數

        R語言的XGBOOST包的參數包括三個方面的參數:常規參數、模型參數和任務參數。通用參數用於選擇哪一類分類器,是樹模型還是線性模型;模型參數取決於常規函數中選擇的模型類型;任務參數取決於學習的場景。

常規數:

booster [default=gbtree] 
選擇基分類器
silent [default=0] 
設置成1則沒有運行信息輸出,最好是設置爲0.
nthread [default to maximum number of threads available if not set] 
線程數
num_pbuffer 
[set automatically by xgboost, no need to be set by user] 
緩衝區大小
num_feature 
[set automatically by xgboost, no need to be set by user] 
特徵維度

模型參數:

(1)樹模型的參數

eta[default=0.3] 

學習率,一般設置小一些。
range: [0,1] 
gamma [default=0] 
後剪枝時,用於控制是否剪枝,值越大,算法越保守。
range: [0,∞] 
max_depth [default=6] 
樹的最大深度
範圍: [1,∞] 
min_child_weight [default=1] 
這個參數默認是 1,是每個葉子裏面 h 的和至少是多少,對正負樣本不均衡時的 0-1 分類而言,假設 h 在 0.01 附近,min_child_weight 爲 1 意味着葉子節點中最少需要包含 100 個樣本。這個參數非常影響結果,控制葉子節點中二階導的和的最小值,該參數值越小,越容易 overfitting。
range: [0,∞] 
max_delta_step [default=0] 
這個參數在更新步驟中起作用,如果取0表示沒有約束,如果取正值則使得更新步驟更加保守。可以防止做太大的更新步子,使更新更加平緩。
range: [0,∞] 
subsample [default=1] 
樣本隨機採樣,較低的值使得算法更加保守,防止過擬合,但是太小的值也會造成欠擬合。
range: (0,1] 
colsample_bytree [default=1] 
列採樣,對每棵樹的生成用的特徵進行列採樣.一般設置爲: 0.5-1
range: (0,1] 
lambda [default=1] 
權重L2正則化
alpha [default=0] 
權重L1正則化

(2)線性模型參數

lambda[default=0] 
權重L2正則化
alpha [default=0] 
權重L1正則化
lambda_bias 
L2 regularization term on bias, default 0(no L1 reg on bias because it is notimportant)

偏導L2正則化參數,默認爲0(沒有偏導L1正則化參數)

任務參數:

objective [default=reg:linear ]   定義最小化損失函數類型,常用參數如下:
“reg:linear” –linear regression 
“reg:logistic” –logistic regression 
“binary:logistic” –logistic regression for binary classification, outputprobability 
“binary:logitraw” –logistic regression for binary classification, output scorebefore logistic transformation 
“count:poisson” –poisson regression for count data, output mean of poissondistribution 
max_delta_step is set to 0.7 by default in poisson regression (used tosafeguard optimization) 
“multi:softmax” –set XGBoost to do multiclass classification using the softmaxobjective, you also need to set num_class(number of classes) 
“multi:softprob” –same as softmax, but output a vector of ndata * nclass, whichcan be further reshaped to ndata, nclass matrix. The result contains predictedprobability of each data point belonging to each class. 
“rank:pairwise” –set XGBoost to do ranking task by minimizing the pairwiseloss 
base_score [ default=0.5 ] 
the initial prediction score of all instances, global bias 所有實例的初始預測評分, 全局偏差
eval_metric [ default according to objective ]   評估指標選擇
evaluation metrics for validation data, a default metric will be assignedaccording to objective( rmse for regression, and error for classification, meanaverage precision for ranking ) 
User can add multiple evaluation metrics, for 
Python user, remember to pass the metrics in as list of parameters pairsinstead of map, so that latter ‘eval_metric’ won’t override previous one 
The choices are listed below: 評估指標可選列表如下:
“rmse”: root mean square error 
“logloss”: negative log-likelihood 
“error”: Binary classification error rate. It is calculated as #(wrongcases)/#(all cases). For the predictions, the evaluation will regard theinstances with prediction value larger than 0.5 as positive instances, and theothers as negative instances. 
“merror”: Multiclass classification error rate. It is calculated as #(wrongcases)/#(all cases). 
“mlogloss”: Multiclass logloss 
“auc”: Area under the curve for ranking evaluation. 
“ndcg”:Normalized Discounted Cumulative Gain 
“map”:Mean average precision 
“ndcg@n”,”map@n”: n can be assigned as an integer to cut off the top positionsin the lists for evaluation. 
“ndcg-“,”map-“,”ndcg@n-“,”map@n-“: In XGBoost, NDCG and MAP will evaluate thescore of a list without any positive samples as 1. By adding “-” in theevaluation metric XGBoost will evaluate these score as 0 to be consistent undersome conditions. training repeatively 
seed [ default=0 ] 隨機種子
random number seed.

       從xgboost原理部分的第二個鏈接那位博主給出的python的xgboost參數幾乎一致,也就是R語言的xgboost的參數與python是一樣的。

3、案例背景

        Rossmann7個歐洲國家擁有3,000家藥店。目前,羅斯曼店經理的任務是提前六週預測其日銷量。商店銷售受到諸多因素的影響,包括促銷,競爭,學校和國家假日,季節性和地點。成千上萬的個人經理根據其獨特的情況預測銷售量,結果的準確性可能會有很大的變化。

      Kaggle所提供的數據的字段如下:

英文名稱

英文解釋

中文解釋

Id 

 an Id that represents a (Store, Date) duple within the test set

 表示測試集中(存儲,日期)副本的Id

Store 

 a unique Id for each store

 每個商店的獨特Id

Sales 

 the turnover for any given day (this is what you are predicting)

每天的銷量(這是需要預測的因變量)

Customers 

 the number of customers on a given day

 某一天的客戶數量

Open 

 an indicator for whether the store was open: 0 = closed, 1 = open

 商店是否打開的指示器:0 =關閉,1 =打開

StateHoliday 

 indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None

 表示一個國家假期。通常所有商店,除了少數例外,在國營假期關閉。請注意,所有學校在公衆假期和週末關閉。a =公衆假期,b =復活節假期,c =聖誕節,0 =

SchoolHoliday 

 indicates if the (Store, Date) was affected by the closure of public schools

 表示(商店,日期)是否受到公立學校關閉的影響

StoreType 

 differentiates between 4 different store models: a, b, c, d

 區分4種不同的商店模式:abcd

Assortment 

 describes an assortment level: a = basic, b = extra, c = extended

 描述分類級別:a = basic,b = extra,c = extended

CompetitionDistance 

 distance in meters to the nearest competitor store

 距離最接近的競爭對手商店的距離

CompetitionOpenSince[Month/Year] 

 gives the approximate year and month of the time the nearest competitor was opened

 給出最近的競爭對手開放時間的大約年和月

Promo 

 indicates whether a store is running a promo on that day

 指示商店是否在當天運行促銷

Promo2 

 Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating

 Promo2是一些持續和連續推廣的一些商店:0 =商店不參與,1 =商店正在參與

Promo2Since[Year/Week] 

 describes the year and calendar week when the store started participating in Promo2

 描述商店開始參與Promo2的日期

PromoInterval 

 describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store

 描述了Promo2的連續間隔開始,命名新的促銷活動的月份。例如“二月,五月,八月,十一月”是指每一輪在該店的任何一年的二月,五月,八月,十一月份開始


4、數據預處理

        由於本案例主要講解xgboost模型,所以對於數據預處理和特徵工程都做得比較少。只做了兩方面的處理,第一,Kaggle官網把商店的一些屬性數據與訓練集、測試集分開放,在不同文件,所以要把store數據集與train、test數據集按列合併;第二,將數據按照xgb要求的格式進行轉換,R語言的xgboost包的xgb.Matrix是轉換格式的包。

5、代碼實現

數據下載地址:https://www.kaggle.com/c/rossmann-store-sales/data

library(readr)
library(xgboost)
library(lubridate)
train<-read.csv('D:/R語言kaggle案例實戰/Kaggle第五節課/train.csv')
test<-read.csv('D:/R語言kaggle案例實戰/Kaggle第五節課/test.csv')
store<-read.csv('D:/R語言kaggle案例實戰/Kaggle第五節課/store.csv')#store是對店鋪屬性的補充
train<-merge(train,store)#將兩個數據集按列合併
test<-merge(test,store)#將兩個數據集按列合併
train$Date<-as.POSIXct(train$Date)#將日期字符變成時間格式
test$Date<-as.POSIXct(test$Date)#將日期字符變成時間格式
train[is.na(train)]<-0#將空值置爲零
test[is.na(test)]<-0
train<-train[which(train$Open=='1'),]#選擇開門的且銷售額不爲0的樣本
train<-train[which(train$Sales!='0'),]
train$month<-month(train$Date)#提取月份
train$year<-year(train$Date)#提取年份
train$day<-day(train$Date)#提取日
train<-train[,-c(3,8)]#刪除日期列和缺失值較多的列
test<-test[,-c(4,7)]#刪除日期列和缺失值較多的列
feature.names<-names(train)[c(1,2,5:19)]#這一步主要使測試集和訓練集的結構一致。
for(f in feature.names){
  if(class(train[[f]])=="character"){
    levels<-unique(c(train[[f]],test[[f]]))
    train[[f]]<-as.integer(factor(train[[f]],levels = levels))
    test[[f]]<-as.integer(factor(test[[f]],levels = levels))
  }
}
tra<-train[,feature.names]
RMPSE<-function(preds,dtrain){ #定義一個評價函數,Kaggle官方給的評價函數作爲xgboost中的評價函數。
  labels<-getinfo(dtrain,"label")
  elab<-exp(as.numeric(labels))-1
  epreds<-exp(as.numeric(preds))-1
  err<-sqrt(mean((epreds/elab-1)^2))
  return(list(metric="RMPSE",value=err))
}
h<-sample(nrow(train),10000)#進行10000次抽樣
dval<-xgb.DMatrix(data=data.matrix(tra[h,]),label=log(train$Sales+1)[h])#用於以下構建watchlist 
dtrain<-xgb.DMatrix(data=data.matrix(tra[-h,]),label=log(train$Sales+1)[-h])#構建xgb特定的矩陣形式
watchlist<-list(val=dval,train=dtrain)#構建模型參數的watchlist,watchlist是用於監聽每次模型運行時的模型性能情況。
param<-list(objective="reg:linear",
            booster="gbtree",
            eta=0.02,
            max_depth=12,
            subsample=0.9,
            colsample_bytree=0.7,
            num_parallel_tree=2,
            alpha=0.0001,
            lambda=1)
clf<-xgb.train(  params=param,
                 data=dtrain,
                 nrounds = 3000,
                 verbose = 0,
                 early.stop.round=100,
                 watchlist = watchlist,
                 maximize = FALSE,
                 feval = RMPSE
  
)
ptest<- predict(clf,test,outputmargin=TRUE)  

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章