R語言筆記之線性迴歸及其衍生

普通線性迴歸

1.最小二乘線性模型

> dat=read.csv("https://raw.githubusercontent.com/happyrabbit/DataScientistR/master/Data/SegData.csv")
> dat=subset(dat,store_exp >0&online_exp >0)
> modeldat=dat[,grep("Q",names(dat))]
> modeldat$total_exp=dat$store_exp+dat$online_exp

下面展示輸出結果，看哈數據是否有缺失值或離羣點

> par(mfrow=c(1,2))
> hist(modeldat$total_exp,main="",xlab="total_exp")
> boxplot(modeldat$total_exp)
>

如上，數據集modeldat中沒有缺失值，但是明顯有離羣點，而且因變量total_exp分佈明顯偏離正太。
我們需要刪除離羣點，然後對因變量進行對數變換
我們用Z分值的方法查找並刪除離羣點。

> y=modeldat$total_exp
 #求z分值
> zs=(y-mean(y))/mad(y)
 #找到z分值大於3.5的離羣點，刪除這些觀測
> modeldat=modeldat[-which(zs>3.5),]
>

接下來檢查變量的共線性

> library(corrplot)
corrplot 0.84 loaded
Warning message:
程輯包‘corrplot’是用R版本3.4.3 來建造的 
> correlation=cor(modeldat[,grep("Q",names(modeldat))])
> corrplot.mixed(correlation,order="hclust",tl.pos="lt",upper="ellipse")

由上圖可以看出，變量之間有很強的相關性。
我們需要刪除高度相關變量的算法，設置閾值爲0.75

> library(caret)
載入需要的程輯包：lattice
載入需要的程輯包：ggplot2
Warning messages:
1: 程輯包‘caret’是用R版本3.4.3 來建造的 
2: 程輯包‘lattice’是用R版本3.4.3 來建造的 
3: 程輯包‘ggplot2’是用R版本3.4.3 來建造的 
> highcor=findCorrelation(correlation,cutoff=.75)
> modeldat=modeldat[,-highcor]

現在我們可以擬合線性模型。“.“表示數據集modeldat中除了因變量外所有的變量都被當做自變量，這裏我們沒有考慮交互效應。
且我們對原始變量進行了對數變換

> limfit=lm(log(total_exp)~.,data=modeldat)
> summary(lmfit)
Error in summary(lmfit) : object 'lmfit' not found
> summary(limfit)

Call:
lm(formula = log(total_exp) ~ ., data = modeldat)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.17494 -0.13719  0.01284  0.14163  0.56227 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  8.098314   0.054286 149.177  < 2e-16 ***
Q1          -0.145340   0.008823 -16.474  < 2e-16 ***
Q2           0.102275   0.019492   5.247 1.98e-07 ***
Q3           0.254450   0.018348  13.868  < 2e-16 ***
Q6          -0.227684   0.011520 -19.764  < 2e-16 ***
Q8          -0.090706   0.016497  -5.498 5.15e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2262 on 805 degrees of freedom
Multiple R-squared:  0.8542,    Adjusted R-squared:  0.8533 
F-statistic: 943.4 on 5 and 805 DF,  p-value: < 2.2e-16

>

我們也可以得到置信區間

> confint(limfit,level=0.9)
                    5 %        95 %
(Intercept)  8.00891811  8.18771037
Q1          -0.15986889 -0.13081186
Q2           0.07017625  0.13437445
Q3           0.22423576  0.28466333
Q6          -0.24665434 -0.20871330
Q8          -0.11787330 -0.06353905
>

2.迴歸診斷
根據高斯-馬爾科夫定理，OLS（最小二乘估計）在下面條件滿足時是BLUE（最優線性無偏估計）：
（1）自變量和隨機誤差不相關
（2）隨機誤差均值爲0
（3）隨機誤差方差一致且相互獨立

4種圖形診斷
a.殘差圖（Residuals vs Fittled)
我們要檢查殘差圖如下幾個方面：
（1）殘差是否分佈在0附件
（2）殘差分佈是否隨機，如果呈現出某種特定分佈模式的話（如，隨着橫座標的增大而增大或減小），說明當前模型關係的假設不充分。
（3）殘差是否存在異方差性，比如隨着擬合值增大殘差分佈方差增加，這就說明殘差分佈有異方差性。
當存在異方差時，參數估計值雖然是無偏的，但不是最小方差線性無偏估計。
由於參數的顯著性檢驗是基於殘差分佈假設的，所以在該假設不成立的情況下該檢驗也將失效。如果你用該回歸方程來預測新樣本，效果可能不理想。

b.q-q圖(Norm Q-Q)
它是一種正太分佈檢驗。
對於標準正太分佈，qq圖上的點分佈在y=x直線上，點偏離直線越遠說明樣本偏離正太分佈越遠。
c.標準化殘差方根散點圖(Scale-Location)
和殘差圖類似，橫座標依舊是樣本擬合值，縱座標爲標準化殘差的絕對值開方

d.Cook距離圖（Cook’s distance)
該圖用於判斷觀測值是否有異常點。
一般情況下，當D<0.5時認爲不是異常值點；當D>0.5時認爲是異常值點。

> par(mfrow=c(2,2))
> plot(lmfit,which=1)
Error in plot(lmfit, which = 1) : object 'lmfit' not found
> plot(limfit,which=1)
> plot(limfit,which=2)
> plot(limfit,which=3)
> plot(limfit,which=4)

如上圖
殘差圖：數據點都基本均勻分佈在直線y=0兩側，無明顯趨勢，滿足線性假設。
標準Q-Q圖：圖上的點基本度在y=x直線附件，可認爲殘差近似服從正太分佈
標準化殘差方根散點圖：若滿足不變方差假設，則在該圖中水平線周圍的點應該隨機分佈，最高點爲殘差最大值點。該圖顯示基本符合方差齊性的要求。
Cook距離圖：最大的Cook距離爲0.05左右，可以認爲沒有異常點。
3.離羣點，高槓杆點和強影響點

離羣點：
Cook距離圖，Z分值都可以用來檢驗線性模型中的離羣點。
但是，Z分值僅僅是針對因變量觀測而言，和使用模型無關，其並未考慮模型的擬合情況。
car包中的outlierTest()函數可以用於對擬合模型對象檢驗是否存在離羣點。
注意：這裏的離羣點指的是那些模型預測效果不佳的觀測點。通常有很大的或正或負的殘差，正殘差說明模型低估了響應值，負殘差說明高估了響應值。這裏使用的是Bonferroni離羣點檢驗，該檢驗也可作用於廣義線性模型。
對於一般線性模型使用的是t檢驗，對於廣義線性模型使用的是正太檢驗。

> library(car)
Warning message:
程輯包‘car’是用R版本3.4.3 來建造的 
> outlierTest(limfit)
     rstudent unadjusted p-value Bonferonni p
960 -5.295504          1.533e-07   0.00012432
>

outlierTest()函數根據單個最大（正或負）殘差值的顯著性來判斷是否有離羣點，若不顯著，則說明數據集中沒有離羣點，若顯著，則檢驗刪除該離羣點，然後再檢驗是否還有其他離羣點存在。
如上，這裏需要刪除第960個被認爲是離羣點的觀測。

> outlierTest(limfit)
     rstudent unadjusted p-value Bonferonni p
960 -5.295504          1.533e-07   0.00012432
#找到相應的觀測
> idex=which(row.names(modeldat)=="960")
#刪除離羣觀測
> modeldat=modeldat[-idex,]
>

接下來我們再擬合一次模型，然後檢驗看看是否還有離羣點。

> limfit=lm(log(total_exp)~.,data=modeldat)
> outlierTest(limfit)

No Studentized residuals with Bonferonni p < 0.05
Largest |rstudent|:
     rstudent unadjusted p-value Bonferonni p
155 -3.818112         0.00014483      0.11731
>

如上，可以看出，現在沒有檢驗出顯著離羣點。
高槓杆值點：
高槓杆值點是與其他預測變量又關的離羣點，即它們是由許多異常的預測變量組合起來的，與響應變量值沒有關係。
高槓杆值的觀測點可通過帽子矩陣的值（hat statistic)來判斷。
對於一個給定的數據集，帽子均值爲p/n,p是模型估計的參數數目（包含截距項），n是樣本量。
一般來說，若觀測點的帽子值大於帽子均值的2或3倍，則可認定爲高槓杆值點。

收縮方法

收縮方法屬於內嵌法。
我們可以對模型參數進行限制或者規範化來達到變量選擇的效果，這些方法能將一些參數估計朝着0收縮。
1.嶺迴歸
R中可以進行嶺迴歸的函數
MASS包中的Im.ridge()函數
elasticnet包中的enet()函數
若需要對參數進行優化，最方便的是caret包中的train函數

> dat=read.csv("https://raw.githubusercontent.com/happyrabbit/DataScientistR/master/Data/SegData.csv")
> dat=subset(dat,store_exp>0&online_exp>0)
> trainx=dat[,grep("Q",names(dat))]
> trainy=dat$store_exp+dat$online_exp

先用train函數對參數進行調優
首先設置交互校驗和參數調優範圍，這裏我們使用10層交互校驗。
爲了保險起見，在用此類方法前應進行標準化

> ctr1=trainControl(method="cv",number=10)
> ridgeGrid=data.frame(.lambda=seq(0,.1,length=20))
> set.seed(100)
> ridgeRegTune=train(trainx,trainy,method="ridge",

#用不同罰函數值來擬合模型
tuneGrid=ridgeGrid,trControl=ctr1,
#中心化和標度化變量
preProc=c("center","scale"))
> ridgeRegTune
Ridge Regression 

999 samples
 10 predictor

Pre-processing: centered (10), scaled (10) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 899, 899, 899, 899, 899, 899, ... 
Resampling results across tuning parameters:

  lambda       RMSE      Rsquared   MAE     
  0.000000000  1748.466  0.7942143  752.9785
  0.005263158  1748.281  0.7943720  753.6987
  0.010526316  1748.549  0.7944728  754.6125
  0.015789474  1749.154  0.7945374  755.7452
  0.021052632  1750.026  0.7945778  757.0960
  0.026315789  1751.121  0.7946015  758.6035
  0.031578947  1752.411  0.7946132  760.2972
  0.036842105  1753.875  0.7946162  762.1418
  0.042105263  1755.500  0.7946125  764.1251
  0.047368421  1757.276  0.7946037  766.1779
  0.052631579  1759.193  0.7945909  768.2752
  0.057894737  1761.246  0.7945750  770.5444
  0.063157895  1763.430  0.7945564  772.9384
  0.068421053  1765.739  0.7945357  775.3503
  0.073684211  1768.171  0.7945131  777.9338
  0.078947368  1770.721  0.7944890  780.5918
  0.084210526  1773.387  0.7944636  783.3492
  0.089473684  1776.166  0.7944370  786.1977
  0.094736842  1779.055  0.7944094  789.2161
  0.100000000  1782.051  0.7943808  792.3606

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was lambda = 0.005263158.
>

如上，訓練出的模型調優參數爲0.01，對應RMSE和R^2分別爲1748.281和80%
下面展示如何使用elasticnet包中的enet()函數

> library(elasticnet)
載入需要的程輯包：lars
Loaded lars 1.2

> ridgefit=enet(x=as.matrix(trainx),y=trainy,lambda=0.01,
 #設置將自變量標準化
 normalize=TRUE)
> ridgePred=predict(ridgefit,newx=as.matrix(trainx),s=1,mode="fraction",type="fit")
> names(ridgePred)
[1] "s"        "fraction" "mode"     "fit"     
> head(ridgePred$fit)
        1         2         3         4         5         6 
1290.4697  224.1595  591.4406 1220.6384  853.3572  908.2040

要得到參數擬合結果，需要在predict()函數中設定type=”coefficients”

> ridgeCoef=predict(ridgefit,newx=as.matrix(trainx),s=1,mode="fraction",type="coefficients")
> RidgeCoef=ridgeCoef$coefficients

嶺迴歸於原始最小二乘迴歸相比優勢在於偏差和方差之間的權衡。
一般線性迴歸中的最小二乘估計在無偏估計中是最優的，但通常估計方差會很大，這意味着訓練集數據的微小變化可能導致參數估計較大的變化。
嶺迴歸估計就是通過犧牲一點點“無偏性”，換取估計方差的減少。
它更適合在普通最小二乘迴歸參數估計方差很大的情況下使用。

2.Lasso(具體過程等會在新篇裏講解）
雖然嶺迴歸可以將參數估計值向0進行收縮，但對於任何調優參數值，它都不能將係數取值變爲嚴格的0.
儘管某些參數估計值變得非常小以至於可以忽略，但事實上嶺迴歸並沒有進行變量選擇。這可能對預測精確度來說不是問題，但對模型解釋提出了挑戰。
Lasso（最小絕對收縮與選擇算子模型）出現用來代替嶺迴歸模型。
Lasso和嶺迴歸很相似，唯一不同的在於罰函數。
下面展示在R中如何進行調優和擬合
先用train()函數對參數進行調優
首先設置交互校驗和參數調優範圍，這裏使用10層交互校檢，在用該方法前注意要進行標準化

> ctr1=trainControl(method="cv",number=10)
> lassoGrid=data.frame(fraction=seq(.8,1,length=20))
> set.seed(100)
> lassoTune=train(trainx,trainy,method="lars",tuneGrid=lassoGrid,trControl=ctr1,preProc=c("center","scale"))
> lassoTune
Least Angle Regression 

999 samples
 10 predictor

Pre-processing: centered (10), scaled (10) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 899, 899, 899, 899, 899, 900, ... 
Resampling results across tuning parameters:

  fraction   RMSE      Rsquared   MAE     
  0.8000000  1762.679  0.7921007  787.5439
  0.8105263  1760.384  0.7924426  784.0889
  0.8210526  1758.364  0.7927419  780.8319
  0.8315789  1756.401  0.7930346  777.7173
  0.8421053  1754.409  0.7933317  774.6333
  0.8526316  1752.524  0.7936236  771.8127
  0.8631579  1750.750  0.7939136  769.0883
  0.8736842  1749.101  0.7941909  766.5841
  0.8842105  1747.581  0.7944455  764.3044
  0.8947368  1746.207  0.7946761  762.1525
  0.9052632  1745.021  0.7948805  760.1038
  0.9157895  1744.029  0.7950563  758.3066
  0.9263158  1743.229  0.7952038  756.7197
  0.9368421  1742.674  0.7953067  755.4797
  0.9473684  1742.382  0.7953591  754.5233
  0.9578947  1742.284  0.7953832  753.9641
  0.9684211  1742.380  0.7953767  753.5812
  0.9789474  1742.667  0.7953442  753.4000
  0.9894737  1743.178  0.7952871  753.5050
  1.0000000  1743.953  0.7951834  754.0392

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was fraction = 0.9578947.
>

訓練出的模型調優參數爲0.95，對應的RMSE和R^2分別爲1742.284何80%，和之前的嶺迴歸幾乎相同。

> plot(lassoTune)

從上交互校驗的RMSE結果圖中可以看出，隨着調優參數的增加，RMSE有一個減少然後增加的過程

Lasso模型可以用許多不同的函數進行擬合
lars包中的lars()函數
elasticnet包的enet()函數
glmnet包中的glmnet()函數
下面我們使用enet()函數，其要求自變量爲一個矩陣對象，要講數據框trainx轉換成矩陣。
預測變量在建模之前需要中心化和標準化，函數中的normalize參數可以自動完成這一過程。
lambda參數控制了嶺迴歸的罰參數

> lassoModel=enet(x=as.matrix(trainx),y=trainy,lambda=0,normalize=TRUE)
> lassoFit=predict(lassoModel,newx=as.matrix(trainx),s=.95,mode="fraction",type="fit")
> head(lassoFit$fit)
        1         2         3         4         5         6 
1371.6160  308.6984  702.2026 1225.5508  832.0466 1028.9785

在predict()函數中設定type=”coefficients”
可以得到參數擬合結果

> lassoCoef=predict(lassoModel,newx=as.matrix(trainx),s=0.95,mode="fraction",type="coefficients")
> LassoCoef=lassoCoef$coefficients
> LassoCoef
       Q1        Q2        Q3        Q4        Q5        Q6        Q7        Q8        Q9       Q10 
-326.7759  720.2801  722.7518  180.7107  542.1603  603.0865    0.0000  -75.9610  342.6375 -281.7818

3.彈性網絡
它是Lasso的一般化版本，該模型結合了兩種罰函數
Lasso對應的估計方差較大，而嶺迴歸又沒有特徵選擇的功能，彈性網絡的優點在於利用了嶺迴歸的罰函數，同時又由Lasso的特徵選擇功能。
該模型能夠更有效地處理成組的高度相關變量。

> enetGrid=expand.grid(.lambda=seq(0,0.2,length=20),.fraction=seq(.8,1,length=20))
> set.seed(100)
> enetTune=train(trainx,trainy,method="enet",tuneGrid=enetGrid,trControl=ctrl,preProc=c("center","scale"))
Error in train.default(trainx, trainy, method = "enet", tuneGrid = enetGrid,  : 
  object 'ctrl' not found
> enetTune=train(trainx,trainy,method="enet",tuneGrid=enetGrid,trControl=ctr1,preProc=c("center","scale"))
> enetfit=enet(x=as.matrix(trainx),y=trainy,lambda=0.01,normalize=TRUE)
> enetPred=predict(enetfit,newx=as.matrix(trainx),s=0.958,mode="fraction",type="fit")
> enetCoef=predict(ridgefit,newx=as.matrix(trainx),s=0.958,mode="fraction",type="coefficients")
> enetCoef
$s
[1] 0.958

$fraction
    0 
0.958 

$mode
[1] "fraction"

$coefficients
        Q1         Q2         Q3         Q4         Q5         Q6         Q7         Q8         Q9        Q10 
-354.24218  744.20631  703.81744  189.47130  499.01455  586.62172    0.00000  -96.71702  352.72551 -284.57245 

>

主成分和偏最小二乘迴歸

在實際應用中，自變量之間通常是彼此相關的，包含相似的信息。
如果對預測變量進行主成分分析（PCA），然後用主成分進行迴歸，這種方法得到的主成分是彼此不相關的。
但存在弊端是得到的新自變量是原變量的線性組合，因此使得模型難以理解。
偏最小二乘迴歸（PLS）是PCR的有監督版本
在建模前也需要進行標準化。（具體原理會在後面章節中專門講解）
下面展示如何用R中的caret包訓練PCR和PLS模型

> library(lattice)
> library(caret)
> library(dplyr)

載入程輯包：‘dplyr’

The following object is masked from ‘package:car’:

    recode

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Warning message:
程輯包‘dplyr’是用R版本3.4.3 來建造的 
> library(elasticnet)
> library(lars)
> sim.dat=read.csv("https://raw.githubusercontent.com/happyrabbit/DataScientistR/master/Data/SegData.csv")
> ymad=mad(na.omit(sim.dat$income))
> zs=(sim.dat$income-mean(na.omit(sim.dat$income)))/ymad
> idex=c(which(na.omit(zs>3.5)),which(is.na(zs)))
> sim.dat=sim.dat[-idex,]
> xtrain=dplyr::select(sim.dat,Q1:Q10)
> ytrain=sim.dat$income
> set.seed(100)
> ctr1=trainControl(method="cv",number=10)
> plsTune=train(xtrain,ytrain,method="pls",tuneGrid=expand.grid(.ncomp=1:10))
> plsTune=train(xtrain,ytrain,method="pls",tuneGrid=expand.grid(.ncomp=1:10),trControl=ctr1)
> plsTune
Partial Least Squares 

772 samples
 10 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 694, 696, 696, 696, 694, 695, ... 
Resampling results across tuning parameters:

  ncomp  RMSE      Rsquared   MAE     
   1     28106.40  0.6553646  19957.91
   2     24852.89  0.7385908  16142.37
   3     23594.19  0.7679501  14507.19
   4     23442.31  0.7713064  13940.47
   5     23407.49  0.7721321  13848.48
   6     23409.49  0.7720994  13838.38
   7     23408.15  0.7721470  13835.54
   8     23408.56  0.7721433  13835.52
   9     23408.46  0.7721447  13835.46
  10     23408.46  0.7721448  13835.48

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was ncomp = 5.
>