R語言與迴歸分析學習筆記（bootstrap method）

Bootstrap方法在之前的博文《R語言與點估計學習筆記（EM算法與Bootstrap法）》裏有提到過，簡而言之，bootstrap方法就是重抽樣。爲什麼需要bootstrap方法呢？因爲bootstrap方法使得我們無需分佈理論的知識也可以進行假設檢驗，獲得置信區間。當數據來自未知分佈，或者存在嚴重異常點，又或者樣本量過小，沒有參數方法解決問題時，bootstrap方法將是一個很棒的方法。

對於迴歸分析而言，bootstrap無疑對迴歸的正態性假設做了極大地放鬆，使得迴歸推斷越來越好用，也更具有說服力。

從博文《R語言與點估計學習筆記（EM算法與Bootstrap法）》裏可以看到，對於參數統計，特別是在已知分佈的參數估計，bootstrap並沒有多大的意義，它的結果和矩估計或者極大似然估計的結果並沒有多大的差別（如果有差別會令人覺得很奇怪，不是嗎？）

Boot包中提供了做bootstrap的兩個十分好用的函數：boot（），boot.ci（）。兩者的調用格式與參數說明如下：

Boot（）函數：

boot(data, statistic, R, sim ="ordinary", stype = c("i", "f", "w"),

strata= rep(1,n), L = NULL, m = 0, weights = NULL,

ran.gen = function(d, p) d, mle = NULL, simple = FALSE, ...,

parallel = c("no", "multicore", "snow"),

ncpus = getOption("boot.ncpus", 1L), cl = NULL)

參數說明：

Data：數據，可以是向量，矩陣，數據框

Statistic：統計量，如均值，中位數，迴歸參數，迴歸裏的R^2等

R：調用統計量函數次數

Boot（）的返回值：

T0:從原始數據中得到的k個統計量的觀測值

T:一個R*K的矩陣

Boot.ci（）函數：

boot.ci(boot.out, conf = 0.95, type = "all",

        index = 1:min(2,length(boot.out$t0)), var.t0 = NULL,

        var.t = NULL, t0 = NULL, t = NULL, L = NULL,

        h = function(t) t, hdot = function(t) rep(1,length(t)),

        hinv = function(t) t, ...)

參數說明：

Boot.out（）：boot函數的返回值

Type：返回置信區間的類型，R中提供的有"norm" ,"basic", "stud","perc", "bca"

一、對單個統計量使用bootstrap方法
我們以R中的數據集women爲例說明這個問題。數據集women列出了美國婦女的平均身高和體重。以體重爲響應變量，身高爲協變量進行迴歸，獲取斜率項的95%置信區間。

R可以通過以下代碼告訴我們答案：

library(boot)
beta<-function(formula,data,indices){
d<-data[indices,]
fit<-lm(formula,data=d)
return(fit$coef[2])
}
result<-boot(data=women,statistic=beta,R=500,formula=weight~height)
boot.ci(result)

輸出結果：

BOOTSTRAPCONFIDENCE INTERVAL CALCULATIONS

Basedon 500 bootstrap replicates

CALL:

boot.ci(boot.out= result)

Intervals:

Level Normal Basic

95% ( 3.218, 3.686 ) ( 3.231, 3.704 )

Level Percentile BCa

95% ( 3.196, 3.669 ) ( 3.199, 3.675 )

Calculationsand Intervals on Original Scale

他們與傳統的估計差別大嗎？我們來看看傳統的區間估計：

confint(lm(weight~height,data=women))

輸出結果：

2.5 % 97.5 %

(Intercept) -100.342655 -74.690679

height 3.253112 3.646888

可以看出，差別並不是很大，究其原因，無外乎正態性得到了很好的滿足。我們看其qq圖：

很清楚也很顯然。Shapiro檢驗也說明了這樣一個事實：

Shapiro-Wilk normality test

data: women$weight

W = 0.9604,p-value = 0.6986

我們在來看一個差別較大的例子：

. 考慮R中的數據集faithful。以waiting爲響應變量，eruptions爲協變量，建立簡單迴歸模型y=α+βx+e。考慮β的95%置信區間。

重複上面的步驟，R代碼如下：

result<-boot(data=faithful,statistic=beta,R=500,formula=waiting~eruptions)
boot.ci(result)
confint(lm(waiting~eruptions,data=faithful))
qqPlot(lm(waiting~eruptions,data=faithful))

輸出結果：

BOOTSTRAPCONFIDENCE INTERVAL CALCULATIONS

Based on 500bootstrap replicates

CALL :

boot.ci(boot.out= result)

Intervals :

Level Normal Basic

95% (10.08, 11.30 ) (10.06, 11.26 )

Level Percentile BCa

95% (10.20, 11.40 ) (10.13, 11.35 )

Calculationsand Intervals on Original Scale

Some BCaintervals may be unstable

傳統估計：

2.5 % 97.5 %

(Intercept)31.20069 35.74810

eruptions 10.10996 11.34932

差別有些大，我們來看看qq圖：

正態性不是很好，shapiro檢驗告訴我們幾乎不可能認爲是正態的。

Shapiro-Wilk normality test

data: faithful$waiting

W = 0.9221,p-value = 1.016e-10

觀察下圖，我們也可以看到waiting不服從正態分佈，那麼他的95%的置信區間可以通過以下代碼獲得：

mean1<-function(data,indices){
d<-data[indices,]
fit<-mean(d$waiting)
return(fit)
}
results<-boot(data=faithful,statistic=mean1,R=1000)
boot.ci(results)

輸出結果：

BOOTSTRAPCONFIDENCE INTERVAL CALCULATIONS

Based on 1000bootstrap replicates

CALL :

boot.ci(boot.out= results)

Intervals :

Level Normal Basic

95% (69.28, 72.50 ) (69.31, 72.53 )

Level Percentile BCa

95% (69.26, 72.49 ) (69.10, 72.44 )

Calculationsand Intervals on Original Scale

與傳統方法比較

t.test(faithful$waiting)$conf.int

[1] 69.2741872.51994

可見估計的穩健性。

二、對多個統計量使用bootstrap方法

我們考慮博文《R語言與迴歸分析學習筆記（應用迴歸小結）（1）》二中的例子，獲取一個統計向量（四個迴歸係數）的95%置信區間。

首先，創建一個返回迴歸係數向量的函數：

betas<-function(formula,data,indices){
d<-data[indices,]
fit<-lm(formula,data=d)
return(fit$coef)
}

然後自助抽樣500次：

results<-boot(data=states,statistic=betas,R=500,formula=Murder~.)
print(results)

輸出結果：

ORDINARYNONPARAMETRIC BOOTSTRAP

Call:

boot(data= states, statistic = betas, R = 500, formula = Murder ~ .)

BootstrapStatistics :

original bias std. error

t1* 1.2345634112 8.969206e-01 5.305460e+00

t2* 0.0002236754 2.345369e-06 8.884204e-05

t3* 4.1428365903 -1.621533e-01 8.333313e-01

t4* 0.0000644247 -1.131896e-04 9.750587e-04

t5* 0.0005813055 -1.928938e-03 1.055004e-02

當對多個統計量自助抽樣時，需要添加一個索引參數，指明plot（），boot.ci()函數所分析對象。如下列代碼用於繪製人口結果：

plot(results,index=2)

輸出結果：

95%置信區間：

boot.ci(results,type="bca",index=2)
boot.ci(results,type="bca",index=3)
boot.ci(results,type="bca",index=4)

輸出結果：

>boot.ci(results,type="bca",index=2)

BOOTSTRAPCONFIDENCE INTERVAL CALCULATIONS

Basedon 500 bootstrap replicates

CALL:

boot.ci(boot.out= results, type = "bca", index = 2)

Intervals:

Level BCa

95% ( 0.0001, 0.0004 )

Calculationsand Intervals on Original Scale

>boot.ci(results,type="bca",index=3)

BOOTSTRAPCONFIDENCE INTERVAL CALCULATIONS

Basedon 500 bootstrap replicates

CALL:

boot.ci(boot.out= results, type = "bca", index = 3)

Intervals:

Level BCa

95% ( 2.284, 5.587 )

Calculationsand Intervals on Original Scale

SomeBCa intervals may be unstable

>boot.ci(results,type="bca",index=4)

BOOTSTRAPCONFIDENCE INTERVAL CALCULATIONS

Basedon 500 bootstrap replicates

CALL:

boot.ci(boot.out= results, type = "bca", index = 4)

Intervals:

Level BCa

95% (-0.0022, 0.0016 )

Calculationsand Intervals on Original Scale

三、殘差法。

Boot包中給出的辦法都是使用成對的bootstrap。結果較爲穩健。但我們也可以使用殘差法對有固定水平的預測變量做估計。

算法如下：

1、先由觀測數據擬合迴歸模型

2、然後獲得響應與殘差

3、從擬合殘差集合中有放回隨機抽取得到bootstrap殘差集合

4、生成一個僞數據集，對x迴歸僞數據集，獲得bootstrap參數估計beta

5、重複多次，得到beta的經驗分佈，做推斷或估計

我以women數據集爲例，利用殘差法說明迴歸係數顯著不爲0.R代碼如下：

lm.reg<-lm(weight~height,data=women)
y.fit<-predict(lm.reg)
y.res<-residuals(lm.reg)
y.bootstrap<-rep(1,15)
datap<-rep(0,100)
dataq<-rep(0,100)
for(pin 1:100){
for(iin 1:15){
res<-sample(y.res,1,replace=TRUE)
y.bootstrap[i]<-y.fit[i]+res
}
datap[p]<-lm(y.bootstrap~women$height)$coef[1]
dataq[p]<-lm(y.bootstrap~women$height)$coef[2]
}
ecdf(datap)
ecdf(dataq)

輸出的結果

ecdf(datap)#x[1:100]= -104.56, -99.192, -97.884, ...,-75.389, -72.604

ecdf(dataq)#x[1:100]= 3.2271, 3.2588, 3.2845, ..., 3.625, 3.7125

顯然結論是對的，其95%的置信區間使用分位數法有：

A<-ecdf(datap)
B<-ecdf(dataq)
quantile(A,probs=c(0.025,0.975))
quantile(B,probs=c(0.025,0.975))

輸出結果：

> quantile(A,probs=c(0.025,0.975))

2.5% 97.5%

-96.95054 -75.22924

> quantile(B,probs=c(0.025,0.975))

2.5% 97.5%

3.264777 3.595554

這次介紹遺留了兩個有價值的問題：1、初始樣本多大？2、應該重複多少次？第一個問題，沒有簡單的答案，但有一點是肯定的，bootstrap不會提供比初始樣本更多的信息，初始信息的收集仍舊十分重要。第二個問題在計算機資源普及的今天，多做一些沒有太大的壞處，但是monte carlo方法及方差縮減技術將是你需要學習的。

yujunbeta

發佈了53 篇原創文章 · 獲贊 97 · 訪問量 70萬+

私信關注

R語言與迴歸分析學習筆記（bootstrap method）

PDManer [元數建模]-v4.9.0 發佈：一款簡單好用的數據庫建模平臺

使用neovim打造go ide(支持代碼跳轉, 代碼補全, 實時語法檢查)

sql求連續值問題

cs01 CSS Syntax

挑戰程序設計競賽 2.3章習題 poj 3046 Ant Counting

[MASM拾遺]Offset僞指令

h30 HTML Layout Elements

瞭解顯卡

一款基於C#開發的通訊調試工具（支持Modbus RTU、MQTT調試）

Linux/Golang/glibC系統調用

迴歸分析作業3

R語言與機器學習學習筆記（分類算法）（1）K-近鄰算法

優化與求解非線性方程組（單變量問題）

兩道概率題求解（半原創）

R語言與迴歸分析學習筆記（應用迴歸小結）（2）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結