Hypothesis with R and Understanding of P-value and confidence-interval
Hypothesis with R
數據集說明
基於Galton數據集,檢驗兒子和女兒與母親身高的相關性
library("AzureML")
ws <- workspace()
galton <- download.datasets(ws, "GaltonFamilies.csv")
head(galton)
The first 6 rows of the data and the columns:
dim(galton)
939 rows and 0 columns (attributes)
數據可視化
畫直方圖展示分別展示母親與兒子,母親與女兒的身高關係
hist.plot = function(df, col, bw, max, min){
ggplot(df, aes_string(col)) + geom_histogram( binwidth = bw ) + xlim(min,max)
}
hist.family = function(df, col1, col2, num.bin = 30){
require(ggplot2)
require(gridExtra)
## compute bin width
max = max(c(df[, col1], df[, col2]))
min = min(c(df[, col1], df[, col2]))
bin.width = (max - min)/num.bin
## create a first histogram
p1 = hist.plot(df, col1, bin.width, max, min)
p1 = p1 + geom_vline(xintercept = mean(df[, col1]), color = 'red', size = 1)
## create a first histogram
p2 = hist.plot(df, col2, bin.width, max, min)
p2 = p2 + geom_vline(xintercept = mean(df[, col2]), color = 'red', size = 1)
## stack the plot
grid.arrange(p1,p2, nrow = 2, ncol = 1)
}
sons = galton[galton$gender=='male',]
hist.family(sons,'childHeight','mother')
在畫圖中,使用geom_vline()來定位均值進行對比。結果如下:
兒子與母親
女兒與母親
可以看到兒子與母親身高分佈重複區域很小,反之,女兒身高分佈與母親很相似,因此,我們可以提出null hypothesis:母親身高與兒子(女兒)身高的均值相同,即 miu1-miu2 = 0; alternative hypothesis則是 miu1 not equals to miu2.
使用t-test(small samples)進行雙邊假設檢驗
##H0: there is no significant difference between the means
families.test <- function(df, col1, col2, paired = TRUE){
t.test(df[,col1],df[,col2],paired=paired)
}
hist.family.conf <- function(df, col1, col2, num.bin = 30, paired=FALSE){
require(ggplot2)
require(gridExtra)
max = max(c(df[,col1], df[,col2]))
min = min(c(df[,col1], df[,col2]))
bin.width = (max-min)/num.bin
mean1 <- mean(df[,col1])
mean2 <- mean(df[,col2])
t <- t.test(df[,col1],df[,col2],paired=paired)
pv1 <- mean2 + t$conf.int[1]
pv2 <- mean2 + t$conf.int[2]
## plot a histogram
p1 <- hist.plot(df,col1,bin.width,max,min)
p1 <- p1 + geom_vline(xintercept = mean1,
color = 'red', size = 1) +
geom_vline(xintercept = pv1,
color = 'red', size = 1, linetype = 2) +
geom_vline(xintercept = pv2,
color = 'red', size = 1, linetype =2)
## A simple boxplot
p2 <- hist.plot(df, col2, bin.width, max, min)
p2 <- p2 + geom_vline(xintercept = mean2,
color = 'red', size = 1.5)
## Now stack the plots
grid.arrange(p1, p2, nrow = 2)
print(t)
}
hist.family.conf(sons,'mother','childHeight')
兒子-母親身高均值差爲0檢驗結果:
自己對於置信區間與p-value的理解:
假設身高差服從自由度爲k-1的t分佈,那麼son-mother檢驗案例中,95%的置信區間爲[-5.514,-4.887],也就是說miu1-miu2的取值範圍以百分之九十五的概率在這個區間。那麼我們計算miu1-miu2=0的概率,基於該t分佈,給出的結果是<2.2E-16,這個值遠小於0.05(alpha),因此我們有足夠充分的證據拒絕null hypothesis,也就是說接受兒子身高與母親身高的均值不相同。
同樣的方法得到daughter-mother身高均值檢驗結果:
miu1-miu2的值以95%的概率存在於區間[-0.25, 0.34]中。基於該t分佈,計算p值爲0.7701遠大於0.05,因此有充分的證據接受null hypothesis。
由此可以看出p-value是基於null hypothesis爲真,觀察到極端情況的概率,它只是一個接受檢驗或者拒絕檢驗的證據,它的大小並不代表得到的結論是否重要與否。