Hypothesis with R

數據集說明

基於Galton數據集，檢驗兒子和女兒與母親身高的相關性

library("AzureML")
ws <- workspace()
galton <- download.datasets(ws, "GaltonFamilies.csv")
head(galton)

The first 6 rows of the data and the columns:

dim(galton)

939 rows and 0 columns (attributes)

數據可視化

畫直方圖展示分別展示母親與兒子，母親與女兒的身高關係

hist.plot = function(df, col, bw, max, min){
   ggplot(df, aes_string(col)) + geom_histogram( binwidth = bw ) + xlim(min,max)
}

hist.family = function(df, col1, col2, num.bin = 30){
   require(ggplot2)
   require(gridExtra)
   ## compute bin width
   max = max(c(df[, col1], df[, col2]))
   min = min(c(df[, col1], df[, col2]))
   bin.width = (max - min)/num.bin
   ## create a first histogram
   p1 = hist.plot(df, col1, bin.width, max, min)
   p1 = p1 + geom_vline(xintercept = mean(df[, col1]), color = 'red', size = 1)
   ## create a first histogram
   p2 = hist.plot(df, col2, bin.width, max, min)
   p2 = p2 + geom_vline(xintercept = mean(df[, col2]), color = 'red', size = 1)
   ## stack the plot
   grid.arrange(p1,p2, nrow = 2, ncol = 1)
}

sons = galton[galton$gender=='male',]
hist.family(sons,'childHeight','mother')

在畫圖中，使用geom_vline()來定位均值進行對比。結果如下：
兒子與母親

女兒與母親

可以看到兒子與母親身高分佈重複區域很小，反之，女兒身高分佈與母親很相似，因此，我們可以提出null hypothesis：母親身高與兒子(女兒)身高的均值相同，即 miu1-miu2 = 0; alternative hypothesis則是 miu1 not equals to miu2.

使用t-test（small samples）進行雙邊假設檢驗

##H0:  there is no significant difference between the means
families.test <- function(df, col1, col2, paired = TRUE){
    t.test(df[,col1],df[,col2],paired=paired)
}

hist.family.conf <- function(df, col1, col2, num.bin = 30, paired=FALSE){
    require(ggplot2)
    require(gridExtra)
    
    max = max(c(df[,col1], df[,col2]))
    min = min(c(df[,col1], df[,col2]))
    bin.width = (max-min)/num.bin
    
    mean1 <- mean(df[,col1])
    mean2 <- mean(df[,col2])
    t <- t.test(df[,col1],df[,col2],paired=paired)
    pv1 <- mean2 + t$conf.int[1]
    pv2 <- mean2 + t$conf.int[2]
    ## plot a histogram
    p1 <- hist.plot(df,col1,bin.width,max,min)
    p1 <- p1 + geom_vline(xintercept = mean1,
                        color = 'red', size = 1) + 
             geom_vline(xintercept = pv1,
                        color = 'red', size = 1, linetype = 2)  + 
             geom_vline(xintercept = pv2,
                        color = 'red', size = 1, linetype =2) 
  
    ## A simple boxplot
    p2 <-  hist.plot(df, col2, bin.width, max, min)
    p2 <- p2 + geom_vline(xintercept = mean2,
                        color = 'red', size = 1.5)

    ## Now stack the plots
    grid.arrange(p1, p2, nrow = 2)

    print(t)
}
hist.family.conf(sons,'mother','childHeight')

兒子-母親身高均值差爲0檢驗結果：

自己對於置信區間與p-value的理解：
假設身高差服從自由度爲k-1的t分佈，那麼son-mother檢驗案例中，95%的置信區間爲[-5.514,-4.887],也就是說miu1-miu2的取值範圍以百分之九十五的概率在這個區間。那麼我們計算miu1-miu2=0的概率，基於該t分佈，給出的結果是<2.2E-16，這個值遠小於0.05（alpha），因此我們有足夠充分的證據拒絕null hypothesis，也就是說接受兒子身高與母親身高的均值不相同。

同樣的方法得到daughter-mother身高均值檢驗結果：

miu1-miu2的值以95%的概率存在於區間[-0.25, 0.34]中。基於該t分佈，計算p值爲0.7701遠大於0.05，因此有充分的證據接受null hypothesis。

由此可以看出p-value是基於null hypothesis爲真，觀察到極端情況的概率，它只是一個接受檢驗或者拒絕檢驗的證據，它的大小並不代表得到的結論是否重要與否。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Hypothesis with R and Understanding of P-value and confidence-interval