引言

在機器學習應用中，我們不可能離開數據。沒有了數據，機器學習算法就像沒有了靈魂。更好地理解數據，可以使我們把它更好地應用在機器學習上。在這篇文章中，我會介紹一些在統計學中，理解數據的一些重要概念，從而使大家更準確地操作數據，玩轉數據。

注意：在這篇文章中會涉及到很多名詞和定義，我就直接用英文了，因爲這更加容易理解，翻譯成漢語以後會讓人更加混亂了。

Populations and Parameters

A population is any large collection of objects or individuals, such as Americans, students, or trees about which information is desired.

A parameter is any summary number, like an average or percentage, that describes the entire population.

下面，我舉個例子來說明Populations and Parameters.

我們想要知道中國所有男人體重的平均值(μ )。這裏，population是所有的中國男人，而parameter是體重的平均值。
我們想要知道中國所有大學生吸菸的比例(p )。這裏，population是所有的中國大學生，而parameter是吸菸比例。

但不幸的是，我們幾乎不可能知道population的parameter. 對於上面的那個例子來說，我們不可能去調查所有中國男人的體重，然後去求平均值。因此，我們只能去估算population的parameter.

Samples and statistics

A sample is a representative group drawn from the population.

A statistic is any summary number, like an average or percentage, that describes the sample.

還用上面的例子來說明問題。

這回我們只選擇具有代表性的100箇中國男人，求出他們的平均值x¯ . 從而來估計μ .
這回我們只選擇具有代表性的100個大學生，求出他們吸菸的比例(̂ p) , 從而來估計p .

上面的100個大學生就是一個sample，求出的p̂ 就是sample的一個statistic.

因爲sample的大小是可控的，因此我們能計算它的任何一個statistic. 從而我們用這個sample statistic去估算未知的population parameter.

有兩種方式可以估算population parameter，它們分別是Confidence intervals 和 hypothesis tests. 下面，我來分別介紹這兩種方法。

t-based Confidence Interval for the Mean

我們可以用t-interval來估算population mean μ . 下面，我來給出它的定義：

When the population standard deviation σ is not known, an interval estimate for the population mean μ with confidence level 1−α is given by :

$x ¯ \pm t α / 2, n - 1 (s n ‾ \sqrt)$

tα/2,n−1：它取決於sample size n 通過計算n−1 , 即degrees of freedom. 也取決於confidence level (1−α)∗100 , 通過求出α2 。
sn√：這個整體叫做”standard error“. 它實際上就是 estimated standard deviation of all the possible sample means.

很明顯，sample mean x¯ 和 sample standard deviation s 以及sample size n 都可以很容易從sample data中獲得。現在，我們只需要求出tα/2,n−1 就行了。

要想求出t 值，我們可以查詢T-Table或用一些統計軟件。但前提是我們要給出degrees of freedom 和 α/2 .

T-Table

現在，我們定義confidence level爲90%，因此α/2 爲0.05. 假設我們的sample size爲15，因此degrees of freedom爲15 - 1 = 14. 通過查詢T-Table，我們的t0.05,14=1.761 . 那麼現在，如果給定你sample data，我們就可以求出Confidence Interval了。這裏，我就不給出數據集了。假設我們求出的區間爲(3.43, 3.68)，這說明我們有90%的自信population mean在這個區間內。

影響t-interval寬度的因素

通過對上面公式的變換，我們可以得出區間的寬度爲：

Width = 2 \times t α / 2, n - 1 (s n ‾ \sqrt)

通過這個公式，我們就可以找出影響寬度的因素了。

隨着sample mean增加，寬度不變。也就是說，sample mean並不影響區間的寬度。
隨着sample standard deviation s 減少，區間的寬度減小。
隨着我們減小confidence level，t值減小，因此區間寬度減小。
隨着我們增加sample size，區間寬度減小。這是一個我們最容易控制的因素，唯一的花費就是我們的時間和金錢。

Hypothesis Testing

hypothesis testing一般包括下面3個步驟：

Making an initial assumption
Collecting evidence (data).
Based on the available evidence (data), deciding whether to reject or not reject the initial assumption.

hypothesis testing的兩種錯誤類型：

Type I error: The null hypothesis is rejected when it is true.

Type II error: The null hypothesis is not rejected when it is false.

進行Hypothesis Testing 有兩種方法，一種是Critical value 方法，另一種是P-value approach. 下面，我來分別介紹這兩種方法。

Hypothesis Testing (Critical value approach)

critical value方法比較observed test statistic和critical value，如果test statistic比critical value更加極端，那麼null hypothesis is rejected. 如果test statistic並沒有critical value極端，那麼null hypothesis is not rejected.

在hypothesis testing中，出現type I error的概率叫做significance level，用α 表示。

用Critical value方法進行任何一個Hypothesis Testing都包含下面四個步驟：

定義null hypotheses 和 alternative hypotheses
假設null hypothesis is True, 用sample data計算test statistic. 如果進行的hypothesis test 是針對population mean μ 的，那麼計算test statistic的公式爲：t∗=x¯−μs/n√
找到critical value
比較critical value 和 test statistic的大小

Hypothesis Testing (P-value approach)

P-value代表的是一個概率，它假設null hypothesis是True的情況下，在alternative hypothesis方向上出現一個比我們sample data的test statistic更極端的test statistic的概率。如果P-value是小於（或等於）α ，那麼null hypothesis is rejected. 如果P-value是大於α ，那麼null hypothesis is not rejected.

用P-value方法進行任何一個Hypothesis Testing都包含下面四個步驟：

定義null hypotheses 和 alternative hypotheses
假設null hypothesis is True, 用sample data計算test statistic. 如果進行的hypothesis test 是針對population mean μ 的，那麼計算test statistic的公式爲：t∗=x¯−μs/n√
找出 p-value值
設置significance level α ，即出現Type I error的概率，通常爲0.01, 0.05, or 0.10. 然後比較p-value和α

Right-tailed test

我用一個具體的例子並用具體的R代碼來演示上面兩個方法。假設我們的sample爲25個運動員，他們每個人的身高如下：

170 167 174 179 179
156 163 156 187 156
183 179 174 179 170
156 187 179 183 174
187 167 159 170 179

我想知道整個中國運動員的平均身高是否大於170，因此我定義下面的hypotheses.

1、定義null hypotheses 和 alternative hypotheses

H0:μ=170
H1:μ>170

2、計算test statistic的公式爲：t∗=x¯−μs/n√

height <- c(170,167,174,179,179,156,163,156,187,156,183,179,174,179,170,156,187,179,183,174,187,167,159,170,179)
# sample mean 'xbar'
xbar <- mean(height) # 175.52
# hypothesized value 'mu'
mu <- 170
# sample standard deviation 's'
s <- sd(height) # 10.31
# sample size 'n'
n <- 25
# test statistic 't'
t <- (xbar - mu) / (s / sqrt(n)) # 1.22

3、找到critical value

找到critical value有兩種方法，一個是用T-Table，另一個是用統計學軟件。但是，無論哪種方法，我們都需要degrees of freedom—n−1 和significance level—α

# significance level, 如果這個值是不可能大於1的，我們在小數點前不用加0
alpha <- .05
# 建議查看官方文檔qt函數
t.alpha <- qt(1−alpha, df=n−1) # 1.711

4、比較critical value 和 test statistic的大小

結論：上面，我們已經求出critical value爲1.711和test statistic爲1.22。由於1.22 < 1.711，那麼我不能reject the null hypothesis. 換句話說，test statistic沒有在”critical region.”中，我沒有足夠的證據表明中國運動員的平均身高是大於170的。有一點我要說明白，不同的significance level有可能會導致不同的結果。

下面，我用P-value方法來進行Hypothesis Testing. 兩種方法的前2個步驟是一樣的，我就直接從第3步開始了。

3、找出 p-value值

想找出p-value的值，也就是找到從test statistic到正無窮曲線下面的面積。

# 這裏t爲test statistic，n爲sample size，上面我已經計算過了
pval = pt(t, df=n−1, lower.tail=FALSE) # 0.117

4、比較p-value和α

結論： 由於p-value的值爲0.117大於α=0.05 . 那麼我不能reject the null hypothesis. 換句話說，我沒有足夠的證據表明中國運動員的平均身高是大於170的。

Left-tailed test

sample data 如下：

11.5 11.8 15.7 16.1 14.1 10.5
15.2 19.0 12.8 12.4 19.2 13.5
16.5 13.5 14.4 16.7 10.9 13.0
15.1 17.1 13.3 12.4 8.5 14.3
12.9 11.1 15.0 13.3 15.8 13.5
9.3 12.2 10.3

我想要知道打藥物作物的平均壽命是否比正常的平均壽命15.7要小。因此因此我定義下面的hypotheses.

H0:μ=15.7
H1:μ<15.7

life <- c(11.5,11.8,15.7,16.1,14.1,10.5,15.2,19.0,12.8,12.4,19.2,13.5,16.5,13.5,14.4,16.7,10.9,13.0,15.1,17.1,13.3,12.4,8.5,14.3,12.9,11.1,15.0,13.3,15.8,13.5,9.3,12.2,10.3)
# sample mean 'xbar'
xbar <- mean(life) # 13.66
# hypothesized value 'mu'
mu <- 15.7
# sample standard deviation 's'
s <- sd(life) # 2.54
# sample size 'n'
n <- 33
# test statistic 't'
t <- (xbar - mu) / (s / sqrt(n)) # -4.60

# significance level, 如果這個值是不可能大於1的，我們在小數點前不用加0
alpha <- .05
# 建議查看官方文檔qt函數
t.alpha <- -qt(1-alpha, df=n-1) # -1.6939

結論：上面，我們已經求出critical value爲-1.6939和test statistic爲-4.60。由於-4.60 < -1.6939，那麼我可以rejects the null hypothesis. 換句話說，test statistic在”critical region.”中，我有足夠的證據表明打藥物作物的平均壽命比正常的平均壽命15.7要小

想找出p-value的值，也就是找到從test statistic到負無窮曲線下面的面積。

# 這裏t爲test statistic，n爲sample size，上面我已經計算過了
pval = pt(t, df=n-1) # 3.174244e-05

結論：由於p-value的值爲3.174244e-05小於α=0.05. 那麼我可以reject the null hypothesis. 換句話說，我有足夠的證據表明打藥物作物的平均壽命比正常的平均壽命15.7要小

Two-tailed test

sample data如下：

7.65 7.60 7.65 7.70 7.55
7.55 7.40 7.40 7.50 7.50

我想知道飛機上的一個零件大小的平均值是否爲7.5. 因此因此我定義下面的hypotheses.

H0:μ=7.5
H1:μ≠7.5

size <- c(7.65,7.60,7.65,7.70,7.55,7.55,7.40,7.40,7.50,7.50)
# sample mean 'xbar'
xbar <- mean(size) # 7.55
# hypothesized value 'mu'
mu <- 7.5
# sample standard deviation 's'
s <- sd(size) # 0.1027
# sample size 'n'
n <- 10
# test statistic 't'
t <- (xbar - mu) / (s / sqrt(n)) # 1.54

# significance level, 如果這個值是不可能大於1的，我們在小數點前不用加0
alpha <- .05
# 建議查看官方文檔qt函數
t.half.alpha = qt(1−alpha/2, df=n−1) # 2.2622
c(−t.half.alpha, t.half.alpha) # [1] −2.2622  2.2622

結論：上面，我們已經求出critical value爲−2.2622和2.2622，而test statistic爲1.54。由於1.54既不大於2.2622也不小於-2.2622，因此我不能rejects the null hypothesis. 換句話說，test statistic不在”critical region.”中，我沒有足夠的證據表明零件大小的平均值不爲7.5

想找出p-value的值，也就是找到從負test statistic到負無窮曲線下面的面積加上test statistic到正無窮曲線下面的面積。

# 這裏t爲test statistic，n爲sample size，上面我已經計算過了
# 你要注意你的test statistic的值是大於0還是小於0，從而決定lower.tail是True還是False
pval <- 2 * pt(t, df=n-1, lower.tail=FALSE) # 0.158

結論：由於p-value的值爲0.158大於α=0.05. 那麼我不能reject the null hypothesis. 換句話說，我沒有足夠的證據表明零件大小的平均值不爲7.5

無論我用什麼方法，Hypothesis Testing的結果都是一樣的！！！

Chi-Square Tests

下面我用Chi-Square Test來測試兩個變量之間是否爲獨立的？

Null Hypothesis: The two categorical variables are independent.
Alternative Hypothesis: The two categorical variables are dependent.

用下面的公式來計算chi-square test statistic：

χ 2 = \sum (O - E) 2 / E

O: observed frequency
E: expected frequency under the null hypothesis，計算公式如下：

E = row total \times column total sample size

接下來，我們比較chi-square test statistic χ2 和degree of freedom = (r - 1) (c - 1)的critical value χ2α ，如果χ2>χ2α ，那麼reject the null hypothesis.

Chi-Square測試變量之間獨立性實例

在R內置的數據集survey中，其中有兩個category變量，一個是Exer，一個是Smoke. 下面，我用Chi-Square來測試這兩個變量之間是否獨立。

library(MASS)       # load the MASS package
tbl = table(survey$Smoke, survey$Exer)
tbl                 # the contingency table

        Freq None Some 
  Heavy    7    1    3 
  Never   87   18   84 
  Occas   12    3    4 
  Regul    9    1    7


chisq.test(tbl)

    Pearson's Chi-squared test

data:  tbl
X-squared = 5.4885, df = 6, p-value = 0.4828

Warning message:
In chisq.test(tbl) : Chi-squared approximation may be incorrect

由於p-value的值爲0.4828大於.05 significance level，因此我們不能reject the null hypothesis，也就是說，smoking habit是獨立於exercise level的。

引用

全文總結自：https://onlinecourses.science.psu.edu/statprogram/review_of_basic_statistics

Xurtle

發佈了106 篇原創文章 · 獲贊 213 · 訪問量 59萬+

他的留言板關注

學好機器學習必會的統計學知識（第二篇）

引言

Populations and Parameters

Samples and statistics

t-based Confidence Interval for the Mean

影響t-interval寬度的因素

Hypothesis Testing

Hypothesis Testing (Critical value approach)

Hypothesis Testing (P-value approach)

Right-tailed test

Left-tailed test

Two-tailed test

Chi-Square Tests

Chi-Square測試變量之間獨立性實例

引用

Understanding the Bias-Variance Tradeoff (理解偏差-方差權衡)

PCA詳解-並用scikit-learn實現PCA壓縮紅酒數據集

R實戰之從頭到尾分析廣告數據集

深入理解 java 中的 Soft references & Weak references & Phantom reference

MIT 18.06 線性代數總結（Part II）

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結