提要:配置jupyter notebook可以參考
https://blog.csdn.net/red_stone1/article/details/72858962
下載的csv文件如果出現問題,可以參考:
https://jingyan.baidu.com/album/c843ea0b9a641477931e4a89.html?picindex=2
%matplotlib inline
import random
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
sns.set_context("talk")
Anscombe’s quartet
Anscombe’s quartet comprises of four datasets, and is rather famous. Why? You’ll find out in this exercise.
anascombe = pd.read_csv('data/anscombe.csv')
anascombe.head()
解釋:
pandas.read_csv
讀取CSV(逗號分割)文件到DataFrame,也支持文件的部分導入和選擇迭代。
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
參數整理:(很詳細)
https://www.cnblogs.com/datablog/p/6127000.htmlanscombe.head()
使用函數head( m )來讀取查看前m條數據,如果沒有參數m,默認讀取前五條數據。
Output:
Part 1
For each of the four datasets…
Compute the mean and variance of both x and y
Compute the correlation coefficient between x and y
Compute the linear regression line: y=β0+β1x+ϵy=β0+β1x+ϵ (hint: use statsmodels and look at the Statsmodels notebook)
計算均值mean和方差variance:
print(anascombe.groupby('dataset')['x'].mean())
print(anascombe.groupby('dataset')['y'].mean())
解釋:
grouby可以對傳入的參數進行分組
https://blog.csdn.net/leonis_v/article/details/51832916pandas.DataFrame.mean()
返回數據的平均值
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.mean.htmlpandas.DataFrame.var()
返回數據的方差
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.var.html
Output:
計算x和y的相關係數:
#4組
X1 = anascombe.x[0:10].values
X2 = anascombe.x[11:21].values
X3 = anascombe.x[22:32].values
X4 = anascombe.x[33:43].values
Y1 = anascombe.y[0:10].values
Y2 = anascombe.y[11:21].values
Y3 = anascombe.y[22:32].values
Y4 = anascombe.y[33:43].values
coefficients = [0,0,0,0]
coefficients[0] = sp.stats.pearsonr(X1, Y1)[0] #第一個返回值
coefficients[1] = sp.stats.pearsonr(X2, Y2)[0]
coefficients[2] = sp.stats.pearsonr(X3, Y3)[0]
coefficients[3] = sp.stats.pearsonr(X4, Y4)[0]
for coefficient in coefficients:
print(coefficient)
解釋:
- 先手動獲取每組數據的x和y
pandas.DataFrame.values()
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.values.htmlscipy.stats.pearsonr(x, y)
計算相關係數
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html#scipy.stats.pearsonr
Output:
計算線性迴歸方程
#1
X_I = sm.add_constant(X1)
model_I = sm.OLS(Y1, X_I)
result_I = model_I.fit()
params_I = result_I.params
print("I: y =", params_I[0], "+", params_I[1], "x")
#2
X_II = sm.add_constant(X2)
model_II = sm.OLS(Y2, X_II)
result_II = model_II.fit()
params_II = result_II.params
print("II: y =", params_II[0], "+", params_II[1], "x")
#3
X_III = sm.add_constant(X3)
model_III = sm.OLS(Y3, X_III)
result_III = model_III.fit()
params_III = result_III.params
print("III: y =", params_III[0], "+", params_III[1], "x")
#4
X_IV = sm.add_constant(X4)
model_IV = sm.OLS(Y4, X_IV)
result_IV = model_IV.fit()
params_IV = result_IV.params
print("IV: y =", params_IV[0], "+", params_IV[1], "x")
解釋:
參考:https://blog.csdn.net/cymy001/article/details/78364652
Output:
Part 2
Using Seaborn, visualize all four datasets.
hint: use sns.FacetGrid combined with plt.scatter
sns.set(style='whitegrid')
gr = sns.FacetGrid(anascombe, col="dataset", hue="dataset", size=3)
gr.map(plt.scatter, 'x', 'y')
plt.show()
參考:官網及https://blog.csdn.net/yutao03081/article/details/79064669
Output: