原創不易,如需轉載,請標明出處。
常用數據分析步驟
1.導入基本工具庫:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import types
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
2.導入訓練和測試數據:
train_data = pd.read_csv("D://ML//Data//train.csv")
test_data = pd.read_csv("D://ML//Data//test.csv")
3.簡單查看數據格式和類型:
train_data.head(10)
train_data.info()
Country Happiness.Rank Happiness.Score Whisker.high Whisker.low Economy..GDP.per.Capita. Family Health..Life.Expectancy. Freedom Generosity Trust..Government.Corruption. Dystopia.Residual
0 Norway 1 7.537 7.594445 7.479556 1.616463 1.533524 0.796667 0.635423 0.362012 0.315964 2.277027
1 Denmark 2 7.522 7.581728 7.462272 1.482383 1.551122 0.792566 0.626007 0.355280 0.400770 2.313707
2 Iceland 3 7.504 7.622030 7.385970 1.480633 1.610574 0.833552 0.627163 0.475540 0.153527 2.322715
3 Switzerland 4 7.494 7.561772 7.426227 1.564980 1.516912 0.858131 0.620071 0.290549 0.367007 2.276716
4 Finland 5 7.469 7.527542 7.410458 1.443572 1.540247 0.809158 0.617951 0.245483 0.382612 2.430182
5 Netherlands 6 7.377 7.427426 7.326574 1.503945 1.428939 0.810696 0.585384 0.470490 0.282662 2.294804
6 Canada 7 7.316 7.384403 7.247597 1.479204 1.481349 0.834558 0.611101 0.435540 0.287372 2.187264
7 New Zealand 8 7.314 7.379510 7.248490 1.405706 1.548195 0.816760 0.614062 0.500005 0.382817 2.046456
8 Sweden 9 7.284 7.344095 7.223905 1.494387 1.478162 0.830875 0.612924 0.385399 0.384399 2.097538
9 Australia 10 7.284 7.356651 7.211349 1.484415 1.510042 0.843887 0.601607 0.477699 0.301184 2.065211
10 Israel 11 7.213 7.279853 7.146146 1.375382 1.376290 0.838404 0.405989 0.330083 0.085242 2.801757
11 Costa Rica 12 7.079 7.168112 6.989888 1.109706 1.416404 0.759509 0.580132 0.214613 0.100107 2.898639
12 Austria 13 7.006 7.070670 6.941330 1.487097 1.459945 0.815328 0.567766 0.316472 0.221060 2.138506
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 155 entries, 0 to 154
Data columns (total 12 columns):
Country 155 non-null object
Happiness.Rank 155 non-null int64
Happiness.Score 155 non-null float64
Whisker.high 155 non-null float64
Whisker.low 155 non-null float64
Economy..GDP.per.Capita. 155 non-null float64
Family 155 non-null float64
Health..Life.Expectancy. 155 non-null float64
Freedom 155 non-null float64
Generosity 155 non-null float64
Trust..Government.Corruption. 155 non-null float64
Dystopia.Residual 155 non-null float64
dtypes: float64(10), int64(1), object(1)
memory usage: 14.6+ KB
通過上面的信息我們可以看到訓練集的數據中包含的特徵個數,每個特徵的數據類型(文字類型或者數字類型),對於文字類型,可以簡單看出是否含有可以二值化的特徵(如性別);
4.對於中文數據特徵,不能二值化的,但是內容較少(例如國家或者省份),利用以下方法可視化其對要預測特徵的影響:
feature_length=len(train_data['Feature'].unique())
print('There have %s Feature in this table'%feature_length)
輔助查看該特徵的長度
plt.figure(figsize=(18,18))
plt.title('Feature Correlation with Result ', y=1.05, size=15)
g=sns.stripplot(x='Feature',y='Result',data=train_data,jitter=True)
plt.xticks(rotation=45)
5.查看各個特徵之間的相關性,這個很重要:
colormap=plt.cm.RdBu
plt.figure(figsize=(14,12))
plt.title('Pearson Correlation of Features', y=1.05, size=15)
sns.heatmap(data_2016.corr(),cmap=colormap,linecolor='white',linewidths=0.1,vmax=1.0,square=True,annot=True)
通過可視化形式,很容易看到特徵之間的相關性。
根據上圖我們可以輕易地看到特徵之間的“影響因子”,橫縱座標交叉的區域顏色越深,代表它們之間關係越深,區域塊上面標識的數字同樣顯示着這點。
6.簡單查看重要特徵的分佈
sns.distplot(train_data['Important Features'])
7.如果含有國家或者地區,可以通過以下方法可視化:
data=dict(type='choropleth',locations=train_data['Country'],locationmode='country names',z=data_2015['Result '],text=train_data['Country'],colorbar={'title':'Result'})
layout=dict(title='Global Result',geo=dict(showframe=False,projection={'type':'Mercator'}))
choromap3=go.Figure(data=[data],layout=layout)
iplot(choromap3)
8.如果是監督學習,則可以利用以下方法訓練模型:
y=train_data['Result']
X=data_2015.drop(['Useless Features'],axis=1)
其中Useless Features就是通過上述步驟得到的對result影響不大的特徵,在此處爲了預測方便可以drop掉(但是既然存在就是合理,所以drop掉是有問題的,稍後再講)
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
trian_data,test_data,train_target,test_target=train_test_split(X, y, test_size=0.3, random_state=101)
lr=LinearRegression()
lr.fit(X_train,y_train)
predict=lr.predict(test_data)
print('predict data',predict)
print('-'*60)
print(test_target)
predict data [ 4.875 6.168 5.856 4.907 6.379 4.643 5.314 3.856 6.952 5.458
4.36 5.129 6.269 5.658 5.919 5.185 5.291 5.615 7.413 7.509
6.474 5.177 6.218 5.976 3.303 6.478 3.695 4.876 3.36 2.905
3.832 5.057 5.045 5.987 5.538 4.324 4.201 5.546 6.65 4.217
3.763 5.033 5.163 4.219 4.508 5.121 5.145 6.084]
------------------------------------------------------------
102 4.875
42 6.168
55 5.856
100 4.907
33 6.379
109 4.643
78 5.314
141 3.856
16 6.952
74 5.458
120 4.360
92 5.129
39 6.269
64 5.658
53 5.919
84 5.185
80 5.291
66 5.615
4 7.413
1 7.509
32 6.474
85 5.177
41 6.218
50 5.976
154 3.303
31 6.478
147 3.695
101 4.876
153 3.360
156 2.905
142 3.832
96 5.057
97 5.045
48 5.987
69 5.538
122 4.324
129 4.201
68 5.546
25 6.650
128 4.217
143 3.763
98 5.033
86 5.163
127 4.219
114 4.508
94 5.121
90 5.145
43 6.084
Name: Happiness Score, dtype: float64
預測結果和實際結果相差無幾;
9.測試結果可視化
plt.scatter(predict,test_target)
plt.xlabel('Predict')
plt.ylabel('Test_data')
如果覺得本文寫的還不錯的夥伴,可以給個關注一起交流進步,如果有在找工作且對阿里感興趣的夥伴,也可以發簡歷給我進行內推: