Chapter1_Hands-On ML with sklearn & TF

首先測試一下如何用python進行基本的數據處理,用的是pandas模塊

import pandas as pd
import os
path=os.path.join("datasets","lifesat","")
path_oecd=path+"oecd_bli_2015.csv"
path_gdp=path+'gdp_per_capita.csv'
oecd_bli=pd.read_csv(path_oecd, thousands=',')
oecd_bli=oecd_bli[life_sat["INEQUALITY"]=="TOT"]
#此處已經將OECD數據的索引index設置爲Country
oecd_bli=oecd_bli.pivot(index="Country",columns="Indicator",values="Value")
gdp_per_capita=pd.read_csv(path_gdp,thousands=',',delimiter='\t',encoding='latin1',na_values="n/a")
gdp_per_capita.rename(columns={"2015":"GDP per capita"},inplace=True)
#將GDP數據的索引也設置爲Country
gdp_per_capita.set_index("Country",inplace=True)
#合併表格,根據索引值Country
full_country_stats=pd.merge(left=oecd_bli,right=gdp_per_capita,left_index=True,right_index=True)
full_country_stats.sort_values(by="GDP per capita",inplace=True)
#print(full_country_stats)
#print(full_country_stats['Life satisfaction'])
#print(full_country_stats[["GDP per capita","Life satisfaction"]])

TEST1:開始練習第一個簡單的機器學習例子,預測GDP與生活滿意度的關係

import os
path=os.path.join("datasets","lifesat","")
def prepare_country_stats(oecd_bli,gdp_per_capita):
    oecd_bli=oecd_bli[oecd_bli["INEQUALITY"]=="TOT"]
    oecd_bli=oecd_bli.pivot(index="Country",columns="Indicator",values="Value")
    gdp_per_capita.rename(columns={"2015":"GDP per capita"},inplace=True)
    gdp_per_capita.set_index("Country",inplace=True)
    full_country_stats=pd.merge(left=oecd_bli,right=gdp_per_capita,left_index=True,right_index=True)
    full_country_stats.sort_values(by="GDP per capita",inplace=True)
    remove_indices=[0,1,6,8,33,34,35]
    keep_indices=list(set(range(36))-set(remove_indices))
    #這裏面爲什麼要用兩個方括號?目前的理解是dataFrame的索引需要一個list作爲輸入,因此item=["GDP per capita","Life satisfaction"],full_country_stats[item]
    return full_country_stats[["GDP per capita","Life satisfaction"]].iloc[keep_indices]
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn.linear_model
#load the data
path_oecd=path+"oecd_bli_2015.csv"
path_gdp=path+'gdp_per_capita.csv'
oecd_bli=pd.read_csv(path_oecd, thousands=',')
gdp_per_capita=pd.read_csv(path_gdp,thousands=',',delimiter='\t',encoding='latin1',na_values="n/a")
#prepare the data
country_stats=prepare_country_stats(oecd_bli,gdp_per_capita)
X=np.c_[country_stats["GDP per capita"]]
Y=np.c_[country_stats["Life satisfaction"]]

#Visualize the data
country_stats.plot(kind='scatter',x="GDP per capita",y="Life satisfaction")
#plt.show()

#select a linear model
model = sklearn.linear_model.LinearRegression()
#Train the model
model.fit(X,Y)
#Plot the model after training
b0,k0=model.intercept_[0],model.coef_[0]
x0=np.linspace(0,60000,500)
plt.plot(x0,b0+k0*x0,'k')
plt.show()
#Make a prediction for Cyprus
#如果只有一個方括號,或提示錯誤 ValueError: Expected 2D array, got 1D array instead:
#也是,一個方括號代表一維數組,兩個方括號代表兩維數組,那爲什麼要求兩位數組呢?
X_new=[[22587]]
print(model.predict(X_new))

在這裏插入圖片描述
[[ 5.96242338]]

Summary:

機器學習的典型流程:

  • 數據預處理,使其格式化;
  • 數據特徵研究;
  • 根據數據特徵選擇合適的機器學習模型;
  • 利用格式化的數據訓練模型;
  • 利用訓練完成的模型進行數據預測。
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章