機器學習三劍客
最近準備對推薦系統進行優化,在學習spark的mllib的代碼的時候,發現裏面很多的參數都是使用numpy。於是有了這篇博客,詳細解釋了python的機器學習的三劍客,我後面也是想把數據用圖表的形式展示出來,這樣就比較形象了。
numpy
Numpy 是python語言的一個擴展程序庫。支持高級大量的維度數組與矩陣運算,此外也針對數組運算提供了大量的數學函數庫。Numpy內部解除了python的GIL,運算效率極好,是大量機器學習框架的基礎庫
Numpy簡單創建數組
import numpy as np
a=[1,2,3,4]
b=np.array(b)
numpy 查看數組屬性
查看數組元素個數
b.size
查看數組形狀
b.shape
查看數組維度
b.ndim
查看數組元素類型
b.dtype
快速創建N維數組的api函數
創建10行10列的數值爲浮點1的矩陣
array_one=np.ones([10,10])
創建10行10列的數值爲浮點0的矩陣
array_zero=np.zeros([10,10])
numpy創建隨機數組np.random
- 均勻分佈
- np.random.rand(10,10)創建指定形狀
- np.random.uniform(0,100)創建指定範圍內的一個數
- np.random.randin(0,100)創建指定方位內的一個整數
- 正態分佈
- 給定均值/標準差/維度的正態分佈
np.random.normal(1.75,0.1,(2,3))
- 給定均值/標準差/維度的正態分佈
- 數組的索引,切片
arr=np.random.normal(1.75,0.1,(4,5))
print arr
after_arr=arr[1:3,2:4]
print after_arr
- 改變數組形狀(要求前後元素個數匹配)
one_20=np.ones([20])
print one_20
one_4_5=one_20.reshape([4,5])
print one_4_5
numpy計算(重要)
stus_score=np.array([[80,88],[82,81],[84,75],[86,83],[75,81]])
stus_score>80
np.where(stus_score<80,0,90)
統計運算
- 指定軸最大值amax(參數1:數組;參數2:axis=0/1;0表示列1表示行)
print np.amax(stus_score,axis=0)
print result
print np.amax(stus_score,axis=1)
print result
- 指定軸最小值amin
result=np.amin(stus_score,axis=0)
print result
result=np.amin(stus_score,axis=1)
print result
- 指定軸平均值mean
result=np.mean(stus_score,axis=0)
print result
result=np.mean(stus_score,axis=1)
print result
- 方差std
result=np.std(stus_score,axis=0)
print result
result=np.std(stus_score,axis=1)
數組運算
數組與數的運算
stus_score[:,0]=stus_score[:,0]+5
print stus_score
stus_score[:,0]=stus_score[:,0]*5
print stus_score
矩陣運算np.dot()
計算規則
(M行, N列) * (N行, Z列) = (M行, Z列)
q = np.array([[0.4], [0.6]])
result = np.dot(stut_score, q)
print result
矩陣拼接
矩陣垂直拼接
v1 = [[0, 1, 2, 3, 4, 5],
[6, 7, 8, 9, 10, 11]]
v2 = [[12, 13, 14, 15, 16, 17],
[18, 19, 20, 21, 22, 23]]
result=np.vstack((v1,v2))
print result
v1 = [[0, 1, 2, 3, 4, 5],
[6, 7, 8, 9, 10, 11]]
v2 = [[12, 13, 14, 15, 16, 17],
[18, 19, 20, 21, 22, 23]]
result=np.hstack((v1,v2))
print result
numpy讀取數據np.getfromtxt
result=np.getfromtxt(“./students_score.csv”,delimiter=“,”)
pandas
pandas是基於numpy開發出的,專門用於數據分析的開源python庫
Pandas的兩大核心數據結構
Series(一維數據)
import numpy as np
import pandas as pd
print pd.Series(np.arange(4,10))
pd.Series([11,12,14],index=[“北京”,”上海”,”深圳”])
pd.Series({“北京”:11,”上海”:12,”深圳”:14})
DataFrame(多特徵數據,既有行索引,又有列索引)
data_3_4=pd.DataFrame(np.arange(10,22).reshape(3,4))
print data_3_4
print(data_3_4[:1])
print(data_3_4[:][0])
# 創建一個3行4列的DataFrame類型數據
data_3_4 = pd.DataFrame(np.arange(10, 22).reshape(3, 4))
# 打印數據
print(data_3_4)
# 打印第一行數據
print(data_3_4[:1])
# 打印第一列數據
print(data_3_4[:][0])
# 讀取數據
result = pd.read_csv("./students_score.csv")
# 數據的形狀
result.shape
# 每列數據的 類型信息
result.dtypes
# 數據的維數
result.ndim
# 數據的索引(起/始/步長)
result.index
# 打印每一列 屬性的名稱
result.columns
# 將數據放到數組中顯示
result.values
print("-->前5個:")
print(result.head(5))
# 打印後5個
print("-->後5個:")
print(result.tail(5))
# 打印描述信息(實驗中好用)
print("-->描述信息:")
print(result.describe())
panda數據讀取(以csv爲例)
pandas.read_csv(filepath,sep=“,”,names=None,usecols=None)
返回的類型:DataFrame
result[‘姓名’][0:6]
result[result[‘age’]>23]
IMDB_1000 = pd.read_csv("./IMDB-Movie-Data.csv")
# 獲取數據字段
print(IMDB_1000.dtypes)
# 根據1000部電影評分進行降序排列,參數ascending, 默認爲True(升序), 這裏爲False(降序)
IMDB_1000.sort_values(by="Rating", ascending=False)
# 時間最長的電影
IMDB_1000[IMDB_1000["Runtime (Minutes)"]==IMDB_1000["Runtime (Minutes)"].max()]
# 時間最短的電影
IMDB_1000[IMDB_1000["Runtime (Minutes)"]==IMDB_1000["Runtime (Minutes)"].min()]
# 電影時長平均值
IMDB_1000["Runtime (Minutes)"].mean()
# 刪除存在缺失值的樣本
IMDB_1000.dropna()
# 爲一些電影缺失的總票房添加平均值
IMDB_1000["Revenue (Millions)"].fillna(IMDB_1000["Revenue (Millions)"].mean(), inplace=True)
# 在線讀取數據,並按照說明文檔, 並對各列信息進行命名
bcw = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data", names=["Sample code number","Clump Thickness","Uniformity of Cell Size","Uniformity of Cell Shape", "Marginal Adhesion","Single Epithelial Cell Size","Bare Nuclei","Bland Chromatin","Normal Nucleoli","Mitoses","Class:"])
#預處理,把數據中的?替換爲np.nan
bcw=bcw.replace(to_replace=“?”,value=np.nan)
# 讀取前10行數據
train = pd.read_csv("./train.csv", nrows = 10)
# 將數據中的time轉換爲最小分度值爲秒(s)的計量單位
train["time"] = pd.to_datetime(train["time"], unit="s")
# 新增列year, month, weekday
train["year"] = pd.DatetimeIndex(train["time"]).year
train["month"] = pd.DatetimeIndex(train["time"]).month
train["weekday"] = pd.DatetimeIndex(train["time"]).weekday
數據表的合併
# 讀取3張表
user_info = pd.read_csv("./user_info.csv")
order_info = pd.read_csv("./order_info.csv")
goods_info = pd.read_csv("./goods_info.csv")
# 合併三張表
u_o = pd.merge(user_info, order_info, how="left", on=["user_id", "user_id"])
u_o_g = pd.merge(u_o, goods_info, how="left", on=["goods_name", "goods_name"])
# 交叉表, 表示出用戶姓名,和商品名之間的關係
user_goods = pd.crosstab(u_o_g["姓名"],u_o_g["goods_name"])
數據表的分組
starbucks = pd.read_csv("./directory.csv")
# 統計每個國家星巴克的數量
starbucks.groupby(["Country"]).count()
# 統計每個國家 每個省份 星巴克的數量
starbucks.groupby(["Country", "State/Province"]).count()
matplotlib
matplotlib 是python 2D繪圖領域的基礎套件,它讓使用者將數據圖形化,並提供多樣化的輸出格式。這裏講會以四個小案例探索matplotlib的常見用法
繪製折線圖
import matplotlib.pyplot as plt
import random
# plt.plot([1, 2, 3, 4])
# plt.ylabel("some numbers")
# plt.show()
#
# plt.plot([1, 2, 3, 4], [1, 4, 9, 16], 'ro')
# plt.show()
beijing_x = [_ for _ in range(0, 24)]
beijing_y = [random.randint(10, 30) for _ in range(0, 24)]
plt.plot(beijing_x, beijing_y, label="beijing")
shanghai_x = [_ for _ in range(0, 24)]
shanghai_y = [random.randint(10, 20) for _ in range(0, 24)]
plt.plot(shanghai_x, shanghai_y, label="shanghai")
hefei_x = [_ for _ in range(0, 24)]
hefei_y = [random.randint(30, 40) for _ in range(0, 24)]
plt.plot(hefei_x, hefei_y, label="hefei", color="#823384", linestyle=":", linewidth=3, alpha=0.3)
##座標軸
x_ = [x_ for x_ in range(24)]
x_desc = ["{}h".format(_) for _ in x_]
plt.xticks(x_, x_desc)
y_ = [_ for _ in range(50)][::2]
y_desc = ["{}c".format(_) for _ in y_]
plt.yticks(y_, y_desc)
plt.xlabel("time")
plt.ylabel("temperature")
plt.title("the temperature change in one day")
plt.legend(loc="best")
plt.show()
import matplotlib.pyplot as plt
import random
from pylab import mpl
mpl.rcParams['font.sans-serif'] = ['FangSong'] # 指定默認字體
mpl.rcParams['axes.unicode_minus'] = False # 解決保存圖像是負號'-'顯示爲方塊的問題
# 條形圖繪製名偵探柯南主要角色年齡
role_list = ["michael", "sdsds", "sdasd", "ffff", "gggg", "bbb", "nnn", "lll"]
role_age = [7, 17, 7, 34, 32, 30, 27, 46]
# 實際年齡
role_ture_age = [18, 17, 18, 34, 45, 30, 27, 46]
x = [i + 1 for i, role in enumerate(role_list)]
y = role_age
y2 = role_ture_age
plt.figure(figsize=(15, 8), dpi=100)
plt.bar(x, y, width=-0.4, label="role age", color="#509839")
plt.bar(x, y2, width=0.3, label="role real age", color="#c03035")
x_desc = [_ for _ in role_list]
plt.xticks(x, x_desc)
y = range(50)[::5]
plt.yticks(y)
plt.xlabel("role")
plt.ylabel("age")
plt.title("the role in cartoon Detective conan")
plt.legend(loc="best")
plt.show()