一、numpy

（一）矩陣

1.創建矩陣

（1）mat()

格式1：字符串

A = np.mat('1 0 0 0;0 1 0 0;-1 2 1 0;1 1 0 1')
print(A)

格式2：列表

B = np.mat([[1,0,0,0],[0,1,0,0],[-1,2,1,0],[1,1,0,1]])
print(B)

（2）matrix()

同mat()

# 格式1：字符串
C = np.matrix('1 0 0 0;0 1 0 0;-1 2 1 0;1 1 0 1')
print(C)
# 格式2：列表
D = np.matrix([[1,0,0,0],[0,1,0,0],[-1,2,1,0],[1,1,0,1]])
print(D)

（3）bmat()

通過分塊矩陣創建big矩陣

Big_Mat = np.bmat('A B;C D')
print(Big_Mat)
Big_Mat = np.bmat([[A,B],[C,D]])
print(Big_Mat)

2.矩陣運算

（1）矩陣與數相乘–數乘

A = np.mat([[1,1],[1,1]])
print(A*3)

（2）矩陣加減法

B = np.mat([[1,2],[3,4]])
print(A+B)

（3）矩陣相乘

print(A*B)

（4）矩陣對應位置元素相乘

C = np.multiply(A,B)
print(C)

（5）矩陣的常用數據

轉置–T
```
print(C.T)
```
共軛轉置矩陣–H
```
print(C.H)
```
逆矩陣–I
```
print(C.I)
```

返回自身數據的一個視圖–A

print(type(C))    # <class 'numpy.matrixlib.defmatrix.matrix'>
print(type(C.A))    # <class 'numpy.ndarray'>
print(C.A)

（二）使用numpy進行數據分析–文件讀取

import numpy as np

x = np.array([[1,1,1],[2,2,2],[3,3,3],[4,4,4]])
y = np.array([1,2,3,4]).reshape((4,1))

1.保存

（1）save()

保存的是二進制文件，文件的擴展名：.npy。

np.save('x',x)

（2）savetxt()

保存的是文本文件。

np.savetxt('x_txt',x)

二進制文件會比文本文件的處理速度快，效率高。

（3）savez()

可以將多個數據保存成一個壓縮文件。

a = np.array([[1,2,3],[4,5,6]])
b = np.arange(0,1,0.1)
np.savez('a_b',a,b)

2.讀取

注意：存儲時可以省略擴展名，但讀取時不能省略。

（1）load()

result = np.load('x.npy')
print(type(result),'\n',result)

（2）loadtxt()

讀取文本文件（txt/csv）

result = np.loadtxt('x_txt',dtype=np.int32)
print(result)

（3）genfromtxt()

保存結構化數據。

# 1.創建結構
df = np.dtype([('name',np.str_,128),('nums',np.int32),('local',np.str_,16)])
# 2.讀取結構化數據
jobs = np.genfromtxt('tecent_jobs.txt',dtype=df,delimiter=',')
print(jobs['name'])
print(jobs['nums'],type(jobs['nums']))
print(jobs['local'])

（三）排序問題

1.直接排序

sort()

直接修改原始數據。

sort()是最常用的排序方法，arr.sort()。

sort()也可以指定一個axis參數，使得sort()可以沿着指定軸對數據集進行排序。

axis=1爲沿行軸排序；axis=0爲沿縱軸排序。

arr = np.array([[4,3,2],[2,1,4]])
arr.sort(axis=1)
print(arr)
arr.sort(axis=0)
print(arr)

2.間接排序

間接排序不更改原始數據的物理結構。

（1）argsort()

返回值爲一個數組，存放的是排序後元素的下標。

arr = np.array([2,1,0,5,3])
new_arr = arr.argsort()

（2）lexsort()

按照()中的最後一個數組進行排序，相同再比較其他數組的對應位置，返回結果也是索引。

a = np.array([3,2,6,4,5])
b = np.array([50,40,30,20,10])
c = np.array([400,300,600,100,200])
result = np.lexsort((a,b,c))
print(result)

（四）去重和重複

1.去重

arr = np.array([1,2,1,2,3,3])
print(np.unique(arr))

2.重複

（1）tile()

將整體看作重複對象。

result = np.tile(arr,3)
print(result)    # [1 2 3 4 1 2 3 4 1 2 3 4]

還可以指定行列重複次數。

result = np.tile(arr,(3,1))
print(result)
# [[1 2 3 4]
#  [1 2 3 4]
#  [1 2 3 4]]
result = np.tile(arr,(3,2))
print(result)
# [[1 2 3 4 1 2 3 4]
#  [1 2 3 4 1 2 3 4]
#  [1 2 3 4 1 2 3 4]]

（2）repeat()

將元素看作重複對象。

result = np.repeat(arr,3)
print(result)    # [1 1 1 2 2 2 3 3 3 4 4 4]

可通過axis參數，指定軸。

axis=1按列進行重複，axis=0按行進行重複。

arr = np.array([[1,2],[3,4]])
result = np.repeat(arr,2)
print(result)    # [1 1 2 2 3 3 4 4]
result = np.repeat(arr,2,axis=0)
print(result)
# [[1 2]
#  [1 2]
#  [3 4]
#  [3 4]]
result = np.repeat(arr,2,axis=1)
print(result)
# [[1 1 2 2]
#  [3 3 4 4]]

區別：tile函數是對數組整體進行重複操作，而repeat是對數組中的每個元素進行重複操作。

練習

生成一個3*3的數組，數據元素爲10以內的整數（隨機生成）
將數組按行進行重複，重複次數爲2

arr = np.random.randint(0,10,size=(3,3))
result = np.repeat(arr,2,axis=0)
print('arr:\n',arr)
print('result:\n',result)

（五）統計函數

import numpy as np

arr = np.arange(20).reshape((4,5))

# 寫法1
# np.函數名(數組名)
# 寫法2
# 數組名.函數名()

# 求和
print(np.sum(arr))
print(arr.sum())

# 平均值
print(arr.mean())

# 標準差/方差
print(arr.std(),arr.var())

# 累加和
print(arr.cumsum())

# 累乘積
print(arr.cumprod())

練習

讀取iris數據集中的花萼長度數據（已保存爲csv格式）
並對其進行排序、去重
求和、累計和、均值、標準差、方差、最小值、最大值

import numpy as np
# 讀取文件
iris = np.loadtxt('iris_sepal_length.csv')
print('花萼長度表爲：\n',iris)

# 直接排序
iris.sort()
print('排序後：\n',iris)

# 去重
unique_iris = np.unique(iris)
print('去重：\n',unique_iris)

# 和
print('和：\n',np.sum(unique_iris))
# 累加和
print('累加和：\n',np.cumsum(unique_iris))
# 均值
print('均值：\n',np.mean(unique_iris))
# 標準差
print('標準差：\n',np.std(unique_iris))
# 方差
print('方差：\n',np.var(unique_iris))
# 最小值
print('最小值：\n',np.min(unique_iris))
# 最大值
print('最大值：\n',np.max(unique_iris))

二、KNN算法–最近鄰算法

思想：一個樣本與數據集中的k個樣本最相似，如果這k個樣本中的大多數屬於一個類別，則該樣本也屬於這個類別。

計算相似度–歐式距離

例：根據樣本集，判斷x電影的類型

import math

# 1.使用python字典構造數據集
movie_data = {
    "寶貝當家": [45, 2, 9, "喜劇片"],
    "美人魚": [21, 17, 5, "喜劇片"],
    "澳門風雲3": [54, 9, 11, "喜劇片"],
    "功夫熊貓3": [39, 0, 31, "喜劇片"],
    "諜影重重": [5, 2, 57, "動作片"],
    "葉問3": [3, 2, 65, "動作片"],
    "倫敦陷落": [2, 3, 55, "動作片"],
    "我的特工爺爺": [6, 4, 21, "動作片"],
    "奔愛": [7, 46, 4, "愛情片"],
    "夜孔雀": [9, 39, 8, "愛情片"],
    "代理情人": [9, 38, 2, "愛情片"],
    "新步步驚心": [8, 34, 17, "愛情片"]
}

# 2.計算距離
x = [23,3,17]
KNN = []
for k,v in movie_data.items():
    d = math.sqrt((v[0]-x[0])**2+(v[1]-x[1])**2+(v[2]-x[2])**2)
    KNN.append([k,round(d,2)])    # round(d,2)，將d保留兩位小數，四捨五入
# print(KNN)

# 3.取k個距離最短的樣本
KNN.sort(key=lambda data:data[1])

KNN = KNN[:5]

choice = {}
for one in KNN:
    movie_name = one[0]
    type = movie_data[movie_name][3]
    choice[type] = choice.get(type,0) + 1
choice = sorted(choice.items(),key=lambda c:c[1],reverse=True)    # items()中是元組
# print(choice)    # 存放元組的列表
print('x的類型爲：',choice[0][0])

迭代和循環

迭代不是循環！！

迭代就是一個一個的取，它不同於循環，不可以倒着取。

數據分析（三）--numpy，KNN算法