數據轉換

移除重複數據

duplicated、drop_duplicates、

利用函數和映射進行數據轉換

map

替換值

replace

重命名軸索引

.index.map

rename——

data.rename(index={'OHIO':'FHDJ'},columns={‘fdjh’:'fhdhgj'})

離散化和麪元劃分

pd.cut、pd.qcut

>> factors = np.random.randn(9)
[ 2.12046097  0.24486218  1.64494175 -0.27307614 -2.11238291 2.15422205 -0.46832859  0.16444572  1.52536248]

pd.qcut()
qcut是根據這些值的頻率來選擇箱子的均勻間隔，即每個箱子中含有的數的數量是相同的

傳入q參數

>>> pd.qcut(factors, 3) #返回每個數對應的分組
[(1.525, 2.154], (-0.158, 1.525], (1.525, 2.154], (-2.113, -0.158], (-2.113, -0.158], (1.525, 2.154], (-2.113, -0.158], (-0.158, 1.525], (-0.158, 1.525]]
Categories (3, interval[float64]): [(-2.113, -0.158] < (-0.158, 1.525] < (1.525, 2.154]]

>>> pd.qcut(factors, 3).value_counts() #計算每個分組中含有的數的數量
(-2.113, -0.158]    3
(-0.158, 1.525]     3
(1.525, 2.154]      3

傳入lable參數

>>> pd.qcut(factors, 3,labels=["a","b","c"]) #返回每個數對應的分組，但分組名稱由label指示
[c, b, c, a, a, c, a, b, b]
Categories (3, object): [a < b < c]

>>> pd.qcut(factors, 3,labels=False) #返回每個數對應的分組，但僅顯示分組下標
[2 1 2 0 0 2 0 1 1]

傳入retbins參數

>>> pd.qcut(factors, 3,retbins=True)# 返回每個數對應的分組，且額外返回bins，即每個邊界值
[(1.525, 2.154], (-0.158, 1.525], (1.525, 2.154], (-2.113, -0.158], (-2.113, -0.158], (1.525, 2.154], (-2.113, -0.158], (-0.158, 1.525], (-0.158, 1.525]]
Categories (3, interval[float64]): [(-2.113, -0.158] < (-0.158, 1.525] < (1.525, 2.154],array([-2.113,  -0.158 ,  1.525,  2.154]))

參數   說明
x   ndarray或Series
q   integer，指示劃分的組數
labels   array或bool，默認爲None。當傳入數組時，分組的名稱由label指示；當傳入Flase時，僅顯示分組下標
retbins   bool，是否返回bins，默認爲False。當傳入True時，額外返回bins，即每個邊界值。
precision   int，精度，默認爲3

pd.cut()
cut將根據值本身來選擇箱子均勻間隔，即每個箱子的間距都是相同的

傳入bins參數

>>> pd.cut(factors, 3) #返回每個數對應的分組
[(0.732, 2.154], (-0.69, 0.732], (0.732, 2.154], (-0.69, 0.732], (-2.117, -0.69], (0.732, 2.154], (-0.69, 0.732], (-0.69, 0.732], (0.732, 2.154]]
Categories (3, interval[float64]): [(-2.117, -0.69] < (-0.69, 0.732] < (0.732, 2.154]]

>>> pd.cut(factors, bins=[-3,-2,-1,0,1,2,3])
[(2, 3], (0, 1], (1, 2], (-1, 0], (-3, -2], (2, 3], (-1, 0], (0, 1], (1, 2]]
Categories (6, interval[int64]): [(-3, -2] < (-2, -1] < (-1, 0] < (0, 1] (1, 2] < (2, 3]]

>>> pd.cut(factors, 3).value_counts() #計算每個分組中含有的數的數量
Categories (3, interval[float64]): [(-2.117, -0.69] < (-0.69, 0.732] < (0.732, 2.154]]
(-2.117, -0.69]    1
(-0.69, 0.732]     4
(0.732, 2.154]     4

傳入lable參數

>>> pd.cut(factors, 3,labels=["a","b","c"]) #返回每個數對應的分組，但分組名稱由label指示
[c, b, c, b, a, c, b, b, c]
Categories (3, object): [a < b < c]

>>> pd.cut(factors, 3,labels=False) #返回每個數對應的分組，但僅顯示分組下標
[2 1 2 1 0 2 1 1 2]

傳入retbins參數

>>> pd.cut(factors, 3,retbins=True)# 返回每個數對應的分組，且額外返回bins，即每個邊界值
([(0.732, 2.154], (-0.69, 0.732], (0.732, 2.154], (-0.69, 0.732], (-2.117, -0.69], (0.732, 2.154], (-0.69, 0.732], (-0.69, 0.732], (0.732, 2.154]]
Categories (3, interval[float64]): [(-2.117, -0.69] < (-0.69, 0.732] < (0.732, 2.154]], array([-2.11664951, -0.69018126,  0.7320204 ,  2.15422205]))

參數   說明
x   array，僅能使用一維數組
bins   integer或sequence of scalars，指示劃分的組數或指定組距
labels   array或bool，默認爲None。當傳入數組時，分組的名稱由label指示；當傳入Flase時，僅顯示分組下標
retbins   bool，是否返回bins，默認爲False。當傳入True時，額外返回bins，即每個邊界值。
precision   int，精度，默認爲3

檢測和過濾異常值

排列和隨機採樣

numpy.random.permutation

numpy.random.permutation+take函數

numpy.random.randint+take函數

numpy.random.permutation：

輸入一個數或者數組，生成一個隨機序列，對多維數組來說是多維隨機打亂而不是1維/

>>np.random.permutation([1, 4, 9, 12, 15])
array([15,  1,  9,  4, 12])

>>arr = np.arange(9).reshape((3, 3))
array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])
>>np.random.permutation(arr)
array([[6, 7, 8],
       [0, 1, 2],
       [3, 4, 5]]) 

>>permutation = list(np.random.permutation(10))
[5, 1, 7, 6, 8, 9, 4, 0, 2, 3]
>>Y = np.array([[1,1,1,1,0,0,0,0,0,0]])
>>Y_new = Y[:, permutation]
array([[0, 1, 0, 0, 0, 0, 0, 1, 1, 1]])

numpy.random.permutation+take函數：

df.take(numpy.random.permutation(len(df))[:3])

#take函數
arr = np.arange(6)*100 #arr初始化爲array( [0,100,200,300,400,500])
inds = [4,3,2]
arr.take(inds)
#相當於從arr序列中依次獲取索引爲4,3,2位置上的元素，因此輸出爲array([400,300,200])

numpy.random.randint+take函數：

numpy.random.randint(low, high=None, size=None, dtype='l')

numpy.random.randint(low, high=None, size=None, dtype='l')

函數的作用是，返回一個隨機整型數，範圍從低（包括）到高（不包括），即[low, high)。
如果沒有寫參數high的值，則返回[0,low)的值。

參數如下：

low: int
生成的數值最低要大於等於low。
（hign = None時，生成的數值要在[0, low)區間內）
high: int (可選)
如果使用這個值，則生成的數值在[low, high)區間。
size: int or tuple of ints(可選)
輸出隨機數的尺寸，比如size = (m * n* k)則輸出同規模即m * n* k個隨機數。默認是None的，僅僅返回滿足要求的單一隨機數。
dtype: dtype(可選)：
想要輸出的格式。如int64、int等等
輸出：

out: int or ndarray of ints
返回一個隨機數或隨機數數組

函數的作用是，返回一個隨機整型數，範圍從低（包括）到高（不包括），即[low, high)。
如果沒有寫參數high的值，則返回[0,low)的值。

bag = np.array([5,7,-1,6,4])
sampler = np.random.randint(0,len(bag),size=10)
draws = bag.take(sampler)

計算指標/啞變量

get_dummies

get_dummies:

我理解get_dummies是將擁有不同值的變量轉換爲0/1數值。打個比方，小明有黃、紅、藍三種顏色的帽子，小明今天戴黃色帽子用1表示，紅色帽子用2表示，藍色帽子用3表示。但1、2、3數值大小本身是沒有意義的，只是用於區分帽子的顏色，因此在實際分析時，需要將1、2、3轉化爲0、1，如下代碼所示：

import pandas as pd
xiaoming=pd.DataFrame([1,2,3],index=['yellow','red','blue'],columns=['hat'])
print(xiaoming)
hat_ranks=pd.get_dummies(xiaoming['hat'],prefix='hat')
print(hat_ranks.head())

        hat
yellow    1
red       2
blue      3
        hat_1  hat_2  hat_3
yellow      1      0      0
red         0      1      0
blue        0      0      1

注：這裏書的p215和216內容沒太看明白，以後有機會再看=。=

用python進行數據分析——第七章：數據規整化、清洗、轉化、合併、重塑【3】：數據轉換

數據轉換

移除重複數據

利用函數和映射進行數據轉換

替換值

重命名軸索引

離散化和麪元劃分

檢測和過濾異常值

排列和隨機採樣

計算指標/啞變量

redis的key亂碼問題和值自增問題

CORS error 但是 status code 是200 OK

一個開源且全面的C#算法實戰教程

一款.NET開源、功能強大、跨平臺的繪圖庫 - OxyPlot

壓縮上傳的GPU數據的方案

使用skopeo同步鏡像

多元數據分佈——以及在MATLAB中的相關函數（《多元數據分析2012》筆記（2））

多元數據降維方法--線性方法--《多元數據分析（2012）》筆記3

MATLAB的slice函數--書《多元數據分析2012》筆記（1）

【python】案例實戰：使用sklearn構造決策樹模型

多元數據分析降維方法--非線性方法（多位標度分析）--（《多元數據分析（2012）》筆記）

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結