此文主要記錄的是實驗樓的百題闖關100題，特此記錄，並加入輸出和部分內容的補充。

Pandas 百題大沖關

Pandas 百題大沖關分爲基礎篇和進階篇，每部分各有 50 道練習題。基礎部分的練習題在於熟悉 Pandas 常用方法的使用，而進階部分則側重於 Pandas 方法的組合應用。

基礎部分

基礎

1. 導入Pandas:

import pandas as pd

2. 查看Pandas版本信息：

print(pd.__version__)

Pandas 的數據結構：Pandas 主要有Series（一維數組），DataFrame（二維數組），Panel（三維數組），Panel4D（四維數組），PanelND（更多維數組）等數據結構。其中 Series 和 DataFrame 應用的最爲廣泛。

Series 是一維帶標籤的數組，它可以包含任何數據類型。包括整數，字符串，浮點數，Python 對象等。Series 可以通過標籤來定位。
DataFrame 是二維的帶標籤的數據結構。我們可以通過標籤來定位數據。這是 NumPy 所沒有的。

創建 Series 數據類型

Pandas 中，Series 可以被看作由 1 列數據組成的數據集。

創建 Series 語法：s = pd.Series(data, index=index)，可以通過多種方式進行創建，以下介紹了 3 個常用方法。

3. 從列表創建Series:

arr = [0, 1, 2, 3, 4]
s1 = pd.Series(arr)  # 如果不指定索引，則默認從 0 開始
s1

out:

0    0
1    1
2    2
3    3
4    4
dtype: int64

提示：前面的 0, 1, 2, 3, 4 爲當前 Series 的索引，後面的 0, 1, 2, 3, 4 爲 Series 的值。

4. 從Ndarray 創建Series:

import numpy as np

n = np.random.randn(5)  # 創建一個隨機 Ndarray 數組

index = ['a', 'b', 'c', 'd', 'e']
s2 = pd.Series(n, index=index) # 指定索引爲a,b,c,d,e
s2

out:

a   -0.419618
b   -0.709257
c    0.288306
d   -0.203162
e    1.754528
dtype: float64

5. 從字典創建Series:

d = {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}  # 定義示例字典
s3 = pd.Series(d)
s3

out:

a    1
b    2
c    3
d    4
e    5
dtype: int64

Series基本操作

6. 修改Series索引

print(s1)  # 以 s1 爲例
s1.index = ['A', 'B', 'C', 'D', 'E']  # 修改後的索引
s1

0    0
1    1
2    2
3    3
4    4
dtype: int64
A    0
B    1
C    2
D    3
E    4
dtype: int64

7. Series縱向拼接

s4 = s3.append(s1)  # 將 s1 拼接到 s3
s4

out:

a    1
b    2
c    3
d    4
e    5
A    0
B    1
C    2
D    3
E    4
dtype: int64

8. Series按指定索引刪除元素

print(s4)
s4 = s4.drop('e')  # 刪除索引爲 e 的值
s4

out:

a    1
b    2
c    3
d    4
e    5
A    0
B    1
C    2
D    3
E    4
dtype: int64
a    1
b    2
c    3
d    4
A    0
B    1
C    2
D    3
E    4
dtype: int64

9. Series修改指定索引的元素

s4['A'] = 6  # 修改索引爲 A 的值 = 6
s4

out:

a    1
b    2
c    3
d    4
A    6
B    1
C    2
D    3
E    4
dtype: int64

10. Series 按指定索引查找元素

s4['B']

out:

11. Series 切片操作

例如對s4的前3個數據訪問

s4[:3]

out:

a    1
b    2
c    3
dtype: int64

Series

12. Series 加法運算

Series 的加法運算是按照索引計算，如果索引不同則填充爲 NaN（空值）。

s4.add(s3)

out:

A    NaN
B    NaN
C    NaN
D    NaN
E    NaN
a    2.0
b    4.0
c    6.0
d    8.0
e    NaN
dtype: float64

13. Series 減法運算

Series的減法運算是按照索引對應計算，如果不同則填充爲 NaN（空值）。

s4.sub(s3)

out:

A    NaN
B    NaN
C    NaN
D    NaN
E    NaN
a    0.0
b    0.0
c    0.0
d    0.0
e    NaN
dtype: float64

14. Series 乘法運算

Series 的乘法運算是按照索引對應計算，如果索引不同則填充爲 NaN（空值）。

s4.mul(s3)

out:

A     NaN
B     NaN
C     NaN
D     NaN
E     NaN
a     1.0
b     4.0
c     9.0
d    16.0
e     NaN
dtype: float64

15. Series 除法運算

Series 的除法運算是按照索引對應計算，如果索引不同則填充爲 NaN（空值）。

s4.div(s3)

out:

A    NaN
B    NaN
C    NaN
D    NaN
E    NaN
a    1.0
b    1.0
c    1.0
d    1.0
e    NaN
dtype: float64

16. Series 求中位數

s4.median()

out:

3.0

17. Series 求和

s4.sum()

out:

18. Series 求最大值

s4.max()

out:

19. Series 求最小值

s4.min()

out:

創建 DataFrame 數據類型

與 Sereis 不同，DataFrame 可以存在多列數據。一般情況下，DataFrame 也更加常用。

20. 通過NumPy數組創建DataFrame

dates = pd.date_range('today', periods=6)  # 定義時間序列作爲 index
num_arr = np.random.randn(6, 4)  # 傳入 numpy 隨機數組
columns = ['A', 'B', 'C', 'D']  # 將列表作爲列名
df1 = pd.DataFrame(num_arr, index=dates, columns=columns)
df1

out:

2020-06-27 08:37:36.648563	0.136455	1.372062	-1.896225	1.005024
2020-06-28 08:37:36.648563	-1.208118	-0.806961	1.382154	1.417238
2020-06-29 08:37:36.648563	-1.800235	-0.272469	-0.966839	-0.188984
2020-06-30 08:37:36.648563	0.081191	-0.468042	0.551959	-0.441269
2020-07-01 08:37:36.648563	0.301758	-0.147157	0.632281	-0.622362
2020-07-02 08:37:36.648563	-0.762515	-1.773605	-0.990699	0.493300

21. 通過字典數組創建DataFrame

data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
        'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
        'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
        'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df2 = pd.DataFrame(data, index=labels)
df2

out:

	animal	age	visits	priority
a	cat	2.5	1	yes
b	cat	3.0	3	yes
c	snake	0.5	2	no
d	dog	NaN	3	yes
e	dog	5.0	2	no
f	cat	2.0	3	no
g	snake	4.5	1	no
h	cat	NaN	1	yes
i	dog	7.0	2	no
j	dog	3.0	1	no

當然如果希望每個鍵的值爲一行，將DataFrame轉置即可：

df2.T

out:

	     a	 b	c	d	e	f	g	h	i	j
animal	cat	cat	snake	dog	dog	cat	snake	cat	dog	dog
age	2.5	3	0.5	NaN	5	2	4.5	NaN	7	3
visits	1	3	2	3	2	3	1	1	2	1
priority	yes	yes	no	yes	no	no	no	yes	no	no

DataFrame 基本操作

23. 預覽DataFrame的前五行數據

此方法對快速瞭解陌生數據集結構十分有用。

df2.head()  # 默認爲顯示 5 行，可根據需要在括號中填入希望預覽的行數

out:

	animal	age	visits	priority
a	cat	2.5	1	yes
b	cat	3.0	3	yes
c	snake	0.5	2	no
d	dog	NaN	3	yes
e	dog	5.0	2	no

24. 查看 DataFrame 的後3行數據

df2.tail(3)   # 參數3可以修改爲自己需要行數

out:

    animal	age	visits	priority
h	cat	NaN	1	yes
i	dog	7.0	2	no
j	dog	3.0	1	no

25. 查看DataFrame 的索引

df2.index

out:

Index(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'], dtype='object')

26. 查看DataFrame 的列名

df2.columns

out:

Index(['animal', 'age', 'visits', 'priority'], dtype='object')

27. 查看 DataFrame 的數值

df2.values   # 得到數組

out:

array([['cat', 2.5, 1, 'yes'],
       ['cat', 3.0, 3, 'yes'],
       ['snake', 0.5, 2, 'no'],
       ['dog', nan, 3, 'yes'],
       ['dog', 5.0, 2, 'no'],
       ['cat', 2.0, 3, 'no'],
       ['snake', 4.5, 1, 'no'],
       ['cat', nan, 1, 'yes'],
       ['dog', 7.0, 2, 'no'],
       ['dog', 3.0, 1, 'no']], dtype=object)

28. 查看 DataFrame 的統計數據

df2.describe()

out:

	    age	     visits
count 8.000000  10.000000
mean  3.437500  1.900000
std	2.007797	0.875595
min	0.500000	1.000000
25%	2.375000	1.000000
50%	3.000000	2.000000
75%	4.625000	2.750000
max	7.000000	3.000000

29. DataFrame 轉置

df2.T

out:

	    a	b	c	d	e	f	g	h	i	j
animal	cat	cat	snake	dog	dog	cat	snake	cat	dog	dog
age 	2.5	3	0.5	NaN	5	2	4.5	NaN	7	3
visits	1	3	2	3	2	3	1	1	2	1
priority	yes	yes	no	yes	no	no	no	yes	no	no

30. 對 DataFrame 進行按列排序

df2.sort_values(by='age', ascending=False)  # 按age排列, ascending=False表示降序，爲True爲升序，默認爲True

out:

	animal	age	visits	priority
i	dog	7.0	2	no
e	dog	5.0	2	no
g	snake	4.5	1	no
b	cat	3.0	3	yes
j	dog	3.0	1	no
a	cat	2.5	1	yes
f	cat	2.0	3	no
c	snake	0.5	2	no
d	dog	NaN	3	yes
h	cat	NaN	1	yes

31. 對DataFrame 數據切片

df2[1:3]  # 按行取，第1行到第2行，不包括第3行

out:

	animal	age	visits	priority
b	cat	    3.0	  3  	yes
c	snake	0.5	  2	    no

32. 對 DataFrame 通過標籤查詢（單列）

df2['age']

out:

a    2.5
b    3.0
c    0.5
d    NaN
e    5.0
f    2.0
g    4.5
h    NaN
i    7.0
j    3.0
Name: age, dtype: float64

等價於 df2.age

df2.age  # 等價於 df2['age']

out:

a    2.5
b    3.0
c    0.5
d    NaN
e    5.0
f    2.0
g    4.5
h    NaN
i    7.0
j    3.0
Name: age, dtype: float64

33. DataFrame 通過標籤查詢（多列）

df2[['age', 'animal']]  # 傳入一個列名組成的[列表]

out:

    age	animal
a	2.5	cat
b	3.0	cat
c	0.5	snake
d	NaN	dog
e	5.0	dog
f	2.0	cat
g	4.5	snake
h	NaN	cat
i	7.0	dog
j	3.0	dog

34. 對 DataFrame 通過位置查找

.iloc[] 裏面的第一參數爲行，第二個參數爲列

方法一：

df2.iloc[1:3]   # 查詢 2，3 行

out:

	animal	age	visits	priority
b	cat 	3.0	 3	yes
c	snake	0.5	 2	no

方法二：

df2.iloc[1:3, 1]  # 查詢 2,3 行的第1列

out:

b    3.0
c    0.5
Name: age, dtype: float64

方法三:

df2.iloc[[0,2,3], [0,2]]  # 查詢 0,2,3 行的第0,1列

out:

	animal	visits
a	cat  	1
c	snake	2
d	dog 	3

35. DataFrame 副本拷貝

# 生成 DataFrame 副本，方便數據集被多個不同流程使用
df3 = df2.copy()
df3

out:

	animal	age	visits	priority
a	cat	2.5	1	yes
b	cat	3.0	3	yes
c	snake	0.5	2	no
d	dog	NaN	3	yes
e	dog	5.0	2	no
f	cat	2.0	3	no
g	snake	4.5	1	no
h	cat	NaN	1	yes
i	dog	7.0	2	no
j	dog	3.0	1	no

36. 判斷 DataFrame 元素是否爲空

df3.isnull()  # 如果爲空則返回爲 True

out:

    animal	age	visits	priority
a	False	False	False	False
b	False	False	False	False
c	False	False	False	False
d	False	True	False	False
e	False	False	False	False
f	False	False	False	False
g	False	False	False	False
h	False	True	False	False
i	False	False	False	False
j	False	False	False	False

判斷指定列是否爲空：

df3['age'].isnull()  # 如果age爲空則返回爲 True

out:

a    False
b    False
c    False
d     True
e    False
f    False
g    False
h     True
i    False
j    False
Name: age, dtype: bool

37. 添加列數據

1）添加一個列 series數據：

num = pd.Series([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], index=df3.index)
df3['No.'] = num  # 添加以 'No.' 爲列名的新數據列
df3

out:

   animal age	visits	priority	No.
a	cat	2.5	1	yes	0
b	cat	3.0	3	yes	1
c	snake	0.5	2	no	2
d	dog	NaN	3	yes	3
e	dog	5.0	2	no	4
f	cat	2.0	3	no	5
g	snake	4.5	1	no	6
h	cat	NaN	1	yes	7
i	dog	7.0	2	no	8
j	dog	3.0	1	no	9

2）直接添加一個list數據：

num = list(range(1, 11)) # [1, 2, 3,....9, 10]
df3['No.'] = num  # 添加以 'No.' 爲列名的新數據列
df3

out:

	animal	age	visits	priority	No.
a	cat	2.5	1	yes	1
b	cat	3.0	3	yes	2
c	snake	0.5	2	no	3
d	dog	NaN	3	yes	4
e	dog	5.0	2	no	5
f	cat	2.0	3	no	6
g	snake	4.5	1	no	7
h	cat	NaN	1	yes	8
i	dog	7.0	2	no	9
j	dog	3.0	1	no	10

38. 根據 DataFrame 的下標值進行修改

1)通過 iat修改：

# 修改第 2 行與第 2 列對應的值 3.0 → 2.0
df3.iat[1, 1] = 2  # 索引序號從 0 開始，這裏爲 1, 1
df3

out:

	animal	age	visits	priority	No.
a	cat	2.5	1	yes	1
b	cat	2.0	3	yes	2
c	snake	0.5	2	no	3
d	dog	NaN	3	yes	4
e	dog	5.0	2	no	5
f	cat	2.0	3	no	6
g	snake	4.5	1	no	7
h	cat	NaN	1	yes	8
i	dog	7.0	2	no	9
j	dog	3.0	1	no	10

2）通過iloc修改：

# 修改第 2 行與第 2 列對應的值 2.0 → 3.0
df3.iloc[1, 1] = 3.0 # 索引序號從 0 開始，這裏爲 1, 1
df3

out:

	animal	age	visits	priority	No.
a	cat	2.5	1	yes	1
b	cat	3.0	3	yes	2
c	snake	0.5	2	no	3
d	dog	NaN	3	yes	4
e	dog	5.0	2	no	5
f	cat	2.0	3	no	6
g	snake	4.5	1	no	7
h	cat	NaN	1	yes	8
i	dog	7.0	2	no	9
j	dog	3.0	1	no	10

39. 根據 DataFrame 的標籤對數據進行修改

df3.loc['f', 'age'] = 1.5
df3

out:

animal	age	visits	priority	No.
a	cat	2.5	1	yes	1
b	cat	3.0	3	yes	2
c	snake	0.5	2	no	3
d	dog	NaN	3	yes	4
e	dog	5.0	2	no	5
f	cat	1.5	3	no	6
g	snake	4.5	1	no	7
h	cat	NaN	1	yes	8
i	dog	7.0	2	no	9
j	dog	3.0	1	no	10

40. DataFrame 求平均值操作

df3.mean()

out:

age       3.375
visits    1.900
No.       5.500
dtype: float64

41. 對 DataFrame 中任意列做求和操作

1)對指定列求和：

df3['visits'].sum()

out:

2）默認對所有列求和：

df3.sum()

out:

animal      catcatsnakedogdogcatsnakecatdogdog
age                                         27
visits                                      19
priority              yesyesnoyesnononoyesnono
No.                                         55
dtype: object

字符串操作

42. 將字符串轉化爲小寫字母

string = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca',
                    np.nan, 'CABA', 'dog', 'cat'])
print(string)
string.str.lower()

out:

0       A
1       B
2       C
3    Aaba
4    Baca
5     NaN
6    CABA
7     dog
8     cat
dtype: object
0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

43. 將字符串轉化爲大寫字母

string.str.upper()

out:

0       A
1       B
2       C
3    AABA
4    BACA
5     NaN
6    CABA
7     DOG
8     CAT
dtype: object

DataFrame 缺失值操作

44. 對缺失值進行填充

df4 = df3.copy()
print(df4)
df4.fillna(value=3)  # fillna 表示填充爲NaN的數據，NaN：not a number

out:

  animal  age  visits priority  No.
a    cat  2.5       1      yes    1
b    cat  3.0       3      yes    2
c  snake  0.5       2       no    3
d    dog  NaN       3      yes    4
e    dog  5.0       2       no    5
f    cat  1.5       3       no    6
g  snake  4.5       1       no    7
h    cat  NaN       1      yes    8
i    dog  7.0       2       no    9
j    dog  3.0       1       no   10
animal	age	visits	priority	No.
a	cat	2.5	1	yes	1
b	cat	3.0	3	yes	2
c	snake	0.5	2	no	3
d	dog	3.0	3	yes	4
e	dog	5.0	2	no	5
f	cat	1.5	3	no	6
g	snake	4.5	1	no	7
h	cat	3.0	1	yes	8
i	dog	7.0	2	no	9
j	dog	3.0	1	no	10

45. 刪除存在缺失值的行

df5 = df3.copy()
print(df5)
df5.dropna(how='any')  # 任何存在 NaN 的行都將被刪除

out:

    animal  age  visits priority  No.
a    cat  2.5       1      yes    1
b    cat  3.0       3      yes    2
c  snake  0.5       2       no    3
d    dog  NaN       3      yes    4
e    dog  5.0       2       no    5
f    cat  1.5       3       no    6
g  snake  4.5       1       no    7
h    cat  NaN       1      yes    8
i    dog  7.0       2       no    9
j    dog  3.0       1       no   10
   animal	age	visits	priority	No.
a	cat	2.5	1	yes	1
b	cat	3.0	3	yes	2
c	snake	0.5	2	no	3
e	dog	5.0	2	no	5
f	cat	1.5	3	no	6
g	snake	4.5	1	no	7
i	dog	7.0	2	no	9
j	dog	3.0	1	no	10

46. DataFrame 按指定列對齊

left = pd.DataFrame({'key': ['foo1', 'foo2'], 'one': [1, 2]})
right = pd.DataFrame({'key': ['foo2', 'foo3'], 'two': [4, 5]})

print(left)
print(right)

# 按照 key 列對齊連接，只存在 foo2 相同，所以最後變成一行, 類似sql語句的連接操作
pd.merge(left, right, on='key')

out:

   key  one
0  foo1    1
1  foo2    2
   key  two
0  foo2    4
1  foo3    5
key	one	two
0	foo2	2	4

DataFrame 文件操作

47. CSV 文件寫入

# df3.to_csv('animal.csv', index=False, header=False) # 表示不將index和header寫入
df3.to_csv('animal.csv', index=True, header=True)  # 默認爲True寫入
print("寫入成功.")

out:

寫入成功.

48. CSV 文件讀取

# df_animal = pd.read_csv('animal.csv', header=None) # 表示不指明列標籤，使用0,1,..這些默認標籤
df_animal = pd.read_csv('animal.csv', header=1) # Header默認爲0,表示數據的第幾行爲列標籤,數據爲列名行以下的數據
df_animal

out:

	cat	3.0	3	yes	2
0	snake	0.5	2	no	3
1	dog	NaN	3	yes	4
2	dog	5.0	2	no	5
3	cat	1.5	3	no	6
4	snake	4.5	1	no	7
5	cat	NaN	1	yes	8
6	dog	7.0	2	no	9
7	dog	3.0	1	no	10

49. Excel 寫入操作

# 具體函數可以查看源碼
# 常用參數：
to_excel(self, excel_writer, sheet_name='Sheet1', na_rep='', float_format=None,columns=None, header=True, index=True, index_label=None,startrow=0, startcol=0, engine=None, merge_cells=True, encoding=None,
inf_rep='inf', verbose=True, freeze_panes=None)

常用參數解析

-excel_writer : string or ExcelWriter object File path or existing ExcelWriter目標路徑
- sheet_name : string, default ‘Sheet1’ Name of sheet which will contain DataFrame,填充excel的第幾頁
- na_rep : string, default ”,Missing data representation 缺失值填充
- float_format : string, default None Format string for floating point numbers
- columns : sequence, optional，Columns to write 選擇輸出的的列。
- header : boolean or list of string, default True Write out column names. If a list of string is given it is assumed to be aliases for the column names
- index : boolean, default True，Write row names (index)
- index_label : string or sequence, default None， Column label for index column(s) if desired. If None is given, andheader and index are True, then the index names are used. A sequence should be given if the DataFrame uses MultiIndex.
- startrow :upper left cell row to dump data frame
- startcol :upper left cell column to dump data frame
- engine : string, default None ，write engine to use - you can also set this via the options，io.excel.xlsx.writer, io.excel.xls.writer, andio.excel.xlsm.writer.
- merge_cells : boolean, default True Write MultiIndex and Hierarchical Rows as merged cells.
- encoding: string, default None encoding of the resulting excel file. Only necessary for xlwt,other writers support unicode natively.
- inf_rep : string, default ‘inf’ Representation for infinity (there is no native representation for infinity in Excel)
- freeze_panes : tuple of integer (length 2), default None Specifies the one-based bottommost row and rightmost column that is to be frozen

df3.to_excel('animal.xlsx', sheet_name='Sheet1' )
print("寫入成功.")

out:

寫入成功.

50. Excel 讀取操作

# 具體函數可以查看源碼
# 常用參數：
read_excel(io, sheetname=0, header=0, skiprows=None, skip_footer=0, index_col=None,names=None, parse_cols=None, parse_dates=False,date_parser=None,na_values=None,thousands=None, convert_float=True, has_index_names=None, converters=None,dtype=None, true_values=None, false_values=None, engine=None, squeeze=False, **kwds)

常用參數解析：

io : string, path object ; excel 路徑。
sheetname : string, int, mixed list of strings/ints, or None, default 0 返回多表使用sheetname=[0,1],若sheetname=None是返回全表注意：int/string 返回的是dataframe，而none和list返回的是dict of dataframe
header : int, list of ints, default 0 指定列名行，默認0，即取第一行，數據爲列名行以下的數據若數據不含列名，則設定 header = None
skiprows : list-like,Rows to skip at the beginning，省略指定行數的數據
skip_footer : int,default 0, 省略從尾部數的int行數據
index_col: int, list of ints, default None指定列爲索引列，也可以使用u”strings”
names : array-like, default None, 指定列的名字。

pd.read_excel('animal.xlsx', 'Sheet1', index_col=None, na_values=['NA'])

out:

  Unnamed: 0	animal	age	visits	priority	No.
0	a	cat	2.5	1	yes	1
1	b	cat	3.0	3	yes	2
2	c	snake	0.5	2	no	3
3	d	dog	NaN	3	yes	4
4	e	dog	5.0	2	no	5
5	f	cat	1.5	3	no	6
6	g	snake	4.5	1	no	7
7	h	cat	NaN	1	yes	8
8	i	dog	7.0	2	no	9
9	j	dog	3.0	1	no	10

進階部分

時間序列索引

51. 建立一個以2018年每一天爲索引，值爲隨機數的Series

dti = pd.date_range(start='2018-01-01', end='2018-12-31', freq='D')
s = pd.Series(np.random.rand(len(dti)), index=dti)
s

out:

2018-01-01    0.747769
2018-01-02    0.936370
2018-01-03    0.698550
2018-01-04    0.169079
2018-01-05    0.408279
                ...   
2018-12-27    0.854535
2018-12-28    0.343373
2018-12-29    0.367045
2018-12-30    0.129674
2018-12-31    0.020034
Freq: D, Length: 365, dtype: float64

52. 統計`s` 中每一個週三對應值的和

# 週一從 0 開始
s[s.index.weekday == 2].sum()

out:

27.895681550217724

提取其中日期爲週三的數據：

s[s.index.weekday == 2] # 週一爲0，週三爲2

out:

2018-01-03    0.698550
2018-01-10    0.385514
2018-01-17    0.379610
2018-01-24    0.206736
2018-01-31    0.877809
2018-02-07    0.379703
2018-02-14    0.947930
2018-02-21    0.223449
2018-02-28    0.008998
2018-03-07    0.326007
2018-03-14    0.018386
2018-03-21    0.830026
2018-03-28    0.078421
2018-04-04    0.754530
2018-04-11    0.856349
2018-04-18    0.697045
2018-04-25    0.849338
2018-05-02    0.767433
2018-05-09    0.774195
2018-05-16    0.566739
2018-05-23    0.691705
2018-05-30    0.463924
2018-06-06    0.962409
2018-06-13    0.392350
2018-06-20    0.569404
2018-06-27    0.333270
2018-07-04    0.655454
2018-07-11    0.333805
2018-07-18    0.588172
2018-07-25    0.496672
2018-08-01    0.438350
2018-08-08    0.065597
2018-08-15    0.640373
2018-08-22    0.639175
2018-08-29    0.233980
2018-09-05    0.747509
2018-09-12    0.765390
2018-09-19    0.317519
2018-09-26    0.467703
2018-10-03    0.329899
2018-10-10    0.963069
2018-10-17    0.566592
2018-10-24    0.148371
2018-10-31    0.406007
2018-11-07    0.876346
2018-11-14    0.869522
2018-11-21    0.553947
2018-11-28    0.126612
2018-12-05    0.878116
2018-12-12    0.856469
2018-12-19    0.138991
2018-12-26    0.752213
dtype: float64

53. 統計`s`中每個月值的平均值

s.resample('M').mean() # resample，重新採樣

out:

2018-01-31    0.483828
2018-02-28    0.552954
2018-03-31    0.419301
2018-04-30    0.523942
2018-05-31    0.547059
2018-06-30    0.460998
2018-07-31    0.482454
2018-08-31    0.462668
2018-09-30    0.557952
2018-10-31    0.461560
2018-11-30    0.499936
2018-12-31    0.492086
Freq: M, dtype: float64

54. 將 Series 中的時間進行轉換（秒轉分鐘）

s = pd.date_range('today', periods=100, freq='S')

ts = pd.Series(np.random.randint(0, 500, len(s)), index=s)

ts.resample('Min').sum()

out:

2020-06-30 07:52:00     6130
2020-06-30 07:53:00    16115
2020-06-30 07:54:00     3366
Freq: T, dtype: int64

55. UTC 世界時間標準

s = pd.date_range('today', periods=1, freq='D')  # 獲取當前時間
ts = pd.Series(np.random.randn(len(s)), s)  # 隨機數值
ts_utc = ts.tz_localize('UTC')  # 轉換爲 UTC 時間
ts_utc

out:

2020-06-30 08:01:12.838843+00:00   -0.361407
Freq: D, dtype: float64

56. 轉換爲上海所在時區

ts_utc.tz_convert('Asia/Shanghai')

out:

2020-06-30 16:01:12.838843+08:00   -0.361407
Freq: D, dtype: float64

看一看你當前的時間，是不是一致？與UTC差8個小時。

57. 不同時間表示方式的轉換

rng = pd.date_range('1/1/2018', periods=5, freq='M')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
print(ts)
ps = ts.to_period()
print(ps)
ps.to_timestamp()

out:

2018-01-31   -0.293837
2018-02-28   -0.692926
2018-03-31    1.204096
2018-04-30    2.485244
2018-05-31    0.019893
Freq: M, dtype: float64
2018-01   -0.293837
2018-02   -0.692926
2018-03    1.204096
2018-04    2.485244
2018-05    0.019893
Freq: M, dtype: float64
2018-01-01   -0.293837
2018-02-01   -0.692926
2018-03-01    1.204096
2018-04-01    2.485244
2018-05-01    0.019893
Freq: MS, dtype: float64

Series 多重索引

58. 創建多重索引 Series

構建一個 letters = ['A', 'B', 'C'] 和 numbers = list(range(10))爲索引，值爲隨機數的多重索引 Series。

letters = ['A', 'B', 'C']
numbers = list(range(10))

mi = pd.MultiIndex.from_product([letters, numbers])  # 設置多重索引
s = pd.Series(np.random.rand(30), index=mi)  # 隨機數
s

out:

A  0    0.744729
   1    0.351805
   2    0.529587
   3    0.043310
   4    0.292182
   5    0.740887
   6    0.428499
   7    0.653610
   8    0.107801
   9    0.590899
B  0    0.542375
   1    0.231597
   2    0.410738
   3    0.634838
   4    0.072990
   5    0.188618
   6    0.821767
   7    0.624321
   8    0.514436
   9    0.695192
C  0    0.160558
   1    0.631878
   2    0.663879
   3    0.667969
   4    0.139756
   5    0.878765
   6    0.129184
   7    0.449790
   8    0.835275
   9    0.965602
dtype: float64

59. 多重索引 Series 查詢

取值：

s['A'][0]  # 取一級索引A裏面的二級索引對應的值

out:

0.7447294320037277

查詢索引爲 1，3，6 的值:

# 查詢索引爲 1，3，6 的值
s.loc[:, [1, 3, 6]] # 可以理解爲查詢所有二級索引爲 1，3，6的數據

out:

A  1    0.351805
   3    0.043310
   6    0.428499
B  1    0.231597
   3    0.634838
   6    0.821767
C  1    0.631878
   3    0.667969
   6    0.129184
dtype: float64

60. 多重索引 Series 切片

s.loc[pd.IndexSlice[:'B', 5:]]

out:

A  5    0.740887
   6    0.428499
   7    0.653610
   8    0.107801
   9    0.590899
B  5    0.188618
   6    0.821767
   7    0.624321
   8    0.514436
   9    0.695192
dtype: float64

DataFrame 多重索引

61. 根據多重索引創建 DataFrame

創建一個以 letters = ['A', 'B'] 和 numbers = list(range(6))爲索引，值爲隨機數據的多重索引 DataFrame。

frame = pd.DataFrame(np.arange(12).reshape(6, 2),
                     index=[list('AAABBB'), list('123123')],
                     columns=['hello', 'shiyanlou'])
frame

out:

	  hello shiyanlou
A	1	0	  1
    2	2	  3
    3	4	  5
B	1	6	  7
    2	8	  9
    3	10	  11

62. 多重索引設置列名稱

frame.index.names = ['first', 'second']
frame

out:


     		 hello shiyanlou
first second		
  A	    1		0	1
        2		2	3
        3		4	5
  B	    1		6	7
        2		8	9
        3		10	11

63. DataFrame 多重索引分組求和

frame.groupby('first').sum()

out:

	  hello	shiyanlou
first		
A		6	   9
B		24	   27

64. DataFrame 行列名稱轉換

print(frame)
frame.stack()

out:

              hello  shiyanlou
first second                  
A     1           0          1
      2           2          3
      3           4          5
B     1           6          7
      2           8          9
      3          10         11
first  second           
A      1       hello         0
               shiyanlou     1
       2       hello         2
               shiyanlou     3
       3       hello         4
               shiyanlou     5
B      1       hello         6
               shiyanlou     7
       2       hello         8
               shiyanlou     9
       3       hello        10
               shiyanlou    11
dtype: int64

65. DataFrame 索引轉換

print(frame)
frame.unstack()

out:

              hello  shiyanlou
first second                  
A     1           0          1
      2           2          3
      3           4          5
B     1           6          7
      2           8          9
      3          10         11
  
		hello	shiyanlou
second	1	2	3	1	2	3
first						
    A	0	2	4	1	3	5
    B	6	8	10	7	9	11

可見，它將二級索引變成了列索引。

66. DataFrame 條件查找

# 示例數據
data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
        'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
        'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
        'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(data, index=labels)
df

out:

	animal	age	visits	priority
a	cat	2.5	1	yes
b	cat	3.0	3	yes
c	snake	0.5	2	no
d	dog	NaN	3	yes
e	dog	5.0	2	no
f	cat	2.0	3	no
g	snake	4.5	1	no
h	cat	NaN	1	yes
i	dog	7.0	2	no
j	dog	3.0	1	no

查找age大於3的全部信息

df[df['age'] > 3]

out:

	animal	age	visits	priority
e	dog	    5.0	2	no
g	snake	4.5	1	no
i	dog		7.0	2	no

67. 根據行列索引切片

行區間和列區間：

df.iloc[2:4, 1:3]

out:

    age	visits
c	0.5	2
d	NaN	3

2）行和列座標：

df.iloc[1, 2]

out:

68. DataFrame 多重條件查詢

查找 age<3 且爲 cat 的全部數據。

df = pd.DataFrame(data, index=labels)

df[(df['animal'] == 'cat') & (df['age'] < 3)]

out:

	animal	age		visits	priority
a	cat		2.5		1		yes
f	cat		2.0		3		no

69. DataFrame 按關鍵字查詢

查找animal包含cat和dog的數據：

df3[df3['animal'].isin(['cat', 'dog'])]

out:

	animal	age	visits	priority
a	cat		2.5	1	yes
b	cat		3.0	3	yes
d	dog		NaN	3	yes
e	dog		5.0	2	no
f	cat		2.0	3	no
h	cat		NaN	1	yes
i	dog		7.0	2	no
j	dog		3.0	1	no

70. DataFrame 按標籤及列名查詢

提取第3、4、8行的animal、age這兩列數據：

df.loc[df.index[[3, 4, 8]], ['animal', 'age']]

out:

	animal	age
d	dog		NaN
e	dog		5.0
i	dog		7.0

71. DataFrame 多條件排序

按照 age 降序，visits 升序排列:

df.sort_values(by=['age', 'visits'], ascending=[False, True])

out:


	animal	age	visits	priority
i	dog		7.0	2	no
e	dog		5.0	2	no
g	snake	4.5	1	no
j	dog		3.0	1	no
b	cat		3.0	3	yes
a	cat		2.5	1	yes
f	cat		2.0	3	no
c	snake	0.5	2	no
h	cat		NaN	1	yes
d	dog		NaN	3	yes

72. DataFrame 多值替換

1) 替換值

將 priority 列的 yes 值替換爲 True，no 值替換爲 False:

df['priority'].map({'yes': True, 'no': False})

out:

a     True
b     True
c    False
d     True
e    False
f    False
g    False
h     True
i    False
j    False
Name: priority, dtype: bool

2）保留小數

將年齡保留兩位小數：

#方法一：
df['age'].map(lambda x: '%.2f' % x)
#方法二：
df['age'].map(lambda x: format(x, '.2f'))
#方法三：（不一定有效）
df['age'].round(decimals=2)

out:

a    2.50
b    3.00
c    0.50
d     nan
e    5.00
f    2.00
g    4.50
h     nan
i    7.00
j    3.00
Name: age, dtype: object

3) 轉爲百分數

#方法一：
df['age'].map(lambda x: '%.2f%%' % (x*100))
#方法二：
df['age'].map(lambda x: format(x, '.2%')) # 不需要乘100

73. DataFrame 分組求和

1）groupby 分組

df.groupby('animal').size()

out:

animal
cat      4
dog      4
snake    2
dtype: int64

注意上面得到的結果就是一個Series類型，所以需要進一步取值按照Series的方法取即可，例如:

print(df.groupby('animal').size().values)
print(df.groupby('animal').size().index)
print(type(df.groupby('animal').size()))

out:

[4 4 2]
Index(['cat', 'dog', 'snake'], dtype='object', name='animal')
<class 'pandas.core.series.Series'>

2) groupby 分組求和

df.groupby('animal').sum()

out:


		age	visits
animal		
cat		7.5		8
dog		15.0	8
snake	5.0		3

74. 使用列表拼接多個 DataFrame

1) 縱向拼接

temp_df1 = pd.DataFrame(np.random.randn(3, 2))  # 生成由隨機數組成的 DataFrame 1
temp_df2 = pd.DataFrame(np.random.randn(3, 2))  # 生成由隨機數組成的 DataFrame 2
temp_df3 = pd.DataFrame(np.random.randn(3, 2))  # 生成由隨機數組成的 DataFrame 3

print(temp_df1)
print(temp_df2)
print(temp_df3)

pieces = [temp_df1, temp_df2, temp_df3]
pd.concat(pieces, axis=0) # axis默認爲0， 即默認縱向拼接

out:

          0         1
0  0.437607 -0.648355
1 -0.416831 -0.405202
2  1.681175 -0.031025
          0         1
0 -0.730415 -0.806742
1 -0.914077  0.809963
2 -0.488658 -0.620225
          0         1
0 -1.210932  0.606868
1 -1.539275  1.830870
2 -0.906066  0.440358
		   0			1
0	0.437607	-0.648355
1	-0.416831	-0.405202
2	1.681175	-0.031025
0	-0.730415	-0.806742
1	-0.914077	0.809963
2	-0.488658	-0.620225
0	-1.210932	0.606868
1	-1.539275	1.830870
2	-0.906066	0.440358

2）橫向拼接

temp_df1 = pd.DataFrame(np.random.randn(3, 2))  # 生成由隨機數組成的 DataFrame 1
temp_df2 = pd.DataFrame(np.random.randn(3, 2))  # 生成由隨機數組成的 DataFrame 2
temp_df3 = pd.DataFrame(np.random.randn(3, 2))  # 生成由隨機數組成的 DataFrame 3

print(temp_df1)
print(temp_df2)
print(temp_df3)

pieces = [temp_df1, temp_df2, temp_df3]
pd.concat(pieces, axis=1)

out:

          0         1
0  1.116153  0.250272
1 -0.941279 -0.159497
2 -0.537866  1.675018
          0         1
0 -0.103160  1.228339
1 -0.149218 -0.551139
2 -0.229225 -0.156848
          0         1
0  0.971171 -0.715241
1  0.077248  0.941577
2  1.535163  0.333749
			0	1	0	1	0	1
0	1.116153	0.250272	-0.103160	1.228339	0.971171	-0.715241
1	-0.941279	-0.159497	-0.149218	-0.551139	0.077248	0.941577
2	-0.537866	1.675018	-0.229225	-0.156848	1.535163	0.333749

75. 找出 DataFrame 表中和最小的列

df = pd.DataFrame(np.random.random(size=(5, 10)), columns=list('abcdefghij'))
print(df)
df.sum().idxmin()  # idxmax(), idxmin() 爲 Series 函數返回最大最小值的索引值

out:

          a         b         c         d         e         f         g  \
0  0.265336  0.281261  0.626660  0.455936  0.469568  0.160094  0.254143   
1  0.293103  0.429918  0.861056  0.704762  0.534546  0.997590  0.651032   
2  0.653752  0.239481  0.956774  0.983244  0.835387  0.739893  0.446470   
3  0.335220  0.832347  0.925990  0.083933  0.092788  0.144650  0.284757   
4  0.923494  0.926540  0.227792  0.872578  0.471281  0.786390  0.731639   

          h         i         j  
0  0.168989  0.034899  0.797001  
1  0.942421  0.926441  0.218743  
2  0.776017  0.662287  0.806842  
3  0.247964  0.102461  0.051523  
4  0.665478  0.116302  0.256650

76. DataFrame 中每個元素減去每一行的平均值

df = pd.DataFrame(np.random.random(size=(5, 3)))
print(df)
print('means:\n', df.mean(axis=1)) # 行均值
df.sub(df.mean(axis=1), axis=0)

out:

          0         1         2
0  0.865347  0.060866  0.750320
1  0.905023  0.779393  0.969498
2  0.800366  0.334823  0.346131
3  0.930328  0.295275  0.761584
4  0.922344  0.904810  0.062543
means:
0    0.558844
1    0.884638
2    0.493774
3    0.662396
4    0.629899
dtype: float64
			0			1		   2
0	0.306502	-0.497978	0.191476
1	0.020385	-0.105246	0.084860
2	0.306593	-0.158950	-0.147642
3	0.267932	-0.367120	0.099188
4	0.292445	0.274911	-0.567356

77. DataFrame 分組，並得到每一組中最大三個數之和

df = pd.DataFrame({'A': list('aaabbcaabcccbbc'),
                   'B': [12, 345, 3, 1, 45, 14, 4, 52, 54, 23, 235, 21, 57, 3, 87]})
print(df)
df.groupby('A')['B'].nlargest(3).sum(level=0)

out:

    A    B
0   a   12
1   a  345
2   a    3
3   b    1
4   b   45
5   c   14
6   a    4
7   a   52
8   b   54
9   c   23
10  c  235
11  c   21
12  b   57
13  b    3
14  c   87
A
a    409
b    156
c    345
Name: B, dtype: int64

透視表

當分析龐大的數據時，爲了更好的發掘數據特徵之間的關係，且不破壞原數據，就可以利用透視表 pivot_table 進行操作。

78. 透視表的創建

新建表將 A, B, C 列作爲索引進行聚合:

df = pd.DataFrame({'A': ['one', 'one', 'two', 'three'] * 3,
                   'B': ['A', 'B', 'C'] * 4,
                   'C': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                   'D': np.random.randn(12),
                   'E': np.random.randn(12)})

print(df)

pd.pivot_table(df, index=['A', 'B'])

out:

        A  B    C         D         E
0     one  A  foo -1.039258 -0.443270
1     one  B  foo  0.388518  0.524361
2     two  C  foo -0.330776 -0.878025
3   three  A  bar  0.000832  1.133901
4     one  B  bar  0.418298  1.626217
5     one  C  bar  0.459358 -1.203031
6     two  A  foo -0.658593 -1.116155
7   three  B  foo -0.331466  1.130495
8     one  C  foo -0.197646  0.726132
9     one  A  bar -0.895106 -0.461336
10    two  B  bar  0.584046  0.674531
11  three  C  bar -0.401441 -0.017452
				D			E
A	B		
one	A	-0.967182	-0.452303
    B	0.403408	1.075289
    C	0.130856	-0.238450
three	A	0.000832	1.133901
    B	-0.331466	1.130495
    C	-0.401441	-0.017452
two	A	-0.658593	-1.116155
    B	0.584046	0.674531
    C	-0.330776	-0.878025

79. 透視表按指定行進行聚合

將該 DataFrame 的 D 列聚合，按照 A,B 列爲索引進行聚合，聚合的方式爲默認求均值。

pd.pivot_table(df, values=['D'], index=['A', 'B'])

out:

		D
A	B	
one	A	-0.967182
    B	0.403408
    C	0.130856
three	A	0.000832
    B	-0.331466
    C	-0.401441
two	A	-0.658593
    B	0.584046
    C	-0.330776

80. 透視表聚合方式定義

上一題中 D 列聚合時，採用默認求均值的方法，若想使用更多的方式可以在 aggfunc 中實現。

            sum			len
            D	 		D
A		B		
one		A	-1.934364	2.0
    	B	0.806816	2.0
    	C	0.261711	2.0
three	A	0.000832	1.0
        B	-0.331466	1.0
        C	-0.401441	1.0
two		A	-0.658593	1.0
        B	0.584046	1.0
        C	-0.330776	1.0

81. 透視表利用額外列進行輔助分割

D 列按照 A,B 列進行聚合時，若關心 C 列對 D 列的影響，可以加入 columns 值進行分析。

pd.pivot_table(df, values=['D'], index=['A', 'B'],
               columns=['C'], aggfunc=np.sum)

out:

			D
		C	bar			foo
A		B		
one		A	-0.895106	-1.039258
        B	0.418298	0.388518
        C	0.459358	-0.197646
three	A	0.000832	NaN
        B	NaN	-0.331466
        C	-0.401441	NaN
two	A	NaN	-0.658593
        B	0.584046	NaN
        C	NaN	-0.33077

82. 透視表的缺省值處理

在透視表中由於不同的聚合方式，相應缺少的組合將爲缺省值，可以加入 fill_value 對缺省值處理:

pd.pivot_table(df, values=['D'], index=['A', 'B'],
               columns=['C'], aggfunc=np.sum, fill_value=0)

out:

                D
        C		bar		foo
 A		B		
one		A	-0.895106	-1.039258
        B	0.418298	0.388518
        C	0.459358	-0.197646
three	A	0.000832	0.000000
        B	0.000000	-0.331466
        C	-0.401441	0.000000
two		A	0.000000	-0.658593
        B	0.584046	0.000000
        C	0.000000	-0.33077

絕對類型

在數據的形式上主要包括數量型和性質型，數量型表示着數據可數範圍可變，而性質型表示範圍已經確定不可改變，絕對型數據就是性質型數據的一種。

83. 絕對型數據定義

df = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6], "raw_grade": [
                  'a', 'b', 'b', 'a', 'a', 'e']})
df["grade"] = df["raw_grade"].astype("category")
df

out:


	id	raw_grade	grade
0	1	a	a
1	2	b	b
2	3	b	b
3	4	a	a
4	5	a	a
5	6	e	e

84. 對絕對型數據重命名

df["grade"].cat.categories = ["very good", "good", "very bad"]
df

out:

	id	raw_grade	grade
0	1	a	very good
1	2	b	good
2	3	b	good
3	4	a	very good
4	5	a	very good
5	6	e	very bad

85. 重新排列絕對型數據並補充相應的缺省值

df["grade"] = df["grade"].cat.set_categories(
    ["very bad", "bad", "medium", "good", "very good"])
df

out:


	id	raw_grade	grade
0	1	a	very good
1	2	b	good
2	3	b	good
3	4	a	very good
4	5	a	very good
5	6	e	very bad

86. 對絕對型數據進行排序

df.sort_values(by="grade")

out:

	id	raw_grade	grade
5	6	e	very bad
1	2	b	good
2	3	b	good
0	1	a	very good
3	4	a	very good
4	5	a	very goo

87. 對絕對型數據進行分組

df.groupby("grade").size()

out:

grade
very bad     1
bad          0
medium       0
good         2
very good    3
dtype: int64

數據清洗

常常我們得到的數據是不符合我們最終處理的數據要求，包括許多缺省值以及壞的數據，需要我們對數據進行清洗。

88. 缺失值擬合

在FilghtNumber中有數值缺失，其中數值爲按 10 增長，補充相應的缺省值使得數據完整，並讓數據爲 int 類型。

df = pd.DataFrame({'From_To': ['LoNDon_paris', 'MAdrid_miLAN', 'londON_StockhOlm',
                               'Budapest_PaRis', 'Brussels_londOn'],
                   'FlightNumber': [10045, np.nan, 10065, np.nan, 10085],
                   'RecentDelays': [[23, 47], [], [24, 43, 87], [13], [67, 32]],
                   'Airline': ['KLM(!)', '<Air France> (12)', '(British Airways. )',
                               '12. Air France', '"Swiss Air"']})
df['FlightNumber'] = df['FlightNumber'].interpolate().astype(int)
df

out:

	From_To	FlightNumber	RecentDelays	Airline
0	LoNDon_paris	10045	[23, 47]	KLM(!)
1	MAdrid_miLAN	10055	[]	<Air France> (12)
2	londON_StockhOlm	10065	[24, 43, 87]	(British Airways. )
3	Budapest_PaRis	10075	[13]	12. Air France
4	Brussels_londOn	10085	[67, 32]	"Swiss Air"

89. 數據列拆分

其中From_to應該爲兩獨立的兩列From和To，將From_to依照_拆分爲獨立兩列建立爲一個新表。

temp = df.From_To.str.split('_', expand=True)
temp.columns = ['From', 'To']
temp

out:

	From	To
0	LoNDon	paris
1	MAdrid	miLAN
2	londON	StockhOlm
3	Budapest	PaRis
4	Brussels	londOn

90. 字符標準化

其中注意到地點的名字都不規範（如：londON應該爲London）需要對數據進行標準化處理。

temp['From'] = temp['From'].str.capitalize() #capitalize() 返回一個首字母大寫的字符串
temp['To'] = temp['To'].str.capitalize()
print(temp['From'])
print(temp['To'])

out:

0      London
1      Madrid
2      London
3    Budapest
4    Brussels
Name: From, dtype: object
0        Paris
1        Milan
2    Stockholm
3        Paris
4       London
Name: To, dtype: object

91. 刪除壞數據加入整理好的數據

將最開始的 From_to 列刪除，加入整理好的 From 和 to 列。

df = df.drop('From_To', axis=1)
df = df.join(temp)
print(df)

out:

   FlightNumber  RecentDelays              Airline      From         To
0         10045      [23, 47]               KLM(!)    London      Paris
1         10055            []    <Air France> (12)    Madrid      Milan
2         10065  [24, 43, 87]  (British Airways. )    London  Stockholm
3         10075          [13]       12. Air France  Budapest      Paris
4         10085      [67, 32]          "Swiss Air"  Brussels     London

92. 去除多餘字符

如同 airline 列中許多數據有許多其他字符，會對後期的數據分析有較大影響，需要對這類數據進行修正。

print(df['Airline'])
# extract()按正則表達式進行數據提取
df['Airline'] = df['Airline'].str.extract('([a-zA-Z\s]+)', expand=False).str.strip()
df

out:

	FlightNumber	RecentDelays	Airline	From	To
0	10045	[23, 47]	KLM	London	Paris
1	10055	[]	Air France	Madrid	Milan
2	10065	[24, 43, 87]	British Airways	London	Stockholm
3	10075	[13]	Air France	Budapest	Paris
4	10085	[67, 32]	Swiss Air	Brussels	London

93. 格式規範

在 RecentDelays 中記錄的方式爲列表類型，由於其長度不一，這會爲後期數據分析造成很大麻煩。這裏將 RecentDelays 的列表拆開，取出列表中的相同位置元素作爲一列，若爲空值即用 NaN 代替。

delays = df['RecentDelays'].apply(pd.Series)

delays.columns = ['delay_{}'.format(n)
                  for n in range(1, len(delays.columns)+1)]

df = df.drop('RecentDelays', axis=1).join(delays)
df

out:

	FlightNumber	Airline	From	To	delay_1	delay_2	delay_3
0	10045	KLM	London	Paris	23.0	47.0	NaN
1	10055	Air France	Madrid	Milan	NaN	NaN	NaN
2	10065	British Airways	London	Stockholm	24.0	43.0	87.0
3	10075	Air France	Budapest	Paris	13.0	NaN	NaN
4	10085	Swiss Air	Brussels	London	67.0	32.0	NaN

數據預處理

94. 信息區間劃分

班級一部分同學的數學成績表，如下圖所示:

df=pd.DataFrame({'name':['Alice','Bob','Candy','Dany','Ella','Frank','Grace','Jenny'],'grades':[58,83,79,65,93,45,61,88]})

out:

	name	grades
0	Alice	58
1	Bob	83
2	Candy	79
3	Dany	65
4	Ella	93
5	Frank	45
6	Grace	61
7	Jenny	88

但我們更加關心的是該同學是否及格，將該數學成績按照是否>60來進行劃分。

df = pd.DataFrame({'name': ['Alice', 'Bob', 'Candy', 'Dany', 'Ella',
                            'Frank', 'Grace', 'Jenny'],
                   'grades': [58, 83, 79, 65, 93, 45, 61, 88]})


def choice(x):
    if x > 60:
        return 1
    else:
        return 0


df.grades = pd.Series(map(lambda x: choice(x), df.grades))
df

out:

	name	grades
0	Alice	0
1	Bob	1
2	Candy	1
3	Dany	1
4	Ella	1
5	Frank	0
6	Grace	1
7	Jenny	1

95. 數據去重

一個列爲A的 DataFrame 數據，如下圖所示:

df = pd.DataFrame({'A': [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7]})

嘗試將 A 列中連續重複的數據清除.

df = pd.DataFrame({'A': [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7]})
df.loc[df['A'].shift() != df['A']]

out:

96.數據歸一化

有時候，DataFrame 中不同列之間的數據差距太大，需要對其進行歸一化處理。

其中，Max-Min歸一化是簡單而常見的一種方式，公式如下:
$Y=\frac{X-X_{min}}{X_{max}-X_{min}}$

def normalization(df):
    numerator = df.sub(df.min())
    denominator = (df.max()).sub(df.min())
    Y = numerator.div(denominator)
    return Y


df = pd.DataFrame(np.random.random(size=(5, 3)))
print(df)
normalization(df)

out:

         0         1         2
0  0.055276  0.666860  0.206399
1  0.873721  0.924465  0.105095
2  0.161571  0.979359  0.678480
3  0.698888  0.091796  0.692453
4  0.694759  0.888997  0.528819
0	1	2
0	0.000000	0.647914	0.172474
1	1.000000	0.938152	0.000000
2	0.129874	1.000000	0.976209
3	0.786385	0.000000	1.000000
4	0.781340	0.898191	0.721407

Pandas 繪圖操作

爲了更好的瞭解數據包含的信息，最直觀的方法就是將其繪製成圖。

97. Series 可視化

%matplotlib inline
ts = pd.Series(np.random.randn(100), index=pd.date_range('today', periods=100))
ts = ts.cumsum()
ts.plot()

out:

98. DataFrame 折現圖

df = pd.DataFrame(np.random.randn(100, 4), index=ts.index,
                  columns=['A', 'B', 'C', 'D'])
df = df.cumsum()
df.plot()

out:

99. DataFrame 散點圖

df = pd.DataFrame({"xs": [1, 5, 2, 8, 1], "ys": [4, 2, 1, 9, 6]})
df = df.cumsum()
df.plot.scatter("xs", "ys", color='red', marker="*")

out:

100. DataFrame 柱形圖

df = pd.DataFrame({"revenue": [57, 68, 63, 71, 72, 90, 80, 62, 59, 51, 47, 52],
                   "advertising": [2.1, 1.9, 2.7, 3.0, 3.6, 3.2, 2.7, 2.4, 1.8, 1.6, 1.3, 1.9],
                   "month": range(12)
                   })

ax = df.plot.bar("month", "revenue", color="yellow")
df.plot("month", "advertising", secondary_y=True, ax=ax)

out:

pandas 百題大沖關

目錄

Pandas 百題大沖關

基礎部分

基礎

1. 導入Pandas:

2. 查看Pandas版本信息：

創建 Series 數據類型

3. 從列表創建Series:

4. 從Ndarray 創建Series:

5. 從字典創建Series:

Series基本操作

6. 修改Series索引

7. Series縱向拼接

8. Series按指定索引刪除元素

9. Series修改指定索引的元素

10. Series 按指定索引查找元素

11. Series 切片操作

Series

12. Series 加法運算

13. Series 減法運算

14. Series 乘法運算

15. Series 除法運算

16. Series 求中位數

17. Series 求和

18. Series 求最大值

19. Series 求最小值

創建 DataFrame 數據類型

20. 通過NumPy數組創建DataFrame

21. 通過字典數組創建DataFrame

DataFrame 基本操作

23. 預覽DataFrame的前五行數據

24. 查看 DataFrame 的後3行數據

25. 查看DataFrame 的索引

26. 查看DataFrame 的列名

27. 查看 DataFrame 的數值

28. 查看 DataFrame 的統計數據

29. DataFrame 轉置

30. 對 DataFrame 進行按列排序

31. 對DataFrame 數據切片

32. 對 DataFrame 通過標籤查詢（單列）

33. DataFrame 通過標籤查詢（多列）

34. 對 DataFrame 通過位置查找

35. DataFrame 副本拷貝

36. 判斷 DataFrame 元素是否爲空

37. 添加列數據

38. 根據 DataFrame 的下標值進行修改

39. 根據 DataFrame 的標籤對數據進行修改

40. DataFrame 求平均值操作

41. 對 DataFrame 中任意列做求和操作

字符串操作

42. 將字符串轉化爲小寫字母

43. 將字符串轉化爲大寫字母

DataFrame 缺失值操作

44. 對缺失值進行填充

45. 刪除存在缺失值的行

46. DataFrame 按指定列對齊

DataFrame 文件操作

47. CSV 文件寫入

48. CSV 文件讀取

49. Excel 寫入操作

50. Excel 讀取操作

進階部分

時間序列索引

51. 建立一個以2018年每一天爲索引，值爲隨機數的Series

52. 統計s 中每一個週三對應值的和

53. 統計s中每個月值的平均值

54. 將 Series 中的時間進行轉換（秒轉分鐘）

55. UTC 世界時間標準

56. 轉換爲上海所在時區

57. 不同時間表示方式的轉換

Series 多重索引

58. 創建多重索引 Series

59. 多重索引 Series 查詢

60. 多重索引 Series 切片

DataFrame 多重索引

61. 根據多重索引創建 DataFrame

62. 多重索引設置列名稱

63. DataFrame 多重索引分組求和

64. DataFrame 行列名稱轉換

52. 統計`s` 中每一個週三對應值的和

53. 統計`s`中每個月值的平均值