此文主要记录的是实验楼的百题闯关100题，特此记录，并加入输出和部分内容的补充。

Pandas 百题大冲关

Pandas 百题大冲关分为基础篇和进阶篇，每部分各有 50 道练习题。基础部分的练习题在于熟悉 Pandas 常用方法的使用，而进阶部分则侧重于 Pandas 方法的组合应用。

基础部分

基础

1. 导入Pandas:

import pandas as pd

2. 查看Pandas版本信息：

print(pd.__version__)

Pandas 的数据结构：Pandas 主要有Series（一维数组），DataFrame（二维数组），Panel（三维数组），Panel4D（四维数组），PanelND（更多维数组）等数据结构。其中 Series 和 DataFrame 应用的最为广泛。

Series 是一维带标签的数组，它可以包含任何数据类型。包括整数，字符串，浮点数，Python 对象等。Series 可以通过标签来定位。
DataFrame 是二维的带标签的数据结构。我们可以通过标签来定位数据。这是 NumPy 所没有的。

创建 Series 数据类型

Pandas 中，Series 可以被看作由 1 列数据组成的数据集。

创建 Series 语法：s = pd.Series(data, index=index)，可以通过多种方式进行创建，以下介绍了 3 个常用方法。

3. 从列表创建Series:

arr = [0, 1, 2, 3, 4]
s1 = pd.Series(arr)  # 如果不指定索引，则默认从 0 开始
s1

out:

0    0
1    1
2    2
3    3
4    4
dtype: int64

提示：前面的 0, 1, 2, 3, 4 为当前 Series 的索引，后面的 0, 1, 2, 3, 4 为 Series 的值。

4. 从Ndarray 创建Series:

import numpy as np

n = np.random.randn(5)  # 创建一个随机 Ndarray 数组

index = ['a', 'b', 'c', 'd', 'e']
s2 = pd.Series(n, index=index) # 指定索引为a,b,c,d,e
s2

out:

a   -0.419618
b   -0.709257
c    0.288306
d   -0.203162
e    1.754528
dtype: float64

5. 从字典创建Series:

d = {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}  # 定义示例字典
s3 = pd.Series(d)
s3

out:

a    1
b    2
c    3
d    4
e    5
dtype: int64

Series基本操作

6. 修改Series索引

print(s1)  # 以 s1 为例
s1.index = ['A', 'B', 'C', 'D', 'E']  # 修改后的索引
s1

0    0
1    1
2    2
3    3
4    4
dtype: int64
A    0
B    1
C    2
D    3
E    4
dtype: int64

7. Series纵向拼接

s4 = s3.append(s1)  # 将 s1 拼接到 s3
s4

out:

a    1
b    2
c    3
d    4
e    5
A    0
B    1
C    2
D    3
E    4
dtype: int64

8. Series按指定索引删除元素

print(s4)
s4 = s4.drop('e')  # 删除索引为 e 的值
s4

out:

a    1
b    2
c    3
d    4
e    5
A    0
B    1
C    2
D    3
E    4
dtype: int64
a    1
b    2
c    3
d    4
A    0
B    1
C    2
D    3
E    4
dtype: int64

9. Series修改指定索引的元素

s4['A'] = 6  # 修改索引为 A 的值 = 6
s4

out:

a    1
b    2
c    3
d    4
A    6
B    1
C    2
D    3
E    4
dtype: int64

10. Series 按指定索引查找元素

s4['B']

out:

11. Series 切片操作

例如对s4的前3个数据访问

s4[:3]

out:

a    1
b    2
c    3
dtype: int64

Series

12. Series 加法运算

Series 的加法运算是按照索引计算，如果索引不同则填充为 NaN（空值）。

s4.add(s3)

out:

A    NaN
B    NaN
C    NaN
D    NaN
E    NaN
a    2.0
b    4.0
c    6.0
d    8.0
e    NaN
dtype: float64

13. Series 减法运算

Series的减法运算是按照索引对应计算，如果不同则填充为 NaN（空值）。

s4.sub(s3)

out:

A    NaN
B    NaN
C    NaN
D    NaN
E    NaN
a    0.0
b    0.0
c    0.0
d    0.0
e    NaN
dtype: float64

14. Series 乘法运算

Series 的乘法运算是按照索引对应计算，如果索引不同则填充为 NaN（空值）。

s4.mul(s3)

out:

A     NaN
B     NaN
C     NaN
D     NaN
E     NaN
a     1.0
b     4.0
c     9.0
d    16.0
e     NaN
dtype: float64

15. Series 除法运算

Series 的除法运算是按照索引对应计算，如果索引不同则填充为 NaN（空值）。

s4.div(s3)

out:

A    NaN
B    NaN
C    NaN
D    NaN
E    NaN
a    1.0
b    1.0
c    1.0
d    1.0
e    NaN
dtype: float64

16. Series 求中位数

s4.median()

out:

3.0

17. Series 求和

s4.sum()

out:

18. Series 求最大值

s4.max()

out:

19. Series 求最小值

s4.min()

out:

创建 DataFrame 数据类型

与 Sereis 不同，DataFrame 可以存在多列数据。一般情况下，DataFrame 也更加常用。

20. 通过NumPy数组创建DataFrame

dates = pd.date_range('today', periods=6)  # 定义时间序列作为 index
num_arr = np.random.randn(6, 4)  # 传入 numpy 随机数组
columns = ['A', 'B', 'C', 'D']  # 将列表作为列名
df1 = pd.DataFrame(num_arr, index=dates, columns=columns)
df1

out:

2020-06-27 08:37:36.648563	0.136455	1.372062	-1.896225	1.005024
2020-06-28 08:37:36.648563	-1.208118	-0.806961	1.382154	1.417238
2020-06-29 08:37:36.648563	-1.800235	-0.272469	-0.966839	-0.188984
2020-06-30 08:37:36.648563	0.081191	-0.468042	0.551959	-0.441269
2020-07-01 08:37:36.648563	0.301758	-0.147157	0.632281	-0.622362
2020-07-02 08:37:36.648563	-0.762515	-1.773605	-0.990699	0.493300

21. 通过字典数组创建DataFrame

data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
        'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
        'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
        'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df2 = pd.DataFrame(data, index=labels)
df2

out:

	animal	age	visits	priority
a	cat	2.5	1	yes
b	cat	3.0	3	yes
c	snake	0.5	2	no
d	dog	NaN	3	yes
e	dog	5.0	2	no
f	cat	2.0	3	no
g	snake	4.5	1	no
h	cat	NaN	1	yes
i	dog	7.0	2	no
j	dog	3.0	1	no

当然如果希望每个键的值为一行，将DataFrame转置即可：

df2.T

out:

	     a	 b	c	d	e	f	g	h	i	j
animal	cat	cat	snake	dog	dog	cat	snake	cat	dog	dog
age	2.5	3	0.5	NaN	5	2	4.5	NaN	7	3
visits	1	3	2	3	2	3	1	1	2	1
priority	yes	yes	no	yes	no	no	no	yes	no	no

DataFrame 基本操作

23. 预览DataFrame的前五行数据

此方法对快速了解陌生数据集结构十分有用。

df2.head()  # 默认为显示 5 行，可根据需要在括号中填入希望预览的行数

out:

	animal	age	visits	priority
a	cat	2.5	1	yes
b	cat	3.0	3	yes
c	snake	0.5	2	no
d	dog	NaN	3	yes
e	dog	5.0	2	no

24. 查看 DataFrame 的后3行数据

df2.tail(3)   # 参数3可以修改为自己需要行数

out:

    animal	age	visits	priority
h	cat	NaN	1	yes
i	dog	7.0	2	no
j	dog	3.0	1	no

25. 查看DataFrame 的索引

df2.index

out:

Index(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'], dtype='object')

26. 查看DataFrame 的列名

df2.columns

out:

Index(['animal', 'age', 'visits', 'priority'], dtype='object')

27. 查看 DataFrame 的数值

df2.values   # 得到数组

out:

array([['cat', 2.5, 1, 'yes'],
       ['cat', 3.0, 3, 'yes'],
       ['snake', 0.5, 2, 'no'],
       ['dog', nan, 3, 'yes'],
       ['dog', 5.0, 2, 'no'],
       ['cat', 2.0, 3, 'no'],
       ['snake', 4.5, 1, 'no'],
       ['cat', nan, 1, 'yes'],
       ['dog', 7.0, 2, 'no'],
       ['dog', 3.0, 1, 'no']], dtype=object)

28. 查看 DataFrame 的统计数据

df2.describe()

out:

	    age	     visits
count 8.000000  10.000000
mean  3.437500  1.900000
std	2.007797	0.875595
min	0.500000	1.000000
25%	2.375000	1.000000
50%	3.000000	2.000000
75%	4.625000	2.750000
max	7.000000	3.000000

29. DataFrame 转置

df2.T

out:

	    a	b	c	d	e	f	g	h	i	j
animal	cat	cat	snake	dog	dog	cat	snake	cat	dog	dog
age 	2.5	3	0.5	NaN	5	2	4.5	NaN	7	3
visits	1	3	2	3	2	3	1	1	2	1
priority	yes	yes	no	yes	no	no	no	yes	no	no

30. 对 DataFrame 进行按列排序

df2.sort_values(by='age', ascending=False)  # 按age排列, ascending=False表示降序，为True为升序，默认为True

out:

	animal	age	visits	priority
i	dog	7.0	2	no
e	dog	5.0	2	no
g	snake	4.5	1	no
b	cat	3.0	3	yes
j	dog	3.0	1	no
a	cat	2.5	1	yes
f	cat	2.0	3	no
c	snake	0.5	2	no
d	dog	NaN	3	yes
h	cat	NaN	1	yes

31. 对DataFrame 数据切片

df2[1:3]  # 按行取，第1行到第2行，不包括第3行

out:

	animal	age	visits	priority
b	cat	    3.0	  3  	yes
c	snake	0.5	  2	    no

32. 对 DataFrame 通过标签查询（单列）

df2['age']

out:

a    2.5
b    3.0
c    0.5
d    NaN
e    5.0
f    2.0
g    4.5
h    NaN
i    7.0
j    3.0
Name: age, dtype: float64

等价于 df2.age

df2.age  # 等价于 df2['age']

out:

a    2.5
b    3.0
c    0.5
d    NaN
e    5.0
f    2.0
g    4.5
h    NaN
i    7.0
j    3.0
Name: age, dtype: float64

33. DataFrame 通过标签查询（多列）

df2[['age', 'animal']]  # 传入一个列名组成的[列表]

out:

    age	animal
a	2.5	cat
b	3.0	cat
c	0.5	snake
d	NaN	dog
e	5.0	dog
f	2.0	cat
g	4.5	snake
h	NaN	cat
i	7.0	dog
j	3.0	dog

34. 对 DataFrame 通过位置查找

.iloc[] 里面的第一参数为行，第二个参数为列

方法一：

df2.iloc[1:3]   # 查询 2，3 行

out:

	animal	age	visits	priority
b	cat 	3.0	 3	yes
c	snake	0.5	 2	no

方法二：

df2.iloc[1:3, 1]  # 查询 2,3 行的第1列

out:

b    3.0
c    0.5
Name: age, dtype: float64

方法三:

df2.iloc[[0,2,3], [0,2]]  # 查询 0,2,3 行的第0,1列

out:

	animal	visits
a	cat  	1
c	snake	2
d	dog 	3

35. DataFrame 副本拷贝

# 生成 DataFrame 副本，方便数据集被多个不同流程使用
df3 = df2.copy()
df3

out:

	animal	age	visits	priority
a	cat	2.5	1	yes
b	cat	3.0	3	yes
c	snake	0.5	2	no
d	dog	NaN	3	yes
e	dog	5.0	2	no
f	cat	2.0	3	no
g	snake	4.5	1	no
h	cat	NaN	1	yes
i	dog	7.0	2	no
j	dog	3.0	1	no

36. 判断 DataFrame 元素是否为空

df3.isnull()  # 如果为空则返回为 True

out:

    animal	age	visits	priority
a	False	False	False	False
b	False	False	False	False
c	False	False	False	False
d	False	True	False	False
e	False	False	False	False
f	False	False	False	False
g	False	False	False	False
h	False	True	False	False
i	False	False	False	False
j	False	False	False	False

判断指定列是否为空：

df3['age'].isnull()  # 如果age为空则返回为 True

out:

a    False
b    False
c    False
d     True
e    False
f    False
g    False
h     True
i    False
j    False
Name: age, dtype: bool

37. 添加列数据

1）添加一个列 series数据：

num = pd.Series([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], index=df3.index)
df3['No.'] = num  # 添加以 'No.' 为列名的新数据列
df3

out:

   animal age	visits	priority	No.
a	cat	2.5	1	yes	0
b	cat	3.0	3	yes	1
c	snake	0.5	2	no	2
d	dog	NaN	3	yes	3
e	dog	5.0	2	no	4
f	cat	2.0	3	no	5
g	snake	4.5	1	no	6
h	cat	NaN	1	yes	7
i	dog	7.0	2	no	8
j	dog	3.0	1	no	9

2）直接添加一个list数据：

num = list(range(1, 11)) # [1, 2, 3,....9, 10]
df3['No.'] = num  # 添加以 'No.' 为列名的新数据列
df3

out:

	animal	age	visits	priority	No.
a	cat	2.5	1	yes	1
b	cat	3.0	3	yes	2
c	snake	0.5	2	no	3
d	dog	NaN	3	yes	4
e	dog	5.0	2	no	5
f	cat	2.0	3	no	6
g	snake	4.5	1	no	7
h	cat	NaN	1	yes	8
i	dog	7.0	2	no	9
j	dog	3.0	1	no	10

38. 根据 DataFrame 的下标值进行修改

1)通过 iat修改：

# 修改第 2 行与第 2 列对应的值 3.0 → 2.0
df3.iat[1, 1] = 2  # 索引序号从 0 开始，这里为 1, 1
df3

out:

	animal	age	visits	priority	No.
a	cat	2.5	1	yes	1
b	cat	2.0	3	yes	2
c	snake	0.5	2	no	3
d	dog	NaN	3	yes	4
e	dog	5.0	2	no	5
f	cat	2.0	3	no	6
g	snake	4.5	1	no	7
h	cat	NaN	1	yes	8
i	dog	7.0	2	no	9
j	dog	3.0	1	no	10

2）通过iloc修改：

# 修改第 2 行与第 2 列对应的值 2.0 → 3.0
df3.iloc[1, 1] = 3.0 # 索引序号从 0 开始，这里为 1, 1
df3

out:

	animal	age	visits	priority	No.
a	cat	2.5	1	yes	1
b	cat	3.0	3	yes	2
c	snake	0.5	2	no	3
d	dog	NaN	3	yes	4
e	dog	5.0	2	no	5
f	cat	2.0	3	no	6
g	snake	4.5	1	no	7
h	cat	NaN	1	yes	8
i	dog	7.0	2	no	9
j	dog	3.0	1	no	10

39. 根据 DataFrame 的标签对数据进行修改

df3.loc['f', 'age'] = 1.5
df3

out:

animal	age	visits	priority	No.
a	cat	2.5	1	yes	1
b	cat	3.0	3	yes	2
c	snake	0.5	2	no	3
d	dog	NaN	3	yes	4
e	dog	5.0	2	no	5
f	cat	1.5	3	no	6
g	snake	4.5	1	no	7
h	cat	NaN	1	yes	8
i	dog	7.0	2	no	9
j	dog	3.0	1	no	10

40. DataFrame 求平均值操作

df3.mean()

out:

age       3.375
visits    1.900
No.       5.500
dtype: float64

41. 对 DataFrame 中任意列做求和操作

1)对指定列求和：

df3['visits'].sum()

out:

2）默认对所有列求和：

df3.sum()

out:

animal      catcatsnakedogdogcatsnakecatdogdog
age                                         27
visits                                      19
priority              yesyesnoyesnononoyesnono
No.                                         55
dtype: object

字符串操作

42. 将字符串转化为小写字母

string = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca',
                    np.nan, 'CABA', 'dog', 'cat'])
print(string)
string.str.lower()

out:

0       A
1       B
2       C
3    Aaba
4    Baca
5     NaN
6    CABA
7     dog
8     cat
dtype: object
0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

43. 将字符串转化为大写字母

string.str.upper()

out:

0       A
1       B
2       C
3    AABA
4    BACA
5     NaN
6    CABA
7     DOG
8     CAT
dtype: object

DataFrame 缺失值操作

44. 对缺失值进行填充

df4 = df3.copy()
print(df4)
df4.fillna(value=3)  # fillna 表示填充为NaN的数据，NaN：not a number

out:

  animal  age  visits priority  No.
a    cat  2.5       1      yes    1
b    cat  3.0       3      yes    2
c  snake  0.5       2       no    3
d    dog  NaN       3      yes    4
e    dog  5.0       2       no    5
f    cat  1.5       3       no    6
g  snake  4.5       1       no    7
h    cat  NaN       1      yes    8
i    dog  7.0       2       no    9
j    dog  3.0       1       no   10
animal	age	visits	priority	No.
a	cat	2.5	1	yes	1
b	cat	3.0	3	yes	2
c	snake	0.5	2	no	3
d	dog	3.0	3	yes	4
e	dog	5.0	2	no	5
f	cat	1.5	3	no	6
g	snake	4.5	1	no	7
h	cat	3.0	1	yes	8
i	dog	7.0	2	no	9
j	dog	3.0	1	no	10

45. 删除存在缺失值的行

df5 = df3.copy()
print(df5)
df5.dropna(how='any')  # 任何存在 NaN 的行都将被删除

out:

    animal  age  visits priority  No.
a    cat  2.5       1      yes    1
b    cat  3.0       3      yes    2
c  snake  0.5       2       no    3
d    dog  NaN       3      yes    4
e    dog  5.0       2       no    5
f    cat  1.5       3       no    6
g  snake  4.5       1       no    7
h    cat  NaN       1      yes    8
i    dog  7.0       2       no    9
j    dog  3.0       1       no   10
   animal	age	visits	priority	No.
a	cat	2.5	1	yes	1
b	cat	3.0	3	yes	2
c	snake	0.5	2	no	3
e	dog	5.0	2	no	5
f	cat	1.5	3	no	6
g	snake	4.5	1	no	7
i	dog	7.0	2	no	9
j	dog	3.0	1	no	10

46. DataFrame 按指定列对齐

left = pd.DataFrame({'key': ['foo1', 'foo2'], 'one': [1, 2]})
right = pd.DataFrame({'key': ['foo2', 'foo3'], 'two': [4, 5]})

print(left)
print(right)

# 按照 key 列对齐连接，只存在 foo2 相同，所以最后变成一行, 类似sql语句的连接操作
pd.merge(left, right, on='key')

out:

   key  one
0  foo1    1
1  foo2    2
   key  two
0  foo2    4
1  foo3    5
key	one	two
0	foo2	2	4

DataFrame 文件操作

47. CSV 文件写入

# df3.to_csv('animal.csv', index=False, header=False) # 表示不将index和header写入
df3.to_csv('animal.csv', index=True, header=True)  # 默认为True写入
print("写入成功.")

out:

写入成功.

48. CSV 文件读取

# df_animal = pd.read_csv('animal.csv', header=None) # 表示不指明列标签，使用0,1,..这些默认标签
df_animal = pd.read_csv('animal.csv', header=1) # Header默认为0,表示数据的第几行为列标签,数据为列名行以下的数据
df_animal

out:

	cat	3.0	3	yes	2
0	snake	0.5	2	no	3
1	dog	NaN	3	yes	4
2	dog	5.0	2	no	5
3	cat	1.5	3	no	6
4	snake	4.5	1	no	7
5	cat	NaN	1	yes	8
6	dog	7.0	2	no	9
7	dog	3.0	1	no	10

49. Excel 写入操作

# 具体函数可以查看源码
# 常用参数：
to_excel(self, excel_writer, sheet_name='Sheet1', na_rep='', float_format=None,columns=None, header=True, index=True, index_label=None,startrow=0, startcol=0, engine=None, merge_cells=True, encoding=None,
inf_rep='inf', verbose=True, freeze_panes=None)

常用参数解析

-excel_writer : string or ExcelWriter object File path or existing ExcelWriter目标路径
- sheet_name : string, default ‘Sheet1’ Name of sheet which will contain DataFrame,填充excel的第几页
- na_rep : string, default ”,Missing data representation 缺失值填充
- float_format : string, default None Format string for floating point numbers
- columns : sequence, optional，Columns to write 选择输出的的列。
- header : boolean or list of string, default True Write out column names. If a list of string is given it is assumed to be aliases for the column names
- index : boolean, default True，Write row names (index)
- index_label : string or sequence, default None， Column label for index column(s) if desired. If None is given, andheader and index are True, then the index names are used. A sequence should be given if the DataFrame uses MultiIndex.
- startrow :upper left cell row to dump data frame
- startcol :upper left cell column to dump data frame
- engine : string, default None ，write engine to use - you can also set this via the options，io.excel.xlsx.writer, io.excel.xls.writer, andio.excel.xlsm.writer.
- merge_cells : boolean, default True Write MultiIndex and Hierarchical Rows as merged cells.
- encoding: string, default None encoding of the resulting excel file. Only necessary for xlwt,other writers support unicode natively.
- inf_rep : string, default ‘inf’ Representation for infinity (there is no native representation for infinity in Excel)
- freeze_panes : tuple of integer (length 2), default None Specifies the one-based bottommost row and rightmost column that is to be frozen

df3.to_excel('animal.xlsx', sheet_name='Sheet1' )
print("写入成功.")

out:

写入成功.

50. Excel 读取操作

# 具体函数可以查看源码
# 常用参数：
read_excel(io, sheetname=0, header=0, skiprows=None, skip_footer=0, index_col=None,names=None, parse_cols=None, parse_dates=False,date_parser=None,na_values=None,thousands=None, convert_float=True, has_index_names=None, converters=None,dtype=None, true_values=None, false_values=None, engine=None, squeeze=False, **kwds)

常用参数解析：

io : string, path object ; excel 路径。
sheetname : string, int, mixed list of strings/ints, or None, default 0 返回多表使用sheetname=[0,1],若sheetname=None是返回全表注意：int/string 返回的是dataframe，而none和list返回的是dict of dataframe
header : int, list of ints, default 0 指定列名行，默认0，即取第一行，数据为列名行以下的数据若数据不含列名，则设定 header = None
skiprows : list-like,Rows to skip at the beginning，省略指定行数的数据
skip_footer : int,default 0, 省略从尾部数的int行数据
index_col: int, list of ints, default None指定列为索引列，也可以使用u”strings”
names : array-like, default None, 指定列的名字。

pd.read_excel('animal.xlsx', 'Sheet1', index_col=None, na_values=['NA'])

out:

  Unnamed: 0	animal	age	visits	priority	No.
0	a	cat	2.5	1	yes	1
1	b	cat	3.0	3	yes	2
2	c	snake	0.5	2	no	3
3	d	dog	NaN	3	yes	4
4	e	dog	5.0	2	no	5
5	f	cat	1.5	3	no	6
6	g	snake	4.5	1	no	7
7	h	cat	NaN	1	yes	8
8	i	dog	7.0	2	no	9
9	j	dog	3.0	1	no	10

进阶部分

时间序列索引

51. 建立一个以2018年每一天为索引，值为随机数的Series

dti = pd.date_range(start='2018-01-01', end='2018-12-31', freq='D')
s = pd.Series(np.random.rand(len(dti)), index=dti)
s

out:

2018-01-01    0.747769
2018-01-02    0.936370
2018-01-03    0.698550
2018-01-04    0.169079
2018-01-05    0.408279
                ...   
2018-12-27    0.854535
2018-12-28    0.343373
2018-12-29    0.367045
2018-12-30    0.129674
2018-12-31    0.020034
Freq: D, Length: 365, dtype: float64

52. 统计`s` 中每一个周三对应值的和

# 周一从 0 开始
s[s.index.weekday == 2].sum()

out:

27.895681550217724

提取其中日期为周三的数据：

s[s.index.weekday == 2] # 周一为0，周三为2

out:

2018-01-03    0.698550
2018-01-10    0.385514
2018-01-17    0.379610
2018-01-24    0.206736
2018-01-31    0.877809
2018-02-07    0.379703
2018-02-14    0.947930
2018-02-21    0.223449
2018-02-28    0.008998
2018-03-07    0.326007
2018-03-14    0.018386
2018-03-21    0.830026
2018-03-28    0.078421
2018-04-04    0.754530
2018-04-11    0.856349
2018-04-18    0.697045
2018-04-25    0.849338
2018-05-02    0.767433
2018-05-09    0.774195
2018-05-16    0.566739
2018-05-23    0.691705
2018-05-30    0.463924
2018-06-06    0.962409
2018-06-13    0.392350
2018-06-20    0.569404
2018-06-27    0.333270
2018-07-04    0.655454
2018-07-11    0.333805
2018-07-18    0.588172
2018-07-25    0.496672
2018-08-01    0.438350
2018-08-08    0.065597
2018-08-15    0.640373
2018-08-22    0.639175
2018-08-29    0.233980
2018-09-05    0.747509
2018-09-12    0.765390
2018-09-19    0.317519
2018-09-26    0.467703
2018-10-03    0.329899
2018-10-10    0.963069
2018-10-17    0.566592
2018-10-24    0.148371
2018-10-31    0.406007
2018-11-07    0.876346
2018-11-14    0.869522
2018-11-21    0.553947
2018-11-28    0.126612
2018-12-05    0.878116
2018-12-12    0.856469
2018-12-19    0.138991
2018-12-26    0.752213
dtype: float64

53. 统计`s`中每个月值的平均值

s.resample('M').mean() # resample，重新采样

out:

2018-01-31    0.483828
2018-02-28    0.552954
2018-03-31    0.419301
2018-04-30    0.523942
2018-05-31    0.547059
2018-06-30    0.460998
2018-07-31    0.482454
2018-08-31    0.462668
2018-09-30    0.557952
2018-10-31    0.461560
2018-11-30    0.499936
2018-12-31    0.492086
Freq: M, dtype: float64

54. 将 Series 中的时间进行转换（秒转分钟）

s = pd.date_range('today', periods=100, freq='S')

ts = pd.Series(np.random.randint(0, 500, len(s)), index=s)

ts.resample('Min').sum()

out:

2020-06-30 07:52:00     6130
2020-06-30 07:53:00    16115
2020-06-30 07:54:00     3366
Freq: T, dtype: int64

55. UTC 世界时间标准

s = pd.date_range('today', periods=1, freq='D')  # 获取当前时间
ts = pd.Series(np.random.randn(len(s)), s)  # 随机数值
ts_utc = ts.tz_localize('UTC')  # 转换为 UTC 时间
ts_utc

out:

2020-06-30 08:01:12.838843+00:00   -0.361407
Freq: D, dtype: float64

56. 转换为上海所在时区

ts_utc.tz_convert('Asia/Shanghai')

out:

2020-06-30 16:01:12.838843+08:00   -0.361407
Freq: D, dtype: float64

看一看你当前的时间，是不是一致？与UTC差8个小时。

57. 不同时间表示方式的转换

rng = pd.date_range('1/1/2018', periods=5, freq='M')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
print(ts)
ps = ts.to_period()
print(ps)
ps.to_timestamp()

out:

2018-01-31   -0.293837
2018-02-28   -0.692926
2018-03-31    1.204096
2018-04-30    2.485244
2018-05-31    0.019893
Freq: M, dtype: float64
2018-01   -0.293837
2018-02   -0.692926
2018-03    1.204096
2018-04    2.485244
2018-05    0.019893
Freq: M, dtype: float64
2018-01-01   -0.293837
2018-02-01   -0.692926
2018-03-01    1.204096
2018-04-01    2.485244
2018-05-01    0.019893
Freq: MS, dtype: float64

Series 多重索引

58. 创建多重索引 Series

构建一个 letters = ['A', 'B', 'C'] 和 numbers = list(range(10))为索引，值为随机数的多重索引 Series。

letters = ['A', 'B', 'C']
numbers = list(range(10))

mi = pd.MultiIndex.from_product([letters, numbers])  # 设置多重索引
s = pd.Series(np.random.rand(30), index=mi)  # 随机数
s

out:

A  0    0.744729
   1    0.351805
   2    0.529587
   3    0.043310
   4    0.292182
   5    0.740887
   6    0.428499
   7    0.653610
   8    0.107801
   9    0.590899
B  0    0.542375
   1    0.231597
   2    0.410738
   3    0.634838
   4    0.072990
   5    0.188618
   6    0.821767
   7    0.624321
   8    0.514436
   9    0.695192
C  0    0.160558
   1    0.631878
   2    0.663879
   3    0.667969
   4    0.139756
   5    0.878765
   6    0.129184
   7    0.449790
   8    0.835275
   9    0.965602
dtype: float64

59. 多重索引 Series 查询

取值：

s['A'][0]  # 取一级索引A里面的二级索引对应的值

out:

0.7447294320037277

查询索引为 1，3，6 的值:

# 查询索引为 1，3，6 的值
s.loc[:, [1, 3, 6]] # 可以理解为查询所有二级索引为 1，3，6的数据

out:

A  1    0.351805
   3    0.043310
   6    0.428499
B  1    0.231597
   3    0.634838
   6    0.821767
C  1    0.631878
   3    0.667969
   6    0.129184
dtype: float64

60. 多重索引 Series 切片

s.loc[pd.IndexSlice[:'B', 5:]]

out:

A  5    0.740887
   6    0.428499
   7    0.653610
   8    0.107801
   9    0.590899
B  5    0.188618
   6    0.821767
   7    0.624321
   8    0.514436
   9    0.695192
dtype: float64

DataFrame 多重索引

61. 根据多重索引创建 DataFrame

创建一个以 letters = ['A', 'B'] 和 numbers = list(range(6))为索引，值为随机数据的多重索引 DataFrame。

frame = pd.DataFrame(np.arange(12).reshape(6, 2),
                     index=[list('AAABBB'), list('123123')],
                     columns=['hello', 'shiyanlou'])
frame

out:

	  hello shiyanlou
A	1	0	  1
    2	2	  3
    3	4	  5
B	1	6	  7
    2	8	  9
    3	10	  11

62. 多重索引设置列名称

frame.index.names = ['first', 'second']
frame

out:


     		 hello shiyanlou
first second		
  A	    1		0	1
        2		2	3
        3		4	5
  B	    1		6	7
        2		8	9
        3		10	11

63. DataFrame 多重索引分组求和

frame.groupby('first').sum()

out:

	  hello	shiyanlou
first		
A		6	   9
B		24	   27

64. DataFrame 行列名称转换

print(frame)
frame.stack()

out:

              hello  shiyanlou
first second                  
A     1           0          1
      2           2          3
      3           4          5
B     1           6          7
      2           8          9
      3          10         11
first  second           
A      1       hello         0
               shiyanlou     1
       2       hello         2
               shiyanlou     3
       3       hello         4
               shiyanlou     5
B      1       hello         6
               shiyanlou     7
       2       hello         8
               shiyanlou     9
       3       hello        10
               shiyanlou    11
dtype: int64

65. DataFrame 索引转换

print(frame)
frame.unstack()

out:

              hello  shiyanlou
first second                  
A     1           0          1
      2           2          3
      3           4          5
B     1           6          7
      2           8          9
      3          10         11
  
		hello	shiyanlou
second	1	2	3	1	2	3
first						
    A	0	2	4	1	3	5
    B	6	8	10	7	9	11

可见，它将二级索引变成了列索引。

66. DataFrame 条件查找

# 示例数据
data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
        'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
        'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
        'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(data, index=labels)
df

out:

	animal	age	visits	priority
a	cat	2.5	1	yes
b	cat	3.0	3	yes
c	snake	0.5	2	no
d	dog	NaN	3	yes
e	dog	5.0	2	no
f	cat	2.0	3	no
g	snake	4.5	1	no
h	cat	NaN	1	yes
i	dog	7.0	2	no
j	dog	3.0	1	no

查找age大于3的全部信息

df[df['age'] > 3]

out:

	animal	age	visits	priority
e	dog	    5.0	2	no
g	snake	4.5	1	no
i	dog		7.0	2	no

67. 根据行列索引切片

行区间和列区间：

df.iloc[2:4, 1:3]

out:

    age	visits
c	0.5	2
d	NaN	3

2）行和列座标：

df.iloc[1, 2]

out:

68. DataFrame 多重条件查询

查找 age<3 且为 cat 的全部数据。

df = pd.DataFrame(data, index=labels)

df[(df['animal'] == 'cat') & (df['age'] < 3)]

out:

	animal	age		visits	priority
a	cat		2.5		1		yes
f	cat		2.0		3		no

69. DataFrame 按关键字查询

查找animal包含cat和dog的数据：

df3[df3['animal'].isin(['cat', 'dog'])]

out:

	animal	age	visits	priority
a	cat		2.5	1	yes
b	cat		3.0	3	yes
d	dog		NaN	3	yes
e	dog		5.0	2	no
f	cat		2.0	3	no
h	cat		NaN	1	yes
i	dog		7.0	2	no
j	dog		3.0	1	no

70. DataFrame 按标签及列名查询

提取第3、4、8行的animal、age这两列数据：

df.loc[df.index[[3, 4, 8]], ['animal', 'age']]

out:

	animal	age
d	dog		NaN
e	dog		5.0
i	dog		7.0

71. DataFrame 多条件排序

按照 age 降序，visits 升序排列:

df.sort_values(by=['age', 'visits'], ascending=[False, True])

out:


	animal	age	visits	priority
i	dog		7.0	2	no
e	dog		5.0	2	no
g	snake	4.5	1	no
j	dog		3.0	1	no
b	cat		3.0	3	yes
a	cat		2.5	1	yes
f	cat		2.0	3	no
c	snake	0.5	2	no
h	cat		NaN	1	yes
d	dog		NaN	3	yes

72. DataFrame 多值替换

1) 替换值

将 priority 列的 yes 值替换为 True，no 值替换为 False:

df['priority'].map({'yes': True, 'no': False})

out:

a     True
b     True
c    False
d     True
e    False
f    False
g    False
h     True
i    False
j    False
Name: priority, dtype: bool

2）保留小数

将年龄保留两位小数：

#方法一：
df['age'].map(lambda x: '%.2f' % x)
#方法二：
df['age'].map(lambda x: format(x, '.2f'))
#方法三：（不一定有效）
df['age'].round(decimals=2)

out:

a    2.50
b    3.00
c    0.50
d     nan
e    5.00
f    2.00
g    4.50
h     nan
i    7.00
j    3.00
Name: age, dtype: object

3) 转为百分数

#方法一：
df['age'].map(lambda x: '%.2f%%' % (x*100))
#方法二：
df['age'].map(lambda x: format(x, '.2%')) # 不需要乘100

73. DataFrame 分组求和

1）groupby 分组

df.groupby('animal').size()

out:

animal
cat      4
dog      4
snake    2
dtype: int64

注意上面得到的结果就是一个Series类型，所以需要进一步取值按照Series的方法取即可，例如:

print(df.groupby('animal').size().values)
print(df.groupby('animal').size().index)
print(type(df.groupby('animal').size()))

out:

[4 4 2]
Index(['cat', 'dog', 'snake'], dtype='object', name='animal')
<class 'pandas.core.series.Series'>

2) groupby 分组求和

df.groupby('animal').sum()

out:


		age	visits
animal		
cat		7.5		8
dog		15.0	8
snake	5.0		3

74. 使用列表拼接多个 DataFrame

1) 纵向拼接

temp_df1 = pd.DataFrame(np.random.randn(3, 2))  # 生成由随机数组成的 DataFrame 1
temp_df2 = pd.DataFrame(np.random.randn(3, 2))  # 生成由随机数组成的 DataFrame 2
temp_df3 = pd.DataFrame(np.random.randn(3, 2))  # 生成由随机数组成的 DataFrame 3

print(temp_df1)
print(temp_df2)
print(temp_df3)

pieces = [temp_df1, temp_df2, temp_df3]
pd.concat(pieces, axis=0) # axis默认为0， 即默认纵向拼接

out:

          0         1
0  0.437607 -0.648355
1 -0.416831 -0.405202
2  1.681175 -0.031025
          0         1
0 -0.730415 -0.806742
1 -0.914077  0.809963
2 -0.488658 -0.620225
          0         1
0 -1.210932  0.606868
1 -1.539275  1.830870
2 -0.906066  0.440358
		   0			1
0	0.437607	-0.648355
1	-0.416831	-0.405202
2	1.681175	-0.031025
0	-0.730415	-0.806742
1	-0.914077	0.809963
2	-0.488658	-0.620225
0	-1.210932	0.606868
1	-1.539275	1.830870
2	-0.906066	0.440358

2）横向拼接

temp_df1 = pd.DataFrame(np.random.randn(3, 2))  # 生成由随机数组成的 DataFrame 1
temp_df2 = pd.DataFrame(np.random.randn(3, 2))  # 生成由随机数组成的 DataFrame 2
temp_df3 = pd.DataFrame(np.random.randn(3, 2))  # 生成由随机数组成的 DataFrame 3

print(temp_df1)
print(temp_df2)
print(temp_df3)

pieces = [temp_df1, temp_df2, temp_df3]
pd.concat(pieces, axis=1)

out:

          0         1
0  1.116153  0.250272
1 -0.941279 -0.159497
2 -0.537866  1.675018
          0         1
0 -0.103160  1.228339
1 -0.149218 -0.551139
2 -0.229225 -0.156848
          0         1
0  0.971171 -0.715241
1  0.077248  0.941577
2  1.535163  0.333749
			0	1	0	1	0	1
0	1.116153	0.250272	-0.103160	1.228339	0.971171	-0.715241
1	-0.941279	-0.159497	-0.149218	-0.551139	0.077248	0.941577
2	-0.537866	1.675018	-0.229225	-0.156848	1.535163	0.333749

75. 找出 DataFrame 表中和最小的列

df = pd.DataFrame(np.random.random(size=(5, 10)), columns=list('abcdefghij'))
print(df)
df.sum().idxmin()  # idxmax(), idxmin() 为 Series 函数返回最大最小值的索引值

out:

          a         b         c         d         e         f         g  \
0  0.265336  0.281261  0.626660  0.455936  0.469568  0.160094  0.254143   
1  0.293103  0.429918  0.861056  0.704762  0.534546  0.997590  0.651032   
2  0.653752  0.239481  0.956774  0.983244  0.835387  0.739893  0.446470   
3  0.335220  0.832347  0.925990  0.083933  0.092788  0.144650  0.284757   
4  0.923494  0.926540  0.227792  0.872578  0.471281  0.786390  0.731639   

          h         i         j  
0  0.168989  0.034899  0.797001  
1  0.942421  0.926441  0.218743  
2  0.776017  0.662287  0.806842  
3  0.247964  0.102461  0.051523  
4  0.665478  0.116302  0.256650

76. DataFrame 中每个元素减去每一行的平均值

df = pd.DataFrame(np.random.random(size=(5, 3)))
print(df)
print('means:\n', df.mean(axis=1)) # 行均值
df.sub(df.mean(axis=1), axis=0)

out:

          0         1         2
0  0.865347  0.060866  0.750320
1  0.905023  0.779393  0.969498
2  0.800366  0.334823  0.346131
3  0.930328  0.295275  0.761584
4  0.922344  0.904810  0.062543
means:
0    0.558844
1    0.884638
2    0.493774
3    0.662396
4    0.629899
dtype: float64
			0			1		   2
0	0.306502	-0.497978	0.191476
1	0.020385	-0.105246	0.084860
2	0.306593	-0.158950	-0.147642
3	0.267932	-0.367120	0.099188
4	0.292445	0.274911	-0.567356

77. DataFrame 分组，并得到每一组中最大三个数之和

df = pd.DataFrame({'A': list('aaabbcaabcccbbc'),
                   'B': [12, 345, 3, 1, 45, 14, 4, 52, 54, 23, 235, 21, 57, 3, 87]})
print(df)
df.groupby('A')['B'].nlargest(3).sum(level=0)

out:

    A    B
0   a   12
1   a  345
2   a    3
3   b    1
4   b   45
5   c   14
6   a    4
7   a   52
8   b   54
9   c   23
10  c  235
11  c   21
12  b   57
13  b    3
14  c   87
A
a    409
b    156
c    345
Name: B, dtype: int64

透视表

当分析庞大的数据时，为了更好的发掘数据特征之间的关系，且不破坏原数据，就可以利用透视表 pivot_table 进行操作。

78. 透视表的创建

新建表将 A, B, C 列作为索引进行聚合:

df = pd.DataFrame({'A': ['one', 'one', 'two', 'three'] * 3,
                   'B': ['A', 'B', 'C'] * 4,
                   'C': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                   'D': np.random.randn(12),
                   'E': np.random.randn(12)})

print(df)

pd.pivot_table(df, index=['A', 'B'])

out:

        A  B    C         D         E
0     one  A  foo -1.039258 -0.443270
1     one  B  foo  0.388518  0.524361
2     two  C  foo -0.330776 -0.878025
3   three  A  bar  0.000832  1.133901
4     one  B  bar  0.418298  1.626217
5     one  C  bar  0.459358 -1.203031
6     two  A  foo -0.658593 -1.116155
7   three  B  foo -0.331466  1.130495
8     one  C  foo -0.197646  0.726132
9     one  A  bar -0.895106 -0.461336
10    two  B  bar  0.584046  0.674531
11  three  C  bar -0.401441 -0.017452
				D			E
A	B		
one	A	-0.967182	-0.452303
    B	0.403408	1.075289
    C	0.130856	-0.238450
three	A	0.000832	1.133901
    B	-0.331466	1.130495
    C	-0.401441	-0.017452
two	A	-0.658593	-1.116155
    B	0.584046	0.674531
    C	-0.330776	-0.878025

79. 透视表按指定行进行聚合

将该 DataFrame 的 D 列聚合，按照 A,B 列为索引进行聚合，聚合的方式为默认求均值。

pd.pivot_table(df, values=['D'], index=['A', 'B'])

out:

		D
A	B	
one	A	-0.967182
    B	0.403408
    C	0.130856
three	A	0.000832
    B	-0.331466
    C	-0.401441
two	A	-0.658593
    B	0.584046
    C	-0.330776

80. 透视表聚合方式定义

上一题中 D 列聚合时，采用默认求均值的方法，若想使用更多的方式可以在 aggfunc 中实现。

            sum			len
            D	 		D
A		B		
one		A	-1.934364	2.0
    	B	0.806816	2.0
    	C	0.261711	2.0
three	A	0.000832	1.0
        B	-0.331466	1.0
        C	-0.401441	1.0
two		A	-0.658593	1.0
        B	0.584046	1.0
        C	-0.330776	1.0

81. 透视表利用额外列进行辅助分割

D 列按照 A,B 列进行聚合时，若关心 C 列对 D 列的影响，可以加入 columns 值进行分析。

pd.pivot_table(df, values=['D'], index=['A', 'B'],
               columns=['C'], aggfunc=np.sum)

out:

			D
		C	bar			foo
A		B		
one		A	-0.895106	-1.039258
        B	0.418298	0.388518
        C	0.459358	-0.197646
three	A	0.000832	NaN
        B	NaN	-0.331466
        C	-0.401441	NaN
two	A	NaN	-0.658593
        B	0.584046	NaN
        C	NaN	-0.33077

82. 透视表的缺省值处理

在透视表中由于不同的聚合方式，相应缺少的组合将为缺省值，可以加入 fill_value 对缺省值处理:

pd.pivot_table(df, values=['D'], index=['A', 'B'],
               columns=['C'], aggfunc=np.sum, fill_value=0)

out:

                D
        C		bar		foo
 A		B		
one		A	-0.895106	-1.039258
        B	0.418298	0.388518
        C	0.459358	-0.197646
three	A	0.000832	0.000000
        B	0.000000	-0.331466
        C	-0.401441	0.000000
two		A	0.000000	-0.658593
        B	0.584046	0.000000
        C	0.000000	-0.33077

绝对类型

在数据的形式上主要包括数量型和性质型，数量型表示着数据可数范围可变，而性质型表示范围已经确定不可改变，绝对型数据就是性质型数据的一种。

83. 绝对型数据定义

df = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6], "raw_grade": [
                  'a', 'b', 'b', 'a', 'a', 'e']})
df["grade"] = df["raw_grade"].astype("category")
df

out:


	id	raw_grade	grade
0	1	a	a
1	2	b	b
2	3	b	b
3	4	a	a
4	5	a	a
5	6	e	e

84. 对绝对型数据重命名

df["grade"].cat.categories = ["very good", "good", "very bad"]
df

out:

	id	raw_grade	grade
0	1	a	very good
1	2	b	good
2	3	b	good
3	4	a	very good
4	5	a	very good
5	6	e	very bad

85. 重新排列绝对型数据并补充相应的缺省值

df["grade"] = df["grade"].cat.set_categories(
    ["very bad", "bad", "medium", "good", "very good"])
df

out:


	id	raw_grade	grade
0	1	a	very good
1	2	b	good
2	3	b	good
3	4	a	very good
4	5	a	very good
5	6	e	very bad

86. 对绝对型数据进行排序

df.sort_values(by="grade")

out:

	id	raw_grade	grade
5	6	e	very bad
1	2	b	good
2	3	b	good
0	1	a	very good
3	4	a	very good
4	5	a	very goo

87. 对绝对型数据进行分组

df.groupby("grade").size()

out:

grade
very bad     1
bad          0
medium       0
good         2
very good    3
dtype: int64

数据清洗

常常我们得到的数据是不符合我们最终处理的数据要求，包括许多缺省值以及坏的数据，需要我们对数据进行清洗。

88. 缺失值拟合

在FilghtNumber中有数值缺失，其中数值为按 10 增长，补充相应的缺省值使得数据完整，并让数据为 int 类型。

df = pd.DataFrame({'From_To': ['LoNDon_paris', 'MAdrid_miLAN', 'londON_StockhOlm',
                               'Budapest_PaRis', 'Brussels_londOn'],
                   'FlightNumber': [10045, np.nan, 10065, np.nan, 10085],
                   'RecentDelays': [[23, 47], [], [24, 43, 87], [13], [67, 32]],
                   'Airline': ['KLM(!)', '<Air France> (12)', '(British Airways. )',
                               '12. Air France', '"Swiss Air"']})
df['FlightNumber'] = df['FlightNumber'].interpolate().astype(int)
df

out:

	From_To	FlightNumber	RecentDelays	Airline
0	LoNDon_paris	10045	[23, 47]	KLM(!)
1	MAdrid_miLAN	10055	[]	<Air France> (12)
2	londON_StockhOlm	10065	[24, 43, 87]	(British Airways. )
3	Budapest_PaRis	10075	[13]	12. Air France
4	Brussels_londOn	10085	[67, 32]	"Swiss Air"

89. 数据列拆分

其中From_to应该为两独立的两列From和To，将From_to依照_拆分为独立两列建立为一个新表。

temp = df.From_To.str.split('_', expand=True)
temp.columns = ['From', 'To']
temp

out:

	From	To
0	LoNDon	paris
1	MAdrid	miLAN
2	londON	StockhOlm
3	Budapest	PaRis
4	Brussels	londOn

90. 字符标准化

其中注意到地点的名字都不规范（如：londON应该为London）需要对数据进行标准化处理。

temp['From'] = temp['From'].str.capitalize() #capitalize() 返回一个首字母大写的字符串
temp['To'] = temp['To'].str.capitalize()
print(temp['From'])
print(temp['To'])

out:

0      London
1      Madrid
2      London
3    Budapest
4    Brussels
Name: From, dtype: object
0        Paris
1        Milan
2    Stockholm
3        Paris
4       London
Name: To, dtype: object

91. 删除坏数据加入整理好的数据

将最开始的 From_to 列删除，加入整理好的 From 和 to 列。

df = df.drop('From_To', axis=1)
df = df.join(temp)
print(df)

out:

   FlightNumber  RecentDelays              Airline      From         To
0         10045      [23, 47]               KLM(!)    London      Paris
1         10055            []    <Air France> (12)    Madrid      Milan
2         10065  [24, 43, 87]  (British Airways. )    London  Stockholm
3         10075          [13]       12. Air France  Budapest      Paris
4         10085      [67, 32]          "Swiss Air"  Brussels     London

92. 去除多余字符

如同 airline 列中许多数据有许多其他字符，会对后期的数据分析有较大影响，需要对这类数据进行修正。

print(df['Airline'])
# extract()按正则表达式进行数据提取
df['Airline'] = df['Airline'].str.extract('([a-zA-Z\s]+)', expand=False).str.strip()
df

out:

	FlightNumber	RecentDelays	Airline	From	To
0	10045	[23, 47]	KLM	London	Paris
1	10055	[]	Air France	Madrid	Milan
2	10065	[24, 43, 87]	British Airways	London	Stockholm
3	10075	[13]	Air France	Budapest	Paris
4	10085	[67, 32]	Swiss Air	Brussels	London

93. 格式规范

在 RecentDelays 中记录的方式为列表类型，由于其长度不一，这会为后期数据分析造成很大麻烦。这里将 RecentDelays 的列表拆开，取出列表中的相同位置元素作为一列，若为空值即用 NaN 代替。

delays = df['RecentDelays'].apply(pd.Series)

delays.columns = ['delay_{}'.format(n)
                  for n in range(1, len(delays.columns)+1)]

df = df.drop('RecentDelays', axis=1).join(delays)
df

out:

	FlightNumber	Airline	From	To	delay_1	delay_2	delay_3
0	10045	KLM	London	Paris	23.0	47.0	NaN
1	10055	Air France	Madrid	Milan	NaN	NaN	NaN
2	10065	British Airways	London	Stockholm	24.0	43.0	87.0
3	10075	Air France	Budapest	Paris	13.0	NaN	NaN
4	10085	Swiss Air	Brussels	London	67.0	32.0	NaN

数据预处理

94. 信息区间划分

班级一部分同学的数学成绩表，如下图所示:

df=pd.DataFrame({'name':['Alice','Bob','Candy','Dany','Ella','Frank','Grace','Jenny'],'grades':[58,83,79,65,93,45,61,88]})

out:

	name	grades
0	Alice	58
1	Bob	83
2	Candy	79
3	Dany	65
4	Ella	93
5	Frank	45
6	Grace	61
7	Jenny	88

但我们更加关心的是该同学是否及格，将该数学成绩按照是否>60来进行划分。

df = pd.DataFrame({'name': ['Alice', 'Bob', 'Candy', 'Dany', 'Ella',
                            'Frank', 'Grace', 'Jenny'],
                   'grades': [58, 83, 79, 65, 93, 45, 61, 88]})


def choice(x):
    if x > 60:
        return 1
    else:
        return 0


df.grades = pd.Series(map(lambda x: choice(x), df.grades))
df

out:

	name	grades
0	Alice	0
1	Bob	1
2	Candy	1
3	Dany	1
4	Ella	1
5	Frank	0
6	Grace	1
7	Jenny	1

95. 数据去重

一个列为A的 DataFrame 数据，如下图所示:

df = pd.DataFrame({'A': [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7]})

尝试将 A 列中连续重复的数据清除.

df = pd.DataFrame({'A': [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7]})
df.loc[df['A'].shift() != df['A']]

out:

96.数据归一化

有时候，DataFrame 中不同列之间的数据差距太大，需要对其进行归一化处理。

其中，Max-Min归一化是简单而常见的一种方式，公式如下:
$Y=\frac{X-X_{min}}{X_{max}-X_{min}}$

def normalization(df):
    numerator = df.sub(df.min())
    denominator = (df.max()).sub(df.min())
    Y = numerator.div(denominator)
    return Y


df = pd.DataFrame(np.random.random(size=(5, 3)))
print(df)
normalization(df)

out:

         0         1         2
0  0.055276  0.666860  0.206399
1  0.873721  0.924465  0.105095
2  0.161571  0.979359  0.678480
3  0.698888  0.091796  0.692453
4  0.694759  0.888997  0.528819
0	1	2
0	0.000000	0.647914	0.172474
1	1.000000	0.938152	0.000000
2	0.129874	1.000000	0.976209
3	0.786385	0.000000	1.000000
4	0.781340	0.898191	0.721407

Pandas 绘图操作

为了更好的了解数据包含的信息，最直观的方法就是将其绘制成图。

97. Series 可视化

%matplotlib inline
ts = pd.Series(np.random.randn(100), index=pd.date_range('today', periods=100))
ts = ts.cumsum()
ts.plot()

out:

98. DataFrame 折现图

df = pd.DataFrame(np.random.randn(100, 4), index=ts.index,
                  columns=['A', 'B', 'C', 'D'])
df = df.cumsum()
df.plot()

out:

99. DataFrame 散点图

df = pd.DataFrame({"xs": [1, 5, 2, 8, 1], "ys": [4, 2, 1, 9, 6]})
df = df.cumsum()
df.plot.scatter("xs", "ys", color='red', marker="*")

out:

100. DataFrame 柱形图

df = pd.DataFrame({"revenue": [57, 68, 63, 71, 72, 90, 80, 62, 59, 51, 47, 52],
                   "advertising": [2.1, 1.9, 2.7, 3.0, 3.6, 3.2, 2.7, 2.4, 1.8, 1.6, 1.3, 1.9],
                   "month": range(12)
                   })

ax = df.plot.bar("month", "revenue", color="yellow")
df.plot("month", "advertising", secondary_y=True, ax=ax)

out:

pandas 百题大冲关

目录

Pandas 百题大冲关

基础部分

基础

1. 导入Pandas:

2. 查看Pandas版本信息：

创建 Series 数据类型

3. 从列表创建Series:

4. 从Ndarray 创建Series:

5. 从字典创建Series:

Series基本操作

6. 修改Series索引

7. Series纵向拼接

8. Series按指定索引删除元素

9. Series修改指定索引的元素

10. Series 按指定索引查找元素

11. Series 切片操作

Series

12. Series 加法运算

13. Series 减法运算

14. Series 乘法运算

15. Series 除法运算

16. Series 求中位数

17. Series 求和

18. Series 求最大值

19. Series 求最小值

创建 DataFrame 数据类型

20. 通过NumPy数组创建DataFrame

21. 通过字典数组创建DataFrame

DataFrame 基本操作

23. 预览DataFrame的前五行数据

24. 查看 DataFrame 的后3行数据

25. 查看DataFrame 的索引

26. 查看DataFrame 的列名

27. 查看 DataFrame 的数值

28. 查看 DataFrame 的统计数据

29. DataFrame 转置

30. 对 DataFrame 进行按列排序

31. 对DataFrame 数据切片

32. 对 DataFrame 通过标签查询（单列）

33. DataFrame 通过标签查询（多列）

34. 对 DataFrame 通过位置查找

35. DataFrame 副本拷贝

36. 判断 DataFrame 元素是否为空

37. 添加列数据

38. 根据 DataFrame 的下标值进行修改

39. 根据 DataFrame 的标签对数据进行修改

40. DataFrame 求平均值操作

41. 对 DataFrame 中任意列做求和操作

字符串操作

42. 将字符串转化为小写字母

43. 将字符串转化为大写字母

DataFrame 缺失值操作

44. 对缺失值进行填充

45. 删除存在缺失值的行

46. DataFrame 按指定列对齐

DataFrame 文件操作

47. CSV 文件写入

48. CSV 文件读取

49. Excel 写入操作

50. Excel 读取操作

进阶部分

时间序列索引

51. 建立一个以2018年每一天为索引，值为随机数的Series

52. 统计s 中每一个周三对应值的和

53. 统计s中每个月值的平均值

54. 将 Series 中的时间进行转换（秒转分钟）

55. UTC 世界时间标准

56. 转换为上海所在时区

57. 不同时间表示方式的转换

Series 多重索引

58. 创建多重索引 Series

59. 多重索引 Series 查询

60. 多重索引 Series 切片

DataFrame 多重索引

61. 根据多重索引创建 DataFrame

62. 多重索引设置列名称

63. DataFrame 多重索引分组求和

64. DataFrame 行列名称转换

52. 统计`s` 中每一个周三对应值的和

53. 统计`s`中每个月值的平均值