25個Pandas高頻實用技巧

參考翻譯自：https://github.com/justmarkham/pandas-videos

導入案例數據集

import pandas as pd
import numpy as np

drinks = pd.read_csv('http://bit.ly/drinksbycountry')
movies = pd.read_csv('http://bit.ly/imdbratings')
orders = pd.read_csv('http://bit.ly/chiporders', sep='\t')
orders['item_price'] = orders.item_price.str.replace('$', '').astype('float')
stocks = pd.read_csv('http://bit.ly/smallstocks', parse_dates=['Date'])
titanic = pd.read_csv('http://bit.ly/kaggletrain')
ufo = pd.read_csv('http://bit.ly/uforeports', parse_dates=['Time'])

<ipython-input-1-9434e3b86302>:7: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will*not* be treated as literal strings when regex=True.
  orders['item_price'] = orders.item_price.str.replace('$', '').astype('float')

1 顯示已安裝的版本

有時你需要知道正在使用的pandas版本，特別是在閱讀pandas文檔時。你可以通過輸入以下命令來顯示pandas版本:

pd.__version__

'1.2.4'

如果你還想知道pandas所依賴的模塊的版本，你可以使用show_versions()函數:

pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : 2cb96529396d93b46abab7bbc73a208e708c642e
python           : 3.8.8.final.0
python-bits      : 64
OS               : Windows
OS-release       : 10
Version          : 10.0.22000
machine          : AMD64
processor        : AMD64 Family 25 Model 80 Stepping 0, AuthenticAMD
byteorder        : little
LC_ALL           : en_US.UTF-8
LANG             : en_US.UTF-8
LOCALE           : Chinese (Simplified)_China.936

pandas           : 1.2.4
numpy            : 1.18.0
pytz             : 2019.3
dateutil         : 2.8.1
pip              : 22.1.2
setuptools       : 52.0.0.post20210125
Cython           : 0.29.23
pytest           : 6.2.3
hypothesis       : None
sphinx           : 4.0.1
blosc            : None
feather          : None
xlsxwriter       : 1.3.8
lxml.etree       : 4.6.3
html5lib         : 1.1
pymysql          : 1.0.2
psycopg2         : 2.9.1 (dt dec pq3 ext lo64)
jinja2           : 2.11.3
IPython          : 7.22.0
pandas_datareader: 0.10.0
bs4              : 4.9.3
bottleneck       : 1.3.2
fsspec           : 0.9.0
fastparquet      : None
gcsfs            : None
matplotlib       : 3.3.4
numexpr          : 2.7.3
odfpy            : None
openpyxl         : 3.0.7
pandas_gbq       : None
pyarrow          : None
pyxlsb           : None
s3fs             : None
scipy            : 1.7.0
sqlalchemy       : 1.4.7
tables           : 3.6.1
tabulate         : None
xarray           : None
xlrd             : 2.0.1
xlwt             : 1.3.0
numba            : 0.53.1

2 創建示例DataFrame

假設你需要創建一個示例DataFrame。有很多種實現的途徑，我最喜歡的方式是傳一個字典給DataFrame constructor，其中字典中的keys爲列名，values爲列的取值。

df = pd.DataFrame({'col one':[100, 200], 'col two':[300, 400]})
df

	col one	col two
0	100	300
1	200	400

如果你需要更大的DataFrame，上述方法將需要太多的輸入。在這種情況下，你可以使用NumPy的 random.rand()函數，定義好該函數的行數和列數，並將其傳遞給DataFrame構造器:

pd.DataFrame(np.random.rand(4, 8))

	0	1	2	3	4	5	6	7
0	0.434350	0.664889	0.003442	0.500086	0.053749	0.831000	0.199008	0.194081
1	0.708474	0.363857	0.949917	0.664410	0.285345	0.957187	0.851665	0.347094
2	0.107086	0.497177	0.488709	0.283645	0.155678	0.815601	0.558401	0.695038
3	0.039673	0.836976	0.878320	0.462584	0.742012	0.483997	0.578045	0.568551

這種方式很好，但如果你還想把列名變爲非數值型的，你可以強制地將一串字符賦值給columns參數：

pd.DataFrame(np.random.rand(4, 8), columns=list('abcdefgh'))

	a	b	c	d	e	f	g	h
0	0.106455	0.072711	0.492421	0.810857	0.986341	0.251466	0.557781	0.299379
1	0.589126	0.851388	0.362811	0.729866	0.524497	0.464101	0.873737	0.098877
2	0.623276	0.835985	0.750665	0.599064	0.230829	0.688544	0.313951	0.878711
3	0.379598	0.665771	0.949013	0.460847	0.004878	0.617837	0.773584	0.560171

你可以想到，你傳遞的字符串的長度必須與列數相同。

3 更改列名

我們來看一下剛纔我們創建的示例DataFrame:

df

	col one	col two
0	100	300
1	200	400

我更喜歡在選取pandas列的時候使用點（.），但是這對那麼列名中含有空格的列不會生效。讓我們來修復這個問題。

更改列名最靈活的方式是使用rename()函數。你可以傳遞一個字典，其中keys爲原列名，values爲新列名，還可以指定axis:

df = df.rename({'col one':'col_one','col two':'col_two'},axis='columns')

使用這個函數最好的方式是你需要更改任意數量的列名，不管是一列或者全部的列。

如果你需要一次性重新命令所有的列名，更簡單的方式就是重寫DataFrame的columns屬性：

df.columns = ['col_one', 'col_two']

如果你需要做的僅僅是將空格換成下劃線，那麼更好的辦法是用str.replace()方法，這是因爲你都不需要輸入所有的列名：

df.columns = df.columns.str.replace(' ', '_')

上述三個函數的結果都一樣，可以更改列名使得列名中不含有空格：

df

	col_one	col_two
0	100	300
1	200	400

最後，如果你需要在列名中添加前綴或者後綴，你可以使用add_prefix()函數：

df.add_prefix('X_')

	X_col_one	X_col_two
0	100	300
1	200	400

或者使用add_suffix()函數：

df.add_suffix('_Y')

	col_one_Y	col_two_Y
0	100	300
1	200	400

4. 行序反轉

我們來看一下drinks這個DataFame:

drinks.head()

	country	beer_servings	spirit_servings	wine_servings	total_litres_of_pure_alcohol	continent
0	Afghanistan	0	0	0	0.0	Asia
1	Albania	89	132	54	4.9	Europe
2	Algeria	25	0	14	0.7	Africa
3	Andorra	245	138	312	12.4	Europe
4	Angola	217	57	45	5.9	Africa

該數據集描述了每個國家的平均酒消費量。如果你想要將行序反轉呢？

最直接的辦法是使用loc函數並傳遞::-1，跟Python中列表反轉時使用的切片符號一致：

drinks.loc[::-1].head()

	country	beer_servings	spirit_servings	wine_servings	total_litres_of_pure_alcohol	continent
192	Zimbabwe	64	18	4	4.7	Africa
191	Zambia	32	19	4	2.5	Africa
190	Yemen	6	0	0	0.1	Asia
189	Vietnam	111	2	1	2.0	Asia
188	Venezuela	333	100	3	7.7	South America

如果你還想重置索引使得它從0開始呢？

你可以使用reset_index()函數，告訴他去掉完全拋棄之前的索引：

drinks.loc[::-1].reset_index(drop=True).head()

	country	beer_servings	spirit_servings	wine_servings	total_litres_of_pure_alcohol	continent
0	Zimbabwe	64	18	4	4.7	Africa
1	Zambia	32	19	4	2.5	Africa
2	Yemen	6	0	0	0.1	Asia
3	Vietnam	111	2	1	2.0	Asia
4	Venezuela	333	100	3	7.7	South America

你可以看到，行序已經反轉，索引也被重置爲默認的整數序號。

5. 列序反轉

跟之前的技巧一樣，你也可以使用loc函數將列從左至右反轉

drinks.loc[:, ::-1].head()

	continent	total_litres_of_pure_alcohol	wine_servings	spirit_servings	beer_servings	country
0	Asia	0.0	0	0	0	Afghanistan
1	Europe	4.9	54	132	89	Albania
2	Africa	0.7	14	0	25	Algeria
3	Europe	12.4	312	138	245	Andorra
4	Africa	5.9	45	57	217	Angola

逗號之前的冒號表示選擇所有行，逗號之後的::-1表示反轉所有的列，這就是爲什麼country這一列現在在最右邊。

6. 通過數據類型選擇列

這裏有drinks這個DataFrame的數據類型：

drinks.dtypes

country                          object
beer_servings                     int64
spirit_servings                   int64
wine_servings                     int64
total_litres_of_pure_alcohol    float64
continent                        object
dtype: object

假設你僅僅需要選取數值型的列，那麼你可以使用select_dtypes()函數：

drinks.select_dtypes(include='number').head()

	beer_servings	spirit_servings	wine_servings	total_litres_of_pure_alcohol
0	0	0	0	0.0
1	89	132	54	4.9
2	25	0	14	0.7
3	245	138	312	12.4
4	217	57	45	5.9

這包含了int和float型的列。

你也可以使用這個函數來選取數據類型爲object的列：

drinks.select_dtypes(include='object').head()

	country	continent
0	Afghanistan	Asia
1	Albania	Europe
2	Algeria	Africa
3	Andorra	Europe
4	Angola	Africa

你還可以選取多種數據類型，只需要傳遞一個列表即可：

drinks.select_dtypes(include=['number', 'object', 'category', 'datetime']).head()

	country	beer_servings	spirit_servings	wine_servings	total_litres_of_pure_alcohol	continent
0	Afghanistan	0	0	0	0.0	Asia
1	Albania	89	132	54	4.9	Europe
2	Algeria	25	0	14	0.7	Africa
3	Andorra	245	138	312	12.4	Europe
4	Angola	217	57	45	5.9	Africa

你還可以用來排除特定的數據類型：

drinks.select_dtypes(exclude='number').head()

7. 將字符型轉換爲數值型

我們來創建另一個示例DataFrame:

df = pd.DataFrame({'col_one':['1.1', '2.2', '3.3'],
                   'col_two':['4.4', '5.5', '6.6'],
                   'col_three':['7.7', '8.8', '-']})
df

	col_one	col_two	col_three
0	1.1	4.4	7.7
1	2.2	5.5	8.8
2	3.3	6.6	-

這些數字實際上儲存爲字符型，導致其數據類型爲object:

df.dtypes

col_one      object
col_two      object
col_three    object
dtype: object

爲了對這些列進行數學運算，我們需要將數據類型轉換成數值型。你可以對前兩列使用astype()函數：

df.astype({'col_one':'float', 'col_two':'float'}).dtypes

col_one      float64
col_two      float64
col_three     object
dtype: object

但是，如果你對第三列也使用這個函數，將會引起錯誤，這是因爲這一列包含了破折號（用來表示0）但是pandas並不知道如何處理它。

你可以對第三列使用to_numeric()函數，告訴其將任何無效數據轉換爲NaN:

pd.to_numeric(df.col_three, errors='coerce')

0    7.7
1    8.8
2    NaN
Name: col_three, dtype: float64

如果你知道NaN值代表0，那麼你可以fillna()函數將他們替換成0：

pd.to_numeric(df.col_three, errors='coerce').fillna(0)

0    7.7
1    8.8
2    0.0
Name: col_three, dtype: float64

最後，你可以通過apply()函數一次性對整個DataFrame使用這個函數：

df = df.apply(pd.to_numeric, errors='coerce').fillna(0)
df

	col_one	col_two	col_three
0	1.1	4.4	7.7
1	2.2	5.5	8.8
2	3.3	6.6	0.0

僅需一行代碼就完成了我們的目標，因爲現在所有的數據類型都轉換成float:

df.dtypes

col_one      float64
col_two      float64
col_three    float64
dtype: object

8. 減小DataFrame空間大小

pandas DataFrame被設計成可以適應內存，所以有些時候你可以減小DataFrame的空間大小，讓它在你的系統上更好地運行起來。

這是drinks這個DataFrame所佔用的空間大小：

drinks.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 6 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   country                       193 non-null    object 
 1   beer_servings                 193 non-null    int64  
 2   spirit_servings               193 non-null    int64  
 3   wine_servings                 193 non-null    int64  
 4   total_litres_of_pure_alcohol  193 non-null    float64
 5   continent                     193 non-null    object 
dtypes: float64(1), int64(3), object(2)
memory usage: 30.5 KB

可以看到它使用了304.KB。

如果你對你的DataFrame有操作方面的問題，或者你不能將它讀進內存，那麼在讀取文件的過程中有兩個步驟可以使用來減小DataFrame的空間大小。

第一個步驟是隻讀取那些你實際上需要用到的列，可以調用usecols參數：

cols = ['beer_servings', 'continent']
small_drinks = pd.read_csv('http://bit.ly/drinksbycountry', usecols=cols)
small_drinks.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   beer_servings  193 non-null    int64 
 1   continent      193 non-null    object
dtypes: int64(1), object(1)
memory usage: 13.7 KB

第二步是將所有實際上爲類別變量的object列轉換成類別變量，可以調用dtypes參數：

dtypes = {'continent':'category'}
smaller_drinks = pd.read_csv('http://bit.ly/drinksbycountry', usecols=cols, dtype=dtypes)
smaller_drinks.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   beer_servings  193 non-null    int64   
 1   continent      193 non-null    category
dtypes: category(1), int64(1)
memory usage: 2.4 KB

通過將continent列讀取爲category數據類型，我們進一步地把DataFrame的空間大小縮小至2.3KB。

值得注意的是，如果跟行數相比，category數據類型的列數相對較小，那麼catefory數據類型可以減小內存佔用。

9. 按行從多個文件中構建DataFrame

假設你的數據集分化爲多個文件，但是你需要將這些數據集讀到一個DataFrame中。

舉例來說，我有一些關於股票的小數聚集，每個數據集爲單天的CSV文件。

pd.read_csv('data/stocks1.csv')
pd.read_csv('data/stocks2.csv')
pd.read_csv('data/stocks3.csv')

	Date	Close	Volume	Symbol
0	2016-10-05	57.64	16726400	MSFT
1	2016-10-05	31.59	11808600	CSCO
2	2016-10-05	113.05	21453100	AAPL

你可以將每個CSV文件讀取成DataFrame，將它們結合起來，然後再刪除原來的DataFrame，但是這樣會多佔用內存且需要許多代碼。

更好的方式爲使用內置的glob模塊。你可以給glob()函數傳遞某種模式，包括未知字符，這樣它會返回符合該某事的文件列表。在這種方式下，glob會查找所有以stocks開頭的CSV文件：

from glob import glob
stock_files = sorted(glob('data/stocks*.csv'))
stock_files

['data\\stocks.csv',
 'data\\stocks1.csv',
 'data\\stocks2.csv',
 'data\\stocks3.csv']

glob會返回任意排序的文件名，這就是我們爲什麼要用Python內置的sorted()函數來對列表進行排序。

我們以生成器表達式用read_csv()函數來讀取每個文件，並將結果傳遞給concat()函數，這會將單個的DataFrame按行來組合：

pd.concat((pd.read_csv(file) for file in stock_files))

	Date	Close	Volume	Symbol
0	2016-10-03	31.50	14070500	CSCO
1	2016-10-03	112.52	21701800	AAPL
2	2016-10-03	57.42	19189500	MSFT
3	2016-10-04	113.00	29736800	AAPL
4	2016-10-04	57.24	20085900	MSFT
5	2016-10-04	31.35	18460400	CSCO
6	2016-10-05	57.64	16726400	MSFT
7	2016-10-05	31.59	11808600	CSCO
8	2016-10-05	113.05	21453100	AAPL
0	2016-10-03	31.50	14070500	CSCO
1	2016-10-03	112.52	21701800	AAPL
2	2016-10-03	57.42	19189500	MSFT
0	2016-10-04	113.00	29736800	AAPL
1	2016-10-04	57.24	20085900	MSFT
2	2016-10-04	31.35	18460400	CSCO
0	2016-10-05	57.64	16726400	MSFT
1	2016-10-05	31.59	11808600	CSCO
2	2016-10-05	113.05	21453100	AAPL

不幸的是，索引值存在重複。爲了避免這種情況，我們需要告訴concat()函數來忽略索引，使用默認的整數索引：

pd.concat((pd.read_csv(file) for file in stock_files), ignore_index=True)

	Date	Close	Volume	Symbol
0	2016-10-03	31.50	14070500	CSCO
1	2016-10-03	112.52	21701800	AAPL
2	2016-10-03	57.42	19189500	MSFT
3	2016-10-04	113.00	29736800	AAPL
4	2016-10-04	57.24	20085900	MSFT
5	2016-10-04	31.35	18460400	CSCO
6	2016-10-05	57.64	16726400	MSFT
7	2016-10-05	31.59	11808600	CSCO
8	2016-10-05	113.05	21453100	AAPL
9	2016-10-03	31.50	14070500	CSCO
10	2016-10-03	112.52	21701800	AAPL
11	2016-10-03	57.42	19189500	MSFT
12	2016-10-04	113.00	29736800	AAPL
13	2016-10-04	57.24	20085900	MSFT
14	2016-10-04	31.35	18460400	CSCO
15	2016-10-05	57.64	16726400	MSFT
16	2016-10-05	31.59	11808600	CSCO
17	2016-10-05	113.05	21453100	AAPL

10. 按列從多個文件中構建DataFrame

上一個技巧對於數據集中每個文件包含行記錄很有用。但是如果數據集中的每個文件包含的列信息呢？

這裏有一個例子，dinks數據集被劃分成兩個CSV文件，每個文件包含三列：

pd.read_csv('data/drinks1.csv').head()

	country	beer_servings	spirit_servings
0	Afghanistan	0	0
1	Albania	89	132
2	Algeria	25	0
3	Andorra	245	138
4	Angola	217	57

pd.read_csv('data/drinks2.csv').head()

	wine_servings	total_litres_of_pure_alcohol	continent
0	0	0.0	Asia
1	54	4.9	Europe
2	14	0.7	Africa
3	312	12.4	Europe
4	45	5.9	Africa

同上一個技巧一樣，我們以使用glob()函數開始。這一次，我們需要告訴concat()函數按列來組合：

drink_files = sorted(glob('data/drinks*.csv'))
pd.concat((pd.read_csv(file) for file in drink_files), axis='columns').head()

	country	beer_servings	spirit_servings	wine_servings	total_litres_of_pure_alcohol	continent	country	beer_servings	spirit_servings	wine_servings	total_litres_of_pure_alcohol	continent
0	Afghanistan	0	0	0	0.0	Asia	Afghanistan	0	0	0	0.0	Asia
1	Albania	89	132	54	4.9	Europe	Albania	89	132	54	4.9	Europe
2	Algeria	25	0	14	0.7	Africa	Algeria	25	0	14	0.7	Africa
3	Andorra	245	138	312	12.4	Europe	Andorra	245	138	312	12.4	Europe
4	Angola	217	57	45	5.9	Africa	Angola	217	57	45	5.9	Africa

現在我們的DataFrame已經有六列了。

11. 從剪貼板中創建DataFrame

假設你將一些數據儲存在Excel或者Google Sheet中，你又想要儘快地將他們讀取至DataFrame中。

你需要選擇這些數據並複製至剪貼板。然後，你可以使用read_clipboard()函數將他們讀取至DataFrame中：

df = pd.read_clipboard()
df

	year	month	day	land_use
0	2018	3	6	62
1	2018	3	6	62
2	2018	3	6	130
3	2018	3	6	121
4	2018	3	6	72
5	2018	3	6	72
6	2018	3	6	130
7	2018	3	6	72
8	2018	3	6	72

和read_csv()類似，read_clipboard()會自動檢測每一列的正確的數據類型：

df.dtypes

year        int64
month       int64
day         int64
land_use    int64
dtype: object

需要注意的是，如果你想要你的工作在未來可復現，那麼read_clipboard()並不值得推薦。

12. 將DataFrame劃分爲兩個隨機的子集

假設你想要將一個DataFrame劃分爲兩部分，隨機地將75%的行給一個DataFrame，剩下的25%的行給另一個DataFrame。

舉例來說，我們的movie ratings這個DataFrame有979行：

len(movies)

我們可以使用 sample() 方法隨機選擇 75% 的行並將它們分配給“movies_1”：

movies_1 = movies.sample(frac=0.75, random_state=1234)

接着我們使用drop()函數來捨棄“moive_1”中出現過的行，將剩下的行賦值給"movies_2"DataFrame：

movies_2 = movies.drop(movies_1.index)

你可以發現總的行數是正確的：

len(movies_1) + len(movies_2)

你還可以檢查每部電影的索引，或者"moives_1":

movies_1.index.sort_values()

Int64Index([  0,   2,   5,   6,   7,   8,   9,  11,  13,  16,
            ...
            966, 967, 969, 971, 972, 974, 975, 976, 977, 978],
           dtype='int64', length=734)

或者"moives_2":

movies_2.index.sort_values()

Int64Index([  1,   3,   4,  10,  12,  14,  15,  18,  26,  30,
            ...
            931, 934, 937, 941, 950, 954, 960, 968, 970, 973],
           dtype='int64', length=245)

需要注意的是，這個方法在索引值不唯一的情況下不起作用。

13. 通過多種類型對DataFrame進行過濾

我們先看一眼movies這個DataFrame：

movies.head()

	star_rating	title	content_rating	genre	duration	actors_list
0	9.3	The Shawshank Redemption	R	Crime	142	[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
1	9.2	The Godfather	R	Crime	175	[u'Marlon Brando', u'Al Pacino', u'James Caan']
2	9.1	The Godfather: Part II	R	Crime	200	[u'Al Pacino', u'Robert De Niro', u'Robert Duv...
3	9.0	The Dark Knight	PG-13	Action	152	[u'Christian Bale', u'Heath Ledger', u'Aaron E...
4	8.9	Pulp Fiction	R	Crime	154	[u'John Travolta', u'Uma Thurman', u'Samuel L....

其中有一列是genre（類型）:

movies.genre.unique()

array(['Crime', 'Action', 'Drama', 'Western', 'Adventure', 'Biography',
       'Comedy', 'Animation', 'Mystery', 'Horror', 'Film-Noir', 'Sci-Fi',
       'History', 'Thriller', 'Family', 'Fantasy'], dtype=object)

比如我們想要對該DataFrame進行過濾，我們只想顯示genre爲Action或者Drama或者Western的電影，我們可以使用多個條件，以"or"符號分隔

movies[(movies.genre == 'Action') |
       (movies.genre == 'Drama') |
       (movies.genre == 'Western')].head()

	star_rating	title	content_rating	genre	duration	actors_list
3	9.0	The Dark Knight	PG-13	Action	152	[u'Christian Bale', u'Heath Ledger', u'Aaron E...
5	8.9	12 Angry Men	NOT RATED	Drama	96	[u'Henry Fonda', u'Lee J. Cobb', u'Martin Bals...
6	8.9	The Good, the Bad and the Ugly	NOT RATED	Western	161	[u'Clint Eastwood', u'Eli Wallach', u'Lee Van ...
9	8.9	Fight Club	R	Drama	139	[u'Brad Pitt', u'Edward Norton', u'Helena Bonh...
11	8.8	Inception	PG-13	Action	148	[u'Leonardo DiCaprio', u'Joseph Gordon-Levitt'...

但是，你實際上可以使用isin()函數將代碼寫得更加清晰，將genres列表傳遞給該函數：

movies[movies.genre.isin(['Action', 'Drama', 'Western'])].head()

	star_rating	title	content_rating	genre	duration	actors_list
3	9.0	The Dark Knight	PG-13	Action	152	[u'Christian Bale', u'Heath Ledger', u'Aaron E...
5	8.9	12 Angry Men	NOT RATED	Drama	96	[u'Henry Fonda', u'Lee J. Cobb', u'Martin Bals...
6	8.9	The Good, the Bad and the Ugly	NOT RATED	Western	161	[u'Clint Eastwood', u'Eli Wallach', u'Lee Van ...
9	8.9	Fight Club	R	Drama	139	[u'Brad Pitt', u'Edward Norton', u'Helena Bonh...
11	8.8	Inception	PG-13	Action	148	[u'Leonardo DiCaprio', u'Joseph Gordon-Levitt'...

如果你想要進行相反的過濾，也就是你將吧剛纔的三種類型的電影排除掉，那麼你可以在過濾條件前加上破浪號：

movies[~movies.genre.isin(['Action', 'Drama', 'Western'])].head()

	star_rating	title	content_rating	genre	duration	actors_list
0	9.3	The Shawshank Redemption	R	Crime	142	[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...
1	9.2	The Godfather	R	Crime	175	[u'Marlon Brando', u'Al Pacino', u'James Caan']
2	9.1	The Godfather: Part II	R	Crime	200	[u'Al Pacino', u'Robert De Niro', u'Robert Duv...
4	8.9	Pulp Fiction	R	Crime	154	[u'John Travolta', u'Uma Thurman', u'Samuel L....
7	8.9	The Lord of the Rings: The Return of the King	PG-13	Adventure	201	[u'Elijah Wood', u'Viggo Mortensen', u'Ian McK...

這種方法能夠起作用是因爲在Python中，波浪號表示“not”操作。

14. 從DataFrame中篩選出數量最多的類別

假設你想要對movies這個DataFrame通過genre進行過濾，但是隻需要前3個數量最多的genre。

我們對genre使用value_counts()函數，並將它保存成counts（type爲Series）:

counts = movies.genre.value_counts()
counts

Drama        278
Comedy       156
Action       136
Crime        124
Biography     77
Adventure     75
Animation     62
Horror        29
Mystery       16
Western        9
Thriller       5
Sci-Fi         5
Film-Noir      3
Family         2
Fantasy        1
History        1
Name: genre, dtype: int64

該Series的nlargest()函數能夠輕鬆地計算出Series中前3個最大值：

counts.nlargest(3)

Drama     278
Comedy    156
Action    136
Name: genre, dtype: int64

最後，我們將該索引傳遞給isin()函數，該函數會把它當成genre列表：

movies[movies.genre.isin(counts.nlargest(3).index)].head()

	star_rating	title	content_rating	genre	duration	actors_list
3	9.0	The Dark Knight	PG-13	Action	152	[u'Christian Bale', u'Heath Ledger', u'Aaron E...
5	8.9	12 Angry Men	NOT RATED	Drama	96	[u'Henry Fonda', u'Lee J. Cobb', u'Martin Bals...
9	8.9	Fight Club	R	Drama	139	[u'Brad Pitt', u'Edward Norton', u'Helena Bonh...
11	8.8	Inception	PG-13	Action	148	[u'Leonardo DiCaprio', u'Joseph Gordon-Levitt'...
12	8.8	Star Wars: Episode V - The Empire Strikes Back	PG	Action	124	[u'Mark Hamill', u'Harrison Ford', u'Carrie Fi...

這樣，在DataFrame中只剩下Drame, Comdey, Action這三種類型的電影了。

15. 處理缺失值

我們來看一看UFO sightings這個DataFrame:

ufo.head()

	City	Colors Reported	Shape Reported	State	Time
0	Ithaca	NaN	TRIANGLE	NY	1930-06-01 22:00:00
1	Willingboro	NaN	OTHER	NJ	1930-06-30 20:00:00
2	Holyoke	NaN	OVAL	CO	1931-02-15 14:00:00
3	Abilene	NaN	DISK	KS	1931-06-01 13:00:00
4	New York Worlds Fair	NaN	LIGHT	NY	1933-04-18 19:00:00

你將會注意到有些值是缺失的。
爲了找出每一列中有多少值是缺失的，你可以使用isna()函數，然後再使用sum():

ufo.isna().sum()

City                  25
Colors Reported    15359
Shape Reported      2644
State                  0
Time                   0
dtype: int64

isna()會產生一個由True和False組成的DataFrame，sum()會將所有的True值轉換爲1，False轉換爲0並把它們加起來。

類似地，你可以通過mean()和isna()函數找出每一列中缺失值的百分比。

ufo.isna().mean()

City               0.001371
Colors Reported    0.842004
Shape Reported     0.144948
State              0.000000
Time               0.000000
dtype: float64

如果你想要捨棄那些包含了缺失值的列，你可以使用dropna()函數：

ufo.dropna(axis='columns').head()

	State	Time
0	NY	1930-06-01 22:00:00
1	NJ	1930-06-30 20:00:00
2	CO	1931-02-15 14:00:00
3	KS	1931-06-01 13:00:00
4	NY	1933-04-18 19:00:00

或者你想要捨棄那麼缺失值佔比超過10%的列，你可以給dropna()設置一個閾值：

ufo.dropna(thresh=len(ufo)*0.9, axis='columns').head()

	City	State	Time
0	Ithaca	NY	1930-06-01 22:00:00
1	Willingboro	NJ	1930-06-30 20:00:00
2	Holyoke	CO	1931-02-15 14:00:00
3	Abilene	KS	1931-06-01 13:00:00
4	New York Worlds Fair	NY	1933-04-18 19:00:00

len(ufo)返回總行數，我們將它乘以0.9，以告訴pandas保留那些至少90%的值不是缺失值的列。

16. 將一個字符串劃分成多個列

我們先創建另一個新的示例DataFrame:

df = pd.DataFrame({'name':['John Arthur Doe', 'Jane Ann Smith'],
                   'location':['Los Angeles, CA', 'Washington, DC']})
df

	name	location
0	John Arthur Doe	Los Angeles, CA
1	Jane Ann Smith	Washington, DC

如果我們需要將“name”這一列劃分爲三個獨立的列，用來表示first, middle, last name呢？我們將會使用str.split()函數，告訴它以空格進行分隔，並將結果擴展成一個DataFrame:

df.name.str.split(' ', expand=True)

	0	1	2
0	John	Arthur	Doe
1	Jane	Ann	Smith

這三列實際上可以通過一行代碼保存至原來的DataFrame:

df[['first', 'middle', 'last']] = df.name.str.split(' ', expand=True)
df

	name	location	first	middle	last
0	John Arthur Doe	Los Angeles, CA	John	Arthur	Doe
1	Jane Ann Smith	Washington, DC	Jane	Ann	Smith

如果我們想要劃分一個字符串，但是僅保留其中一個結果列呢？比如說，讓我們以", "來劃分location這一列：

df.location.str.split(', ', expand=True)

	0	1
0	Los Angeles	CA
1	Washington	DC

如果我們只想保留第0列作爲city name，我們僅需要選擇那一列並保存至DataFrame:

df['city'] = df.location.str.split(', ', expand=True)[0]
df

	name	location	first	middle	last	city
0	John Arthur Doe	Los Angeles, CA	John	Arthur	Doe	Los Angeles
1	Jane Ann Smith	Washington, DC	Jane	Ann	Smith	Washington

17. 將一個由列表組成的Series擴展成DataFrame

我們創建一個新的示例DataFrame:

df = pd.DataFrame({'col_one':['a', 'b', 'c'], 'col_two':[[10, 40], [20, 50], [30, 60]]})
df

	col_one	col_two
0	a	[10, 40]
1	b	[20, 50]
2	c	[30, 60]

這裏有兩列，第二列包含了Python中的由整數元素組成的列表。

如果我們想要將第二列擴展成DataFrame，我們可以對那一列使用apply()函數並傳遞給Series constructor:

df_new = df.col_two.apply(pd.Series)
df_new

	0	1
0	10	40
1	20	50
2	30	60

過使用concat()函數，我們可以將原來的DataFrame和新的DataFrame組合起來：

pd.concat([df, df_new], axis='columns')

	col_one	col_two	0	1
0	a	[10, 40]	10	40
1	b	[20, 50]	20	50
2	c	[30, 60]	30	60

18. 對多個函數進行聚合

我們來看一眼從Chipotle restaurant chain得到的orders這個DataFrame:

orders.head(10)

	order_id	quantity	item_name	choice_description	item_price
0	1	1	Chips and Fresh Tomato Salsa	NaN	2.39
1	1	1	Izze	[Clementine]	3.39
2	1	1	Nantucket Nectar	[Apple]	3.39
3	1	1	Chips and Tomatillo-Green Chili Salsa	NaN	2.39
4	2	2	Chicken Bowl	[Tomatillo-Red Chili Salsa (Hot), [Black Beans...	16.98
5	3	1	Chicken Bowl	[Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou...	10.98
6	3	1	Side of Chips	NaN	1.69
7	4	1	Steak Burrito	[Tomatillo Red Chili Salsa, [Fajita Vegetables...	11.75
8	4	1	Steak Soft Tacos	[Tomatillo Green Chili Salsa, [Pinto Beans, Ch...	9.25
9	5	1	Steak Burrito	[Fresh Tomato Salsa, [Rice, Black Beans, Pinto...	9.25

每個訂單（order）都有訂單號（order_id），包含一行或者多行。爲了找出每個訂單的總價格，你可以將那個訂單號的價格（item_price）加起來。比如，這裏是訂單號爲1的總價格：

orders[orders.order_id == 1].item_price.sum()

11.56

如果你想要計算每個訂單的總價格，你可以對order_id使用groupby()，再對每個group的item_price進行求和。

orders.groupby('order_id').item_price.sum().head()

order_id
1    11.56
2    16.98
3    12.67
4    21.00
5    13.70
Name: item_price, dtype: float64

但是，事實上你不可能在聚合時僅使用一個函數，比如sum()。爲了對多個函數進行聚合，你可以使用agg()函數，傳給它一個函數列表，比如sum()和count():

orders.groupby('order_id').item_price.agg(['sum', 'count']).head()

	sum	count
order_id
1	11.56	4
2	16.98	1
3	12.67	2
4	21.00	2
5	13.70	2

這爲我們提供了每個訂單的總價以及每個訂單中的商品數量。

19. 將聚合結果與DataFrame進行組合

我們再看一眼orders這個DataFrame:

orders.head(10)

	order_id	quantity	item_name	choice_description	item_price
0	1	1	Chips and Fresh Tomato Salsa	NaN	2.39
1	1	1	Izze	[Clementine]	3.39
2	1	1	Nantucket Nectar	[Apple]	3.39
3	1	1	Chips and Tomatillo-Green Chili Salsa	NaN	2.39
4	2	2	Chicken Bowl	[Tomatillo-Red Chili Salsa (Hot), [Black Beans...	16.98
5	3	1	Chicken Bowl	[Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou...	10.98
6	3	1	Side of Chips	NaN	1.69
7	4	1	Steak Burrito	[Tomatillo Red Chili Salsa, [Fajita Vegetables...	11.75
8	4	1	Steak Soft Tacos	[Tomatillo Green Chili Salsa, [Pinto Beans, Ch...	9.25
9	5	1	Steak Burrito	[Fresh Tomato Salsa, [Rice, Black Beans, Pinto...	9.25

如果我們想要增加新的一列，用於展示每個訂單的總價格呢？回憶一下，我們通過使用sum()函數得到了總價格：

orders.groupby('order_id').item_price.sum().head()

order_id
1    11.56
2    16.98
3    12.67
4    21.00
5    13.70
Name: item_price, dtype: float64

sum()是一個聚合函數，這表明它返回輸入數據的精簡版本（reduced version ）。

換句話說，sum()函數的輸出：

len(orders.groupby('order_id').item_price.sum())

比這個函數的輸入要小：

len(orders.item_price)

解決的辦法是使用transform()函數，它會執行相同的操作但是返回與輸入數據相同的形狀：

total_price = orders.groupby('order_id').item_price.transform('sum')
len(total_price)

我們將這個結果存儲至DataFrame中新的一列：

orders['total_price'] = total_price
orders.head(10)

	order_id	quantity	item_name	choice_description	item_price	total_price
0	1	1	Chips and Fresh Tomato Salsa	NaN	2.39	11.56
1	1	1	Izze	[Clementine]	3.39	11.56
2	1	1	Nantucket Nectar	[Apple]	3.39	11.56
3	1	1	Chips and Tomatillo-Green Chili Salsa	NaN	2.39	11.56
4	2	2	Chicken Bowl	[Tomatillo-Red Chili Salsa (Hot), [Black Beans...	16.98	16.98
5	3	1	Chicken Bowl	[Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou...	10.98	12.67
6	3	1	Side of Chips	NaN	1.69	12.67
7	4	1	Steak Burrito	[Tomatillo Red Chili Salsa, [Fajita Vegetables...	11.75	21.00
8	4	1	Steak Soft Tacos	[Tomatillo Green Chili Salsa, [Pinto Beans, Ch...	9.25	21.00
9	5	1	Steak Burrito	[Fresh Tomato Salsa, [Rice, Black Beans, Pinto...	9.25	13.70

你可以看到，每個訂單的總價格在每一行中顯示出來了。

這樣我們就能方便地甲酸每個訂單的價格佔該訂單的總價格的百分比：

orders['percent_of_total'] = orders.item_price / orders.total_price
orders.head(10)

	order_id	quantity	item_name	choice_description	item_price	total_price	percent_of_total
0	1	1	Chips and Fresh Tomato Salsa	NaN	2.39	11.56	0.206747
1	1	1	Izze	[Clementine]	3.39	11.56	0.293253
2	1	1	Nantucket Nectar	[Apple]	3.39	11.56	0.293253
3	1	1	Chips and Tomatillo-Green Chili Salsa	NaN	2.39	11.56	0.206747
4	2	2	Chicken Bowl	[Tomatillo-Red Chili Salsa (Hot), [Black Beans...	16.98	16.98	1.000000
5	3	1	Chicken Bowl	[Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou...	10.98	12.67	0.866614
6	3	1	Side of Chips	NaN	1.69	12.67	0.133386
7	4	1	Steak Burrito	[Tomatillo Red Chili Salsa, [Fajita Vegetables...	11.75	21.00	0.559524
8	4	1	Steak Soft Tacos	[Tomatillo Green Chili Salsa, [Pinto Beans, Ch...	9.25	21.00	0.440476
9	5	1	Steak Burrito	[Fresh Tomato Salsa, [Rice, Black Beans, Pinto...	9.25	13.70	0.675182

20. 選取行和列的切片

我們看一眼另一個數據集：

titanic.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

這就是著名的Titanic數據集，它保存了Titanic上乘客的信息以及他們是否存活。

如果你想要對這個數據集做一個數值方面的總結，你可以使用describe()函數：

titanic.describe()

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

但是，這個DataFrame結果可能比你想要的信息顯示得更多。

如果你想對這個結果進行過濾，只想顯示“五數概括法”（five-number summary）的信息，你可以使用loc函數並傳遞"min"到"max"的切片:

titanic.describe().loc['min':'max']

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
min	1.0	0.0	1.0	0.420	0.0	0.0	0.0000
25%	223.5	0.0	2.0	20.125	0.0	0.0	7.9104
50%	446.0	0.0	3.0	28.000	0.0	0.0	14.4542
75%	668.5	1.0	3.0	38.000	1.0	0.0	31.0000
max	891.0	1.0	3.0	80.000	8.0	6.0	512.3292

如果你不是對所有列都感興趣，你也可以傳遞列名的切片：

titanic.describe().loc['min':'max', 'Pclass':'Parch']

	Pclass	Age	SibSp	Parch
min	1.0	0.420	0.0	0.0
25%	2.0	20.125	0.0	0.0
50%	3.0	28.000	0.0	0.0
75%	3.0	38.000	1.0	0.0
max	3.0	80.000	8.0	6.0

21. 對MultiIndexed Series進行重塑

Titanic數據集的Survived列由1和0組成，因此你可以對這一列計算總的存活率：

titanic.Survived.mean()

0.3838383838383838

如果你想對某個類別，比如“Sex”，計算存活率，你可以使用groupby():

titanic.groupby('Sex').Survived.mean()

Sex
female    0.742038
male      0.188908
Name: Survived, dtype: float64

如果你想一次性對兩個類別變量計算存活率，你可以對這些類別變量使用groupby()：

titanic.groupby(['Sex', 'Pclass']).Survived.mean()

Sex     Pclass
female  1         0.968085
        2         0.921053
        3         0.500000
male    1         0.368852
        2         0.157407
        3         0.135447
Name: Survived, dtype: float64

該結果展示了由Sex和Passenger Class聯合起來的存活率。它存儲爲一個MultiIndexed Series，也就是說它對實際數據有多個索引層級。

這使得該數據難以讀取和交互，因此更爲方便的是通過unstack()函數將MultiIndexed Series重塑成一個DataFrame:

titanic.groupby(['Sex', 'Pclass']).Survived.mean().unstack()

Pclass	1	2	3
Sex
female	0.968085	0.921053	0.500000
male	0.368852	0.157407	0.135447

該DataFrame包含了與MultiIndexed Series一樣的數據，不同的是，現在你可以用熟悉的DataFrame的函數對它進行操作。

22. 創建數據透視表（pivot table）

如果你經常使用上述的方法創建DataFrames，你也許會發現用pivot_table()函數更爲便捷：

titanic.pivot_table(index='Sex', columns='Pclass', values='Survived', aggfunc='mean')

Pclass	1	2	3
Sex
female	0.968085	0.921053	0.500000
male	0.368852	0.157407	0.135447

想要使用數據透視表，你需要指定索引(index), 列名(columns), 值(values)和聚合函數(aggregation function)。

數據透視表的另一個好處是，你可以通過設置margins=True輕鬆地將行和列都加起來：

titanic.pivot_table(index='Sex', columns='Pclass', values='Survived', aggfunc='mean',
                    margins=True)

Pclass	1	2	3	All
Sex
female	0.968085	0.921053	0.500000	0.742038
male	0.368852	0.157407	0.135447	0.188908
All	0.629630	0.472826	0.242363	0.383838

這個結果既顯示了總的存活率，也顯示了Sex和Passenger Class的存活率。

最後，你可以創建交叉表（cross-tabulation），只需要將聚合函數由"mean"改爲"count":

titanic.pivot_table(index='Sex', columns='Pclass', values='Survived', aggfunc='count',
                    margins=True)

Pclass	1	2	3	All
Sex
female	94	76	144	314
male	122	108	347	577
All	216	184	491	891

這個結果展示了每一對類別變量組合後的記錄總數。

23. 將連續數據轉變成類別數據

我們來看一下Titanic數據集中的Age那一列：

titanic.Age.head(10)

0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
5     NaN
6    54.0
7     2.0
8    27.0
9    14.0
Name: Age, dtype: float64

它現在是連續性數據，但是如果我們想要將它轉變成類別數據呢？

一個解決辦法是對年齡範圍打標籤，比如"adult", "young adult", "child"。實現該功能的最好方式是使用cut()函數：

pd.cut(titanic.Age, bins=[0, 18, 25, 99], labels=['child', 'young adult', 'adult']).head(10)

0    young adult
1          adult
2          adult
3          adult
4          adult
5            NaN
6          adult
7          child
8          adult
9          child
Name: Age, dtype: category
Categories (3, object): ['child' < 'young adult' < 'adult']

這會對每個值打上標籤。0到18歲的打上標籤"child"，18-25歲的打上標籤"young adult"，25到99歲的打上標籤“adult”。

注意到，該數據類型爲類別變量，該類別變量自動排好序了（有序的類別變量）。

24. 更改顯示選項

我們再來看一眼Titanic 數據集：

titanic.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

注意到，Age列保留到小數點後1位，Fare列保留到小數點後4位。如果你想要標準化，將顯示結果保留到小數點後2位呢？

你可以使用set_option()函數：

pd.set_option('display.float_format', '{:.2f}'.format)

titanic.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.00	1	A/5 21171	7.25	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.00	1	PC 17599	71.28	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.00	0	STON/O2. 3101282	7.92	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.00	1	113803	53.10	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.00	0	373450	8.05	NaN	S

set_option()函數中第一個參數爲選項的名稱，第二個參數爲Python格式化字符。可以看到，Age列和Fare列現在已經保留小數點後兩位。注意，這並沒有修改基礎的數據類型，而只是修改了數據的顯示結果。

你也可以重置任何一個選項爲其默認值：

pd.reset_option('display.float_format')

對於其它的選項也是類似的使用方法。

25. Style a DataFrame

上一個技巧在你想要修改整個jupyter notebook中的顯示會很有用。但是，一個更靈活和有用的方法是定義特定DataFrame中的格式化（style）。

我們回到stocks這個DataFrame:

stocks

	Date	Close	Volume	Symbol
0	2016-10-03	31.50	14070500	CSCO
1	2016-10-03	112.52	21701800	AAPL
2	2016-10-03	57.42	19189500	MSFT
3	2016-10-04	113.00	29736800	AAPL
4	2016-10-04	57.24	20085900	MSFT
5	2016-10-04	31.35	18460400	CSCO
6	2016-10-05	57.64	16726400	MSFT
7	2016-10-05	31.59	11808600	CSCO
8	2016-10-05	113.05	21453100	AAPL

我們可以創建一個格式化字符串的字典，用於對每一列進行格式化。然後將其傳遞給DataFrame的style.format()函數：

format_dict = {'Date':'{:%m/%d/%y}', 'Close':'${:.2f}', 'Volume':'{:,}'}
stocks.style.format(format_dict)

	Date	Close	Volume	Symbol
0	10/03/16	$31.50	14,070,500	CSCO
1	10/03/16	$112.52	21,701,800	AAPL
2	10/03/16	$57.42	19,189,500	MSFT
3	10/04/16	$113.00	29,736,800	AAPL
4	10/04/16	$57.24	20,085,900	MSFT
5	10/04/16	$31.35	18,460,400	CSCO
6	10/05/16	$57.64	16,726,400	MSFT
7	10/05/16	$31.59	11,808,600	CSCO
8	10/05/16	$113.05	21,453,100	AAPL

注意到，Date列是month-day-year的格式，Close列包含一個$符號，Volume列包含逗號。

我們可以通過鏈式調用函數來應用更多的格式化：

(stocks.style.format(format_dict)
 .hide_index()
 .highlight_min('Close', color='red')
 .highlight_max('Close', color='lightgreen')
)

Date	Close	Volume	Symbol
10/03/16	$31.50	14,070,500	CSCO
10/03/16	$112.52	21,701,800	AAPL
10/03/16	$57.42	19,189,500	MSFT
10/04/16	$113.00	29,736,800	AAPL
10/04/16	$57.24	20,085,900	MSFT
10/04/16	$31.35	18,460,400	CSCO
10/05/16	$57.64	16,726,400	MSFT
10/05/16	$31.59	11,808,600	CSCO
10/05/16	$113.05	21,453,100	AAPL

我們現在隱藏了索引，將Close列中的最小值高亮成紅色，將Close列中的最大值高亮成淺綠色。

這裏有另一個DataFrame格式化的例子：

(stocks.style.format(format_dict)
 .hide_index()
 .background_gradient(subset='Volume', cmap='Blues')
)

Date	Close	Volume	Symbol
10/03/16	$31.50	14,070,500	CSCO
10/03/16	$112.52	21,701,800	AAPL
10/03/16	$57.42	19,189,500	MSFT
10/04/16	$113.00	29,736,800	AAPL
10/04/16	$57.24	20,085,900	MSFT
10/04/16	$31.35	18,460,400	CSCO
10/05/16	$57.64	16,726,400	MSFT
10/05/16	$31.59	11,808,600	CSCO
10/05/16	$113.05	21,453,100	AAPL

Volume列現在有一個漸變的背景色，你可以輕鬆地識別出大的和小的數值。

最後一個例子：

(stocks.style.format(format_dict)
 .hide_index()
 .bar('Volume', color='lightblue', align='zero')
 .set_caption('Stock Prices from October 2016')
)

Stock Prices from October 2016
Date	Close	Volume	Symbol
10/03/16	$31.50	14,070,500	CSCO
10/03/16	$112.52	21,701,800	AAPL
10/03/16	$57.42	19,189,500	MSFT
10/04/16	$113.00	29,736,800	AAPL
10/04/16	$57.24	20,085,900	MSFT
10/04/16	$31.35	18,460,400	CSCO
10/05/16	$57.64	16,726,400	MSFT
10/05/16	$31.59	11,808,600	CSCO
10/05/16	$113.05	21,453,100	AAPL

現在，Volumn列上有一個條形圖，DataFrame上有一個標題。請注意，還有許多其他的選項你可以用來格式化DataFrame。

額外技巧：Profile a DataFrame

假設你拿到一個新的數據集，你不想要花費太多力氣，只是想快速地探索下。那麼你可以使用pandas-profiling這個模塊。在你的系統上安裝好該模塊，然後使用ProfileReport()函數，傳遞的參數爲任何一個DataFrame。它會返回一個互動的HTML報告：

第一部分爲該數據集的總覽，以及該數據集可能出現的問題列表；
第二部分爲每一列的總結。你可以點擊"toggle details"獲取更多信息；
第三部分顯示列之間的關聯熱力圖；
第四部分爲缺失值情況報告；
第五部分顯示該數據及的前幾行。

使用示例如下（只顯示第一部分的報告）：

import pandas_profiling
pandas_profiling.ProfileReport(titanic)