前言:
今天是一個特別的日子,因爲好大的霧霾!很久沒有這麼大的霧霾了
☁️☁️☁️☁️☁️
更重要的是我要補一下關於pandas的筆記。
之前有寫過NumPy的需要的可以看這裏NumPy基礎(1. 準備!),最近看pandas,所以根據官網給出的10分鐘快速體驗pandas(沒錯,官網在單獨使用pandas一詞的時候首字母沒有大寫)做一些記錄。好了,廢話就說這麼多,開始整。
注:本文基於Release 0.25.3
Package overview
首先,按照我的習慣,還是先看介紹pandas是個什麼鬼。powerful Python data analysis toolkit
pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal.
pandas是一個Python包提供快速的、靈活的、直觀的展現數據關係(類似於關係型數據庫的表)和索引(類似於表中,每一個列的列名都是此列數據上的一個索引)。(後面是pandas開發團隊的理想和目標,跳過了)
看完上面的話應該大概明白,pandas好像就是在玩表格,太天真了(jiushizheyang)
pandas is well suited for many different kinds of data:
- Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
- Ordered and unordered (not necessarily fixed-frequency) time series data.
- Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
- Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure
pandas可以使用多種不同類型數據:
- table:沒錯是你以前遇到過的那種關係表結構。
- 有序或無序的時間序列:神馬東西這麼抽象。
- 矩陣數據帶有行索引和列索引:比table牛皮一點,不但有列名還有行名,後面看幾個例子就明白了。
- 任何其他數據,不需要索引直接可以注入pandas數據結構:我懷疑你在吹牛,試試看。
上面一直說pandas數據結構多麼無敵,來看:
The two primary data structures of pandas, Series (1-dimensional) and DataFrame (2-dimensional), handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering. For R users, DataFrame provides everything that R’s data.frame provides and much more. pandas is built on top of NumPy and is intended to integrate well within a scientific computing environment with many other 3rd party libraries.
pandas主要提供兩種數據結構Series
(一維)和DataFrame
(二維,實際還有三維結構)用以處理大量行業的數據。對於R(R語言,常用於數據分析)用戶DataFrame提供了與R中DataFrame相同甚至更好的功能。pandas基於NumPy開發如需NumPy筆記回到本頁最上有連接,是爲了更好的與其他第三方lib集成。
Dimensions | Name | Description |
---|---|---|
1 | Series | 1D labeled homogeneously-typed array |
2 | DataFrame | General 2D labeled, size-mutable tabular structure with potentially heterogeneously-typed column |
10 minutes to pandas
注:例子中大多數爲表格形式的數據顯示,ipython或jupyter編輯有更好的結果顯示
This is a short introduction to pandas, geared mainly for new users. You can see more complex recipes in the Cookbook.
Object creation
See the Data Structure Intro section.
- 創建一個類型爲Series的實例:
Creating a Series by passing a list of values, letting pandas create a default integer index:
NaN
:not a number,類似Python語法中的None
In [3]: s = pd.Series([1, 3, 5, np.nan, 6, 8])
In [4]: s
Out[4]:
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
打印結果中可以看到類似Python字典dictionary的結構,key爲0-5的數字(這其實是pandas爲我們的數據自動生成的默認行索引
index),value爲我們傳入的數據。
- 通過NumPy
array
創建一個類型爲DataFrame
的實例:
Creating a DataFrame by passing a NumPy array, with a datetime index and labeled columns:
構造實例df
,先使用data_range生成一個元素類型爲datetime的NumPy array名爲dates
生成DataFrame類型,數據爲生成隨機數,行索引index爲上一步生成好的dates
,並制定列索引columns爲ABCD
In [5]: dates = pd.date_range('20130101', periods=6)
In [6]: dates
Out[6]:
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
In [7]: df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
In [8]: df
Out[8]:
A B C D
2013-01-01 -0.254631 1.423220 0.038671 0.762526
2013-01-02 -0.978045 1.159385 0.374813 -2.082786
2013-01-03 -0.233653 -1.853380 -0.040360 0.196083
2013-01-04 0.684391 0.261879 -0.777546 -0.862060
2013-01-05 -0.776971 0.125240 -0.340414 -0.782474
2013-01-06 0.347661 1.357053 1.011172 0.725046
可以看到結果顯示的表格形式,ABCD爲數據的列頭,dates中的數據爲行索引。
- 構造實例
df2
,通過dict
創造一個DataFrame
實例
傳入字典key爲A-F,
A對應value爲標量1.0,
B對應value爲pd提供的時間戳,
C對應value爲Series類型(值爲1,index爲1-4的四個數字,類型float32),
D對應value爲np array類型(四個值爲3元素,類型int32),
E對應value爲Categorical(R語言中的一種結構)類型,
F對應value爲字符串’foo’。
In [9]: df2 = pd.DataFrame({'A': 1.,
...: 'B': pd.Timestamp('20130102'),
...: 'C': pd.Series(1, index=list(range(4)), dtype='float32'),
...: 'D': np.array([3] * 4, dtype='int32'),
...: 'E': pd.Categorical(["test", "train", "test", "train"]),
...: 'F': 'foo'})
...:
In [10]: df2
Out[10]:
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo
- 查看df2的結構
In [11]: df2.dtypes
Out[11]:
A float64
B datetime64[ns]
C float32
D int32
E category
F object
dtype: object
If you’re using IPython, tab completion for column names (as well as public attributes) is automatically enabled. Here’s a subset of the attributes that will be completed:
In [12]: df2.<TAB> # noqa: E225, E999
df2.A df2.bool
df2.abs df2.boxplot
df2.add df2.C
df2.add_prefix df2.clip
df2.add_suffix df2.clip_lower
df2.align df2.clip_upper
df2.all df2.columns
df2.any df2.combine
df2.append df2.combine_first
df2.apply df2.compound
df2.applymap df2.consolidate
df2.D
每個column都被轉爲了df2的一個屬性A-F都存在(篇幅受限)
Viewing data
查看數據
See the Basics section.
- 查看首尾數據
In [13]: df2.head()
Out[13]:
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo
In [14]: df.tail(3)
Out[14]:
A B C D E F
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo
- 查看行、列索引
In [15]: df2.index
Out[15]:
Int64Index([0, 1, 2, 3], dtype='int64')
In [16]: df.columns
Out[16]: Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')
DataFrame.to_numpy() gives a NumPy representation of the underlying data. Note that this can be an expensive operation when your DataFrame has columns with different data types, which comes down to a fundamental difference between pandas and NumPy: NumPy arrays have one dtype for the entire array, while pandas DataFrames have one dtype per column. When you call DataFrame.to_numpy(), pandas will find the NumPy dtype that can hold all of the dtypes in the DataFrame. This may end up being object, which requires casting every value to a Python object.
使用DataFrame.to_numpy()方法可以返回NumPy array類型,array中每個元素爲一行數據。如果列中有不同的數據類型會造成很大的開銷。就像NumPy array中元素具有相同數據類型,DatFrames一個列中的元素也具有相同的數據類型。
注:DataFrame.to_numpy()不會輸出行、列索引。
- describe() 顯示數據的快速統計摘要:
df2.describe()
A C D
count 4.0 4.0 4.0
mean 1.0 1.0 3.0
std 0.0 0.0 0.0
min 1.0 1.0 3.0
25% 1.0 1.0 3.0
50% 1.0 1.0 3.0
75% 1.0 1.0 3.0
max 1.0 1.0 3.0
- 對數據轉置:
df2
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo
df2.T
0 ... 3
A 1 ... 1
B 2013-01-02 00:00:00 ... 2013-01-02 00:00:00
C 1 ... 1
D 3 ... 3
E test ... train
F foo ... foo
[6 rows x 4 columns]
- 指定方向排序
注意列頭變爲了DCBA
df.sort_index(axis=1, ascending=False)
D C B A
2013-01-01 0.762526 0.038671 1.423220 -0.254631
2013-01-02 -2.082786 0.374813 1.159385 -0.978045
2013-01-03 0.196083 -0.040360 -1.853380 -0.233653
2013-01-04 -0.862060 -0.777546 0.261879 0.684391
2013-01-05 -0.782474 -0.340414 0.125240 -0.776971
2013-01-06 0.725046 1.011172 1.357053 0.347661
- 根據value排序
注意B列變爲遞增序列
df.sort_values(by='B')
A B C D
2013-01-03 -0.233653 -1.853380 -0.040360 0.196083
2013-01-05 -0.776971 0.125240 -0.340414 -0.782474
2013-01-04 0.684391 0.261879 -0.777546 -0.862060
2013-01-02 -0.978045 1.159385 0.374813 -2.082786
2013-01-06 0.347661 1.357053 1.011172 0.725046
2013-01-01 -0.254631 1.423220 0.038671 0.762526
Selection
選擇數據
注:推薦使用pandas數據訪問方法
.at
,.iat
,.loc
and.iloc
While standard Python / Numpy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, we recommend the optimized pandas data access methods, .at, .iat, .loc and .iloc.
索引用法參見:
See the indexing documentation Indexing and Selecting Data and MultiIndex / Advanced Indexing.
Getting
Selecting a single column, which yields a Series, equivalent to df.A:
選擇單列
df.A
2013-01-01 -0.254631
2013-01-02 -0.978045
2013-01-03 -0.233653
2013-01-04 0.684391
2013-01-05 -0.776971
2013-01-06 0.347661
Freq: D, Name: A, dtype: float64
df['A']
2013-01-01 -0.254631
2013-01-02 -0.978045
2013-01-03 -0.233653
2013-01-04 0.684391
2013-01-05 -0.776971
2013-01-06 0.347661
Freq: D, Name: A, dtype: float64
- 通過
[]
數據切片
df[0:3]
A B C D
2013-01-01 -0.254631 1.423220 0.038671 0.762526
2013-01-02 -0.978045 1.159385 0.374813 -2.082786
2013-01-03 -0.233653 -1.853380 -0.040360 0.196083
df['20130102':'20130104']
A B C D
2013-01-02 -0.978045 1.159385 0.374813 -2.082786
2013-01-03 -0.233653 -1.853380 -0.040360 0.196083
2013-01-04 0.684391 0.261879 -0.777546 -0.862060
根據索引選數據
更多方法參見:
See more in Selection by Label.
dates[0]
Timestamp('2013-01-01 00:00:00', freq='D')
# 根據行索引獲取數據
df.loc[dates[0]]
A -0.254631
B 1.423220
C 0.038671
D 0.762526
Name: 2013-01-01 00:00:00, dtype: float64
- 獲取多列數據
# :所有行,A\B兩列數據
df.loc[:, ['A', 'B']]
A B
2013-01-01 -0.254631 1.423220
2013-01-02 -0.978045 1.159385
2013-01-03 -0.233653 -1.853380
2013-01-04 0.684391 0.261879
2013-01-05 -0.776971 0.125240
2013-01-06 0.347661 1.357053
# 同時制定行列索引
df.loc['20130102':'20130104', ['A', 'B']]
A B
2013-01-02 -0.978045 1.159385
2013-01-03 -0.233653 -1.853380
2013-01-04 0.684391 0.261879
- 如果只返回一行數據自動降維成
Series
類型
df.loc['20130102', ['A', 'B']]
A -0.978045
B 1.159385
type(df.loc['20130102', ['A', 'B']])
<class 'pandas.core.series.Series'>
- 如果返回一個cell的數據自動降維標量類型
df.loc[dates[0], 'A']
-0.2546306719860299
type(df.loc[dates[0], 'A'])
<class 'numpy.float64'>
df.get('A').get(dates[0])
-0.2546306719860299
根據座標選數據
See more in Selection by Position.
df
A B C D
2013-01-01 -0.254631 1.423220 0.038671 0.762526
2013-01-02 -0.978045 1.159385 0.374813 -2.082786
2013-01-03 -0.233653 -1.853380 -0.040360 0.196083
2013-01-04 0.684391 0.261879 -0.777546 -0.862060
2013-01-05 -0.776971 0.125240 -0.340414 -0.782474
2013-01-06 0.347661 1.357053 1.011172 0.725046
選擇行: iloc
= index location
df.iloc[3]
A 0.684391
B 0.261879
C -0.777546
D -0.862060
Name: 2013-01-04 00:00:00, dtype: float64
- 像python/numpy一樣用數字指定切片
df.iloc[3:5, 0:2]
A B
2013-01-04 0.684391 0.261879
2013-01-05 -0.776971 0.125240
- 用list指定特定的position
df.iloc[[1, 2, 4], [0, 2]]
A C
2013-01-02 -0.978045 0.374813
2013-01-03 -0.233653 -0.040360
2013-01-05 -0.776971 -0.340414
- 按行切片
df.iloc[1:3, :]
A B C D
2013-01-02 -0.978045 1.159385 0.374813 -2.082786
2013-01-03 -0.233653 -1.853380 -0.040360 0.196083
- 按列切片
df.iloc[:, 1:3]
B C
2013-01-01 1.423220 0.038671
2013-01-02 1.159385 0.374813
2013-01-03 -1.853380 -0.040360
2013-01-04 0.261879 -0.777546
2013-01-05 0.125240 -0.340414
2013-01-06 1.357053 1.011172
- For getting a value explicitly:
In [37]: df.iloc[1, 1]
Out[37]: -0.17321464905330858
- For getting fast access to a scalar (equivalent to the prior method):
In [38]: df.iat[1, 1]
Out[38]: -0.17321464905330858
布爾參數索引
- 使用某個列的值選擇數據
df
A B C D
2013-01-01 -0.254631 1.423220 0.038671 0.762526
2013-01-02 -0.978045 1.159385 0.374813 -2.082786
2013-01-03 -0.233653 -1.853380 -0.040360 0.196083
2013-01-04 0.684391 0.261879 -0.777546 -0.862060
2013-01-05 -0.776971 0.125240 -0.340414 -0.782474
2013-01-06 0.347661 1.357053 1.011172 0.725046
- 選擇A列值大於0的數據
df[df.A > 0]
A B C D
2013-01-04 0.684391 0.261879 -0.777546 -0.862060
2013-01-06 0.347661 1.357053 1.011172 0.725046
- DataFrame中大於0的數據
df[df > 0]
A B C D
2013-01-01 NaN 1.423220 0.038671 0.762526
2013-01-02 NaN 1.159385 0.374813 NaN
2013-01-03 NaN NaN NaN 0.196083
2013-01-04 0.684391 0.261879 NaN NaN
2013-01-05 NaN 0.125240 NaN NaN
2013-01-06 0.347661 1.357053 1.011172 0.725046
- 使用isin()方法過濾
爲df增加一列用於filter,copy深拷貝在前面的筆記中寫過。
df3 = df .copy()
df3['E'] = ['one','one','two','three','four','three']
df3
A B C D E
2013-01-01 -0.254631 1.423220 0.038671 0.762526 one
2013-01-02 -0.978045 1.159385 0.374813 -2.082786 one
2013-01-03 -0.233653 -1.853380 -0.040360 0.196083 two
2013-01-04 0.684391 0.261879 -0.777546 -0.862060 three
2013-01-05 -0.776971 0.125240 -0.340414 -0.782474 four
2013-01-06 0.347661 1.357053 1.011172 0.725046 three
過濾E列爲’two’, 'four’的數據
df3[df3['E'].isin(['two','four'])]
A B C D E
2013-01-03 -0.233653 -1.85338 -0.040360 0.196083 two
2013-01-05 -0.776971 0.12524 -0.340414 -0.782474 four
Setting
- 設置新列並分配數據
df
A B C D
2013-01-01 -0.254631 1.423220 0.038671 0.762526
2013-01-02 -0.978045 1.159385 0.374813 -2.082786
2013-01-03 -0.233653 -1.853380 -0.040360 0.196083
2013-01-04 0.684391 0.261879 -0.777546 -0.862060
2013-01-05 -0.776971 0.125240 -0.340414 -0.782474
2013-01-06 0.347661 1.357053 1.011172 0.725046
構造一個Series
設置給df
的F
列,由於Series
行索引從20130102
起始,所以df
的2013-01-01
沒有匹配數值故顯示爲NaN
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20130102', periods=6))
df['F'] = s1
df
A B C D F
2013-01-01 -0.254631 1.423220 0.038671 0.762526 NaN
2013-01-02 -0.978045 1.159385 0.374813 -2.082786 1.0
2013-01-03 -0.233653 -1.853380 -0.040360 0.196083 2.0
2013-01-04 0.684391 0.261879 -0.777546 -0.862060 3.0
2013-01-05 -0.776971 0.125240 -0.340414 -0.782474 4.0
2013-01-06 0.347661 1.357053 1.011172 0.725046 5.0
- 對df中所有大於0的數取負
df4 = df.copy()
df4[df4 > 0] = -df4
df4
A B C D F
2013-01-01 -0.254631 -1.423220 -0.038671 -0.762526 NaN
2013-01-02 -0.978045 -1.159385 -0.374813 -2.082786 -1.0
2013-01-03 -0.233653 -1.853380 -0.040360 -0.196083 -2.0
2013-01-04 -0.684391 -0.261879 -0.777546 -0.862060 -3.0
2013-01-05 -0.776971 -0.125240 -0.340414 -0.782474 -4.0
2013-01-06 -0.347661 -1.357053 -1.011172 -0.725046 -5.0
寫了這麼久還沒到一半,累😢,要是看累了去喝杯水吧🍻,若果有人看的話
Missing data
pandas使用
np.nan
代表缺失數據。默認參與計算。See the Missing Data section.
reindex()
構造新的df1
,增加新列E
In [55]: df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])
In [56]: df1.loc[dates[0]:dates[1], 'E'] = 1
In [57]: df1
Out[57]:
A B C D F E
2013-01-01 0.000000 0.000000 -1.509059 5 NaN 1.0
2013-01-02 1.212112 -0.173215 0.119209 5 1.0 1.0
2013-01-03 -0.861849 -2.104569 -0.494929 5 2.0 NaN
2013-01-04 0.721555 -0.706771 -1.039575 5 3.0 NaN
刪除missing data
In [58]: df1.dropna(how='any')
Out[58]:
A B C D F E
2013-01-02 1.212112 -0.173215 0.119209 5 1.0 1.0
填充missing data
In [59]: df1.fillna(value=5)
Out[59]:
A B C D F E
2013-01-01 0.000000 0.000000 -1.509059 5 5.0 1.0
2013-01-02 1.212112 -0.173215 0.119209 5 1.0 1.0
2013-01-03 -0.861849 -2.104569 -0.494929 5 2.0 5.0
2013-01-04 0.721555 -0.706771 -1.039575 5 3.0 5.0
用布爾值標記爲NaN的值
In [60]: pd.isna(df1)
Out[60]:
A B C D F E
2013-01-01 False False False False True False
2013-01-02 False False False False False False
2013-01-03 False False False False False True
2013-01-04 False False False False False True
Operations
操作
See the Basic section on Binary Ops.
統計
此類操作通常不包含NaN的值
In [51]: df
Out[51]:
A B C D F
2013-01-01 0.000000 0.000000 -1.509059 5 NaN
2013-01-02 1.212112 -0.173215 0.119209 5 1.0
2013-01-03 -0.861849 -2.104569 -0.494929 5 2.0
2013-01-04 0.721555 -0.706771 -1.039575 5 3.0
2013-01-05 -0.424972 0.567020 0.276232 5 4.0
2013-01-06 -0.673690 0.113648 -1.478427 5 5.0
- 平均值
In [61]: df.mean()
Out[61]:
A -0.004474
B -0.383981
C -0.687758
D 5.000000
F 3.000000
dtype: float64
對其他的行同上操作
In [62]: df.mean(1)
Out[62]:
2013-01-01 0.872735
2013-01-02 1.431621
2013-01-03 0.707731
2013-01-04 1.395042
2013-01-05 1.883656
2013-01-06 1.592306
Freq: D, dtype: float64
- 操作具有不同維度且需要對齊的對象。此外,panda會自動地沿着指定的維度廣播。
In [63]: s = pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2)
In [64]: s
Out[64]:
2013-01-01 NaN
2013-01-02 NaN
2013-01-03 1.0
2013-01-04 3.0
2013-01-05 5.0
2013-01-06 NaN
Freq: D, dtype: float64
In [65]: df.sub(s, axis='index')
Out[65]:
A B C D F
2013-01-01 NaN NaN NaN NaN NaN
2013-01-02 NaN NaN NaN NaN NaN
2013-01-03 -1.861849 -3.104569 -1.494929 4.0 1.0
2013-01-04 -2.278445 -3.706771 -4.039575 2.0 0.0
2013-01-05 -5.424972 -4.432980 -4.723768 0.0 -1.0
2013-01-06 NaN NaN NaN NaN NaN
apply
對數據應用一個函數:
In [66]: df.apply(np.cumsum)
Out[66]:
A B C D F
2013-01-01 0.000000 0.000000 -1.509059 5 NaN
2013-01-02 1.212112 -0.173215 -1.389850 10 1.0
2013-01-03 0.350263 -2.277784 -1.884779 15 3.0
2013-01-04 1.071818 -2.984555 -2.924354 20 6.0
2013-01-05 0.646846 -2.417535 -2.648122 25 10.0
2013-01-06 -0.026844 -2.303886 -4.126549 30 15.0
In [67]: df.apply(lambda x: x.max() - x.min())
Out[67]:
A 2.073961
B 2.671590
C 1.785291
D 0.000000
F 4.000000
dtype: float64
值計數
See more at Histogramming and Discretization.
In [68]: s = pd.Series(np.random.randint(0, 7, size=10))
In [69]: s
Out[69]:
0 4
1 2
2 1
3 2
4 6
5 4
6 4
7 6
8 4
9 4
dtype: int64
In [70]: s.value_counts()
Out[70]:
4 5
6 2
2 2
1 1
dtype: int64
String methods
Series is equipped with a set of string processing methods in the str attribute that make it easy to operate on each element of the array, as in the code snippet below. Note that pattern-matching in str generally uses regular expressions by default (and in some cases always uses them). See more at Vectorized String Methods.
In [71]: s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
In [72]: s.str.lower()
Out[72]:
0 a
1 b
2 c
3 aaba
4 baca
5 NaN
6 caba
7 dog
8 cat
dtype: object
Merge
合併
See the Merging section.
- Concat
Concatenating pandas objects together with concat():
In [73]: df = pd.DataFrame(np.random.randn(10, 4))
In [74]: df
Out[74]:
0 1 2 3
0 -0.548702 1.467327 -1.015962 -0.483075
1 1.637550 -1.217659 -0.291519 -1.745505
2 -0.263952 0.991460 -0.919069 0.266046
3 -0.709661 1.669052 1.037882 -1.705775
4 -0.919854 -0.042379 1.247642 -0.009920
5 0.290213 0.495767 0.362949 1.548106
6 -1.131345 -0.089329 0.337863 -0.945867
7 -0.932132 1.956030 0.017587 -0.016692
8 -0.575247 0.254161 -1.143704 0.215897
9 1.193555 -0.077118 -0.408530 -0.862495
# break it into pieces
In [75]: pieces = [df[:3], df[3:7], df[7:]]
In [76]: pd.concat(pieces)
Out[76]:
0 1 2 3
0 -0.548702 1.467327 -1.015962 -0.483075
1 1.637550 -1.217659 -0.291519 -1.745505
2 -0.263952 0.991460 -0.919069 0.266046
3 -0.709661 1.669052 1.037882 -1.705775
4 -0.919854 -0.042379 1.247642 -0.009920
5 0.290213 0.495767 0.362949 1.548106
6 -1.131345 -0.089329 0.337863 -0.945867
7 -0.932132 1.956030 0.017587 -0.016692
8 -0.575247 0.254161 -1.143704 0.215897
9 1.193555 -0.077118 -0.408530 -0.862495
- Join
類似於SQL風格
SQL style merges. See the Database style joining section.
In [77]: left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
In [78]: right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
In [79]: left
Out[79]:
key lval
0 foo 1
1 foo 2
In [80]: right
Out[80]:
key rval
0 foo 4
1 foo 5
In [81]: pd.merge(left, right, on='key')
Out[81]:
key lval rval
0 foo 1 4
1 foo 1 5
2 foo 2 4
3 foo 2 5
In [82]: left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})
In [83]: right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})
In [84]: left
Out[84]:
key lval
0 foo 1
1 bar 2
In [85]: right
Out[85]:
key rval
0 foo 4
1 bar 5
In [86]: pd.merge(left, right, on='key')
Out[86]:
key lval rval
0 foo 1 4
1 bar 2 5
- Append
追加
See the Appending section.
In [87]: df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'C', 'D'])
In [88]: df
Out[88]:
A B C D
0 1.346061 1.511763 1.627081 -0.990582
1 -0.441652 1.211526 0.268520 0.024580
2 -1.577585 0.396823 -0.105381 -0.532532
3 1.453749 1.208843 -0.080952 -0.264610
4 -0.727965 -0.589346 0.339969 -0.693205
5 -0.339355 0.593616 0.884345 1.591431
6 0.141809 0.220390 0.435589 0.192451
7 -0.096701 0.803351 1.715071 -0.708758
In [89]: s = df.iloc[3]
In [90]: df.append(s, ignore_index=True)
Out[90]:
A B C D
0 1.346061 1.511763 1.627081 -0.990582
1 -0.441652 1.211526 0.268520 0.024580
2 -1.577585 0.396823 -0.105381 -0.532532
3 1.453749 1.208843 -0.080952 -0.264610
4 -0.727965 -0.589346 0.339969 -0.693205
5 -0.339355 0.593616 0.884345 1.591431
6 0.141809 0.220390 0.435589 0.192451
7 -0.096701 0.803351 1.715071 -0.708758
8 1.453749 1.208843 -0.080952 -0.264610
Grouping
By “group by” we are referring to a process involving one or more of the following steps:
- Splitting the data into groups based on some criteria
- Applying a function to each group independently
- Combining the results into a data structure
See the Grouping section.
In [91]: df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
....: 'foo', 'bar', 'foo', 'foo'],
....: 'B': ['one', 'one', 'two', 'three',
....: 'two', 'two', 'one', 'three'],
....: 'C': np.random.randn(8),
....: 'D': np.random.randn(8)})
....:
In [92]: df
Out[92]:
A B C D
0 foo one -1.202872 -0.055224
1 bar one -1.814470 2.395985
2 foo two 1.018601 1.552825
3 bar three -0.595447 0.166599
4 foo two 1.395433 0.047609
5 bar two -0.392670 -0.136473
6 foo one 0.007207 -0.561757
7 foo three 1.928123 -1.623033
- 分組求和
In [93]: df.groupby('A').sum()
Out[93]:
C D
A
bar -2.802588 2.42611
foo 3.146492 -0.63958
In [94]: df.groupby(['A', 'B']).sum()
Out[94]:
C D
A B
bar one -1.814470 2.395985
three -0.595447 0.166599
two -0.392670 -0.136473
foo one -1.195665 -0.616981
three 1.928123 -1.623033
two 2.414034 1.600434
Reshaping
Stack
疊加
In [95]: tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
....: 'foo', 'foo', 'qux', 'qux'],
....: ['one', 'two', 'one', 'two',
....: 'one', 'two', 'one', 'two']]))
....:
In [96]: index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
In [97]: df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
In [98]: df2 = df[:4]
In [99]: df2
Out[99]:
A B
first second
bar one 0.029399 -0.542108
two 0.282696 -0.087302
baz one -1.575170 1.771208
two 0.816482 1.100230
stack()方法壓縮列
In [100]: stacked = df2.stack()
In [101]: stacked
Out[101]:
first second
bar one A 0.029399
B -0.542108
two A 0.282696
B -0.087302
baz one A -1.575170
B 1.771208
two A 0.816482
B 1.100230
dtype: float64
unstack()
In [102]: stacked.unstack()
Out[102]:
A B
first second
bar one 0.029399 -0.542108
two 0.282696 -0.087302
baz one -1.575170 1.771208
two 0.816482 1.100230
In [103]: stacked.unstack(1)
Out[103]:
second one two
first
bar A 0.029399 0.282696
B -0.542108 -0.087302
baz A -1.575170 0.816482
B 1.771208 1.100230
In [104]: stacked.unstack(0)
Out[104]:
first bar baz
second
one A 0.029399 -1.575170
B -0.542108 1.771208
two A 0.282696 0.816482
B -0.087302 1.100230
Pivot tables
See the section on Pivot Tables.
In [105]: df = pd.DataFrame({'A': ['one', 'one', 'two', 'three'] * 3,
.....: 'B': ['A', 'B', 'C'] * 4,
.....: 'C': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
.....: 'D': np.random.randn(12),
.....: 'E': np.random.randn(12)})
.....:
In [106]: df
Out[106]:
A B C D E
0 one A foo 1.418757 -0.179666
1 one B foo -1.879024 1.291836
2 two C foo 0.536826 -0.009614
3 three A bar 1.006160 0.392149
4 one B bar -0.029716 0.264599
5 one C bar -1.146178 -0.057409
6 two A foo 0.100900 -1.425638
7 three B foo -1.035018 1.024098
8 one C foo 0.314665 -0.106062
9 one A bar -0.773723 1.824375
10 two B bar -1.170653 0.595974
11 three C bar 0.648740 1.167115
In [107]: pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])
Out[107]:
C bar foo
A B
one A -0.773723 1.418757
B -0.029716 -1.879024
C -1.146178 0.314665
three A 1.006160 NaN
B NaN -1.035018
C 0.648740 NaN
two A NaN 0.100900
B -1.170653 NaN
C NaN 0.536826
Time series
一些時間序列轉換的方法
In [108]: rng = pd.date_range('1/1/2012', periods=100, freq='S')
In [109]: ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
In [110]: ts.resample('5Min').sum()
Out[110]:
2012-01-01 25083
Freq: 5T, dtype: int64
- 時區表示
In [111]: rng = pd.date_range('3/6/2012 00:00', periods=5, freq='D')
In [112]: ts = pd.Series(np.random.randn(len(rng)), rng)
In [113]: ts
Out[113]:
2012-03-06 0.464000
2012-03-07 0.227371
2012-03-08 -0.496922
2012-03-09 0.306389
2012-03-10 -2.290613
Freq: D, dtype: float64
In [114]: ts_utc = ts.tz_localize('UTC')
In [115]: ts_utc
Out[115]:
2012-03-06 00:00:00+00:00 0.464000
2012-03-07 00:00:00+00:00 0.227371
2012-03-08 00:00:00+00:00 -0.496922
2012-03-09 00:00:00+00:00 0.306389
2012-03-10 00:00:00+00:00 -2.290613
Freq: D, dtype: float64
- 轉換時區
In [116]: ts_utc.tz_convert('US/Eastern')
Out[116]:
2012-03-05 19:00:00-05:00 0.464000
2012-03-06 19:00:00-05:00 0.227371
2012-03-07 19:00:00-05:00 -0.496922
2012-03-08 19:00:00-05:00 0.306389
2012-03-09 19:00:00-05:00 -2.290613
Freq: D, dtype: float64
Converting between time span representations:
In [117]: rng = pd.date_range('1/1/2012', periods=5, freq='M')
In [118]: ts = pd.Series(np.random.randn(len(rng)), index=rng)
In [119]: ts
Out[119]:
2012-01-31 -1.134623
2012-02-29 -1.561819
2012-03-31 -0.260838
2012-04-30 0.281957
2012-05-31 1.523962
Freq: M, dtype: float64
In [120]: ps = ts.to_period()
In [121]: ps
Out[121]:
2012-01 -1.134623
2012-02 -1.561819
2012-03 -0.260838
2012-04 0.281957
2012-05 1.523962
Freq: M, dtype: float64
In [122]: ps.to_timestamp()
Out[122]:
2012-01-01 -1.134623
2012-02-01 -1.561819
2012-03-01 -0.260838
2012-04-01 0.281957
2012-05-01 1.523962
Freq: MS, dtype: float64
Converting between period and timestamp enables some convenient arithmetic functions to be used. In the following example, we convert a quarterly frequency with year ending in November to 9am of the end of the month following the quarter end:
In [123]: prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')
In [124]: ts = pd.Series(np.random.randn(len(prng)), prng)
In [125]: ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9
In [126]: ts.head()
Out[126]:
1990-03-01 09:00 -0.902937
1990-06-01 09:00 0.068159
1990-09-01 09:00 -0.057873
1990-12-01 09:00 -0.368204
1991-03-01 09:00 -1.144073
Freq: H, dtype: float64
Categoricals
pandas提供此類型更容易做全文本處理
pandas can include categorical data in a DataFrame. For full docs, see the categorical introduction and the API documentation.
In [127]: df = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6],
.....: "raw_grade": ['a', 'b', 'b', 'a', 'a', 'e']})
.....:
- 將grade數據轉爲categorical類型
In [128]: df["grade"] = df["raw_grade"].astype("category")
In [129]: df["grade"]
Out[129]:
0 a
1 b
2 b
3 a
4 a
5 e
Name: grade, dtype: category
Categories (3, object): [a, b, e]
- 重命名
In [130]: df["grade"].cat.categories = ["very good", "good", "very bad"]
- 在categories中排序
In [133]: df.sort_values(by="grade")
Out[133]:
id raw_grade grade
5 6 e very bad
1 2 b good
2 3 b good
0 1 a very good
3 4 a very good
4 5 a very good
- 同樣可以進行分組
In [134]: df.groupby("grade").size()
Out[134]:
grade
very bad 1
bad 0
medium 0
good 2
very good 3
dtype: int64
Plotting
See the Plotting docs.
In [135]: ts = pd.Series(np.random.randn(1000),
.....: index=pd.date_range('1/1/2000', periods=1000))
.....:
In [136]: ts = ts.cumsum()
In [137]: ts.plot()
Out[137]: <matplotlib.axes._subplots.AxesSubplot at 0x7f45409e1690>
- 在DataFrame,
plot()
方法可以轉換所有列索引
In [138]: df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,
.....: columns=['A', 'B', 'C', 'D'])
.....:
In [139]: df = df.cumsum()
In [140]: plt.figure()
Out[140]: <Figure size 640x480 with 0 Axes>
In [141]: df.plot()
Out[141]: <matplotlib.axes._subplots.AxesSubplot at 0x7f453cb4dc50>
In [142]: plt.legend(loc='best')
Out[142]: <matplotlib.legend.Legend at 0x7f453cacfc90>
Getting data in/out
CSV
In [143]: df.to_csv('foo.csv')
In [144]: pd.read_csv('foo.csv')
Out[144]:
Unnamed: 0 A B C D
0 2000-01-01 0.266457 -0.399641 -0.219582 1.186860
1 2000-01-02 -1.170732 -0.345873 1.653061 -0.282953
2 2000-01-03 -1.734933 0.530468 2.060811 -0.515536
3 2000-01-04 -1.555121 1.452620 0.239859 -1.156896
4 2000-01-05 0.578117 0.511371 0.103552 -2.428202
.. ... ... ... ... ...
995 2002-09-22 -8.985362 -8.485624 -4.669462 31.367740
996 2002-09-23 -9.558560 -8.781216 -4.499815 30.518439
997 2002-09-24 -9.902058 -9.340490 -4.386639 30.105593
998 2002-09-25 -10.216020 -9.480682 -3.933802 29.758560
999 2002-09-26 -11.856774 -10.671012 -3.216025 29.369368
[1000 rows x 5 columns]
HDF5
Reading and writing to HDFStores.
Writing to a HDF5 Store.
In [145]: df.to_hdf('foo.h5', 'df')
Reading from a HDF5 Store.
In [146]: pd.read_hdf('foo.h5', 'df')
Out[146]:
A B C D
2000-01-01 0.266457 -0.399641 -0.219582 1.186860
2000-01-02 -1.170732 -0.345873 1.653061 -0.282953
2000-01-03 -1.734933 0.530468 2.060811 -0.515536
2000-01-04 -1.555121 1.452620 0.239859 -1.156896
2000-01-05 0.578117 0.511371 0.103552 -2.428202
... ... ... ... ...
2002-09-22 -8.985362 -8.485624 -4.669462 31.367740
2002-09-23 -9.558560 -8.781216 -4.499815 30.518439
2002-09-24 -9.902058 -9.340490 -4.386639 30.105593
2002-09-25 -10.216020 -9.480682 -3.933802 29.758560
2002-09-26 -11.856774 -10.671012 -3.216025 29.369368
[1000 rows x 4 columns]
Excel
Reading and writing to MS Excel.
Writing to an excel file.
In [147]: df.to_excel('foo.xlsx', sheet_name='Sheet1')
Reading from an excel file.
In [148]: pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])
Out[148]:
Unnamed: 0 A B C D
0 2000-01-01 0.266457 -0.399641 -0.219582 1.186860
1 2000-01-02 -1.170732 -0.345873 1.653061 -0.282953
2 2000-01-03 -1.734933 0.530468 2.060811 -0.515536
3 2000-01-04 -1.555121 1.452620 0.239859 -1.156896
4 2000-01-05 0.578117 0.511371 0.103552 -2.428202
.. ... ... ... ... ...
995 2002-09-22 -8.985362 -8.485624 -4.669462 31.367740
996 2002-09-23 -9.558560 -8.781216 -4.499815 30.518439
997 2002-09-24 -9.902058 -9.340490 -4.386639 30.105593
998 2002-09-25 -10.216020 -9.480682 -3.933802 29.758560
999 2002-09-26 -11.856774 -10.671012 -3.216025 29.369368
[1000 rows x 5 columns]
Gotchas
If you are attempting to perform an operation you might see an exception like:
>>> if pd.Series([False, True, False]):
... print("I was true")
Traceback
...
ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all().
See Comparisons for an explanation and what to do.
See Gotchas as well.
後記:
不得不說有些虎頭蛇尾了,實在扛不住了,說好的10分鐘呢,爲啥我搞了一天?休息😵