pandas基礎(1.基本操作)

前言:
今天是一個特別的日子,因爲好大的霧霾!很久沒有這麼大的霧霾了
☁️☁️☁️☁️☁️
在這裏插入圖片描述
更重要的是我要補一下關於pandas的筆記。
之前有寫過NumPy的需要的可以看這裏NumPy基礎(1. 準備!),最近看pandas,所以根據官網給出的10分鐘快速體驗pandas(沒錯,官網在單獨使用pandas一詞的時候首字母沒有大寫)做一些記錄。好了,廢話就說這麼多,開始整。


注:本文基於Release 0.25.3

Package overview

首先,按照我的習慣,還是先看介紹pandas是個什麼鬼。powerful Python data analysis toolkit

pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal.

pandas是一個Python包提供快速的、靈活的、直觀的展現數據關係(類似於關係型數據庫的表)索引(類似於表中,每一個列的列名都是此列數據上的一個索引)(後面是pandas開發團隊的理想和目標,跳過了)
看完上面的話應該大概明白,pandas好像就是在玩表格,太天真了(jiushizheyang)


pandas is well suited for many different kinds of data:

  • Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
  • Ordered and unordered (not necessarily fixed-frequency) time series data.
  • Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
  • Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure

pandas可以使用多種不同類型數據:

  • table:沒錯是你以前遇到過的那種關係表結構。
  • 有序或無序的時間序列:神馬東西這麼抽象。
  • 矩陣數據帶有行索引和列索引:比table牛皮一點,不但有列名還有行名,後面看幾個例子就明白了。
  • 任何其他數據,不需要索引直接可以注入pandas數據結構:我懷疑你在吹牛,試試看。

上面一直說pandas數據結構多麼無敵,來看:

The two primary data structures of pandas, Series (1-dimensional) and DataFrame (2-dimensional), handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering. For R users, DataFrame provides everything that R’s data.frame provides and much more. pandas is built on top of NumPy and is intended to integrate well within a scientific computing environment with many other 3rd party libraries.

pandas主要提供兩種數據結構Series(一維)DataFrame(二維,實際還有三維結構)用以處理大量行業的數據。對於R(R語言,常用於數據分析)用戶DataFrame提供了與R中DataFrame相同甚至更好的功能。pandas基於NumPy開發如需NumPy筆記回到本頁最上有連接,是爲了更好的與其他第三方lib集成。

Dimensions Name Description
1 Series 1D labeled homogeneously-typed array
2 DataFrame General 2D labeled, size-mutable tabular structure with potentially heterogeneously-typed column

10 minutes to pandas

注:例子中大多數爲表格形式的數據顯示,ipython或jupyter編輯有更好的結果顯示
This is a short introduction to pandas, geared mainly for new users. You can see more complex recipes in the Cookbook.

Object creation

See the Data Structure Intro section.


  1. 創建一個類型爲Series的實例:

Creating a Series by passing a list of values, letting pandas create a default integer index:

NaN:not a number,類似Python語法中的None

In [3]: s = pd.Series([1, 3, 5, np.nan, 6, 8])

In [4]: s
Out[4]: 
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

打印結果中可以看到類似Python字典dictionary的結構,key爲0-5的數字(這其實是pandas爲我們的數據自動生成的默認行索引index),value爲我們傳入的數據。


  1. 通過NumPy array創建一個類型爲DataFrame的實例:

Creating a DataFrame by passing a NumPy array, with a datetime index and labeled columns:

構造實例df,先使用data_range生成一個元素類型爲datetime的NumPy array名爲dates
生成DataFrame類型,數據爲生成隨機數,行索引index爲上一步生成好的dates,並制定列索引columnsABCD

In [5]: dates = pd.date_range('20130101', periods=6)

In [6]: dates
Out[6]: 
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [7]: df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))

In [8]: df
Out[8]: 
                   A         B         C         D
2013-01-01 -0.254631  1.423220  0.038671  0.762526
2013-01-02 -0.978045  1.159385  0.374813 -2.082786
2013-01-03 -0.233653 -1.853380 -0.040360  0.196083
2013-01-04  0.684391  0.261879 -0.777546 -0.862060
2013-01-05 -0.776971  0.125240 -0.340414 -0.782474
2013-01-06  0.347661  1.357053  1.011172  0.725046

可以看到結果顯示的表格形式,ABCD爲數據的列頭,dates中的數據爲行索引。

  1. 構造實例df2,通過dict創造一個DataFrame實例

傳入字典key爲A-F,
A對應value爲標量1.0,
B對應value爲pd提供的時間戳,
C對應value爲Series類型(值爲1,index爲1-4的四個數字,類型float32)
D對應value爲np array類型(四個值爲3元素,類型int32)
E對應value爲Categorical(R語言中的一種結構)類型,
F對應value爲字符串’foo’。

In [9]: df2 = pd.DataFrame({'A': 1.,
   ...:                     'B': pd.Timestamp('20130102'),
   ...:                     'C': pd.Series(1, index=list(range(4)), dtype='float32'),
   ...:                     'D': np.array([3] * 4, dtype='int32'),
   ...:                     'E': pd.Categorical(["test", "train", "test", "train"]),
   ...:                     'F': 'foo'})
   ...: 

In [10]: df2
Out[10]: 
     A          B    C  D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
2  1.0 2013-01-02  1.0  3   test  foo
3  1.0 2013-01-02  1.0  3  train  foo
  1. 查看df2的結構
In [11]: df2.dtypes
Out[11]: 
A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

If you’re using IPython, tab completion for column names (as well as public attributes) is automatically enabled. Here’s a subset of the attributes that will be completed:

In [12]: df2.<TAB>  # noqa: E225, E999
df2.A                  df2.bool
df2.abs                df2.boxplot
df2.add                df2.C
df2.add_prefix         df2.clip
df2.add_suffix         df2.clip_lower
df2.align              df2.clip_upper
df2.all                df2.columns
df2.any                df2.combine
df2.append             df2.combine_first
df2.apply              df2.compound
df2.applymap           df2.consolidate
df2.D

每個column都被轉爲了df2的一個屬性A-F都存在(篇幅受限)


Viewing data

查看數據

See the Basics section.

  1. 查看首尾數據
In [13]: df2.head()
Out[13]: 
     A          B    C  D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
2  1.0 2013-01-02  1.0  3   test  foo
3  1.0 2013-01-02  1.0  3  train  foo
In [14]: df.tail(3)
Out[14]: 
     A          B    C  D      E    F
1  1.0 2013-01-02  1.0  3  train  foo
2  1.0 2013-01-02  1.0  3   test  foo
3  1.0 2013-01-02  1.0  3  train  foo
  1. 查看行、列索引
In [15]: df2.index
Out[15]: 
Int64Index([0, 1, 2, 3], dtype='int64')

In [16]: df.columns
Out[16]: Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')

DataFrame.to_numpy() gives a NumPy representation of the underlying data. Note that this can be an expensive operation when your DataFrame has columns with different data types, which comes down to a fundamental difference between pandas and NumPy: NumPy arrays have one dtype for the entire array, while pandas DataFrames have one dtype per column. When you call DataFrame.to_numpy(), pandas will find the NumPy dtype that can hold all of the dtypes in the DataFrame. This may end up being object, which requires casting every value to a Python object.

使用DataFrame.to_numpy()方法可以返回NumPy array類型,array中每個元素爲一行數據。如果列中有不同的數據類型會造成很大的開銷。就像NumPy array中元素具有相同數據類型,DatFrames一個列中的元素也具有相同的數據類型。

注:DataFrame.to_numpy()不會輸出行、列索引。

  1. describe() 顯示數據的快速統計摘要:
df2.describe()

         A    C    D
count  4.0  4.0  4.0
mean   1.0  1.0  3.0
std    0.0  0.0  0.0
min    1.0  1.0  3.0
25%    1.0  1.0  3.0
50%    1.0  1.0  3.0
75%    1.0  1.0  3.0
max    1.0  1.0  3.0
  1. 對數據轉置:
df2
     A          B    C  D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
2  1.0 2013-01-02  1.0  3   test  foo
3  1.0 2013-01-02  1.0  3  train  foo


df2.T
                     0  ...                    3
A                    1  ...                    1
B  2013-01-02 00:00:00  ...  2013-01-02 00:00:00
C                    1  ...                    1
D                    3  ...                    3
E                 test  ...                train
F                  foo  ...                  foo
[6 rows x 4 columns]
  1. 指定方向排序

注意列頭變爲了DCBA

df.sort_index(axis=1, ascending=False)

                   D         C         B         A
2013-01-01  0.762526  0.038671  1.423220 -0.254631
2013-01-02 -2.082786  0.374813  1.159385 -0.978045
2013-01-03  0.196083 -0.040360 -1.853380 -0.233653
2013-01-04 -0.862060 -0.777546  0.261879  0.684391
2013-01-05 -0.782474 -0.340414  0.125240 -0.776971
2013-01-06  0.725046  1.011172  1.357053  0.347661
  1. 根據value排序

注意B列變爲遞增序列

df.sort_values(by='B')

                   A         B         C         D
2013-01-03 -0.233653 -1.853380 -0.040360  0.196083
2013-01-05 -0.776971  0.125240 -0.340414 -0.782474
2013-01-04  0.684391  0.261879 -0.777546 -0.862060
2013-01-02 -0.978045  1.159385  0.374813 -2.082786
2013-01-06  0.347661  1.357053  1.011172  0.725046
2013-01-01 -0.254631  1.423220  0.038671  0.762526

Selection

選擇數據

注:推薦使用pandas數據訪問方法.at, .iat, .loc and .iloc
While standard Python / Numpy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, we recommend the optimized pandas data access methods, .at, .iat, .loc and .iloc.

索引用法參見:
See the indexing documentation Indexing and Selecting Data and MultiIndex / Advanced Indexing.

Getting

Selecting a single column, which yields a Series, equivalent to df.A:
選擇單列

df.A
2013-01-01   -0.254631
2013-01-02   -0.978045
2013-01-03   -0.233653
2013-01-04    0.684391
2013-01-05   -0.776971
2013-01-06    0.347661
Freq: D, Name: A, dtype: float64
df['A']
2013-01-01   -0.254631
2013-01-02   -0.978045
2013-01-03   -0.233653
2013-01-04    0.684391
2013-01-05   -0.776971
2013-01-06    0.347661
Freq: D, Name: A, dtype: float64
  • 通過[]數據切片
df[0:3]
                   A         B         C         D
2013-01-01 -0.254631  1.423220  0.038671  0.762526
2013-01-02 -0.978045  1.159385  0.374813 -2.082786
2013-01-03 -0.233653 -1.853380 -0.040360  0.196083
df['20130102':'20130104']
                   A         B         C         D
2013-01-02 -0.978045  1.159385  0.374813 -2.082786
2013-01-03 -0.233653 -1.853380 -0.040360  0.196083
2013-01-04  0.684391  0.261879 -0.777546 -0.862060

根據索引選數據

更多方法參見:
See more in Selection by Label.

dates[0]
Timestamp('2013-01-01 00:00:00', freq='D')

# 根據行索引獲取數據
df.loc[dates[0]]
A   -0.254631
B    1.423220
C    0.038671
D    0.762526
Name: 2013-01-01 00:00:00, dtype: float64
  • 獲取多列數據
# :所有行,A\B兩列數據
df.loc[:, ['A', 'B']]

                   A         B
2013-01-01 -0.254631  1.423220
2013-01-02 -0.978045  1.159385
2013-01-03 -0.233653 -1.853380
2013-01-04  0.684391  0.261879
2013-01-05 -0.776971  0.125240
2013-01-06  0.347661  1.357053
# 同時制定行列索引
df.loc['20130102':'20130104', ['A', 'B']]

                   A         B
2013-01-02 -0.978045  1.159385
2013-01-03 -0.233653 -1.853380
2013-01-04  0.684391  0.261879
  • 如果只返回一行數據自動降維成Series類型
df.loc['20130102', ['A', 'B']]
A   -0.978045
B    1.159385

type(df.loc['20130102', ['A', 'B']])
<class 'pandas.core.series.Series'>
  • 如果返回一個cell的數據自動降維標量類型
df.loc[dates[0], 'A']
-0.2546306719860299

type(df.loc[dates[0], 'A'])
<class 'numpy.float64'>

df.get('A').get(dates[0])
-0.2546306719860299

根據座標選數據

See more in Selection by Position.

df

                   A         B         C         D
2013-01-01 -0.254631  1.423220  0.038671  0.762526
2013-01-02 -0.978045  1.159385  0.374813 -2.082786
2013-01-03 -0.233653 -1.853380 -0.040360  0.196083
2013-01-04  0.684391  0.261879 -0.777546 -0.862060
2013-01-05 -0.776971  0.125240 -0.340414 -0.782474
2013-01-06  0.347661  1.357053  1.011172  0.725046

選擇行: iloc = index location

df.iloc[3]

A    0.684391
B    0.261879
C   -0.777546
D   -0.862060
Name: 2013-01-04 00:00:00, dtype: float64
  • 像python/numpy一樣用數字指定切片
df.iloc[3:5, 0:2]

                   A         B
2013-01-04  0.684391  0.261879
2013-01-05 -0.776971  0.125240
  • 用list指定特定的position
df.iloc[[1, 2, 4], [0, 2]]

                   A         C
2013-01-02 -0.978045  0.374813
2013-01-03 -0.233653 -0.040360
2013-01-05 -0.776971 -0.340414
  • 按行切片
df.iloc[1:3, :]

                   A         B         C         D
2013-01-02 -0.978045  1.159385  0.374813 -2.082786
2013-01-03 -0.233653 -1.853380 -0.040360  0.196083
  • 按列切片
df.iloc[:, 1:3]

                   B         C
2013-01-01  1.423220  0.038671
2013-01-02  1.159385  0.374813
2013-01-03 -1.853380 -0.040360
2013-01-04  0.261879 -0.777546
2013-01-05  0.125240 -0.340414
2013-01-06  1.357053  1.011172
  • For getting a value explicitly:
In [37]: df.iloc[1, 1]
Out[37]: -0.17321464905330858
  • For getting fast access to a scalar (equivalent to the prior method):
In [38]: df.iat[1, 1]
Out[38]: -0.17321464905330858

布爾參數索引

  • 使用某個列的值選擇數據
df
                   A         B         C         D
2013-01-01 -0.254631  1.423220  0.038671  0.762526
2013-01-02 -0.978045  1.159385  0.374813 -2.082786
2013-01-03 -0.233653 -1.853380 -0.040360  0.196083
2013-01-04  0.684391  0.261879 -0.777546 -0.862060
2013-01-05 -0.776971  0.125240 -0.340414 -0.782474
2013-01-06  0.347661  1.357053  1.011172  0.725046
  • 選擇A列值大於0的數據
df[df.A > 0]
                   A         B         C         D
2013-01-04  0.684391  0.261879 -0.777546 -0.862060
2013-01-06  0.347661  1.357053  1.011172  0.725046
  • DataFrame中大於0的數據
df[df > 0]

                   A         B         C         D
2013-01-01       NaN  1.423220  0.038671  0.762526
2013-01-02       NaN  1.159385  0.374813       NaN
2013-01-03       NaN       NaN       NaN  0.196083
2013-01-04  0.684391  0.261879       NaN       NaN
2013-01-05       NaN  0.125240       NaN       NaN
2013-01-06  0.347661  1.357053  1.011172  0.725046
  • 使用isin()方法過濾

爲df增加一列用於filter,copy深拷貝在前面的筆記中寫過。

df3 = df .copy()
df3['E'] = ['one','one','two','three','four','three']
df3
                   A         B         C         D      E
2013-01-01 -0.254631  1.423220  0.038671  0.762526    one
2013-01-02 -0.978045  1.159385  0.374813 -2.082786    one
2013-01-03 -0.233653 -1.853380 -0.040360  0.196083    two
2013-01-04  0.684391  0.261879 -0.777546 -0.862060  three
2013-01-05 -0.776971  0.125240 -0.340414 -0.782474   four
2013-01-06  0.347661  1.357053  1.011172  0.725046  three

過濾E列爲’two’, 'four’的數據

df3[df3['E'].isin(['two','four'])]

                   A        B         C         D     E
2013-01-03 -0.233653 -1.85338 -0.040360  0.196083   two
2013-01-05 -0.776971  0.12524 -0.340414 -0.782474  four

Setting

  • 設置新列並分配數據
df
                   A         B         C         D
2013-01-01 -0.254631  1.423220  0.038671  0.762526
2013-01-02 -0.978045  1.159385  0.374813 -2.082786
2013-01-03 -0.233653 -1.853380 -0.040360  0.196083
2013-01-04  0.684391  0.261879 -0.777546 -0.862060
2013-01-05 -0.776971  0.125240 -0.340414 -0.782474
2013-01-06  0.347661  1.357053  1.011172  0.725046

構造一個Series
設置給dfF列,由於Series行索引從20130102起始,所以df2013-01-01沒有匹配數值故顯示爲NaN

s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20130102', periods=6))
df['F'] = s1
df
                   A         B         C         D    F
2013-01-01 -0.254631  1.423220  0.038671  0.762526  NaN
2013-01-02 -0.978045  1.159385  0.374813 -2.082786  1.0
2013-01-03 -0.233653 -1.853380 -0.040360  0.196083  2.0
2013-01-04  0.684391  0.261879 -0.777546 -0.862060  3.0
2013-01-05 -0.776971  0.125240 -0.340414 -0.782474  4.0
2013-01-06  0.347661  1.357053  1.011172  0.725046  5.0
  • 對df中所有大於0的數取負
df4 = df.copy()
df4[df4 > 0] = -df4
df4

                   A         B         C         D    F
2013-01-01 -0.254631 -1.423220 -0.038671 -0.762526  NaN
2013-01-02 -0.978045 -1.159385 -0.374813 -2.082786 -1.0
2013-01-03 -0.233653 -1.853380 -0.040360 -0.196083 -2.0
2013-01-04 -0.684391 -0.261879 -0.777546 -0.862060 -3.0
2013-01-05 -0.776971 -0.125240 -0.340414 -0.782474 -4.0
2013-01-06 -0.347661 -1.357053 -1.011172 -0.725046 -5.0

寫了這麼久還沒到一半,累😢,要是看累了去喝杯水吧🍻,若果有人看的話

Missing data

pandas使用np.nan代表缺失數據。默認參與計算。See the Missing Data section.

reindex()構造新的df1,增加新列E

In [55]: df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])

In [56]: df1.loc[dates[0]:dates[1], 'E'] = 1

In [57]: df1
Out[57]: 
                   A         B         C  D    F    E
2013-01-01  0.000000  0.000000 -1.509059  5  NaN  1.0
2013-01-02  1.212112 -0.173215  0.119209  5  1.0  1.0
2013-01-03 -0.861849 -2.104569 -0.494929  5  2.0  NaN
2013-01-04  0.721555 -0.706771 -1.039575  5  3.0  NaN

刪除missing data

In [58]: df1.dropna(how='any')
Out[58]: 
                   A         B         C  D    F    E
2013-01-02  1.212112 -0.173215  0.119209  5  1.0  1.0

填充missing data

In [59]: df1.fillna(value=5)
Out[59]: 
                   A         B         C  D    F    E
2013-01-01  0.000000  0.000000 -1.509059  5  5.0  1.0
2013-01-02  1.212112 -0.173215  0.119209  5  1.0  1.0
2013-01-03 -0.861849 -2.104569 -0.494929  5  2.0  5.0
2013-01-04  0.721555 -0.706771 -1.039575  5  3.0  5.0

用布爾值標記爲NaN的值

In [60]: pd.isna(df1)
Out[60]: 
                A      B      C      D      F      E
2013-01-01  False  False  False  False   True  False
2013-01-02  False  False  False  False  False  False
2013-01-03  False  False  False  False  False   True
2013-01-04  False  False  False  False  False   True

Operations

操作

See the Basic section on Binary Ops.

統計

此類操作通常不包含NaN的值

In [51]: df
Out[51]: 
                   A         B         C  D    F
2013-01-01  0.000000  0.000000 -1.509059  5  NaN
2013-01-02  1.212112 -0.173215  0.119209  5  1.0
2013-01-03 -0.861849 -2.104569 -0.494929  5  2.0
2013-01-04  0.721555 -0.706771 -1.039575  5  3.0
2013-01-05 -0.424972  0.567020  0.276232  5  4.0
2013-01-06 -0.673690  0.113648 -1.478427  5  5.0
  • 平均值
In [61]: df.mean()
Out[61]: 
A   -0.004474
B   -0.383981
C   -0.687758
D    5.000000
F    3.000000
dtype: float64

對其他的行同上操作

In [62]: df.mean(1)
Out[62]: 
2013-01-01    0.872735
2013-01-02    1.431621
2013-01-03    0.707731
2013-01-04    1.395042
2013-01-05    1.883656
2013-01-06    1.592306
Freq: D, dtype: float64
  • 操作具有不同維度且需要對齊的對象。此外,panda會自動地沿着指定的維度廣播。
In [63]: s = pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2)

In [64]: s
Out[64]: 
2013-01-01    NaN
2013-01-02    NaN
2013-01-03    1.0
2013-01-04    3.0
2013-01-05    5.0
2013-01-06    NaN
Freq: D, dtype: float64

In [65]: df.sub(s, axis='index')
Out[65]: 
                   A         B         C    D    F
2013-01-01       NaN       NaN       NaN  NaN  NaN
2013-01-02       NaN       NaN       NaN  NaN  NaN
2013-01-03 -1.861849 -3.104569 -1.494929  4.0  1.0
2013-01-04 -2.278445 -3.706771 -4.039575  2.0  0.0
2013-01-05 -5.424972 -4.432980 -4.723768  0.0 -1.0
2013-01-06       NaN       NaN       NaN  NaN  NaN

apply

對數據應用一個函數:

In [66]: df.apply(np.cumsum)
Out[66]: 
                   A         B         C   D     F
2013-01-01  0.000000  0.000000 -1.509059   5   NaN
2013-01-02  1.212112 -0.173215 -1.389850  10   1.0
2013-01-03  0.350263 -2.277784 -1.884779  15   3.0
2013-01-04  1.071818 -2.984555 -2.924354  20   6.0
2013-01-05  0.646846 -2.417535 -2.648122  25  10.0
2013-01-06 -0.026844 -2.303886 -4.126549  30  15.0

In [67]: df.apply(lambda x: x.max() - x.min())
Out[67]: 
A    2.073961
B    2.671590
C    1.785291
D    0.000000
F    4.000000
dtype: float64

值計數

See more at Histogramming and Discretization.

In [68]: s = pd.Series(np.random.randint(0, 7, size=10))

In [69]: s
Out[69]: 
0    4
1    2
2    1
3    2
4    6
5    4
6    4
7    6
8    4
9    4
dtype: int64

In [70]: s.value_counts()
Out[70]: 
4    5
6    2
2    2
1    1
dtype: int64

String methods

Series is equipped with a set of string processing methods in the str attribute that make it easy to operate on each element of the array, as in the code snippet below. Note that pattern-matching in str generally uses regular expressions by default (and in some cases always uses them). See more at Vectorized String Methods.

In [71]: s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])

In [72]: s.str.lower()
Out[72]: 
0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

Merge

合併

See the Merging section.

  • Concat

Concatenating pandas objects together with concat():

In [73]: df = pd.DataFrame(np.random.randn(10, 4))

In [74]: df
Out[74]: 
          0         1         2         3
0 -0.548702  1.467327 -1.015962 -0.483075
1  1.637550 -1.217659 -0.291519 -1.745505
2 -0.263952  0.991460 -0.919069  0.266046
3 -0.709661  1.669052  1.037882 -1.705775
4 -0.919854 -0.042379  1.247642 -0.009920
5  0.290213  0.495767  0.362949  1.548106
6 -1.131345 -0.089329  0.337863 -0.945867
7 -0.932132  1.956030  0.017587 -0.016692
8 -0.575247  0.254161 -1.143704  0.215897
9  1.193555 -0.077118 -0.408530 -0.862495

# break it into pieces
In [75]: pieces = [df[:3], df[3:7], df[7:]]

In [76]: pd.concat(pieces)
Out[76]: 
          0         1         2         3
0 -0.548702  1.467327 -1.015962 -0.483075
1  1.637550 -1.217659 -0.291519 -1.745505
2 -0.263952  0.991460 -0.919069  0.266046
3 -0.709661  1.669052  1.037882 -1.705775
4 -0.919854 -0.042379  1.247642 -0.009920
5  0.290213  0.495767  0.362949  1.548106
6 -1.131345 -0.089329  0.337863 -0.945867
7 -0.932132  1.956030  0.017587 -0.016692
8 -0.575247  0.254161 -1.143704  0.215897
9  1.193555 -0.077118 -0.408530 -0.862495
  • Join
    類似於SQL風格

SQL style merges. See the Database style joining section.

In [77]: left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})

In [78]: right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})

In [79]: left
Out[79]: 
   key  lval
0  foo     1
1  foo     2

In [80]: right
Out[80]: 
   key  rval
0  foo     4
1  foo     5

In [81]: pd.merge(left, right, on='key')
Out[81]: 
   key  lval  rval
0  foo     1     4
1  foo     1     5
2  foo     2     4
3  foo     2     5
In [82]: left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})

In [83]: right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})

In [84]: left
Out[84]: 
   key  lval
0  foo     1
1  bar     2

In [85]: right
Out[85]: 
   key  rval
0  foo     4
1  bar     5

In [86]: pd.merge(left, right, on='key')
Out[86]: 
   key  lval  rval
0  foo     1     4
1  bar     2     5
  • Append
    追加

See the Appending section.

In [87]: df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'C', 'D'])

In [88]: df
Out[88]: 
          A         B         C         D
0  1.346061  1.511763  1.627081 -0.990582
1 -0.441652  1.211526  0.268520  0.024580
2 -1.577585  0.396823 -0.105381 -0.532532
3  1.453749  1.208843 -0.080952 -0.264610
4 -0.727965 -0.589346  0.339969 -0.693205
5 -0.339355  0.593616  0.884345  1.591431
6  0.141809  0.220390  0.435589  0.192451
7 -0.096701  0.803351  1.715071 -0.708758

In [89]: s = df.iloc[3]

In [90]: df.append(s, ignore_index=True)
Out[90]: 
          A         B         C         D
0  1.346061  1.511763  1.627081 -0.990582
1 -0.441652  1.211526  0.268520  0.024580
2 -1.577585  0.396823 -0.105381 -0.532532
3  1.453749  1.208843 -0.080952 -0.264610
4 -0.727965 -0.589346  0.339969 -0.693205
5 -0.339355  0.593616  0.884345  1.591431
6  0.141809  0.220390  0.435589  0.192451
7 -0.096701  0.803351  1.715071 -0.708758
8  1.453749  1.208843 -0.080952 -0.264610

Grouping

By “group by” we are referring to a process involving one or more of the following steps:

  • Splitting the data into groups based on some criteria
  • Applying a function to each group independently
  • Combining the results into a data structure

See the Grouping section.

In [91]: df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
   ....:                          'foo', 'bar', 'foo', 'foo'],
   ....:                    'B': ['one', 'one', 'two', 'three',
   ....:                          'two', 'two', 'one', 'three'],
   ....:                    'C': np.random.randn(8),
   ....:                    'D': np.random.randn(8)})
   ....: 

In [92]: df
Out[92]: 
     A      B         C         D
0  foo    one -1.202872 -0.055224
1  bar    one -1.814470  2.395985
2  foo    two  1.018601  1.552825
3  bar  three -0.595447  0.166599
4  foo    two  1.395433  0.047609
5  bar    two -0.392670 -0.136473
6  foo    one  0.007207 -0.561757
7  foo  three  1.928123 -1.623033
  • 分組求和
In [93]: df.groupby('A').sum()
Out[93]: 
            C        D
A                     
bar -2.802588  2.42611
foo  3.146492 -0.63958
In [94]: df.groupby(['A', 'B']).sum()
Out[94]: 
                  C         D
A   B                        
bar one   -1.814470  2.395985
    three -0.595447  0.166599
    two   -0.392670 -0.136473
foo one   -1.195665 -0.616981
    three  1.928123 -1.623033
    two    2.414034  1.600434

Reshaping

Stack

疊加

In [95]: tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
   ....:                      'foo', 'foo', 'qux', 'qux'],
   ....:                     ['one', 'two', 'one', 'two',
   ....:                      'one', 'two', 'one', 'two']]))
   ....: 

In [96]: index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])

In [97]: df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])

In [98]: df2 = df[:4]

In [99]: df2
Out[99]: 
                     A         B
first second                    
bar   one     0.029399 -0.542108
      two     0.282696 -0.087302
baz   one    -1.575170  1.771208
      two     0.816482  1.100230

stack()方法壓縮列

In [100]: stacked = df2.stack()

In [101]: stacked
Out[101]: 
first  second   
bar    one     A    0.029399
               B   -0.542108
       two     A    0.282696
               B   -0.087302
baz    one     A   -1.575170
               B    1.771208
       two     A    0.816482
               B    1.100230
dtype: float64

unstack()

In [102]: stacked.unstack()
Out[102]: 
                     A         B
first second                    
bar   one     0.029399 -0.542108
      two     0.282696 -0.087302
baz   one    -1.575170  1.771208
      two     0.816482  1.100230

In [103]: stacked.unstack(1)
Out[103]: 
second        one       two
first                      
bar   A  0.029399  0.282696
      B -0.542108 -0.087302
baz   A -1.575170  0.816482
      B  1.771208  1.100230

In [104]: stacked.unstack(0)
Out[104]: 
first          bar       baz
second                      
one    A  0.029399 -1.575170
       B -0.542108  1.771208
two    A  0.282696  0.816482
       B -0.087302  1.100230

Pivot tables

See the section on Pivot Tables.

In [105]: df = pd.DataFrame({'A': ['one', 'one', 'two', 'three'] * 3,
   .....:                    'B': ['A', 'B', 'C'] * 4,
   .....:                    'C': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
   .....:                    'D': np.random.randn(12),
   .....:                    'E': np.random.randn(12)})
   .....: 

In [106]: df
Out[106]: 
        A  B    C         D         E
0     one  A  foo  1.418757 -0.179666
1     one  B  foo -1.879024  1.291836
2     two  C  foo  0.536826 -0.009614
3   three  A  bar  1.006160  0.392149
4     one  B  bar -0.029716  0.264599
5     one  C  bar -1.146178 -0.057409
6     two  A  foo  0.100900 -1.425638
7   three  B  foo -1.035018  1.024098
8     one  C  foo  0.314665 -0.106062
9     one  A  bar -0.773723  1.824375
10    two  B  bar -1.170653  0.595974
11  three  C  bar  0.648740  1.167115
In [107]: pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])
Out[107]: 
C             bar       foo
A     B                    
one   A -0.773723  1.418757
      B -0.029716 -1.879024
      C -1.146178  0.314665
three A  1.006160       NaN
      B       NaN -1.035018
      C  0.648740       NaN
two   A       NaN  0.100900
      B -1.170653       NaN
      C       NaN  0.536826

Time series

一些時間序列轉換的方法

In [108]: rng = pd.date_range('1/1/2012', periods=100, freq='S')

In [109]: ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)

In [110]: ts.resample('5Min').sum()
Out[110]: 
2012-01-01    25083
Freq: 5T, dtype: int64
  • 時區表示
In [111]: rng = pd.date_range('3/6/2012 00:00', periods=5, freq='D')

In [112]: ts = pd.Series(np.random.randn(len(rng)), rng)

In [113]: ts
Out[113]: 
2012-03-06    0.464000
2012-03-07    0.227371
2012-03-08   -0.496922
2012-03-09    0.306389
2012-03-10   -2.290613
Freq: D, dtype: float64

In [114]: ts_utc = ts.tz_localize('UTC')

In [115]: ts_utc
Out[115]: 
2012-03-06 00:00:00+00:00    0.464000
2012-03-07 00:00:00+00:00    0.227371
2012-03-08 00:00:00+00:00   -0.496922
2012-03-09 00:00:00+00:00    0.306389
2012-03-10 00:00:00+00:00   -2.290613
Freq: D, dtype: float64
  • 轉換時區
In [116]: ts_utc.tz_convert('US/Eastern')
Out[116]: 
2012-03-05 19:00:00-05:00    0.464000
2012-03-06 19:00:00-05:00    0.227371
2012-03-07 19:00:00-05:00   -0.496922
2012-03-08 19:00:00-05:00    0.306389
2012-03-09 19:00:00-05:00   -2.290613
Freq: D, dtype: float64

Converting between time span representations:

In [117]: rng = pd.date_range('1/1/2012', periods=5, freq='M')

In [118]: ts = pd.Series(np.random.randn(len(rng)), index=rng)

In [119]: ts
Out[119]: 
2012-01-31   -1.134623
2012-02-29   -1.561819
2012-03-31   -0.260838
2012-04-30    0.281957
2012-05-31    1.523962
Freq: M, dtype: float64

In [120]: ps = ts.to_period()

In [121]: ps
Out[121]: 
2012-01   -1.134623
2012-02   -1.561819
2012-03   -0.260838
2012-04    0.281957
2012-05    1.523962
Freq: M, dtype: float64

In [122]: ps.to_timestamp()
Out[122]: 
2012-01-01   -1.134623
2012-02-01   -1.561819
2012-03-01   -0.260838
2012-04-01    0.281957
2012-05-01    1.523962
Freq: MS, dtype: float64

Converting between period and timestamp enables some convenient arithmetic functions to be used. In the following example, we convert a quarterly frequency with year ending in November to 9am of the end of the month following the quarter end:

In [123]: prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')

In [124]: ts = pd.Series(np.random.randn(len(prng)), prng)

In [125]: ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9

In [126]: ts.head()
Out[126]: 
1990-03-01 09:00   -0.902937
1990-06-01 09:00    0.068159
1990-09-01 09:00   -0.057873
1990-12-01 09:00   -0.368204
1991-03-01 09:00   -1.144073
Freq: H, dtype: float64

Categoricals

pandas提供此類型更容易做全文本處理

pandas can include categorical data in a DataFrame. For full docs, see the categorical introduction and the API documentation.

In [127]: df = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6],
   .....:                    "raw_grade": ['a', 'b', 'b', 'a', 'a', 'e']})
   .....: 
  • 將grade數據轉爲categorical類型
In [128]: df["grade"] = df["raw_grade"].astype("category")

In [129]: df["grade"]
Out[129]: 
0    a
1    b
2    b
3    a
4    a
5    e
Name: grade, dtype: category
Categories (3, object): [a, b, e]
  • 重命名
In [130]: df["grade"].cat.categories = ["very good", "good", "very bad"]
  • 在categories中排序
In [133]: df.sort_values(by="grade")
Out[133]: 
   id raw_grade      grade
5   6         e   very bad
1   2         b       good
2   3         b       good
0   1         a  very good
3   4         a  very good
4   5         a  very good
  • 同樣可以進行分組
In [134]: df.groupby("grade").size()
Out[134]: 
grade
very bad     1
bad          0
medium       0
good         2
very good    3
dtype: int64

Plotting

See the Plotting docs.

In [135]: ts = pd.Series(np.random.randn(1000),
   .....:                index=pd.date_range('1/1/2000', periods=1000))
   .....: 

In [136]: ts = ts.cumsum()

In [137]: ts.plot()
Out[137]: <matplotlib.axes._subplots.AxesSubplot at 0x7f45409e1690>

在這裏插入圖片描述

  • 在DataFrame,plot()方法可以轉換所有列索引
In [138]: df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,
   .....:                   columns=['A', 'B', 'C', 'D'])
   .....: 

In [139]: df = df.cumsum()

In [140]: plt.figure()
Out[140]: <Figure size 640x480 with 0 Axes>

In [141]: df.plot()
Out[141]: <matplotlib.axes._subplots.AxesSubplot at 0x7f453cb4dc50>

In [142]: plt.legend(loc='best')
Out[142]: <matplotlib.legend.Legend at 0x7f453cacfc90>

在這裏插入圖片描述

Getting data in/out

CSV

Writing to a csv file.

In [143]: df.to_csv('foo.csv')

Reading from a csv file.

In [144]: pd.read_csv('foo.csv')
Out[144]: 
     Unnamed: 0          A          B         C          D
0    2000-01-01   0.266457  -0.399641 -0.219582   1.186860
1    2000-01-02  -1.170732  -0.345873  1.653061  -0.282953
2    2000-01-03  -1.734933   0.530468  2.060811  -0.515536
3    2000-01-04  -1.555121   1.452620  0.239859  -1.156896
4    2000-01-05   0.578117   0.511371  0.103552  -2.428202
..          ...        ...        ...       ...        ...
995  2002-09-22  -8.985362  -8.485624 -4.669462  31.367740
996  2002-09-23  -9.558560  -8.781216 -4.499815  30.518439
997  2002-09-24  -9.902058  -9.340490 -4.386639  30.105593
998  2002-09-25 -10.216020  -9.480682 -3.933802  29.758560
999  2002-09-26 -11.856774 -10.671012 -3.216025  29.369368

[1000 rows x 5 columns]

HDF5

Reading and writing to HDFStores.

Writing to a HDF5 Store.

In [145]: df.to_hdf('foo.h5', 'df')
Reading from a HDF5 Store.

In [146]: pd.read_hdf('foo.h5', 'df')
Out[146]: 
                    A          B         C          D
2000-01-01   0.266457  -0.399641 -0.219582   1.186860
2000-01-02  -1.170732  -0.345873  1.653061  -0.282953
2000-01-03  -1.734933   0.530468  2.060811  -0.515536
2000-01-04  -1.555121   1.452620  0.239859  -1.156896
2000-01-05   0.578117   0.511371  0.103552  -2.428202
...               ...        ...       ...        ...
2002-09-22  -8.985362  -8.485624 -4.669462  31.367740
2002-09-23  -9.558560  -8.781216 -4.499815  30.518439
2002-09-24  -9.902058  -9.340490 -4.386639  30.105593
2002-09-25 -10.216020  -9.480682 -3.933802  29.758560
2002-09-26 -11.856774 -10.671012 -3.216025  29.369368

[1000 rows x 4 columns]

Excel

Reading and writing to MS Excel.

Writing to an excel file.

In [147]: df.to_excel('foo.xlsx', sheet_name='Sheet1')
Reading from an excel file.

In [148]: pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])
Out[148]: 
    Unnamed: 0          A          B         C          D
0   2000-01-01   0.266457  -0.399641 -0.219582   1.186860
1   2000-01-02  -1.170732  -0.345873  1.653061  -0.282953
2   2000-01-03  -1.734933   0.530468  2.060811  -0.515536
3   2000-01-04  -1.555121   1.452620  0.239859  -1.156896
4   2000-01-05   0.578117   0.511371  0.103552  -2.428202
..         ...        ...        ...       ...        ...
995 2002-09-22  -8.985362  -8.485624 -4.669462  31.367740
996 2002-09-23  -9.558560  -8.781216 -4.499815  30.518439
997 2002-09-24  -9.902058  -9.340490 -4.386639  30.105593
998 2002-09-25 -10.216020  -9.480682 -3.933802  29.758560
999 2002-09-26 -11.856774 -10.671012 -3.216025  29.369368

[1000 rows x 5 columns]

Gotchas

If you are attempting to perform an operation you might see an exception like:

>>> if pd.Series([False, True, False]):
...     print("I was true")
Traceback
    ...
ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all().

See Comparisons for an explanation and what to do.

See Gotchas as well.


後記:
不得不說有些虎頭蛇尾了,實在扛不住了,說好的10分鐘呢,爲啥我搞了一天?休息😵

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章