Python | Pandas DataFrame最常用的十二個方法總結

原創

2020-06-29 12:09

在data science領域，pandas是python最常用的library，而DataFrame又是pandas最核心的數據結構。用久了，發現與pandas DataFrame相關的常用的方法其實就那麼幾個，只要能熟練掌握便能解決大部分需求了。

1. Create a pandas DataFrame

如果數據已經以list的形式存在了的話，最常用的方法是直接pass in 一個字典，比如：

import pandas as pd
name_lst = ['John','Mike']
age_lst = [12,30]
city_lst = ['New York City','Paris']
df = pd.DataFrame({'name':name_lst,'age':age_lst,'city':city_lst})

如果沒有，可以創建一個空DataFrame，再以append的方式（見下文）添加行。（columns參數定義了列的名字，是optional的。）

df = pd.DataFrame(columns=['name','age','city'])

2. DataFrame.head()

The most basic, but also the most frequently used method, especially useful when you have a large DataFrame and you just want to glance over the first few rows to make sure everything looks right.

df.head(5)  # returns the first 5 rows

3. DataFrame.columns

Technically this is not a method, but an attribute of pandas DataFrame. This comes in handy when the DataFrame you are dealing with has numerous columns and you would like to find the name of a certain column.

df.columns

4. DataFrame.loc() & DataFrame.iloc()

篩選DataFrame的兩個方法，使用方法非常靈活多樣，這點documentation裏有詳細的例子介紹。二者的區別：loc()的篩選是基於labels（column names）和條件（也就是boolean arrays），而iloc()的篩選是基於index的。
比如：

df.loc[df['age']<18]   #篩選出未成年人
df.iloc[:5]            #篩選出前五行

loc()也是pandas推薦的用來set values的方法，因爲它可以幫助避免chained assignment的問題。比如下面這個例子：

df[df['age'<18]]['age']='underage'   # chained assignment problem

上面的代碼並不會更改原本df的值，因爲df[df['age'<18]]只是原DataFrame的一個copy。因此，跑上面那段代碼會得到以下提示：
SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead
正確的做法是用loc()方法：

df.loc[df['age']<18, 'age']='underage'   # correct

5. DataFrame.iterrows()

The method to use if you want to iterate through the entire DataFrame, works similar to Python dictionary’s items.

for index, row in df.iterrows():
    # code block

6. DataFrame.append()

Append another DataFrame under the original DataFrame. Return a new DataFrame object.

df_new = df.append(pd.DataFrame({'name':['Kelly'],'age':[42],'city':['Beijing']}))

7. pd.merge()

Equivalent to SQL join，提供inner, outer, left, right幾種選項。比如我們有一個消費者信息表fct_customer和一個訂單信息表fct_order，其中共有的column是消費者編號customer_id，現在希望只找到所有下過訂單的顧客的信息，則可以使用inner join：

old_customer = pd.merge(fct_customer, fct_order, how='inner', on='customer_id')

如果兩個表裏對消費者編號這一列的叫法不同也沒有關係，可以不使用on這個參數，而分別pass inleft_on和right_on兩個參數。

8. DataFrame.groupby()

Equivalent to SQL groupby，提供若干aggregate functions選項，常用的就是mean(), sum(), count()等幾種。比如，想查看df中每個城市的人平均年齡是多少：

df.groupby(how='city').mean()

9. DataFrame.fillna() & DataFrame.dropna()

非常實用的兩個方法，可以作爲read in一個DataFrame後的第一步來使用，畢竟如果DataFrame裏含有nan的話難免會在做一些操作時報錯。注意，默認該方法會return a new DataFrame，如果希望修改原DataFrame的話需要pass in inplace=True。

df.fillna(0, inplace=True)

10. DataFrame.drop_duplicates()

有點類似於SQL裏的distinct關鍵詞，可以選擇一個subset of columns，去除它們中帶有相同值的行。

df.drop_duplicates(subset=['city','age'], inplace=True)

11. DataFrame.sort_values()

Return the sorted DataFrame. 常用的參數有：

by：the column name or the list of column names to be sorted
ascending：由小到大還是由大到小
axis：0爲橫行，1爲豎列
inplace：修改原DataFrame還是return a new DataFrame

df_sorted = df.sort_values(by=['city','age'], ascending=False)

12. DataFrame.reset_index()

不起眼但經常會用到的方法。尤其是在用完drop_duplicates(), sort_values()等一些會刪除某些行或改變行的順序的方法後，默認會保留原index number，因此會出現sort完以後編號全部亂掉的情況。這時重置一下index就十分有用了。
需要注意的是該方法default設置爲重置後將把舊的index作爲一列保留在DataFrame中，如果不想要需要設置drop=True。

df.reset_index(drop=True, inplace=True)

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Python | Pandas DataFrame最常用的十二個方法總結

1. Create a pandas DataFrame

2. DataFrame.head()

3. DataFrame.columns

4. DataFrame.loc() & DataFrame.iloc()

5. DataFrame.iterrows()

6. DataFrame.append()

7. pd.merge()

8. DataFrame.groupby()

9. DataFrame.fillna() & DataFrame.dropna()

10. DataFrame.drop_duplicates()

11. DataFrame.sort_values()

12. DataFrame.reset_index()

測試人員都是畫畫大神，讓我看看誰還不會用代碼圖？

Object.values()對象遍歷

我拍了拍Redis，被移出了羣聊···

網絡現代化通向雲原生應用的高速公路

面試官：說說你對序列化的理解

Python | Pandas DataFrame最常用的十二個方法總結

Python append()的兩個坑

Probability Theory | Coin Tossing Problems (TBC) | 概率論中的拋硬幣問題（未完待續）

Hackerrank | Hash Tables: Ransom Note解答

算法 | Five Steps to Dynamic Programming（解決動態規劃問題的五個步驟）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結