Task02：索引（3天）

單級索引

loc方法、iloc方法、[]操作符

最常用的索引方法可能就是這三類，其中iloc表示位置索引，loc表示標籤索引，[]也具有很大的便利性，各有特點。總結成一句話就是，行用loc，列用[]，位置用iloc。

loc方法

loc的適用條件：只有在index 或者column 爲標籤型索引的情況下.，只加一個參數時，只能進行行選擇。

loc可以讓你按照索引來進行行列選擇，這裏需要注意的是：所有在loc中使用的切片全部包含右端點！這是因爲如果作爲Pandas的使用者，那麼肯定不太關心最後一個標籤再往後一位是什麼，但是如果是左閉右開，那麼就很麻煩，先要知道再後面一列的名字是什麼，非常不方便，因此Pandas中將loc設計爲左右全閉。

loc的逗號兩邊可以是單個元素、元素列表、布爾列表、函數，四選一。
（1）單行索引

df.loc[1103] # 提取出1103這一行

（2）多行索引

df.loc[[1102,2304]] # 提取出1102和2304這兩行
df.loc[1304:2103].head() # 左右全閉，這裏不同於切片
df.loc[2402::-1].head() # 這裏-1表示逆序

（3）單列索引

df.loc[:,'Height'].head() # 取‘Height’這一列

（4）多列索引

df.loc[:,['Height','Math']].head() # 取'Height','Math'這兩列
df.loc[:,'Height':'Math'].head() # 多列連續索引（是閉區間）

（5）聯合索引

df.loc[1102:2401:3,'Height':'Math'] # 每三行取一次

（6）函數式索引
所謂傳入函數，就是換個方式傳入列表或者標量。只不過當你怎麼生成這個結果，裏面的騷操作空間就很大了，遠比單單直接把loc參數定死要靈活。

df.loc[lambda x:x['Gender']=='M'].head() # 取‘Gender’爲‘M’的行
#loc中使用的函數，傳入參數就是前面的df

# 這裏的例子表示，loc中能夠傳入函數，並且函數的輸入值是整張表，輸出爲標量、切片、合法列表（元素出現在索引中）、合法索引
def f(x):
    return [1101, 1103]
df.loc[f] # 取出‘1101’、‘1103’這兩行

（7）布爾索引

df.loc[df['Address'].isin(['street_7','street_4'])].head() 
# 取出‘Address’一列是‘street_7’、‘street_4’的所有行

# 傳入布爾列表
# i[-1]指的是街道編號
# 功能同上
df.loc[[
    True if i[-1] == '4' or i[-1] == '7' else False
    for i in df['Address'].values
]].head()

注：本質上說，loc中能傳入的只有布爾列表和索引子集構成的列表，只要把握這個原則就很容易理解上面那些操作

iloc方法

iloc的適用條件：只有在index 是整形的情況適用，也就是只適合位置型索引。

如果說loc是按照索引（index）的值來選取的話，那麼iloc就是按照索引的位置來進行選取。iloc不關心索引的具體值是多少，只關心位置是多少，所以使用iloc時方括號中只能使用數值。

注意與loc不同，切片右端點不包含。

（1）單行索引

df.iloc[3] # 取第四行
# 與下面的loc用法是一樣的效果
df.loc[1104] # ‘1104’是索引名

（2）多行索引

df.iloc[3:5]

（3）單列索引

df.iloc[:,3].head()

（4）多列索引

df.iloc[:, 7::-2].head()  # 倒序，且每兩列取一次

（5）混合索引

df.iloc[3::4,7::-2].head() # 每四行取一行，每兩列取一列（倒序）

（6）函數式索引

df.iloc[lambda x:[3]].head() # 取第四行

注：iloc中接收的參數只能爲整數或整數列表或布爾列表，不能使用布爾Series，如果要用就必須如下把values拿出來。

#df.iloc[df['School']=='S_1'].head() #報錯
df.iloc[(df['School']=='S_1').values].head()

[] 操作符

Series的[]操作

（1）單元素索引

s = pd.Series(df['Math'],index=df.index)
s[1101]
#使用的是索引標籤

（2）多行索引

s[0:4]
#使用的是絕對位置的整數切片，與元素無關，這裏容易混淆

（3）函數式索引

s[lambda x: x.index[16::-6]]
#注意使用lambda函數時，直接切片(如：s[lambda x: 16::-6])就報錯，此時使用的不是絕對位置切片，而是元素切片，非常易錯

（4）布爾索引

s[s>80]

注：如果不想陷入困境，請不要在行索引爲浮點時使用[]操作符，因爲在Series中[]的浮點切片並不是進行位置比較，而是值比較，非常特殊。

DataFrame的[]操作

（1）單行索引

df[1:2]
#這裏非常容易寫成df['label']，會報錯
#同Series使用了絕對位置切片

#如果想要獲得某一個元素，可用如下get_loc方法：
row = df.index.get_loc(1102)
df[row:row+1]

（2）多行索引

#用切片，如果是選取指定的某幾行，推薦使用loc，否則很可能報錯
df[3:5]

（3）但列索引

df['School'].head()

（4）多列索引

df[['School','Math']].head()

（5）函數式索引

df[lambda x:['Math','Physics']].head()

（6）布爾索引

df[df['Gender']=='F'].head()

注：一般來說，[]操作符常用於列選擇或布爾選擇，儘量避免行的選擇。

布爾索引

（1）布爾符號：’&’，’|’，’~’：分別代表和and，或or，取反not

df[(df['Gender']=='F')&(df['Address']=='street_2')].head()
# 取‘Gender’爲‘F’，‘Address’爲‘street_2’的所有行

# 取‘Math’>85或者‘Address’爲‘street_7’滿足一項的所有行
df[(df['Math']>85)|(df['Address']=='street_7')].head()

# 取反
df[~((df['Math']>75)|(df['Address']=='street_1'))].head()

（2）loc和[]中相應位置都能使用布爾列表選擇：

df.loc[df['Math'] > 60, df.columns == 'Physics'].head()

思考：爲什麼df.loc[df[‘Math’]>60,(df[:8][‘Address’]==‘street_6’).values].head()得到和上述結果一樣？values能去掉嗎？

df.loc[df['Math']>60,(df[:8]['Address']=='street_6').values].head()

答：8位布爾列表，只有最後一個是‘True’，因此也就和上面一樣，把最後一列給選進來。
（3）isin方法

df[df['Address'].isin(['street_1', 'street_4'])
   & df['Physics'].isin(['A', 'A+'])]

#上面也可以用字典方式寫：
df[df[['Address', 'Physics']].isin({
    'Address': ['street_1', 'street_4'],
    'Physics': ['A', 'A+']
}).all(1)]
#all與&的思路是類似的，其中的1代表按照跨列方向判斷是否全爲True

上面也可以用字典方式寫：

df[df[['Address', 'Physics']].isin({
    'Address': ['street_1', 'street_4'],
    'Physics': ['A', 'A+']
}).all(1)]
#all與&的思路是類似的，其中的1代表按照跨列方向判斷是否全爲True

快速標量索引

當只需要取一個元素時，at和iat方法能夠提供更快的實現：

display(df.at[1101,'School'])
display(df.loc[1101,'School'])
display(df.iat[0,0])
display(df.iloc[0,0])

下面測試他們的時間：

%timeit df.at[1101,'School']
%timeit df.loc[1101,'School']
%timeit df.iat[0,0]
%timeit df.iloc[0,0]

區間索引

（1）利用interval_range方法

pd.interval_range(start=0,end=5)
#closed參數可選'left''right''both''neither'，默認左開右閉

pd.interval_range(start=0,periods=8,freq=5)
#periods參數控制區間個數，freq控制步長

（2）利用cut將數值列轉爲區間爲元素的分類變量，例如統計數學成績的區間情況：

math_interval = pd.cut(df['Math'], bins=[0, 40, 60, 80, 100])
#注意，如果沒有類型轉換，此時並不是區間類型，而是category類型
math_interval.head()

（3）區間索引的選取

df_i = df.join(math_interval,rsuffix='_interval')[['Math','Math_interval']]\
            .reset_index().set_index('Math_interval')
df_i.head()

df_i.loc[65].head()
#包含該值就會被選中

df_i.loc[[65,90]]

如果想要選取某個區間，先要把分類變量轉爲區間變量，再使用overlap方法：

#df_i.loc[pd.Interval(70,75)].head() 報錯
df_i[df_i.index.astype('interval').overlaps(pd.Interval(70, 85))].head()

多級索引

創建多級索引

通過from_tuple或from_arrays

（1）直接創建元組

tuples = [('A','a'),('A','b'),('B','a'),('B','b')]
mul_index = pd.MultiIndex.from_tuples(tuples, names=('Upper', 'Lower'))
mul_index

pd.DataFrame({'Score':['perfect','good','fair','bad']},index=mul_index)

（2）利用zip創建元組

L1 = list('AABB')
L2 = list('abab')
tuples = list(zip(L1,L2))
mul_index = pd.MultiIndex.from_tuples(tuples, names=('Upper', 'Lower'))
pd.DataFrame({'Score':['perfect','good','fair','bad']},index=mul_index)

（3）通過Array創建元組

arrays = [['A', 'a'], ['A', 'b'], ['B', 'a'], ['B', 'b']]
mul_index = pd.MultiIndex.from_tuples(arrays, names=('Upper', 'Lower'))
pd.DataFrame({'Score': ['perfect', 'good', 'fair', 'bad']}, index=mul_index)

通過from_product

L1 = ['A','B']
L2 = ['a','b']
mul_index = pd.MultiIndex.from_product([L1,L2],names=('Upper', 'Lower'))
#兩兩相乘
pd.DataFrame({'Score': ['perfect', 'good', 'fair', 'bad']}, index=mul_index)

指定df中的列創建（set_index方法）

df_using_mul = df.set_index(['Class','Address'])
df_using_mul.head()

多層索引切片

（1）一般切片

#df_using_mul.loc['C_2','street_5']
#當索引不排序時，單個索引會報出性能警告
#df_using_mul.index.is_lexsorted()
#該函數檢查是否排序
df_using_mul.sort_index().loc['C_2','street_5']
#df_using_mul.sort_index().index.is_lexsorted()

#df_using_mul.loc[('C_2','street_5'):] 報錯
#當不排序時，不能使用多層切片
df_using_mul.sort_index().loc[('C_2','street_6'):('C_3','street_4')]
#注意此處由於使用了loc，因此仍然包含右端點

df_using_mul.sort_index().loc[('C_2','street_7'):'C_3'].head()
#非元組也是合法的，表示選中該層所有元素

（2）第一類特殊情況：由元組構成列表

df_using_mul.sort_index().loc[[('C_2','street_7'),('C_3','street_2')]]
#表示選出某幾個元素，精確到最內層索引

（3）第二類特殊情況：由列表構成元組

df_using_mul.sort_index().loc[(['C_2','C_3'],['street_4','street_7']),:]
#選出第一層在‘C_2’和'C_3'中且第二層在'street_4'和'street_7'中的行

注意兩者的區別：

多層索引中的slice對象

L1,L2 = ['A','B','C'],['a','b','c']
mul_index1 = pd.MultiIndex.from_product([L1,L2],names=('Upper', 'Lower'))
L3,L4 = ['D','E','F'],['d','e','f']
mul_index2 = pd.MultiIndex.from_product([L3,L4],names=('Big', 'Small'))
df_s = pd.DataFrame(np.random.rand(9,9),index=mul_index1,columns=mul_index2)
df_s

索引Slice的使用非常靈活：

df_s.loc[idx['B':,df_s['D']['d']>0.3],idx[df_s.sum()>4]]
#df_s.sum()默認爲對列求和，因此返回一個長度爲9的數值列表

索引層的交換

swaplevel方法（兩層交換）

df_using_mul.swaplevel(i=1,j=0,axis=0).sort_index().head()

#如果索引有name，可以直接使用name
df_muls.reorder_levels(['Address','School','Class'],axis=0).sort_index().head()

效果是一樣的。

索引設定

index_col參數

index_col是read_csv中的一個參數，而不是某一個方法：

pd.read_csv('data/table.csv',index_col=['Address','School']).head()

reindex和reindex_like

（1）reindex是指重新索引，它的重要特性在於索引對齊，很多時候用於重新排序

df.reindex(index=[1101,1203,1206,2402])

df.reindex(columns=['Height','Gender','Average']).head()

（2）可以選擇缺失值的填充方法：fill_value和method（bfill/ffill/nearest），其中method參數必須索引單調

df.reindex(index=[1101,1203,1206,2402],method='bfill')
#bfill表示用所在索引1206的後一個有效行填充，ffill爲前一個有效行，nearest是指最近的

df.reindex(index=[1101,1203,1206,2402],method='nearest')
#數值上1205比1301更接近1206，因此用前者填充

（3）reindex_like的作用爲生成一個橫縱索引完全與參數列表一致的DataFrame，數據使用被調用的表

df_temp = pd.DataFrame({
    'Weight': np.zeros(5),
    'Height': np.zeros(5),
    'ID': [1101, 1104, 1103, 1106, 1102]
}).set_index('ID')
df_temp.reindex_like(df[0:5][['Weight', 'Height']])

（4）如果df_temp單調還可以使用method參數：

df_temp = pd.DataFrame({
    'Weight': range(5),
    'Height': range(5),
    'ID': [1101, 1104, 1103, 1106, 1102]
}).set_index('ID').sort_index()
df_temp.reindex_like(df[0:5][['Weight', 'Height']], method='bfill')
#可以自行檢驗這裏的1105的值是否是由bfill規則填充

set_index和reset_index

（1）先介紹set_index：從字面意思看，就是將某些列作爲索引。
（2）使用表內列作爲索引：

df.set_index('Class').head()

（3）利用append參數可以將當前索引維持不變

df.set_index('Class',append=True).head()

（4）當使用與表長相同的列作爲索引（需要先轉化爲Series，否則報錯）：

df.set_index(pd.Series(range(df.shape[0]))).head()

（5）可以直接添加多級索引：

df.set_index([pd.Series(range(df.shape[0])),
              pd.Series(np.ones(df.shape[0]))]).head()

（6）下面介紹reset_index方法，它的主要功能是將索引重置
（7）默認狀態直接恢復到自然數索引：

（8）用level參數指定哪一層被reset，用col_level參數指定set到哪一層：

rename_axis和rename

（1）rename_axis是針對多級索引的方法，作用是修改某一層的索引名，而不是索引標籤

df_temp.rename_axis(index={'Lower':'LowerLower'},columns={'Big':'BigBig'})

（2）rename方法用於修改列或者行索引標籤，而不是索引名：

df_temp.rename(index={'A':'T'},columns={'e':'changed_e'}).head()

常用索引型函數

where函數

（1）對條件爲False的單元進行填充

（2）通過這種方法篩選結果和[]操作符的結果完全一致：

# 把‘Gender’不等於M的行全部去除
df.where(df['Gender']=='M').dropna().head()

（3）第一個參數爲布爾條件，第二個參數爲填充值：

df.where(df['Gender']=='M',np.random.rand(df.shape[0],df.shape[1])).head()

mask函數

mask函數與where功能上相反，其餘完全一致，即對條件爲True的單元進行填充

# 把‘Gender’等於M的所有行全部刪除
df.mask(df['Gender']=='M').dropna().head()

df.mask(df['Gender']=='M',np.random.rand(df.shape[0],df.shape[1])).head()

query函數

query函數中的布爾表達式中，下面的符號都是合法的：行列索引名、字符串、and/not/or/&/|/~/not in/in/==/!=、四則運算符

df.query(
    '(Address in ["street_6","street_7"])&(Weight>(70+10))&(ID in [1303,2304,2402])'
)

重複元素處理

duplicated方法

（1）該方法返回了是否重複的布爾列表

（2）可選參數keep默認爲first，即首次出現設爲不重複，若爲last，則最後一次設爲不重複，若爲False，則所有重複項爲True

drop_duplicates方法

（1）從名字上看出爲剔除重複項，這在後面章節中的分組操作中可能是有用的，例如需要保留每組的第一個值：

df.drop_duplicates('Class')

（2）參數與duplicate函數類似：

df.drop_duplicates('Class',keep='last')

（3）在傳入多列時等價於將多列共同視作一個多級索引，比較重複項：

df.drop_duplicates(['School','Class'])

抽樣函數

這裏的抽樣函數指的就是sample函數
（1）n爲樣本量

df.sample(n=5)

（2）frac爲抽樣比

df.sample(frac=0.05)

（3）replace爲是否放回

df.sample(n=df.shape[0],replace=True).head()

（4）axis爲抽樣維度，默認爲0，即抽行

df.sample(n=3,axis=1).head()

（5）weights爲樣本權重，自動歸一化

參考內容

教程倉庫鏈接
《利用Python進行數據分析》

Pandas學習筆記2——索引

Pandas基礎

單級索引

loc方法、iloc方法、[]操作符

loc方法

iloc方法

[] 操作符

Series的[]操作

DataFrame的[]操作

布爾索引

快速標量索引

區間索引

多級索引

創建多級索引

通過from_tuple或from_arrays

通過from_product

指定df中的列創建（set_index方法）

多層索引切片

多層索引中的slice對象

索引層的交換

swaplevel方法（兩層交換）

索引設定

index_col參數

reindex和reindex_like

set_index和reset_index

rename_axis和rename

常用索引型函數

where函數

mask函數

query函數

重複元素處理

duplicated方法

drop_duplicates方法

抽樣函數

參考內容