merge函數的參數

concat函數（用於連接）

對於pandas對象（如Series和DataFrame），帶有標籤的軸使你能夠進一步推廣數組的連接運算。具體點說，你還需要考慮以下這些東西：

如果對象在其它軸上的索引不同，我們應該合併這些軸的不同元素還是隻使用交集？
連接的數據集是否需要在結果對象中可識別？
連接軸中保存的數據是否需要保留？許多情況下，DataFrame默認的整數標籤最好在連接時刪掉。
pandas的concat函數提供了一種能夠解決這些問題的可靠方式。我將給出一些例子來講解其使用方式。假設有三個沒有重疊索引的Series：

In [82]: s1 = pd.Series([0, 1], index=['a', 'b'])

In [83]: s2 = pd.Series([2, 3, 4], index=['c', 'd', 'e'])

In [84]: s3 = pd.Series([5, 6], index=['f', 'g'])

對這些對象調用concat可以將值和索引粘合在一起：

In [85]: pd.concat([s1, s2, s3])
Out[85]: 
a    0
b    1
c    2
d    3
e    4
f    5
g    6
dtype: int64

默認情況下，concat是在axis=0上工作的，最終產生一個新的Series。如果傳入axis=1，則結果就會變成一個DataFrame（axis=1是列）：

In [86]: pd.concat([s1, s2, s3], axis=1)
Out[86]: 
     0    1    2
a  0.0  NaN  NaN
b  1.0  NaN  NaN
c  NaN  2.0  NaN
d  NaN  3.0  NaN
e  NaN  4.0  NaN
f  NaN  NaN  5.0
g  NaN  NaN  6.0

這種情況下，另外的軸上沒有重疊，從索引的有序並集（外連接）上就可以看出來。傳入join=’inner’即可得到它們的交集：

In [87]: s4 = pd.concat([s1, s3])

In [88]: s4
Out[88]: 
a    0
b    1
f    5
g    6
dtype: int64

In [89]: pd.concat([s1, s4], axis=1)
Out[89]: 
     0  1
a  0.0  0
b  1.0  1
f  NaN  5
g  NaN  6

In [90]: pd.concat([s1, s4], axis=1, join='inner')
Out[90]: 
   0  1
a  0  0
b  1  1

在這個例子中，f和g標籤消失了，是因爲使用的是join=’inner’選項。

你可以通過join_axes指定要在其它軸上使用的索引：

In [91]: pd.concat([s1, s4], axis=1, join_axes=[['a', 'c', 'b', 'e']])
Out[91]: 
     0    1
a  0.0  0.0
c  NaN  NaN
b  1.0  1.0
e  NaN  NaN

如果傳入的不是列表而是一個字典，則字典的鍵就會被當做keys選項的值：

In [101]: pd.concat({'level1': df1, 'level2': df2}, axis=1)

Out[101]: 
  level1     level2     
     one two  three four
a      0   1    5.0  6.0
b      2   3    NaN  NaN
c      4   5    7.0  8.0

最後一個關於DataFrame的問題是，DataFrame的行索引不包含任何相關數據：

In [103]: df1 = pd.DataFrame(np.random.randn(3, 4), columns=['a', 'b', 'c', 'd'])

In [104]: df2 = pd.DataFrame(np.random.randn(2, 3), columns=['b', 'd', 'a'])

In [105]: df1
Out[105]: 
          a         b         c         d
0  1.246435  1.007189 -1.296221  0.274992
1  0.228913  1.352917  0.886429 -2.001637
2 -0.371843  1.669025 -0.438570 -0.539741

In [106]: df2
Out[106]: 
          b         d         a
0  0.476985  3.248944 -1.021228
1 -0.577087  0.124121  0.302614

在這種情況下，傳入ignore_index=True即可：

In [107]: pd.concat([df1, df2], ignore_index=True)
Out[107]: 
          a         b         c         d
0  1.246435  1.007189 -1.296221  0.274992
1  0.228913  1.352917  0.886429 -2.001637
2 -0.371843  1.669025 -0.438570 -0.539741
3 -1.021228  0.476985       NaN  3.248944
4  0.302614 -0.577087       NaN  0.124121

合併重疊數據

還有一種數據組合問題不能用簡單的合併（merge）或連接（concatenation）運算來處理。比如說，你可能有索引全部或部分重疊的兩個數據集。舉個有啓發性的例子，我們使用NumPy的where函數，它表示一種等價於面向數組的if-else：

In [108]: a = pd.Series([np.nan, 2.5, np.nan, 3.5, 4.5, np.nan],
   .....:               index=['f', 'e', 'd', 'c', 'b', 'a'])

In [109]: b = pd.Series(np.arange(len(a), dtype=np.float64),
   .....:               index=['f', 'e', 'd', 'c', 'b', 'a'])

In [110]: b[-1] = np.nan

In [111]: a
Out[111]: 
f    NaN
e    2.5
d    NaN
c    3.5
b    4.5
a    NaN
dtype: float64

In [112]: b
Out[112]: 
f    0.0
e    1.0
d    2.0
c    3.0
b    4.0
a    NaN
dtype: float64

# 這樣得到的是ndarray
In [113]: np.where(pd.isnull(a), b, a)
Out[113]: array([ 0. ,  2.5,  2. ,  3.5,  4.5,  nan])

# 也可以這樣去進行使用
# 注意這樣寫的是True的時候不發生替換, 當爲False既當a大於5的時候發生替換
a.where(a < 5, 5.0, inplace=True)
# 這樣子返回的就是pd.Series格式的

combine_first函數

對於DataFrame，combine_first你可以將其看做：用傳遞對象中的數據爲調用對象的缺失數據“打補丁”：

In [115]: df1 = pd.DataFrame({'a': [1., np.nan, 5., np.nan],
   .....:                     'b': [np.nan, 2., np.nan, 6.],
   .....:                     'c': range(2, 18, 4)})

In [116]: df2 = pd.DataFrame({'a': [5., 4., np.nan, 3., 7.],
   .....:                     'b': [np.nan, 3., 4., 6., 8.]})

In [117]: df1
Out[117]: 
     a    b   c
0  1.0  NaN   2
1  NaN  2.0   6
2  5.0  NaN  10
3  NaN  6.0  14

In [118]: df2
Out[118]: 
     a    b
0  5.0  NaN
1  4.0  3.0
2  NaN  4.0
3  3.0  6.0
4  7.0  8.0

In [119]: df1.combine_first(df2)
Out[119]: 
     a    b     c
0  1.0  NaN   2.0
1  4.0  2.0   6.0
2  5.0  4.0  10.0
3  3.0  6.0  14.0
4  7.0  8.0   NaN

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

數據規整：聚合、合併和重塑

merge函數的參數

concat函數（用於連接）

合併重疊數據

combine_first函數

lightdb hash index的性能和限制

sklearn pipeline 實現多個模型統一調參

pandas使用(不定期把所見的比較有效的處理方式加過來)

正確理解查準率與查全率、auc值

lstm模型與情感分析實例

Linux離線安裝pyspark與嘗試使用pyspark連接數據庫

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結