Pandas學習

數據結構

  • Series:一維數組,與Numpy中的一維array類似。二者與Python基本的數據結構List也很相近,其區別是:List中的元素可以是不同的數據類型,而Array和Series中則只允許存儲相同的數據類型,這樣可以更有效的使用內存,提高運算效率。

  • Time- Series:以時間爲索引的Series。

  • DataFrame:二維的表格型數據結構。很多功能與R中的data.frame類似。可以將DataFrame理解爲Series的容器。以下的內容主要以DataFrame爲主。

  • Panel :三維的數組,可以理解爲DataFrame的容器。

導入csv文件

>>> df = pd.DataFrame(pd.read_csv('insurance.csv',header=0)) #加載csv文件
>>> df.head() #查看前10行
   age     sex     bmi  children smoker     region      charges
0   19  female  27.900         0    yes  southwest  16884.92400
1   18    male  33.770         1     no  southeast   1725.55230
2   28    male  33.000         3     no  southeast   4449.46200
3   33    male  22.705         0     no  northwest  21984.47061
4   32    male  28.880         0     no  northwest   3866.85520
>>> df.tail() #查看後10行
      age     sex    bmi  children smoker     region     charges
1333   50    male  30.97         3     no  northwest  10600.5483
1334   18  female  31.92         0     no  northeast   2205.9808
1335   18  female  36.85         0     no  southeast   1629.8335
1336   21  female  25.80         0     no  southwest   2007.9450
1337   61  female  29.07         0    yes  northwest  29141.3603

>>> df = pd.DataFrame(pd.read_excel('insurance.xlsx')) #加載xlsx文件
>>> df.head() #查看前10行
   age     sex     bmi  children smoker     region      charges
0   19  female  27.900         0    yes  southwest  16884.92400
1   18    male  33.770         1     no  southeast   1725.55230
2   28    male  33.000         3     no  southeast   4449.46200
3   33    male  22.705         0     no  northwest  21984.47061
4   32    male  28.880         0     no  northwest   3866.85520
>>> df.tail() #查看後10行
      age     sex    bmi  children smoker     region     charges
1333   50    male  30.97         3     no  northwest  10600.5483
1334   18  female  31.92         0     no  northeast   2205.9808
1335   18  female  36.85         0     no  southeast   1629.8335
1336   21  female  25.80         0     no  southwest   2007.9450
1337   61  female  29.07         0    yes  northwest  29141.3603

DataFrame

>>> df = pd.DataFrame({"id":[1001,1002,1003,1004,1005,1006], 
"date":pd.date_range('20130102', periods=6), 
"city":['Beijing ', 'SH', ' guangzhou ', 'Shenzhen', 'shanghai', 'BEIJING '],
"age":[23,44,54,32,34,32],
"category":['100-A','100-B','110-A','110-C','210-A','130-F'],"price":[1200,np.nan,2133,5433,np.nan,4432]},
columns =['id','date','city','category','age','price'])
>>> df.head()
     id       date         city category  age   price
0  1001 2013-01-02     Beijing     100-A   23  1200.0
1  1002 2013-01-03           SH    100-B   44     NaN
2  1003 2013-01-04   guangzhou     110-A   54  2133.0
3  1004 2013-01-05     Shenzhen    110-C   32  5433.0
4  1005 2013-01-06     shanghai    210-A   34     NaN
  1. 維度查看:
>>> df.shape
(6, 6)
  1. 數據表基本信息(維度、列名稱、數據格式、所佔空間等):
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 6 columns):
id          6 non-null int64
date        6 non-null datetime64[ns]
city        6 non-null object
category    6 non-null object
age         6 non-null int64
price       4 non-null float64
dtypes: datetime64[ns](1), float64(1), int64(2), object(2)
memory usage: 368.0+ bytes

3、每一列數據的格式:

>>> df.dtypes
id                   int64
date        datetime64[ns]
city                object
category            object
age                  int64
price              float64
dtype: object

4、某一列格式:

>>> df['id'].dtype
dtype('int64')

5、空值:

>>> df.isnull()
      id   date   city  category    age  price
0  False  False  False     False  False  False
1  False  False  False     False  False   True
2  False  False  False     False  False  False
3  False  False  False     False  False  False
4  False  False  False     False  False   True
5  False  False  False     False  False  False

6、查看某一列的唯一值:

>>> df['age']
0    23
1    44
2    54
3    32
4    34
5    32
Name: age, dtype: int64
>>> df['age'].unique()
array([23, 44, 54, 32, 34])

7、查看數據表的值:

>>> df.values
array([[1001, Timestamp('2013-01-02 00:00:00'), 'Beijing ', '100-A', 23,
        1200.0],
       [1002, Timestamp('2013-01-03 00:00:00'), 'SH', '100-B', 44, nan],
       [1003, Timestamp('2013-01-04 00:00:00'), ' guangzhou ', '110-A',
        54, 2133.0],
       [1004, Timestamp('2013-01-05 00:00:00'), 'Shenzhen', '110-C', 32,
        5433.0],
       [1005, Timestamp('2013-01-06 00:00:00'), 'shanghai', '210-A', 34,
        nan],
       [1006, Timestamp('2013-01-07 00:00:00'), 'BEIJING ', '130-F', 32,
        4432.0]], dtype=object)

8、查看列名稱:

>>> df.columns
Index(['id', 'date', 'city', 'category', 'age', 'price'], dtype='object')

9、查看前N行數據、後N行數據:

>>> df.head(2) #顯示前2行數據
     id       date      city category  age   price
0  1001 2013-01-02  Beijing     100-A   23  1200.0
1  1002 2013-01-03        SH    100-B   44     NaN
>>> df.tail(2) #顯示後2行數據
     id       date      city category  age   price
4  1005 2013-01-06  shanghai    210-A   34     NaN
5  1006 2013-01-07  BEIJING     130-F   32  4432.0

數據表清洗

1、用數字0填充空值:

>>> df
     id       date         city category  age   price
0  1001 2013-01-02     Beijing     100-A   23  1200.0
1  1002 2013-01-03           SH    100-B   44     NaN
2  1003 2013-01-04   guangzhou     110-A   54  2133.0
3  1004 2013-01-05     Shenzhen    110-C   32  5433.0
4  1005 2013-01-06     shanghai    210-A   34     NaN
5  1006 2013-01-07     BEIJING     130-F   32  4432.0
>>> df.fillna(0)
     id       date         city category  age   price
0  1001 2013-01-02     Beijing     100-A   23  1200.0
1  1002 2013-01-03           SH    100-B   44     0.0
2  1003 2013-01-04   guangzhou     110-A   54  2133.0
3  1004 2013-01-05     Shenzhen    110-C   32  5433.0
4  1005 2013-01-06     shanghai    210-A   34     0.0
5  1006 2013-01-07     BEIJING     130-F   32  4432.0

2、使用列price的均值對NAN進行填充:

>>> df['price']=df['price'].fillna(df['price'].mean())
>>> df['price']
0    1200.0
1    3299.5
2    2133.0
3    5433.0
4    3299.5
5    4432.0
Name: price, dtype: float64

3、清除city字段的字符空格:

>>> df['city']
0       Beijing 
1             SH
2     guangzhou 
3       Shenzhen
4       shanghai
5       BEIJING 
Name: city, dtype: object
>>> df['city']=df['city'].map(str.strip)
>>> df['city']
0      Beijing
1           SH
2    guangzhou
3     Shenzhen
4     shanghai
5      BEIJING
Name: city, dtype: object

4、大小寫轉換:

>>> df['city']
0      Beijing
1           SH
2    guangzhou
3     Shenzhen
4     shanghai
5      BEIJING
Name: city, dtype: object
>>> df['city']=df['city'].str.lower()
>>> df['city']
0      beijing
1           sh
2    guangzhou
3     shenzhen
4     shanghai
5      beijing
Name: city, dtype: object

5、更改數據格式 (astype):

>>> df['price'].astype('int')
0    1200
1    3299
2    2133
3    5433
4    3299
5    4432
Name: price, dtype: int64

6、更改列名稱 (rename):

>>> df.columns
Index(['id', 'date', 'city', 'category', 'age', 'price'], dtype='object')
>>> df.rename(columns={'category': 'category-size'})
     id       date       city category-size  age   price
0  1001 2013-01-02    beijing         100-A   23  1200.0
1  1002 2013-01-03         sh         100-B   44  3299.5
2  1003 2013-01-04  guangzhou         110-A   54  2133.0
3  1004 2013-01-05   shenzhen         110-C   32  5433.0
4  1005 2013-01-06   shanghai         210-A   34  3299.5
5  1006 2013-01-07    beijing         130-F   32  4432.0

7、刪除後出現的重複值:

>>> df['city']
0      beijing
1           sh
2    guangzhou
3     shenzhen
4     shanghai
5      beijing
Name: city, dtype: object
>>> df['city'].drop_duplicates()
0      beijing
1           sh
2    guangzhou
3     shenzhen
4     shanghai
Name: city, dtype: object

8、刪除先出現的重複值:

>>> df['city'].drop_duplicates(keep='last')
1           sh
2    guangzhou
3     shenzhen
4     shanghai
5      beijing
Name: city, dtype: object

9、數據替換:

>>> df['city']
0      beijing
1           sh
2    guangzhou
3     shenzhen
4     shanghai
5      beijing
Name: city, dtype: object
>>> df['city'].replace('sh', 'shanghai')
0      beijing
1     shanghai
2    guangzhou
3     shenzhen
4     shanghai
5      beijing
Name: city, dtype: object

數據預處理

1、數據合併-merge函數

merge函數的參數如下表所示:

參數 說明
left 參與合併的左側DataFrame
right 參與合併的右側DataFrame
how “inner”,”outer”,”left”,”right”其中之一,默認爲”inner”
on 用於連接的列名,必須存在於左右兩個DataFrame
left_on 左側DataFrame中用作連接鍵的列
right_on 右側DataFrame中用作連接鍵的列
left_index 將左側的行索引用作其連接鍵
right_index 將右側的行索引用作其連接鍵
sort 根據連接鍵對合並後的數據進行排列,默認爲True
suffixes 字符串值元組,用於追加到重疊列名的末尾,默認爲(‘_x’,‘_y’)。如果左右兩個DataFrame對象都有“data”,則結果就會出現“data_x”和“data_y”
copy 默認爲True。如果設置爲False,可以避免將數據複製到結果數據結構中

1.1、數據表合併-on參數

>>> df1=pd.DataFrame({'key':['b','b','a','a','b','a','c'],'data1':range(7)})
>>> df1
  key  data1
0   b      0
1   b      1
2   a      2
3   a      3
4   b      4
5   a      5
6   c      6
>>> df2=pd.DataFrame({'key':['a','b','d'],'data2':range(3)})
>>> df2
  key  data2
0   a      0
1   b      1
2   d      2
>>> pd.merge(df1,df2,on='key')
  key  data1  data2
0   b      0      1
1   b      1      1
2   b      4      1
3   a      2      0
4   a      3      0
5   a      5      0

1.2、數據合併-left_on,right_on參數

>>> df3=pd.DataFrame({'l_key':['b','b','a','a','b','a','c'],'data1':range(7)})
>>> df3
  l_key  data1
0     b      0
1     b      1
2     a      2
3     a      3
4     b      4
5     a      5
6     c      6
>>> df4=pd.DataFrame({'r_key':['a','b','d'],'data2':range(3)})
>>> df4
  r_key  data2
0     a      0
1     b      1
2     d      2
>>> pd.merge(df3,df4,left_on='l_key',right_on='r_key')
  l_key  data1 r_key  data2
0     b      0     b      1
1     b      1     b      1
2     b      4     b      1
3     a      2     a      0
4     a      3     a      0
5     a      5     a      0

1.3、數據合併-how參數

>>> df2=pd.DataFrame({'key':['a','b','d'],'data2':range(3)})
>>> df2
  key  data2
0   a      0
1   b      1
2   d      2
>>> df1
  key  data1
0   b      0
1   b      1
2   a      2
3   a      3
4   b      4
5   a      5
6   c      6
>>> pd.merge(df1,df2,on='key',how='outer')
  key  data1  data2
0   b    0.0    1.0
1   b    1.0    1.0
2   b    4.0    1.0
3   a    2.0    0.0
4   a    3.0    0.0
5   a    5.0    0.0
6   c    6.0    NaN
7   d    NaN    2.0
>>> df_inner = pd.merge(df1,df2,on='key',how='inner')
>>> df_inner
  key  data1  data2
0   b      0      1
1   b      1      1
2   b      4      1
3   a      2      0
4   a      3      0
5   a      5      0
>>> pd.merge(df1,df2,on='key',how='left') #只使用左邊的DataFrame的鍵
  key  data1  data2
0   b      0    1.0
1   b      1    1.0
2   a      2    0.0
3   a      3    0.0
4   b      4    1.0
5   a      5    0.0
6   c      6    NaN
>>> pd.merge(df1,df2,on='key',how='right') #只使用右邊的DataFrame的鍵
  key  data1  data2
0   b    0.0      1
1   b    1.0      1
2   b    4.0      1
3   a    2.0      0
4   a    3.0      0
5   a    5.0      0
6   d    NaN      2

1.5、數據合併-left_index,right_index參數

>>> df7=pd.DataFrame({'key':['a','b','a','a','b','c'],'value':range(6)})
>>> df7
  key  value
0   a      0
1   b      1
2   a      2
3   a      3
4   b      4
5   c      5
>>> df8=pd.DataFrame({'group_val':[3.5,7]},index=['a','b'])
>>> df8
   group_val
a        3.5
b        7.0
>>> pd.merge(df7,df8,left_on='key',right_index=True) #進行索引上的合併
  key  value  group_val
0   a      0        3.5
2   a      2        3.5
3   a      3        3.5
1   b      1        7.0
4   b      4        7.0

1.6、數據合併-多對多的合併操作

>>> df5=pd.DataFrame({'key':['b','b','a','c','a','b'],'data1':range(6)})
>>> df5
  key  data1
0   b      0
1   b      1
2   a      2
3   c      3
4   a      4
5   b      5
>>> df6=pd.DataFrame({'key':['a','b','a','b','d'],'data2':range(5)})
>>> df6
  key  data2
0   a      0
1   b      1
2   a      2
3   b      3
4   d      4
>>> pd.merge(df5,df6,how='outer') #產生的是行的笛卡爾積,由於左邊的DataFrame有3個”b”行,右邊的有兩個,所以最終結果就有6個“b”行
   key  data1  data2
0    b    0.0    1.0
1    b    0.0    3.0
2    b    1.0    1.0
3    b    1.0    3.0
4    b    5.0    1.0
5    b    5.0    3.0
6    a    2.0    0.0
7    a    2.0    2.0
8    a    4.0    0.0
9    a    4.0    2.0
10   c    3.0    NaN
11   d    NaN    4.0

2、設置索引列

>>> df = pd.DataFrame({"id":[1001,1002,1003,1004,1005,1006], "date":pd.date_range('20130102', periods=6), "city":['Beijing ', 'SH', ' guangzhou ', 'Shenzhen', 'shanghai', 'BEIJING '],"age":[23,44,54,32,34,32],"category":['100-A','100-B','110-A','110-C','210-A','130-F'],"price":[1200,np.nan,2133,5433,np.nan,4432]},columns =['id','date','city','category','age','price'])
>>> df
     id       date         city category  age   price
0  1001 2013-01-02     Beijing     100-A   23  1200.0
1  1002 2013-01-03           SH    100-B   44     NaN
2  1003 2013-01-04   guangzhou     110-A   54  2133.0
3  1004 2013-01-05     Shenzhen    110-C   32  5433.0
4  1005 2013-01-06     shanghai    210-A   34     NaN
5  1006 2013-01-07     BEIJING     130-F   32  4432.0
>>> df.set_index('id')
           date         city category  age   price
id                                                
1001 2013-01-02     Beijing     100-A   23  1200.0
1002 2013-01-03           SH    100-B   44     NaN
1003 2013-01-04   guangzhou     110-A   54  2133.0
1004 2013-01-05     Shenzhen    110-C   32  5433.0
1005 2013-01-06     shanghai    210-A   34     NaN
1006 2013-01-07     BEIJING     130-F   32  4432.0

3、按照特定列的值排序:

>>> df.sort_values(by=['age'])
     id       date         city category  age   price
0  1001 2013-01-02     Beijing     100-A   23  1200.0
3  1004 2013-01-05     Shenzhen    110-C   32  5433.0
5  1006 2013-01-07     BEIJING     130-F   32  4432.0
4  1005 2013-01-06     shanghai    210-A   34     NaN
1  1002 2013-01-03           SH    100-B   44     NaN
2  1003 2013-01-04   guangzhou     110-A   54  2133.0

4、按照索引列排序:

>>> df.set_index('age').sort_index()
       id       date         city category   price
age                                               
23   1001 2013-01-02     Beijing     100-A  1200.0
32   1004 2013-01-05     Shenzhen    110-C  5433.0
32   1006 2013-01-07     BEIJING     130-F  4432.0
34   1005 2013-01-06     shanghai    210-A     NaN
44   1002 2013-01-03           SH    100-B     NaN
54   1003 2013-01-04   guangzhou     110-A  2133.0
5、如果price列的值>3000,group列顯示high,否則顯示low:

>>> df['group'] = np.where(df['price'] > 3000,'high','low')
>>> df['group']
0     low
1     low
2     low
3    high
4     low
5    high
Name: group, dtype: object

6、對複合多個條件的數據進行分組標記

>>> df.loc[(df['city'] == 'beijing') & (df['price'] >= 4000),'sign']=1
>>> df
     id       date         city category  age   price group  sign
0  1001 2013-01-02     Beijing     100-A   23  1200.0   low   NaN
1  1002 2013-01-03           SH    100-B   44     NaN   low   NaN
2  1003 2013-01-04   guangzhou     110-A   54  2133.0   low   NaN
3  1004 2013-01-05     Shenzhen    110-C   32  5433.0  high   NaN
4  1005 2013-01-06     shanghai    210-A   34     NaN   low   NaN
5  1006 2013-01-07     BEIJING     130-F   32  4432.0  high   NaN
接下來兩個小實驗中使用到數據表df_inner,該數據表內容如下:

>>> df = pd.DataFrame({"id":[1001,1002,1003,1004,1005,1006], "date":pd.date_range('20130102', periods=6),"city":['Beijing ', 'SH', ' guangzhou ', 'Shenzhen', 'shanghai', 'BEIJING '],"age":[23,44,54,32,34,32],"category":['100-A','100-B','110-A','110-C','210-A','130-F'],"price":[1200,np.nan,2133,5433,np.nan,4432]},columns =['id','date','city','category','age','price'])
>>> df1 = df1=pd.DataFrame({"id":[1001,1002,1003,1004,1005,1006,1007,1008],"gender":['male','female','male','female','male','female','male','female'],"pay":['Y','N','Y','Y','N','Y','N','Y',],"m-point":[10,12,20,40,40,40,30,20]})
>>> df_inner=pd.merge(df,df1,how='inner') 
>>> df_inner
     id       date         city category   ...     price  gender pay m-point
0  1001 2013-01-02     Beijing     100-A   ...    1200.0    male   Y      10
1  1002 2013-01-03           SH    100-B   ...       NaN  female   N      12
2  1003 2013-01-04   guangzhou     110-A   ...    2133.0    male   Y      20
3  1004 2013-01-05     Shenzhen    110-C   ...    5433.0  female   Y      40
4  1005 2013-01-06     shanghai    210-A   ...       NaN    male   N      40
5  1006 2013-01-07     BEIJING     130-F   ...    4432.0  female   Y      40

[6 rows x 9 columns]

7、對category字段的值依次進行分列,並創建數據表,索引值爲df_inner的索引列,列名稱爲category和size

>>> split=pd.DataFrame((x.split('-') for x in df_inner['category']),index=df_inner.index,columns=['category','size']))
>>> print(split.values)
[['100' 'A']
 ['100' 'B']
 ['110' 'A']
 ['110' 'C']
 ['210' 'A']
 ['130' 'F']]

8、將完成分裂後的數據表和原df_inner數據表進行匹配

>>> df_inner=pd.merge(df_inner,split,right_index=True, left_index=True)
>>> print(df_inner.values)
[[1001 Timestamp('2013-01-02 00:00:00') 'Beijing ' '100-A' 23 1200.0
  'male' 'Y' 10 '100' 'A']
 [1002 Timestamp('2013-01-03 00:00:00') 'SH' '100-B' 44 nan 'female' 'N'
  12 '100' 'B']
 [1003 Timestamp('2013-01-04 00:00:00') ' guangzhou ' '110-A' 54 2133.0
  'male' 'Y' 20 '110' 'A']
 [1004 Timestamp('2013-01-05 00:00:00') 'Shenzhen' '110-C' 32 5433.0
  'female' 'Y' 40 '110' 'C']
 [1005 Timestamp('2013-01-06 00:00:00') 'shanghai' '210-A' 34 nan 'male'
  'N' 40 '210' 'A']
 [1006 Timestamp('2013-01-07 00:00:00') 'BEIJING ' '130-F' 32 4432.0
  'female' 'Y' 40 '130' 'F']]
發佈了63 篇原創文章 · 獲贊 3 · 訪問量 4632
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章