Data Cleaning and Preparation 數據清洗和準備
修改之後,增加代碼,註釋
xiaoyao
import numpy as np
import pandas as pd
PREVIOUS_MAX_ROWS = pd. options. display. max_rows
pd. options. display. max_rows = 20
np. random. seed( 12345 )
import matplotlib. pyplot as plt
plt. rc( 'figure' , figsize= ( 10 , 6 ) )
np. set_printoptions( precision= 4 , suppress= True )
import warnings
warnings. filterwarnings( 'ignore' )
Handling Missing Data 處理缺失數據
pandas對象的所有描述性統計默認都不包括缺失數據。pandas使用浮點值NaN(Not a Number)表示缺失數據。
string_data = pd. Series( [ 'aardvark' , 'artichoke' , np. nan, 'avocado' ] )
string_data
0 aardvark
1 artichoke
2 NaN
3 avocado
dtype: object
string_data. isnull( )
0 False
1 False
2 True
3 False
dtype: bool
在pandas中,採用的是R語言中的慣用法,將缺失值表示爲NA,他表示不可用not available.在統計應用中,NA數據可能是不存在的數據或者雖然存在,但是沒有觀察到(例如,數據採集中發生了問題)。
當進行數據清洗以進行分析的時候,最好直接對缺失數據進行分析,從而判斷數據採集的問題或者缺失數據可能導致的偏差。
python內置的None值在對象數組中也可以作爲NA
string_data
0 aardvark
1 artichoke
2 NaN
3 avocado
dtype: object
string_data. isnull( )
0 False
1 False
2 True
3 False
dtype: bool
string_data[ 0 ] = None
string_data. isnull( )
0 True
1 False
2 True
3 False
dtype: bool
一些關於缺失數據處理的函數
方法
說明
dropna
根據各標籤的值中是否存在缺失數據對軸標籤進行過濾,可以通過閾值調節對缺失值的容忍度
fillna
用指定值或者插值方法(如ffill或者bfill)填充確實數據
isnull
返回一個含有布爾值的對象,這些布爾值表示哪些值爲缺失值NA,該對象的類型與源類型一樣
notnull
這個是isnull的否定形式
Filtering Out Missing Data 濾除缺失數據
過濾掉缺失數據的方式有很多種。可以通過pandas.isnull或者布爾索引的方式,但dropna可能會更加實用。對於一個Series,dropna返回一個僅僅含有非空數據和索引值的Series:
from numpy import nan as NA
data = pd. Series( [ 1 , NA, 3.5 , NA, 7 ] )
data. dropna( )
0 1.0
2 3.5
4 7.0
dtype: float64
data[ data. notnull( ) ]
0 1.0
2 3.5
4 7.0
dtype: float64
對於DataFrame對象,事情變得不一樣。他這裏默認丟棄任何含有缺失值的行。
data = pd. DataFrame( [ [ 1 . , 6.5 , 3 . ] , [ 1 . , NA, NA] ,
[ NA, NA, NA] , [ NA, 6.5 , 3 . ] ] )
cleaned = data. dropna( )
data
0
1
2
0
1.0
6.5
3.0
1
1.0
NaN
NaN
2
NaN
NaN
NaN
3
NaN
6.5
3.0
cleaned
data. dropna( how= 'all' )
0
1
2
0
1.0
6.5
3.0
1
1.0
NaN
NaN
3
NaN
6.5
3.0
data
0
1
2
0
1.0
6.5
3.0
1
1.0
NaN
NaN
2
NaN
NaN
NaN
3
NaN
6.5
3.0
data[ 4 ] = NA
data
0
1
2
4
0
1.0
6.5
3.0
NaN
1
1.0
NaN
NaN
NaN
2
NaN
NaN
NaN
NaN
3
NaN
6.5
3.0
NaN
data. dropna( axis= 1 , how= 'all' )
0
1
2
0
1.0
6.5
3.0
1
1.0
NaN
NaN
2
NaN
NaN
NaN
3
NaN
6.5
3.0
另外一個濾除DataFrame行的問題所涉及時間序列數據。加入我只想留下一部分觀測數據,可以採用thresh參數 實現此目的。
df = pd. DataFrame( np. random. randn( 7 , 3 ) )
df
0
1
2
0
0.476985
3.248944
-1.021228
1
-0.577087
0.124121
0.302614
2
0.523772
0.000940
1.343810
3
-0.713544
-0.831154
-2.370232
4
-1.860761
-0.860757
0.560145
5
-1.265934
0.119827
-1.063512
6
0.332883
-2.359419
-0.199543
df[ 0 ]
0 0.476985
1 -0.577087
2 0.523772
3 -0.713544
4 -1.860761
5 -1.265934
6 0.332883
Name: 0, dtype: float64
df. iloc[ : 4 , 1 ] = NA
df. iloc[ : 2 , 2 ] = NA
df
0
1
2
0
0.476985
NaN
NaN
1
-0.577087
NaN
NaN
2
0.523772
NaN
1.343810
3
-0.713544
NaN
-2.370232
4
-1.860761
-0.860757
0.560145
5
-1.265934
0.119827
-1.063512
6
0.332883
-2.359419
-0.199543
df. dropna( )
0
1
2
0
0.476985
3.248944
-1.021228
1
-0.577087
0.124121
0.302614
2
0.523772
0.000940
1.343810
3
-0.713544
-0.831154
-2.370232
4
-1.860761
-0.860757
0.560145
5
-1.265934
0.119827
-1.063512
6
0.332883
-2.359419
-0.199543
df. dropna( thresh= 2 )
0
1
2
0
0.476985
3.248944
-1.021228
1
-0.577087
0.124121
0.302614
2
0.523772
0.000940
1.343810
3
-0.713544
-0.831154
-2.370232
4
-1.860761
-0.860757
0.560145
5
-1.265934
0.119827
-1.063512
6
0.332883
-2.359419
-0.199543
Filling In Missing Data 填充缺失數據
不濾除缺失數據,我希望通過其他的方法來填補這些“空洞”,對於大多數情況而言,fillna方法是主要的函數。通過一個常數
調用fillna就會將缺失值替換爲那個常數值:
df
0
1
2
0
0.476985
NaN
NaN
1
-0.577087
NaN
NaN
2
0.523772
NaN
1.343810
3
-0.713544
NaN
-2.370232
4
-1.860761
-0.860757
0.560145
5
-1.265934
0.119827
-1.063512
6
0.332883
-2.359419
-0.199543
df. fillna( 0 )
0
1
2
0
0.476985
0.000000
0.000000
1
-0.577087
0.000000
0.000000
2
0.523772
0.000000
1.343810
3
-0.713544
0.000000
-2.370232
4
-1.860761
-0.860757
0.560145
5
-1.265934
0.119827
-1.063512
6
0.332883
-2.359419
-0.199543
df. fillna( { 1 : 0.5 , 2 : 0 } )
0
1
2
0
0.476985
0.500000
0.000000
1
-0.577087
0.500000
0.000000
2
0.523772
0.500000
1.343810
3
-0.713544
0.500000
-2.370232
4
-1.860761
-0.860757
0.560145
5
-1.265934
0.119827
-1.063512
6
0.332883
-2.359419
-0.199543
fillna默認會返回新對象,但是也可以實現對現有的對象進行就地修改
_ = df. fillna( 0 , inplace= True )
df
0
1
2
0
0.476985
0.000000
0.000000
1
-0.577087
0.000000
0.000000
2
0.523772
0.000000
1.343810
3
-0.713544
0.000000
-2.370232
4
-1.860761
-0.860757
0.560145
5
-1.265934
0.119827
-1.063512
6
0.332883
-2.359419
-0.199543
df = pd. DataFrame( np. random. randn( 6 , 3 ) )
df
0
1
2
0
0.862580
-0.010032
0.050009
1
0.670216
0.852965
-0.955869
2
-0.023493
-2.304234
-0.652469
3
-1.218302
-1.332610
1.074623
4
0.723642
0.690002
1.001543
5
-0.503087
-0.622274
-0.921169
df. iloc[ 2 : , 1 ] = NA
df. iloc[ 4 : , 2 ] = NA
df
0
1
2
0
0.862580
-0.010032
0.050009
1
0.670216
0.852965
-0.955869
2
-0.023493
NaN
-0.652469
3
-1.218302
NaN
1.074623
4
0.723642
NaN
NaN
5
-0.503087
NaN
NaN
data = pd. Series( [ 1 . , NA, 3.5 , NA, 7 ] )
data. mean( )
3.8333333333333335
data. fillna( data. mean( ) )
0 1.000000
1 3.833333
2 3.500000
3 3.833333
4 7.000000
dtype: float64
關於fillna參數的說明
value
用於填充缺失值的標量值或者字典對象
method
插值方式,如果函數調用時候沒有進行指定,則默認爲“ffill”
axis
待填充的軸,默認爲axis=0
inplace
修改調用者對象而不產生副本,就地修改
limit
(對於前向和後向填充)可以連續填充的最大數量
Data Transformation 數據轉換
到此之前都是進行的爲:數據的重排,另一類重要的操作爲:通過過濾,清理以及其他的轉換工作。
Removing Duplicates 移除重複的數據
data = pd. DataFrame( { 'k1' : [ 'one' , 'two' ] * 3 + [ 'two' ] ,
'k2' : [ 1 , 1 , 2 , 3 , 3 , 4 , 4 ] } )
data
k1
k2
0
one
1
1
two
1
2
one
2
3
two
3
4
one
3
5
two
4
6
two
4
data. duplicated( )
0 False
1 False
2 False
3 False
4 False
5 False
6 True
dtype: bool
data. drop_duplicates( )
k1
k2
0
one
1
1
two
1
2
one
2
3
two
3
4
one
3
5
two
4
data
k1
k2
0
one
1
1
two
1
2
one
2
3
two
3
4
one
3
5
two
4
6
two
4
data[ 'v1' ] = range ( 7 )
data
k1
k2
v1
0
one
1
0
1
two
1
1
2
one
2
2
3
two
3
3
4
one
3
4
5
two
4
5
6
two
4
6
data. drop_duplicates( [ 'k1' ] )
k1
k2
v1
0
one
1
0
1
two
1
1
data. drop_duplicates( [ 'k1' , 'k2' ] , keep= 'last' )
k1
k2
v1
0
one
1
0
1
two
1
1
2
one
2
2
3
two
3
3
4
one
3
4
6
two
4
6
Transforming Data Using a Function or Mapping
利用函數或者映射進行數據轉換
對於許多數據集,可能希望根據數組、Series或者DataFrame列中的值來實現轉換工作,我們接下來:
data = pd. DataFrame( { 'food' : [ 'bacon' , 'pulled pork' , 'bacon' ,
'Pastrami' , 'corned beef' , 'Bacon' ,
'pastrami' , 'honey ham' , 'nova lox' ] ,
'ounces' : [ 4 , 3 , 12 , 6 , 7.5 , 8 , 3 , 5 , 6 ] } )
data
food
ounces
0
bacon
4.0
1
pulled pork
3.0
2
bacon
12.0
3
Pastrami
6.0
4
corned beef
7.5
5
Bacon
8.0
6
pastrami
3.0
7
honey ham
5.0
8
nova lox
6.0
meat_to_animal = {
'bacon' : 'pig' ,
'pulled pork' : 'pig' ,
'pastrami' : 'cow' ,
'corned beef' : 'cow' ,
'honey ham' : 'pig' ,
'nova lox' : 'salmon'
}
"""
有些肉類的首字母大寫了,而另一些沒有,
因此,首先調用Series的str.lower方法,將各個值轉換爲小寫:
"""
lowercased = data[ 'food' ] . str . lower( )
lowercased
0 bacon
1 pulled pork
2 bacon
3 pastrami
4 corned beef
5 bacon
6 pastrami
7 honey ham
8 nova lox
Name: food, dtype: object
data[ 'animal' ] = lowercased. map ( meat_to_animal)
data
food
ounces
animal
0
bacon
4.0
pig
1
pulled pork
3.0
pig
2
bacon
12.0
pig
3
Pastrami
6.0
cow
4
corned beef
7.5
cow
5
Bacon
8.0
pig
6
pastrami
3.0
cow
7
honey ham
5.0
pig
8
nova lox
6.0
salmon
也可以傳入一個可以完成全部工作的函數,這裏使用匿名函數
data[ 'food' ] . map ( lambda x: meat_to_animal[ x. lower( ) ] )
0 pig
1 pig
2 pig
3 cow
4 cow
5 pig
6 cow
7 pig
8 salmon
Name: food, dtype: object
data
food
ounces
animal
0
bacon
4.0
pig
1
pulled pork
3.0
pig
2
bacon
12.0
pig
3
Pastrami
6.0
cow
4
corned beef
7.5
cow
5
Bacon
8.0
pig
6
pastrami
3.0
cow
7
honey ham
5.0
pig
8
nova lox
6.0
salmon
Replacing Values 替換值
利用fillna方法填充缺失數據可以看作是替換值的一種特殊方法。前面已經看到,map可以用於修改對象的數據子集。
而replace則提供了以中國實現該功能的更加簡單、靈活的方式。
data = pd. Series( [ 1 . , - 999 . , 2 . , - 999 . , - 1000 . , 3 . ] )
data
0 1.0
1 -999.0
2 2.0
3 -999.0
4 -1000.0
5 3.0
dtype: float64
data. replace( - 999 , np. nan)
0 1.0
1 NaN
2 2.0
3 NaN
4 -1000.0
5 3.0
dtype: float64
data. replace( [ - 999 , - 1000 ] , np. nan)
0 1.0
1 NaN
2 2.0
3 NaN
4 NaN
5 3.0
dtype: float64
data
0 1.0
1 -999.0
2 2.0
3 -999.0
4 -1000.0
5 3.0
dtype: float64
data. replace( [ - 999 , - 1000 ] , [ np. nan, 0 ] )
0 1.0
1 NaN
2 2.0
3 NaN
4 0.0
5 3.0
dtype: float64
data. replace( { - 999 : np. nan, - 1000 : 0 } )
0 1.0
1 NaN
2 2.0
3 NaN
4 0.0
5 3.0
dtype: float64
data.replace方法與data.str.replace不同,後者做的是字符串的元素級替換,
Renaming Axis Indexes 重命名軸索引
data = pd. DataFrame( np. arange( 12 ) . reshape( ( 3 , 4 ) ) ,
index= [ 'Ohio' , 'Colorado' , 'New York' ] ,
columns= [ 'one' , 'two' , 'three' , 'four' ] )
跟Series中的值一樣,軸標籤也可以通過函數或者映射進行轉換,從而得到一個新的不同標籤的對象。軸還可以被就地修改,而無需新建一個數據結構。
data
one
two
three
four
Ohio
0
1
2
3
Colorado
4
5
6
7
New York
8
9
10
11
transform = lambda x: x[ : 4 ] . upper( )
data. index. map ( transform)
Index(['OHIO', 'COLO', 'NEW '], dtype='object')
data. index = data. index. map ( transform)
data
one
two
three
four
OHIO
0
1
2
3
COLO
4
5
6
7
NEW
8
9
10
11
data. rename( index= str . title, columns= str . upper)
ONE
TWO
THREE
FOUR
Ohio
0
1
2
3
Colo
4
5
6
7
New
8
9
10
11
data. rename( index= { 'OHIO' : 'INDIANA' } ,
columns= { 'three' : 'peekaboo' } )
one
two
peekaboo
four
INDIANA
0
1
2
3
COLO
4
5
6
7
NEW
8
9
10
11
data. rename( index= { 'OHIO' : 'INDIANA' } , inplace= True )
data
one
two
three
four
INDIANA
0
1
2
3
COLO
4
5
6
7
NEW
8
9
10
11
Discretization and Binning 離散化和麪元劃分
爲了便於分析,連續的數據常常被離散化或者拆分爲"面元(bin)".
如下:假設有一組人員數據,希望將其劃分爲不同的年齡組:
ages = [ 20 , 22 , 25 , 27 , 21 , 23 , 37 , 31 , 61 , 45 , 41 , 32 ]
bins = [ 18 , 25 , 35 , 60 , 100 ]
cats = pd. cut( ages, bins)
cats
[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
pandas返回的是一個特殊的Categorical對象。結果展示了pandas.cut劃分的面元。可以將其看作一組表示面元名稱的字符串。
它的底層含有一個表示不同分類名稱的類型數組,以及一個codes屬性中的年齡數據的標籤。
cats. codes
array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)
cats. categories
IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]],
closed='right',
dtype='interval[int64]')
pd. value_counts( cats)
(18, 25] 5
(35, 60] 3
(25, 35] 3
(60, 100] 1
dtype: int64
pd. cut( ages, [ 18 , 26 , 36 , 61 , 100 ] , right= False )
[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 12
Categories (4, interval[int64]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]
group_names = [ 'Youth' , 'YoungAdult' , 'MiddleAged' , 'Senior' ]
pd. cut( ages, bins, labels= group_names)
[Youth, Youth, Youth, YoungAdult, Youth, ..., YoungAdult, Senior, MiddleAged, MiddleAged, YoungAdult]
Length: 12
Categories (4, object): [Youth < YoungAdult < MiddleAged < Senior]
data = np. random. rand( 20 )
pd. cut( data, 4 , precision= 2 )
[(0.49, 0.72], (0.02, 0.26], (0.02, 0.26], (0.49, 0.72], (0.49, 0.72], ..., (0.49, 0.72], (0.49, 0.72], (0.26, 0.49], (0.72, 0.96], (0.49, 0.72]]
Length: 20
Categories (4, interval[float64]): [(0.02, 0.26] < (0.26, 0.49] < (0.49, 0.72] < (0.72, 0.96]]
這裏的選項precision=2,限定小數只有兩位
data = np. random. randn( 1000 )
cats = pd. qcut( data, 4 )
cats
[(-0.0453, 0.604], (-2.9499999999999997, -0.686], (-0.0453, 0.604], (-0.0453, 0.604], (-2.9499999999999997, -0.686], ..., (-0.686, -0.0453], (0.604, 3.928], (0.604, 3.928], (-0.0453, 0.604], (-0.686, -0.0453]]
Length: 1000
Categories (4, interval[float64]): [(-2.9499999999999997, -0.686] < (-0.686, -0.0453] < (-0.0453, 0.604] < (0.604, 3.928]]
pd. value_counts( cats)
(0.604, 3.928] 250
(-0.0453, 0.604] 250
(-0.686, -0.0453] 250
(-2.9499999999999997, -0.686] 250
dtype: int64
pd. qcut( data, [ 0 , 0.1 , 0.5 , 0.9 , 1 . ] )
[(-0.0453, 1.289], (-1.191, -0.0453], (-0.0453, 1.289], (-0.0453, 1.289], (-2.9499999999999997, -1.191], ..., (-1.191, -0.0453], (1.289, 3.928], (1.289, 3.928], (-0.0453, 1.289], (-1.191, -0.0453]]
Length: 1000
Categories (4, interval[float64]): [(-2.9499999999999997, -1.191] < (-1.191, -0.0453] < (-0.0453, 1.289] < (1.289, 3.928]]
Detecting and Filtering Outliers 檢測和過濾異常值
data = pd. DataFrame( np. random. randn( 1000 , 4 ) )
data. describe( )
0
1
2
3
count
1000.000000
1000.000000
1000.000000
1000.000000
mean
-0.043288
0.046433
0.026352
-0.010204
std
0.998391
0.999185
1.010005
0.992779
min
-3.428254
-3.645860
-3.184377
-3.745356
25%
-0.740152
-0.599807
-0.612162
-0.699863
50%
-0.085000
0.043663
-0.008168
-0.031732
75%
0.625698
0.746527
0.690847
0.692355
max
3.366626
2.653656
3.525865
2.735527
col = data[ 2 ]
col[ np. abs ( col) > 3 ]
50 3.260383
225 -3.056990
312 -3.184377
772 3.525865
Name: 2, dtype: float64
data[ ( np. abs ( data) > 3 ) . any ( 1 ) ]
0
1
2
3
31
-2.315555
0.457246
-0.025907
-3.399312
50
0.050188
1.951312
3.260383
0.963301
126
0.146326
0.508391
-0.196713
-3.745356
225
-0.293333
-0.242459
-3.056990
1.918403
249
-3.428254
-0.296336
-0.439938
-0.867165
312
0.275144
1.179227
-3.184377
1.369891
534
-0.362528
-3.548824
1.553205
-2.186301
626
3.366626
-2.372214
0.851010
1.332846
772
-0.658090
-0.207434
3.525865
0.283070
793
0.599947
-3.645860
0.255475
-0.549574
data[ np. abs ( data) > 3 ] = np. sign( data) * 3
data. describe( )
0
1
2
3
count
1000.000000
1000.000000
1000.000000
1000.000000
mean
-0.043227
0.047628
0.025807
-0.009059
std
0.995841
0.995170
1.006769
0.988960
min
-3.000000
-3.000000
-3.000000
-3.000000
25%
-0.740152
-0.599807
-0.612162
-0.699863
50%
-0.085000
0.043663
-0.008168
-0.031732
75%
0.625698
0.746527
0.690847
0.692355
max
3.000000
2.653656
3.000000
2.735527
np. sign( data) . head( )
0
1
2
3
0
-1.0
-1.0
-1.0
-1.0
1
-1.0
1.0
-1.0
-1.0
2
1.0
-1.0
-1.0
1.0
3
1.0
1.0
1.0
-1.0
4
1.0
1.0
1.0
1.0
Permutation and Random Sampling 排列和隨機採樣
利用numpy.random.permutation函數可以輕鬆實現對Series或者DataFrame的列的排列工作(permuting,隨機重排序)。通過對需要排列的軸的長度調用permutation,可以產生一個表示新順序的整數數組。
df = pd. DataFrame( np. arange( 5 * 4 ) . reshape( ( 5 , 4 ) ) )
sampler = np. random. permutation( 5 )
sampler
array([2, 0, 3, 4, 1])
df
0
1
2
3
0
0
1
2
3
1
4
5
6
7
2
8
9
10
11
3
12
13
14
15
4
16
17
18
19
df. take( sampler)
0
1
2
3
2
8
9
10
11
0
0
1
2
3
3
12
13
14
15
4
16
17
18
19
1
4
5
6
7
df. sample( n= 3 )
0
1
2
3
2
8
9
10
11
1
4
5
6
7
0
0
1
2
3
choices = pd. Series( [ 5 , 7 , - 1 , 6 , 4 ] )
draws = choices. sample( n= 10 , replace= True )
draws
4 4
4 4
1 7
3 6
4 4
3 6
4 4
4 4
3 6
2 -1
dtype: int64
Computing Indicator/Dummy Variables 計算指標/啞變量
另一種常用於統計建模或機器學習的轉換方式是:將分類變量(類別型變量)轉換爲"啞變量"或者"指標矩陣"
df = pd. DataFrame( { 'key' : [ 'b' , 'b' , 'a' , 'c' , 'a' , 'b' ] ,
'data1' : range ( 6 ) } )
pd. get_dummies( df[ 'key' ] )
a
b
c
0
0
1
0
1
0
1
0
2
1
0
0
3
0
0
1
4
1
0
0
5
0
1
0
df
key
data1
0
b
0
1
b
1
2
a
2
3
c
3
4
a
4
5
b
5
如果,DataFrame的某一列中含有k各不同的值,則可以派生出一個k列的矩陣或者DataFrame(其值全爲1和0)
"""
有時候,可能想給指標DataFrame的列加上一個前綴,以便於能夠跟其他的數據進行合併。
get_dummies的prefix參數可以實現該功能。
"""
dummies = pd. get_dummies( df[ 'key' ] , prefix= 'key' )
df_with_dummy = df[ [ 'data1' ] ] . join( dummies)
df_with_dummy
data1
key_a
key_b
key_c
0
0
0
1
0
1
1
0
1
0
2
2
1
0
0
3
3
0
0
1
4
4
1
0
0
5
5
0
1
0
mnames = [ 'movie_id' , 'title' , 'genres' ]
movies = pd. read_table( 'datasets/movielens/movies.dat' , sep= '::' ,
header= None , names= mnames)
movies[ : 10 ]
movie_id
title
genres
0
1
Toy Story (1995)
Animation|Children's|Comedy
1
2
Jumanji (1995)
Adventure|Children's|Fantasy
2
3
Grumpier Old Men (1995)
Comedy|Romance
3
4
Waiting to Exhale (1995)
Comedy|Drama
4
5
Father of the Bride Part II (1995)
Comedy
5
6
Heat (1995)
Action|Crime|Thriller
6
7
Sabrina (1995)
Comedy|Romance
7
8
Tom and Huck (1995)
Adventure|Children's
8
9
Sudden Death (1995)
Action
9
10
GoldenEye (1995)
Action|Adventure|Thriller
all_genres = [ ]
for x in movies. genres:
all_genres. extend( x. split( '|' ) )
genres = pd. unique( all_genres)
genres
array(['Animation', "Children's", 'Comedy', 'Adventure', 'Fantasy',
'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror',
'Sci-Fi', 'Documentary', 'War', 'Musical', 'Mystery', 'Film-Noir',
'Western'], dtype=object)
zero_matrix = np. zeros( ( len ( movies) , len ( genres) ) )
dummies = pd. DataFrame( zero_matrix, columns= genres)
gen = movies. genres[ 0 ]
gen. split( '|' )
dummies. columns. get_indexer( gen. split( '|' ) )
array([0, 1, 2], dtype=int64)
for i, gen in enumerate ( movies. genres) :
indices = dummies. columns. get_indexer( gen. split( '|' ) )
dummies. iloc[ i, indices] = 1
movies_windic = movies. join( dummies. add_prefix( 'Genre_' ) )
movies_windic. iloc[ 0 ]
movie_id 1
title Toy Story (1995)
genres Animation|Children's|Comedy
Genre_Animation 1
Genre_Children's 1
...
Genre_War 0
Genre_Musical 0
Genre_Mystery 0
Genre_Film-Noir 0
Genre_Western 0
Name: 0, Length: 21, dtype: object
對於很大的數據,用這種方法構建多成員指標變量就會變得非常慢,最好使用更加低級的函數,將其寫入到Numpy數組,然後將結果包裝在DataFrame中。
np. random. seed( 12345 )
values = np. random. rand( 10 )
values
array([0.9296, 0.3164, 0.1839, 0.2046, 0.5677, 0.5955, 0.9645, 0.6532,
0.7489, 0.6536])
bins = [ 0 , 0.2 , 0.4 , 0.6 , 0.8 , 1 ]
pd. get_dummies( pd. cut( values, bins) )
(0.0, 0.2]
(0.2, 0.4]
(0.4, 0.6]
(0.6, 0.8]
(0.8, 1.0]
0
0
0
0
0
1
1
0
1
0
0
0
2
1
0
0
0
0
3
0
1
0
0
0
4
0
0
1
0
0
5
0
0
1
0
0
6
0
0
0
0
1
7
0
0
0
1
0
8
0
0
0
1
0
9
0
0
0
1
0
String Manipulation 字符串操縱
python本身能夠處理字符串和文本,對於更加複雜的模式匹配和文本操作,就需要使用到正則表達式。pandas對此進行了加強,可以實現對:整租數據應用字符串表達式和正則表達式,而且可以處理煩人的缺失數據。
String Object Methods 字符串對象方法
val = 'a,b, guido'
val. split( ',' )
['a', 'b', ' guido']
pieces = [ x. strip( ) for x in val. split( ',' ) ]
pieces
['a', 'b', 'guido']
first, second, third = pieces
first + '::' + second + '::' + third
'a::b::guido'
'::' . join( pieces)
'a::b::guido'
'guido' in val
True
val. index( ',' )
1
val. find( ':' )
-1
find和index的區別是;如果找不到字符串,index將會引發一個異常,而不是返回-1
val. index( ':' )
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-108-2c016e7367ac> in <module>
----> 1 val.index(':')
ValueError: substring not found
val. count( ',' )
2
val. replace( ',' , '::' )
'a::b:: guido'
val. replace( ',' , '' )
'ab guido'
python內置的字符串方法
方法
說明
count
返回字串在字符串中出現的次數(非重疊)
endswith
字符串是否以某個後綴結尾,是則返回True
startswith
字符串是否以某個前綴開頭,是則返回True
find, rfind
如果在字符串中找到字串,則返回第一次出現的位置,沒有發現則返回-1,,後者返回最後一個發現的位置
Regular Expressions 正則表達式
re模塊的函數可以分爲三個大類:模式匹配,替換以及拆分
import re
text = "foo bar\t baz \tqux"
re. split( '\s+' , text)
['foo', 'bar', 'baz', 'qux']
調用re.split(’\s+’,text)的時候,正則表達式會先被編譯,然後會在text上調用其split方法。
regex = re. compile ( '\s+' )
regex. split( text)
['foo', 'bar', 'baz', 'qux']
regex. findall( text)
[' ', '\t ', ' \t']
如果想避免正則表達式中不需要的轉移 (\),則可以使用原始字符串字面量如:
r’C:\x’
如果打算對許多字符串應用同一條正則表達式,建議通過re.compile創建regex對象。這樣子可以節省大量的cpu時間
findall返回的是:字符串中所有的匹配項,而search則只返回第一個匹配項。match則更加嚴格,僅僅匹配字符串的首部。
text = """Dave [email protected]
Steve [email protected]
Rob [email protected]
Ryan [email protected]
"""
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'
regex = re. compile ( pattern, flags= re. IGNORECASE)
regex. findall( text)
['[email protected] ', '[email protected] ', '[email protected] ', '[email protected] ']
m = regex. search( text)
m
<re.Match object; span=(5, 20), match='[email protected] '>
text[ m. start( ) : m. end( ) ]
'[email protected] '
print ( regex. match( text) )
None
print ( regex. sub( 'REDACTED' , text) )
Dave REDACTED
Steve REDACTED
Rob REDACTED
Ryan REDACTED
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
regex = re. compile ( pattern, flags= re. IGNORECASE)
m = regex. match( '[email protected] ' )
m. groups( )
('wesm', 'bright', 'net')
regex. findall( text)
[('dave', 'google', 'com'),
('steve', 'gmail', 'com'),
('rob', 'gmail', 'com'),
('ryan', 'yahoo', 'com')]
print ( regex. sub( r'Username: \1, Domain: \2, Suffix: \3' , text) )
Dave Username: dave, Domain: google, Suffix: com
Steve Username: steve, Domain: gmail, Suffix: com
Rob Username: rob, Domain: gmail, Suffix: com
Ryan Username: ryan, Domain: yahoo, Suffix: com
Vectorized String Functions in pandas pandas的矢量化字符串函數
data = { 'Dave' : '[email protected] ' , 'Steve' : '[email protected] ' ,
'Rob' : '[email protected] ' , 'Wes' : np. nan}
data = pd. Series( data)
data
Dave [email protected]
Steve [email protected]
Rob [email protected]
Wes NaN
dtype: object
data. isnull( )
Dave False
Steve False
Rob False
Wes True
dtype: bool
data. str . contains( 'gmail' )
Dave False
Steve True
Rob True
Wes NaN
dtype: object
pattern
data. str . findall( pattern, flags= re. IGNORECASE)
Dave [(dave, google, com)]
Steve [(steve, gmail, com)]
Rob [(rob, gmail, com)]
Wes NaN
dtype: object
matches = data. str . match( pattern, flags= re. IGNORECASE)
matches
Dave True
Steve True
Rob True
Wes NaN
dtype: object
matches. str [ 0 ]
data. str [ : 5 ]
pd. options. display. max_rows = PREVIOUS_MAX_ROWS
Conclusion