數據結構
-
Series:一維數組,與Numpy中的一維array類似。二者與Python基本的數據結構List也很相近,其區別是:List中的元素可以是不同的數據類型,而Array和Series中則只允許存儲相同的數據類型,這樣可以更有效的使用內存,提高運算效率。
-
Time- Series:以時間爲索引的Series。
-
DataFrame:二維的表格型數據結構。很多功能與R中的data.frame類似。可以將DataFrame理解爲Series的容器。以下的內容主要以DataFrame爲主。
-
Panel :三維的數組,可以理解爲DataFrame的容器。
導入csv文件
>>> df = pd.DataFrame(pd.read_csv('insurance.csv',header=0)) #加載csv文件
>>> df.head() #查看前10行
age sex bmi children smoker region charges
0 19 female 27.900 0 yes southwest 16884.92400
1 18 male 33.770 1 no southeast 1725.55230
2 28 male 33.000 3 no southeast 4449.46200
3 33 male 22.705 0 no northwest 21984.47061
4 32 male 28.880 0 no northwest 3866.85520
>>> df.tail() #查看後10行
age sex bmi children smoker region charges
1333 50 male 30.97 3 no northwest 10600.5483
1334 18 female 31.92 0 no northeast 2205.9808
1335 18 female 36.85 0 no southeast 1629.8335
1336 21 female 25.80 0 no southwest 2007.9450
1337 61 female 29.07 0 yes northwest 29141.3603
>>> df = pd.DataFrame(pd.read_excel('insurance.xlsx')) #加載xlsx文件
>>> df.head() #查看前10行
age sex bmi children smoker region charges
0 19 female 27.900 0 yes southwest 16884.92400
1 18 male 33.770 1 no southeast 1725.55230
2 28 male 33.000 3 no southeast 4449.46200
3 33 male 22.705 0 no northwest 21984.47061
4 32 male 28.880 0 no northwest 3866.85520
>>> df.tail() #查看後10行
age sex bmi children smoker region charges
1333 50 male 30.97 3 no northwest 10600.5483
1334 18 female 31.92 0 no northeast 2205.9808
1335 18 female 36.85 0 no southeast 1629.8335
1336 21 female 25.80 0 no southwest 2007.9450
1337 61 female 29.07 0 yes northwest 29141.3603
DataFrame
>>> df = pd.DataFrame({"id":[1001,1002,1003,1004,1005,1006],
"date":pd.date_range('20130102', periods=6),
"city":['Beijing ', 'SH', ' guangzhou ', 'Shenzhen', 'shanghai', 'BEIJING '],
"age":[23,44,54,32,34,32],
"category":['100-A','100-B','110-A','110-C','210-A','130-F'],"price":[1200,np.nan,2133,5433,np.nan,4432]},
columns =['id','date','city','category','age','price'])
>>> df.head()
id date city category age price
0 1001 2013-01-02 Beijing 100-A 23 1200.0
1 1002 2013-01-03 SH 100-B 44 NaN
2 1003 2013-01-04 guangzhou 110-A 54 2133.0
3 1004 2013-01-05 Shenzhen 110-C 32 5433.0
4 1005 2013-01-06 shanghai 210-A 34 NaN
- 維度查看:
>>> df.shape
(6, 6)
- 數據表基本信息(維度、列名稱、數據格式、所佔空間等):
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 6 columns):
id 6 non-null int64
date 6 non-null datetime64[ns]
city 6 non-null object
category 6 non-null object
age 6 non-null int64
price 4 non-null float64
dtypes: datetime64[ns](1), float64(1), int64(2), object(2)
memory usage: 368.0+ bytes
3、每一列數據的格式:
>>> df.dtypes
id int64
date datetime64[ns]
city object
category object
age int64
price float64
dtype: object
4、某一列格式:
>>> df['id'].dtype
dtype('int64')
5、空值:
>>> df.isnull()
id date city category age price
0 False False False False False False
1 False False False False False True
2 False False False False False False
3 False False False False False False
4 False False False False False True
5 False False False False False False
6、查看某一列的唯一值:
>>> df['age']
0 23
1 44
2 54
3 32
4 34
5 32
Name: age, dtype: int64
>>> df['age'].unique()
array([23, 44, 54, 32, 34])
7、查看數據表的值:
>>> df.values
array([[1001, Timestamp('2013-01-02 00:00:00'), 'Beijing ', '100-A', 23,
1200.0],
[1002, Timestamp('2013-01-03 00:00:00'), 'SH', '100-B', 44, nan],
[1003, Timestamp('2013-01-04 00:00:00'), ' guangzhou ', '110-A',
54, 2133.0],
[1004, Timestamp('2013-01-05 00:00:00'), 'Shenzhen', '110-C', 32,
5433.0],
[1005, Timestamp('2013-01-06 00:00:00'), 'shanghai', '210-A', 34,
nan],
[1006, Timestamp('2013-01-07 00:00:00'), 'BEIJING ', '130-F', 32,
4432.0]], dtype=object)
8、查看列名稱:
>>> df.columns
Index(['id', 'date', 'city', 'category', 'age', 'price'], dtype='object')
9、查看前N行數據、後N行數據:
>>> df.head(2) #顯示前2行數據
id date city category age price
0 1001 2013-01-02 Beijing 100-A 23 1200.0
1 1002 2013-01-03 SH 100-B 44 NaN
>>> df.tail(2) #顯示後2行數據
id date city category age price
4 1005 2013-01-06 shanghai 210-A 34 NaN
5 1006 2013-01-07 BEIJING 130-F 32 4432.0
數據表清洗
1、用數字0填充空值:
>>> df
id date city category age price
0 1001 2013-01-02 Beijing 100-A 23 1200.0
1 1002 2013-01-03 SH 100-B 44 NaN
2 1003 2013-01-04 guangzhou 110-A 54 2133.0
3 1004 2013-01-05 Shenzhen 110-C 32 5433.0
4 1005 2013-01-06 shanghai 210-A 34 NaN
5 1006 2013-01-07 BEIJING 130-F 32 4432.0
>>> df.fillna(0)
id date city category age price
0 1001 2013-01-02 Beijing 100-A 23 1200.0
1 1002 2013-01-03 SH 100-B 44 0.0
2 1003 2013-01-04 guangzhou 110-A 54 2133.0
3 1004 2013-01-05 Shenzhen 110-C 32 5433.0
4 1005 2013-01-06 shanghai 210-A 34 0.0
5 1006 2013-01-07 BEIJING 130-F 32 4432.0
2、使用列price的均值對NAN進行填充:
>>> df['price']=df['price'].fillna(df['price'].mean())
>>> df['price']
0 1200.0
1 3299.5
2 2133.0
3 5433.0
4 3299.5
5 4432.0
Name: price, dtype: float64
3、清除city字段的字符空格:
>>> df['city']
0 Beijing
1 SH
2 guangzhou
3 Shenzhen
4 shanghai
5 BEIJING
Name: city, dtype: object
>>> df['city']=df['city'].map(str.strip)
>>> df['city']
0 Beijing
1 SH
2 guangzhou
3 Shenzhen
4 shanghai
5 BEIJING
Name: city, dtype: object
4、大小寫轉換:
>>> df['city']
0 Beijing
1 SH
2 guangzhou
3 Shenzhen
4 shanghai
5 BEIJING
Name: city, dtype: object
>>> df['city']=df['city'].str.lower()
>>> df['city']
0 beijing
1 sh
2 guangzhou
3 shenzhen
4 shanghai
5 beijing
Name: city, dtype: object
5、更改數據格式 (astype):
>>> df['price'].astype('int')
0 1200
1 3299
2 2133
3 5433
4 3299
5 4432
Name: price, dtype: int64
6、更改列名稱 (rename):
>>> df.columns
Index(['id', 'date', 'city', 'category', 'age', 'price'], dtype='object')
>>> df.rename(columns={'category': 'category-size'})
id date city category-size age price
0 1001 2013-01-02 beijing 100-A 23 1200.0
1 1002 2013-01-03 sh 100-B 44 3299.5
2 1003 2013-01-04 guangzhou 110-A 54 2133.0
3 1004 2013-01-05 shenzhen 110-C 32 5433.0
4 1005 2013-01-06 shanghai 210-A 34 3299.5
5 1006 2013-01-07 beijing 130-F 32 4432.0
7、刪除後出現的重複值:
>>> df['city']
0 beijing
1 sh
2 guangzhou
3 shenzhen
4 shanghai
5 beijing
Name: city, dtype: object
>>> df['city'].drop_duplicates()
0 beijing
1 sh
2 guangzhou
3 shenzhen
4 shanghai
Name: city, dtype: object
8、刪除先出現的重複值:
>>> df['city'].drop_duplicates(keep='last')
1 sh
2 guangzhou
3 shenzhen
4 shanghai
5 beijing
Name: city, dtype: object
9、數據替換:
>>> df['city']
0 beijing
1 sh
2 guangzhou
3 shenzhen
4 shanghai
5 beijing
Name: city, dtype: object
>>> df['city'].replace('sh', 'shanghai')
0 beijing
1 shanghai
2 guangzhou
3 shenzhen
4 shanghai
5 beijing
Name: city, dtype: object
數據預處理
1、數據合併-merge函數
merge函數的參數如下表所示:
參數 | 說明 |
---|---|
left | 參與合併的左側DataFrame |
right | 參與合併的右側DataFrame |
how | “inner”,”outer”,”left”,”right”其中之一,默認爲”inner” |
on | 用於連接的列名,必須存在於左右兩個DataFrame |
left_on | 左側DataFrame中用作連接鍵的列 |
right_on | 右側DataFrame中用作連接鍵的列 |
left_index | 將左側的行索引用作其連接鍵 |
right_index | 將右側的行索引用作其連接鍵 |
sort | 根據連接鍵對合並後的數據進行排列,默認爲True |
suffixes | 字符串值元組,用於追加到重疊列名的末尾,默認爲(‘_x’,‘_y’)。如果左右兩個DataFrame對象都有“data”,則結果就會出現“data_x”和“data_y” |
copy | 默認爲True。如果設置爲False,可以避免將數據複製到結果數據結構中 |
1.1、數據表合併-on參數
>>> df1=pd.DataFrame({'key':['b','b','a','a','b','a','c'],'data1':range(7)})
>>> df1
key data1
0 b 0
1 b 1
2 a 2
3 a 3
4 b 4
5 a 5
6 c 6
>>> df2=pd.DataFrame({'key':['a','b','d'],'data2':range(3)})
>>> df2
key data2
0 a 0
1 b 1
2 d 2
>>> pd.merge(df1,df2,on='key')
key data1 data2
0 b 0 1
1 b 1 1
2 b 4 1
3 a 2 0
4 a 3 0
5 a 5 0
1.2、數據合併-left_on,right_on參數
>>> df3=pd.DataFrame({'l_key':['b','b','a','a','b','a','c'],'data1':range(7)})
>>> df3
l_key data1
0 b 0
1 b 1
2 a 2
3 a 3
4 b 4
5 a 5
6 c 6
>>> df4=pd.DataFrame({'r_key':['a','b','d'],'data2':range(3)})
>>> df4
r_key data2
0 a 0
1 b 1
2 d 2
>>> pd.merge(df3,df4,left_on='l_key',right_on='r_key')
l_key data1 r_key data2
0 b 0 b 1
1 b 1 b 1
2 b 4 b 1
3 a 2 a 0
4 a 3 a 0
5 a 5 a 0
1.3、數據合併-how參數
>>> df2=pd.DataFrame({'key':['a','b','d'],'data2':range(3)})
>>> df2
key data2
0 a 0
1 b 1
2 d 2
>>> df1
key data1
0 b 0
1 b 1
2 a 2
3 a 3
4 b 4
5 a 5
6 c 6
>>> pd.merge(df1,df2,on='key',how='outer')
key data1 data2
0 b 0.0 1.0
1 b 1.0 1.0
2 b 4.0 1.0
3 a 2.0 0.0
4 a 3.0 0.0
5 a 5.0 0.0
6 c 6.0 NaN
7 d NaN 2.0
>>> df_inner = pd.merge(df1,df2,on='key',how='inner')
>>> df_inner
key data1 data2
0 b 0 1
1 b 1 1
2 b 4 1
3 a 2 0
4 a 3 0
5 a 5 0
>>> pd.merge(df1,df2,on='key',how='left') #只使用左邊的DataFrame的鍵
key data1 data2
0 b 0 1.0
1 b 1 1.0
2 a 2 0.0
3 a 3 0.0
4 b 4 1.0
5 a 5 0.0
6 c 6 NaN
>>> pd.merge(df1,df2,on='key',how='right') #只使用右邊的DataFrame的鍵
key data1 data2
0 b 0.0 1
1 b 1.0 1
2 b 4.0 1
3 a 2.0 0
4 a 3.0 0
5 a 5.0 0
6 d NaN 2
1.5、數據合併-left_index,right_index參數
>>> df7=pd.DataFrame({'key':['a','b','a','a','b','c'],'value':range(6)})
>>> df7
key value
0 a 0
1 b 1
2 a 2
3 a 3
4 b 4
5 c 5
>>> df8=pd.DataFrame({'group_val':[3.5,7]},index=['a','b'])
>>> df8
group_val
a 3.5
b 7.0
>>> pd.merge(df7,df8,left_on='key',right_index=True) #進行索引上的合併
key value group_val
0 a 0 3.5
2 a 2 3.5
3 a 3 3.5
1 b 1 7.0
4 b 4 7.0
1.6、數據合併-多對多的合併操作
>>> df5=pd.DataFrame({'key':['b','b','a','c','a','b'],'data1':range(6)})
>>> df5
key data1
0 b 0
1 b 1
2 a 2
3 c 3
4 a 4
5 b 5
>>> df6=pd.DataFrame({'key':['a','b','a','b','d'],'data2':range(5)})
>>> df6
key data2
0 a 0
1 b 1
2 a 2
3 b 3
4 d 4
>>> pd.merge(df5,df6,how='outer') #產生的是行的笛卡爾積,由於左邊的DataFrame有3個”b”行,右邊的有兩個,所以最終結果就有6個“b”行
key data1 data2
0 b 0.0 1.0
1 b 0.0 3.0
2 b 1.0 1.0
3 b 1.0 3.0
4 b 5.0 1.0
5 b 5.0 3.0
6 a 2.0 0.0
7 a 2.0 2.0
8 a 4.0 0.0
9 a 4.0 2.0
10 c 3.0 NaN
11 d NaN 4.0
2、設置索引列
>>> df = pd.DataFrame({"id":[1001,1002,1003,1004,1005,1006], "date":pd.date_range('20130102', periods=6), "city":['Beijing ', 'SH', ' guangzhou ', 'Shenzhen', 'shanghai', 'BEIJING '],"age":[23,44,54,32,34,32],"category":['100-A','100-B','110-A','110-C','210-A','130-F'],"price":[1200,np.nan,2133,5433,np.nan,4432]},columns =['id','date','city','category','age','price'])
>>> df
id date city category age price
0 1001 2013-01-02 Beijing 100-A 23 1200.0
1 1002 2013-01-03 SH 100-B 44 NaN
2 1003 2013-01-04 guangzhou 110-A 54 2133.0
3 1004 2013-01-05 Shenzhen 110-C 32 5433.0
4 1005 2013-01-06 shanghai 210-A 34 NaN
5 1006 2013-01-07 BEIJING 130-F 32 4432.0
>>> df.set_index('id')
date city category age price
id
1001 2013-01-02 Beijing 100-A 23 1200.0
1002 2013-01-03 SH 100-B 44 NaN
1003 2013-01-04 guangzhou 110-A 54 2133.0
1004 2013-01-05 Shenzhen 110-C 32 5433.0
1005 2013-01-06 shanghai 210-A 34 NaN
1006 2013-01-07 BEIJING 130-F 32 4432.0
3、按照特定列的值排序:
>>> df.sort_values(by=['age'])
id date city category age price
0 1001 2013-01-02 Beijing 100-A 23 1200.0
3 1004 2013-01-05 Shenzhen 110-C 32 5433.0
5 1006 2013-01-07 BEIJING 130-F 32 4432.0
4 1005 2013-01-06 shanghai 210-A 34 NaN
1 1002 2013-01-03 SH 100-B 44 NaN
2 1003 2013-01-04 guangzhou 110-A 54 2133.0
4、按照索引列排序:
>>> df.set_index('age').sort_index()
id date city category price
age
23 1001 2013-01-02 Beijing 100-A 1200.0
32 1004 2013-01-05 Shenzhen 110-C 5433.0
32 1006 2013-01-07 BEIJING 130-F 4432.0
34 1005 2013-01-06 shanghai 210-A NaN
44 1002 2013-01-03 SH 100-B NaN
54 1003 2013-01-04 guangzhou 110-A 2133.0
5、如果price列的值>3000,group列顯示high,否則顯示low:
>>> df['group'] = np.where(df['price'] > 3000,'high','low')
>>> df['group']
0 low
1 low
2 low
3 high
4 low
5 high
Name: group, dtype: object
6、對複合多個條件的數據進行分組標記
>>> df.loc[(df['city'] == 'beijing') & (df['price'] >= 4000),'sign']=1
>>> df
id date city category age price group sign
0 1001 2013-01-02 Beijing 100-A 23 1200.0 low NaN
1 1002 2013-01-03 SH 100-B 44 NaN low NaN
2 1003 2013-01-04 guangzhou 110-A 54 2133.0 low NaN
3 1004 2013-01-05 Shenzhen 110-C 32 5433.0 high NaN
4 1005 2013-01-06 shanghai 210-A 34 NaN low NaN
5 1006 2013-01-07 BEIJING 130-F 32 4432.0 high NaN
接下來兩個小實驗中使用到數據表df_inner,該數據表內容如下:
>>> df = pd.DataFrame({"id":[1001,1002,1003,1004,1005,1006], "date":pd.date_range('20130102', periods=6),"city":['Beijing ', 'SH', ' guangzhou ', 'Shenzhen', 'shanghai', 'BEIJING '],"age":[23,44,54,32,34,32],"category":['100-A','100-B','110-A','110-C','210-A','130-F'],"price":[1200,np.nan,2133,5433,np.nan,4432]},columns =['id','date','city','category','age','price'])
>>> df1 = df1=pd.DataFrame({"id":[1001,1002,1003,1004,1005,1006,1007,1008],"gender":['male','female','male','female','male','female','male','female'],"pay":['Y','N','Y','Y','N','Y','N','Y',],"m-point":[10,12,20,40,40,40,30,20]})
>>> df_inner=pd.merge(df,df1,how='inner')
>>> df_inner
id date city category ... price gender pay m-point
0 1001 2013-01-02 Beijing 100-A ... 1200.0 male Y 10
1 1002 2013-01-03 SH 100-B ... NaN female N 12
2 1003 2013-01-04 guangzhou 110-A ... 2133.0 male Y 20
3 1004 2013-01-05 Shenzhen 110-C ... 5433.0 female Y 40
4 1005 2013-01-06 shanghai 210-A ... NaN male N 40
5 1006 2013-01-07 BEIJING 130-F ... 4432.0 female Y 40
[6 rows x 9 columns]
7、對category字段的值依次進行分列,並創建數據表,索引值爲df_inner的索引列,列名稱爲category和size
>>> split=pd.DataFrame((x.split('-') for x in df_inner['category']),index=df_inner.index,columns=['category','size']))
>>> print(split.values)
[['100' 'A']
['100' 'B']
['110' 'A']
['110' 'C']
['210' 'A']
['130' 'F']]
8、將完成分裂後的數據表和原df_inner數據表進行匹配
>>> df_inner=pd.merge(df_inner,split,right_index=True, left_index=True)
>>> print(df_inner.values)
[[1001 Timestamp('2013-01-02 00:00:00') 'Beijing ' '100-A' 23 1200.0
'male' 'Y' 10 '100' 'A']
[1002 Timestamp('2013-01-03 00:00:00') 'SH' '100-B' 44 nan 'female' 'N'
12 '100' 'B']
[1003 Timestamp('2013-01-04 00:00:00') ' guangzhou ' '110-A' 54 2133.0
'male' 'Y' 20 '110' 'A']
[1004 Timestamp('2013-01-05 00:00:00') 'Shenzhen' '110-C' 32 5433.0
'female' 'Y' 40 '110' 'C']
[1005 Timestamp('2013-01-06 00:00:00') 'shanghai' '210-A' 34 nan 'male'
'N' 40 '210' 'A']
[1006 Timestamp('2013-01-07 00:00:00') 'BEIJING ' '130-F' 32 4432.0
'female' 'Y' 40 '130' 'F']]