pandas的使用

Reading data from a csv file and plot

import pandas as pd

閱讀csv文件,sep表示分隔符,encoding表示編碼格式,parse_dates表示對其中一列日期進行解析,index_col表示以那一列爲索引

df=pd.read_csv(‘filename’,sep=’;’,encoding=’utf-8’,parse_dates=[‘Date’],dayfirt=True,index_col=’Date’)

選擇一列的畫直接用字典的方式,其中time爲其中的一列,通過read的對象讀的數據叫作DataFrame,

df[‘time’]
可以使用plot()函數對其中的列依據主索引畫圖
df[‘Berri 1’].plot()
df.plot(figsize=(15, 10))

當讀取多列數據時,可以使用
df[[‘event’,’time’]][:10]#表示讀取event和time行前十項數據
value_count()函數實現對一列數據的統計,分別列出相應的種類和數量
對大數據csv行列的有條件讀取
有兩種方式可以實現有條件讀取,一是將條件列出,用&方式疊加,
count_server_1=df[‘source’]==’server’
count_access_1=df[‘event’]==’access’
access_server=df[count_server_1&count_access_1][:10][[‘time’,’enrollment_id’]]
在上面的代碼中第一二行實際上是一個判斷,其結果爲true和false.同時可以指定相應的列(注意採用雙括號)
二是直接的方式
count_access=df[df[‘event’]==’access’]#注意其需要兩個df.
csv列項統計和類型變換
is_noise = complaints[‘Complaint Type’] == “Noise - Street/Sidewalk”
noise_complaints = complaints[is_noise]
noise_complaints[‘Borough’].value_counts()
輸出爲:
MANHATTAN 917
BROOKLYN 456
BRONX 292
QUEENS 226
STATEN ISLAND 36
Unspecified 1
dtype: int64

noise_complaint_counts = noise_complaints[‘Borough’].value_counts()
complaint_counts = complaints[‘Borough’].value_counts()

實現類型的轉換和比例,其各個地區的抱怨數/各個地區的總數
noise_complaint_counts / complaint_counts.astype(float)
BRONX 0.014833
BROOKLYN 0.013864
MANHATTAN 0.037755
QUEENS 0.010143
STATEN ISLAND 0.007474
Unspecified 0.000141
dtype: float64
注意這個kind=’bar’
(noise_complaint_counts / complaint_counts.astype(float)).plot(kind=’bar’)

import matplotlib.pyplot as plt#採用加plt的方式進行顯示
result[‘enrollment_id’].plot()
plt.show()

列的增加和賦值,以及對時間的運營
berri_bikes = bikes[[‘Berri 1’]]#當以一個DataFrame爲對象,進行復制的時候,把要賦值的列看做index來處理,因此有兩個中括號
在時間的處理上可以運用day和weekday來區分相應的時間

berri_bikes.index.day
Out[6]:
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 1, 2, 3,
4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
21, 22, 23, 24, 25, 26, 27, 28, 29, 1, 2, 3, 4, 5, 6, 7, 8,
9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25,
26, 27, 28, 29, 30, 31, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,
29, 30, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 1,
2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 1, 2, 3, 4, 5,
6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22,
23, 24, 25, 26, 27, 28, 29, 30, 31, 1, 2, 3, 4, 5, 6, 7, 8,
9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25,
26, 27, 28, 29, 30, 31, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,
29, 30, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 1,
2, 3, 4, 5], dtype=int32)

增加新的列,直接賦值即可:
berri_bikes[‘weekday’] = berri_bikes.index.weekday
通過groupby(‘weekday’)進行分組,aggregate(sum)統計相應的和

weekday_counts = berri_bikes.groupby(‘weekday’).aggregate(sum)
weekday_counts
Out[9]:
Berri 1
weekday
0 134298
1 135305
2 152972
3 160131
4 141771
5 101578
6 99310

刪除列,以hour爲groupby的標誌,賦值給新的DataFrame對象,對列元素的替換,空值列實現刪除
對列元素的替換,可以把整個列看做列表來實現,代碼如下:
weather_mar2012.columns = [s.replace(u’\xb0’, ”) for s in weather_mar2012.columns]
以上代碼實現在列中替換元素

空值列實現刪除.代碼如下:
weather_mar2012=weather_mar2012.dropna(axis=1,how=’any’)#注意axis=1表示列,同時dropna

刪除特定列
weather_mar2012=weather_mar2012.drop([‘Year’,’Month’,’Day’],axis=1)
以hour爲groupby的標誌,賦值給新的DataFrame對象
temp=weather_mar2012[[u’Temp (C)’]]
temp[‘Hour’]=weather_mar2012.index.hour#將索引以小時劃分
temp.groupby(‘Hour’).aggregate(np.median).plot()

數據集的連接,寫入csv文件
weather_2012=pd.concat(data_by_month)
weather_2012.to_csv(‘weather_2012.csv’)

字符串的操作
weather_description = weather_2012[‘Weather’]#在字符串下只有一箇中括號
is_snowing = weather_description.str.contains(‘Snow’)#其結果爲True和False

以時間一定時間爲間隔採樣,此時是按月採樣,方法是每月的中值
weather_2012[‘Temp(C)’].resample(‘M’,how=np.median).plot(kind=’bar’)

the percentage of time it was snowing each month,是每個月的比例,因爲之前float已經變成了0-1,所以直接求mean就是其中的比例
is_snowing.astype(float).resample(‘M’, how=np.mean).plot(kind=’bar’)

將兩個數據進行疊加,成爲一個數據
temperature = weather_2012[‘Temp (C)’].resample(‘M’, how=np.median)
is_snowing = weather_2012[‘Weather’].str.contains(‘Snow’)
snowiness = is_snowing.astype(float).resample(‘M’, how=np.mean)
Name the columns,必須要對其命名
temperature.name = “Temperature”
snowiness.name = “Snowiness”
stats = pd.concat([temperature, snowiness], axis=1)#用concat函數疊加,並指明axis

unique()函數辨別總共多少中

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章