一、文件的讀寫

xlsx–工作簿–文件夾

sheet–表

import pandas as pd

detail = pd.read_excel('data/meal_order_detail.xlsx')
print(detail.shape)    # 默認讀取的是sheet1

如果想讀別的sheet表，使用sheetname參數，指定想要讀取的表的索引位置。

detail_sheet2 = pd.read_excel('data/meal_order_detail.xlsx',sheetname=1)
print(detail_sheet2.shape)
print(detail_sheet2.describe())

二、將多個表合成爲一個表

對於excel表格，可以利用專門的xlrd庫來進行操作。

import pandas as pd
import xlrd

file_path = 'data/meal_order_detail.xlsx'
# 1.拿到工作簿
wb = xlrd.open_workbook(file_path)
# print(wb)    # <xlrd.book.Book object at 0x0000000006188C50>
# 2.拿到sheet列表
sheets = wb.sheet_names()
# print(sheets)    # ['meal_order_detail1', 'meal_order_detail2', 'meal_order_detail3']
# 3.合成爲一張表
total = pd.DataFrame()
for i in range(len(sheets)):
    sheet = pd.read_excel(file_path,sheetname=i,skiprows=0,encoding='utf-8')
    # print(type(sheet))
    print(sheet.shape)
    total = total.append(sheet)
print(total.shape)
# 4.保存
wb = pd.ExcelWriter('detail.xlsx')
total.to_excel(wb,'Sheet1')
wb.save()

三、使用pandas對數據進行增刪改查

（一）查找

1.loc[行索引名稱或條件,列索引名稱]

值域：爲閉區間

索引名稱可以是當前不存在的（增加）

# 使用條件獲取表格數據# 讀取order_id=458的，dishes_name和order_id兩列
result = detail.loc[detail['order_id']==458,['dishes_name','order_id']]
print(result)print(result.describe())

2.iloc[行索引位置,列索引位置]

值域：前閉後開

索引位置必須要是當前存在的

（1）想要的行是連續的–切片

# 讀取order_id=417的，dishes_name和order_id兩列
result2 = detail.iloc[0:5,[1,5]]
print(result2)
print(result2.describe())

（2）想要的行是不連續的

方法1：行索引

result2 = detail.iloc[[0,2,3,4],[1,5]]
print(result2)
print(result2.describe())

方法2：使用條件

類似於遮罩

print((detail['order_id'] == 417).values)
result2 = detail.iloc[(detail['order_id'] == 417).values,[1,5]]
print(result2)
print(result2.describe())

（二）修改

loc/iloc–查找

修改：就是在查找的基礎上賦值

1.針對某一個單元格

查找

cell_00 = order.iloc[0,0]
print(cell_00)

例

查找name爲苗宇怡那個單元格，將苗宇怡改爲老苗

cell_0_21 = order.iloc[0,20]
print(cell_0_21)
order.iloc[0,20] = '老苗'
print(order.iloc[0,20])

2.批量修改

例

將emp_id值爲982的修改爲98200

order.loc[order['emp_id']==982,'emp_id'] = 98200
print(order.loc[order['emp_id']==98200,'emp_id'])

（三）刪除

格式：表對象.函數

1.drop()

（1）刪除行

def drop(self, labels, axis=0, level=None, inplace=False, errors=‘raise’):

axis=0–刪除行，axis=1–刪除列

inplace–是否影響原數據(order)，默認False，不影響

# 直接指定索引
print(order.shape)
order.drop(labels=[0,1,2],axis=0,inplace=True)
print(order.shape)
# 還可以使用<class 'pandas.core.indexes.numeric.Int64Index'>對象作爲labels
# 具體用法見後面的數據清洗

（2）刪除列

# (1)刪除一列
print(order.shape)
print(order.columns)
order.drop(labels='mode',axis=1,inplace=True)
print(order.shape)
print(order.columns)

# (2)刪除多列
print(order.shape)
order.drop(labels=['mode','check_closed'],axis=1,inplace=True)
print(order.shape)

2.dropna()

def dropna(self, axis=0, how=‘any’, thresh=None, subset=None,inplace=False):

axis=0–按行刪除，axis=1–按列刪除

how=any–行或列中只要有一個值爲空，就刪除

how=all–行或列中所有的元素都是空，才刪除

print(order.shape)
order.dropna(axis=1,how='all',inplace=True)
print(order.shape)

3.drop_duplicates()–去重（針對行）

def drop_duplicates(self, subset=None, keep=‘first’, inplace=False):

print(order.shape)
order.drop_duplicates(inplace=True)
print(order.shape)

（四）增加

在原有基礎上，要麼增加一行，要麼增加一列。

1.增加一列

例

將時間差作爲一列，插入表中

order['lock_time'] = pd.to_datetime(order['lock_time'])
order['use_start_time'] = pd.to_datetime(order['use_start_time'])
delda_time = order['lock_time']-order['use_start_time']
order['delta_time'] = delda_time
print(order)

列中元素皆相同，可以直接賦值一個數。

order['test'] = 1
print(order['test'])

2.增加一行

print(order.shape[0])
li = []
for i in range(21):
    li.append(i)
order.loc[order.shape[0],:] = li
print(order.shape[0])

四、時間數據的處理

import pandas as pd

order = pd.read_table('data/meal_order_info.csv',sep=',',encoding='gbk')    # DataFrame

1.類型轉換

start_time = order['use_start_time']    # Series
# print(start_time[0],type(start_time[0]))    # 2016/8/1 11:05 <class 'str'>
# 爲了方便對日期時間進行操作，需要將字符串類型的日期時間數據轉換成日期時間類型
result = pd.to_datetime(start_time)    # Series
# print(result[0],type(result[0]))    # 2016-08-01 11:05:00 <class 'pandas._libs.tslib.Timestamp'>

# 此時order的use_start_time數據依然是str類型
# print(type(order['use_start_time'][0]))    # <class 'str'>
# 需要給order的use_start_time數據賦值爲轉換類型之後的數據
order['use_start_time'] = result
# print(type(order['use_start_time'][0]))    # <class 'pandas._libs.tslib.Timestamp'>

# 同理，處理lock_time列
lock_time = order['lock_time']
order['lock_time'] = pd.to_datetime(lock_time)

2.計算

（1）常用屬性和方法

year
month
day
week
weekday()
weekday_name
…

# 年份
# for i in order['lock_time']:
#     print(i.year)
years = [i.year for i in order['lock_time']]
print(years)

# week--當前年份的第幾周
# weeks = [i.week for i in order['lock_time']]
print(weeks)

# weekday()--星期幾
# 星期一--0，星期二--1，...
weekdays = [i.weekday() for i in order['lock_time']]
print(weekdays)

# weekday_name--星期幾
# Monday,Tuesday,...
weekday_names = [i.weekday_name for i in order['lock_time']]
print(weekday_names)

（2）聚合函數

# min()/max()
print(order['lock_time'].min())
print(order['lock_time'].max())
# 時間差
delta = order['lock_time'].max()-order['lock_time'].min()
print(delta)

練習

統計每個訂單的點單所用時間

delta_time = order['lock_time']-order['use_start_time']
# print(deta_time)
print('最小：',delta_time.min())    # 最小： -1 days +00:05:00
print('最大：',delta_time.max())    # 最大： 16 days 00:08:00

顯然，數據存在不合規的情況。

這就涉及到了數據預處理，即數據清洗。

五、數據清洗

數據清洗

清洗掉不合規的數據
清洗掉空值數據
清洗掉重複數據

# 分析待處理數據，lock_time早於use_start_time明顯是不合規的
print(type(delta_time[0]))    # <class 'pandas._libs.tslib.Timedelta'>
# index[]--返回符合條件的索引
result1 = delta_time.index[(delta_time<pd.to_timedelta(0))]
print(result1)

# lock_time與use_start_time的跨度如果超過一天，也是不合規的
print(delta_time.dt.days)
result2 = delta_time.index[(delta_time.dt.days>0)]
print(result2)

1.分別刪除

print(order.shape)
order.drop(labels=result1,axis=0,inplace=True)
print(order.shape)
order.drop(labels=result2,axis=0,inplace=True)
print(order.shape)

那要是條件多了，這種方法就low了。

2.統一刪除

result = delta_time.index[(delta_time<pd.to_timedelta(0))|(delta_time.dt.days>0)]
print(order.shape)
order.drop(labels=result,axis=0,inplace=True)
print(order.shape)

結果

這回得到的數據就比較合規了。

print('~~~~~~~~~~~~~~~~~~~~~~~~~~清洗後~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
delta_time = order['lock_time'] - order['use_start_time']
print('最小：',delta_time.min())    # 最小： 0 days 00:00:00
print('最大：',delta_time.max())    # 最大： 0 days 00:55:00

# 平均點餐時間
print('平均：',delta_time.mean())    # 平均： 0 days 00:11:20.639470
# 作用：餐飲--翻桌率

六、統計函數

同numpy

import numpy as np
import pandas as pd

arr = np.array([1,2,3])
np.sum(arr)
arr.sum()

order = pd.read_table('data/meal_order_info.csv',sep=',',encoding='gbk')
# print('營業額：',order['accounts_payable'].sum())
print('營業額：',np.sum(order['accounts_payable']))

練習

多條件查詢：| &

生成表格時不要原索引：index=Flase

import pandas as pd

# 有users.xlsx文件，完成如下操作：
# 1.打開該文件，統計該表格有行數和列數
file_path = 'users.xlsx'
users = pd.read_excel(file_path)
print(users.shape)

# 2.將表中所有"NAME"="admin"的數據修改爲：“Administrator”
users.loc[users['NAME']=='admin','NAME'] = 'Administrator'
print(users.loc[users['NAME']=='Administrator',:])

# 3.表中有“LAST_VISITS”和“FIRST_VISIT”兩個字段，分別表示最後登錄時間和第一次登錄時間，請計算這兩個時間的時間間隔。
delta_time = users['LAST_VISITS'] - users['FIRST_VISIT']
print(delta_time)

# 4.將年齡小於30歲且性別爲男性的數據單獨保存成一個文件，文件名爲：“nan_31.xlsx”
# users_age = users.loc[users['age']<30,:]
# new_users = users_age.loc[users_age['sex']=='男',:]

new_users = users.loc[(users['sex']=='男')&(users['age']<30),:]
print(new_users.shape)
wb = pd.ExcelWriter('nan_31.xlsx')
new_users.to_excel(wb,'Sheet1',index=False)
wb.save()

數據分析（五）--pandas（文件讀寫，合成表，增刪改查，數據清洗，時間數據處理，統計函數）