6.1 讀寫CSV數據
例如:寫一個文件test.csv文件,再打開
>>> import csv
>>> headers = ['class','name','sex','height','year']
>>> rows = [[1,'xiaoming','male',168,23],[1,'xiaohong','female',162,22],[2,'xiaozhang','female',163,21],[2,'xiaoli','male',158,21]]
>>> with open('test.csv','w')as f:
... f_csv = csv.writer(f)
... f_csv.writerow(headers)
... f_csv.writerows(rows)
...
28
寫入後打開看看內容
#讀取
>>> with open('test.csv') as f:
... f_csv = csv.reader(f)
... headers = next(f_csv) #不讀取header行
... for row in f_csv:
... print("row = ", row)
...
row = ['1', 'xiaoming', 'male', '168', '23']
row = ['1', 'xiaohong', 'female', '162', '22']
row = ['2', 'xiaozhang', 'female', '163', '21']
row = ['2', 'xiaoli', 'male', '158', '21']
#優先使用這種方式
>>> with open('test.csv') as f:
... f_csv = csv.reader(f) #讀取header行
... for row in f_csv:
... print("row = ", row)
...
row = ['class', 'name', 'sex', 'height', 'year']
row = ['1', 'xiaoming', 'male', '168', '23']
row = ['1', 'xiaohong', 'female', '162', '22']
row = ['2', 'xiaozhang', 'female', '163', '21']
row = ['2', 'xiaoli', 'male', '158', '21']
下面這種常用方式,需注意:這種方式需要自己處理細節,比如字段被引號引起來要自己去除引號,被引用字段恰好包含一個逗號,產出的那一行會因爲大小的錯誤而使得代碼崩潰(因爲原始數據是用逗號分開的)
>>> with open('test.csv') as f:
... for line in f:
... row = line.split(',')
... print (row)
...
['class', 'name', 'sex', 'height', 'year\n']
['1', 'xiaoming', 'male', '168', '23\n']
['1', 'xiaohong', 'female', '162', '22\n']
['2', 'xiaozhang', 'female', '163', '21\n']
['2', 'xiaoli', 'male', '158', '21\n']
上面這種訪問,要使用索引去訪問不方便。下面介紹將數據讀取爲字典序列,使用標頭去訪問,如下:
>>> with open('test.csv') as f:
... f_csv = csv.DictReader(f)
... for row in f_csv:
... print(row)
...
OrderedDict([('class', '1'), ('name', 'xiaoming'), ('sex', 'male'), ('height', '168'), ('year', '23')])
OrderedDict([('class', '1'), ('name', 'xiaohong'), ('sex', 'female'), ('height', '162'), ('year', '22')])
OrderedDict([('class', '2'), ('name', 'xiaozhang'), ('sex', 'female'), ('height', '163'), ('year', '21')])
OrderedDict([('class', '2'), ('name', 'xiaoli'), ('sex', 'male'), ('height', '158'), ('year', '21')])
6.2 json打印,格式化輸出,排序輸出
>>> from pprint import pprint
>>> import json
>>> a = {"a":"1","c":2,"b":4}
#傳了indent參數,書上說跟pprint輸出格式是一樣的,但是我驗證pprint輸出不是
>>> print(json.dumps(a,indent=4))
{
"a": "1",
"c": 2,
"b": 4
}
>>> pprint(a)
{'a': '1', 'b': 4, 'c': 2}
#使用sort_keys排序
>>> print(json.dumps(a,sort_keys=True))
{"a": "1", "b": 4, "c": 2}
6.3 XML解析
用xml.etree.ElementTree解析簡單的XML。對於更加高級的應用,應該考慮使用lxml。lxml採用的編程接口和ElementTree一樣。lxml完全兼容XML標準,而且運行速度非常快,還提供驗證、XSLT以及XPath這樣的功能。使用方式只要將xml的導入語句form xml.etree.ElementTree import parse 改成 form lxml.etree.ElementTree import parse即可。
6.4 數據統計和彙總
任何涉及統計、時間序列以及其他相關技術數據的分析問題,都應該使用Pandas庫。
Pandas是一個龐大的庫,尤其適用於:需要分析大型的數據集、將數據歸組、執行統計分析或者其他類似任務。
>>> import pandas
>>> rats = pandas.read_csv('test.csv')
>>> rats
class name sex height year
0 1 xiaoming male 168 23
1 1 xiaohong female 162 22
2 2 xiaozhang female 163 21
3 2 xiaoli male 158 21