【python文件】文件的讀寫處理

1 pd.read_csv/dataframe.to_csv

pd.read_csv()

pd.read_csv(filepath_or_buffer, sep=',', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, iterator=False, chunksize=None, compression='infer', thousands=None, decimal=b'.', lineterminator=None, quotechar='"', quoting=0, escapechar=None, comment=None, encoding=None, dialect=None, tupleize_cols=False, error_bad_lines=True, warn_bad_lines=True, skipfooter=0, skip_footer=0, doublequote=True, delim_whitespace=False, as_recarray=False, compact_ints=False, use_unsigned=False, low_memory=True, buffer_lines=None, memory_map=False, float_precision=None)

參數：

1.sep：
指定分隔符，默認爲逗號’,’，可以用正則表達式；如果sep沒有指定的話，使用engine=‘c’時，無法自動檢測分隔符，但是使用engine=‘python’時可以

2.delimiter : str, default None
定界符，備選分隔符（如果指定該參數，則sep參數失效）

3.header：int or list of ints, default ‘infer’
指定哪一行作爲表頭。默認設置爲0（即第一行作爲表頭），如果沒有表頭的話，要修改參數，設置header=None

4.names：
指定列的名稱，用列表表示。一般我們沒有表頭，即header=None時，這個用來添加列名就很有用啦！

5.index_col:
用作行索引的列編號或者列名，如果給定一個序列則有多個行索引。

如果文件不規則，行尾有分隔符，則可以設定index_col=False 來是的pandas不適用第一列作爲行索引。

6.prefix:
在沒有列標題時，給列添加前綴。例如：添加‘X’ 成爲 X0, X1, …

7.engine : ‘c’或者‘python’

C引擎快但是Python引擎功能更加完備

8.nrows : int, default None
需要讀取的行數（從文件頭開始算起）

9.encoding:
亂碼的時候用這個就是了，官網文檔看看用哪個：

10.skiprows : list-like or integer, default None
需要忽略的行數（從文件開始處算起），或需要跳過的行號列表（從0開始）。。

11.dtype：Type name or dict of column

每列數據的數據類型，例如 {‘a’: np.float64, ‘b’: np.int32,‘c’: ‘Int64’}

#讀取URL
url = "https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user"
df = pd.read_csv(url, sep="|")

#csv文件
pd.read_csv('data.csv'，engine='python')

dataframe.to_csv()

 DataFrame.to_csv(path_or_buf=None, sep=', ', na_rep='', float_format=None, columns=None, header=True, index=True, index_label=None, mode='w', encoding=None, compression='infer', quoting=None, quotechar='"', line_terminator=None, chunksize=None, tupleize_cols=None, date_format=None, doublequote=True, escapechar=None, decimal='.')[source]

1.sep：字符串，分隔符，跟read_csv()的一個意思

2.na_rep：字符串，將NaN轉換爲特定值

3.columns：列表，指定哪些列寫進去

4.header：默認header=0，如果沒有表頭，設置header=None，表示我沒有表頭呀！

5.index：關於索引的，默認True,寫入索引

file_csv = os.path.join(workdir,'data.csv')
df = pd.read_csv(file_csv,sep=',',encoding='utf-8')
df.to_csv('out_csv.csv',index=False,encoding='utf-8')

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

2 np.genfromtxt/np.savetxt

numpy.genfromtxt()

genfromtxt能夠考慮缺失的數據,但其他更快和更簡單的函數像loadtxt不能考慮缺失值。

numpy.genfromtxt(fname, dtype=<class 'float'>, comments='#', delimiter=None, skip_header=0, skip_footer=0, converters=None, missing_values=None, filling_values=None, usecols=None, names=None, excludelist=None, deletechars=None, replace_space='_', autostrip=False, case_sensitive=True, defaultfmt='f%i', unpack=None, usemask=False, loose=True, invalid_raise=True, max_rows=None, encoding='bytes')

1.dtype : dtype, optional

數據類型：

1.單個類型,如dtype =float。
2.一個序列類型，例如dtype=(int, float, float).
3.一個逗號分隔的字符串，例如dtype=“i4,f8,|S3”.
4.一個字典包含兩個鍵‘names’和‘formats’
5.一個元組序列，例如dtype = [(‘A‘,int),(‘B‘,float)]

comments : str, optional

可選參數comments用於定義標記註釋開始的字符串。默認情況下，genfromtxt假設爲comments='#'。註釋標記可以出現在該行的任何地方。忽略註釋標記後的任何字符

import numpy as np
from io import StringIO
data = """#
# Skip me !
# Skip me too !
1, 2
3, 4
5, 6 #This is the third line of the data
7, 8
# And here comes the last line
9, 0
"""
print(np.genfromtxt(StringIO(data), comments="#", delimiter=","))

delimiter : str, int, or sequence, optionalne

分隔符設置。我們可能處理固定寬度的文件，其中列被定義爲給定數量的字符。在這種情況下，我們需要將delimiter設置爲單個整數（如果所有列具有相同的大小）或整數序列（如果列可以具有不同的大小）

import numpy as np
from io import StringIO
data = "123456789\n   4  7 9\n   4567 9"
print(np.genfromtxt(StringIO(data), delimiter=(4, 3, 2)))

skip_header : int, optional

跳過開頭的行數

skip_footer : int, optional

跳過末尾的行數

missing_values : variable, optional

默認情況下使用空格表示缺失，我們可以使用更復雜的字符表示缺失，例如‘N/A‘或‘???‘。missing_values接受三種類型的值：

一個字符串或逗號分隔的字符串：這個字符串將被用作標記的缺失數據的所有列
一個字符串序列：在這種情況下,按照順序每一項與對應的列相關聯。
一個字典：字典的值是字符串或字符串序列。對應的key可以列索引(整數)或列名(字符串)。此外,key=none定義一個默認值的適用於所有列。

filling_values : variable, optional

出現缺失值時，系統默認填充的值

usecols : sequence, optional

提取指定的列

Which columns to read, with 0 being the first. For example, usecols = (1, 4, 5) will extract the 2nd, 5th and 6th columns.

names : {None, True, str, sequence}, optional

可以將names參數設置爲true並跳過第一行，程序將把第一行作爲列名稱，即使第一行被註釋掉了也會被讀取

autostrip : bool, optional

是否自動去除空格

encoding : str, optional

Encoding used to decode the inputfile. Does not apply when fname is a file object. The special value ‘bytes’ enables backward compatibility workarounds that ensure that you receive byte arrays when possible and passes latin1 encoded strings to converters. Override this value to receive unicode arrays and pass strings as input to converters. If set to None the system default is used. The default value is ‘bytes’.

import numpy as np
data = np.genfromtxt('test.txt',delimiter=',',skip_header=18)

numpy.savetxt()

numpy.savetxt(fname, X, fmt='%.18e', delimiter=' ', newline='\n', header='', footer='', comments='# ', encoding=None)

x = y = z = np.arange(0.0,5.0,1.0)
np.savetxt('test.out', x, delimiter=',')   # X is an array
np.savetxt('test.out', (x,y,z))   # x,y,z equal sized 1D arrays
np.savetxt('test.out', x, fmt='%1.4e')   # use exponential notation

3 pd.read_excel/dataframe.to_excel

pd.read_excel()

pandas.read_excel(io, sheet_name=0, header=0, names=None, index_col=None, parse_cols=None, usecols=None, squeeze=False, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skiprows=None, nrows=None, na_values=None, keep_default_na=True, verbose=False, parse_dates=False, date_parser=None, thousands=None, comment=None, skip_footer=0, skipfooter=0, convert_float=True, mangle_dupe_cols=True, **kwds)

參數：

1.sheetname : string, int, mixed list of strings/ints, or None, default 0 返回多表使用sheetname=[0,1],若sheetname=None是返回全表注意：int/string 返回的是dataframe，而none和list返回的是dict of dataframe
**2.header **: int, list of ints, default 0 指定列名行，默認0，即取第一行，數據爲列名行以下的數據若數據不含列名，則設定 header = None
3.skiprows : list-like,Rows to skip at the beginning，省略指定行數的數據
4.skip_footer : int,default 0, 省略從尾部數的int行數據
5.index_col : int, list of ints, default None指定列爲索引列，也可以使用u”strings”
**6.names **: array-like, default None, 指定列的名字。
7.dtype:字典類型{‘列名1’:數據類型，‘列名’:數據類型}，設定指定列的數據類型。

pd.read_excel('tmp.xlsx', index_col=0)
pd.read_excel(open('tmp.xlsx', 'rb'),sheet_name='Sheet3') 
pd.read_excel('tmp.xlsx', index_col=0,dtype={'Name': str, 'Value': float})
pd.read_excel('tmp.xlsx', index_col=0,na_values=['string1', 'string2'])
pd.read_excel('tmp.xlsx',names=["a","b","c","e"])

dataframe.to_excel()

dataframe.to_excel(excel_writer, sheet_name='Sheet1', na_rep='', float_format=None,columns=None, header=True, index=True, index_label=None,startrow=0, startcol=0, engine=None, merge_cells=True, encoding=None,inf_rep='inf', verbose=True, freeze_panes=None)

# sheet_name=0表示讀取第一個sheet，也可以指定要讀取的sheet的名稱(字符串格式)
# header=0 表示使用第一行作爲表頭(列名)
# 如果數據中沒有列名(表頭)，可以設置header=None,同時names參數來指定list格式的列名
df = pd.read_excel(file_excel,sheet_name=0,header=0,encoding='utf-8')
# dataframe.to_csv()保存csv文件
df.to_excel('out_data.xlsx',index=False,encoding='utf-8')

4 pd.read_json/dataframe.to_json

pd.read_json()

pandas.read_json(path_or_buf=None, orient=None, typ='frame', dtype=True, convert_axes=True, convert_dates=True, keep_default_dates=True, numpy=False, precise_float=False, date_unit=None, encoding=None, lines=False, chunksize=None, compression='infer')

orient : string,

JSON字符串的樣式，The set of possible orients is:

'split' : dict like {index -> [index], columns -> [columns], data -> [values]}
'records' : list like [{column -> value}, ... , {column -> value}]
'index' : dict like {index -> {column -> value}}
'columns' : dict like {column -> {index -> value}}
'values' : just the values array

typ : type of object to recover (series or frame), default ‘frame’

dtype : boolean or dict, default True

If True, infer dtypes, if a dict of column to dtype, then use those, if False, then don’t infer dtypes at all, applies only to the data.

convert_axes : boolean, default True

Try to convert the axes to the proper dtypes.

convert_dates : boolean, default True

List of columns to parse for dates; If True, then try to parse datelike columns default is True; a column label is datelike if

it ends with '_at',
it ends with '_time',
it begins with 'timestamp',
it is 'modified', or
it is 'date'

keep_default_dates : boolean, default True

If parsing dates, then parse the default datelike columns

numpy : boolean, default False

Direct decoding to numpy arrays. Supports numeric data only, but non-numeric column and index labels are supported. Note also that the JSON ordering MUST be the same for each term if numpy=True.

precise_float : boolean, default False

Set to enable usage of higher precision (strtod) function when decoding string to double values. Default (False) is to use fast but less precise builtin functionality

date_unit : string, default None

The timestamp unit to detect if converting dates. The default behaviour is to try and detect the correct precision, but if this is not desired then pass one of ‘s’, ‘ms’, ‘us’ or ‘ns’ to force parsing only seconds, milliseconds, microseconds or nanoseconds respectively.

encoding : str, default is ‘utf-8’

The encoding to use to decode py3 bytes.

New in version 0.19.0.

lines : boolean, default False

Read the file as a json object per line.

New in version 0.19.0.

chunksize : integer, default None

Return JsonReader object for iteration. See the line-delimted json docs for more information on chunksize. This can only be passed if lines=True. If this is None, the file will be read into memory all at once.

New in version 0.21.0

dataframe.to_json()

DataFrame.to_json(path_or_buf=None, orient=None, date_format=None, double_precision=10, force_ascii=True, date_unit='ms', default_handler=None, lines=False, compression='infer', index=True)

df = pd.DataFrame([['a', 'b'], ['c', 'd']],index=['row 1', 'row 2'],
                  columns=['col 1', 'col 2'])
df.to_json(orient='split')
pd.read_json(_, orient='split')
pd.read_json(filepath,orient='values',encoding='utf-8')

5 pd.read_html/dataframe.to_html

pd.read_html()

dataframe.to_html()

6 pd.read_pickle /dataframe.to_pickle

pd.read_pickle()

dataframe.to_pickle

7 pd.read_sql

pd.read_sql

pandas.read_sql(sql, con, index_col=None, coerce_float=True, params=None, parse_dates=None, columns=None, chunksize=None)

sql:SQL命令字符串
con：連接sql數據庫的engine，一般可以用SQLalchemy或者pymysql之類的包建立
index_col: 選擇某一列作爲index
coerce_float:非常有用，將數字形式的字符串直接以float型讀入
parse_dates:將某一列日期型字符串轉換爲datetime型數據，與pd.to_datetime函數功能類似。可以直接提供需要轉換的列名以默認的日期形式轉換，也可以用字典的格式提供列名和轉換的日期格式，比如{column_name: format string}（format string："%Y:%m:%H:%M:%S"）。
columns:要選取的列。一般沒啥用，因爲在sql命令裏面一般就指定要選擇的列了
chunksize：如果提供了一個整數值，那麼就會返回一個generator，每次輸出的行數就是提供的值的大小。

8 import csv

將數據讀取爲列表序列

import csv

with open('filename.csv') as f:
	f_csv=csv.reader(f)
	headers=next(f.csv) #讀取文件第一行
	for row in f_csv: #繼續讀取後續內容
		pass

將數據讀取爲字典序列

import csv

with open('filename.csv') as f:
	f_csv=csv.DictReader(f)
	headers=next(f.csv) #讀取文件第一行
	for row in f_csv: #返回有序字典
		pass

寫入數據

import csv
headers=['symbol','price','date','time','change']
rows=[('AA',39.48,'6/11/2007','9:36am',-0.18),
		('AIG',39.48,'6/11/2007','9:36am',-0.18),
		('AXp',39.48,'6/11/2007','9:36am',-0.18)]

with open('filename.csv','w') as f:
	f_csv=csv.writer(f)
	f_csv.writerow(headers) #寫入第一行
	f_csv.writerows(rows)  #寫入其他行

寫入字典序列

import csv
headers=['symbol','price','date','time','change']


rows=[{'symbol:'AA','price':39.48,'date':'6/11/2007','time':'9:36am','change':-0.18},
	{'symbol:'AA','price':39.48,'date':'6/11/2007','time':'9:36am','change':-0.18},
	{'symbol:'AA','price':39.48,'date':'6/11/2007','time':'9:36am','change':-0.18}]

with open('filename.csv','w') as f:
	f_csv=csv.DictWriter(f,headers)
	f_csv.writeheader() #寫入第一行
	f_csv.writerows(rows)  #寫入其他行

#注意：rows鍵值要和headers裏面的匹配

9 import json

json.dumps和json.loads

#字符串轉爲json結構
import json
data={'name':'ACME',
		'shares':100,
		'price':542}

json_str=json.dumps(data)

#Json的字符串轉回python數據結構
data=json.loads(json_str)

#寫文件
with open('data.json','w') as f:
	json.dump(data,f)

#讀文件
with open('data.json','r') as f:
	data=json.load(f)

json數據解碼爲有序字典

import json
from collections import OrderedDict

s='{'name':'ACME',
		'shares':100,
		'price':542}'

data=json.loads(s,object_pairs_hook=OrderedDict)

在輸出中對鍵排序

json.dumps(data,sort_keys=True)

10 import gzip

import gzip
with gzip.open('somefile.gz', 'rt') as f:
	text = f.read()

import gzip
with gzip.open('somefile.gz', 'wt'，compresslevel=5) as f:
	f.write(text)
    
#compresslevel=5，壓縮等級
#默認的等級是 9，也是最高的壓縮等級。等級越低性能越好，但是數據壓縮程度也
#越低。

11 import bz2

import bz2
with bz2.open('somefile.bz2', 'rt') as f:
	text = f.read()

import bz2
with bz2.open('somefile.bz2', 'wt'，compresslevel=5) as f:
	f.write(text)
    
#compresslevel=5，壓縮等級
#默認的等級是 9，也是最高的壓縮等級。等級越低性能越好，但是數據壓縮程度也
#越低。

12 import pickle

13 import pymysql

14 read/ readline/readlines/ write

read()

1、讀取整個文件，將文件內容放到一個字符串變量中
2、如果文件大於可用內存(好幾個G的)，不可能使用這種處理

增量解析文件

def read_in_block(file_path):
    BLOCK_SIZE = 1024
    with open(file_path, "r") as f:
        while True:
            block = f.read(BLOCK_SIZE)  # 每次讀取固定長度到內存緩衝區
            if block:
                yield block
            else:
                return  # 如果讀取到文件末尾，則退出

readline()

1、readline()每次讀取一行，比readlines()慢得多
2、readline()返回的是一個字符串對象，保存當前行的內容

readlines()

1、一次性讀取整個文件。

2、自動將文件內容分析成一個行的列表。

with open() as f:
	for line in f.readlines():
		line.split(分隔符)

write()

import os
filename=***
filepath=***
if not os.path.exists(filename):
	with open(filepath,'w') as f:
		f.write(data)
else:
	print('File already exists.')

參考

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_json.html#pandas.DataFrame.to_json

https://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html

《python cookbook》