小點dian兒：Python大文件讀取、文件拆分，讀取部分數據

原創

水...琥珀

2018-12-28 17:35

pandas參數說明

pandas在read_csv或 read_table有幾個個參數

文件部分讀取參數

nrows : int, default None

Number of rows of file to read. Useful for reading pieces of large files

nrows需要讀取的行數（從文件開頭算起）假如這個參數設定爲 50，則僅僅讀取文件的前50行。

skiprows : list-like or integer or callable, default None

Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file.

If callable, the callable function will be evaluated against the row indices, returning True if the row should be skipped and False otherwise. An example of a valid callable argument would be lambda x: x in [0, 2].

skiprows 需要忽略的行。可以是整數，代表跳過的行數（從文件開頭算起）；可以是列表，代表需要跳過的行號（從0開始）。

In [6]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)
Out[6]: 
  col1 col2  col3
0    a    b     2

skip_footer : int, default 0

DEPRECATED: use the skipfooter parameter instead, as they are identical

skip_footer 需要忽略的行數（從文件末尾處算起）

文件成塊讀取參數

iterator : boolean, default False

Return TextFileReader object for iteration or getting chunks withget_chunk().

chunksize : int, default None

Return TextFileReader object for iteration. See the IO Tools docsfor more information on iterator and chunksize.

chunksize是成塊讀入的數目，下面的代碼意思是文件每50行讀入一次.

寫入文件時“index=None,header=False”目的是不添加索引。具體可見這裏

成塊讀入，並寫入其他文件。

import pandas as pd
#成塊讀入
reader = pd.read_table("E:\\Mypython3\\fact.txt",
                        header=None,encoding='utf-8',sep="\t",chunksize =50)
i = 0
#對快操作
for chunk in reader:
    i = i+1
    print(chunk)
    chunk.to_csv("{}.txt".format(i),index=None,header=False,sep="\t",encoding='utf-8')

官網的例子
In [177]: reader = pd.read_table('tmp.sv', sep='|', chunksize=4)

In [178]: reader
Out[178]: <pandas.io.parsers.TextFileReader at 0x7f212235a518>

In [179]: for chunk in reader:
   .....:     print(chunk)
   .....: 
   Unnamed: 0         0         1         2         3
0           0  0.469112 -0.282863 -1.509059 -1.135632
1           1  1.212112 -0.173215  0.119209 -1.044236
2           2 -0.861849 -2.104569 -0.494929  1.071804
3           3  0.721555 -0.706771 -1.039575  0.271860
   Unnamed: 0         0         1         2         3
4           4 -0.424972  0.567020  0.276232 -1.087401
5           5 -0.673690  0.113648 -1.478427  0.524988
6           6  0.404705  0.577046 -1.715002 -1.039268
7           7 -0.370647 -1.157892 -1.344312  0.844885
   Unnamed: 0         0        1         2         3
8           8  1.075770 -0.10905  1.643563 -1.469388
9           9  0.357021 -0.67460 -1.776904 -0.968914

注意的是如果“chunk.to_csv("{}.txt".format(i),index=None,header=False,encoding='utf-8')”，就會出現“'DataFrame' object has no attribute 'to_table'”，所以在存入的時候要用 to_csv ,默認爲逗號分隔符，在分隔符上寫上想要的分隔符。

這種方法在寫入的文件中，換行符爲“\r\n”(原文件在生成的時候是“\n”爲換行符)

知識補充：

CRLF, LF 是用來表示文本換行的方式。CR(Carriage Return) 代表回車，對應字符 "\r"；LF(Line Feed) 代表換行，對應字符 '\n'。主流的操作系統Windows 系統使用的是 CRLF, Unix系統(包括Linux, MacOS近些年的版本) 使用的是LF。

Specifying iterator=True will also return the TextFileReader object:

`iterator=True` 要和get_chunk(n)配合使用

In [180]: reader = pd.read_table('tmp.sv', sep='|', iterator=True)

In [181]: reader.get_chunk(5)
Out[181]: 
   Unnamed: 0         0         1         2         3
0           0  0.469112 -0.282863 -1.509059 -1.135632
1           1  1.212112 -0.173215  0.119209 -1.044236
2           2 -0.861849 -2.104569 -0.494929  1.071804
3           3  0.721555 -0.706771 -1.039575  0.271860
4           4 -0.424972  0.567020  0.276232 -1.087401

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

小點dian兒：Python大文件讀取、文件拆分，讀取部分數據

pandas參數說明

文件部分讀取參數

文件成塊讀取參數

chunksize是成塊讀入的數目，下面的代碼意思是文件每50行讀入一次.

成塊讀入，並寫入其他文件。

`iterator=True` 要和get_chunk(n)配合使用

windows 安裝pytorch 權限問題

python 哈工大NTP分詞安裝pyltp 及配置模型（python3.5 3.6）

Python小點dian兒: 按列條件篩選、刪除DataFrame的整行

Python小點dian兒: ValueError: invalid literal for int() with base 10

Python小點dian兒: Python-Pandas-DataFrame 如何把df變爲以數據中的某一列爲index

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

小點dian兒：Python大文件讀取、文件拆分，讀取部分數據

pandas參數說明

文件部分讀取參數

文件成塊讀取參數

chunksize是成塊讀入的數目，下面的代碼意思是文件每50行讀入一次.

成塊讀入，並寫入其他文件。

iterator=True 要和get_chunk(n)配合使用

`iterator=True` 要和get_chunk(n)配合使用