python读取文本数据

总结
(1)推荐使用 with 语句操作文件 IO。
(2)如果文件较大,可以按字节读取或按行读取。
(3)使用文件迭代器进行逐行迭代。

1. python读取文本文件实现机制
读写文件是最常见的IO操作。Python内置了读写文件的函数.
读写文件前,我们先必须了解一下,在磁盘上读写文件的功能都是由操作系统提供的,现代操作系统不允许普通的程序直接操作磁盘,所以,读写文件就是请求操作系统打开一个文件对象(通常称为文件描述符),然后,通过操作系统提供的接口从这个文件对象中读取数据(读文件),或者把数据写入这个文件对象(写文件)。

步骤 操作
第一步 请求操作系统打开一个文件对象
第二步 通过文件对象接口读(写)数据
第三步 关闭文件对象

2. python读取文本文件之Python内置函数
(2.1.1).open()函数读取文本文件demo:

完整过程:
f = open('/Users/michael/test.txt', 'r') # 请求操作系统打开一个文件对象 
f.read()  #通过文件对象接口读(写)数据
f.close() #关闭文件对象
其中:
1)第一步,标示符'r'表示读,这样,我们就成功地打开了一个文件。如果文件不存在,open()函数就会抛出一个IOError的错误,并且给出错误码和详细的信息告诉你文件不存在;
2)第二步,如果文件打开成功,接下来,调用read()方法可以一次读取文件的全部内容,Python把内容读到内存,用一个str对象表示;
3)第三步,是调用close()方法关闭文件。文件使用完毕后必须关闭,因为文件对象会占用操作系统的资源,并且操作系统同一时间能打开的文件数量也是有限的;
 ///////////////////////////////////////////////////////////////////
 从上可以看出,一旦上述过程忘记写f.close()或者第二步读写文件出错,文件打开后就不能被关闭,会占用内存资源,因此,为了保证无论是否出错都能正确地关闭文件,我们可以使用try ... finally来实现:
try:
    f = open('/path/to/file', 'r')
    print f.read()
finally:
    if f:
        f.close()
上述写法比较繁琐,更优化的写法,是用with实现,with实现方式跟try...finally...是一样的,写起来更方便;
with open('/path/to/file', 'r') as f:
    print f.read()
 ////////////////////////////////////////////////////
 读取文件的方式:
 (1)一次性读取文件的全部内容。f.read() 或者f.read(size)
 调用read()会一次性读取文件的全部内容,如果文件有10G,内存就爆了,所以,要保险起见,可以反复调用read(size)方法,每次最多读取size个字节的内容。另外,调用readline()可以每次读取一行内容,调用readlines()一次读取所有内容并按行返回list。因此,要根据需要决定怎么调用。如果文件很小,read()一次性读取最方便;如果不能确定文件大小,反复调用read(size)比较保险;如果是配置文件,调用readlines()最方便;
 (2)文件为可迭代对象。for line in f.readlines():
    print(line.strip()) # 把末尾的'\n'删掉

(2.1.2).open()函数写文本文件:
写文件和读文件是一样的,唯一区别是调用open()函数时,传入标识符’w’或者’wb’表示写文本文件或写二进制文件;
open()函数写文本文件demo:

原始方式:
>>> f = open('/Users/michael/test.txt', 'w')
>>> f.write('Hello, world!')
>>> f.close()
优化方式:
with open('/Users/michael/test.txt', 'w') as f:
    f.write('Hello, world!')
 ///////////////////////////////////////////
写文件案例2:
with open('/Users/ethan/data2.txt', 'w') as f:
    f.write('one\n')
    f.write('two')
 注解:
(1)如果上述文件已存在,则会清空原内容并覆盖掉;
(2)如果上述路径是正确的(比如存在 /Users/ethan 的路径),但是文件不存在(data2.txt 不存在),则会新建一个文件,并写入上述内容;
(3)如果上述路径是不正确的(比如将路径写成 /Users/eth ),这时会抛出 IOError;

with open('/home/ccs/tmp.txt', 'r') as f:
    lines = list(f) #将f字符串对象执行普通迭代器的操作,返回可迭代对象列表
    print(lines)

with open('/home/ccs/tmp.txt', 'r') as f:
    while True:
        line = f.readlines()     # readlines()函数返回整体为可迭代对象列表
        if not line:
            break
        print(line)
'''
上面两种读文件方式结果:
['10  1   9   9\n', '6   3   2   8\n', '20  10  3   23\n', '1   4   1   10\n', '10  8   6   3\n', '10  2   1   6\n']
结论:
可以看到,我们可以对文件迭代器执行和普通迭代器相同的操作,比如上面使用 list(open(filename)) 将 f 转为一个字符串列表,这样所达到的效果和使用 readlines 是一样的。

'''

with open('/home/ccs/tmp.txt', 'r') as f:
    for line in f:
        print(line)

with open('/home/ccs/tmp.txt', 'r') as f:
    while True:
        line = f.readline()     # 逐行读取
        if not line:
            break
        print(line)             # 这里加了 ',' 是为了避免 print 自动换行

'''
上面两种读文件方式结果:
10  1   9   9

6   3   2   8

20  10  3   23

1   4   1   10

10  8   6   3

10  2   1   6

结论:
使用open函数打开得到的文件对象本身f是可迭代的,利用普通for循环遍历迭代取出元素值与f.readline迭代读取每行元素值是一样的;
'''

(2.1.3).open()函数工具:
def open(file, mode=‘r’, buffering=None, encoding=None, errors=None, newline=None, closefd=True): # known special case of open
“”"
Open file and return a stream. Raise IOError upon failure.

file is either a text or byte string giving the name (and the path
if the file isn't in the current working directory) of the file to
be opened or an integer file descriptor of the file to be
wrapped. (If a file descriptor is given, it is closed when the
returned I/O object is closed, unless closefd is set to False.)

mode is an optional string that specifies the mode in which the file
is opened. It defaults to 'r' which means open for reading in text
mode.  Other common values are 'w' for writing (truncating the file if
it already exists), 'x' for creating and writing to a new file, and
'a' for appending (which on some Unix systems, means that all writes
append to the end of the file regardless of the current seek position).
In text mode, if encoding is not specified the encoding used is platform
dependent: locale.getpreferredencoding(False) is called to get the
current locale encoding. (For reading and writing raw bytes use binary
mode and leave encoding unspecified.) The available modes are:

========= ===============================================================
Character Meaning
--------- ---------------------------------------------------------------
'r'       open for reading (default)
'w'       open for writing, truncating the file first
'x'       create a new file and open it for writing
'a'       open for writing, appending to the end of the file if it exists
'b'       binary mode
't'       text mode (default)
'+'       open a disk file for updating (reading and writing)
'U'       universal newline mode (deprecated)
========= ===============================================================

The default mode is 'rt' (open for reading text). For binary random
access, the mode 'w+b' opens and truncates the file to 0 bytes, while
'r+b' opens the file without truncation. The 'x' mode implies 'w' and
raises an `FileExistsError` if the file already exists.

Python distinguishes between files opened in binary and text modes,
even when the underlying operating system doesn't. Files opened in
binary mode (appending 'b' to the mode argument) return contents as
bytes objects without any decoding. In text mode (the default, or when
't' is appended to the mode argument), the contents of the file are
returned as strings, the bytes having been first decoded using a
platform-dependent encoding or using the specified encoding if given.

'U' mode is deprecated and will raise an exception in future versions
of Python.  It has no effect in Python 3.  Use newline to control
universal newlines mode.

buffering is an optional integer used to set the buffering policy.
Pass 0 to switch buffering off (only allowed in binary mode), 1 to select
line buffering (only usable in text mode), and an integer > 1 to indicate
the size of a fixed-size chunk buffer.  When no buffering argument is
given, the default buffering policy works as follows:

* Binary files are buffered in fixed-size chunks; the size of the buffer
  is chosen using a heuristic trying to determine the underlying device's
  "block size" and falling back on `io.DEFAULT_BUFFER_SIZE`.
  On many systems, the buffer will typically be 4096 or 8192 bytes long.

* "Interactive" text files (files for which isatty() returns True)
  use line buffering.  Other text files use the policy described above
  for binary files.

encoding is the name of the encoding used to decode or encode the
file. This should only be used in text mode. The default encoding is
platform dependent, but any encoding supported by Python can be
passed.  See the codecs module for the list of supported encodings.

errors is an optional string that specifies how encoding errors are to
be handled---this argument should not be used in binary mode. Pass
'strict' to raise a ValueError exception if there is an encoding error
(the default of None has the same effect), or pass 'ignore' to ignore
errors. (Note that ignoring encoding errors can lead to data loss.)
See the documentation for codecs.register or run 'help(codecs.Codec)'
for a list of the permitted encoding error strings.

newline controls how universal newlines works (it only applies to text
mode). It can be None, '', '\n', '\r', and '\r\n'.  It works as
follows:

* On input, if newline is None, universal newlines mode is
  enabled. Lines in the input can end in '\n', '\r', or '\r\n', and
  these are translated into '\n' before being returned to the
  caller. If it is '', universal newline mode is enabled, but line
  endings are returned to the caller untranslated. If it has any of
  the other legal values, input lines are only terminated by the given
  string, and the line ending is returned to the caller untranslated.

* On output, if newline is None, any '\n' characters written are
  translated to the system default line separator, os.linesep. If
  newline is '' or '\n', no translation takes place. If newline is any
  of the other legal values, any '\n' characters written are translated
  to the given string.

If closefd is False, the underlying file descriptor will be kept open
when the file is closed. This does not work when a file name is given
and must be True in that case.

A custom opener can be used by passing a callable as *opener*. The
underlying file descriptor for the file object is then obtained by
calling *opener* with (*file*, *flags*). *opener* must return an open
file descriptor (passing os.open as *opener* results in functionality
similar to passing None).

open() returns a file object whose type depends on the mode, and
through which the standard file operations such as reading and writing
are performed. When open() is used to open a file in a text mode ('w',
'r', 'wt', 'rt', etc.), it returns a TextIOWrapper. When used to open
a file in a binary mode, the returned class varies: in read binary
mode, it returns a BufferedReader; in write binary and append binary
modes, it returns a BufferedWriter, and in read/write mode, it returns
a BufferedRandom.

It is also possible to use a string or bytearray as a file for both
reading and writing. For strings StringIO can be used like a file
opened in a text mode, and for bytes a BytesIO can be used like a file
opened in a binary mode.
"""

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章