作者：SeanCheney
鏈接：https://www.jianshu.com/p/047d8c1c7e14
根據簡書的加上一點自己理解, 與其中較爲常用有用的

讀寫文本格式的數據

In [13]: pd.read_csv('examples/ex2.csv', header=None)
Out[13]: 
   0   1   2   3      4
0  1   2   3   4  hello
1  5   6   7   8  world
2  9  10  11  12    foo

In [14]: pd.read_csv('examples/ex2.csv', names=['a', 'b', 'c', 'd', 'message'])
Out[14]: 
   a   b   c   d message
0  1   2   3   4   hello
1  5   6   7   8   world
2  9  10  11  12     foo

假設你希望將message列做成DataFrame的索引。你可以明確表示要將該列放到索引4的位置上，也可以通過index_col參數指定”message”：

In [15]: names = ['a', 'b', 'c', 'd', 'message']

In [16]:
pd.read_csv('examples/ex2.csv', names=names, index_col='message')
Out[16]: 
         a   b   c   d
message               
hello    1   2   3   4
world    5   6   7   8
foo      9  10  11  12

如果希望將多個列做成一個層次化索引，只需傳入由列編號或列名組成的列表即可：

In [17]: !cat examples/csv_mindex.csv
key1,key2,value1,value2
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16

In [18]: parsed = pd.read_csv('examples/csv_mindex.csv',
   ....:                      index_col=['key1', 'key2'])

In [19]: parsed
Out[19]: 
           value1  value2
key1 key2                
one  a          1       2
     b          3       4
     c          5       6
     d          7       8
two  a          9      10
     b         11      12
     c         13      14
     d         15      16

雖然可以手動對數據進行規整，這裏的字段是被數量不同的空白字符間隔開的。這種情況下，你可以傳遞一個正則表達式作爲read_table的分隔符。可以用正則表達式表達爲\s+，於是有：

In [21]: result = pd.read_table('examples/ex3.txt', sep='\s+')

In [22]: result
Out[22]: 
            A         B         C
aaa -0.264438 -1.026059 -0.619500
bbb  0.927272  0.302904 -0.032399
ccc -0.264273 -0.386314 -0.217601
ddd -0.871858 -0.348382  1.100491

na_values可以用一個列表或集合的字符串表示缺失值：

In [29]: result = pd.read_csv('examples/ex5.csv', na_values=['NULL'])

In [30]: result
Out[30]: 
  something  a   b     c   d message
0       one  1   2   3.0   4     NaN
1       two  5   6   NaN   8   world
2     three  9  10  11.0  12     foo

字典的各列可以使用不同的NA標記值：
意思是message這一列值爲foo, na都置爲NaN，將其視爲空

In [31]: 
sentinels = {'message': ['foo', 'NA'], 'something': ['two']}

In [32]: pd.read_csv('examples/ex5.csv', na_values=sentinels)
Out[32]:
something  a   b     c   d message
0       one  1   2   3.0   4     NaN
1       NaN  5   6   NaN   8   world
2     three  9  10  11.0  12     NaN

逐塊讀取文本文件

在處理很大的文件時，或找出大文件中的參數集以便於後續處理時，你可能只想讀取文件的一小部分或逐塊對文件進行迭代。
如果只想讀取幾行（避免讀取整個文件），通過nrows進行指定即可：

pd.read_csv('examples/ex6.csv', nrows=5)

要逐塊讀取文件，可以指定chunksize（行數）：

In [874]: chunker = pd.read_csv('ch06/ex6.csv', chunksize=1000)

In [875]: chunker
Out[875]: <pandas.io.parsers.TextParser at 0x8398150>

read_csv所返回的這個TextParser對象使你可以根據chunksize對文件進行逐塊迭代。比如說，我們可以迭代處理ex6.csv，將值計數聚合到”key”列中，如下所示：

chunker = pd.read_csv('examples/ex6.csv', chunksize=1000)

tot = pd.Series([])
for piece in chunker:
    tot = tot.add(piece['key'].value_counts(), fill_value=0)

tot = tot.sort_values(ascending=False)
然後有：

In [40]: tot[:10]
Out[40]: 
E    368.0
X    364.0
L    346.0
O    343.0
Q    340.0
M    338.0
J    337.0
F    335.0
K    334.0
H    330.0
dtype: float64

TextParser還有一個get_chunk方法，它使你可以讀取任意大小的塊。

將數據寫出到文本格式

數據也可以被輸出爲分隔符格式的文本。我們再來看看之前讀過的一個CSV文件：

In [41]: data = pd.read_csv('examples/ex5.csv')

In [42]: data
Out[42]: 
  something  a   b     c   d message
0       one  1   2   3.0   4     NaN
1       two  5   6   NaN   8   world
2     three  9  10  11.0  12     foo

利用DataFrame的to_csv方法，我們可以將數據寫到一個以逗號分隔的文件中：

In [43]: data.to_csv('examples/out.csv')

In [44]: !cat examples/out.csv
,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo

當然，還可以使用其他分隔符（由於這裏直接寫出到sys.stdout，所以僅僅是打印出文本結果而已）：

In [45]: import sys

In [46]: data.to_csv(sys.stdout, sep='|')
|something|a|b|c|d|message
0|one|1|2|3.0|4|
1|two|5|6||8|world
2|three|9|10|11.0|12|foo

缺失值在輸出結果中會被表示爲空字符串。你可能希望將其表示爲別的標記值：

In [47]: data.to_csv(sys.stdout, na_rep='NULL')
,something,a,b,c,d,message
0,one,1,2,3.0,4,NULL
1,two,5,6,NULL,8,world
2,three,9,10,11.0,12,foo

如果沒有設置其他選項，則會寫出行和列的標籤。當然，它們也都可以被禁用：

In [48]: data.to_csv(sys.stdout, index=False, header=False)
one,1,2,3.0,4,
two,5,6,,8,world
three,9,10,11.0,12,foo

此外，你還可以只寫出一部分的列，並以你指定的順序排列：

In [49]: data.to_csv(sys.stdout, index=False, columns=['a', 'b', 'c'])
a,b,c
1,2,3.0
5,6,
9,10,11.0

Series也有一個to_csv方法：

In [50]: dates = pd.date_range('1/1/2000', periods=7)

In [51]: ts = pd.Series(np.arange(7), index=dates)

In [52]: ts.to_csv('examples/tseries.csv')

In [53]: !cat examples/tseries.csv
2000-01-01,0
2000-01-02,1
2000-01-03,2
2000-01-04,3
2000-01-05,4
2000-01-06,5
2000-01-07,6

數據庫交互主要

在商業場景下，大多數數據可能不是存儲在文本或Excel文件中。基於SQL的關係型數據庫（如SQL Server、PostgreSQL和MySQL等）使用非常廣泛，其它一些數據庫也很流行。數據庫的選擇通常取決於性能、數據完整性以及應用程序的伸縮性需求。

將數據從SQL加載到DataFrame的過程很簡單，此外pandas還有一些能夠簡化該過程的函數。例如，我將使用SQLite數據庫（通過Python內置的sqlite3驅動器）：

In [121]: import sqlite3

In [122]: query = """
   .....: CREATE TABLE test
   .....: (a VARCHAR(20), b VARCHAR(20),
   .....:  c REAL,        d INTEGER
   .....: );"""

In [123]: con = sqlite3.connect('mydata.sqlite')

In [124]: con.execute(query)
Out[124]: <sqlite3.Cursor at 0x7f6b12a50f10>

In [125]: con.commit()

然後插入幾行數據：

In [126]: data = [('Atlanta', 'Georgia', 1.25, 6),
   .....:         ('Tallahassee', 'Florida', 2.6, 3),
   .....:         ('Sacramento', 'California', 1.7, 5)]

In [127]: stmt = "INSERT INTO test VALUES(?, ?, ?, ?)"

In [128]: con.executemany(stmt, data)
Out[128]: <sqlite3.Cursor at 0x7f6b15c66ce0>

從表中選取數據時，大部分Python SQL驅動器（PyODBC、psycopg2、MySQLdb、pymssql等）都會返回一個元組列表：

In [130]: cursor = con.execute('select * from test')

In [131]: rows = cursor.fetchall()

In [132]: rows
Out[132]: 
[('Atlanta', 'Georgia', 1.25, 6),
 ('Tallahassee', 'Florida', 2.6, 3),
 ('Sacramento', 'California', 1.7, 5)]

你可以將這個元組列表傳給DataFrame構造器，但還需要列名（位於光標的description屬性中）：

In [133]: cursor.description
Out[133]: 
(('a', None, None, None, None, None, None),
 ('b', None, None, None, None, None, None),
 ('c', None, None, None, None, None, None),
 ('d', None, None, None, None, None, None))

In [134]: pd.DataFrame(rows, columns=[x[0] for x in cursor.description])
Out[134]: 
             a           b     c  d
0      Atlanta     Georgia  1.25  6
1  Tallahassee     Florida  2.60  3
2   Sacramento  California  1.70  5

這種數據規整操作相當多，你肯定不想每查一次數據庫就重寫一次。SQLAlchemy項目是一個流行的Python SQL工具，它抽象出了SQL數據庫中的許多常見差異。pandas有一個read_sql函數，可以讓你輕鬆的從SQLAlchemy連接讀取數據。這裏，我們用SQLAlchemy連接SQLite數據庫，並從之前創建的表讀取數據：

In [135]: import sqlalchemy as sqla

In [136]: db = sqla.create_engine('sqlite:///mydata.sqlite')

In [137]: pd.read_sql('select * from test', db)
Out[137]: 
             a           b     c  d
0      Atlanta     Georgia  1.25  6
1  Tallahassee     Florida  2.60  3
2   Sacramento  California  1.70  5

pandas數據加載、存儲與文件格式

讀寫文本格式的數據

逐塊讀取文本文件

將數據寫出到文本格式

數據庫交互主要

sklearn pipeline 實現多個模型統一調參

pandas使用(不定期把所見的比較有效的處理方式加過來)

正確理解查準率與查全率、auc值

lstm模型與情感分析實例

Linux離線安裝pyspark與嘗試使用pyspark連接數據庫

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

pandas數據加載、存儲與文件格式

讀寫文本格式的數據

逐塊讀取文本文件

將數據寫出到文本格式

數據庫交互 主要

數據庫交互主要