Data Loading, Storage,

数据加载，存储与文件格式

xiaoyao

import numpy as np
import pandas as pd
np.random.seed(12345)
import matplotlib.pyplot as plt
plt.rc('figure', figsize=(10, 6))
np.set_printoptions(precision=4, suppress=True)

Reading and Writing Data in Text Format

读取，将数据写入到text文件中

# read_csv默认以','间隔
df = pd.read_csv('examples/ex1.csv')
df

	a	b	c	d	message
0	1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo

# read_table默认以'\t'分隔
pd.read_table('examples/ex1.csv', sep=',')

	a	b	c	d	message
0	1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo

"""有的文件没有标题行，可以通过pandas为其分配默认的列名，也可以自定义列名
"""
pd.read_csv('examples/ex2.csv', names=['a', 'b', 'c', 'd', 'message'])

	a	b	c	d	message
0	1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo

pd.read_csv('examples/ex2.csv', header=None)

	0	1	2	3	4
0	1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo

"""
如果希望将message列做成DataFrame的索引，可以明确表示将该列放到索引4的位置上，也可以通过
index_col参数指定“message”
"""
names = ['a', 'b', 'c', 'd', 'message']
pd.read_csv('examples/ex2.csv', names=names, index_col='message')

	a	b	c	d
message
hello	1	2	3	4
world	5	6	7	8
foo	9	10	11	12

# !cat examples/csv_mindex.csv
parsed = pd.read_csv('examples/csv_mindex.csv',
                     index_col=['key1', 'key2'])
parsed

		value1	value2
key1	key2
one	a	1	2
	b	3	4
	c	5	6
	d	7	8
two	a	9	10
	b	11	12
	c	13	14
	d	15	16

list(open('examples/ex3.txt'))

['            A         B         C\n',
 'aaa -0.264438 -1.026059 -0.619500\n',
 'bbb  0.927272  0.302904 -0.032399\n',
 'ccc -0.264273 -0.386314 -0.217601\n',
 'ddd -0.871858 -0.348382  1.100491\n']

有些表格可能不是用固定的分隔符去分隔字段的（比如说:空白字符或者其他模式）。有些表格可能不是用固定的分隔字段的（比如空白字符或者其他模式来分隔字段）。

正则表达式中\s匹配任何空白字符，包括空格、制表符、换页符等等, 等价于[ \f\n\r\t\v]

\f -> 匹配一个换页
\n -> 匹配一个换行符
\r -> 匹配一个回车符
\t -> 匹配一个制表符
\v -> 匹配一个垂直制表符
而“\s+”则表示匹配任意多个上面的字符。

"""
虽然可以手动对数据进行规整，这里的字段是被不同空白字符间隔开的。这种情况下，可以传递一个
正则表达式作为read_table的分隔符。可以使用正则表达式表达为: \s+

"""
result = pd.read_table('examples/ex3.txt', sep='\s+')
result

	A	B	C
aaa	-0.264438	-1.026059	-0.619500
bbb	0.927272	0.302904	-0.032399
ccc	-0.264273	-0.386314	-0.217601
ddd	-0.871858	-0.348382	1.100491

# 使用skiprows跳过文件的第一行，第三行，第四行
pd.read_csv('examples/ex4.csv', skiprows=[0, 2, 3])

	a	b	c	d	message
0	1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo

# 缺失值处理是文件解析任务中的一个重要的组成部分。

# 缺失数据经常是没有(空字符串)要么使用某个标记值进行表示。

# 默认情况之下，pandas或使用一组经常出现的标记值进行识别，比如：NA以及NULL
result = pd.read_csv('examples/ex5.csv')
result

	something	a	b	c	d	message
0	one	1	2	3.0	4	NaN
1	two	5	6	NaN	8	world
2	three	9	10	11.0	12	foo

pd.isnull(result)

	something	a	b	c	d	message
0	False	False	False	False	False	True
1	False	False	False	True	False	False
2	False	False	False	False	False	False

# na_values可以使用一个列表或者集合的字符串表示缺失值
result = pd.read_csv('examples/ex5.csv', na_values=['NULL'])
result

	something	a	b	c	d	message
0	one	1	2	3.0	4	NaN
1	two	5	6	NaN	8	world
2	three	9	10	11.0	12	foo

# 字典的各列可以使用不同的NA标记值
sentinels = {'message': ['foo', 'NA'], 'something': ['two']}
pd.read_csv('examples/ex5.csv', na_values=sentinels)

	something	a	b	c	d	message
0	one	1	2	3.0	4	NaN
1	NaN	5	6	NaN	8	world
2	three	9	10	11.0	12	NaN

这里列举出pandas.read_csv和pandas.read_tabel常用的选项。

参数	说明
path	表示文件系统的位置,url,文件型对象的字符串
sep或者delimiter	用于对行中各字段进行拆分的字符序列或者正则表达式
header	用作列名的行号。默认为0，表示第一行，如果没有header行就应该进行设置为None
index_col	用作行索引的列表好或者列名，可以是单个名称或者/数字组成的列表（层次化索引）
names	用于结果的列名列表，结合header=None
$\vdots$	$\vdots$

Reading Text Files in Pieces 逐块读取文本文件

在处理很大的文件的时候，或者找出大文件中的参数集以便于后续处理，可只想读取文件的一小部分或逐块对文件进行迭代。

# 在看大文件之前，可以先设置pandas显示更加紧密一些
pd.options.display.max_rows = 10

result = pd.read_csv('examples/ex6.csv')
result

	one	two	three	four	key
0	0.467976	-0.038649	-0.295344	-1.824726	L
1	-0.358893	1.404453	0.704965	-0.200638	B
2	-0.501840	0.659254	-0.421691	-0.057688	G
3	0.204886	1.074134	1.388361	-0.982404	R
4	0.354628	-0.133116	0.283763	-0.837063	Q
...	...	...	...	...	...
9995	2.311896	-0.417070	-1.409599	-0.515821	L
9996	-0.479893	-0.650419	0.745152	-0.646038	E
9997	0.523331	0.787112	0.486066	1.093156	K
9998	-0.362559	0.598894	-1.843201	0.887292	G
9999	-0.096376	-1.012999	-0.657431	-0.573315	0

10000 rows × 5 columns

# 如果只想读入几行（避免读取整个文件）,可以通过nrows进行指定即可。
pd.read_csv('examples/ex6.csv', nrows=5)

	one	two	three	four	key
0	0.467976	-0.038649	-0.295344	-1.824726	L
1	-0.358893	1.404453	0.704965	-0.200638	B
2	-0.501840	0.659254	-0.421691	-0.057688	G
3	0.204886	1.074134	1.388361	-0.982404	R
4	0.354628	-0.133116	0.283763	-0.837063	Q

# 如果需要逐块读取文件，可以指定chunksize(行数)
chunker = pd.read_csv('examples/ex6.csv', chunksize=1000)
chunker

<pandas.io.parsers.TextFileReader at 0x23892e3d848>

import warnings
warnings.filterwarnings('ignore')

迭代处理ex6.csv,将值计数聚合到“key”列中，操作如下;

chunker = pd.read_csv('examples/ex6.csv', chunksize=1000)

tot = pd.Series([])
for piece in chunker:
    tot = tot.add(piece['key'].value_counts(), fill_value=0)

tot = tot.sort_values(ascending=False)

tot[:10]

E    368.0
X    364.0
L    346.0
O    343.0
Q    340.0
M    338.0
J    337.0
F    335.0
K    334.0
H    330.0
dtype: float64

textParser还有一个get_chunk方法，它可以读取任意大小的块。

Writing Data to Text Format 将数据写出到文本格式

data = pd.read_csv('examples/ex5.csv')
data

	something	a	b	c	d	message
0	one	1	2	3.0	4	NaN
1	two	5	6	NaN	8	world
2	three	9	10	11.0	12	foo

# 数据可以被输出为分隔符格式的文本

# 利用DataFrame中的to_csv方法，可以将数据写入到一个以逗号分隔的文件中

data.to_csv('examples/out.csv')
# !cat examples/out.csv

import sys
# 也可以使用其他的分隔符，这里直接写入到sys.stdout,所以仅仅是打印出文本结果而已
data.to_csv(sys.stdout, sep='|')

|something|a|b|c|d|message
0|one|1|2|3.0|4|
1|two|5|6||8|world
2|three|9|10|11.0|12|foo

# 缺失值在输出结果中被表示为空字符串。这里我希望将其表示为别的标记值：
data.to_csv(sys.stdout, na_rep='NULL')

,something,a,b,c,d,message
0,one,1,2,3.0,4,NULL
1,two,5,6,NULL,8,world
2,three,9,10,11.0,12,foo

# 如果没有设置其他的选项，则会写出行和列的标签。也可以被禁用
data.to_csv(sys.stdout, index=False, header=False)

one,1,2,3.0,4,
two,5,6,,8,world
three,9,10,11.0,12,foo

# 也可以只写出一部分的列，同时以我指定的顺序排序
data.to_csv(sys.stdout, index=False, columns=['a', 'b', 'c'])

a,b,c
1,2,3.0
5,6,
9,10,11.0

Series也有一个to_csv方法

dates = pd.date_range('1/1/2000', periods=7)
ts = pd.Series(np.arange(7), index=dates)
ts.to_csv('examples/tseries.csv')
# !cat examples/tseries.csv

datesfile = pd.read_csv('examples/tseries.csv')
datesfile

	Unnamed: 0	0
0	2000-01-01	0
1	2000-01-02	1
2	2000-01-03	2
3	2000-01-04	3
4	2000-01-05	4
5	2000-01-06	5
6	2000-01-07	6

Working with Delimited Formats 处理分隔符格式

大部分存储在磁盘上的表格型数据都可以使用pandas.read_table进行加载，但有时需要做一些手工处理。实际上，因为接收到含有畸形行的文件而使得read_table出毛病的情况很常见。

# 将任意打开的文件或者文件型的对象传给csv.reader:对这个reader进行迭代将会产生一个元组（
# 同时：移除了所有的引号）

import csv
f = open('examples/ex7.csv')

reader = csv.reader(f)

for line in reader:
    print(line)

f.close()

为了使得数据格式合乎要求，进行处理。

首先，读取文件到一个多行的列表中；

with open('examples/ex7.csv') as f:
    lines = list(csv.reader(f))

# 将这些行分开为；标题行和数据行
header, values = lines[0], lines[1:]

# 关于zip 函数的使用
a = [1,2,3]
b = [4,5,6]
c = [4,5,6,7,8]
zipped = zip(a,b)     # 打包为元组的列表
print(zipped)

<zip object at 0x00000238958F5808>

print(zip(a,c) )             # 元素个数与最短的列表一致

print((*zipped)  )

<zip object at 0x0000023895828988>

# 然后，使用字典构造式和zip(*values)，后者将行转置为列，创建数据列的字典：

"""
zip() 函数用于将可迭代的对象作为参数，将对象中对应的元素打包成一个个元组，然后返回由这些元组组成的列表。

如果各个迭代器的元素个数不一致，则返回列表长度与最短的对象相同，利用 * 号操作符，可以将元组解压为列表。
"""
data_dict = {h: v for h, v in zip(header, zip(*values))}
data_dict

{'a': ('1', '1'), 'b': ('2', '2'), 'c': ('3', '3')}

class my_dialect(csv.Dialect):
lineterminator = ‘\n’
delimiter = ‘;’
quotechar = ‘"’
quoting = csv.QUOTE_MINIMAL

reader = csv.reader(f, dialect=my_dialect)

reader = csv.reader(f, delimiter=’|’)

with open(‘mydata.csv’, ‘w’) as f:
writer = csv.writer(f, dialect=my_dialect)
writer.writerow((‘one’, ‘two’, ‘three’))
writer.writerow((‘1’, ‘2’, ‘3’))
writer.writerow((‘4’, ‘5’, ‘6’))
writer.writerow((‘7’, ‘8’, ‘9’))

JSON Data JSON数据

JSON已经成为：通过http请求在web浏览器和其他的应用程序之间发送数据的标准格式之一。他比表格型的数据更加灵活。

obj = """
{"name": "Wes",
 "places_lived": ["United States", "Spain", "Germany"],
 "pet": null,
 "siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]},
              {"name": "Katie", "age": 38,
               "pets": ["Sixes", "Stache", "Cisco"]}]
}
"""

import json
result = json.loads(obj)
result

{'name': 'Wes',
 'places_lived': ['United States', 'Spain', 'Germany'],
 'pet': None,
 'siblings': [{'name': 'Scott', 'age': 30, 'pets': ['Zeus', 'Zuko']},
  {'name': 'Katie', 'age': 38, 'pets': ['Sixes', 'Stache', 'Cisco']}]}

# json.dumps则将python对象转换为JSON格式。
asjson = json.dumps(result)

将（一个或者一组）JSON对象转换为DataFrame或者其他便于分析的数据结构？

最简单的方式就是：往DataFrame构造器传入一个字典的列表（就是原先的JSON对象），并选取数据字段的子集.

siblings = pd.DataFrame(result['siblings'], columns=['name', 'age'])
siblings

	name	age
0	Scott	30
1	Katie	38

# pandas.read_json可以自动将特别格式的json数据集转换为Series或者DataFrame.

# pandas.read_json的默认选项假设json数组中的每个对象是表格中的一行
data = pd.read_json('examples/example.json')
data

	a	b	c
0	1	2	3
1	4	5	6
2	7	8	9

# 如果需要将数据从pandas输出到json,可以使用to_json方法；
print(data.to_json())
print(data.to_json(orient='records'))

{"a":{"0":1,"1":4,"2":7},"b":{"0":2,"1":5,"2":8},"c":{"0":3,"1":6,"2":9}}
[{"a":1,"b":2,"c":3},{"a":4,"b":5,"c":6},{"a":7,"b":8,"c":9}]

XML and HTML: Web Scraping Web信息收集

python有许多的可以读写常见的html和xml格式数据的库，包括：lxml,beautiful soup和 html5lib.

lxml的速度比较快，但是，，，，其他的库处理有误的html或者xml文件更好。

pandas有一个内置的功能，read_html,它可以使用lxml和beautiful soup 自动将html文件中的表格解析为DataFrame对象。

conda install lxml # 或者 pip install lxml
pip install beautifulsoup4 html5lib

# 这里的html文件记录了银行倒闭的情况
tables = pd.read_html('examples/fdic_failed_bank_list.html')
len(tables)

failures = tables[0]
failures.head()

	Bank Name	City	ST	CERT	Acquiring Institution	Closing Date	Updated Date
0	Allied Bank	Mulberry	AR	91	Today's Bank	September 23, 2016	November 17, 2016
1	The Woodbury Banking Company	Woodbury	GA	11297	United Bank	August 19, 2016	November 17, 2016
2	First CornerStone Bank	King of Prussia	PA	35312	First-Citizens Bank & Trust Company	May 6, 2016	September 6, 2016
3	Trust Company Bank	Memphis	TN	9956	The Bank of Fayette County	April 29, 2016	September 6, 2016
4	North Milwaukee State Bank	Milwaukee	WI	20364	First-Citizens Bank & Trust Company	March 11, 2016	June 16, 2016

因为failures有许多列，pandas插入了一个换行符\

# 计算按照年份倒闭的银行数量
close_timestamps = pd.to_datetime(failures['Closing Date'])
close_timestamps.dt.year.value_counts()

2010    157
2009    140
2011     92
2012     51
2008     25
       ... 
2004      4
2001      4
2007      3
2003      3
2000      2
Name: Closing Date, Length: 15, dtype: int64

Parsing XML with lxml.objectify 利用lxml.objectify解析xml

"""
<INDICATOR>
  <INDICATOR_SEQ>373889</INDICATOR_SEQ>
  <PARENT_SEQ></PARENT_SEQ>
  <AGENCY_NAME>Metro-North Railroad</AGENCY_NAME>
  <INDICATOR_NAME>Escalator Availability</INDICATOR_NAME>
  <DESCRIPTION>Percent of the time that escalators are operational
  systemwide. The availability rate is based on physical observations performed
  the morning of regular business days only. This is a new indicator the agency
  began reporting in 2009.</DESCRIPTION>
  <PERIOD_YEAR>2011</PERIOD_YEAR>
  <PERIOD_MONTH>12</PERIOD_MONTH>
  <CATEGORY>Service Indicators</CATEGORY>
  <FREQUENCY>M</FREQUENCY>
  <DESIRED_CHANGE>U</DESIRED_CHANGE>
  <INDICATOR_UNIT>%</INDICATOR_UNIT>
  <DECIMAL_PLACES>1</DECIMAL_PLACES>
  <YTD_TARGET>97.00</YTD_TARGET>
  <YTD_ACTUAL></YTD_ACTUAL>
  <MONTHLY_TARGET>97.00</MONTHLY_TARGET>
  <MONTHLY_ACTUAL></MONTHLY_ACTUAL>
</INDICATOR>
"""

%pwd

'D:\\python code\\8messy\\4利用python进行数据分析\\pydata-book-2nd-edition'

# 先使用lxml.objectify解析该文件，然后通过getroot得到该xml文件的根节点的引用：
from lxml import objectify

path = 'datasets/mta_perf/Performance_MNR.xml'
parsed = objectify.parse(open(path))
root = parsed.getroot()

root.INDICATOR返回一个用于产生xml元素的生成器。对于每条记录，使用标记名和数据值填充一个字典。

data = []

skip_fields = ['PARENT_SEQ', 'INDICATOR_SEQ',
               'DESIRED_CHANGE', 'DECIMAL_PLACES']

for elt in root.INDICATOR:
    el_data = {}
    for child in elt.getchildren():
        if child.tag in skip_fields:
            continue
        el_data[child.tag] = child.pyval
    data.append(el_data)

perf = pd.DataFrame(data)
perf.head()

	AGENCY_NAME	INDICATOR_NAME	DESCRIPTION	PERIOD_YEAR	PERIOD_MONTH	CATEGORY	FREQUENCY	INDICATOR_UNIT	YTD_TARGET	YTD_ACTUAL	MONTHLY_TARGET	MONTHLY_ACTUAL
0	Metro-North Railroad	On-Time Performance (West of Hudson)	Percent of commuter trains that arrive at thei...	2008	1	Service Indicators	M	%	95	96.9	95	96.9
1	Metro-North Railroad	On-Time Performance (West of Hudson)	Percent of commuter trains that arrive at thei...	2008	2	Service Indicators	M	%	95	96	95	95
2	Metro-North Railroad	On-Time Performance (West of Hudson)	Percent of commuter trains that arrive at thei...	2008	3	Service Indicators	M	%	95	96.3	95	96.9
3	Metro-North Railroad	On-Time Performance (West of Hudson)	Percent of commuter trains that arrive at thei...	2008	4	Service Indicators	M	%	95	96.8	95	98.3
4	Metro-North Railroad	On-Time Performance (West of Hudson)	Percent of commuter trains that arrive at thei...	2008	5	Service Indicators	M	%	95	96.6	95	95.8

from io import StringIO
tag = '<a href="http://www.google.com">Google</a>'
root = objectify.parse(StringIO(tag)).getroot()

root
root.get('href')
root.text

'Google'

Binary Data Formats 二进制数据格式

实现数据的高效二进制格式存储的最有效的办法之一是使用python内置的pickle序列化。

pandas对象都有一个用于将数据以pickle格式保存到磁盘上的to_pickle方法：

frame = pd.read_csv('examples/ex1.csv')
frame
frame.to_pickle('examples/frame_pickle')

# 可以通过pickle 直接读取被pickle化的数据，或者是使用更为方便的pandas.read_pickle
pd.read_pickle('examples/frame_pickle')

	a	b	c	d	message
0	1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo

注意；pickle仅仅建议用于短期的存储格式。原因是很难保证该格式永远是稳定的；现在的pickle对象可能无法被后续版本的库unpickle出来。

Using HDF5 Format 使用hdf5格式

HDF5和MessagePack，，pandas内置支持的两个二进制数据格式。

frame = pd.DataFrame({'a': np.random.randn(100)})
store = pd.HDFStore('mydata.h5')
store['obj1'] = frame
store['obj1_col'] = frame['a']
store

<class 'pandas.io.pytables.HDFStore'>
File path: mydata.h5

# HDF5文件中的对象可以通过与字典一样的api进行获取
store['obj1']

	a
0	-0.204708
1	0.478943
2	-0.519439
3	-0.555730
4	1.965781
...	...
95	0.795253
96	0.118110
97	-0.748532
98	0.584970
99	0.152677

100 rows × 1 columns

# HDFStore支持两种存储模式，fixed和table,后者会更慢，但是

# 支持使用特殊语法进行查询操作

store.put('obj2', frame, format='table')
store.select('obj2', where=['index >= 10 and index <= 15'])
store.close()

frame.to_hdf('mydata.h5', 'obj3', format='table')
pd.read_hdf('mydata.h5', 'obj3', where=['index < 5'])

	a
0	-0.204708
1	0.478943
2	-0.519439
3	-0.555730
4	1.965781

import os
os.remove('mydata.h5')

HDF5不是数据库，它最适合用作’"一次写多次读"的数据集。虽然数据可以在任何时候被添加到文件中，但是如果同时发送多个写操作，文件就可能被破坏。

Reading Microsoft Excel Files 读取Microsoft excel文件

xlsx = pd.ExcelFile('examples/ex1.xlsx')

pd.read_excel(xlsx, 'Sheet1')

	Unnamed: 0	a	b	c	d	message
0	0	1	2	3	4	hello
1	1	5	6	7	8	world
2	2	9	10	11	12	foo

frame = pd.read_excel('examples/ex1.xlsx', 'Sheet1')
frame

	Unnamed: 0	a	b	c	d	message
0	0	1	2	3	4	hello
1	1	5	6	7	8	world
2	2	9	10	11	12	foo

"""
如果要将pandas数据写入为excel格式，必须像创建一个excelWriter，然后使用pandas对象的to_excel

方法将数据写入其中；
"""
writer = pd.ExcelWriter('examples/ex2.xlsx')
frame.to_excel(writer, 'Sheet1')
writer.save()

# 也可以不用ExcelWriter，而是传递文件的路径到to_excel;
frame.to_excel('examples/ex2.xlsx')

Interacting with Web APIs Web APIs交互

import requests
url = 'https://api.github.com/repos/pandas-dev/pandas/issues'
resp = requests.get(url)
resp

<Response [200]>

data = resp.json()
data[0]['title']

'BUG: resample().nearest() throws an exception if tz is provided for the DatetimeIndex'

issues = pd.DataFrame(data, columns=['number', 'title',
                                     'labels', 'state'])
issues

	number	title	labels	state
0	33895	BUG: resample().nearest() throws an exception ...	[{'id': 76811, 'node_id': 'MDU6TGFiZWw3NjgxMQ=...	open
1	33894	ENH: Construct pandas dataframe from function	[{'id': 76812, 'node_id': 'MDU6TGFiZWw3NjgxMg=...	open
2	33892	COMPAT: add back block manager constructors to...	[{'id': 76865106, 'node_id': 'MDU6TGFiZWw3Njg2...	open
3	33891	pandas interpolate inconsistent results with a...	[]	open
4	33890	CI: failing timedelta64 test on Linux py37_loc...	[{'id': 48070600, 'node_id': 'MDU6TGFiZWw0ODA3...	open
...	...	...	...	...
25	33851	DOC: Single Document For Code Guidelines	[{'id': 134699, 'node_id': 'MDU6TGFiZWwxMzQ2OT...	open
26	33849	BUG: weird behaviour of pivot_table when aggf...	[{'id': 76811, 'node_id': 'MDU6TGFiZWw3NjgxMQ=...	open
27	33846	BUG: Series construction with EA dtype and ind...	[{'id': 76811, 'node_id': 'MDU6TGFiZWw3NjgxMQ=...	open
28	33845	TestDatetimeIndex.test_reindex_with_same_tz 32...	[{'id': 563047854, 'node_id': 'MDU6TGFiZWw1NjM...	open
29	33843	Roll quantile support multiple quantiles as pe...	[]	open

30 rows × 4 columns

Interacting with Databases 数据库交互

import sqlite3
query = """
CREATE TABLE test
(a VARCHAR(20), b VARCHAR(20),
 c REAL,        d INTEGER
);"""
con = sqlite3.connect('mydata.sqlite')
con.execute(query)
con.commit()

data = [('Atlanta', 'Georgia', 1.25, 6),
        ('Tallahassee', 'Florida', 2.6, 3),
        ('Sacramento', 'California', 1.7, 5)]
stmt = "INSERT INTO test VALUES(?, ?, ?, ?)"
con.executemany(stmt, data)
con.commit()

cursor = con.execute('select * from test')
rows = cursor.fetchall()
rows

[('Atlanta', 'Georgia', 1.25, 6),
 ('Tallahassee', 'Florida', 2.6, 3),
 ('Sacramento', 'California', 1.7, 5)]

cursor.description
pd.DataFrame(rows, columns=[x[0] for x in cursor.description])

import sqlalchemy as sqla
db = sqla.create_engine('sqlite:///mydata.sqlite')
pd.read_sql('select * from test', db)

!rm mydata.sqlite

数据分析--数据加载,存储与文件格式

Data Loading, Storage,

Reading and Writing Data in Text Format

这里列举出pandas.read_csv和pandas.read_tabel常用的选项。

Reading Text Files in Pieces 逐块读取文本文件

Writing Data to Text Format 将数据写出到文本格式

Working with Delimited Formats 处理分隔符格式

JSON Data JSON数据

XML and HTML: Web Scraping Web信息收集

Parsing XML with lxml.objectify 利用lxml.objectify解析xml

Binary Data Formats 二进制数据格式

Using HDF5 Format 使用hdf5格式

Reading Microsoft Excel Files 读取Microsoft excel文件

Interacting with Web APIs Web APIs交互

Interacting with Databases 数据库交互

Conclusion

tensorflow2簡潔實現softmax迴歸

tensorflow數據操作

2數據分析庫pandas的使用

SVR模型&python應用

特徵工程中常用的數據處理方式

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結