在web应用中常用JSON(JavaScript Object Notation)格式传输数据,例如:
-
利用http://httpbin.org/API对发送的http请求进行观测。
-
爬虫程序利用Spalsh渲染引擎渲染页面。
要求:在Python中读取json数据。
解决方案:
标准库中的json模块,使用其中loads()、dumps()方法完成json数据的读写。
- 对于
requests
模块:
>>> import requests
>>> r = requests.get('http://httpbin.org/headers')
>>> r
<Response [200]>
>>> r.content
b'{\n "headers": {\n "Accept": "*/*", \n "Accept-Encoding": "gzip, deflate", \n "Host": "httpbin.org", \n "User-Agent": "python-requests/2.22.0"\n }\n}\n'
>>> r.text
'{\n "headers": {\n "Accept": "*/*", \n "Accept-Encoding": "gzip, deflate", \n "Host": "httpbin.org", \n "User-Agent": "python-requests/2.22.0"\n }\n}\n'
- 对于
json
模块:
json数据解析(反序列化):json.loads()
>>> import json
>>> d = json.loads(r.text) #python解析为字典
>>> d
{'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.22.0'}}
>>> d['headers']
{'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.22.0'}
>>> d['headers']['Host']
'httpbin.org'
- 方案示例:
Spider(Json) → Splash → Web
创建splash容器
$ sudo docker pull scrapinghub/splash
$ sudo docker run -itd -p 8050:8050 scrapinghub/splash
数据序列化为json数据:json.dumps()
>>> import requests
>>> import json
>>> requests.post
<function post at 0x7fad7195c378>
>>> url = 'http://localhost:8050/render.html'
>>> headers = {'content-type': 'application/json'}
>>> data = {'url': 'http://jd.com', 'timeout': 20, 'images': 0} #以京东为例,timeout指定渲染时间,images为0表示不渲染图片
>>> json_data = json.dumps(data) #将python字典转换为json数据,即序列化
>>> json_data
'{"url": "http://jd.com", "timeout": 20, "images": 0}'
>>> r2 = requests.post(url, headers=headers, data=json_data)
>>> r2
<Response [200]>
>>> r2.text
'<!DOCTYPE html><html class="o2_mini csstransitions cssanimations o2_webkit o2_safari o2_602"><head>\n <meta charset="utf8" version="1">\n <title>京东(JD.COM)-正品低价、品质保障、配送及时、轻松购物!</title>\n <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=yes">\n <meta name="description" content="京东JD.COM-专业的综合网上购物商城,销售家电、数码通讯、电脑、家居百货、服装服饰、母婴、图书、食品等数万个品牌优质商品.便捷、诚信的服务,为您提供愉悦的网上购物体验!">\n #中间省略 M11.4,0.9C11.4,0.9,11.4,0.9,11.4,0.9L11.4,0.9z"></path></symbol></defs></svg><ul class="elevator_list"><li class="elevator_item"><a class="elevator_lk" href="javascript:void(0);" clstag="h|keycount|core|elvt_01" tabindex="-1" aria-hidden="true"><span class="elevator_lk_bg"></span><span class="elevator_lk_txt">京东秒杀</span></a></li><li class="elevator_item"><a class="elevator_lk" href="javascript:void(0);" clstag="h|keycount|core|elvt_02" tabindex="-1" aria-hidden="true"><span class="elevator_lk_bg"></span><span class="elevator_lk_txt">特色优选</span></a></li><li class="elevator_item"><a class="elevator_lk" href="javascript:void(0);" clstag="h|keycount|core|elvt_03" tabindex="-1" aria-hidden="true"><span class="elevator_lk_bg"></span><span class="elevator_lk_txt">频道广场</span></a></li><li class="elevator_item"><a class="elevator_lk" href="javascript:void(0);" clstag="h|keycount|core|elvt_04" tabindex="-1" aria-hidden="true"><span class="elevator_lk_bg"></span><span class="elevator_lk_txt">为你推荐</span></a></li><li class="elevator_item"><a class="elevator_lk elevator_lk2" href="//jdcs.jd.com/chat/index.action?venderId=1&entry=jd_web_jimi_jdhome" target="_blank" clstag="h|keycount|core|elvt_05"><span class="elevator_lk_bg"></span><svg><use xlink:href="#icon_timline"></use></svg><span class="elevator_lk_txt">客服</span></a></li><li class="elevator_item"><a class="elevator_lk elevator_lk2" href="//surveys.jd.com/index.php?r=survey/index/sid/889711/newtest/Y/lang/zh-Hans" target="_blank" clstag="h|keycount|core|elvt_06"><span class="elevator_lk_bg"></span><svg><use xlink:href="#icon_feedback"></use></svg><span class="elevator_lk_txt">反馈</span></a></li></ul><a class="elevator_totop" href="javascript: void(0);" clstag="h|keycount|core|elvt_07" tabindex="-1" aria-hidden="true"><span class="elevator_totop_icon">\ue606</span><span class="elevator_totop_txt">顶部</span></a></div></div></div>\n<script type="text/javascript">\n window.point.dom = new Date().getTime();\n</script>\n\n\n\n\n<script type="text/javascript" src="//misc.360buyimg.com/mtd/pc/index_2019/1.0.0/static/js/runtime.js"></script>\n\n<script type="text/javascript" src="//misc.360buyimg.com/mtd/pc/index_2019/1.0.0/static/js/index.chunk.js"></script>\n\n<script type="text/javascript">\n window.point.js = new Date().getTime();\n</script>\n</body></html>'
除json.loads()
和json.dumps()
外,json模块还有json.load()
和json.dump()
。
dumps()和dump()是序列化方法。dumps()只完成了序列化为str;dump()必须传文件描述符,将序列化的str保存到文件中。
loads()和load()是反序列化方法。loads()只完成了反序列化;load()只接收文件描述符,完成了读取文件和反序列化。
>>> data
{'url': 'http://jd.com', 'timeout': 20, 'images': 0}
>>> f = open('demo.json', 'w')
>>> json.dump(data, f) #将字典转化为json数据
>>> f.close()
# cat demo.json
{"url": "http://jd.com", "timeout": 20, "images": 0}
>>> f2 = open('demo.json')
>>> json.load(f2) #将json数据转化为字典
{'url': 'http://jd.com', 'timeout': 20, 'images': 0}