Python筆記本

爬蟲基本原理

爬蟲是請求網站並提取數據的自動化程序

爬蟲的基本流程

發起請求：通過http庫想目標站點發送請求
如果服務器響應，會得到一個response
解析內容
保存數據，保存成文本或者至數據庫

#!/usr/bin/env python
# encoding: utf-8

import requests
response = requests.get('http://www.baidu.com')
print response.headers
print response.status_code
print response.text

能抓取怎麼樣的數據

抓取網頁文本
抓取圖片
視頻
其他

#!/usr/bin/env python
# encoding: utf-8

import requests
response = requests.get('https://ss1.bdstatic.com/kvoZeXSm1A5BphGlnYG/skin_zoom/178.jpg?2')
with open('e:/aaa.jpg', 'wb') as f:
    f.write(response.content)
    f.close()

有哪些解析方式

直接處理（網頁構造簡單、返回的內容簡單）
Json解析（返回Json的字符串）
正則表達式
BeatifulSoup
PyQuery
Xpath

怎麼解決javascript渲染的問題，獲取的網頁和瀏覽器打開的網頁內容不一致，如下

#!/usr/bin/env python
# encoding: utf-8

import requests
response = requests.get('https://m.weibo.cn/')
print response.headers
print response.status_code
print response.text

分析Ajax請求
selenium/WebDriver

from selenium import webdriver

driver = webdriver.Chrome()
driver.get('https://m.weibo.cn/')
#driver.get('https://www.zhihu.com/')
print driver.page_source

3. Splash

4. PyV8、Ghost.py

怎麼樣來保存數據

純文本
關係型數據庫
非關係型數據庫
二進制文件

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Python筆記本

layui導出excel亂碼

'org.springframework.jms.core.JmsMessagingTemplate' that could not be found？

vue項目裏的Simditor修改縮進爲首行縮進

HTMLDivElement 和 Object 相互轉換

css設置div高度，但div的高度無法自適應內容

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結