requests 庫的另類用法（stream）

原創

2018-09-01 22:01

重要
我的博客從今天起開始陸續遷移到
http://vearne.cc
敬請關注
本文鏈接
http://vearne.cc/archives/120

起因: 同事讓我幫他抓取一批URL，並獲取對應URL的標籤中的文字，忽略對應URL網站的封禁問題，這個任務並不是一個特別麻煩的事情。然後實際跑起來，卻發現流量打的很高，超過10Mb/s。

經過排查發現，是因爲很多URL，實際是下載鏈接，會觸發文件下載，這些URL對應的html中根本不會包含標籤，那麼處理邏輯就很清晰了，先拿到headers，取出Content-Type，判斷是否是
text/html，如果不是，則該Response的body體，就沒有必要讀取了。

查找requests的相應資料

By default, when you make a request, the body of the response is
downloaded immediately. You can override this behaviour and defer
downloading the response body until you access the Response.content
attribute with the stream parameter:

tarball_url = 'https://github.com/kennethreitz/requests/tarball/master'
r = requests.get(tarball_url, stream=True)

At this point only the response headers have been downloaded and the
connection remains open, hence allowing us to make content retrieval
conditional:

if int(r.headers['content-length']) < TOO_LONG:
  content = r.content
  ...

只有headers頭被下載了，body中的數據還沒有被下載，這樣就能避免不必要的流量開銷，只有當你使用r.content 的時候，所有body內容纔會被下載

You can further control the workflow by use of the
Response.iter_content() and Response.iter_lines() methods.
Alternatively, you can read the undecoded body from the underlying
urllib3 urllib3.HTTPResponse at Response.raw.

實時上還可以使用Response.iter_content() Response.iter_lines()
Response.raw()來自己決定要讀取多少數據

重要
我的博客從今天起開始陸續遷移到
http://vearne.cc
敬請關注
本文鏈接
http://vearne.cc/archives/120

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

requests 庫的另類用法（stream）

工作中用到的腳本合集

24-5-18 X

我的監控世界觀(5)--如何在監控中反映業務場景

做了個工具類的小網站---tool.admaster.club

java newFixedThreadPool 報錯

2016年在讀的書

我在數據庫方面踩過的"坑"

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結