01Python爬蟲---快速使用Urllib爬取網頁

原創

2020-02-24 21:26

環境使用python3.5

import urllib.request  # 導入模塊

一、採用獲取網頁信息，然後再寫入文件中

1、將獲取的網頁信息
file = urllib.request.urlopen("http://www.baidu.com")
data = file.read()  # 讀取網頁全部內容   賦值給一個字符串變量
dataline = file.readline()  # 讀取網頁一行內容  賦值給列表變量
print(dataline)
print(data)


2、將獲取的網頁保存到本地
fhandle = open("/home/zyb/crawler/myweb/part4/1.html","wb")  # 打開以寫入模式打開1.html
fhandle.write(data)  # 將獲取的data數據寫入 文件中
fhandle.close()  #關閉文件

二、將網頁寫入本地文件

1、利用urllib.request.urlretrieve(url, filename = 本地文件地址)

filename = urllib.request.urlretrieve("http://edu.51cto.com",filename="/home/zyb/crawler/myweb/part4/2.html")

2、urlretrieve執行過程中會產生緩存所以需要清除使用urlcleanup()進行清除

urllib.request.urlcleanup()

三、urllib一些常見用法

1、爬取的網頁.info()

返回當前環境的相關信息,info()返回

file.info() 
"""
Date: Mon, 18 Dec 2017 15:14:32 GMT
Content-Type: text/html; charset=utf-8
Transfer-Encoding: chunked
Connection: Close
Vary: Accept-Encoding
Set-Cookie: BAIDUID=D79E4E0D7CE07FA007BE2425E09D89BC:FG=1; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
Set-Cookie: BIDUPSID=D79E4E0D7CE07FA007BE2425E09D89BC; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
Set-Cookie: PSTM=1513610072; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
Set-Cookie: BDSVRTM=0; path=/
Set-Cookie: BD_HOME=0; path=/
Set-Cookie: H_PS_PSSID=145821079170012517720719; path=/; domain=.baidu.com
P3P: CP=" OTI DSP COR IVA OUR IND COM "
Cache-Control: private
Cxy_all: baidu+8158bfec4d91e0dabf81725c2629dc8e
Expires: Mon, 18 Dec 2017 15:14:03 GMT
X-Powered-By: HPHP
Server: BWS/1.1
X-UA-Compatible: IE=Edge,chrome=1
BDPAGETYPE: 1
BDQID: 0xbd42c81600029cd3
BDUSERID: 0
"""

2、爬取的網頁..getcode()

爬取網頁的狀態

file.getcode()  

# 200

3、爬取的網頁.geturl()

爬取網頁的鏈接

file.geturl() 
# http://www.baidu.com

4、urllib.request.quote()

如果在URL中輸入中文或者”:”或者”&”等不符合標準的字符時,需要編碼時使用

urllib.request.quote()

url = urllib.request.quote("http://www.sina.com.cn")

print(url)

# http%3A//www.sina.com.cn

5、urllib.request.unquote()

當需要解碼時使用

urllib.request.unquote()

url = urllib.request.unquote("http://www.sina.com.cn")

print(url)

# http://www.sina.com.cn

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

01Python爬蟲---快速使用Urllib爬取網頁

一、採用獲取網頁信息，然後再寫入文件中

二、將網頁寫入本地文件

1、利用urllib.request.urlretrieve(url, filename = 本地文件地址)

2、urlretrieve執行過程中會產生緩存所以需要清除使用urlcleanup()進行清除

三、urllib一些常見用法

1、爬取的網頁.info()

2、爬取的網頁..getcode()

3、爬取的網頁.geturl()

4、urllib.request.quote()

5、urllib.request.unquote()

釘釘打卡速度慢

Nginx R31 doc 官方文檔-01-nginx 如何安裝

Qt/C++音視頻開發74-合併標籤圖形/生成yolo運算結果圖形/文字和圖形合併成一個/水印濾鏡

挑戰程序設計競賽 2.2章習題 POJ - 3617 Best Cow Line 貪心

字節面試：MySQL什麼時候鎖表？如何防止鎖表？

.NET8連接SQL SERVER 2008 R2 報：證書鏈是由不受信任的頒發機構頒發的

golang開發環境搭建(win10)

python計算機視覺學習筆記——PIL庫的用法

Golang初學：獲取程序內存使用情況，std runtime

06Python爬蟲---正則表達式05之實戰

05Python爬蟲---小結

07Python爬蟲---Cookie實戰

08Python爬蟲---正則和Cookie小結

前端學習OneDay--JS ES6之let和const

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結