Python之爬蟲urllib（一）

原創

ydw_ydw

2018-08-31 13:30

本節介紹的信息內容

包含模塊

網頁編碼問題解決

urlopen 的返回對象（在例子中指的是rsp）

包含模塊

urllib.request: 打開和讀取urls
urllib.error：包含urllib.request產生的常見的錯誤，使用try捕捉
urllib.parse: 包含解析url的方法
urllib.robotparse: 解析robots.txt文件
案例1

from urllib import request
'''
使用urllib.request請求一個網頁內容，並把內容打印出來
'''


if __name__ == '__main__':

    url = "http://jobs.zhaopin.com/195435110251173.htm?ssidkey=y&ss=409&ff=03&sg=2644e782b8b143419956320b22910c91&so=1"
    # 打開相應url並把相應頁面作爲返回
    rsp = request.urlopen(url)

    # 把返回結果讀取出來
    # 讀取出來內容類型爲bytes
    html = rsp.read()
    print(type(html))

    # 如果想把bytes內容轉換成字符串，需要解碼
    html = html.decode("utf-8")

    print(html)

網頁編碼問題解決

chardet 可以自動檢測頁面文件的編碼格式，但是，可能有誤
需要安裝，安裝方法：在python的安裝文件夾的scripts文件夾裏面有個pip.exe文件，安裝時需要用到這個(貌似python2.4版本以上才默認有這個功能)，在命令行模式下進入pip.exe所在的文件夾，然後在命令提示符中輸入pip.exe install chardet
Python獲取網頁編碼的兩種方法——requests.get、chardet
案例2

'''
利用request下載頁面
自動檢測頁面編碼

'''

import urllib
import chardet

if __name__ == '__main__':
    url = 'http://stock.eastmoney.com/news/1407,20170807763593890.html'

    rsp = urllib.request.urlopen(url)

    html = rsp.read()

    #1、利用 chardet自動檢測
    cs = chardet.detect(html)
    print(type(cs))
    print(cs)


    #2、使用get取值保證不會出錯
    html = html.decode(cs.get("encoding", "utf-8"))
    print(html)

urlopen 的返回對象（在例子中指的是rsp）

返回對象具有的函數
- geturl: 返回請求對象的url
- info: 請求反饋對象的meta信息
- getcode：返回的http code
- 案例3（也可以在 print(type(rsp)) 一行打上斷點，然後看編輯器下方Console中給出的信息，可以得到URL等信息）

import urllib

if __name__ == '__main__':
    url = 'http://stock.eastmoney.com/news/1407,20170807763593890.html'

    rsp = urllib.request.urlopen(url)

    print(type(rsp))
    print(rsp)

    print("URL： {0}".format( rsp.geturl()))
    print("Info: {0}".format(rsp.info()))
    print("Code: {0}".format(rsp.getcode()))

    html = rsp.read()

    # 使用get取值保證不會出錯
    html = html.decode()

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Python之爬蟲urllib（一）

包含模塊

網頁編碼問題解決

urlopen 的返回對象（在例子中指的是rsp）

Python之配置日誌的幾種方式

Python之XML創建

Python之爬蟲準備工作

SSM+Maven+Bootstrap+MySQL實現增刪改查的一個小demo

Python之線程代替方案 - 多進程

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結