Python爬取糗事百科

原創

qq_42603652

2018-08-31 16:05

上一篇文章簡單介紹了一下正則的基本語法，今天這篇文章就講一個用正則來爬取糗事百科的例子

一、引入模塊

因爲urlopen功能比較簡單，所以設置代理ip需引入ProxyHandler和build_opener模塊，ip的獲取可以上西祠代理查詢

import re
from urllib.request import Request,build_opener,ProxyHandler
base_url = 'https://www.qiushibaike.com/hot/page/'
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
}

ip_list = [
    '220.249.185.178:9999',
    '124.193.85.88:8080',
    '116.62.194.248:3128',
    '112.115.57.20:3128',
    '171.37.143.73:9797'
]
proxies = {
    'http:':random.choice(ip_list)
}

二、設置代理IP，獲取網頁內容

def down_load_qiubai_info(pageIndex):
    full_url = base_url + str(pageIndex) + '/'
    # print(full_url)
    #添加headers設置爬蟲目標以及用戶標識
    #如果只寫full_url等於告訴網頁獲取網頁內容的不是自然人，而是程序
    # 因爲程序自帶 User_Agent:Python urllib / 3.6
    request = Request(full_url,headers=headers)
    #設置代理IP
    proxies_Handler = ProxyHandler(proxies)
    opener=build_opener(proxies_Handler)
    response = opener.open(request)
    # 獲取對應網頁的全部內容
    code = response.read().decode()

三、根據獲取的源碼獲取糗事百科內容

鼠標右鍵檢查，每個瀏覽器都不一樣我這裏用的是谷歌瀏覽器

注意：(.*?)表示要獲取的內容，正則匹配的內容從指定的開始位置到全部內容結束，所以只需要指定開始的位置不需要指定結束的位置，如果我們想要正則獲取某一對標籤裏面的內容的時候，那麼需要將這對標籤對寫完整而且咋想要獲取的內容，上添加()例如：<h2>(.*?)</h2>

#使用正則獲取姓名，年齡，內容，評論，評論數 等
#根據獲取的內容在網頁中的位置寫正則
pattern = re.compile(r'<div class="author clearfix">.*?<h2>(.*?)</h2>.*?<div class="articleGender.*?Icon">(.*?)</div>.*?<a.*?href="(.*?)".*?>.*?<div class="content">.*?<span>(.*?)</span>.*?<div class="stats">.*?<i class="number">(.*?)</i>.*?<span class="stats-comments">.*?<i class="number">(.*?)</i>',re.S)
    #在源碼中查找所有符合正則的內容
    result = pattern.findall(code)
    # print(result)
    for name,age,href,content,stats,comment in result:
        # strip去除 換行
        name = name.strip('\n')
        age =age.strip('\n')
        content = content.strip('\n')
        href = href.strip('\n')
        stats = stats.strip('\n')
        comment = comment.strip('\n')
        # print(name)
        # print(age)
        # print(content)
        print(href)
        # print(stats)
        # print(comment)
        if int(comment)!=0:
            get_all_comment_with(href)
        else:
            print('該內容暫無評論')

四、獲取評論人的個人主頁

def get_all_comment_with(url):
    #拼接url
    detail_url = 'https://www.qiushibaike.com'+ url
    # print(detail_url)
    #獲取源碼
    request = Request(detail_url,headers=headers)
    response = urlopen(request)
    code = response.read().decode()
    #根據位置寫正則
    pattern = re.compile(r'<div class="avatars">.<a href="(.*?)".*?>',re.S)
    #在源碼中查找所有符合正則的內容
    result = pattern.findall(code)
    # print(result)
    for x in result:
        x = x.strip('\n')
        # print(x)
        #拼接url
        url_list = 'https://www.qiushibaike.com'+ x
        print(url_list)

以上就是用正則爬取糗事百科的步驟。在寫正則的時候一定要注意要根據網頁源碼來寫，想要獲取的內容用（）括起來，

其他的不需要使用的內容用.*?或者其他元字符代替。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Python爬取糗事百科

Promise請求

xhr

python之windows中redis數據庫的安裝

Pthon Scrapy框架的安裝與使用

Windows上node.js的安裝與使用

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結