【Python3網絡爬蟲】4-正則運用之爬取糗事百科

正則表達式實戰鞏固

import requests
from fake_useragent import UserAgent
import re

url = 'https://www.qiushibaike.com/text/page/{}/'
headers = {
    'User-Agent': UserAgent().chrome
}


def get_data(page):
    print("正在爬取第{}頁".format(page))
    response = requests.get(url.format(page), headers=headers)
    info = response.text
    infos = re.findall(r'<div class="content">\s*<span>\s*(.+)\s*</span>', info)
    with open('duanzi.txt', 'a+', encoding='utf-8') as f:
        for info in infos:
            info = info.replace("\s", "")
            f.write(info + "\n\n")


for page in range(1, 14):
    get_data(page)

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

4-正則運用之爬取糗事百科

【Python3網絡爬蟲】4-正則運用之爬取糗事百科

HDFS(hadoop分佈式文件系統)

1-urllib庫的使用

4-正則運用之爬取糗事百科

5-BeautifulSoup的使用

2-Requests庫的使用

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結