Python爬蟲——爬取小說

這學期學校開了數據挖掘這門課，然後花了幾天時間Python入門，老師不打算講爬蟲這一塊，自己對爬蟲一直挺感興趣，想了解一下，所以用了兩天簡單的學了一下爬蟲，做了一個小demo

目標網站：

http://www.paoshu8.com/0_7

該目標網站的robots協議不存在，該網站沒有限制哪些內容不能爬，所以可以放心的爬了吧，但別爬太快，給別人服務器造成太大負擔

分析：
1.先用requests庫的get方法請求該目標網站，從響應體可以獲取到目標網站的網頁源代碼（如果請求失敗，建議加上User-Agent）

		headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
        }
        response = requests.get(url, headers=headers)

2.通過下面的圖可以發現，目標網站將所有的章節和鏈接都套在dd標籤裏的a標籤中，可以用正則將所有章節和章節鏈接都匹配出來放到列表中（當然，處理方法有很多種，如beautifulsoup、xpath、re…自己熟悉哪種便用哪種）

# 這是匹配所有章節的鏈接和章節名的正則表達式
<dd>.*?href="(.{14,19}?)">(第.*?)</a></dd>

對錶達式進行測試（可以看到匹配條數一致，大致的看了一下，基本沒問題，可以進行下一步了）

控制檯打印:

3.處理完所有章節和鏈接，接下來就是爬取一章裏的內容，再次用requests庫裏的get方法請求一章，然後處理返回來的網頁源代碼，又可以用正則匹配提取出內容，這1000多章小說都可以這樣處理，所以可以寫個循環，利用第二步獲取的所有章節的鏈接作爲請求url(因爲不是完整的url，所以前面需要加上http://www.paoshu8.com，這樣拼接後纔是完整的url)，循環獲取，也可以用multiprocessing模塊實現多線程提高速度

# 這是匹配文章內容的正則表達式
.*?id="content">(.*?)</div>

對錶達式進行測試： 發現內容中有p標籤,需要去除p標籤

去除p標籤

# 這是去除p標籤的正則表達式
<p>(.*?)</p>

但是呢，處理完p標籤後輸出獲取的章節內容和列表，又發現列表內容中出現了\u3000(這是Unicode的全角空白符)，但是顯示內容確實是正常的，如果想進行處理，可以參考python去除\ufeff、\xa0、\u3000(這裏我沒進行處理，保存的文件也是正常顯示的)

4.接下來就是將小說內容一章一章寫入文件中，這一步在處理每一章內容的時候就可以一起實現

代碼：

import os
import requests
import re


# 請求所有章節
def request_all_chapter(url):
    """請求指定url上的所有章節"""
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
        }
        response = requests.get(url, headers=headers)
        if response.status_code == 200:  # 如果狀態碼爲200，請求成功
            # response.encoding = response.apparent_encoding  # 如果亂碼加上
            print('章節目錄爬取成功...')
            return response.text
    except requests.RequestException:
        print("章節目錄爬取失敗...")


# 解析請求的html並提取出所有章節
def get_all_chapter(html):
    """提取出書名和所有章節目錄"""
    re_bookname = r'.*?book_name"\s+content="(.*?)"/>'  # 提取書名
    re_chapters = '<dd>.*?href="(.{14,19}?)">(第.*?)</a></dd>'  # 提取所有章節名和鏈接
    global book_name  # 定義全局變量
    book_name = re.search(re_bookname, html).group(1)
    # 提取的章節名和鏈接返回的是一個列表,列表裏的元素是元組，每個元組都有兩個元素，分別爲鏈接和章節名
    book_chapters = re.findall(re_chapters, html)  
    return book_chapters


# 請求一章的內容
def request_one_chaptercontent(url, chapter_url, chapter):
    """
    :param url: 要請求章節的前部分url
    :param chapter_url: 要請求章節的後部分url
    :param chapter: 要請求章節的章節名
    :return: 該章的網頁源代碼
    """
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
        }
        response = requests.get(url+chapter_url, headers=headers)
        if response.status_code == 200:
            print(chapter, "爬取成功...")
            return response.text
    except requests.RequestException:
        print(chapter, "爬取失敗...")


# 解析請求的一章html並提取出章節內容處理後寫入文件中
def get_one_chaptercontent(chapter_html, chapter, path):
    """
    解析返回來的網頁源代碼，從中提取小說內容並作出處理然後寫入文件中
    :param chapter_html: 該章的網頁源代碼
    :param chapter: 要請求章節的章節名
    :param path: 保存的路徑
    """
    reg1 = '.*?id="content">(.*?)</div>'  # 提取內容的正則表達式
    reg2 = '<p>(.*?)</p>'  # 去掉p標籤的正則表達式
    content = re.search(reg1, chapter_html).group(1)  # 提取出的內容(列表)
    content = re.findall(reg2, content)  # 去掉了p標籤的內容(列表)
    # 將列表轉爲字符串 每個元素後加換行
    content = '\n'.join(content)
    # 寫入文件中
    write_to_file(content, chapter, path)


# 將小說寫入到文件中
def write_to_file(content, chapter, path):
    """
    :param content: 小說的內容
    :param chapter: 章節名
    :param path: 保存的路徑
    """
    try:
        with open(path, 'a', encoding='utf-8') as f:
            f.write(content)
            print(chapter, '保存成功...')
    except Exception:
        print(chapter, "寫入失敗")


if __name__ == '__main__':
    # 爬取小說目錄
    url = 'http://www.paoshu8.com/0_7'
    html = request_all_chapter(url)
    chapters = get_all_chapter(html)
    # 根據書名創建保存所有章節的文件夾
    save_path = 'D:/'+book_name
    if not os.path.exists(save_path):  # 路徑不存在則創建
        os.mkdir(save_path)
    # 爬取前10章小說
    for i in range(0, 10):
        chapter_html = request_one_chaptercontent('http://www.paoshu8.com', chapters[i][0], chapters[i][1])
        path = save_path+'/'+chapters[i][1]+'.txt'  # 保存小說章節的路徑
        get_one_chaptercontent(chapter_html, chapters[i][1], path)

結果:

以上就是簡單的爬蟲小demo的全過程了，是不是很有趣？雖然用正則有缺陷，但畢竟是娛樂嘛，快去試試吧

Python爬蟲——爬取小說

Python 爬蟲：Spring Boot 反爬蟲的成功案例

京東科技數字化營銷能力的演進與最佳實踐| 京東雲技術團隊

MyBatis兩種緩存

MyBatis模糊查詢的三種處理參數方式

關於Cookie中有中文報錯的問題

解決Jackson亂碼問題

Python爬蟲——爬取小說

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結