python下載小說

原創

MA木易YA

2019-08-22 17:19

最近正好在看這本小說，網上廣告屬實多了點，而且好多存在斷章的情況，所以自己去網上下載下來電腦或者手機上看最實際了

1. 首頁

對沒錯，就是這本都市小說《道德天書》，從首頁還是很容易能夠獲取到章節鏈接的

2. 內容頁

內容頁面有點小陷阱，雖然看着簡單，但是實際將頁面的內容打印出來是殘缺的，他只構建了部分文本內容，實際的內容是需要自己抓包獲取,不信的話可以打印頁面內容看看，是不完整的

3. 抓包

這裏規則還是比較簡單的，很容易就找到了對應的數據包

我們選擇數據包的headers進入裏面顯示的真實鏈接就可以看到具體的內容了

4. 數據獲取

所有的數據來源我們都知道了，就着手開始建設了，首先從首頁遍歷所有章節的頁面鏈接，從每個章節的頁面中獲取到標題和內容，沒錯，這裏的內容需要去數據包中獲取，仔細觀察會發現數據包的鏈接恰巧就是網頁主頁鏈接+章節鏈接的後兩項，所以能夠很輕易的組合出來，後面的內容無非是獲取，寫入。

這裏因爲內容中還是存在廣告，所以用replace將它剔除了，其他的就是xpath解析，寫入了

#!/usr/bin/env python
# -*- coding：utf-8 -*-
'''
@author: maya
@software: Pycharm
@file: tqdm.py
@time: 2019/8/20 14:08
@desc:
'''
import requests
from lxml import etree

headers = {
    'cookie': 'Hm_lvt_33b927fed41089db72f5d741701b24f2=1566285504; SL_GWPT_Show_Hide_tmp=1; SL_wptGlobTipTmp=1; Hm_lpvt_33b927fed41089db72f5d741701b24f2=1566285551',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36',
    'upgrade-insecure-request': '1',
    'referer': 'https://www.rzlib.net/b/73/73530/',
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3'
}

def get_html(url):
    return requests.get(url, headers=headers)\
        .text.replace('如果您覺得《周睿道德天書》還不錯的話，請粘貼以下網址分享給你的QQ、'
                      '微信或微博好友，謝謝支持！', '').replace('（ 本書網址：https://www.'
                                                   'rzlib.net/b/73/73530/ ）', '')


def get_data(url):
    html = etree.HTML(get_html(url))
    title = html.xpath('//h1/text()')[0].replace('.', '_')
    content_url = get_url(url)
    content_html = etree.HTML(get_html(content_url)).xpath('//body//text()')
    content = ["".join(data.split()) for data in content_html]
    return title, content


def get_url(url):
    return "https://www.rzlib.net/b/txtt5552/" + url.split('/')[-2] + "/" + url.split('/')[-1]


def write_data(url):
    title, content = get_data(url)
    with open('books/' + title + '.txt', 'w', encoding='utf-8') as f:
        f.write(title.replace('_', '. ') + '\n')
        for data in content:
            if data != "":
                f.write('  ' + data + '\n')
    with open('books/books.txt', 'a', encoding='utf-8') as p:
        p.write(title.replace('_', '. ') + '\n')
        for data in content:
            if data != "":
                p.write('  ' + data + '\n')
        p.write('\n')



def get_total(index_utl):
    html = etree.HTML(get_html(index_utl))
    urls = html.xpath('//div[@class="ListChapter"][2]/ul/li/a/@href')
    for url in urls:
        write_data("https://www.rzlib.net" + url)
        print("第{}章已完成寫入".format(urls.index(url) + 1))


if __name__ == '__main__':
    get_total("https://www.rzlib.net/b/73/73530/")

代碼參考Github

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

python下載小說

1. 首頁

2. 內容頁

3. 抓包

4. 數據獲取

《Python進階》學習筆記

Leetcode 3161. 物塊放置查詢

leetcode 60 排列序列

一個docker容器暴露多個端口

微服務實踐之使用 Visual Studio 2022 調試Dapr 應用程序

wpf附加屬性理解 WPF附加屬性

兒時的年味 | 年俗源遠年意流長羽西X簡書紅蘊新生#

python內置函數——Map、Reduce、Filter

python 郵件模塊

Git分支命令

OS常用方法總結

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結