Python爬取小说全书网

给大家分享一个爬取小说的代码，这次的代码很简单，而且只需要修改一下地方就可以爬取全书网的任意小说。对于爬取小说我们都知道需要爬取章节名称，小说内容。所以接下来我就讲解部分主要代码，后面附上完整代码。

第一步，获取网页源代码，正则表达式匹配链接和标题。处理字符

#获取网页源代码
    html=urllib.request.urlopen("http://www.quanshuwang.com/book/2/2131").read()
    #处理字符
    html=html.decode("gbk")
    #print(html)
    #正则匹配，分组匹配
    reg=r'<a href="(.*?)" title=".*?">(.*?)</a></li>'
    #增加效率
    reg=re.compile(reg)

第二步，获取每个章节的题目和链接，并打开，采用正则表达式匹配小说内容。

#获取章节链接并打开
        novel_url=url[0]
        novel_title=url[1]
        chapt=urllib.request.urlopen(novel_url).read()
        chapt_html=chapt.decode("gbk")
        #print(chapt_html)
        #正则表达式获取文章内容
        reg=r'</script>&nbsp;&nbsp;&nbsp;&nbsp;(.*?)<script type="text/javascript">'
        #匹配换行，S多行匹配
        reg=re.compile(reg,re.S)
        chapt_content=re.findall(reg,chapt_html)

之后我们爬取的内容如下；

对爬取的内容进行预处理。替换空格和<br />


        chapt_content=chapt_content[0].replace("&nbsp;&nbsp;&nbsp;&nbsp;","")
        chapt_content=chapt_content.replace("<br />","")

结果如下：

接下来就是保存到本地，这里我给大家分享两种方法，一种是每一章节保存一个TXT文件，另外一中就是全部保存到一个TXT文件中，代码如下：

f=open('{}.txt'.format(novel_title),'w')
        f.write(chapt_content)
        #with open('C:/Users/ASUS/Desktop/12.txt','a',encoding='utf-8') as f:
            #f.write(chapt_content)
            #f.close()

效果如下：

后面一个是全部保存到一个TXT文件中的，接下来就是完整的代码;

# coding=utf-8

import urllib.request
import re
#驼峰命名、获取小说内容
def getNoverContent():
    #获取网页源代码
    html=urllib.request.urlopen("http://www.quanshuwang.com/book/2/2131").read()
    #处理字符
    html=html.decode("gbk")
    #正则匹配，分组匹配
    reg=r'<a href="(.*?)" title=".*?">(.*?)</a></li>'
    #增加效率
    reg=re.compile(reg)
    #查找所有
    urls=re.findall(reg,html)
    #元组
    for url in urls:
        #获取章节链接并打开
        novel_url=url[0]
        novel_title=url[1]
        chapt=urllib.request.urlopen(novel_url).read()
        chapt_html=chapt.decode("gbk")
        #print(chapt_html)
        #正则表达式获取文章内容
        reg=r'</script>&nbsp;&nbsp;&nbsp;&nbsp;(.*?)<script type="text/javascript">'
        #匹配换行，S多行匹配
        reg=re.compile(reg,re.S)
        chapt_content=re.findall(reg,chapt_html)
        #替换空格
        chapt_content=chapt_content[0].replace("&nbsp;&nbsp;&nbsp;&nbsp;","")
        chapt_content=chapt_content.replace("<br />","")
    
        print("正在爬取%s"%novel_title)
        #保存为txt文件
        f=open('{}.txt'.format(novel_title),'w')
        f.write(chapt_content)
        #with open('C:/Users/ASUS/Desktop/12.txt','a',encoding='utf-8') as f:
            #f.write(chapt_content)
            #f.close()
print("爬取完成")
getNoverContent()

好了，今天就分享到这里，有什么不对的地方还望各位大佬指正修改。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Python爬取小说全书网

工作中用到的脚本合集

微服务实践Aspire项目发布到远程k8s集群

通过f-string编写简洁高效的Python格式化输出代码

[转帖]20个常用的Linux工具命令

[转帖]PostgreSQL从小白到高手教程 - 第46讲：poc-tpch测试

24-5-18 X

flask搭建虛擬環境

python爬取前程無憂崗位詳信息

初識Python，Python爬取小說

Python爬取小說全書網

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結