用Python爬取顶点小说网站中的《庆余年》思路参考——记一次不成功的抓取

目的:用python爬虫抓取顶点小说网站中的《庆余年》小说内容个,并保存为txt格式文件。

环境:Win10系统,Anaconda3 + PyCharm, python3.6版本

思路:(1)先在整个目录页,下载每一章节对应的URL和标题;(2)针对每一章的URL,下载对应的内容,将内容按照顺序存在TXT文件中。

步骤:

1.顶点小说中《庆余年》的主页网址:

https://www.booktxt.net/1_1902/

2. 在主页中,【楔子 一块黑布】处点击鼠标右键,选择【检查】,步骤如上图所示,结果如下图:

可以看到,所有的章节标题和对应的超链接,都在<a href="***".html>***</a>中,  href是对应的部分url,文本是题目。

3.打开【楔子 一块黑布】章节,发现《庆余年》主页的网址 https://www.booktxt.net/1_1902/ + <a>中的 href部分的url组合起来就是本节的真正网址;

4.在本节内容任意一处单击鼠标右键,选择【检查】,步骤如上图所示, 结果如下图所示:

可以发现内容在<div id="content">***</div>中,注意其中有空行。

5.初步代码参考,先打印出每个章节对应的网址和题目:

import requests
from bs4 import BeautifulSoup
import os
import time
import random
import re


all_url = 'https://www.booktxt.net/1_1902/'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
}

start_html = requests.get(url=all_url, headers=headers)

# print(start_html.status_code)  #200
# print(type(start_html))        #<class 'requests.models.Response'>
# print(start_html.encoding)             #ISO-8859-1
# print(start_html.apparent_encoding)    #GB2312

start_html.encoding = "GB2312"

# 保存地址
path = 'F:/QingYuNian'
# 检查地址是否存在,如果不存在,则进行创建
if os.path.exists(path):
    print('目录已存在')
else:
    os.makedirs(path)
# 切换路径到保存目录下
os.chdir(path)

# 使用BeautifulSoup解析
soup = BeautifulSoup(start_html.text, 'html.parser')

# linkList1 = []
URL_all = soup.find_all('a', attrs={'href': re.compile(r'\d+.html')})
for urls in URL_all:
    print(urls)

结果:

可以看到,真正的章节对应的<a>中的href对应的***.html都是6位数字,

(另外一种思路:每一章节对应的网址是有规律的增加1,可以找出最大的增加量,然后以在开始的网址为参考,逐步获取后一章节的网址)

因此可以将

URL_all = soup.find_all('a', attrs={'href': re.compile(r'\d+.html')})

替换为:

URL_all = soup.find_all('a', attrs={'href': re.compile(r'^\d{6}.html')})

6.最终代码:

# 下载《庆余年》小说


import requests
from bs4 import BeautifulSoup
import os
import time
import random
import re


all_url = 'https://www.booktxt.net/1_1902/'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
}

start_html = requests.get(url=all_url, headers=headers)

# print(start_html.status_code)  #200
# print(type(start_html))        #<class 'requests.models.Response'>
# print(start_html.encoding)             #ISO-8859-1
# print(start_html.apparent_encoding)    #GB2312

start_html.encoding = "GB2312"

# 保存地址
path = 'F:/QingYuNian'
# 检查地址是否存在,如果不存在,则进行创建
if os.path.exists(path):
    print('目录已存在')
else:
    os.makedirs(path)
# 切换路径到保存目录下
os.chdir(path)

# 使用BeautifulSoup解析
soup = BeautifulSoup(start_html.text, 'html.parser')

linkList1 = []
URL_all = soup.find_all('a', attrs={'href': re.compile(r'\d{6}.html')})
for urls in URL_all:
    # print(urls)
    url = all_url + urls['href']
    title = urls.string
    # print(url)
    # print(title)
    # with open('QYN_Url_title.txt', 'a+', encoding='utf-8') as f:
    #     f.write(url + '\t')
    #     f.write(title + '\n')
    linkList1.append([url, title])

n = 0
for item in linkList1:
    url_1 = item[0]
    title_1 = item[1]
    one_html = requests.get(url=url_1, headers=headers)
    one_html.encoding = "GB2312"
    soup = BeautifulSoup(one_html.text, 'html.parser')
    con = soup.find('div', id='content')
    cont = con.get_text('\n', '<br/>')    # 去除换行
    with open('QingYuNian_Contents.txt', 'a+', encoding='utf-8') as f:
        f.write(title_1 + '\n')
        f.write(url_1 + '\n')
        f.write(cont + '\n')
        f.write("==================================================\n")
    n = n + 1
    print('第{}节下载完成'.format(n))
    time.sleep(random.random() * 10)
print("=====全部章节下载完成=====")

运行结果:

 只抓取到了一小部分章节:

 

结果遇到了错误: 

Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\site-packages\urllib3\contrib\pyopenssl.py", line 444, in wrap_socket
    cnx.do_handshake()
  File "C:\ProgramData\Anaconda3\lib\site-packages\OpenSSL\SSL.py", line 1907, in do_handshake
    self._raise_ssl_error(self._ssl, result)
  File "C:\ProgramData\Anaconda3\lib\site-packages\OpenSSL\SSL.py", line 1631, in _raise_ssl_error
    raise SysCallError(errno, errorcode.get(errno))
OpenSSL.SSL.SysCallError: (10060, 'WSAETIMEDOUT')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 600, in urlopen
    chunked=chunked)
  File "C:\ProgramData\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 343, in _make_request
    self._validate_conn(conn)
  File "C:\ProgramData\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 849, in _validate_conn
    conn.connect()
  File "C:\ProgramData\Anaconda3\lib\site-packages\urllib3\connection.py", line 356, in connect
    ssl_context=context)
  File "C:\ProgramData\Anaconda3\lib\site-packages\urllib3\util\ssl_.py", line 359, in ssl_wrap_socket
    return context.wrap_socket(sock, server_hostname=server_hostname)
  File "C:\ProgramData\Anaconda3\lib\site-packages\urllib3\contrib\pyopenssl.py", line 450, in wrap_socket
    raise ssl.SSLError('bad handshake: %r' % e)
ssl.SSLError: ("bad handshake: SysCallError(10060, 'WSAETIMEDOUT')",)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\site-packages\requests\adapters.py", line 445, in send
    timeout=timeout
  File "C:\ProgramData\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 638, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "C:\ProgramData\Anaconda3\lib\site-packages\urllib3\util\retry.py", line 398, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='www.booktxt.net', port=443): Max retries exceeded with url: /1_1902/610566.html (Caused by SSLError(SSLError("bad handshake: SysCallError(10060, 'WSAETIMEDOUT')")))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "F:/MeiZITu_PaChong_all/QingYuNian20200103New.py", line 56, in <module>
    one_html = requests.get(url=url_1, headers=headers)
  File "C:\ProgramData\Anaconda3\lib\site-packages\requests\api.py", line 72, in get
    return request('get', url, params=params, **kwargs)
  File "C:\ProgramData\Anaconda3\lib\site-packages\requests\api.py", line 58, in request
    return session.request(method=method, url=url, **kwargs)
  File "C:\ProgramData\Anaconda3\lib\site-packages\requests\sessions.py", line 512, in request
    resp = self.send(prep, **send_kwargs)
  File "C:\ProgramData\Anaconda3\lib\site-packages\requests\sessions.py", line 622, in send
    r = adapter.send(request, **kwargs)
  File "C:\ProgramData\Anaconda3\lib\site-packages\requests\adapters.py", line 511, in send
    raise SSLError(e, request=request)
requests.exceptions.SSLError: HTTPSConnectionPool(host='www.booktxt.net', port=443): Max retries exceeded with url: /1_1902/610566.html (Caused by SSLError(SSLError("bad handshake: SysCallError(10060, 'WSAETIMEDOUT')")))

Process finished with exit code 1

进一步解决思路参考:

第一种:使用模拟登录每一个章节对应的网址,然后下载内容。 

第二种:使用代理IP。

参考:

https://blog.csdn.net/lb245557472/article/details/80239603

https://blog.csdn.net/zouyee/article/details/25898751

发布了31 篇原创文章 · 获赞 43 · 访问量 7万+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章