用Python爬取顶点小说网站中的《庆余年》思路参考——记一次不成功的抓取

目的：用python爬虫抓取顶点小说网站中的《庆余年》小说内容个，并保存为txt格式文件。

环境：Win10系统，Anaconda3 + PyCharm, python3.6版本

思路：（1）先在整个目录页，下载每一章节对应的URL和标题；（2）针对每一章的URL，下载对应的内容，将内容按照顺序存在TXT文件中。

步骤：

1.顶点小说中《庆余年》的主页网址：

https://www.booktxt.net/1_1902/

2. 在主页中，【楔子一块黑布】处点击鼠标右键，选择【检查】，步骤如上图所示，结果如下图：

可以看到，所有的章节标题和对应的超链接，都在<a href="***".html>***</a>中， href是对应的部分url，文本是题目。

3.打开【楔子一块黑布】章节，发现《庆余年》主页的网址 https://www.booktxt.net/1_1902/ + <a>中的 href部分的url组合起来就是本节的真正网址；

4.在本节内容任意一处单击鼠标右键，选择【检查】，步骤如上图所示，结果如下图所示：

可以发现内容在<div id="content">***</div>中，注意其中有空行。

5.初步代码参考，先打印出每个章节对应的网址和题目：

import requests
from bs4 import BeautifulSoup
import os
import time
import random
import re


all_url = 'https://www.booktxt.net/1_1902/'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
}

start_html = requests.get(url=all_url, headers=headers)

# print(start_html.status_code)  #200
# print(type(start_html))        #<class 'requests.models.Response'>
# print(start_html.encoding)             #ISO-8859-1
# print(start_html.apparent_encoding)    #GB2312

start_html.encoding = "GB2312"

# 保存地址
path = 'F:/QingYuNian'
# 检查地址是否存在，如果不存在，则进行创建
if os.path.exists(path):
    print('目录已存在')
else:
    os.makedirs(path)
# 切换路径到保存目录下
os.chdir(path)

# 使用BeautifulSoup解析
soup = BeautifulSoup(start_html.text, 'html.parser')

# linkList1 = []
URL_all = soup.find_all('a', attrs={'href': re.compile(r'\d+.html')})
for urls in URL_all:
    print(urls)

结果：

可以看到，真正的章节对应的<a>中的href对应的***.html都是6位数字，

（另外一种思路：每一章节对应的网址是有规律的增加1，可以找出最大的增加量，然后以在开始的网址为参考，逐步获取后一章节的网址）

因此可以将

URL_all = soup.find_all('a', attrs={'href': re.compile(r'\d+.html')})

替换为：

URL_all = soup.find_all('a', attrs={'href': re.compile(r'^\d{6}.html')})

6.最终代码：

# 下载《庆余年》小说


import requests
from bs4 import BeautifulSoup
import os
import time
import random
import re


all_url = 'https://www.booktxt.net/1_1902/'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
}

start_html = requests.get(url=all_url, headers=headers)

# print(start_html.status_code)  #200
# print(type(start_html))        #<class 'requests.models.Response'>
# print(start_html.encoding)             #ISO-8859-1
# print(start_html.apparent_encoding)    #GB2312

start_html.encoding = "GB2312"

# 保存地址
path = 'F:/QingYuNian'
# 检查地址是否存在，如果不存在，则进行创建
if os.path.exists(path):
    print('目录已存在')
else:
    os.makedirs(path)
# 切换路径到保存目录下
os.chdir(path)

# 使用BeautifulSoup解析
soup = BeautifulSoup(start_html.text, 'html.parser')

linkList1 = []
URL_all = soup.find_all('a', attrs={'href': re.compile(r'\d{6}.html')})
for urls in URL_all:
    # print(urls)
    url = all_url + urls['href']
    title = urls.string
    # print(url)
    # print(title)
    # with open('QYN_Url_title.txt', 'a+', encoding='utf-8') as f:
    #     f.write(url + '\t')
    #     f.write(title + '\n')
    linkList1.append([url, title])

n = 0
for item in linkList1:
    url_1 = item[0]
    title_1 = item[1]
    one_html = requests.get(url=url_1, headers=headers)
    one_html.encoding = "GB2312"
    soup = BeautifulSoup(one_html.text, 'html.parser')
    con = soup.find('div', id='content')
    cont = con.get_text('\n', '<br/>')    # 去除换行
    with open('QingYuNian_Contents.txt', 'a+', encoding='utf-8') as f:
        f.write(title_1 + '\n')
        f.write(url_1 + '\n')
        f.write(cont + '\n')
        f.write("==================================================\n")
    n = n + 1
    print('第{}节下载完成'.format(n))
    time.sleep(random.random() * 10)
print("=====全部章节下载完成=====")

运行结果：

只抓取到了一小部分章节：

结果遇到了错误：

Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\site-packages\urllib3\contrib\pyopenssl.py", line 444, in wrap_socket
    cnx.do_handshake()
  File "C:\ProgramData\Anaconda3\lib\site-packages\OpenSSL\SSL.py", line 1907, in do_handshake
    self._raise_ssl_error(self._ssl, result)
  File "C:\ProgramData\Anaconda3\lib\site-packages\OpenSSL\SSL.py", line 1631, in _raise_ssl_error
    raise SysCallError(errno, errorcode.get(errno))
OpenSSL.SSL.SysCallError: (10060, 'WSAETIMEDOUT')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 600, in urlopen
    chunked=chunked)
  File "C:\ProgramData\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 343, in _make_request
    self._validate_conn(conn)
  File "C:\ProgramData\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 849, in _validate_conn
    conn.connect()
  File "C:\ProgramData\Anaconda3\lib\site-packages\urllib3\connection.py", line 356, in connect
    ssl_context=context)
  File "C:\ProgramData\Anaconda3\lib\site-packages\urllib3\util\ssl_.py", line 359, in ssl_wrap_socket
    return context.wrap_socket(sock, server_hostname=server_hostname)
  File "C:\ProgramData\Anaconda3\lib\site-packages\urllib3\contrib\pyopenssl.py", line 450, in wrap_socket
    raise ssl.SSLError('bad handshake: %r' % e)
ssl.SSLError: ("bad handshake: SysCallError(10060, 'WSAETIMEDOUT')",)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\site-packages\requests\adapters.py", line 445, in send
    timeout=timeout
  File "C:\ProgramData\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 638, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "C:\ProgramData\Anaconda3\lib\site-packages\urllib3\util\retry.py", line 398, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='www.booktxt.net', port=443): Max retries exceeded with url: /1_1902/610566.html (Caused by SSLError(SSLError("bad handshake: SysCallError(10060, 'WSAETIMEDOUT')")))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "F:/MeiZITu_PaChong_all/QingYuNian20200103New.py", line 56, in <module>
    one_html = requests.get(url=url_1, headers=headers)
  File "C:\ProgramData\Anaconda3\lib\site-packages\requests\api.py", line 72, in get
    return request('get', url, params=params, **kwargs)
  File "C:\ProgramData\Anaconda3\lib\site-packages\requests\api.py", line 58, in request
    return session.request(method=method, url=url, **kwargs)
  File "C:\ProgramData\Anaconda3\lib\site-packages\requests\sessions.py", line 512, in request
    resp = self.send(prep, **send_kwargs)
  File "C:\ProgramData\Anaconda3\lib\site-packages\requests\sessions.py", line 622, in send
    r = adapter.send(request, **kwargs)
  File "C:\ProgramData\Anaconda3\lib\site-packages\requests\adapters.py", line 511, in send
    raise SSLError(e, request=request)
requests.exceptions.SSLError: HTTPSConnectionPool(host='www.booktxt.net', port=443): Max retries exceeded with url: /1_1902/610566.html (Caused by SSLError(SSLError("bad handshake: SysCallError(10060, 'WSAETIMEDOUT')")))

Process finished with exit code 1

进一步解决思路参考：

第一种：使用模拟登录每一个章节对应的网址，然后下载内容。

第二种：使用代理IP。

参考：

https://blog.csdn.net/lb245557472/article/details/80239603

https://blog.csdn.net/zouyee/article/details/25898751