用Python爬取頂點小說網站中的《慶餘年》思路參考——記一次不成功的抓取

目的：用python爬蟲抓取頂點小說網站中的《慶餘年》小說內容個，並保存爲txt格式文件。

環境：Win10系統，Anaconda3 + PyCharm, python3.6版本

思路：（1）先在整個目錄頁，下載每一章節對應的URL和標題；（2）針對每一章的URL，下載對應的內容，將內容按照順序存在TXT文件中。

步驟：

1.頂點小說中《慶餘年》的主頁網址：

https://www.booktxt.net/1_1902/

2. 在主頁中，【楔子一塊黑布】處點擊鼠標右鍵，選擇【檢查】，步驟如上圖所示，結果如下圖：

可以看到，所有的章節標題和對應的超鏈接，都在<a href="***".html>***</a>中， href是對應的部分url，文本是題目。

3.打開【楔子一塊黑布】章節，發現《慶餘年》主頁的網址 https://www.booktxt.net/1_1902/ + <a>中的 href部分的url組合起來就是本節的真正網址；

4.在本節內容任意一處單擊鼠標右鍵，選擇【檢查】，步驟如上圖所示，結果如下圖所示：

可以發現內容在<div id="content">***</div>中，注意其中有空行。

5.初步代碼參考，先打印出每個章節對應的網址和題目：

import requests
from bs4 import BeautifulSoup
import os
import time
import random
import re


all_url = 'https://www.booktxt.net/1_1902/'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
}

start_html = requests.get(url=all_url, headers=headers)

# print(start_html.status_code)  #200
# print(type(start_html))        #<class 'requests.models.Response'>
# print(start_html.encoding)             #ISO-8859-1
# print(start_html.apparent_encoding)    #GB2312

start_html.encoding = "GB2312"

# 保存地址
path = 'F:/QingYuNian'
# 檢查地址是否存在，如果不存在，則進行創建
if os.path.exists(path):
    print('目錄已存在')
else:
    os.makedirs(path)
# 切換路徑到保存目錄下
os.chdir(path)

# 使用BeautifulSoup解析
soup = BeautifulSoup(start_html.text, 'html.parser')

# linkList1 = []
URL_all = soup.find_all('a', attrs={'href': re.compile(r'\d+.html')})
for urls in URL_all:
    print(urls)

結果：

可以看到，真正的章節對應的<a>中的href對應的***.html都是6位數字，

（另外一種思路：每一章節對應的網址是有規律的增加1，可以找出最大的增加量，然後以在開始的網址爲參考，逐步獲取後一章節的網址）

因此可以將

URL_all = soup.find_all('a', attrs={'href': re.compile(r'\d+.html')})

替換爲：

URL_all = soup.find_all('a', attrs={'href': re.compile(r'^\d{6}.html')})

6.最終代碼：

# 下載《慶餘年》小說


import requests
from bs4 import BeautifulSoup
import os
import time
import random
import re


all_url = 'https://www.booktxt.net/1_1902/'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
}

start_html = requests.get(url=all_url, headers=headers)

# print(start_html.status_code)  #200
# print(type(start_html))        #<class 'requests.models.Response'>
# print(start_html.encoding)             #ISO-8859-1
# print(start_html.apparent_encoding)    #GB2312

start_html.encoding = "GB2312"

# 保存地址
path = 'F:/QingYuNian'
# 檢查地址是否存在，如果不存在，則進行創建
if os.path.exists(path):
    print('目錄已存在')
else:
    os.makedirs(path)
# 切換路徑到保存目錄下
os.chdir(path)

# 使用BeautifulSoup解析
soup = BeautifulSoup(start_html.text, 'html.parser')

linkList1 = []
URL_all = soup.find_all('a', attrs={'href': re.compile(r'\d{6}.html')})
for urls in URL_all:
    # print(urls)
    url = all_url + urls['href']
    title = urls.string
    # print(url)
    # print(title)
    # with open('QYN_Url_title.txt', 'a+', encoding='utf-8') as f:
    #     f.write(url + '\t')
    #     f.write(title + '\n')
    linkList1.append([url, title])

n = 0
for item in linkList1:
    url_1 = item[0]
    title_1 = item[1]
    one_html = requests.get(url=url_1, headers=headers)
    one_html.encoding = "GB2312"
    soup = BeautifulSoup(one_html.text, 'html.parser')
    con = soup.find('div', id='content')
    cont = con.get_text('\n', '<br/>')    # 去除換行
    with open('QingYuNian_Contents.txt', 'a+', encoding='utf-8') as f:
        f.write(title_1 + '\n')
        f.write(url_1 + '\n')
        f.write(cont + '\n')
        f.write("==================================================\n")
    n = n + 1
    print('第{}節下載完成'.format(n))
    time.sleep(random.random() * 10)
print("=====全部章節下載完成=====")

運行結果：

只抓取到了一小部分章節：

結果遇到了錯誤：

Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\site-packages\urllib3\contrib\pyopenssl.py", line 444, in wrap_socket
    cnx.do_handshake()
  File "C:\ProgramData\Anaconda3\lib\site-packages\OpenSSL\SSL.py", line 1907, in do_handshake
    self._raise_ssl_error(self._ssl, result)
  File "C:\ProgramData\Anaconda3\lib\site-packages\OpenSSL\SSL.py", line 1631, in _raise_ssl_error
    raise SysCallError(errno, errorcode.get(errno))
OpenSSL.SSL.SysCallError: (10060, 'WSAETIMEDOUT')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 600, in urlopen
    chunked=chunked)
  File "C:\ProgramData\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 343, in _make_request
    self._validate_conn(conn)
  File "C:\ProgramData\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 849, in _validate_conn
    conn.connect()
  File "C:\ProgramData\Anaconda3\lib\site-packages\urllib3\connection.py", line 356, in connect
    ssl_context=context)
  File "C:\ProgramData\Anaconda3\lib\site-packages\urllib3\util\ssl_.py", line 359, in ssl_wrap_socket
    return context.wrap_socket(sock, server_hostname=server_hostname)
  File "C:\ProgramData\Anaconda3\lib\site-packages\urllib3\contrib\pyopenssl.py", line 450, in wrap_socket
    raise ssl.SSLError('bad handshake: %r' % e)
ssl.SSLError: ("bad handshake: SysCallError(10060, 'WSAETIMEDOUT')",)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\site-packages\requests\adapters.py", line 445, in send
    timeout=timeout
  File "C:\ProgramData\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 638, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "C:\ProgramData\Anaconda3\lib\site-packages\urllib3\util\retry.py", line 398, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='www.booktxt.net', port=443): Max retries exceeded with url: /1_1902/610566.html (Caused by SSLError(SSLError("bad handshake: SysCallError(10060, 'WSAETIMEDOUT')")))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "F:/MeiZITu_PaChong_all/QingYuNian20200103New.py", line 56, in <module>
    one_html = requests.get(url=url_1, headers=headers)
  File "C:\ProgramData\Anaconda3\lib\site-packages\requests\api.py", line 72, in get
    return request('get', url, params=params, **kwargs)
  File "C:\ProgramData\Anaconda3\lib\site-packages\requests\api.py", line 58, in request
    return session.request(method=method, url=url, **kwargs)
  File "C:\ProgramData\Anaconda3\lib\site-packages\requests\sessions.py", line 512, in request
    resp = self.send(prep, **send_kwargs)
  File "C:\ProgramData\Anaconda3\lib\site-packages\requests\sessions.py", line 622, in send
    r = adapter.send(request, **kwargs)
  File "C:\ProgramData\Anaconda3\lib\site-packages\requests\adapters.py", line 511, in send
    raise SSLError(e, request=request)
requests.exceptions.SSLError: HTTPSConnectionPool(host='www.booktxt.net', port=443): Max retries exceeded with url: /1_1902/610566.html (Caused by SSLError(SSLError("bad handshake: SysCallError(10060, 'WSAETIMEDOUT')")))

Process finished with exit code 1

進一步解決思路參考：

第一種：使用模擬登錄每一個章節對應的網址，然後下載內容。

第二種：使用代理IP。

參考：

https://blog.csdn.net/lb245557472/article/details/80239603

https://blog.csdn.net/zouyee/article/details/25898751