用Python爬取頂點小說網站中的《慶餘年》思路參考——記一次不成功的抓取

目的:用python爬蟲抓取頂點小說網站中的《慶餘年》小說內容個,並保存爲txt格式文件。

環境:Win10系統,Anaconda3 + PyCharm, python3.6版本

思路:(1)先在整個目錄頁,下載每一章節對應的URL和標題;(2)針對每一章的URL,下載對應的內容,將內容按照順序存在TXT文件中。

步驟:

1.頂點小說中《慶餘年》的主頁網址:

https://www.booktxt.net/1_1902/

2. 在主頁中,【楔子 一塊黑布】處點擊鼠標右鍵,選擇【檢查】,步驟如上圖所示,結果如下圖:

可以看到,所有的章節標題和對應的超鏈接,都在<a href="***".html>***</a>中,  href是對應的部分url,文本是題目。

3.打開【楔子 一塊黑布】章節,發現《慶餘年》主頁的網址 https://www.booktxt.net/1_1902/ + <a>中的 href部分的url組合起來就是本節的真正網址;

4.在本節內容任意一處單擊鼠標右鍵,選擇【檢查】,步驟如上圖所示, 結果如下圖所示:

可以發現內容在<div id="content">***</div>中,注意其中有空行。

5.初步代碼參考,先打印出每個章節對應的網址和題目:

import requests
from bs4 import BeautifulSoup
import os
import time
import random
import re


all_url = 'https://www.booktxt.net/1_1902/'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
}

start_html = requests.get(url=all_url, headers=headers)

# print(start_html.status_code)  #200
# print(type(start_html))        #<class 'requests.models.Response'>
# print(start_html.encoding)             #ISO-8859-1
# print(start_html.apparent_encoding)    #GB2312

start_html.encoding = "GB2312"

# 保存地址
path = 'F:/QingYuNian'
# 檢查地址是否存在,如果不存在,則進行創建
if os.path.exists(path):
    print('目錄已存在')
else:
    os.makedirs(path)
# 切換路徑到保存目錄下
os.chdir(path)

# 使用BeautifulSoup解析
soup = BeautifulSoup(start_html.text, 'html.parser')

# linkList1 = []
URL_all = soup.find_all('a', attrs={'href': re.compile(r'\d+.html')})
for urls in URL_all:
    print(urls)

結果:

可以看到,真正的章節對應的<a>中的href對應的***.html都是6位數字,

(另外一種思路:每一章節對應的網址是有規律的增加1,可以找出最大的增加量,然後以在開始的網址爲參考,逐步獲取後一章節的網址)

因此可以將

URL_all = soup.find_all('a', attrs={'href': re.compile(r'\d+.html')})

替換爲:

URL_all = soup.find_all('a', attrs={'href': re.compile(r'^\d{6}.html')})

6.最終代碼:

# 下載《慶餘年》小說


import requests
from bs4 import BeautifulSoup
import os
import time
import random
import re


all_url = 'https://www.booktxt.net/1_1902/'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
}

start_html = requests.get(url=all_url, headers=headers)

# print(start_html.status_code)  #200
# print(type(start_html))        #<class 'requests.models.Response'>
# print(start_html.encoding)             #ISO-8859-1
# print(start_html.apparent_encoding)    #GB2312

start_html.encoding = "GB2312"

# 保存地址
path = 'F:/QingYuNian'
# 檢查地址是否存在,如果不存在,則進行創建
if os.path.exists(path):
    print('目錄已存在')
else:
    os.makedirs(path)
# 切換路徑到保存目錄下
os.chdir(path)

# 使用BeautifulSoup解析
soup = BeautifulSoup(start_html.text, 'html.parser')

linkList1 = []
URL_all = soup.find_all('a', attrs={'href': re.compile(r'\d{6}.html')})
for urls in URL_all:
    # print(urls)
    url = all_url + urls['href']
    title = urls.string
    # print(url)
    # print(title)
    # with open('QYN_Url_title.txt', 'a+', encoding='utf-8') as f:
    #     f.write(url + '\t')
    #     f.write(title + '\n')
    linkList1.append([url, title])

n = 0
for item in linkList1:
    url_1 = item[0]
    title_1 = item[1]
    one_html = requests.get(url=url_1, headers=headers)
    one_html.encoding = "GB2312"
    soup = BeautifulSoup(one_html.text, 'html.parser')
    con = soup.find('div', id='content')
    cont = con.get_text('\n', '<br/>')    # 去除換行
    with open('QingYuNian_Contents.txt', 'a+', encoding='utf-8') as f:
        f.write(title_1 + '\n')
        f.write(url_1 + '\n')
        f.write(cont + '\n')
        f.write("==================================================\n")
    n = n + 1
    print('第{}節下載完成'.format(n))
    time.sleep(random.random() * 10)
print("=====全部章節下載完成=====")

運行結果:

 只抓取到了一小部分章節:

 

結果遇到了錯誤: 

Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\site-packages\urllib3\contrib\pyopenssl.py", line 444, in wrap_socket
    cnx.do_handshake()
  File "C:\ProgramData\Anaconda3\lib\site-packages\OpenSSL\SSL.py", line 1907, in do_handshake
    self._raise_ssl_error(self._ssl, result)
  File "C:\ProgramData\Anaconda3\lib\site-packages\OpenSSL\SSL.py", line 1631, in _raise_ssl_error
    raise SysCallError(errno, errorcode.get(errno))
OpenSSL.SSL.SysCallError: (10060, 'WSAETIMEDOUT')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 600, in urlopen
    chunked=chunked)
  File "C:\ProgramData\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 343, in _make_request
    self._validate_conn(conn)
  File "C:\ProgramData\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 849, in _validate_conn
    conn.connect()
  File "C:\ProgramData\Anaconda3\lib\site-packages\urllib3\connection.py", line 356, in connect
    ssl_context=context)
  File "C:\ProgramData\Anaconda3\lib\site-packages\urllib3\util\ssl_.py", line 359, in ssl_wrap_socket
    return context.wrap_socket(sock, server_hostname=server_hostname)
  File "C:\ProgramData\Anaconda3\lib\site-packages\urllib3\contrib\pyopenssl.py", line 450, in wrap_socket
    raise ssl.SSLError('bad handshake: %r' % e)
ssl.SSLError: ("bad handshake: SysCallError(10060, 'WSAETIMEDOUT')",)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\site-packages\requests\adapters.py", line 445, in send
    timeout=timeout
  File "C:\ProgramData\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 638, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "C:\ProgramData\Anaconda3\lib\site-packages\urllib3\util\retry.py", line 398, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='www.booktxt.net', port=443): Max retries exceeded with url: /1_1902/610566.html (Caused by SSLError(SSLError("bad handshake: SysCallError(10060, 'WSAETIMEDOUT')")))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "F:/MeiZITu_PaChong_all/QingYuNian20200103New.py", line 56, in <module>
    one_html = requests.get(url=url_1, headers=headers)
  File "C:\ProgramData\Anaconda3\lib\site-packages\requests\api.py", line 72, in get
    return request('get', url, params=params, **kwargs)
  File "C:\ProgramData\Anaconda3\lib\site-packages\requests\api.py", line 58, in request
    return session.request(method=method, url=url, **kwargs)
  File "C:\ProgramData\Anaconda3\lib\site-packages\requests\sessions.py", line 512, in request
    resp = self.send(prep, **send_kwargs)
  File "C:\ProgramData\Anaconda3\lib\site-packages\requests\sessions.py", line 622, in send
    r = adapter.send(request, **kwargs)
  File "C:\ProgramData\Anaconda3\lib\site-packages\requests\adapters.py", line 511, in send
    raise SSLError(e, request=request)
requests.exceptions.SSLError: HTTPSConnectionPool(host='www.booktxt.net', port=443): Max retries exceeded with url: /1_1902/610566.html (Caused by SSLError(SSLError("bad handshake: SysCallError(10060, 'WSAETIMEDOUT')")))

Process finished with exit code 1

進一步解決思路參考:

第一種:使用模擬登錄每一個章節對應的網址,然後下載內容。 

第二種:使用代理IP。

參考:

https://blog.csdn.net/lb245557472/article/details/80239603

https://blog.csdn.net/zouyee/article/details/25898751

發佈了31 篇原創文章 · 獲贊 43 · 訪問量 7萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章