目的:用python爬蟲抓取頂點小說網站中的《慶餘年》小說內容個,並保存爲txt格式文件。
環境:Win10系統,Anaconda3 + PyCharm, python3.6版本
思路:(1)先在整個目錄頁,下載每一章節對應的URL和標題;(2)針對每一章的URL,下載對應的內容,將內容按照順序存在TXT文件中。
步驟:
1.頂點小說中《慶餘年》的主頁網址:
https://www.booktxt.net/1_1902/
2. 在主頁中,【楔子 一塊黑布】處點擊鼠標右鍵,選擇【檢查】,步驟如上圖所示,結果如下圖:
可以看到,所有的章節標題和對應的超鏈接,都在<a href="***".html>***</a>中, href是對應的部分url,文本是題目。
3.打開【楔子 一塊黑布】章節,發現《慶餘年》主頁的網址 https://www.booktxt.net/1_1902/ + <a>中的 href部分的url組合起來就是本節的真正網址;
4.在本節內容任意一處單擊鼠標右鍵,選擇【檢查】,步驟如上圖所示, 結果如下圖所示:
可以發現內容在<div id="content">***</div>中,注意其中有空行。
5.初步代碼參考,先打印出每個章節對應的網址和題目:
import requests
from bs4 import BeautifulSoup
import os
import time
import random
import re
all_url = 'https://www.booktxt.net/1_1902/'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
}
start_html = requests.get(url=all_url, headers=headers)
# print(start_html.status_code) #200
# print(type(start_html)) #<class 'requests.models.Response'>
# print(start_html.encoding) #ISO-8859-1
# print(start_html.apparent_encoding) #GB2312
start_html.encoding = "GB2312"
# 保存地址
path = 'F:/QingYuNian'
# 檢查地址是否存在,如果不存在,則進行創建
if os.path.exists(path):
print('目錄已存在')
else:
os.makedirs(path)
# 切換路徑到保存目錄下
os.chdir(path)
# 使用BeautifulSoup解析
soup = BeautifulSoup(start_html.text, 'html.parser')
# linkList1 = []
URL_all = soup.find_all('a', attrs={'href': re.compile(r'\d+.html')})
for urls in URL_all:
print(urls)
結果:
可以看到,真正的章節對應的<a>中的href對應的***.html都是6位數字,
(另外一種思路:每一章節對應的網址是有規律的增加1,可以找出最大的增加量,然後以在開始的網址爲參考,逐步獲取後一章節的網址)
因此可以將
URL_all = soup.find_all('a', attrs={'href': re.compile(r'\d+.html')})
替換爲:
URL_all = soup.find_all('a', attrs={'href': re.compile(r'^\d{6}.html')})
6.最終代碼:
# 下載《慶餘年》小說
import requests
from bs4 import BeautifulSoup
import os
import time
import random
import re
all_url = 'https://www.booktxt.net/1_1902/'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
}
start_html = requests.get(url=all_url, headers=headers)
# print(start_html.status_code) #200
# print(type(start_html)) #<class 'requests.models.Response'>
# print(start_html.encoding) #ISO-8859-1
# print(start_html.apparent_encoding) #GB2312
start_html.encoding = "GB2312"
# 保存地址
path = 'F:/QingYuNian'
# 檢查地址是否存在,如果不存在,則進行創建
if os.path.exists(path):
print('目錄已存在')
else:
os.makedirs(path)
# 切換路徑到保存目錄下
os.chdir(path)
# 使用BeautifulSoup解析
soup = BeautifulSoup(start_html.text, 'html.parser')
linkList1 = []
URL_all = soup.find_all('a', attrs={'href': re.compile(r'\d{6}.html')})
for urls in URL_all:
# print(urls)
url = all_url + urls['href']
title = urls.string
# print(url)
# print(title)
# with open('QYN_Url_title.txt', 'a+', encoding='utf-8') as f:
# f.write(url + '\t')
# f.write(title + '\n')
linkList1.append([url, title])
n = 0
for item in linkList1:
url_1 = item[0]
title_1 = item[1]
one_html = requests.get(url=url_1, headers=headers)
one_html.encoding = "GB2312"
soup = BeautifulSoup(one_html.text, 'html.parser')
con = soup.find('div', id='content')
cont = con.get_text('\n', '<br/>') # 去除換行
with open('QingYuNian_Contents.txt', 'a+', encoding='utf-8') as f:
f.write(title_1 + '\n')
f.write(url_1 + '\n')
f.write(cont + '\n')
f.write("==================================================\n")
n = n + 1
print('第{}節下載完成'.format(n))
time.sleep(random.random() * 10)
print("=====全部章節下載完成=====")
運行結果:
只抓取到了一小部分章節:
結果遇到了錯誤:
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\lib\site-packages\urllib3\contrib\pyopenssl.py", line 444, in wrap_socket
cnx.do_handshake()
File "C:\ProgramData\Anaconda3\lib\site-packages\OpenSSL\SSL.py", line 1907, in do_handshake
self._raise_ssl_error(self._ssl, result)
File "C:\ProgramData\Anaconda3\lib\site-packages\OpenSSL\SSL.py", line 1631, in _raise_ssl_error
raise SysCallError(errno, errorcode.get(errno))
OpenSSL.SSL.SysCallError: (10060, 'WSAETIMEDOUT')
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 600, in urlopen
chunked=chunked)
File "C:\ProgramData\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 343, in _make_request
self._validate_conn(conn)
File "C:\ProgramData\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 849, in _validate_conn
conn.connect()
File "C:\ProgramData\Anaconda3\lib\site-packages\urllib3\connection.py", line 356, in connect
ssl_context=context)
File "C:\ProgramData\Anaconda3\lib\site-packages\urllib3\util\ssl_.py", line 359, in ssl_wrap_socket
return context.wrap_socket(sock, server_hostname=server_hostname)
File "C:\ProgramData\Anaconda3\lib\site-packages\urllib3\contrib\pyopenssl.py", line 450, in wrap_socket
raise ssl.SSLError('bad handshake: %r' % e)
ssl.SSLError: ("bad handshake: SysCallError(10060, 'WSAETIMEDOUT')",)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\lib\site-packages\requests\adapters.py", line 445, in send
timeout=timeout
File "C:\ProgramData\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 638, in urlopen
_stacktrace=sys.exc_info()[2])
File "C:\ProgramData\Anaconda3\lib\site-packages\urllib3\util\retry.py", line 398, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='www.booktxt.net', port=443): Max retries exceeded with url: /1_1902/610566.html (Caused by SSLError(SSLError("bad handshake: SysCallError(10060, 'WSAETIMEDOUT')")))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "F:/MeiZITu_PaChong_all/QingYuNian20200103New.py", line 56, in <module>
one_html = requests.get(url=url_1, headers=headers)
File "C:\ProgramData\Anaconda3\lib\site-packages\requests\api.py", line 72, in get
return request('get', url, params=params, **kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\requests\api.py", line 58, in request
return session.request(method=method, url=url, **kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\requests\sessions.py", line 512, in request
resp = self.send(prep, **send_kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\requests\sessions.py", line 622, in send
r = adapter.send(request, **kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\requests\adapters.py", line 511, in send
raise SSLError(e, request=request)
requests.exceptions.SSLError: HTTPSConnectionPool(host='www.booktxt.net', port=443): Max retries exceeded with url: /1_1902/610566.html (Caused by SSLError(SSLError("bad handshake: SysCallError(10060, 'WSAETIMEDOUT')")))
Process finished with exit code 1
進一步解決思路參考:
第一種:使用模擬登錄每一個章節對應的網址,然後下載內容。
第二種:使用代理IP。
參考: