Python爬取文章和小說內容

原創

2020-05-25 17:11

python

一、安裝requests庫和 bs4

pip install requests

pip install bs4

二、分析步驟

三、實踐（爬取文章）

1、代碼：

import io
import os
import sys
import requests
from bs4 import BeautifulSoup

sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')  # 編碼格式

def urlBS(url):  # 定義發起請求函數
    resp = requests.get(url)
    html = resp.content.decode('gbk')
    soup = BeautifulSoup(html, 'lxml') # 解析網頁
    # print(soup)
    return soup
    
firsturl = 'http://www.rensheng5.com/zx/onduzhe/'  # 目標地址
urlBS(firsturl)

def main(url):
    soup = urlBS(url)  #調用函數
    lis = soup.find('ul', class_="i1 ico1").find_all('li') # 從網頁獲取的信息

    # 數據保存的目錄(os.getced()創建文件夾)
    path = os.getcwd()+u'/爬取的文章/'
    if not os.path.isdir(path):   # 判斷是否有這個文件夾
        os.mkdir(path)
    for i in lis:
        newurl = i.find('a')['href']
        print(newurl)
        # 請求每篇文章
        result = urlBS(newurl)  #調用函數
        title = result.find('div', class_="artview").find('h1').get_text() # 獲取標題
        print(title)
        writer = result.find('div', class_="artinfo").get_text()   # 獲取作者
        print(writer)
        # 保存的文件格式:
        filename = path + title + '.txt'
        print(filename)

        #寫入操作
        new = open(filename, 'w')
        new.write('<<' + title + '>>\n\n') # 寫入標題
        new.write(writer + '\n\n')  # 寫入作者
        text = result.find('div', class_="artbody").find('p').get_text()
        new.write(text)  # 寫入內容
        new.close()    # 關閉

if __name__ == '__main__':
    fristurl = 'http://www.rensheng5.com/zx/onduzhe/'
    main(firsturl)

2、效果：

3、說明：

四、合併爲一個.txt文件

1、在命令行窗口，進入需要合併的Txt文件的目錄。

2、確認目錄正確後，輸入type *.txt >>e:\111.txt，該命令將把當前目錄下的所有txt文件的內容輸出到e:\111.txt。

3、到此，打開合併後的e:\111.txt，即可看到多個Txt文件都已按順序合併到F盤的111.txt文件中。

五、解決爬蟲獲取網頁，出現亂碼問題

通用解決方案：

response=request.get("url網站")

data=bytes(response.text,response.encoding).decode("gbk","ignore")

六、實踐（爬取小說）

1、代碼：

import requests
from bs4 import BeautifulSoup

response = requests.get('http://www.biquw.com/book/19877/')
response2=bytes(response.text,response.encoding).decode("utf-8","ignore")
# print(response2)

# 網頁選擇器實例化
soup = BeautifulSoup(response2,'lxml')

data_list = soup.find('ul')

for book in data_list.find_all('a'):
    print('{}:{}'.format(book.text,'http://www.biquw.com/book/19877/'+ book['href']))
    book_url = 'http://www.biquw.com/book/19877/' + book['href']
    data_book = requests.get(book_url).text
    soup = BeautifulSoup(data_book,'lxml')  # 解析網頁
    data = soup.find('div',{'id':'htmlContent'})  # 查看網頁獲取
    data2 = bytes(data.text, response.encoding).decode("utf-8", "ignore")
    print(data2)

    # 文件操作
    # 方式一、輸出到同一個txt文件
    file = open('book2.txt', 'a', encoding='utf-8')
    file.write(data2)
    file.close()
    
    #方式二、輸入到各自的txt文件
    # with open(book.text + '.txt','a',encoding='utf-8') as f:
    #     f.write(data2)

2、效果：

3、說明

如果看了這篇文章對你有幫助或讓你學到了知識，請給我一個贊吧，謝謝！
下一篇 Python多線程爬取小說

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Python爬取文章和小說內容

python

一、安裝requests庫和 bs4

二、分析步驟

三、實踐（爬取文章）

1、代碼：

2、效果：

3、說明：

四、合併爲一個.txt文件

五、解決爬蟲獲取網頁，出現亂碼問題

六、實踐（爬取小說）

1、代碼：

2、效果：

3、說明

985 碩士程序員，空窗 4 個月沒有 Offer！

一文搞懂 Spring 循環依賴

賽博鬥地主——使用大語言模型扮演Agent智能體玩牌類遊戲。

VScode右鍵打開(添加到右鍵)

記一次 .NET某工控視覺自動化系統卡死分析

vue+springboot 登錄註冊功能

阿里雲服務器的相關操作

Python爬取文章和小說內容

Android studio問題解決彙總

Vue學習3-(語法)

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

Python爬取文章和小說內容

python

一、安裝requests庫 和 bs4

二、分析步驟

三、實踐（爬取文章）

1、代碼：

2、效果：

3、說明：

四、合併爲一個.txt文件

五、解決爬蟲獲取網頁，出現亂碼問題

六、實踐（爬取小說）

1、代碼：

2、效果：

3、說明

一、安裝requests庫和 bs4