Python爬蟲採集CloudBlog網站的文章

原創

2020-06-07 18:49

---------------------------------------------------------------------------------------------
[版權申明：本文系作者原創，轉載請註明出處]
文章出處：http://blog.csdn.net/sdksdk0/article/details/76208980
作者：朱培 ID：sdksdk0
--------------------------------------------------------------------------------------------

本文通過使用python爬蟲，來將一個網站中的文章獲取下來，包括標題、發表時間、作者、文章內容等基本信息，並且將這些數據存儲到數據庫中，是一個非常完整的流程。獲取首頁所有的文章連接，並存放到URL集合中，然後再一個個的訪問這些採集到的鏈接，來訪問，並再次解析出文章詳細的內容。

最近有個需求，需要採集金融財經類的新聞文章，獲取首頁所有的文章連接，並存放到URL集合中，在本文中，以採集CloudBlog的博客文章爲例，如下圖所示，首先採集這個頁面的信息，主要是先採集列表，從列表中獲取URL。爲防止重複訪問，設置一個歷史訪問，用於對新添加的URL進行過濾。解析DOM樹，獲取文章相關信息，並將信息存儲到Article對象中。

採集號url之後，然後我們用爬蟲去訪問這個網址，循環讀取，拿到這個詳情頁的標題、作者、發表時間和文章內容。以下圖爲例。將Article對象中的數據保存到Mysql數據庫中。每完成一次數據的存儲，計數器增加並打印文章標題，否則打印錯誤信息。如果集合中的URL全部讀取完或數據數量達到設定值，程序結束。

具體實現如下：

1、數據庫結構

SET FOREIGN_KEY_CHECKS=0;

-- ----------------------------
-- Table structure for news
-- ----------------------------
DROP TABLE IF EXISTS `news`;
CREATE TABLE `news` (
  `id` int(6) unsigned NOT NULL AUTO_INCREMENT,
  `url` varchar(255) NOT NULL,
  `title` varchar(45) NOT NULL,
  `author` varchar(12) DEFAULT NULL,
  `date` varchar(25) DEFAULT NULL,
  `content` longtext,
  `zq_date` varchar(25) DEFAULT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `url_UNIQUE` (`url`)
) ENGINE=InnoDB AUTO_INCREMENT=122 DEFAULT CHARSET=utf8;

2、python代碼

import re # 網絡連接模塊
import bs4 # DOM解析模塊
import pymysql # 數據庫連接模塊
import urllib.request # 網絡訪問模塊
import time #時間模塊

# 配置參數
maxcount = 100 # 數據數量
home = 'https://www.tianfang1314.cn/index.html' # 起始位置
# 數據庫連接參數
db_config = {
'host': 'localhost',
'port': '3306',

'username': 'root',

'password': '123456',

'database': 'news',
'charset': 'utf8'
}

url_set = set() # url集合
url_old = set() # 過期url

# 獲取首頁鏈接
request = urllib.request.Request(home)
#爬取結果
response = urllib.request.urlopen(request)
html = response.read()
#設置解碼方式
html = html.decode('utf-8')

soup = bs4.BeautifulSoup(html, 'html.parser')
pattern = '/blog/articles/\w+/\w+.html'
links = soup.find_all('a', href=re.compile(pattern))
for link in links:
url_set.add(link['href'])

# 文章類定義
class Article(object):
def __init__(self):
self.url = None #地址
self.title = None #標題
self.author = None #作者
self.date = None #時間
self.content = None #文章內容
self.zq_date=None; #文章採集時間

# 連接數據庫
connect = pymysql.Connect(
host=db_config['host'],
port=int(db_config['port']),
user=db_config['username'],
passwd=db_config['password'],
db=db_config['database'],
charset=db_config['charset']
)
cursor = connect.cursor()

# 處理URL信息
count = 0
while len(url_set) != 0:
try:
# 獲取鏈接
url = url_set.pop()
url='https://www.tianfang1314.cn'+url
url_old.add(url)

# 獲取代碼
response = urllib.request.urlopen(request)
html = response.read()
# 設置解碼方式
html = html.decode('utf-8')

# DOM解析
soup = bs4.BeautifulSoup(html, 'html.parser')
pattern = 'https://www.tianfang1314.cn/blog/articles/\w+/\w+.html' # 鏈接匹配規則
links = soup.find_all('a', href=re.compile(pattern))

# 獲取URL
for link in links:
if link['href'] not in url_old:
url_set.add(link['href'])

# 數據防重
sql = "SELECT id FROM news WHERE url = '%s' "
data = (url,)
cursor.execute(sql % data)
if cursor.rowcount != 0:
raise Exception('重複數據: ' + url)

# 獲取詳情頁的鏈接
drequest = urllib.request.Request(url)
# 爬取結果
dresponse = urllib.request.urlopen(drequest)
dhtml = dresponse.read()
# 設置解碼方式
dhtml = dhtml.decode('utf-8')
dsoup = bs4.BeautifulSoup(dhtml, 'html.parser')
# 獲取信息
article = Article()
article.url = url # URL信息
page = dsoup.find('div', {'class': 'data_list'})
article.title=page.find('div', {'class': 'blog_title'}).get_text()
infoStr = page.find('div', {'class': 'blog_info'}).get_text() # 文章信息，例如發佈時間：『 2016-12-14 11:26 』用戶名：sdksdk0 閱讀(938) 評論(3)

infoStr=infoStr.rsplit('『', 1)
infoStr=infoStr[1].rsplit('』', 1)
article.date = infoStr[0] # 時間
article.author = infoStr[1].rsplit('\xa0\xa0', 1)[0].rsplit('用戶名：', 1)[1] #用戶名
article.content = page.find('div', {'class': 'blog_content'}).get_text() # 獲取文章
article.zq_date = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()) #採集時間

# 存儲數據
sql = "INSERT INTO news( url, title, author, date, content,zq_date ) "
sql = sql + " VALUES ('%s', '%s', '%s', '%s', '%s','%s') "
data = (article.url, article.title, article.author, article.date, article.content,article.zq_date)
cursor.execute(sql % data)
connect.commit()

except Exception as e:
print(e)
continue
else:
print(article.title)
count += 1
finally:
# 判斷數據是否收集完成
if count == maxcount:
break

# 關閉數據庫連接
cursor.close()

connect.close()

3、運行效果

我們可以在數據庫中可以查看到我們採集到的數據。 select * from news;

總結：在這個爬蟲爬取的過程中，遇到了一些坑，主要就是CloudBlog的頁面不夠規範，所以在使用BeautifulSoup讀取這個網頁的時候，有的節點會有很多重複數據的現象，其次，這個網站的鏈接地址是/blog/articles/\w+/\w+.html這樣的規則的，而不是直接帶的https://的這種，所以我上面還拼接了一個網址前綴。在採集時間和用戶的時候，採用了rsplit進行切分處理，可以看到我上面做回來很多的切分操作的，當然，你也可以選擇用正則來匹配獲取數據。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Python爬蟲採集CloudBlog網站的文章

vue項目獲取富文本編輯器wangEditor內容導出爲word（html轉word格式並下載）

dotnet C# 創建 X11 應用時設置窗口背景顏色

Navicat安裝與激活教程

TDengine docker安裝方法

vue3組件通信與props

sapui5

Alpine Linux apk add DNS lookup error

部分JDK版本的發佈時間

工作中用到的腳本合集

合併代碼時Beyond Compare設置

Curator實現分佈式鎖的基本原理

併發編程常見問答

淺談Redis和zookeeper的分佈式鎖設計

Python爬蟲採集CloudBlog網站的文章

設計模式常見問答

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結