CSDN訪客量可視化
由於CSDN網頁上不顯示具體的訪客量,手機app上的訪客量又和公開文章的訪問量對不上,也搞不懂他還算了啥,所以我就把所有文章的訪問量統計下來作爲訪客量數據。
具體思路:
1. 每天中午12點通過requests+pyquery獲取總體訪問量並保存
2. 通過matplotlib將數據可視化,並分析訪客流量
3. 測試代碼
還有功能等待開發
GitHub地址:https://github.com/99Kies/Visitor_Monitor
收集數據程序大致可以分爲下面三個模塊
爬取模塊
1. 分析頁面(https://blog.csdn.net/qq_19381989)
可見每個博文都放在一個單獨的div中,這種格式最適合用pyquery拉,值得注意的是上面和下面都有一個”多餘“的div,在定位或者保存的時候要注意一哈,最後copy一下他的css位置 #mainBox > main > div.article-list > div
2. 分析如何翻頁
一翻頁,觀察一下url就曉得這玩意可以用requests.get去翻頁
2. 編寫代碼
import requests
from pyquery import PyQuery as pq
import time
def get_read_number(page):
'''
page指自己csdn博客的頁數
:param page: 自己csdn博客的頁數
:return: {文章主題,訪客量,評論量} 時間(年/月/日)
'''
all_read = 0
#用來存儲所有的瀏覽量
count = 0
#用來存儲定位到多少篇文章
for i in range(1,page+1):
url = 'https://blog.csdn.net/qq_19381989/article/list/{}'.format(i)
# print(url)
r = requests.get(url)
doc = pq(r.text)
items = doc('#mainBox > main > div.article-list > div').items()
for item in items:
project = {
'title': item.find('h4 > a').text(),
'read': item.find('div.info-box.d-flex.align-content-center > p:nth-child(3) > span > span').text(),
'talk':item.find('div.info-box.d-flex.align-content-center > p:nth-child(5) > span > span').text(),
}
flag = 1
#利用flag記錄是否爲首與末的空div,空的就不存儲料
for example in project:
# if project['read'] is '' or '原' not in project['title']:
#若只想統計原創作品時---
if project['read'] is '':
flag = 0
if flag == 1:
all_read += int(project['read'])
count += 1
return all_read, time.strftime("%Y/%m/%d",time.localtime(time.time()))
利用matplotlib實時檢測一下(時間改爲%H:%M:%S測試一番)
import matplotlib.pyplot as plt
import time
from numpy import *
def plot_show_msg(msg):
xtime = []
yread = []
for xy in msg:
xtime.append(xy[1])
yread.append(xy[0])
print(msg)
ax = array(xtime)
ay = array(yread)
plt.close()
plt.plot(ax, ay)
plt.xticks(rotation=70)
plt.margins(0.08)
plt.subplots_adjust(bottom=0.15)
plt.xlabel("Date")
plt.ylabel("Visitors")
plt.title("Visitor Data Visualization")
plt.show()
plt.pause(1)
plt.close()
print('----------')
if __name__ == '__main__':
msg = []
while 1:
time.sleep(1)
all_read, now_time = get_read_number(3)
msg.append((all_read,now_time))
plot_show_msg(msg)
存儲模塊
將爬取到的數據存儲到對應的文件中
def write_to_csvfile(all_read, date):
msg_path = '文件夾名'
filename = msg_path + os.path.sep +'文件名'
if not os.path.exists(msg_path):
os.mkdir(msg_path)
with open(filename, 'a', encoding='utf-8') as csvfile:
writer = csv.writer(csvfile, dialect='unix')
writer.writerow(['read','date'])
try:
with open(filename,'a',encoding='utf-8') as csvfile:
writer = csv.writer(csvfile, dialect='unix')
writer.writerow((all_read,date))
except Exception as e:
print(e)
更新模塊
無限更新的方法太耗費資源料,我打算就和嗶哩嗶哩差不多一天搞一次,每天中午12點更新數據,將數據存儲在本地文件中,這樣Web應用就可以調用這些數據,製作優美的表格撩
有兩種存儲方式
1. 本地電腦bat文件定時運行; //存儲在./Read_msg/read_msg.csv
2. 放服務器上跑程序,利用上方的代碼,運行時判斷今天的日期存在保存文件裏嗎,若沒則進行更新,若有則更新,還有一個就是若沒有這個保存文件時也要進行更新; //存儲在./Test_msg/test_msg.csv
方法一(我用於本地測試)
在寫完代碼後發現幾年前就有人去寫過這個csdn訪客數據化撩 https://blog.csdn.net/s740556472/article/details/78239204
在當時CSDN是有顯示瀏覽量的,現在沒了,這位博主把bat方法寫的很詳細,學習了,感謝!!!
編寫day_to_day.bat文件 (csdn_read_save.py中整合了爬蟲模塊和存儲模塊),
此方法的數據存儲在./Read_msg/read_msg.csv中
win+R (compmgmt.msc) 打開計算機管理
進入任務管理程序庫欄,並創建任務
csdn_read_save.py整合代碼:
import requests
from pyquery import PyQuery as pq
import re
import matplotlib.pyplot as plt
import time
from numpy import *
import csv
import os
def get_read_number(page):
all_read = 0
count = 0
for i in range(1,page+1):
url = 'https://blog.csdn.net/qq_19381989/article/list/{}'.format(i)
# print(url)
r = requests.get(url)
doc = pq(r.text)
items = doc('#mainBox > main > div.article-list > div').items()
for item in items:
project = {
'title': item.find('h4 > a').text(),
'read': item.find('div.info-box.d-flex.align-content-center > p:nth-child(3) > span > span').text(),
'talk':item.find('div.info-box.d-flex.align-content-center > p:nth-child(5) > span > span').text(),
}
flag = 1
for example in project:
# if project['read'] is '' or '原' not in project['title']:
if project['read'] is '':
flag = 0
if flag == 1:
all_read += int(project['read'])
count += 1
return str(all_read), time.strftime("%Y/%m/%d",time.localtime(time.time()))
def write_to_file(all_read, date):
msg_path = 'Read_msg'
filename = msg_path + os.path.sep +'read_msg.csv'
if not os.path.exists(msg_path):
os.mkdir(msg_path)
with open(filename, 'a', encoding='utf-8') as csvfile:
writer = csv.writer(csvfile, dialect='unix')
writer.writerow(['read','date'])
try:
with open(filename,'a',encoding='utf-8') as csvfile:
writer = csv.writer(csvfile, dialect='unix')
writer.writerow((all_read,date))
except Exception as e:
print(e)
if __name__ == '__main__':
all_read, date = get_read_number(3)
print(all_read,date)
write_to_file(all_read, date)
方法二(我用於Web端)
大致代碼如下,每次運行前都判斷一下當前時間是否被記錄過,此方法的數據保存在./Test_msg/test_msg.csv中
import os
import csv
def is_yesterday_yn():
'''
每次保存時都打開存儲訪客數據的文件判斷一下最後一次保存的是否爲昨天,若是則進行爬取
若沒有訪客數據的文件時也要進行爬蟲
:param filename: 訪客數據文件名
:return: True/False True:需要爬蟲。False:無需爬蟲
'''
msg_path = 'Test_msg'
today = time.strftime("%Y/%m/%d",time.localtime(time.time()))
filename = msg_path + os.path.sep + 'test_msg.csv'
if not os.path.exists(msg_path):
return True
with open(filename,'r',encoding='utf-8') as csvfile:
reader = str(csvfile.readlines())
print(reader)
if today in reader:
print('is Today')
return False
else:
print('isn\'t today, you need update!')
return True
def update_msg():
if is_yesterday_yn():
all_read, date = get_read_number(3)
write_to_csvfile(all_read, date)
這樣就可以確保運行代碼時造成的相同日期的存儲,儘量做到一天一更,在web端上的大致意思就是 每請求一次頁面都要檢驗一次數據是否需要更新,且檢驗完之後每隔15分鐘重新檢驗一次。
下面再添加一個matplotlib大致實現一下代碼
def get_msg(filename):
try:
xtime = []
yread = []
with open(filename,'r',encoding='utf-8') as csvfile:
reader = csv.reader(csvfile)
for row in list(reader)[1:]:
xtime.append(row[1])
yread.append(row[0])
return xtime,yread
except:
print('Read Error')
def plot_show_msg(xtime,yread):
ax = array(xtime)
ay = array(yread)
# plt.ion()
plt.close()
plt.plot(ax,ay)
plt.xticks(rotation=70)
plt.margins(0.08)
plt.subplots_adjust(bottom=0.15)
plt.xlabel("Date")
plt.ylabel("Visitors")
#圖的標題
plt.title("Visitor Data Visualization")
plt.show()
plt.pause(1)
plt.close()
if __name__ == '__main__':
while 1:
update_msg()
xtime, yread = get_msg('./Test_msg/test_msg.csv')
plot_show_msg(xtime, yread)
time.sleep(900)
#每隔15分鐘判斷一次
GitHub地址:https://github.com/99Kies/Visitor_Monitor
希望有大家來個star