[简单爬虫]记录博客流量-day day up

原創

PJZero

2020-02-23 22:56

做了一个小工具，用于记录我的csdn博客每天的流量变化，当程序运行的的时候捕获到一场则发送邮件到我的邮箱，告知我来处理异常。每天的流量会记录在csv文件中，可以使用pandas方便的获取文件内容并绘图。
用到的工具包括

requests
bs4(beautifulsoup4)
csv(buildin)
smtplib(buildin)

更详细的内容请看代码注释

import requests   # 发送网络请求
import bs4        # 解析网页内容 
import csv        # 读写csv文件
from datetime import datetime
import smtplib    # 用于发送邮件
import time       # sleep

# csdn 博客地址，一下是我的地址
__url = "http://blog.csdn.net/pengjian444?skin=skin-yellow"


def send_mail(content, to_address='[email protected]'):
    try:
        content = "[博客记录系统出现故障]: " + content
        # 这里使用的是网易的smtp服务器
        # 默认使用ssl连接，默认的端口号是465
        smtp = smtplib.SMTP_SSL(" smtp.163.com", port=465)
        # qq 邮箱账号
        username = "××××××"
        # 邮箱授权码
        password = "××××××"
        smtp.login(username, password)
        smtp.sendmail(username,
                      to_address,
                      content)
    except Exception as base_e:
        log_line = "{}:{}".format(datetime.now(), str(base_e))
        print(log_line)
        with open('log.txt', 'a') as f:
            f.write(log_line)

def get_page_content(url):
    """
    获取网页内容
    """
    headers = {
        'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '
                      'AppleWebKit/537.36 ('
                      'KHTML, like Gecko) Chrome/58.0.3029.81 '
                      'Safari/537.36',
        'Host'      : 'blog.csdn.net'
    }
    r = requests.get(url, headers=headers)
    if r.status_code != 200:
        return ""
    else:
        return r.text

def get_flux(page_content):
    if page_content is None:
        return None

    soup = bs4.BeautifulSoup(page_content, "html.parser")
    return int(soup.select("#blog_rank li span")[0].string[:-1])

def write_csv(flux_data, file_name='flux.csv'):
    """
    写入csv文件
    :param flux_data:
    :param file_name:
    :return:
    """
    row = [str(datetime.now().date()), str(flux_data)]
    with open(file_name, 'a') as f:
        writer = csv.writer(f)
        writer.writerow(row)

if __name__ == '__main__':
    while True:
        try:
            content = get_page_content(url=__url)
            flux = get_flux(content)
            print(flux)
            write_csv(flux_data=flux)
        except Exception as e:
            send_mail(e)

        time.sleep(24 * 60 * 60)  # 每天记录一次

这里面综合了一些很小的知识点，包括以下内容
+ 使用csv模块读写csv文件
+ requests库的简单使用（发送get消息，设置headers)
+ 使用bs4解析网页内容
+ 发送邮件

麻雀虽小，五脏俱全。希望大家能对大家有所帮助

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

[简单爬虫]记录博客流量-day day up

今天！通义灵码在北京、成都、杭州三城开讲啦

【BI 可视化插件】怎么做？手把手教你实现

梯度下降法(BGD,SGD,MSGD)python+numpy具體實現

Java 圖片基礎操作

【spring】給component命名 & 設置scope

Java 刪除某個目錄

【算法題】輸入一串數字，判斷有多少種字母組合方式

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結