Python爬虫：自动登录及下载实践

来源：公众号『很酷的程序员』
ID：RealCoolEngineer

在使用Python爬虫自动爬取网页数据时，有时候需要先登录才能爬取一些特定的网页内容，本文是一个自动登录网页抓取数据的一个示例。

本文包括以下知识点：

分析网页请求类型，需要的参数以及返回的结果
如何使用requests构建HTTP请求
如何实现自动登录
系统路径库pathlib和命令行工具Click实践应用

一需求分析

1 背景

最近跑步都在使用polar的心率带，采集了一些心率数据，不想手动下载，所以需要实现一个脚本自动把一段时间内的数据全部拉取下来。

2 问题拆解

为了实现前面的需求，大概需要这几个步骤：

研究怎么实现自动登录
如何获取单次运动记录的下载链接
如何获取到指定日期范围内的训练历史下载链接
实现批量请求下载链接，保存文件

二网页分析

1 登录分析

首先需要分析一下在登录的时候，发生了什么事情。这一步通过浏览器自带的检查工具就可以实现。下面以chrome浏览器为例。

打开polar flow官网登录界面（如果已经登录先退出），鼠标右键选择检查(快捷键F12)，在打开的界面中最上方的tab中选择Network，如下图：

现在使用一个不存在的账号登录，观察右边显示的网络连接过程。

使用不存在的账号是为了避免网站登录成功自动跳转，就看不到登录的细节了。

现在按照预期地登录失败了，可以看到结果如图：

这里着重要注意的是：

General中的请求链接（https://flow.polar.com/login）和请求类型（POST），这是创建请求的核心；
Request Headers中的相关字段。每个HTTP请求都需要构建一个请求的Header，里面就包含有登录信息的cookie，因为字段较多，这里就不展开，创建请求的时候一般也只需要其中的部分内容，依网站设计而异；
Form Data中最重要的当然是用户的账号信息。

登录成功以后cookie就保存了登录的信息（具体cookie到底有啥信息，可以先不关注）。

2 分析下载链接

前面分析了登录请求的url和所需要的header及data。
接下来分析一下在运动记录中的下载数据的链接。

下图是运动详情界面，点击导出课程，通过开发者工具可以看到其发送的是GET请求：

下面是点击导出之后获得到响应，也是就是服务器返回的内容：

从这个响应返回的HTML内容，得到下载的链接的相对地址为：

/api/export/training/csv/557xxxxxx

所以完整的下载地址应该是：

https://flow.polar.com/api/export/training/csv/557xxxxxx

无论是导出页面还是下载链接，其中最关键的是训练记录的ID（如例子中为557xxxxxx），只需要知道训练历史的ID列表，就可以轻松构建下载链接进行下载。

3 分析训练历史列表

为了构建下载链接，需要分析如何获取训练历史的ID列表。
如下图，通过点击训练历史，选择检索的开始日期和结束日期，同样使用chrome自带的开发工具分析：

从上图可以得知：

General：请求url为https://flow.polar.com/api/training/history，请求类型为POST；
Request Headers：包含cookie和其他一些字段，实现的时候按需设置即可；
Request Payload：这里是请求包含的数据，指明要查询特定userId从fromDate到toDate的训练历史；
Response：上图没有展开，其响应是json格式，这很友好，直接转为字典就可以轻松获取所有训练历史记录的ID了。

Request Payload和登录url使用的Form Data其实意思是差不多的，只是格式上的差异；在Request Headers中指定的content-type有所不同。

从Request Payload中可以知道，要获取训练历史，当然还需要指定用户的ID，那么怎么获取当前用户的ID呢？大家可以自己实践一下。

打开训练历史记录页面的请求中会包含获取用户信息的请求，用同样的方法分析即可。

三实现

1 实现自动登录

Ok，现在明确了需要以什么方式、带上怎样的header和data去请求哪个链接来实现登录，那么就可以通过使用requests模块来创建请求了。

因为登录之后需要有更多的操作，所以不能是一次性请求，这时使用requests.session创建一个会话即可，登录成功后，session会包含cookie，无需手动填充header，程序的流程为：

通过参数指定或者终端输入用户账号和密码；
确定要请求的url，填充header和form data；
使用requests.session.post进行登录，如果返回码为200即表示登录成功。

代码实现如下：

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Author: Farmer Li，公众号：很酷的程序员(RealCoolEngineer)
# @Date: 2021-04-06

import click
import requests

USER_AGENT = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36'


class PolarSpider:
    def __init__(self) -> None:
        self._session = requests.session()

    def login(self, email, password):
        print("Login Polar Flow")
        url = "https://flow.polar.com/login"

        header = {
            'Referer': 'https://flow.polar.com/',
            'Rser-Agent': USER_AGENT
        }
        postData = {
            'returnUrl': '/',
            "email": email,
            "password": password,
        }
        res = self._session.post(url, data=postData, headers=header)
        if res.status_code == 200:
            print('Login succeed')
            return True
        else:
            print(f"statusCode = {res.status_code}")
            return False


@click.command()
@click.option('-e', '--email', help='Account email')
@click.option('-p', '--password', help='Account password')
def main(email, password):
    if email is None:
        email = click.prompt('Input Email')
    if password is None:
        password = click.prompt('Input password', hide_input=True)

    spider = PolarSpider()
    spider.login(email, password)


if __name__ == '__main__':
    main()

核心代码其实就一行，其他的内容都是为了这行代码服务的：

self._session.post(postUrl, data=postData, headers=header)

2 获取用户信息

这一步我们需要获取用户的ID，用于请求训练历史列表，获取其中的训练ID。
代码实现如下（注意这实现为PolarSpider的一个方法，后同）：

# @Author: Farmer Li，公众号：很酷的程序员(RealCoolEngineer)
def get_user_id(self):
    user_id = -1
    header = {
        'content-type': 'application/json; charset=UTF-8',
        'referer': 'https://flow.polar.com/diary/training-list',
        'user-agent': USER_AGENT,
    }
    info_url = 'https://flow.polar.com/api/account/users/current/user'
    print('Request current user id...')
    try:
        res = self._session.get(info_url, headers=header)
    except BaseException as e:
        print(f'Error occurred!\n{e}')
    if res.status_code == 200:
        user_info = json.loads(res.text)
        user_id = user_info['user']['id']
    else:
        print(f'Falied with code: {res.status_code}')

    return user_id

3 获取训练历史列表

一样地，只需要按照前面分析的结果构建请求即可，实现如下：

# @Author: Farmer Li，公众号：很酷的程序员(RealCoolEngineer)
def get_history_ids(self, user_id, from_date=None, to_date=None):
    history_ids = []
    payload = {
        'userId': user_id,
        'fromDate': from_date,
        'toDate': to_date,
        'sportIds': []
    }
    header = {
        'content-type': 'application/json',
        'user-agent': USER_AGENT,
        'x-requested-with': 'XMLHttpRequest'
    }
    history_url = 'https://flow.polar.com/api/training/history'
    print(f'Request training history for {user_id}, from {from_date} to {to_date}')
    try:
        res = self._session.post(history_url,
                                 headers=header,
                                 data=json.dumps(payload))
    except BaseException as e:
        print(f'Error occurred!\n{e}')
    if res.status_code == 200:
        train_list = json.loads(res.text)
        for train in train_list:
            history_ids.append(train['id'])
    else:
        print(f'Falied with code: {res.status_code}')

    return history_ids

4 实现自动下载

先获取用户id，其次获取训练id列表，再构建下载链接，下载到本地即可。
实现如下：

# @Author: Farmer Li，公众号：很酷的程序员(RealCoolEngineer)

from pathlib import Path

DEAFULT_DOWNLOAD_DIR = Path.home() / 'Downloads/polar/'

def download_history(self,
                     from_date=None,
                     to_date=None,
                     save_dir: Path = DEAFULT_DOWNLOAD_DIR):
    user_id = self.get_user_id()
    hids = self.get_history_ids(user_id, from_date, to_date)
    dst_dir = save_dir / str(user_id)
    if not dst_dir.exists():
        dst_dir.mkdir(parents=True)
    print(f'Downloading, to: {dst_dir}')

    download_url_preffix = 'https://flow.polar.com/api/export/training/csv'
    total_num = len(hids)
    for i, hid in enumerate(hids, 1):
        download_url = f'{download_url_preffix}/{hid}'
        print(f'\nDownloading [{i}/{total_num}]: {download_url}')
        r = self._session.get(download_url)
        if r.status_code == 200:
            dst_file_name = f'{user_id}_{hid}.csv'
            dst_file = dst_dir / dst_file_name
            print(f'Writing to file: {dst_file}')
            with dst_file.open('wb') as f:
                f.write(r.content)

5 完善命令行参数

最后支持下通过命令行指定自动下载数据的日期范围，并且设置默认值提升体验。

@click.command()
@click.option('-e', '--email', help='Account email')
@click.option('-p', '--password', help='Account password')
@click.option('-from', '--from-date', help='From date')
@click.option('-to', '--to-date', help='To date')
def main(email, password, from_date, to_date):
    if email is None:
        email = click.prompt('Input Email')
    if password is None:
        password = click.prompt('Input password', hide_input=True)

    if from_date is None:
        from_date = click.prompt('From date(yyyy-mm-dd)', default='2021-01-20')
    if to_date is None:
        to_date = click.prompt('To date(yyyy-mm-dd)', default='2021-01-30')

    spider = PolarSpider()
    spider.login(email, password)
    spider.download_history(from_date, to_date)

上面便是自动爬取Polar训练历史记录数据的演示，其中主要涉及requests、json、pathlib和Click的使用，关于后两者，往期的文章有更加详细的介绍。

Python爬虫：自动登录及下载实践

一需求分析

1 背景

2 问题拆解

二网页分析

1 登录分析

2 分析下载链接

3 分析训练历史列表

三实现

1 实现自动登录

2 获取用户信息

3 获取训练历史列表

4 实现自动下载

5 完善命令行参数

10分钟搞定Mysql主从部署配置

如何使用 JS 判断用户是否处于活跃状态

「Pygors跨平台GUI」2：安装MinGW-w64、MSYS2还是WSL2

[转帖]

python列出centos7内存使用前50的进程信息

「Pygors跨平台GUI」1：Pygors跨平台GUI应用研究

一键自动化博客发布工具,用过的人都说好(掘金篇)

lightdb数据库超时相关控制参数

lightdb秒级增加列和删除列（not null带默认值）

Java ThreadPoolShutdown

CMake應用：集成gtest進行單元測試

CMake應用：合併靜態庫的最佳實踐

cmake應用：交叉編譯

cmake應用：從編譯過程理解CMake

GCC編譯過程概述

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

Python爬虫：自动登录及下载实践

一 需求分析

1 背景

2 问题拆解

二 网页分析

1 登录分析

2 分析下载链接

3 分析训练历史列表

三 实现

1 实现自动登录

2 获取用户信息

3 获取训练历史列表

4 实现自动下载

5 完善命令行参数

一需求分析

二网页分析

三实现