Python爬蟲：自動登錄及下載實踐

來源：公衆號『很酷的程序員』
ID：RealCoolEngineer

在使用Python爬蟲自動爬取網頁數據時，有時候需要先登錄才能爬取一些特定的網頁內容，本文是一個自動登錄網頁抓取數據的一個示例。

本文包括以下知識點：

分析網頁請求類型，需要的參數以及返回的結果
如何使用requests構建HTTP請求
如何實現自動登錄
系統路徑庫pathlib和命令行工具Click實踐應用

一需求分析

1 背景

最近跑步都在使用polar的心率帶，採集了一些心率數據，不想手動下載，所以需要實現一個腳本自動把一段時間內的數據全部拉取下來。

2 問題拆解

爲了實現前面的需求，大概需要這幾個步驟：

研究怎麼實現自動登錄
如何獲取單次運動記錄的下載鏈接
如何獲取到指定日期範圍內的訓練歷史下載鏈接
實現批量請求下載鏈接，保存文件

二網頁分析

1 登錄分析

首先需要分析一下在登錄的時候，發生了什麼事情。這一步通過瀏覽器自帶的檢查工具就可以實現。下面以chrome瀏覽器爲例。

打開polar flow官網登錄界面（如果已經登錄先退出），鼠標右鍵選擇檢查(快捷鍵F12)，在打開的界面中最上方的tab中選擇Network，如下圖：

現在使用一個不存在的賬號登錄，觀察右邊顯示的網絡連接過程。

使用不存在的賬號是爲了避免網站登錄成功自動跳轉，就看不到登錄的細節了。

現在按照預期地登錄失敗了，可以看到結果如圖：

這裏着重要注意的是：

General中的請求鏈接（https://flow.polar.com/login）和請求類型（POST），這是創建請求的核心；
Request Headers中的相關字段。每個HTTP請求都需要構建一個請求的Header，裏面就包含有登錄信息的cookie，因爲字段較多，這裏就不展開，創建請求的時候一般也只需要其中的部分內容，依網站設計而異；
Form Data中最重要的當然是用戶的賬號信息。

登錄成功以後cookie就保存了登錄的信息（具體cookie到底有啥信息，可以先不關注）。

2 分析下載鏈接

前面分析了登錄請求的url和所需要的header及data。
接下來分析一下在運動記錄中的下載數據的鏈接。

下圖是運動詳情界面，點擊導出課程，通過開發者工具可以看到其發送的是GET請求：

下面是點擊導出之後獲得到響應，也是就是服務器返回的內容：

從這個響應返回的HTML內容，得到下載的鏈接的相對地址爲：

/api/export/training/csv/557xxxxxx

所以完整的下載地址應該是：

https://flow.polar.com/api/export/training/csv/557xxxxxx

無論是導出頁面還是下載鏈接，其中最關鍵的是訓練記錄的ID（如例子中爲557xxxxxx），只需要知道訓練歷史的ID列表，就可以輕鬆構建下載鏈接進行下載。

3 分析訓練歷史列表

爲了構建下載鏈接，需要分析如何獲取訓練歷史的ID列表。
如下圖，通過點擊訓練歷史，選擇檢索的開始日期和結束日期，同樣使用chrome自帶的開發工具分析：

從上圖可以得知：

General：請求url爲https://flow.polar.com/api/training/history，請求類型爲POST；
Request Headers：包含cookie和其他一些字段，實現的時候按需設置即可；
Request Payload：這裏是請求包含的數據，指明要查詢特定userId從fromDate到toDate的訓練歷史；
Response：上圖沒有展開，其響應是json格式，這很友好，直接轉爲字典就可以輕鬆獲取所有訓練歷史記錄的ID了。

Request Payload和登錄url使用的Form Data其實意思是差不多的，只是格式上的差異；在Request Headers中指定的content-type有所不同。

從Request Payload中可以知道，要獲取訓練歷史，當然還需要指定用戶的ID，那麼怎麼獲取當前用戶的ID呢？大家可以自己實踐一下。

打開訓練歷史記錄頁面的請求中會包含獲取用戶信息的請求，用同樣的方法分析即可。

三實現

1 實現自動登錄

Ok，現在明確了需要以什麼方式、帶上怎樣的header和data去請求哪個鏈接來實現登錄，那麼就可以通過使用requests模塊來創建請求了。

因爲登錄之後需要有更多的操作，所以不能是一次性請求，這時使用requests.session創建一個會話即可，登錄成功後，session會包含cookie，無需手動填充header，程序的流程爲：

通過參數指定或者終端輸入用戶賬號和密碼；
確定要請求的url，填充header和form data；
使用requests.session.post進行登錄，如果返回碼爲200即表示登錄成功。

代碼實現如下：

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Author: Farmer Li，公衆號：很酷的程序員(RealCoolEngineer)
# @Date: 2021-04-06

import click
import requests

USER_AGENT = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36'


class PolarSpider:
    def __init__(self) -> None:
        self._session = requests.session()

    def login(self, email, password):
        print("Login Polar Flow")
        url = "https://flow.polar.com/login"

        header = {
            'Referer': 'https://flow.polar.com/',
            'Rser-Agent': USER_AGENT
        }
        postData = {
            'returnUrl': '/',
            "email": email,
            "password": password,
        }
        res = self._session.post(url, data=postData, headers=header)
        if res.status_code == 200:
            print('Login succeed')
            return True
        else:
            print(f"statusCode = {res.status_code}")
            return False


@click.command()
@click.option('-e', '--email', help='Account email')
@click.option('-p', '--password', help='Account password')
def main(email, password):
    if email is None:
        email = click.prompt('Input Email')
    if password is None:
        password = click.prompt('Input password', hide_input=True)

    spider = PolarSpider()
    spider.login(email, password)


if __name__ == '__main__':
    main()

核心代碼其實就一行，其他的內容都是爲了這行代碼服務的：

self._session.post(postUrl, data=postData, headers=header)

2 獲取用戶信息

這一步我們需要獲取用戶的ID，用於請求訓練歷史列表，獲取其中的訓練ID。
代碼實現如下（注意這實現爲PolarSpider的一個方法，後同）：

# @Author: Farmer Li，公衆號：很酷的程序員(RealCoolEngineer)
def get_user_id(self):
    user_id = -1
    header = {
        'content-type': 'application/json; charset=UTF-8',
        'referer': 'https://flow.polar.com/diary/training-list',
        'user-agent': USER_AGENT,
    }
    info_url = 'https://flow.polar.com/api/account/users/current/user'
    print('Request current user id...')
    try:
        res = self._session.get(info_url, headers=header)
    except BaseException as e:
        print(f'Error occurred!\n{e}')
    if res.status_code == 200:
        user_info = json.loads(res.text)
        user_id = user_info['user']['id']
    else:
        print(f'Falied with code: {res.status_code}')

    return user_id

3 獲取訓練歷史列表

一樣地，只需要按照前面分析的結果構建請求即可，實現如下：

# @Author: Farmer Li，公衆號：很酷的程序員(RealCoolEngineer)
def get_history_ids(self, user_id, from_date=None, to_date=None):
    history_ids = []
    payload = {
        'userId': user_id,
        'fromDate': from_date,
        'toDate': to_date,
        'sportIds': []
    }
    header = {
        'content-type': 'application/json',
        'user-agent': USER_AGENT,
        'x-requested-with': 'XMLHttpRequest'
    }
    history_url = 'https://flow.polar.com/api/training/history'
    print(f'Request training history for {user_id}, from {from_date} to {to_date}')
    try:
        res = self._session.post(history_url,
                                 headers=header,
                                 data=json.dumps(payload))
    except BaseException as e:
        print(f'Error occurred!\n{e}')
    if res.status_code == 200:
        train_list = json.loads(res.text)
        for train in train_list:
            history_ids.append(train['id'])
    else:
        print(f'Falied with code: {res.status_code}')

    return history_ids

4 實現自動下載

先獲取用戶id，其次獲取訓練id列表，再構建下載鏈接，下載到本地即可。
實現如下：

# @Author: Farmer Li，公衆號：很酷的程序員(RealCoolEngineer)

from pathlib import Path

DEAFULT_DOWNLOAD_DIR = Path.home() / 'Downloads/polar/'

def download_history(self,
                     from_date=None,
                     to_date=None,
                     save_dir: Path = DEAFULT_DOWNLOAD_DIR):
    user_id = self.get_user_id()
    hids = self.get_history_ids(user_id, from_date, to_date)
    dst_dir = save_dir / str(user_id)
    if not dst_dir.exists():
        dst_dir.mkdir(parents=True)
    print(f'Downloading, to: {dst_dir}')

    download_url_preffix = 'https://flow.polar.com/api/export/training/csv'
    total_num = len(hids)
    for i, hid in enumerate(hids, 1):
        download_url = f'{download_url_preffix}/{hid}'
        print(f'\nDownloading [{i}/{total_num}]: {download_url}')
        r = self._session.get(download_url)
        if r.status_code == 200:
            dst_file_name = f'{user_id}_{hid}.csv'
            dst_file = dst_dir / dst_file_name
            print(f'Writing to file: {dst_file}')
            with dst_file.open('wb') as f:
                f.write(r.content)

5 完善命令行參數

最後支持下通過命令行指定自動下載數據的日期範圍，並且設置默認值提升體驗。

@click.command()
@click.option('-e', '--email', help='Account email')
@click.option('-p', '--password', help='Account password')
@click.option('-from', '--from-date', help='From date')
@click.option('-to', '--to-date', help='To date')
def main(email, password, from_date, to_date):
    if email is None:
        email = click.prompt('Input Email')
    if password is None:
        password = click.prompt('Input password', hide_input=True)

    if from_date is None:
        from_date = click.prompt('From date(yyyy-mm-dd)', default='2021-01-20')
    if to_date is None:
        to_date = click.prompt('To date(yyyy-mm-dd)', default='2021-01-30')

    spider = PolarSpider()
    spider.login(email, password)
    spider.download_history(from_date, to_date)

上面便是自動爬取Polar訓練歷史記錄數據的演示，其中主要涉及requests、json、pathlib和Click的使用，關於後兩者，往期的文章有更加詳細的介紹。

Python爬蟲：自動登錄及下載實踐

一需求分析

1 背景

2 問題拆解

二網頁分析

1 登錄分析

2 分析下載鏈接

3 分析訓練歷史列表

三實現

1 實現自動登錄

2 獲取用戶信息

3 獲取訓練歷史列表

4 實現自動下載

5 完善命令行參數

2024年DataOps趨勢預測：AI不會取代數據工程師

雲原生週刊：K8s 中的服務和網絡｜ 2024.4.29

通過Http鏈接地址爬取有贊微信商城商品信息及下載至EXCEL

多人同時導出 Excel 幹崩服務器！新來的阿里大佬給出的解決方案太優雅了！

[轉帖]cpupower

今天，昨天，近七天，近30天，近90天，js封裝

華爲云云原生FinOps解決方案，釋放雲原生最大價值

CMake應用：集成gtest進行單元測試

CMake應用：合併靜態庫的最佳實踐

cmake應用：交叉編譯

cmake應用：從編譯過程理解CMake

GCC編譯過程概述

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

Python爬蟲：自動登錄及下載實踐

一 需求分析

1 背景

2 問題拆解

二 網頁分析

1 登錄分析

2 分析下載鏈接

3 分析訓練歷史列表

三 實現

1 實現自動登錄

2 獲取用戶信息

3 獲取訓練歷史列表

4 實現自動下載

5 完善命令行參數

一需求分析

二網頁分析

三實現