今天我們要正式使用程序來爬取BOSS的招聘數據了，我會從最基礎的一步一步去完善程序，幫助大家來理解爬蟲程序，其中還是有許多問題我沒能解決，也希望有大佬可以留言幫助一下

首先我們來訪問下頁面，看下結果是不是和瀏覽器訪問是一致的

具體的頁面返回的信息太多了，我們可以發現訪問不同頁面的Title是不同的，是按我們的查詢條件變化的，那我們暫時可以只關注Title的變化吧

from bs4 import BeautifulSoup as bs
import requests


def de_title(func):
    def wrapper(*args, **kwargs):
        req = func(*args, **kwargs)
        content = req.content.decode("utf-8")
        content = bs(content, "html.parser")
        print(func.__name__, content.find("title").text)
    return wrapper


@de_title
def test1():
    req = requests.get('https://www.zhipin.com')
    return req


@de_title
def test2():
    req = requests.get('https://www.zhipin.com/job_detail/?query=python&city=101010100&industry=&position=100109')
    return req


if __name__ == "__main__":
    test1()
    test2()

從返回結果可以看到首頁是正常返回了，但是查詢頁的結果和預期是有區別的
我再次到瀏覽器刷新下查詢頁，如果你的網絡不好的話是也是可以看到會有個請稍等的加載過程
經過百度後瞭解到，這是在進行cookies的生成

獲取Cookies

確實上面的代碼很簡單，只是去訪問，並沒有添加cookies信息，boss肯定做了很多防爬功能的。
先分析一波噢，我們可以爬取到首頁就說明首頁是不需要cookies驗證的，我們可以先打開瀏覽器的F12進入Application看看cookies的樣子

可以看多左側有個Cookies，然後右擊網址可以看到Clear，我們把保存的cookies先清除掉，看到時什麼時候生成的
cookies的key有 __c 和 Hm_lpvt_194df3105ad7148dcf2b98a91b5e727a，好像是有效時間，但是實際測試好像並沒有1個小時。

在cookies還存在一個加密字段__zp_stoken__再次分析boss的cookies生產機制，我無法觸發js生成有效的cookies，且__zp_stoken__加密好難，找到相關文章也表示看不懂，放棄

此時可以想到，既然首頁是不需要緩存的，我們可以先訪問主頁，使用主頁生成的cookies繼續訪問後續頁面，這樣看似是可以的，但是爬蟲和我們用的瀏覽器的獲取過程還是有區別的，在瀏覽器獲取頁面後會解析html渲染其中的js、css等，boss生成cookies時，需要js獲取當前路由再重定向到特點的路由，經過多次計算後生成的，爬蟲是沒辦直接渲染的，再次放棄、放棄

但是爲了能獲取到一部分數據，只能從瀏覽器獲取最新的cookies信息，定時刷新瀏覽器來保持cookies的時效，真low

GO! 從瀏覽器獲取Cookies信息

import os
import json
import base64
import sqlite3
from win32crypt import CryptUnprotectData
from cryptography.hazmat.primitives.ciphers.aead import AESGCM


def get_string(local_state):
    with open(local_state, 'r', encoding='utf-8') as f:
        s = json.load(f)['os_crypt']['encrypted_key']
    return s


def pull_the_key(base64_encrypted_key):
    encrypted_key_with_header = base64.b64decode(base64_encrypted_key)
    encrypted_key = encrypted_key_with_header[5:]
    key = CryptUnprotectData(encrypted_key, None, None, None, 0)[1]
    return key


def decrypt_string(key, data):
    nonce, cipherbytes = data[3:15], data[15:]
    aesgcm = AESGCM(key)
    plainbytes = aesgcm.decrypt(nonce, cipherbytes, None)
    plaintext = plainbytes.decode('utf-8')
    return plaintext


def get_cookie_from_chrome(host):
    local_state = os.environ['LOCALAPPDATA'] + r'\Google\Chrome\User Data\Local State'
    cookie_path = os.environ['LOCALAPPDATA'] + r"\Google\Chrome\User Data\Default\Cookies"

    sql = "select host_key,name,encrypted_value from cookies where host_key='%s'" % host

    with sqlite3.connect(cookie_path) as conn:
        cu = conn.cursor()
        res = cu.execute(sql).fetchall()
        cu.close()
        cookies = {}
        key = pull_the_key(get_string(local_state))
        for host_key, name, encrypted_value in res:
            if encrypted_value[0:3] == b'v10':
                cookies[name] = decrypt_string(key, encrypted_value)
            else:
                cookies[name] = CryptUnprotectData(encrypted_value)[1].decode()

        # print(cookies)
        return cookies


if __name__ == "__main__":
    print(get_cookie_from_chrome('.zhipin.com'))

# 打印結果
> {'Hm_lpvt_194df3105ad7148dcf2b98a91b5e727a': '1591673534', 'Hm_lvt_194df3105ad7148dcf2b98a91b5e727a': '1591090007,1591669802', '__a': '8822883.1591091039.1591091039.1591669802.22.2.14.22', '__c': '1591669802', '__g': '-', '__l': 'l=%2Fwww.zhipin.com%2Fshanghai%2F&r=&friend_source=0&friend_source=0', '__zp_stoken__': 'ddfaaCzFwRCxhSDdVFyZWXhMbWlVzT3c9XEtcFFtqWzJsClIXOkAaLHYPQU4EVQFRQSADE3tSDRQoX3dkHBwcGUxZKzhQID5pY35mGiMvDT8aR2cvOlt0Ukc5YSoYQitNAxlGbCBbZz9gTSU%3D', 'lastCity': '101020100'}

把Cookies加到請求中

from tp.boss.get_cookies import get_cookie_from_chrome
from bs4 import BeautifulSoup as bs
import requests


@de_title
def test3():
    cookie_dict = get_cookie_from_chrome('.zhipin.com')
    # 將字典轉爲CookieJar：
    cookies = requests.utils.cookiejar_from_dict(cookie_dict, cookiejar=None, overwrite=True)
    s = requests.Session()
    s.cookies = cookies
    req = s.get('https://www.zhipin.com/job_detail/?query=python&city=101010100&industry=&position=100109')
    return req

if __name__ == "__main__":
    test1()
    test2()
    test3()

又有狀況，這不是查詢應該有的Title，但Cookies的問題好像是過去了，打印下詳情看看吧

原來是限制的IP，加個header試試吧

from tp.boss.get_cookies import get_cookie_from_chrome
from bs4 import BeautifulSoup as bs
import requests
import random


@de_title
def test4():
    user_agent_list = [
        "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36",
        "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36",
        "Mozilla/5.0 (Windows NT 10.0; …) Gecko/20100101 Firefox/61.0",
        "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36",
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)",
        "Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10.5; en-US; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15"
    ]

    headers = {
        "user-agent": random.choice(user_agent_list)
    }


    cookie_dict = get_cookie_from_chrome('.zhipin.com')
    # 將字典轉爲CookieJar：
    cookies = requests.utils.cookiejar_from_dict(cookie_dict, cookiejar=None, overwrite=True)
    s = requests.Session()
    s.cookies = cookies
    s.headers = headers
    req = s.get('https://www.zhipin.com/job_detail/?query=python&city=101010100&industry=&position=100109')
    return req

if __name__ == "__main__":
    test1()
    test2()
    test3()
    test4()

我們終於看到了招聘信息
不行的話要刷新下瀏覽器噢太low了。。。

【Python行業分析3】BOSS直聘招聘信息獲取之爬蟲程序分析

首先我們來訪問下頁面，看下結果是不是和瀏覽器訪問是一致的

獲取Cookies

GO! 從瀏覽器獲取Cookies信息

把Cookies加到請求中

今天就到這吧

微信搜一搜關注博主領取更多學習諮詢

.NET有哪些好用的定時任務調度框架

Python 將PDF轉爲PDF/A、PDF/X，以及PDF/A轉回PDF

elk3

Kafka存儲機制

aws語音呼叫調用，告警電話

深度學習框架火焰圖pprof和CUDA Nsys配置指南

爬蟲兩種繞過5s盾的方法

【轉】[C#] WebAPI 防止併發調用二（冥等性）

【轉】[SQL Server]關掉 SSMS 的 IntelliSense

號稱能打敗MLP的KAN到底行不行？數學核心原理全面解析

【數據結構與算法Python實踐系列】經典排序算法-選擇排序

【數據結構與算法Python實踐系列】經典排序算法-冒泡排序

【JQuery學習筆記day14】HTML 屬性

【JQuery學習筆記day12】HTML 表單

【數據結構與算法Python實踐系列】0 序

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結