python 人人車字體反爬分析 -含源碼

原創

2020-04-26 10:07

嚴重聲明：本文僅用於學習交流，不得用於商業用途，同時希望大家遵循robots協議，維護網絡和諧。
本猿最近在逛一些網站的時間。在打開瀏覽器的f12查看人家前端代碼咋寫的時候，經常會發現就是頁面上顯示的內容和源碼裏面的不一樣，然後自己請求一遍也還是不一樣，奇怪，猿，妙不可言？本着猿精神，上網查了下，這種屬於字體反爬策略。應用的還是不少的，所以，在這裏將在下對字體反爬的見解寫一下。

先觀察人人車網站：人人車
1. 打開一個詳情頁，看頁面標籤，直接年款啥的直接都不一樣了，查看n多網頁後發現，變化的都是數字：
2. 查看css樣式的時候發現，應用了個字體樣式，同時名稱直接關聯：
3. 查看服務器給我們發的包，直接發現一個跟上面標籤屬性名，以及css樣式選擇器一樣名稱的包，利用谷歌瀏覽器的規範輸出，直接查看，發現最下面的一行數字順序是不太對的，找規律：
4. 發現這樣對應就是本應該在6的位置上他換上了8，7的位置換成了6，8換成了7：
  
  5.我們再看首頁的源碼和展示給我們的標題，利用上面的規則一看，猿來如此：
那麼，作爲一名pythonic，怎麼會被這些迷惑操作遮住我們睿智的雙眼，下面是本猿利用python代碼實現的字體復原（悲慘的利用三天時間，學了爬蟲，再加上查資料，終於有了點眉目），代碼貼在下面，bug勿噴，安心食用：

# !/usr/bin/python3
# -*- coding: utf-8 -*-
"""
# @Time      :   2020/4/25 13:32
# @Author    :   hupoc
# @File      :   second_hand_car_renrenche.py
# @Desc      :   
"""
import io
import os
import requests
from fontTools.ttLib import TTFont
from scrapy import Selector


def font_transfor(translate_text, html, url):
    """
    網頁上使用字體文件的文本轉換
    :param translate_text: 需要轉換的文本
    :param html: 經過scrapy.Selector()轉換後的響應對象
    :param url: 請求的url
    :return: 轉換後的文本
    """
    # 獲取字體名所在標籤文本
    font_interface_str = html.xpath("//div[@class='title']/h1/@class").extract_first()
    if font_interface_str and 'title-name' in font_interface_str:
        # 提取字體文件名稱
        font_interface_name = font_interface_str.replace('title-name', '').strip()
    else:
        # 如果沒有title-name表示沒有應用字體文件
        print('沒有提取到字體名所在標籤文本,url:{0}'.format(html.url))
        return
    file_name = url.rsplit("/", 1)[-1] + '_' + font_interface_str
    # 拼接字體文件 在本地保存路徑
    file_ttf = '/tmp/renrenche_font/{0}.woff'.format(file_name)
    if not os.path.exists(os.path.dirname(file_ttf)):
        os.makedirs(os.path.dirname(file_ttf))
    # 下載字體文件,保存到本地路徑
    url = 'https://misc.rrcimg.com/ttf/{0}.woff'.format(font_interface_name)
    resp = requests.get(url)
    if not resp:
        print('請求字體文件失敗')
    try:
        font = TTFont(io.BytesIO(resp.content))
    except Exception as e:
        return
    font_obj = font['cmap']
    font_tables = font['cmap'].tables
    uni_list = font['cmap'].tables[0].ttFont.getGlyphOrder()
    # 生成轉換規則
    base_num_list = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
    base_eng_list = {'zero': '0', 'one': '1', 'two': '2', 'three': '3', 'four': '4', 'five': '5',
                     'six': '6', 'seven': '7', 'eight': '8', 'nine': '9'}
    mapping_list = [base_eng_list[_] for _ in uni_list[1:]]
    font_dict = dict(zip(mapping_list, base_num_list))
    # 對需要轉換的文本進行轉換
    transfor_str = [_ if not _.isdigit() else font_dict[_] for _ in translate_text]
    # 關閉字體文件
    font.close()
    # 返回正確的文本
    return ''.join(transfor_str)


def get_renrenche_detail():
    # 請求頁面獲取響應
    detail_url = 'https://www.renrenche.com/bj/car/c2d4fdd2902df36b'
    response = requests.get(detail_url)
    if not response or response.status_code != 200:
        print('人人車詳情頁請求失敗')
        return
    # 將響應文本轉爲lxml對象，爲了方便，使用的是scrapy.Selector()
    html = Selector(text=response.text)
    # 獲取標題，並規格化標題文本
    titles = html.xpath("//div[@class='title']/h1/text()").extract()
    title = ' '.join(titles).replace('\n', '').strip()
    # 轉換標題文字，獲取正確文本
    true_title = font_transfor(title, html, detail_url)
    # 打印文本
    print(true_title)


if __name__ == '__main__':
    # 請求人人車的二手車詳情頁
    get_renrenche_detail()

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

python 人人車字體反爬分析 -含源碼

爬蟲字體反爬分析(1) 人人車 -- 含源碼

Python 求平均值

連接Ubuntu1804 ssh: connect to host 192.168.159.132 port 22: Connection refused

連接Ubuntu1804 ssh [email protected]: Permission denied (publickey,password).

python 人人車字體反爬分析 -含源碼

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結