大衆點評各城市熱門餐廳評分字體加密信息數據採集

    以前寫過兩篇大衆點評的採集代碼,不過由於歷史原因,大衆早已經更換了反爬策略,近期又看了看大衆新的反爬機制,也做了小小的破解,先說說之前大衆前端加密方式:

     字體通過加載svg圖片然後通過css樣式控制雪花圖的背景座標,來進行繪製前端的展示字體,要解密,就需要我們反向計算svg中的字體與前端字體對應關係,便可解密,詳情可看大衆點評評論抓取-加密評論信息完整抓取

     另外有兩篇關於大衆點評排行榜的分析,這個版本是以前大衆出的熱門排行,現在我也找不到入口了,但是以前排行榜依舊可以用,只是限定了全國少量城市,可供參考:大衆點評各城市熱門餐廳數據爬蟲抓取大衆點評熱門餐廳抓取與數據分析

本次大衆各城市列表頁信息採集及解密

先看看具體加密情況:

  1. 以西安大衆點評爲例,採集美食篇;
入口頁
首頁

    2. 詳情美食列表頁:注意看鏈接構成:www.dianping.com/城市拼音/ch10(美食)/p1(頁碼);

詳情列表頁
要採集的列表頁

    3. 頁面中的點評量,商鋪地址,等各項評分數據都採取了加密策略,看標籤,顯示都是:“口”, 然後看css樣式發現採用了大衆自定義字體,最右邊有它的css樣式鏈接,繼續往後看其中內容;

具體加密
詳情采集頁加密信息

 

    4.  查看源碼中的情況,發現:“口”都是&#xe89c,都是以:“&#x”開頭的字體,直接看讓我們束手無策,但是聯繫上圖的css樣式,便可知他們是加載了自己的字體庫;

源碼
加密內容

    5. 查看css樣式可以看到每個對應標籤內的加密字體鏈接,如果是chrome瀏覽器,則有個.woff後綴的自定義字體,我們取這個鏈接,這些鏈接最終都要去解析,才能解密字體;

加密字體css
css加密字庫鏈接

    6. 在Network的font中便能看到具體的加密字庫,如果點擊鏈接會下載下來;

加密字體
加密字體woff文件

 

 

      7. 查看加密字庫,可以看到頁面具體內容了,說說這個字庫怎麼生成,它後面存儲的是無數的座標點,也就是這個字是通過無數左標點一點一點畫出來的,我們反向畫圖估計工作量有點大,注意每個字上面的標籤,這些標籤就是網頁源碼中去掉:“&#x”之後的內容了,是不是發現了新大陸,我們只要對應起來這個關係,然後去替換這些字就可以實現解密了,現在唯一難點就是這個字庫是畫出來的,並且大衆這個字庫不斷在動態更新,所以雖然我們知道其中內容,但是無法直接複製粘貼替換,於是有了以下思路:

  1. ) 首先用TTFont去讀取woff文件;
  2. ) 通過Image模塊進行座標點繪製座標點形成圖片,相當於截圖;
  3. ) 通過OCR方式去讀取字體;
  4. ) 通過k,v關係去找對應字體
font = TTFont(file_name)  # 打開加密文件
codeList = font.getGlyphOrder()[2:]
# 在畫板繪製
im = Image.new("RGB", (1800, 1000), (255, 255, 255))
dr = ImageDraw.Draw(im)
font = ImageFont.truetype(file_name, 40)
count = 15
list_img = numpy.array_split(codeList, count)  # 將列表切分成15份,以便於在圖片上分行顯示
加密字庫
加密字體

 

OCR識別完的字體
識別完之後的字

      8. 剩下一步就是通過識別之後的字與它加密的標籤對應起關係來,然後替換掉源碼中的加密標籤,便可成功獲取到數據了。

查看最終採集的第一頁店鋪數據

[
    {
        "ID":"jfP5B6BhSEijjpRY",
        "shopLi":"http://www.dianping.com/shop/jfP5B6BhSEijjpRY",
        "shopName":"藍田印象",
        "shopStar":"4.86",
        "shopRecommend":[
            "藍田餄餎",
            "桂花糯米糕",
            "油餅"
        ],
        "shopTotal":"871 ",
        "shopAvg":"43",
        "shopTag":"其他美食",
        "shopArea":"北寶路",
        "shopAddress":"北環路藍豐家園對面",
        "shopTaste":"4.86",
        "shopEnvironment":"4.85",
        "shopServer":"4.84"
    },
    {
        "ID":"l8VFgvYfgSIaQvz7",
        "shopLi":"http://www.dianping.com/shop/l8VFgvYfgSIaQvz7",
        "shopName":"長安壹號",
        "shopStar":"4.78",
        "shopRecommend":[
            "長安葫蘆雞",
            "麻什",
            "太宗吊燒肉"
        ],
        "shopTotal":"2474",
        "shopAvg":"222",
        "shopTag":"陝菜",
        "shopArea":"省體育場",
        "shopAddress":"長安北路1號",
        "shopTaste":"4.68",
        "shopEnvironment":"4.89",
        "shopServer":"4.81"
    },
    {
        "ID":"G9VBBvcEp6puNgUy",
        "shopLi":"http://www.dianping.com/shop/G9VBBvcEp6puNgUy",
        "shopName":"旺順閣魚頭泡餅(悅薈廣場店)",
        "shopStar":"4.64",
        "shopRecommend":[
            "經典魚頭泡餅",
            "手工現烙餅",
            "芝士焗紅薯"
        ],
        "shopTotal":"803",
        "shopAvg":"116",
        "shopTag":"京菜",
        "shopArea":"民可園",
        "shopAddress":"解放路116號悅薈廣場L606a",
        "shopTaste":"4.61",
        "shopEnvironment":"4.86",
        "shopServer":"4.82"
    },
    {
        "ID":"H3pvCM708cM764Z5",
        "shopLi":"http://www.dianping.com/shop/H3pvCM708cM764Z5",
        "shopName":"榮宴·中餐廳",
        "shopStar":"4.87",
        "shopRecommend":[
            "魯式蔥燒海蔘",
            "國宴開水白菜",
            "佛跳牆"
        ],
        "shopTotal":"151 ",
        "shopAvg":"1319",
        "shopTag":"創校菜",
        "shopArea":"當新路沿樂",
        "shopAddress":"高新二路與科技二路什字東丹軒梓園北門(農業銀行二樓)",
        "shopTaste":"4.85",
        "shopEnvironment":"4.89",
        "shopServer":"4.9"
    },
    {
        "ID":"l1uuIVrie5SV9nAh",
        "shopLi":"http://www.dianping.com/shop/l1uuIVrie5SV9nAh",
        "shopName":"糊塗記(新城廣場店)",
        "shopStar":"4.82",
        "shopRecommend":[
            "葫蘆雞",
            "高陵油餅",
            "關中四寶"
        ],
        "shopTotal":"3820",
        "shopAvg":"64",
        "shopTag":"陝菜",
        "shopArea":"鐘樓/鼓樓",
        "shopAddress":"南新街8號路西",
        "shopTaste":"4.84",
        "shopEnvironment":"4.82",
        "shopServer":"4.69"
    },
    {
        "ID":"H1Lvg3y9EaOTV086",
        "shopLi":"http://www.dianping.com/shop/H1Lvg3y9EaOTV086",
        "shopName":"張老闆的店(民樂園店)",
        "shopStar":"4.87",
        "shopRecommend":[
            "麻麻面",
            "霸氣雙拼披薩",
            "椒麻牛肚"
        ],
        "shopTotal":"2093",
        "shopAvg":"89",
        "shopTag":"特色菜",
        "shopArea":"民可園",
        "shopAddress":"解放路111號民樂園萬達步行街11號樓10101鋪",
        "shopTaste":"4.86",
        "shopEnvironment":"4.89",
        "shopServer":"4.88"
    },
    {
        "ID":"Ga60yrRErPAWOD8D",
        "shopLi":"http://www.dianping.com/shop/Ga60yrRErPAWOD8D",
        "shopName":"長安大牌檔之長安集市(賽格旗艦店)",
        "shopStar":"4.70",
        "shopRecommend":[
            "長安葫蘆雞",
            "豆皮涮牛肚鍋",
            "醪糟冰淇淋"
        ],
        "shopTotal":"35020",
        "shopAvg":"81",
        "shopTag":"陝菜",
        "shopArea":"小寨",
        "shopAddress":"小寨東路賽格國際購物中心6樓西北角",
        "shopTaste":"4.61",
        "shopEnvironment":"4.84",
        "shopServer":"4.62"
    },
    {
        "ID":"lazoXjXc4sGBvSPp",
        "shopLi":"http://www.dianping.com/shop/lazoXjXc4sGBvSPp",
        "shopName":"和悅和牛火鍋(邁科中心店)",
        "shopStar":"4.93",
        "shopRecommend":[
            "5A三角牛腩和牛粒",
            "5A和牛上腦",
            "招牌松茸菌湯底"
        ],
        "shopTotal":"213",
        "shopAvg":"598",
        "shopTag":"打邊爐/港式火鍋",
        "shopArea":"丈八",
        "shopAddress":"錦業路12號邁科中心A座1樓",
        "shopTaste":"4.93",
        "shopEnvironment":"4.93",
        "shopServer":"4.93"
    },
    {
        "ID":"Eg7Os5JRBOLW8Xnu",
        "shopLi":"http://www.dianping.com/shop/Eg7Os5JRBOLW8Xnu",
        "shopName":"胖子甑糕",
        "shopStar":"4.79",
        "shopRecommend":[
            "甑糕",
            "棗泥",
            "蜜棗"
        ],
        "shopTotal":"2574",
        "shopAvg":"8",
        "shopTag":"小喫",
        "shopArea":"蓮湖公園",
        "shopAddress":"灑金橋路與勞武巷交叉口楊天玉臘牛羊肉店旁",
        "shopTaste":"4.69",
        "shopEnvironment":"4.16",
        "shopServer":"4.67"
    },
    {
        "ID":"k4MQcT0ou69m0Ult",
        "shopLi":"http://www.dianping.com/shop/k4MQcT0ou69m0Ult",
        "shopName":"愛驊褲帶麪館(總店)",
        "shopStar":"4.89",
        "shopRecommend":[
            "biangbiang面",
            "油潑面",
            "蘸水面"
        ],
        "shopTotal":"1609",
        "shopAvg":"16",
        "shopTag":"麪館",
        "shopArea":"鐘樓/鼓樓",
        "shopAddress":"東木頭市19號(秦豫肉夾饃東隔壁)",
        "shopTaste":"4.9",
        "shopEnvironment":"4.58",
        "shopServer":"4.85"
    },
    {
        "ID":"l8qbUQaSQNjSLD2i",
        "shopLi":"http://www.dianping.com/shop/l8qbUQaSQNjSLD2i",
        "shopName":"陝拾叄(鼓樓店)",
        "shopStar":"4.87",
        "shopRecommend":[
            "醪糟味冰淇淋",
            "秦酥",
            "豆腐冰淇淋"
        ],
        "shopTotal":"9418",
        "shopAvg":"32",
        "shopTag":"冰淇淋",
        "shopArea":"鐘樓/鼓樓",
        "shopAddress":"北院門270號",
        "shopTaste":"4.86",
        "shopEnvironment":"4.87",
        "shopServer":"4.87"
    },
    {
        "ID":"k4PFL1AksZDcU3a8",
        "shopLi":"http://www.dianping.com/shop/k4PFL1AksZDcU3a8",
        "shopName":"爺們兒泥爐烤肉",
        "shopStar":"4.88",
        "shopRecommend":[
            "品厚切五花肉",
            "祕製梅花肉",
            "調味澳洲肥牛"
        ],
        "shopTotal":"762",
        "shopAvg":"85",
        "shopTag":"融合烤肉",
        "shopArea":"鐘樓/鼓樓",
        "shopAddress":"東縣門與飲馬池十字路東",
        "shopTaste":"4.88",
        "shopEnvironment":"4.81",
        "shopServer":"4.9"
    },
    {
        "ID":"H6oZsmKfP21fMtVy",
        "shopLi":"http://www.dianping.com/shop/H6oZsmKfP21fMtVy",
        "shopName":"陽坊大都涮羊肉",
        "shopStar":"3.48",
        "shopRecommend":[
            "蘇尼特肥羊",
            "軟切羊肉",
            "大都招牌肉"
        ],
        "shopTotal":"21 ",
        "shopAvg":"135",
        "shopTag":"老北京火鍋",
        "shopArea":"丈八",
        "shopAddress":"高新六路CROSS萬象匯8號樓2層",
        "shopTaste":"3.72",
        "shopEnvironment":"3.92",
        "shopServer":"3.77"
    },
    {
        "ID":"l24zZ7Ak8q6L2dhU",
        "shopLi":"http://www.dianping.com/shop/l24zZ7Ak8q6L2dhU",
        "shopName":"醉長安(鐘樓旗艦店)",
        "shopStar":"4.83",
        "shopRecommend":[
            "老陝葫蘆雞",
            "晾衣毛肚",
            "妙筆生花"
        ],
        "shopTotal":"8462",
        "shopAvg":"83",
        "shopTag":"陝菜",
        "shopArea":"鐘樓/鼓樓",
        "shopAddress":"竹笆市鼓樓向南200米美豪麗致酒店1樓2樓",
        "shopTaste":"4.75",
        "shopEnvironment":"4.88",
        "shopServer":"4.87"
    },
    {
        "ID":"G7L8e4Z2Oph1epU7",
        "shopLi":"http://www.dianping.com/shop/G7L8e4Z2Oph1epU7",
        "shopName":"蓮花餐飲(朱雀店)",
        "shopStar":"4.88",
        "shopRecommend":[
            "紫陽蒸盆子",
            "安康吊爐芝麻燒餅",
            "清蒸鴨嘴魚"
        ],
        "shopTotal":"2519",
        "shopAvg":"109",
        "shopTag":"陝菜",
        "shopArea":"省體育場",
        "shopAddress":"朱雀大街中段1號",
        "shopTaste":"4.85",
        "shopEnvironment":"4.89",
        "shopServer":"4.88"
    }
]

源碼

源碼說明:大衆點評字庫是動態變化的,所以需要不斷去請求新的字庫,也可以查看變化規律,設置一定時間變動,代碼有一點瑕疵,解密的時候我直接全局替換加密字體了,對於部分字體應該按照css樣式的class樣式,如:tagName去對應的替換,這部分主要做解密,並未細節化去替換。

#!/usr/bin/python3  
# encoding: utf-8  
""" 
@version: v1.0 
@author: W_H_J 
@license: Apache Licence  
@contact: [email protected] 
@software: PyCharm 
@file: dazhongFoodList.py
@time: 2020/6/17 10:12
@describe: 大衆點評各個城市列表頁美食信息
如果要翻看10頁以後的,需要登錄然後手動添加cookie
"""
import json
import random
import re
import sys
import os
import numpy
import pytesseract
from PIL import Image, ImageDraw, ImageFont
from pyquery import PyQuery as pq
import requests
from fontTools.ttLib import TTFont
sys.path.append(os.path.abspath(os.path.dirname(__file__) + '/' + '..'))
sys.path.append("..")
doc_path = "./secretDoc"  # 下載下的woff字庫存儲文件夾
if not os.path.exists(doc_path): os.mkdir(doc_path)


class DaZhongFoodList:
    def __init__(self):
        self.USER_AGENT_LIST = [
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
            "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
            "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5"]

    def get(self, url):
        head = {
            'User-Agent': '{0}'.format(random.sample(self.USER_AGENT_LIST, 1)[0]),  # 隨機獲取
            'Host': 'www.dianping.com',
            'Cookie': 'navCtgScroll=0; navCtgScroll=0; _lxsdk_cuid=171f22248aac8-0997e5681a77d9-376b4502-1fa400-171f22248abc8; _lxsdk=171f22248aac8-0997e5681a77d9-376b4502-1fa400-171f22248abc8; _hc.v=e8d6b5d2-6fac-becb-d1e9-8f2d9e0a75e3.1588905266; cye=xian; _dp.ac.v=205fd5cc-9ba6-4b29-932d-cc13b3e6244f; ua=dpuser_5832767585; ctu=a9f247ab89a4ee779f162d4b6923fc08fb12285c5ae4076901c1d499b665bee2; s_ViewType=10; fspop=test; Hm_lvt_602b80cf8079ae6591966cc70a3940e7=1591178871,1591179190,1591595589,1592359696; _lx_utm=utm_source%3DBaidu%26utm_medium%3Dorganic; logan_session_token=4zjh7es2zv6d3liq9tvn; logan_custom_report=; default_ab=shopList%3AA%3A5; cityid=17; pvhistory="6L+U5ZuePjo8L3N1Z2dlc3QvZ2V0SnNvbkRhdGE/Y2FsbGJhY2s9anNvbnBfMTU5MjM2Mjg5MTgyMl80ODIyNz46PDE1OTIzNjI4OTA0MzFdX1s="; m_flash2=1; PHOENIX_ID=0a49a8ba-172c03ac912-85fcc; _tr.u=msMl25dgCdFtGAsS; _tr.s=oLyNfoSoVWA7xm17; cy=17; _lxsdk_s=172c13e1aa7-a9a-f22-225%7C%7C20; Hm_lpvt_602b80cf8079ae6591966cc70a3940e7=1592379974'
        }
        html = requests.get(url, headers=head)
        html.encoding = "UTF-8"
        print("STATUS:==>", html)
        # print(html.headers)
        page_url = html.url
        if 'verify' in page_url:
            print("出現驗證碼,請驗證")
            print(page_url)
            return False
        # 獲取加密字體鏈接css
        r1 = r'<link rel="stylesheet" type="text/css" href="(.*?)">'
        jia_mi_font_link = [x for x in re.findall(r1, html.text, re.S) if 'svgtextcss' in x]
        dict_secret_key_value = {}
        if jia_mi_font_link:
            jia_mi_font_link_href = "http:" + jia_mi_font_link[0]
            jia_mi_css_text = requests.get(jia_mi_font_link_href).text  # 請求加密字體
            # 獲取加密字體文件
            woff_url = re.findall(r'(//s3plus\.meituan\.net/.{,100}?woff)', jia_mi_css_text)
            secret_href = ["http:" + x for x in set(woff_url)]
            print("加密字體庫==>", secret_href)
            list_secret = []
            for x in secret_href:
                file_name = x[x.rfind("/") + 1:]  # 加密文件
                print("000--------", file_name)
                if os.path.exists(doc_path + "/" + file_name):
                    print("111--------", file_name)
                else:
                    content = requests.get(x).content  # 獲取下載加密字體內容
                    with open(doc_path + "/" + file_name, "wb") as f:
                        f.write(content)
                list_secret.append(self.font_convert(doc_path + "/" + file_name))  # 調用解密
            for x in list_secret:
                print("==>", x)
                dict_secret_key_value.update(x)  # 最終解密字體字典
        print(dict_secret_key_value)
        print()
        str_html_base = html.text
        for k, v in dict_secret_key_value.items():
            str_html_base = str_html_base.replace(k, v)  # 用解密字體替換掉加密字體
        # print(str_html_base)
        print()
        doc = pq(str_html_base)
        div_li = doc("#shop-all-list > ul > li").items()
        list_shop_msg = []
        for x in div_li:
            shop_li = x("div.txt > div.tit > a").attr("href")                                            # 商鋪鏈接
            shop_id = x("div.txt > div.tit > a").attr("data-shopid")                                     # 商鋪ID:http://www.dianping.com/shop/商鋪ID
            shop_name = x("div.txt > div.tit > a").attr("title")                                         # 商鋪名稱
            shop_star = x("div.txt > div.comment > div > div.star_score.star_score_sml").text()          # 評價等級
            shop_recommend_temp = x("div.txt > div.recommend").text()                                    # 推薦菜
            if shop_recommend_temp:
                shop_recommend = shop_recommend_temp.replace("推薦菜: ", "").split(" ")
            shop_total = x("div.txt > div.comment > a.review-num").text().replace("\n", "").replace("條點評", "")  # 多少條評論
            shop_avg = x("div.txt > div.comment > a.mean-price").text().replace("\n", "").replace("人均 ¥", "")   # 人均
            shop_tag = x("div.txt > div.tag-addr > a:nth-child(1) > span.tag").text().replace("\n", "")            # 分類
            shop_area = x("div.txt > div.tag-addr > a:nth-child(3) > span").text().replace("\n", "")               # 商圈
            shop_address = x("div.txt > div.tag-addr > span").text().replace("\n", "")                             # 商鋪地址
            shop_taste = x("div.txt > span > span:nth-child(1)").text().replace("\n", "").replace("口味", "")      # 口味
            shop_environment = x("div.txt > span > span:nth-child(2)").text().replace("\n", "").replace("環境", "")  # 環境
            shop_server = x("div.txt > span > span:nth-child(3)").text().replace("\n", "").replace("服務", "")       # 服務
            dict_shop={"ID":shop_id,"shopLi":shop_li,"shopName":shop_name,"shopStar":shop_star,"shopRecommend":shop_recommend,"shopTotal":shop_total,"shopAvg":shop_avg,"shopTag":shop_tag,"shopArea":shop_area,"shopAddress":shop_address,"shopTaste":shop_taste,"shopEnvironment":shop_environment,"shopServer":shop_server}
            msg = json.dumps(dict_shop, ensure_ascii=False)
            list_shop_msg.append(dict_shop)
            print(msg)
            print("-" * 50)
        print(json.dumps(list_shop_msg, ensure_ascii=False))

    def font_convert(self, file_name):
        """
        將web下載的字體文件解析,返回其編碼和漢字的對應關係
        :param file_name: 加密woff字體文件
        :return: {'&#xe105;': '2'}
        """
        font = TTFont(file_name)  # 打開加密文件
        codeList = font.getGlyphOrder()[2:]
        # 在畫板繪製
        im = Image.new("RGB", (1800, 1000), (255, 255, 255))
        dr = ImageDraw.Draw(im)
        font = ImageFont.truetype(file_name, 40)
        count = 15
        list_img = numpy.array_split(codeList, count)  # 將列表切分成15份,以便於在圖片上分行顯示
        for t in range(count):
            newList = [i.replace("uni", "\\u") for i in list_img[t]]
            text = "".join(newList)
            text = text.encode('utf-8').decode('unicode_escape')
            dr.text((0, 50 * t), text, font=font, fill="#000000")
        im.save(file_name.replace(".woff", "") + ".jpg")  # 可以將圖片保存到本地,以便於手動打開圖片查看
        im = Image.open(file_name.replace(".woff", "") + ".jpg")
        testdata_dir_config = '--tessdata-dir "D:\\Program Files (x86)\\Tesseract-OCR\\tessdata"'  # OCR文字識別路徑,如果路徑加入系統環境變量了,則無需設置此值
        result = pytesseract.image_to_string(im, config=testdata_dir_config, lang="chi_sim")  # 指定lang解析爲:中文簡體-chi_sim
        # print("===>",result)
        result = result.replace(" ", "").replace("\n", "")  # OCR識別出來的字符串有空格換行符
        codeList = [i.replace("uni", "&#x") + ";" for i in codeList]  # 大衆點評加密規則就是將加密字體的:uni替換成:&#x
        return dict(zip(codeList, list(result)))  # 生成形如:{'&#xe105;': '2'} 的解密加密對應密文

    def run(self, page_num:int):
        for i in range(1, page_num+1):
            # 城市鏈接構成:www.dianping.com/城市拼音/ch10(美食)/p1(頁碼)
            self.get("http://www.dianping.com/xian/ch10/p"+str(i))


if __name__ == '__main__':
    dzr = DaZhongFoodList()
    dzr.run(1)
    # print(dzr.fontConvert())

代碼僅供學習參考,不可商用,否則後果由使用者個人承擔 ,轉載請註明出處

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章