大众点评各城市热门餐厅评分字体加密信息数据采集

    以前写过两篇大众点评的采集代码,不过由于历史原因,大众早已经更换了反爬策略,近期又看了看大众新的反爬机制,也做了小小的破解,先说说之前大众前端加密方式:

     字体通过加载svg图片然后通过css样式控制雪花图的背景座标,来进行绘制前端的展示字体,要解密,就需要我们反向计算svg中的字体与前端字体对应关系,便可解密,详情可看大众点评评论抓取-加密评论信息完整抓取

     另外有两篇关于大众点评排行榜的分析,这个版本是以前大众出的热门排行,现在我也找不到入口了,但是以前排行榜依旧可以用,只是限定了全国少量城市,可供参考:大众点评各城市热门餐厅数据爬虫抓取大众点评热门餐厅抓取与数据分析

本次大众各城市列表页信息采集及解密

先看看具体加密情况:

  1. 以西安大众点评为例,采集美食篇;
入口页
首页

    2. 详情美食列表页:注意看链接构成:www.dianping.com/城市拼音/ch10(美食)/p1(页码);

详情列表页
要采集的列表页

    3. 页面中的点评量,商铺地址,等各项评分数据都采取了加密策略,看标签,显示都是:“口”, 然后看css样式发现采用了大众自定义字体,最右边有它的css样式链接,继续往后看其中内容;

具体加密
详情采集页加密信息

 

    4.  查看源码中的情况,发现:“口”都是&#xe89c,都是以:“&#x”开头的字体,直接看让我们束手无策,但是联系上图的css样式,便可知他们是加载了自己的字体库;

源码
加密内容

    5. 查看css样式可以看到每个对应标签内的加密字体链接,如果是chrome浏览器,则有个.woff后缀的自定义字体,我们取这个链接,这些链接最终都要去解析,才能解密字体;

加密字体css
css加密字库链接

    6. 在Network的font中便能看到具体的加密字库,如果点击链接会下载下来;

加密字体
加密字体woff文件

 

 

      7. 查看加密字库,可以看到页面具体内容了,说说这个字库怎么生成,它后面存储的是无数的座标点,也就是这个字是通过无数左标点一点一点画出来的,我们反向画图估计工作量有点大,注意每个字上面的标签,这些标签就是网页源码中去掉:“&#x”之后的内容了,是不是发现了新大陆,我们只要对应起来这个关系,然后去替换这些字就可以实现解密了,现在唯一难点就是这个字库是画出来的,并且大众这个字库不断在动态更新,所以虽然我们知道其中内容,但是无法直接复制粘贴替换,于是有了以下思路:

  1. ) 首先用TTFont去读取woff文件;
  2. ) 通过Image模块进行座标点绘制座标点形成图片,相当于截图;
  3. ) 通过OCR方式去读取字体;
  4. ) 通过k,v关系去找对应字体
font = TTFont(file_name)  # 打开加密文件
codeList = font.getGlyphOrder()[2:]
# 在画板绘制
im = Image.new("RGB", (1800, 1000), (255, 255, 255))
dr = ImageDraw.Draw(im)
font = ImageFont.truetype(file_name, 40)
count = 15
list_img = numpy.array_split(codeList, count)  # 将列表切分成15份,以便于在图片上分行显示
加密字库
加密字体

 

OCR识别完的字体
识别完之后的字

      8. 剩下一步就是通过识别之后的字与它加密的标签对应起关系来,然后替换掉源码中的加密标签,便可成功获取到数据了。

查看最终采集的第一页店铺数据

[
    {
        "ID":"jfP5B6BhSEijjpRY",
        "shopLi":"http://www.dianping.com/shop/jfP5B6BhSEijjpRY",
        "shopName":"蓝田印象",
        "shopStar":"4.86",
        "shopRecommend":[
            "蓝田饸饹",
            "桂花糯米糕",
            "油饼"
        ],
        "shopTotal":"871 ",
        "shopAvg":"43",
        "shopTag":"其他美食",
        "shopArea":"北宝路",
        "shopAddress":"北环路蓝丰家园对面",
        "shopTaste":"4.86",
        "shopEnvironment":"4.85",
        "shopServer":"4.84"
    },
    {
        "ID":"l8VFgvYfgSIaQvz7",
        "shopLi":"http://www.dianping.com/shop/l8VFgvYfgSIaQvz7",
        "shopName":"长安壹号",
        "shopStar":"4.78",
        "shopRecommend":[
            "长安葫芦鸡",
            "麻什",
            "太宗吊烧肉"
        ],
        "shopTotal":"2474",
        "shopAvg":"222",
        "shopTag":"陕菜",
        "shopArea":"省体育场",
        "shopAddress":"长安北路1号",
        "shopTaste":"4.68",
        "shopEnvironment":"4.89",
        "shopServer":"4.81"
    },
    {
        "ID":"G9VBBvcEp6puNgUy",
        "shopLi":"http://www.dianping.com/shop/G9VBBvcEp6puNgUy",
        "shopName":"旺顺阁鱼头泡饼(悦荟广场店)",
        "shopStar":"4.64",
        "shopRecommend":[
            "经典鱼头泡饼",
            "手工现烙饼",
            "芝士焗红薯"
        ],
        "shopTotal":"803",
        "shopAvg":"116",
        "shopTag":"京菜",
        "shopArea":"民可园",
        "shopAddress":"解放路116号悦荟广场L606a",
        "shopTaste":"4.61",
        "shopEnvironment":"4.86",
        "shopServer":"4.82"
    },
    {
        "ID":"H3pvCM708cM764Z5",
        "shopLi":"http://www.dianping.com/shop/H3pvCM708cM764Z5",
        "shopName":"荣宴·中餐厅",
        "shopStar":"4.87",
        "shopRecommend":[
            "鲁式葱烧海参",
            "国宴开水白菜",
            "佛跳墙"
        ],
        "shopTotal":"151 ",
        "shopAvg":"1319",
        "shopTag":"创校菜",
        "shopArea":"当新路沿乐",
        "shopAddress":"高新二路与科技二路什字东丹轩梓园北门(农业银行二楼)",
        "shopTaste":"4.85",
        "shopEnvironment":"4.89",
        "shopServer":"4.9"
    },
    {
        "ID":"l1uuIVrie5SV9nAh",
        "shopLi":"http://www.dianping.com/shop/l1uuIVrie5SV9nAh",
        "shopName":"糊涂记(新城广场店)",
        "shopStar":"4.82",
        "shopRecommend":[
            "葫芦鸡",
            "高陵油饼",
            "关中四宝"
        ],
        "shopTotal":"3820",
        "shopAvg":"64",
        "shopTag":"陕菜",
        "shopArea":"钟楼/鼓楼",
        "shopAddress":"南新街8号路西",
        "shopTaste":"4.84",
        "shopEnvironment":"4.82",
        "shopServer":"4.69"
    },
    {
        "ID":"H1Lvg3y9EaOTV086",
        "shopLi":"http://www.dianping.com/shop/H1Lvg3y9EaOTV086",
        "shopName":"张老板的店(民乐园店)",
        "shopStar":"4.87",
        "shopRecommend":[
            "麻麻面",
            "霸气双拼披萨",
            "椒麻牛肚"
        ],
        "shopTotal":"2093",
        "shopAvg":"89",
        "shopTag":"特色菜",
        "shopArea":"民可园",
        "shopAddress":"解放路111号民乐园万达步行街11号楼10101铺",
        "shopTaste":"4.86",
        "shopEnvironment":"4.89",
        "shopServer":"4.88"
    },
    {
        "ID":"Ga60yrRErPAWOD8D",
        "shopLi":"http://www.dianping.com/shop/Ga60yrRErPAWOD8D",
        "shopName":"长安大牌档之长安集市(赛格旗舰店)",
        "shopStar":"4.70",
        "shopRecommend":[
            "长安葫芦鸡",
            "豆皮涮牛肚锅",
            "醪糟冰淇淋"
        ],
        "shopTotal":"35020",
        "shopAvg":"81",
        "shopTag":"陕菜",
        "shopArea":"小寨",
        "shopAddress":"小寨东路赛格国际购物中心6楼西北角",
        "shopTaste":"4.61",
        "shopEnvironment":"4.84",
        "shopServer":"4.62"
    },
    {
        "ID":"lazoXjXc4sGBvSPp",
        "shopLi":"http://www.dianping.com/shop/lazoXjXc4sGBvSPp",
        "shopName":"和悦和牛火锅(迈科中心店)",
        "shopStar":"4.93",
        "shopRecommend":[
            "5A三角牛腩和牛粒",
            "5A和牛上脑",
            "招牌松茸菌汤底"
        ],
        "shopTotal":"213",
        "shopAvg":"598",
        "shopTag":"打边炉/港式火锅",
        "shopArea":"丈八",
        "shopAddress":"锦业路12号迈科中心A座1楼",
        "shopTaste":"4.93",
        "shopEnvironment":"4.93",
        "shopServer":"4.93"
    },
    {
        "ID":"Eg7Os5JRBOLW8Xnu",
        "shopLi":"http://www.dianping.com/shop/Eg7Os5JRBOLW8Xnu",
        "shopName":"胖子甑糕",
        "shopStar":"4.79",
        "shopRecommend":[
            "甑糕",
            "枣泥",
            "蜜枣"
        ],
        "shopTotal":"2574",
        "shopAvg":"8",
        "shopTag":"小吃",
        "shopArea":"莲湖公园",
        "shopAddress":"洒金桥路与劳武巷交叉口杨天玉腊牛羊肉店旁",
        "shopTaste":"4.69",
        "shopEnvironment":"4.16",
        "shopServer":"4.67"
    },
    {
        "ID":"k4MQcT0ou69m0Ult",
        "shopLi":"http://www.dianping.com/shop/k4MQcT0ou69m0Ult",
        "shopName":"爱骅裤带面馆(总店)",
        "shopStar":"4.89",
        "shopRecommend":[
            "biangbiang面",
            "油泼面",
            "蘸水面"
        ],
        "shopTotal":"1609",
        "shopAvg":"16",
        "shopTag":"面馆",
        "shopArea":"钟楼/鼓楼",
        "shopAddress":"东木头市19号(秦豫肉夹馍东隔壁)",
        "shopTaste":"4.9",
        "shopEnvironment":"4.58",
        "shopServer":"4.85"
    },
    {
        "ID":"l8qbUQaSQNjSLD2i",
        "shopLi":"http://www.dianping.com/shop/l8qbUQaSQNjSLD2i",
        "shopName":"陕拾叁(鼓楼店)",
        "shopStar":"4.87",
        "shopRecommend":[
            "醪糟味冰淇淋",
            "秦酥",
            "豆腐冰淇淋"
        ],
        "shopTotal":"9418",
        "shopAvg":"32",
        "shopTag":"冰淇淋",
        "shopArea":"钟楼/鼓楼",
        "shopAddress":"北院门270号",
        "shopTaste":"4.86",
        "shopEnvironment":"4.87",
        "shopServer":"4.87"
    },
    {
        "ID":"k4PFL1AksZDcU3a8",
        "shopLi":"http://www.dianping.com/shop/k4PFL1AksZDcU3a8",
        "shopName":"爷们儿泥炉烤肉",
        "shopStar":"4.88",
        "shopRecommend":[
            "品厚切五花肉",
            "秘制梅花肉",
            "调味澳洲肥牛"
        ],
        "shopTotal":"762",
        "shopAvg":"85",
        "shopTag":"融合烤肉",
        "shopArea":"钟楼/鼓楼",
        "shopAddress":"东县门与饮马池十字路东",
        "shopTaste":"4.88",
        "shopEnvironment":"4.81",
        "shopServer":"4.9"
    },
    {
        "ID":"H6oZsmKfP21fMtVy",
        "shopLi":"http://www.dianping.com/shop/H6oZsmKfP21fMtVy",
        "shopName":"阳坊大都涮羊肉",
        "shopStar":"3.48",
        "shopRecommend":[
            "苏尼特肥羊",
            "软切羊肉",
            "大都招牌肉"
        ],
        "shopTotal":"21 ",
        "shopAvg":"135",
        "shopTag":"老北京火锅",
        "shopArea":"丈八",
        "shopAddress":"高新六路CROSS万象汇8号楼2层",
        "shopTaste":"3.72",
        "shopEnvironment":"3.92",
        "shopServer":"3.77"
    },
    {
        "ID":"l24zZ7Ak8q6L2dhU",
        "shopLi":"http://www.dianping.com/shop/l24zZ7Ak8q6L2dhU",
        "shopName":"醉长安(钟楼旗舰店)",
        "shopStar":"4.83",
        "shopRecommend":[
            "老陕葫芦鸡",
            "晾衣毛肚",
            "妙笔生花"
        ],
        "shopTotal":"8462",
        "shopAvg":"83",
        "shopTag":"陕菜",
        "shopArea":"钟楼/鼓楼",
        "shopAddress":"竹笆市鼓楼向南200米美豪丽致酒店1楼2楼",
        "shopTaste":"4.75",
        "shopEnvironment":"4.88",
        "shopServer":"4.87"
    },
    {
        "ID":"G7L8e4Z2Oph1epU7",
        "shopLi":"http://www.dianping.com/shop/G7L8e4Z2Oph1epU7",
        "shopName":"莲花餐饮(朱雀店)",
        "shopStar":"4.88",
        "shopRecommend":[
            "紫阳蒸盆子",
            "安康吊炉芝麻烧饼",
            "清蒸鸭嘴鱼"
        ],
        "shopTotal":"2519",
        "shopAvg":"109",
        "shopTag":"陕菜",
        "shopArea":"省体育场",
        "shopAddress":"朱雀大街中段1号",
        "shopTaste":"4.85",
        "shopEnvironment":"4.89",
        "shopServer":"4.88"
    }
]

源码

源码说明:大众点评字库是动态变化的,所以需要不断去请求新的字库,也可以查看变化规律,设置一定时间变动,代码有一点瑕疵,解密的时候我直接全局替换加密字体了,对于部分字体应该按照css样式的class样式,如:tagName去对应的替换,这部分主要做解密,并未细节化去替换。

#!/usr/bin/python3  
# encoding: utf-8  
""" 
@version: v1.0 
@author: W_H_J 
@license: Apache Licence  
@contact: [email protected] 
@software: PyCharm 
@file: dazhongFoodList.py
@time: 2020/6/17 10:12
@describe: 大众点评各个城市列表页美食信息
如果要翻看10页以后的,需要登录然后手动添加cookie
"""
import json
import random
import re
import sys
import os
import numpy
import pytesseract
from PIL import Image, ImageDraw, ImageFont
from pyquery import PyQuery as pq
import requests
from fontTools.ttLib import TTFont
sys.path.append(os.path.abspath(os.path.dirname(__file__) + '/' + '..'))
sys.path.append("..")
doc_path = "./secretDoc"  # 下载下的woff字库存储文件夹
if not os.path.exists(doc_path): os.mkdir(doc_path)


class DaZhongFoodList:
    def __init__(self):
        self.USER_AGENT_LIST = [
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
            "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
            "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5"]

    def get(self, url):
        head = {
            'User-Agent': '{0}'.format(random.sample(self.USER_AGENT_LIST, 1)[0]),  # 随机获取
            'Host': 'www.dianping.com',
            'Cookie': 'navCtgScroll=0; navCtgScroll=0; _lxsdk_cuid=171f22248aac8-0997e5681a77d9-376b4502-1fa400-171f22248abc8; _lxsdk=171f22248aac8-0997e5681a77d9-376b4502-1fa400-171f22248abc8; _hc.v=e8d6b5d2-6fac-becb-d1e9-8f2d9e0a75e3.1588905266; cye=xian; _dp.ac.v=205fd5cc-9ba6-4b29-932d-cc13b3e6244f; ua=dpuser_5832767585; ctu=a9f247ab89a4ee779f162d4b6923fc08fb12285c5ae4076901c1d499b665bee2; s_ViewType=10; fspop=test; Hm_lvt_602b80cf8079ae6591966cc70a3940e7=1591178871,1591179190,1591595589,1592359696; _lx_utm=utm_source%3DBaidu%26utm_medium%3Dorganic; logan_session_token=4zjh7es2zv6d3liq9tvn; logan_custom_report=; default_ab=shopList%3AA%3A5; cityid=17; pvhistory="6L+U5ZuePjo8L3N1Z2dlc3QvZ2V0SnNvbkRhdGE/Y2FsbGJhY2s9anNvbnBfMTU5MjM2Mjg5MTgyMl80ODIyNz46PDE1OTIzNjI4OTA0MzFdX1s="; m_flash2=1; PHOENIX_ID=0a49a8ba-172c03ac912-85fcc; _tr.u=msMl25dgCdFtGAsS; _tr.s=oLyNfoSoVWA7xm17; cy=17; _lxsdk_s=172c13e1aa7-a9a-f22-225%7C%7C20; Hm_lpvt_602b80cf8079ae6591966cc70a3940e7=1592379974'
        }
        html = requests.get(url, headers=head)
        html.encoding = "UTF-8"
        print("STATUS:==>", html)
        # print(html.headers)
        page_url = html.url
        if 'verify' in page_url:
            print("出现验证码,请验证")
            print(page_url)
            return False
        # 获取加密字体链接css
        r1 = r'<link rel="stylesheet" type="text/css" href="(.*?)">'
        jia_mi_font_link = [x for x in re.findall(r1, html.text, re.S) if 'svgtextcss' in x]
        dict_secret_key_value = {}
        if jia_mi_font_link:
            jia_mi_font_link_href = "http:" + jia_mi_font_link[0]
            jia_mi_css_text = requests.get(jia_mi_font_link_href).text  # 请求加密字体
            # 获取加密字体文件
            woff_url = re.findall(r'(//s3plus\.meituan\.net/.{,100}?woff)', jia_mi_css_text)
            secret_href = ["http:" + x for x in set(woff_url)]
            print("加密字体库==>", secret_href)
            list_secret = []
            for x in secret_href:
                file_name = x[x.rfind("/") + 1:]  # 加密文件
                print("000--------", file_name)
                if os.path.exists(doc_path + "/" + file_name):
                    print("111--------", file_name)
                else:
                    content = requests.get(x).content  # 获取下载加密字体内容
                    with open(doc_path + "/" + file_name, "wb") as f:
                        f.write(content)
                list_secret.append(self.font_convert(doc_path + "/" + file_name))  # 调用解密
            for x in list_secret:
                print("==>", x)
                dict_secret_key_value.update(x)  # 最终解密字体字典
        print(dict_secret_key_value)
        print()
        str_html_base = html.text
        for k, v in dict_secret_key_value.items():
            str_html_base = str_html_base.replace(k, v)  # 用解密字体替换掉加密字体
        # print(str_html_base)
        print()
        doc = pq(str_html_base)
        div_li = doc("#shop-all-list > ul > li").items()
        list_shop_msg = []
        for x in div_li:
            shop_li = x("div.txt > div.tit > a").attr("href")                                            # 商铺链接
            shop_id = x("div.txt > div.tit > a").attr("data-shopid")                                     # 商铺ID:http://www.dianping.com/shop/商铺ID
            shop_name = x("div.txt > div.tit > a").attr("title")                                         # 商铺名称
            shop_star = x("div.txt > div.comment > div > div.star_score.star_score_sml").text()          # 评价等级
            shop_recommend_temp = x("div.txt > div.recommend").text()                                    # 推荐菜
            if shop_recommend_temp:
                shop_recommend = shop_recommend_temp.replace("推荐菜: ", "").split(" ")
            shop_total = x("div.txt > div.comment > a.review-num").text().replace("\n", "").replace("条点评", "")  # 多少条评论
            shop_avg = x("div.txt > div.comment > a.mean-price").text().replace("\n", "").replace("人均 ¥", "")   # 人均
            shop_tag = x("div.txt > div.tag-addr > a:nth-child(1) > span.tag").text().replace("\n", "")            # 分类
            shop_area = x("div.txt > div.tag-addr > a:nth-child(3) > span").text().replace("\n", "")               # 商圈
            shop_address = x("div.txt > div.tag-addr > span").text().replace("\n", "")                             # 商铺地址
            shop_taste = x("div.txt > span > span:nth-child(1)").text().replace("\n", "").replace("口味", "")      # 口味
            shop_environment = x("div.txt > span > span:nth-child(2)").text().replace("\n", "").replace("环境", "")  # 环境
            shop_server = x("div.txt > span > span:nth-child(3)").text().replace("\n", "").replace("服务", "")       # 服务
            dict_shop={"ID":shop_id,"shopLi":shop_li,"shopName":shop_name,"shopStar":shop_star,"shopRecommend":shop_recommend,"shopTotal":shop_total,"shopAvg":shop_avg,"shopTag":shop_tag,"shopArea":shop_area,"shopAddress":shop_address,"shopTaste":shop_taste,"shopEnvironment":shop_environment,"shopServer":shop_server}
            msg = json.dumps(dict_shop, ensure_ascii=False)
            list_shop_msg.append(dict_shop)
            print(msg)
            print("-" * 50)
        print(json.dumps(list_shop_msg, ensure_ascii=False))

    def font_convert(self, file_name):
        """
        将web下载的字体文件解析,返回其编码和汉字的对应关系
        :param file_name: 加密woff字体文件
        :return: {'&#xe105;': '2'}
        """
        font = TTFont(file_name)  # 打开加密文件
        codeList = font.getGlyphOrder()[2:]
        # 在画板绘制
        im = Image.new("RGB", (1800, 1000), (255, 255, 255))
        dr = ImageDraw.Draw(im)
        font = ImageFont.truetype(file_name, 40)
        count = 15
        list_img = numpy.array_split(codeList, count)  # 将列表切分成15份,以便于在图片上分行显示
        for t in range(count):
            newList = [i.replace("uni", "\\u") for i in list_img[t]]
            text = "".join(newList)
            text = text.encode('utf-8').decode('unicode_escape')
            dr.text((0, 50 * t), text, font=font, fill="#000000")
        im.save(file_name.replace(".woff", "") + ".jpg")  # 可以将图片保存到本地,以便于手动打开图片查看
        im = Image.open(file_name.replace(".woff", "") + ".jpg")
        testdata_dir_config = '--tessdata-dir "D:\\Program Files (x86)\\Tesseract-OCR\\tessdata"'  # OCR文字识别路径,如果路径加入系统环境变量了,则无需设置此值
        result = pytesseract.image_to_string(im, config=testdata_dir_config, lang="chi_sim")  # 指定lang解析为:中文简体-chi_sim
        # print("===>",result)
        result = result.replace(" ", "").replace("\n", "")  # OCR识别出来的字符串有空格换行符
        codeList = [i.replace("uni", "&#x") + ";" for i in codeList]  # 大众点评加密规则就是将加密字体的:uni替换成:&#x
        return dict(zip(codeList, list(result)))  # 生成形如:{'&#xe105;': '2'} 的解密加密对应密文

    def run(self, page_num:int):
        for i in range(1, page_num+1):
            # 城市链接构成:www.dianping.com/城市拼音/ch10(美食)/p1(页码)
            self.get("http://www.dianping.com/xian/ch10/p"+str(i))


if __name__ == '__main__':
    dzr = DaZhongFoodList()
    dzr.run(1)
    # print(dzr.fontConvert())

代码仅供学习参考,不可商用,否则后果由使用者个人承担 ,转载请注明出处

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章