如何爬取貓眼全部信息（電影信息、演員信息）

爬取貓眼的全部信息，這裏主要指的是電影列表裏的電影信息和演員信息，如下界面。

爬去的時候有兩個難點。一：字體加密（如今好像機制有更新來些，用網上的方法不行）；二：美團檢測。下面將分別講述我解決的過程。

一、字體加密

關於字體加密，網絡上介紹的很多，思路也都類似。貓眼每次加載的時候會動態的加載不同的字形編碼。解決思路呢，就是先下載好它的一個字體文件(.woff結尾的，方法網上多有介紹，我就不在累述了)，然後每次爬取新的界面的時候，將新的字體文件的字形座標與之前下載好的的對比。網上很多方法還是很久之前的，當時貓眼的字體加密機制還很簡陋，只是很單純的比對座標是否相等，現在的機制是每次加載的字形都是有細微變化的。網上很多人的方法是說其變化在一個範圍內，只要比對他們的差異不超過一個範圍就是同一個數字。我也用了這個方法，但發現很多數字都識別錯誤，比如3,5,9;1,4;0,8；這三組數字很容易混淆。說明這個問題不是那麼簡單的。我用的方法先是取三個基準字體文件，然後將每個基準字體文件分別與動態加載的進行座標對比，我也設置了一個差異範圍，對符合這個差異的進行計數，最後計數最大的就認證爲對應的數字。這樣做之後還是不免出現錯誤，然後我發現每個數的座標數目有差異，比如3,5,9這個三個數的座標數量明顯有很大不同，由於每個數字自身的座標數每次加載也是不同的，有細微的差別，所以我再進行一個座標個數差異判斷。還有一點比較重要的，就是這個差異值的選擇，網上很多選擇的是8、10，可能貓眼加密機制改了的原因，這幾個預設差異值，識別率很低。我試的時候 5 的識別率最高，這些你們可以自己實驗。之前我們選了三組基準，由於我們進行了前面的識別操作，識別率進行很高了，所以再進行一次三局兩勝的操作，再次提高識別率。由於代碼比較多，我放一些關鍵的代碼。還有一種方法也許可行，我沒試過，可以參考這篇博客參考博文，這篇博客使用knn算法來識別的，識別率應該挺高的。但這個需要多準備些字體文件來提高識別率。

def replace_font(self, response,res):

        #基準，比對三次，兩次以上一致即默認正確
        #我是“我要出家當道士”，其他非原創
        base_font = TTFont('./fonts/base.woff')
        base_font.saveXML('./fonts/base_font.xml')
        base_dict = {'uniF870': '6', 'uniEE8C': '3', 'uniECDC': '7', 'uniE6A2': '1', 'uniF734': '5',
                     'uniF040': '9', 'uniEAE5': '0', 'uniF12A': '4', 'uniF2D2': '2', 'uniE543': '8'}
        base_list = base_font.getGlyphOrder()[2:]

        base_font2 = TTFont('./fonts/base2.woff')
        base_font2.saveXML('./fonts/base_font2.xml')
        base_dict2 = {'uniF230': '6', 'uniEBA1': '3', 'uniF517': '7', 'uniF1D2': '1', 'uniE550': '5',
                      'uniEBA4': '9', 'uniEB7A': '0', 'uniEC29': '4', 'uniF7E1': '2', 'uniF6B7': '8'}
        base_list2 = base_font2.getGlyphOrder()[2:]

        base_font3 = TTFont('./fonts/base3.woff')
        base_font3.saveXML('./fonts/base_font3.xml')
        base_dict3 = {'uniF8D3': '6', 'uniF0C9': '3', 'uniEF09': '7', 'uniE9FD': '1', 'uniE5B7': '5',
                      'uniF4DE': '9', 'uniF4F9': '0', 'uniE156': '4', 'uniE9B5': '2', 'uniEC6D': '8'}
        base_list3 = base_font3.getGlyphOrder()[2:]

        #網站動態加載的字體
        #我是“我要出家當道士”，其他非原創
        font_file = re.findall(r'vfile\.meituan\.net\/colorstone\/(\w+\.woff)', response)[0]
        font_url = 'http://vfile.meituan.net/colorstone/' + font_file
        #print(font_url)
        new_file = self.get_html(font_url)
        with open('./fonts/new.woff', 'wb') as f:
            f.write(new_file.content)
        new_font = TTFont('./fonts/new.woff')
        new_font.saveXML('./fonts/new_font.xml')
        new_list = new_font.getGlyphOrder()[2:]


        coordinate_list1 = []
        for uniname1 in base_list:
            # 獲取字體對象的橫縱座標信息
            coordinate = base_font['glyf'][uniname1].coordinates
            coordinate_list1.append(list(coordinate))

        coordinate_list2 = []
        for uniname1 in base_list2:
            # 獲取字體對象的橫縱座標信息
            coordinate = base_font2['glyf'][uniname1].coordinates
            coordinate_list2.append(list(coordinate))

        coordinate_list3 = []
        for uniname1 in base_list3:
            # 獲取字體對象的橫縱座標信息
            coordinate = base_font3['glyf'][uniname1].coordinates
            coordinate_list3.append(list(coordinate))

        coordinate_list4 = []
        for uniname2 in new_list:
            coordinate = new_font['glyf'][uniname2].coordinates
            coordinate_list4.append(list(coordinate))

        index2 = -1
        new_dict = {}
        for name2 in coordinate_list4:#動態
            index2 += 1

            result1 = ""
            result2 = ""
            result3 = ""

            index1 = -1
            max = -1;
            for name1 in coordinate_list1: #本地
                index1 += 1
                same = self.compare(name1, name2)
                if same > max:
                    max = same
                    result1 = base_dict[base_list[index1]]

            index1 = -1
            max = -1;
            for name1 in coordinate_list2: #本地
                index1 += 1
                same = self.compare(name1, name2)
                if same > max:
                    max = same
                    result2 = base_dict2[base_list2[index1]]

            index1 = -1
            max = -1;
            for name1 in coordinate_list3: #本地
                index1 += 1
                same = self.compare(name1, name2)
                if same > max:
                    max = same
                    result3 = base_dict3[base_list3[index1]]

            if result1 == result2:
                new_dict[new_list[index2]] = result2
            elif result1 == result3:
                new_dict[new_list[index2]] = result3
            elif result2 == result3:
                new_dict[new_list[index2]] = result3
            else:
                new_dict[new_list[index2]] = result1

        for i in new_list:
            pattern = i.replace('uni', '&#x').lower() + ';'
            res = res.replace(pattern, new_dict[i])
        return res


    """
    輸入：某倆個對象字體的座標列表
    #我是“我要出家當道士”，其他非原創
    輸出相似度
    """
    def compare(self, c1, c2):
        count = 0
        length1 = len(c1)
        length2 = len(c2)
        if abs(length2-length1) > 7:
            return -1
        length = 0
        if length1 > length2:
            length = length2
        else:
            length = length1
        #print(length)
        for i in range(length):
            if (abs(c1[i][0] - c2[i][0]) < 5 and abs(c1[i][1] - c2[i][1]) < 5):
                count += 1
        return count

二、美團防爬

關於美團防爬，網上也有很多博客，但內容大家應該知道的，你抄我，我抄你的，最後差不多都是一樣的。但還是有很多優質的博文值得拜讀。我水平也有限，給的參考借鑑也有限，所以我這裏只給兩種我自己用過的，而且成功爬取我需要的3000部電影和9000位演員數據。我是兩種方法交織使用來爬取數據的，我使用了正常的requests來爬取和selenium自動工具(配合mitm——proxy)。速度最快的是requests，但很容易被檢測；性能最穩定的是selenium，不易被檢測。

1，正常的requests

requests對於爬取貓眼還是很有用的，只是被美團檢測後，需要很長時間的冷卻，具體時間未知，我使用request配置已經登錄過後的cookie成功爬取了全部的電影詳情信息。參考代碼如下。其實比較麻煩的就是xpath解析網頁源碼了。

class getFilmsData(object):

    def __init__(self):
        self.headers = {}
        self.headers['User-Agent'] = 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.06'
        self.headers['Cookie'] = '填寫你的cookie，不知道的話，留言，我會快速告訴你'
        self.dataManager = mongoManager()
        self.fontDecode = FontDecode()

    #根據url，獲取數據,
    #limitItem爲限制頁，即其頁item之前的已經獲取完成
    #我是“我要出家當道士”，其他非原創
    def getData(self,url,limitItem):

        s = requests.session()
        s.headers = self.headers
        s.keep_alive = False

        content = s.get(url).text
        #print("URLTEXT is :",response.text)

        if "驗證中心" in content:
            print("目錄界面美團驗證")
            return False
        sel = etree.HTML(content)

        count = 0

        urls = sel.xpath('//div[@class="movie-item"]')
        scores = sel.xpath('//div[@class="channel-detail channel-detail-orange"]')
        for box in urls:
            #抓取列表界面的電影url
            count += 1
            if count < limitItem:
                continue
            print("begin ",count,"th item")
            scoreCheck = scores[count-1].xpath('.//text()')[0]
            #無評分的電影不爬取
            if scoreCheck == "暫無評分":
                break

            urlBack = box.xpath('.//a/@href')[0]
            #獲取電影詳情url
            url = "https://maoyan.com"+urlBack

            #獲取電影名稱、時長、上映日期、票房、評分、演員、海報url
            resp = s.get(url)
            realUrl = resp.url
            res = resp.text
            if "驗證中心" in res:
                print("信息界面美團驗證")
                return False
            #res2= self.replace_font(res)
            selTmp = etree.HTML(res)
            #電影票房
            #我是“我要出家當道士”，其他非原創
            money = selTmp.xpath('//div[@class="movie-index-content box"]/span[1]/text()')
            unit = selTmp.xpath('//div[@class="movie-index-content box"]/span[2]/text()')
            filmMoney = ""
            if len(money) == 0:
                #無票房的電影不爬取
                continue
            else:
                ascll = str(money[0])
                #print("money ascll is:",ascll)
                utfs = str(ascll.encode('unicode_escape'))[1:].replace("'","").replace("\\\\u",";&#x").split('.')
                unicode = ""
                if len(utfs)>1:
                    unicode = utfs[0][1:]+";."+utfs[1][1:]+";"
                else:
                    unicode = utfs[0][1:]+";"
                filmMoney = self.fontDecode.replace_font(res,unicode)
                if len(unit) > 0:
                    filmMoney += unit[0]
            #電影名稱
            filmName = selTmp.xpath('//div[@class="movie-brief-container"]/h1[1]/text()')[0]
            #電影海報
            filmImage = selTmp.xpath('//div[@class="avatar-shadow"]/img[1]/@src')[0]
            #電影時長
            filmTime = selTmp.xpath('//div[@class="movie-brief-container"]/ul[1]/li[2]/text()')[0].replace('\n', '').replace(' ', '')
            #電影上映時間
            filmBegin = selTmp.xpath('//div[@class="movie-brief-container"]/ul[1]/li[3]/text()')[0].replace('\n', '')
            #電影評分
            score = selTmp.xpath('//div[@class="movie-index-content score normal-score"]/span[1]/span[1]/text()')
            #由於票房和評分字體編碼加密的緣故，所以需要先編碼爲unicode，在進行解密
            #我是“我要出家當道士”，其他非原創
            filmScore = ""
            if len(score) == 0:
                filmScore = "評分暫無"
            else:
                ascll = str(score[0])
                #print("score ascll is:",ascll)
                utfs = str(ascll.encode('unicode_escape'))[1:].replace("'","").replace("\\\\u",";&#x").split('.')
                unicode = ""
                if len(utfs)>1:
                    unicode = utfs[0][1:]+";."+utfs[1][1:]+";"
                else:
                    unicode = utfs[0][1:]+";"
                filmScore = self.fontDecode.replace_font(res,unicode)+"分"
            print(filmMoney,filmScore)
            #獲取電影演員表,只獲取前10個主要演員
            actorSol = selTmp.xpath('//div[@class="tab-celebrity tab-content"]/div[@class="celebrity-container"]/div[@class="celebrity-group"][2]/ul/li')
            #print(actors)
            actors = []
            actorUrls = []
            num = len(actorSol)
            for i in range(10):
                num -= 1
                if num < 0:
                    break
                actorUrl = "https://maoyan.com"+actorSol[i].xpath('.//div[@class="info"]/a/@href')[0]
                actorItem = actorSol[i].xpath('.//div[@class="info"]/a/text()')[0].replace('\n', '').replace(' ', '')
                if len(actorSol[i].xpath('.//div[@class="info"]/span[1]/text()')) > 1:
                    actorItem += (" "+actorSol[i].xpath('.//div[@class="info"]/span[1]/text()')[0].replace('\n', '').replace(' ', ''))
                actorUrls.append(actorUrl)
                actors.append(actorItem)
            #獲取電影簡介
            introductionT = ""
            introductionF = selTmp.xpath('//span[@class = "dra"]/text()')
            if len(introductionF) > 0:
                introductionT = introductionF[0]
            print(count,filmName,filmImage,filmBegin,filmTime,filmScore,filmMoney,actors,introductionT)
            time.sleep(1)
        s.close()

2，selenium配合mitmproxy

第二種方法就是selenium配合mitmproxy，selenium是一個自動化的工具，與request爬取網頁內容不同，selenium可以模仿用戶打開瀏覽器來瀏覽網頁，所見的都可爬取。這個也多用在爬取大量數據避免檢測時。但如果單純的使用selenium，還是很容易被檢測出來，這裏我理解也不是特別深，只是單純會用而已。大致就是瀏覽器設置的幾個回覆參數，如果使用selenium的話，這幾個參數就會被賦值，而正常用戶瀏覽的話，這幾個參數是未定義的；還有一些其他的防selenium的，我沒繼續深入瞭解。

具體的配置方法可以參考我之前寫的博客：mitmproxy配合selenium

我發現，我在爬取的時候，這個方法比第一個穩定，出現美團驗證的機率很低，而且出現之後複製網站網址手動到對應的瀏覽器裏打開，就會出現驗證，你手動多試幾次就過去了，然後就可以繼續爬取。當然這個速度肯定沒request快。舉個例子吧，比如我在爬取https://maoyan.com/films/celebrity/28936的時候出現了美團檢測，我用的是谷歌的驅動，那麼複製這個網址到谷歌瀏覽器中，這時候一般會出現美團檢測，你手動的驗證通過後（不通過就關閉瀏覽器再來幾次），再繼續爬取就行了。其實selenium自動化使用瀏覽器還是被網址認爲是人在瀏覽，只有出現檢測的時候，纔會被網址識別出是selenium，所以我們只要認爲的過檢測就可以了。

以上就是我的方法，想要源碼的私信我

如何爬取貓眼全部信息（電影信息、演員信息）

一、字體加密

二、美團防爬

1，正常的requests

2，selenium配合mitmproxy

如何使用 JS 判斷用戶是否處於活躍狀態

通過HPA+CronHPA組合應對業務複雜彈性伸縮場景

❤️‍🔥 Solon Cloud Event 新的事務特性與應用

DASH.js使用demo（配合ffmpeg和mp4box）

matplotlib文件顯示不了中文的通用解決方法

mongo數據庫導入導出數據庫

面向Web應用的發佈訂閱中間件的研究與實現

ubuntu環境下pip3安裝mysqlclient報錯error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結