前言

上一篇博文是我寫的第一篇博文，存在了各種各樣的小bug：換行不規範、出現莫名其妙的html標籤等等。在以後會慢慢改正。

這篇文章主要是介紹兩個技術，一個是網頁前端加速BigPipe技術，另一個是html數據解析需要用到的xpath技術。

爲什麼我在數據解析的時候沒有用比較成熟的BeautifulSoup？因爲facebook的網頁源碼過於龐大，或多或少存在和標準不一樣的地方（這不影響瀏覽器的解析），使得BeautifulSoup無法正確加載分析，所以採取了xpath的方法。如果大家有什麼好方法能夠使BS加載facebook的html請留言和我探討哈！

BigPipe技術

爲什麼要介紹BigPipe？

因爲最開始的時候根本找不到我們需要的數據在哪裏，第一眼看見源碼我是基本是一臉懵逼的，來感受一下，下圖是登陸facebook後主頁的掩碼。看一看sublime右邊那個整體預覽，一大片黃色代碼（很大一部分都是json數據）。

可以看到下載了很多JS腳本，還有很多註釋掉的html（灰色部分）：

不過沒關係，我們直接搜索想找的信息就好了，比如我關注了扎克伯格，搜索Mark Zuckerberg，發現大部分的Mark Zuckerberg都出現在註釋裏。我們都知道，註釋裏的代碼是不會被執行的。但是通過觀察可以發現，註釋裏的代碼，的確出現在了網頁中，並被執行了。所以可以這樣理解：註釋裏的代碼相當於輸入數據，通過JS腳本的解析，最終呈現在了瀏覽器上。經過觀察，我們需要的信息都在註釋中，可是註釋這麼多，到底去哪裏找？或者說，如何從這麼多代碼裏找到我們想要的信息，並且能夠避開無關信息（廣告，推廣之類的）？這個時候就需要用到BigPipe技術了。

BigPipe簡介

BigPipe技術是facebook在2010年前提出的一種前端加速技術，效果極其明顯，facebook個人主頁的加載時間從原來的5s縮短到了2.5s。這是一個很了不起的成就，因爲有研究顯示，當用戶打開一個網頁的時間超過3s還收不到任何反應，那麼差評就少不了了。2,5s剛好小於3s，但是在實際使用中，用戶的真實體驗遠遠小於2.5s，爲什麼呢？請繼續往後看。

在傳統的頁面加載方法中，整個web頁面在服務器端組合好後再通過網絡傳輸至用戶端，最後由瀏覽器解析數據並展示給用戶。而BigPipe技術借鑑了CPU的流水線技術，將網頁切割成不同的模塊，如下圖，每個黑框代表了一個模塊，在BigPipe中，模塊的學名叫做PageLet。

在服務器端，網頁的生成不再以頁面爲單位，而是以PageLet爲單位。每生成好一個PageLet，就將該模塊發送至用戶端。多個PageLet並行發送，大大提高了頁面的整體加載速度。一圖讀懂傳統方法與BigPipe技術的不同：

每個PageLet都包含了數據——完整Dom樹，以及必要的基本信息，例如編號、放置位置。JavaScript解析腳本會讀取PageLet的基本信息，根據其中的分類信息，選擇相應的container，將數據放置其中。加載示意圖如下：

這時候我們就可以根據PageLet基本信息（還記得那一大串json數據嗎？）就可以確定哪條數據是廣告，哪條是推廣，而哪條是我們需要採集的數據。這一下就避開了一大推會造成混淆的數據。

代碼編寫

囉嗦了這麼多，終於把問題交代清楚了。。。

首先，我們要看一看html，找一找它們的規律。

前面十幾行代碼，主要是下載css樣式表和js腳本。接着十幾行是初始化BigPipe。然後就進入了正軌。

來看看PageLet基本信息：

這麼長一段其實只有一句代碼，主要的意思是執行了bigPipe.onPageletArrive()這個函數，從名字就能看出來這個函數是幹嘛的，至於後面的一大堆，就是PageLet的基本信息了。在裏面能找到一些有用的東西：

"display_dependency":["topnews_main_stream_408239535924329"] 這條數據表示顯示在哪個模塊上吧。topnews這個關鍵詞告訴我們這是置頂新聞，不是我們需要採集的信息。

"content":{"substream_0":{"container_id":"u_0_x"}指示出來container的id號。

後面還有jsmods，requires之類的參數，沒啥意義。

定位到用戶發佈的消息，發現一個模式："display_dependency":["substream_X"]，其中X（大）是數字，或者"display_dependency":["substream_X_xxxxxxxxx"]，其中x（小）是數字或字母。經過觀察，符合這個模式的PageLet都是我們需要採集的數據——用戶發佈的“朋友圈”。這個結論不一定靠譜，因爲沒有任何理論依據，也沒有任何文檔可供查看。但是在我所遇見的情況中，這種方法完美的避開了所有廣告和推廣。

PageLet的數據信息，就在這條代碼的上面。不要忘記，DOM樹代碼是被註釋起來的，註釋內可能存在換行，這是因爲有人發“朋友圈”時，發了好幾段話，造成了空格的產生。我們只需要一直往上找，找到註釋的起始位置即可。實現起來也比較簡單：

def get_newdom_from_html(file_path):
    # 把html存在了文件中，便於調試
    file = open(file_path)
    html = file.readlines()
    data = []
    for i in range(len(html)):
        # 找到正確的PageLet 
        if html[i].find('display_dependency":["substream_') > 0:
            newdom = ''
            j = i - 2
            # 提取出全部數據
            while html[j].find('<div class="hidden_elem">') < 0:
                newdom = html[j] + newdom
                j = j - 1
            # 使用正則匹配，去掉多餘的空行和註釋
            newdom = html[j] + newdom
            re_comment = re.compile('\n')
            newdom = re_comment.sub('', newdom)
            re_comment = re.compile('<!--.*-->')
            match = re_comment.search(newdom)
            newdom = match.group()
            re_comment = re.compile('<!-- ')
            newdom = re_comment.sub('', newdom)
            re_comment = re.compile('-->')
            newdom = re_comment.sub('', newdom)
            data.append(newdom)

    print 'Get', len(data), 'informations container from html.'
    return data

這時候，我們已經獲取了目標數據所在的Dom樹，下面就該使用xpath對數據進行精確定位了。

XPath的使用

XPath是一種表達式語言，被用來處理xml類型的語言，使用起來很方便。尤其是它的“相對路徑”，應該是處理複雜多變的html的唯一辦法。

舉個小例子，比如下面這個html：

<?xml version="1.0" encoding="ISO-8859-1"?>
<bookstore>
  <book>
    <title lang="eng">Harry Potter</title>
    <price>29.99</price>
  </book>
  <book>
    <title lang="chn">Learning XML</title>
    <price>39.95</price>
  </book>
</bookstore>

使用xpath定位Harry Potter這本書。

# 用絕對路徑方法表示：
/boolstore/book[1]
# 用相對路徑方法表示：
//book[1]

其中，“/”是絕對路徑的標誌，“//”是相對路徑的標誌。相對路徑是指在一個父節點下面的子節點，但是這個子節點可能距離父節點不只一層。在“[]”內，可以對節點選擇，比如book[1]就是選擇所有第一個book節點。當然也可以根據屬性選擇，還是以Harry Potter爲例，他的語言是“eng”，那麼它也可以這樣被選擇：

/boolstore/book[@lang=”eng”]
# 如果根據價格來選：
/bookstore/book[price>30.00]

以上這些小方法足夠我們處理facebook的數據了。

下面講一個真實的例子，facebook中的一條“朋友圈”是這個樣子的，有文字有圖片，而他的html就比較亂了，下面的html就是PageLet裏的數據：

<div class="_4-u2 mbm _5v3q _4-u8" id="u_ps_0_0_1">
    <div class="_3ccb" data-gt="{"type":"click2canvas","fbsource":703,"ref":"nf_generic"}" id="u_ps_0_0_2">
        <div></div>
        <div class="userContentWrapper _5pcr" role="article" aria-label="Story">
            <div class="_1dwg _1w_m">
                <div class="_4r_y">
                    <div class="_6a uiPopover _5pbi _cmw _5v56 _b1e" id="u_ps_0_0_3" data-ft="{"tn":"V"}">
                        <a class="_4xev _p" aria-label="Story options" href="#" aria-haspopup="true" aria-expanded="false" rel="toggle" role="button" id="u_ps_0_0_4"></a>
                    </div>
                </div>
                <div class="_4gns accessible_elem"></div>
                <div class="_5x46">
                    <div class="clearfix _5va3">
                        <a class="_5pb8 _8o _8s lfloat _ohe" href="https://www.facebook.com/NBCBlacklist/?ref=nf" aria-hidden="true" tabindex="-1" target="" data-ft="{"tn":"\u003C"}" data-hovercard="/ajax/hovercard/page.php?id=315791511882046">
                            <div class="_38vo"><img class="_s0 _5xib _5sq7 _44ma _rw img" src="https://fbcdn-profile-a.akamaihd.net/hprofile-ak-xlf1/v/t1.0-1/p50x50/11891029_731263597001500_5132239452791988839_n.png?oh=3b747b33e8a76fd06c73fa4a75a0ee94&oe=58098655&__gda__=1475421439_ae7e8523072625710162c2ee000b4c18" alt=""></div>
                        </a>
                        <div class="clearfix _42ef">
                            <div class="rfloat _ohf"></div>
                            <div class="_5va4">
                                <div>
                                    <div class="_6a _5u5j">
                                        <div class="_6a _6b" style="height:40px"></div>
                                        <div class="_6a _5u5j _6b">
                                            <h5 class="_5pbw" data-ft="{"tn":"C"}"><span class="fwn fcg"><span class="fwb fcg" data-ft="{"tn":"k"}"><a href="https://www.facebook.com/NBCBlacklist/?fref=nf" data-hovercard="/ajax/hovercard/page.php?id=315791511882046&extragetparams=%7B%22fref%22%3A%22nf%22%7D">The Blacklist</a></span></span></h5>
                                            <div class="_5pcp"><span><span class="fsm fwn fcg"><a class="_5pcq" href="/NBCBlacklist/photos/a.330790057048858.1073741828.315791511882046/889778107816714/?type=3" rel="theater" ajaxify="/NBCBlacklist/photos/a.330790057048858.1073741828.315791511882046/889778107816714/?type=3&size=600%2C400&fbid=889778107816714&source=12&player_origin=unknown" target=""><abbr title="Friday, July 1, 2016 at 11:42pm" data-utime="1467387720" data-shorten="1" class="_5ptz timestamp livetimestamp"><span class="timestampContent">11 hrs</span></abbr>
                                                </a>
                                                </span>
                                                </span><span role="presentation" aria-hidden="true"> • </span><a data-hover="tooltip" data-tooltip-content="Public" class="uiStreamPrivacy inlineBlock fbStreamPrivacy fbPrivacyAudienceIndicator _5pcq" aria-label="Public" href="#" role="button"><i class="lock img sp_LNqePrqmloc sx_35b578"></i></a></div>
                                        </div>
                                    </div>
                                </div>
                            </div>
                        </div>
                    </div>
                </div>
                <div class="_5pbx userContent" data-ft="{"tn":"K"}">
                    <p>Tom has mastered the art of two truths and a lie.</p>
                    <div class="_5wpt"></div>
                </div>
                <div class="_3x-2">
                    <div data-ft="{"tn":"H"}">
                        <div class="mtm">
                            <div class="_5cq3" data-ft="{"tn":"E"}">
                                <a class="_4-eo _2t9n" href="/NBCBlacklist/photos/a.330790057048858.1073741828.315791511882046/889778107816714/?type=3" rel="theater" ajaxify="/NBCBlacklist/photos/a.330790057048858.1073741828.315791511882046/889778107816714/?type=3&size=600%2C400&fbid=889778107816714&player_origin=unknown" data-render-location="newsstand" style="width:476px;" data-testid="theater_link">
                                    <div class="uiScaledImageContainer _4-ep" style="width:476px;height:317px;" id="u_ps_0_0_5"><img class="scaledImageFitWidth img" src="https://fbcdn-photos-b-a.akamaihd.net/hphotos-ak-xfp1/v/t1.0-0/p320x320/13413785_889778107816714_5168115831313224460_n.jpg?oh=67aefb5b75d827228cdad41e758a7c09&oe=58044C1E&__gda__=1475004462_88b45f79750b4f918dadf06943b5cc3f" alt="The Blacklist's photo." width="476" height="318"></div>
                                </a>
                            </div>
                        </div>
                    </div>
                </div>
            </div>
            <div>
                <form rel="async" class="commentable_item collapsed_comments" method="post" data-ft="{"tn":"]"}" action="/ajax/ufi/modify.php" onsubmit="return window.Event && Event.__inlineSubmit && Event.__inlineSubmit(this,event)" id="u_ps_0_0_8">
                    <input type="hidden" name="charset_test" value="€,´,€,´,水,Д,Є">
                    <input type="hidden" name="fb_dtsg" value="AQHS2YZ9HT3a:AQEr2NnPYVLY" autocomplete="off">
                    <input type="hidden" autocomplete="off" name="ft_ent_identifier" value="889778107816714">
                    <input type="hidden" autocomplete="off" name="data_only_response" value="1">
                    <div class="_sa_ _5vsi _ca7 _192z">
                        <div class="_37uu">
                            <div data-reactroot="">
                                <div class="_3399 _1f6t _4_dr">
                                    <div class="_524d">
                                        <div class="_ipn">
                                            <div class="_ipo">
                                                <a aria-live="polite" class="_ipm" data-comment-prelude-ref="action_link_bling" data-ft="{"tn":"O"}" data-hover="tooltip" data-tooltip-uri="/ufi/comment/tooltip/?ft_ent_identifier=889778107816714&av=100011766661649" href="/NBCBlacklist/photos/a.330790057048858.1073741828.315791511882046/889778107816714/?type=3&comment_tracking=%7B%22tn%22%3A%22O%22%7D" role="button">
                                                    <!-- react-text: 7 -->361 Comments
                                                    <!-- /react-text -->
                                                </a><a aria-live="polite" class="_ipm" data-hover="tooltip" data-tooltip-uri="/ufi/share/tooltip/?ft_ent_identifier=889778107816714&av=100011766661649" href="https://www.facebook.com/shares/view?id=889778107816714&av=100011766661649" role="button">408 Shares</a></div>
                                            <div class="_ipp">
                                                <div class="_3t53 _4ar- _ipn"><span aria-label="See who reacted to this" class="_3t54" role="toolbar" tabindex="0"><a aria-label="18K Like" class="_27jf _3emk" href="/ufi/reaction/profile/browser/?ft_ent_identifier=889778107816714&av=100011766661649" rel="ignore" role="button" tabindex="-1"><span class="_9zc _2p7a _4-op"><i class="_3j7l _2p78 _9--"></i></span><span class="_3chu">18K</span></a><a aria-label="1.4K Love" class="_27jf _3emk" href="/ufi/reaction/profile/browser/?ft_ent_identifier=889778107816714&av=100011766661649" rel="ignore" role="button" tabindex="-1"><span class="_9zc _2p7a _4-op"><i class="_3j7m _2p78 _9--"></i></span><span class="_3chu">1.4K</span></a><a aria-label="61 Angry" class="_27jf _3emk" href="/ufi/reaction/profile/browser/?ft_ent_identifier=889778107816714&av=100011766661649" rel="ignore" role="button" tabindex="-1"><span class="_9zc _2p7a _4-op"><i class="_3j7q _2p78 _9--"></i></span><span class="_3chu">61</span></a></span><a class="_2x4v" href="/ufi/reaction/profile/browser/?ft_ent_identifier=889778107816714&av=100011766661649" rel="ignore"><span aria-hidden="[object Object]" class="_1g5v"><span data-hover="tooltip" data-tooltip-uri="/ufi/reaction/tooltip/?ft_ent_identifier=889778107816714&av=100011766661649">20K</span></span><span class="_4arz"><span data-hover="tooltip" data-tooltip-uri="/ufi/reaction/tooltip/?ft_ent_identifier=889778107816714&av=100011766661649">20K</span></span></a></div>
                                            </div>
                                        </div>
                                    </div>
                                </div>
                                <div class="_3399 _a7s clearfix">
                                    <div class="_524d">
                                        <div class="_42nr"><span><div class="_khz"><a aria-pressed="false" class="UFILikeLink _4x9- _4x9_ _48-k" data-testid="fb-ufi-likelink" href="#" role="button" tabindex="0"><!-- react-text: 35 -->Like<!-- /react-text --></a><span role="button" class="accessible_elem" tabindex="-1">Show more reactions</span></div>
                                        </span><span><a class="comment_link _5yxe" role="button" href="#" title="Leave a comment" data-ft="{ "tn": "S", "type": 24 }">Comment</a></span><span><a href="#" class="share_action_link _5f9b" data-ft="{ "tn": "J", "type": 25 }" title="Send this to friends or post it on your timeline."><!-- react-text: 41 -->Share<!-- /react-text --><span class="UFIShareLinkSpinner _1wfk img _55ym _55yn _55yo _5tqs" aria-label="Loading..." aria-busy="true"></span></a>
                                        </span>
                                    </div>
                                </div>
                            </div>
                        </div>
                    </div>
            </div>
            <div class="uiUfi UFIContainer _5pc9 _5vsj _5v9k" id="u_ps_0_0_7"></div>
            </form>
        </div>
    </div>
</div>
</div>

可以ctrl+F搜索下作者The Blacklist，一部美劇。可一看到，“The Blacklist"在<h5>標籤下的一個<a>標籤中，這兩個標籤中間還隔着好幾層，不過沒關係，我們可以利用相對位置進行定位。

div/div/div/div[3]//h5/span//a[0] # 一條“朋友圈”作者的相對位置

搜索作者的完整的代碼如下：

def get_writer(tree):
    r = tree.xpath('div/div/div/div[3]//h5/span//a')
    try:
        return r[0].text
    except:
        return 'wrong'

其中tree是html一個etree，使用etree.parse(html)構造。其他的信息，比如圖片啊，文字啊，都可以使用相同的辦法提取出來。在提取之前，需要一個預處理，否則會出現好多非法字符，造成解析錯誤。代碼如下：

# -*- coding:gb2312 -*-
__author__ = 'HYDT'
import re
from lxml import etree

def get_newdom_from_html(file_path):
    # 把html存在了文件中，便於調試
    file = open(file_path)
    html = file.readlines()
    data = []
    for i in range(len(html)):
        # 找到正確的PageLet
        if html[i].find('display_dependency":["substream_') > 0:
            newdom = ''
            j = i - 2
            # 提取出全部數據
            while html[j].find('<div class="hidden_elem">') < 0:
                newdom = html[j] + newdom
                j = j - 1
            # 使用正則匹配，去掉多餘的空行和註釋
            newdom = html[j] + newdom
            re_comment = re.compile('\n')
            newdom = re_comment.sub('', newdom)
            re_comment = re.compile('<!--.*-->')
            match = re_comment.search(newdom)
            newdom = match.group()
            re_comment = re.compile('<!-- ')
            newdom = re_comment.sub('', newdom)
            re_comment = re.compile('-->')
            newdom = re_comment.sub('', newdom)
            data.append(newdom)

    print 'Get', len(data), 'informations container from html.'
    return data


def analysis_html(file_path):
    tree = etree.parse(file_path)
    if judge_liked(tree):
        return 'liked'
    dict = {'writer': '',
     'time': '',
     'content': '',
     'img': [],
     'video': []}
    dict['writer'] = get_writer(tree)
    dict['time'] = get_time(tree)
    dict['content'] = get_content(tree)
    dict['img'] = get_img(tree)
    dict['video'] = get_video(tree)
    for key in dict:
        if dict[key] == 'wrong':
            return 'wrong'
    return dict


def get_writer(tree):
    r = tree.xpath('div/div/div/div[3]//h5/span//a')
    try:
        return r[0].text
    except:
        return 'wrong'


def get_time(tree):
    r = tree.xpath('div/div/div/div[3]//abbr')
    try:
        return r[0].attrib['title']
    except:
        return 'wrong'


def get_content(tree):
    try:
        r = tree.xpath('div/div/div/div[4]//text()')
        content = ''
        for sentence in r:
            if sentence != 'See More' and sentence != '...' and sentence != '\n':
                content = content + sentence

        return content
    except:
        return 'wrong'


def get_img(tree):
    try:
        r = tree.xpath('div/div/div/div[5]//img')
        img_list = []
        for img in r:
            img_list.append(img.attrib['src'])
        return img_list
    except:
        return 'wrong'


def get_video(tree):
    try:
        r = tree.xpath('div/div/div/div[5]//video')
        video_list = []
        for video in r:
            video_list.append(video.attrib['src'])

    except:
        return 'video wrong'


def judge_liked(tree):
    r = tree.xpath('div/div/div/div[3]//h5//text()')
    if ' liked this.' in r:
        return True
    return False


def pre_analysis(newdom, file_save_path):
    re_comment = re.compile('<form .*</form>')
    newdom = re.sub(re_comment, '', newdom)
    re_comment = re.compile('><')
    newdom = re.sub(re_comment, '>\n<', newdom)
    re_comment = re.compile('<div>\n</div>')
    newdom = re.sub(re_comment, '', newdom)
    outfile = open(file_save_path, 'w')
    outfile.write(newdom)
    outfile.close()

總結

這一部分，聽起來很簡單。但是當時做的時候，經常走入死衚衕。比如html標籤的class屬性，是很常用的定位標誌，而facebook會隨機的改變一些，讓class的值不是固定的。在比如開始的時候總能遇見廣告或者推廣，很煩人，但是從html上又辨別不出來，經過一個多星期才無意之間發現BigPipe這個好東西。找到了BigPipe之後也不是一番豐順。最開始我預料是JS會把文字、圖片之類的信心分類放好，但是仔細閱讀源碼後才發現並不是這樣，而是直接把註釋中的DOM樹直接扔進了container裏。但是JS確實也有解析數據，比如以下幾個函數：

可以很清楚的看出來，當一個PageLet到達後，會調用onPageletArrive函數，然後調用後面幾個函數處理數據。其中有三個比較重要，appendNodes、addedElements、addedImages。這三個函數把PageLet中的數據進行了分割，然後做上了標記，然後再做了什麼就沒跟住了。。。

下面一片會介紹一下NoSQL，和數據入庫。

facebook數據採集——利用BigPipe技術和xpath解析數據

前言

BigPipe技術

爲什麼要介紹BigPipe？

BigPipe簡介

代碼編寫

XPath的使用

總結

《IDA PRO 權威指南》學習筆記（20191024）

facebook數據採集——利用BigPipe技術和xpath解析數據

FaceBook數據採集——模擬登錄

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結