python爬蟲 -- xpath處理emoji問題

前言

 

本篇文章很短,就是記錄一個偶然遇到的問題

 

問題復現

 

是這樣的,在用xpath解析某網站的時候,由於網站數據格式是普通的html,而非json字符串,所以只能解析DOM對象,有的能用正則表達式的我都儘量用正則表達式了,沒法用正則的我都用beautifulsoup庫或者pyquery了,但是沒法,通用型還是沒法跟xpath比,而且我已經寫好一版,在有限的時間改的話就很煩了

不多說,先看問題

 

 

首先部分的網站源碼如下:

 

<article class="_55wo _5rgr _5gh8 _3drq async_like"
         data-ft='{"mf_story_key":"10159935560038463","top_level_post_id":"10159935560038463","tl_objid":"10159935560038463","content_owner_id_new":"8245623462","throwback_story_xxid":"10159935560038463","page_id":"8245623462","story_location":4,"story_attachment_style":"video_inline","tds_flgs":3,"ott":"AX90AyHPzJSMfPjF","tn":"-R"}'
         data-sigil="story-div story-popup-metadata story-popup-metadata feed-ufi-metadata"
         data-store='{"linkdata":"mf_story_key.10159935560038463:top_level_post_id.10159935560038463:tl_objid.10159935560038463:content_owner_id_new.8245623462:throwback_story_xxid.10159935560038463:page_id.8245623462:story_location.4:story_attachment_style.video_inline:tds_flgs.3:ott.AX90AyHPzJSMfPjF","share_id":"10159935560038463","feedback_target":"10159935560038463","feedback_source":0,"action_source":0,"actor_id":100065274592441}'
         data-xt="2.mf_story_key.10159935560038463:top_level_post_id.10159935560038463:tl_objid.10159935560038463:content_owner_id_new.8245623462:throwback_story_xxid.10159935560038463:page_id.8245623462:story_location.4:story_attachment_style.video_inline:tds_flgs.3:ott.AX90AyHPzJSMfPjF"
         data-xt-vimp='{"pixel_in_percentage":0,"duration_in_ms":1,"subsequent_gap_in_ms":60000,"log_initial_nonviewable":false,"should_batch":true,"require_horizontally_onscreen":false}'
         id="u_0_5_iv">
    <div class="story_body_container">
        <header class="_7om2 _1o88 _77kd _5qc1">
            <div class="_5s61 _2pii _5i2i _52wc">
                <div class="_5xu4">
                    <div class="_67lm _77kc" data-gt='{"tn":"~"}' data-sigil="feed_story_ring8245623462"><a
                            data-click='{"event":"click_post_avatar_image","target_id":"10159935560038463"}'
                            data-gt='{"tn":"~"}' href="/nba/?__tn__=%7E%7E-R"><i aria-label="NBA, profile picture"
                                                                                 class="img _1-yc profpic" role="img"
                                                                                 ></i></a>
                    </div>
                </div>
            </div>
            <div class="_4g34 _5i2i _52we">
                <div class="_5xu4">
                    <div class="_7om2 _52wc">
                        <div class="_4g34"><h3 class="_52jd _52jb _52jh _5qc3 _4vc- _3rc4 _4vc-" data-gt='{"tn":"C"}'>
                            <span><strong><a href="/nba/?__tn__=C-R">NBA</a></strong><span aria-label="Verified Page"
                                                                                           class="_56_f _5dzy _5dz- _3twv"
                                                                                           id="u_0_e_x2"
                                                                                           role="img"></span></span>
                        </h3>
                            <div class="_52jc _5qc4 _78cz _24u0 _36xo" data-sigil="m-feed-voice-subtitle"><a
                                    href="/story.php?story_xxid=10159935560038463&id=8245623462&__tn__=-R"><abbr>6
                                hrs</abbr></a><span aria-hidden="true"> · </span><span><div class="_7jwi"><span
                                    data-sigil="audience-icon"><i aria-label="Public"
                                                                  class="feedAudienceIcon img sp_eXcmc5QyINt_2x sx_e966fc"
                                                                  role="img"></i></span><div class="_7jwh"></div></div></span>
                            </div>
                        </div>
                        <div class="_5s61">
                            <div class="_2pir" id="feed_story_fan_8245623462"></div>
                        </div>
                        <div class="_5s61"></div>
                        <div class="_5s61 _2pis">
                            <div class="_yff" data-sigil="story-popup-causal-init"
                                 data-store='{"feedobjectsIdentifiers":"S:_I8245623462:10159935560038463","feedContext":"{\"use_m_feed\":true,\"m_entstream_source\":\"timeline\",\"is_pages_timeline\":true,\"story_node_id\":\"u_0_5_iv\",\"show_attachments\":true,\"is_attached_story\":false}"}'
                                 id="u_0_b_35"><a aria-haspopup="true" class="_4s19 sec" data-sigil="touchable" href="#"
                                                  role="button"></a><i class="img sp_eXcmc5QyINt_2x sx_b9866d"
                                                                       data-sigil="story-popup-context-init"><u>More
                                options</u></i></div>
                        </div>
                    </div>
                </div>
            </div>
        </header>
        <div class="_5rgt _5nk5 _5msi" data-ft='{"tn":"*s"}' data-gt='{"tn":"*s"}' style="">
            <div><span><p>Watch the BEST DEEP 3'S from the <a href="/LAClippers/?__tn__=%2As-R">L.A. Clippers</a> during the <a
                    class="_5ayv _qdx" href="/hashtag/nbaplayoffs?__tn__=%2As-R"><span class="_5aw4 _qdz">#</span><span
                    class="_5ayu">NBAPlayoffs</span></a>! </p><p> <a class="_5ayv _qdx"
                                                                     href="/hashtag/thatsgame?__tn__=%2As-R"><span
                    class="_5aw4 _qdz">#</span><span class="_5ayu">ThatsGame</span></a> <span class="_5mfr"><span
                    class="_6qdm"
                    style='height: 16px; width: 16px; font-size: 16px; background-image: url("https://static.xx.xxcdn.net/images/emoji.php/v9/tdf/2/16/1f4a5.png")'>💥</span></span></p></span>
            </div>
            <a aria-label="Open story" class="_5msj"
               href="/story.php?story_xxid=10159935560038463&id=8245623462&__tn__=%2As%2As-R"></a></div>
        <div class="_5rgu _7dc9 _27x0" data-ft='{"tn":"H"}'>
            <section class="_2rea _24e1 _412_ _bpa _vyy _5t8z">
                <div class="_2zi_ _zgm _2zj0">
                    <div class="_53mw" data-sigil="inlineVideo"
                         data-store='{"videoID":"4456269257751059","playerFormat":"inline","playerOrigin":"page_timeline","external_log_id":null,"external_log_type":null,"rootID":4456269257751059,"playerSuborigin":"misc","useOzLive":false,"playbackIsLiveStreaming":false,"canUseOffline":null,"playOnClick":true,"videoDebuggerEnabled":false,"videoViewabilityLoggingEnabled":false,"videoViewabilityLoggingPollingRate":-1,"videoScrollUseLowThrottleRate":true,"playInFullScreen":false,"type":"video","src":"https:\/\/video-mad1-1.xx.xxcdn.net\/v\/t42.1790-2\/10000000_540531577146622_2129266242166849959_n.mp4?_nc_cat=111&ccb=1-3&_nc_sid=985c63&efg=eyJ2ZW5jb2RlX3RhZyI6InN2ZV9zZCJ9&_nc_ohc=CHxlLBnqdg8AX84rJTC&tn=3o-lXXvU9tVtdq6j&_nc_rml=0&_nc_ht=video-mad1-1.xx&oh=5ab243e6a2407a74ed09407f43ad04e9&oe=6107CF3F","width":320,"height":180,"trackingNodes":"FH-R","downloadResources":null,"subtitlesSrc":null,"spherical":false,"sphericalParams":null,"defaultQuality":null,"availableQualities":null,"playStartSec":null,"playEndSec":null,"playMuted":null,"disableVideoControls":false,"loop":false,"numOfLoops":null,"shouldPlayInline":true,"dashManifest":null,"isAdsPreview":false,"iframeEmbedReferrer":null,"adClientToken":null,"audioOnlyVideoSrc":null,"audioOnlyEnabled":false,"permalinkShareID":null,"feedPosition":null,"chainDepth":null,"videoURL":"https:\/\/www.xxxxxx.com\/nba\/videos\/4456269257751059\/","disableLogging":false}'>
                        <i class="img _lt3 _4s0y" data-sigil="playInlineVideo"
                           style=""></i>
                        <div class="_1o0y" data-sigil="m-video-play-button playInlineVideo"><span
                                style="display:block;height:0;overflow:hidden;position:absolute;width:0;padding:0">Play Video</span>
                        </div>
                    </div>
                </div>
            </section>
            <div></div>
            <div></div>
        </div>
    </div>
    <footer class="_22rc" data-ft='{"tn":"*W"}'>
        <div class="_2ip_ _4b44" data-sigil="mufi-inline" id="feedback_inline_10159935560038463">
            <div class="_34qc _3hxn _3myz _4b45"><a data-sigil="feed-ufi-trigger"
                                                    href="/story.php?story_xxid=10159935560038463&id=8245623462&anchor_composer=false&__tn__=%2AW-R"
                                                    role="button">
                <div class="_rnk _77ke _2eo- _1e6 _4b44" data-sigil="reactions-bling-bar" id="u_0_f_m4">
                    <div class="_1w1k" data-sigil="reactions-sentence-container"><span class="_qfz _77kf"><div
                            class="_1g05 _77lc" style="z-index:3"><i class="img sp_eXcmc5QyINt_2x sx_9540f7"
                                                                     role="presentation"><u>Like</u></i></div><div
                            class="_1g05 _77lc" style="z-index:2"><i class="img sp_eXcmc5QyINt_2x sx_2d1286"
                                                                     role="presentation"><u>Love</u></i></div><div
                            class="_1g05 _77lc" style="z-index:1"><i class="img sp_eXcmc5QyINt_2x sx_176208"
                                                                     role="presentation"><u>Wow</u></i></div></span>
                        <div aria-label="567 left reactions including Like, Love and Wow" class="_1g06">567</div>
                    </div>
                    <div class="_1fnt"><span class="_1j-c" data-sigil="comments-token">10 Comments</span><span
                            class="_1j-c">36 Shares</span></div>
                </div>
            </a></div>
            <div class="_52jh _7om2 _15kk _15ks _15km _4b47 _4b46" data-sigil="ufi-inline-actions">
                <div class="_52jj _15kl _3hwk _4g34"><a aria-pressed="false" class="_15ko _77li touchable"
                                                        data-ft='{"tn":">"}'
                                                        data-sigil="touchable ufi-inline-like like-reaction-flyout"
                                                        data-store='{"reaction":0,"feedbackTarget":"10159935560038463","kaiOSReactions":false}'
                                                        href="/ufi/reaction/?ft_ent_identifier=10159935560038463&reaction_type=1&story_render_location=timeline&feedback_source=0&is_sponsored=0&ext=1628151954&hash=AeQmDqjrKECVo8k9bxk&__tn__=%3E%2AW-R"
                                                        id="u_0_g_4b" role="button" tabindex="0">Like</a>
                    <div class="_1ekf" data-sigil="screenreader-reactions-trigger" role="link" tabindex="-1">Show more
                        reactions
                    </div>
                </div>
                <div class="_52jj _15kl _3hwk _4g34"><a class="_15kq _77li"
                                                        data-click='{"event":"click_comment_ufi","target_id":"10159935560038463"}'
                                                        data-ft='{"tn":"S"}'
                                                        data-sigil="feed-ufi-focus feed-ufi-trigger ufiCommentLink mufi-composer-focus"
                                                        href="/story.php?story_xxid=10159935560038463&id=8245623462&fs=0&focus_composer=0&__tn__=S%2AW-R">Comment</a>
                </div>
                <div class="_52jj _15kl _3hwk _4g34"><a class="_15kr _77li"
                                                        data-click='{"event":"click_share_ufi","target_id":"10159935560038463"}'
                                                        data-ft='{"tn":"J"}' data-sigil="share-popup"
                                                        data-store='{"is_acting_as_page":false,"reshare_post":false,"share_id":"10159935560038463","feedback_source":0,"feedback_referrer":null,"internal_preview_image_id":null,"shareable_uri":"\/story.php?story_xxid=10159935560038463&id=8245623462","user_id":100065274592441,"behavior":"custom"}'
                                                        href="/sharer.php?fs=0&sid=10159935560038463&__tn__=J%2AW-R">Share</a>
                </div>
            </div>
        </div>
    </footer>
</article>

  

 

然後我的xpath語法就是解析不了,我用以下代碼測試:

 

 

 

 

 

就很奇怪了,經過我的測試,發現是因爲有emoji表情符引起的,

 

 

 

我把那些emoji符號刪除了就可以正常解析了:

 

 

 

就很騷了。

 

你知道這個問題我花了1個小時排查嗎,我真的是一點一點的把問題摳出來的,就感覺我在逆向js代碼一樣一段一段摳

 

 

解決問題

 

一開始我想的是,用beautifulsoup找出那段有emoji的符號部分的節點刪除就行,問題是解決了:

 

 

 

 

 

 

但是我發現並不是很通用,因爲,有可能emoji不會一定存在於我篩選出來的那個class爲_6qdm上,也可能出現在其他地方。

 

那麼就還是得用正則匹配了:

 

 

re.compile(u'[\U00010000-\U0010ffff]')

 

 

 

 

既然能匹配到,那就用sub替換即可:

 

f = open('profile.html',encoding='utf-8')
cont = f.read()
f.close()
try:
    pattern = re.compile(u'[\U00010000-\U0010ffff]')
except re.error:
    pattern = re.compile(u'[\uD800-\uDBFF][\uDC00-\uDFFF]')
print(pattern.findall(cont))
cont = pattern.sub('',cont)
# soup = BeautifulSoup(cont, 'html.parser')
# remove_obj = soup.select('span[class="_6qdm"]')
# if remove_obj:
#     [rem.extract() for rem in remove_obj]
# html_xpath = etree.HTML(str(soup))
html_xpath = etree.HTML(cont)
print(html_xpath.xpath('//text()'))

 

執行:

 

 

 

 

驗證下,我換了一個html結構:

 

 

 

 

 

果然能匹配到,ok,問題解決

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章