愛奇藝用戶分析平臺實踐:TB級數據查詢秒級返回

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"本文根據鄒興標老師在〖2020 DAMS中國數據智能管理峯會〗現場演講內容整理而成。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"導讀:在流量飽和的時代背景下,業務的增長依賴於通過大數據快速精準分析進行業務試驗與升級,從而找到真正的Magic Number。業務各個節點的發展對數據具有強依賴性,隨着業務的發展,這些數據對於時效性以及分析複雜度的要求越來越高。如何利用平臺化的方式,將精細化的業務、分析需求統一解決,如何實現多維度,多行爲的交互式分析是成爲了技術團隊的條件。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"精細化運營平臺-北斗是愛奇藝技術團隊自研的交互式用戶分析平臺。支持用戶核心行爲,疊加用戶畫像標籤的方式以定位目標人羣,從而進行有針對性地業務維度分析。","attrs":{}}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"一、用戶分析平臺誕生的背景","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"愛奇藝在2015年上線了基於hive的自助查詢平臺。但是隨着業務的快速發展和數據量的急劇增長。基於hive的查詢平臺從分析深度及分析時效性已經無法滿足業務的需求。於是急需一個交互式(秒級結果返回)的用戶分析平臺來滿足業務需求。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在用戶分析平臺上線前,業務通過自主查詢工具進行數據分析時面臨着以下困難:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"查詢耗時長:基於hive的多表關聯及大表的單表查詢往往需要半小時及以上的時間才能出結果,無法快速驗證想法;","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"分析門檻高:用戶基於數據理解進行數據查詢,沒有現成的路徑,留存等分析模板使用,從而使得數據分析需要專業分析師才能進行,無法賦能與運營等業務人員數據分析能力;","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據使用未形成閉環:在查詢平臺獲取查詢滿意的數據結果後,無法將數據結論直接應用到線上形成數據分析的閉環。","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"針對以上,我們搭建了用戶分析平臺-北斗,實現了交互級別的用戶行爲分析,讓數據可快速的 分析→決策→行動,實現數據分析的閉環。不但爲業務提供表的查詢服務,進一步提供了用戶行爲分析的解決方案。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如下圖:用戶分析平臺接入愛奇藝各大業務數據及基礎平臺數據,以用戶分羣爲核心,支持畫像,路徑,留存等各類分析行爲。且可將分析人羣可輸出至線上系統,實現數據分析的閉環。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/28/28c1aa816e83e3af9573b1314de5091d.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"二、用戶分析平臺的平臺架構","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1、引擎選型","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"項目對於查詢引擎的選擇有以下三個原則:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大數據基礎架構對查詢引擎的支持粒度;","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"項目成員對於查詢引擎的熟悉程度;","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"查詢引擎在項目場景上的性能優劣。","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"公司大數據基礎架構團隊已經支持Kylin、Impala、Kudu、Druid,Spark等不同的數據查詢引擎。我們選擇了Impala、Spark、Kylin進行了性能測試。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/29/29dfd52da84769a3cb29ed387f3959ea.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"測試結果如下:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/53/539e63fa07155546adcce5194e7dd1e5.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"用戶分析平臺最主要的特性是實現多行爲的組合分析,例如:圈選在最近30天播放過青春有你2大於120分鐘,且是一線城市的女性用戶。需要將播放行爲關聯用戶屬性表。因此此次測試最重要的指標是大表之間的關聯。經過多個場景的測試後,最終選擇了impala作爲用戶分析平臺的查詢引擎。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2、數據模型","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如下圖,項目參考了神策的數據建模思路,結合愛奇藝的業務場景,將數據劃分爲兩大模塊,用戶行爲數據及用戶畫像數據:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/8c/8c91995c37f48a43e8f75ffdc58e7396.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"用戶行爲和用戶畫像之間通過設備id或者賬號id實現關聯分析;","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"行爲包含如下元素:","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"用戶畫像:涵蓋了算法團隊對於用戶性別,年齡等預測信息。","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於如上的數據模型,可實現各類用戶行爲疊加用戶畫像的分析,滿足個性化的業務場景需求。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3、產品技術架構","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但是愛奇藝日均有上億的獨立設備數,超過500TB的數據增量。如何基於impala實現秒級的查詢返回仍然是一個巨大的挑戰。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過對目標使用用戶的調研,在絕大多數的分析場景下可以接受一定的誤差(千分之五以內),於是系統的核心模塊採用了抽樣分析。如下圖:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/a5/a5a1f7f582bf8d98b9b83463c8582cac.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"用戶行爲數據及畫像數據使用了MurmurHash算法將數據均勻的打散到100個分區中;","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"使用parquet格式進行數據存儲,減少scan hdfs的時間;","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"後端服務使用動態採樣進行分析查詢,即初次查詢單分區數據,若發現目標樣本過少,提升抽樣比,在追求效率的同時保證誤差在千分之五以內。","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如下圖:需要圈選出20200101至20200331觀看中國新說唱2019時長超過1200秒的女性,且在20200401未啓動的用戶:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/d3/d3d38107e1277110a21c7b2c6a158dfc.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"示例需要從900億數據中圈選出條件人羣,如果使用全量查詢需要耗時70S,使用下圖抽樣引擎後,查詢效率提升至7S,誤差在千分之一。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/66/664d6e46abd3b861694388b19d09240d.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過抽樣查詢,滿足了用戶對於分析的時效性的需求。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4、查詢優化","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"技術團隊在其他查詢優化上也做了很大的努力,下面簡單介紹下畫像分析場景做出的優化。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"5、畫像分析優化","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"用戶分析中有一個用戶對劇集類型偏好的畫像模塊,每個用戶基本都有多個劇集偏好,需要計算出如下圖的分佈:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/40/403f5ca95cb550206c5ec9fad3559772.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最開始設計的數據存儲如下表格,每個用戶有多條記錄,存儲成本高,scan hdfs耗時,參考impala支持結構體類型,技術團隊引入了 ARRAY < struct>類型,每位用戶存儲一條特徵,數據行數縮減爲原來的二十分之一。  ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"參考執行計劃,scan hdfs的縮減爲原來的三分之一,並且後續減少了與人羣關聯的壓力。抽樣的畫像分析的性能從8.66s,提升至2.89s。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/19/195b77cf2c3e8df8a458bf16a3f5d449.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"三、業務應用場景","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面介紹一個用戶分析平臺在業務上的實際應用場景。分析師使用用戶分析平臺對熱劇進行分析時:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圈選出看完慶餘年的人羣;","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"針對人羣進行留存分析分析,發現熱劇人羣的留存率,播放次數相對於其他用戶要低;","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"將圈選的人羣通過用戶分析平臺推送至線上進行AB測試push;","attrs":{}}]}],"attrs":{}},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Push後使用用戶分析平臺進行人羣分析,次留及播放次數均有一定的提升。","attrs":{}}]}],"attrs":{}}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"使用用戶分析平臺實現了分析與線上運營的聯動,實現數據可分析,可決策,可行動的閉環。並且極大的縮短了這個週期。從傳統需要數週來分析及上線策略提效至一天內即可完成。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/c0/c0482ddafd0dc37ef5f75227f98f99cc.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"四、未來規劃","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"目前用戶分析平臺還是基於T+1的離線數據分析,用戶對於數據的時效性的訴求越來越強烈。未來的愛奇藝用戶分析平臺提供實時,移動,更加智能來滿足更多場景的用戶需求。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"鄒興標,","attrs":{}},{"type":"text","text":"愛奇藝數據分析平臺高級經理,10年數據領域工作,專注數據建設及數據應用方向。目前在愛奇藝負責用戶分析平臺及內容分析平臺的開發工作。在數據倉庫及OLAP分析方面有豐富的從業經驗。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文鏈接:","attrs":{}},{"type":"link","attrs":{"href":"https://mp.weixin.qq.com/s/b7pHXd4PUBq8g8QrQYZeqg","title":""},"content":[{"type":"text","text":"愛奇藝用戶分析平臺實踐:TB級數據查詢秒級返回","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章