貝殼DMP平臺建設實踐

{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"DMP是一個大家討論已久的話題,尤其廣告領域,是以DMP爲基礎來展開工作的。由於每個公司所面臨的業務場景不同、問題不同,所以在具體落地時的做法也不盡相同。今天主要和大家分享貝殼如何進行DMP落地。"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"主要內容包括:"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"貝殼爲什麼做DMP?"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"貝殼DMP的整體設計"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"DMP平臺效果"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本次總結"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"爲什麼做DMP?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從幾個案例出發介紹"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/34\/348a664aa8722d688030df31c2def17f.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1. APP消息推送"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"DMP的數據與貝殼APP的推送系統整合後,可以做到千人前面來推送消息,避免用戶收到的推送消息千篇一律、一模一樣。比如,有用戶偏好北京回龍觀總價400萬的兩居室,那我們可以在推送文案以及文案的落地頁上加入跟用戶感興趣房子相關的一些數據,這樣用戶點擊的意願就會被提高很多。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2. DSP廣告"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"貝殼的推廣有一部分是DSP廣告,也即我們在刷百度信息流的時候會刷到貝殼的廣告。最早的投放方式是基於城市維度,每個人看到的都是和自己所在城市相關,比如北京的用戶只看到了北京的一些廣告文案。但是和DMP結合之後,就會跟興趣掛鉤。比如你昨天瀏覽了一個小區,那麼第二天系統推送的時候,就會把這個小區的文案放在推送內容裏面,這樣用戶的點擊率意願就會提高很多。進而整個廣告的CTR也提升了,大概提升五到十倍左右,效果還是非常不錯的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/70\/7007a0b2a7f5f586e26abdec329bdb58.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3. 站內推薦"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"APP的首頁有三部分:第一部分是專屬二手房源,屬於導購頁的一個分發,是一個導購專題,裏面都是關於房子的一些專題信息,這些專題信息也是和用戶興趣掛鉤的;第二部分是列表頁,業務方會根據用戶不同的興趣給出不同的排版策略,排新房、排租房與用戶興趣相關,這裏的策略就是依據DMP的數據做的;第三部分是整個房源列表,這塊的展示也是基於DMP做到千人千面,把用戶最感興趣的一些房子第一時間展示給用戶,吸引用戶發生瀏覽、進而提升留存和轉換。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4. 搜索"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"搜索,跟推薦類似,都是把用戶最感興趣的一些房源展示給用戶,促進用戶留存。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/36\/365cc826dfb668ce7ac48262d91c5ac6.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"5. 潛客召回"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們通過一些數據分析會發現,用戶跟經紀人產生聯繫之後,它後續轉委託以及轉帶看的概率就會高很多。所以我們會實時去計算用戶的數據,當用戶的搜索、瀏覽房源、以及查看經紀人的相關信息等行爲量達到一定程度後,我們計算認爲他的行爲足夠豐富,這個時候就會做潛客召回,也即給用戶彈一個框,引導用戶去留資,留資完後就把其信息分發給經紀人,這樣經紀人通過電話就可以與用戶產生聯繫。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"6. 商機引導"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這是IM場景下的一個例子,就是當用戶跟經紀人產生聯繫後,我們會把用戶的畫像數據推送給經紀人,經紀人可以直觀地瞭解用戶的偏好,方便其更好的去與客戶進行溝通,如果溝通效果不錯,則客戶會留下手機號,之後順帶的就會產生一次委託,成爲委託客。客戶成爲委託客後,經紀人就可以在委託客分析B端查看到客戶更詳細的一些信息,比如說最近的活躍狀態,最近的一些行爲數據,瀏覽了哪些小區,所瀏覽小區房價的變化趨勢,還有客戶喜歡什麼時間在線上瀏覽、喜歡什麼時間出來帶看,這些信息可以輔助經紀人做好後續的約帶看安排。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從上面這些案例可以看出,無論是站外的老客召回,還是站內的精細化運營,DMP都在發揮着非常重要的一些作用。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"那麼到底什麼是DMP呢?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/d1\/d1e37b4b54b2ff75f1216c5b2b884db4.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其實DMP就是把用戶各種各樣的數據,包括結構化數據以及非結構化數據,進行整合計算然後標籤化,通過標籤來描述刻畫用戶I,理解用戶。比如,通過標籤,我們解讀到一個北京的用戶,想看鄭州的房子,喜歡400萬的兩居室。通過用戶的標籤,我們可以非常直觀的瞭解用戶。然後基於這些結構化的標籤數據,我們也可以很方便地跟各個下游系統做對接,實現站外或者站內的精細化運營。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"DMP的整體設計"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1. DMP整體設計"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/9d\/9d08f835bca72db2d94497017ca0c247.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"貝殼DMP的整個架構設計,從下往上總共分成了五層:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"① 最下面一層,也即第五層是數據收集層:收集層主要負責採集兩類數據。第一類是用戶在APP上的各種行爲數據,比如搜索什麼樣的房子,瀏覽了什麼樣的房子,以及關注了什麼樣的房子等等。第二類是業務DB的各種線下數據,比如線下的帶看、轉委託等。這兩類數據都會收集到hive裏面。APP上用戶的行爲數據採集,通過系統羅盤來實現,這個系統是貝殼專門用來進行埋點管理和埋點數據收集的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"② 第四層是數據加工層:數據採集到倉庫裏面後,會對數據進行各種的加工,然後產生相應的主題寬表。比如針對用戶,我們會建一張用戶主題寬表,將用戶所有線上線下的數據打通,然後將數據整合到寬表裏,基於這個寬表,我們可以做相關的數據分析以及模型計算。最終產出人房客的基礎標籤數據。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"③ 第三層是應用數據存儲層:數據加工後產生的標籤數據都是在hive裏面的,大家知道hive其實是一個分析型的數據庫,它的查詢速度非常慢,是不能夠支撐各種業務上高速查詢的需求。所以我們需要一個應用存儲層。目前對於存儲,我們做了三種:第一種是Hbase,主要滿足高併發場景下的KV高併發的查詢;第二種是clickhouse,這是比較新的一種OLAP引擎,主要做SQL形式的人羣圈包和人羣的洞察;第三種是Mongo,在圈人羣包之後,我們會將各種ID數據同步到Mongo,然後與業務系統對接滿足業務上的查詢需求。比如我們給用戶推送消息時,push系統會把我們生產的人羣包裏面的設備數據拉走,然後按照設備給用戶做推送,這裏就會涉及到高穩定性高併發的分頁查詢。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"④ 第二層是應用層:基於存儲層,我們搭建了應用層,主要提供的功能如下:"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"標籤管理,主要支持把hive數據快速的導入到CK裏面或者Hbase裏面,以及快速把數據上線到標籤層;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"標籤集市,可以讓大家快速的瞭解我們現在都有什麼樣的標籤;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"人羣圈選,支持以可視化的拖拽形式來自由的組合標籤來圈選用戶;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"人羣洞察,在圈選人羣包之後,可以通過人羣洞察來看人羣的構成是什麼樣子,比如地域分佈、性別分佈等等;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"人羣拓展,這個功能在廣告領域是用的比較多的一個功能,它可以通過一個少量的種子用戶,然後擴展出一個海量的用戶羣體。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"⑤ 最上面的一層就是API層:API層主要做的是一個統一的數據輸出功能,且包含了鑑權、流控、容災等各種控制。基於API層,我們可以對接各種業務系統,如推薦搜索、人羣分析、push系統。另外,從數據層到API層,我們做了一個比較完整的監控報警功能,這樣可以保證我們整個數據的可用性和API的高可用性。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2. 逐層介紹"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接下來我們逐層看一下每層是怎麼做的,以及遇到的問題和相應的解決方案。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"① 數據加工層:"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/07\/07503bef78cf5b6cc367bf7b7239cdec.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/fc\/fcc1cd912b0073c534de3c861882f17f.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在數據加工層,我們一共產出4種數據,2份偏基礎的數據-基礎數據、行爲數據,2份偏核心的數據-偏好數據、預測數據。"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第1種是基礎數據,包括地理位置、APP相關、活躍相關。比如用戶的常駐地,這個是根據IP解析獲得,而工作地與居住地的商圈劃分,則根據GPS在時間上的分配來確定,白天時間段多的定爲工作地,晚上時間段多的定爲居住地。比如用戶是否安裝APP,是IOS還是安卓,對應的版本是什麼,使用習慣是什麼,以及何時註冊、何時激活、最後一次活躍的時間。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第2種是行爲數據,主要是統計一段時間內行爲的累計情況,通過一些簡單的統計分析就能得到這些數據。這些累計數據主要是圍繞着貝殼整個的找房路徑來就展開了,比如:用戶進入APP後產生搜索、然後瀏覽房源、關注房源、……、一直到最後的成交。相應的數據有:近X天搜索次數、近X天瀏覽次數、近X天帶看次數等。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第3種是偏好數據,這塊數據是最核心的一塊數據,前面講到的各種系統應用,其實都是偏好數據在發揮各種各樣的作用。我們主要圍繞用戶對房子上的各種各樣的偏好(如商圈、價格、面積等),通過一個公式來計算其偏好;這個公式是比較通用的,是通過行爲次數乘以行爲權重再乘以衰減因子然後得到一個得分,也即偏好得分。這裏的行爲就是圍繞着找房路徑的那些行爲。不同行爲的權重會分別不同,比如一個400電話可能相當於五次瀏覽,一次帶看可能相當於十次瀏覽等,權重的初期值可以與業務方溝通確定一個,後期的話可以依賴算法的能力去計算。衰減因子主要用來刻畫用戶興趣漸變的過程,用戶的興趣是不停變化的,用戶越近的一些數據,行爲權重可能會越高,越早遠的一些數據,權重就會越低一些。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第4種是預測類數據,這部分數據主要是通過一些機器學習、一些計算模型來獲取的,用來闡述用戶未來發生某種行爲的概率。比如用戶未來幾天內產生商機的概率、產生委託、產生帶看、成交的一些概率。前面說到的潛客召回,就利用了這裏商機概率的數據。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外,除了以上用戶角度的四份數據之外,我們還有經紀人的數據,有了用戶數據,有了經紀人數據,我們就可以做各種匹配和使用,比如給用戶推薦經紀人,比如經紀人任務系統打通,然後通過一個調度系統來調度經紀人以提高效率。"}]}]}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/f8\/f8a763c9320ca62822991de552115684.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在建設這些數據的過程中,我們遇到了很多問題。比較核心和突出的問題有三個。"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一個問題:用戶數據如何做歸一,這個問題是所有做DMP或者做用戶數據繞不開的一個問題,比如一個用戶有兩個手機,一個IOS,一個安卓,安卓手機上鍊家 APP和貝殼APP,IOS手機上面安裝了貝殼APP,如果沒有做用戶歸一,我們可能會把這個用戶識別成三個用戶,這樣刻畫用戶顯然是不完整不精確的。我們的解決方式是:不同設備的按照手機號做歸一、同一設備上不同APP的按照手機設備號IMEI歸一,具體的實現方式是:通過Spark把用戶所有的節點數據加載在一起,然後使用GraphX來去構建圖關係,根據圖關係的連通性生成用戶的唯一實體。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二個問題:DMP的數據是用180天的數據來計算的,每次計算可能會涉及到上百億的數據量,在這種海量的數據量下,我們如何保證數據產出的及時性?這裏我們主要用了四種方法,第一種方法是列裁剪,主題寬表的字段非常多,但是標籤計算只需要其中某些字段,因此我們基於寬表生成很多臨時表,只把所需字段給裁剪出來;第二種方法是預聚合,每天我們都需要計算產生很多的行爲統計數據,但沒必要每次都從明細數據算,可以每天預聚合一次,通過每天的預聚合來合成所需要的統計數據;通過列裁剪和預聚合,整個數據量降低到原先的1\/10到1\/20,進而相應的數據計算量也減少了很多;第三種方法是增量計算,我們每天都是橫跨近180天的數據來計算,採用增量計算來處理,每次把最新的加進來,把最老的給去掉,這樣就沒必要每次都計算180天的數據;第四種方法是集羣資源隔離,最早的時候DMP任務使用公共隊列,公共隊列有個問題,就是如果新上一批任務,它的效率可能會非常差,然後整個隊列任務會變得非常卡,DMP數據產出的穩定性就會受到影響,所以我們爲DMP單獨申請了一個獨立的隊列,這樣產出及時性就得到了保障。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第三個問題:50+的偏好標籤如何在迭代中快遞開發和上線。房屋的字段是非常多的,因爲涉及到二手房、新房、房屋租賃等業務,而且每個業務線有其各自特點,所以標籤數據就非常多。雖然數據很多,但是計算公式其實是相同的,也即指標是類似的,每次計算只是維度不同而已,比如有些是按照價格段維度計算,有些是按照商圈維度計算,因此我們可以實現配置化,通過一個數據庫來去維護所有標籤的維度信息,根據這些信息動態的生成SQL,然後通過Spark運行這些SQL來生成各種各樣的標籤數據,也即標籤化。當標籤計算要增加一種用戶行爲的話,通過配置化修改數據庫的生成邏輯即可完成一次標籤上線。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"② 存儲層:"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/4b\/4bc50ece4e8cce6d815da81a97f9bbb8.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在存儲層我們面對的問題主要有四個。"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一個是:億級調用量下數據查詢的穩定性如何保障?現在整個DMP調用量是非常大的,基本上用戶在APP裏每一次行爲都會涉及到調用DMP,通過DMP來做業務決策和APP端的策略響應。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二個是:秒級人羣預估。運營同學在做投放等運營時,需要預估計算人羣的用戶數量,通常的做法是不斷建人羣包來查看人羣包的大小直到人羣包的大小達到預期,這種做法效率非常低下。是否可以有人羣預估的功能,支持通過拖拽的形式,快速的預估到人羣包的大小。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第三個是:分鐘級的人羣計算,我們每天大概需要計算200到500個左右的人羣包,我們怎麼去保障這些人羣包可以儘快的被計算出來然後投入到業務使用中?"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第四個是:目前我們有1300+的標籤,如何才能做到標籤快速上線?"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於上述這些問題,我們首先看一下整體存儲層的設計。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/48\/485d7bea4524b89b8d87eb99bfc4cca1.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一部分:最左邊是hive裏的四份核心的數據,然後通過BulkLoad把數據導入到HBase,BulkLoad的導入方式能實現效率的最大化,同時可以避免性能上的開銷;數據彙總到一起後,我們通過API的方式對外提供服務。這個地方我們做了幾點優化,一個是爲HBase申請了SSD的磁盤,HBase主要是支持海量數據的寫入,數據查詢的話,基本上都在幾十毫秒到幾百毫秒,無法滿足我們五毫秒左右的查詢時間需求,我們希望可以通過縮短查詢時間,來給業務策略爭取更多的時間;另一個是爲API做了Redis預緩存,業務頻繁調用的用戶數據基本上是最近活躍的用戶數據,所以我們每天通過Spark把hive裏最近七天活躍的用戶導入到Redis裏做預緩存,同時也做了流控、容災控制等來保障API的高可用。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二部分:通過Spark將hive裏的數據導入到ClickHouse,然後再通過SQL的形式做圈人羣包和人羣洞察。最早的圈包是使用SQL來做,也即通過標籤的拖拽,對應到標籤的與或非,然後再翻譯成ClickHouse的and操作或者or操作。但是ClickHouse對join支撐並不好,join的時候會帶來一些問題,比如當標籤非常複雜的時候,如20到50個標籤組合圈選一個人羣包,對應的SQL發送到ClickHouse,要麼把ClickHouse查掛,要麼查詢非常慢而不能滿足業務方的性能需求。這個地方我們通過引入Bitmap來解決,它是做人羣圈選常用的一個技術。首先我們構建一個用戶ID表,然後通過Spark加載用戶的各種標籤數據,加載之後通過各種維度來聚合,生成各種維度上的Bitmap數據,也即通過Spark生成ClickHouse底層所需要的Bitmap數據,其實ClickHouse底層Bitmap數據的存儲結構是Roaring Bitmap,因此我們通過Spark生成的是Roaring Bitmap數據,再經過序列化之後會存儲到ClickHouse裏的Bitmap表裏。舉例說明下如何使用,比如性別有男女,我們會給男生生成一個Bitmap,給女生生成一個Bitmap,如果圈選用戶時選擇城市是北京、性別是男,我們只要把性別是男的Bitmap查出來,再把城市是北京的Bitmap查出來,然後做一次與運算,就能得到這個圈選人羣了,而且通過bitmapCardinality函數可以很快得到人羣的基數,這樣就可以實現秒級的人羣預估,這就是Bitmap的實現方式。但是這裏有一個問題需要我們考慮,就是我們有各種各樣的行爲的數據是連續的,比如用戶近幾天瀏覽了多少套房子,對這種連續性數據,我們該怎麼做呢?整個底層的實現的話,我們對瀏覽了8套房子的用戶,會生成一個大於等於8的Bitmap進行存儲,同時會生成大於等於7一直到大於等於0的Bitmap且一併存儲。這樣的話我們做標籤與或非的時候,如果選擇小於8的用戶,我們就可以把大於等於0和大於等於7的用戶做一個異或,然後得到小於7的用戶羣體。通過這種方式,我們就可以對應的翻譯成異或的形式來實現各種大於、等於、小於的邏輯操作。利用Bitmap我們確實取得了一個非常大的進步,把人羣包的圈選從原先的15分鐘到20分鐘,降低到現在的1分鐘,整個人羣包(200-500個)的構建時間也從原來的十幾個小時壓縮到現在的4-5個小時。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"實時畫像"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/c4\/c42bc444b69063457eab0e2e132c9a6c.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外,說一下實時畫像。關於實時畫像,大家都比較質疑,對於房產領域這種長週期的業務,到底有沒有做實時畫像的必要。在我們做數據分析的時候,發現用戶生命週期是分幾個階段的,在早期的時候他的需求是不確定的,可能會隨着時間的變化快速地發生變化,只有經過一段時間的沉澱之後,他的偏好纔會穩定下來。如果沒有實時畫像的話,就無法捕獲到用戶前期興趣的快速變化,在做差異化服務的時候效果就會打折扣,因此是有必要做實時畫像的。具體的操作實踐爲:把線上的所有行爲數據收集到kafka裏面,線下所有的數據通過binlog的形式收集到kafka裏面,然後通過行爲聚合模塊(也即Spark Streaming)消費kafka的數據,再加上房屋的各種數據,來統計彙總用戶在各種各樣行爲上的次數。比如瀏覽了3次回龍觀、瀏覽了5次400萬以上的房等。基於這些次數,再結合kafka消費傳遞過來的用戶信息,通過偏好計算模塊和行爲模塊就可以得到用戶的偏好數據。偏好數據的計算方式就是之前講過的一個公式,用戶行爲次數乘以行爲的權重再乘以時間衰減。表示偏好的得分數據最終存儲到Redis裏面,通過API對外提供服務。補充說明下,上述過程中的行爲次數統計數據是存儲在HBase中的,且彙總爲小時級,每個小時會存一份數據,爲了使偏好計算模塊可以快速的查詢數據,採用寬表的形式進行存儲。實時畫像主要應用在推薦和搜索,業務效果明顯, CTR、CVR提升幅度在3%~10%+。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後講一下標籤的快速上線,也就是如何把hive的數據快速的導入到CK裏面。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/0b\/0b2c50a96f61cc9433c71292e613902a.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這裏我們通過配置化的管理,可以知道hive的表和字段在CK裏面是映射到哪張表哪個字段,以及對應字段是枚舉值還是個連續值。如果枚舉值的話,就維護它的碼錶相關數據。通過Spark可以動態化的把數據導入到CK裏面,導入CK後,就可以做標籤配置了。通過這個標籤配置管理,可以把CK的各種數據上線到標籤層、以及標籤的上下線、還有層級的維護。基於這些標籤,就可以做可視化的圈包了,任意的拖拽標籤之後,就可以看到這個標籤組合下有多少人羣,數量是多少,同時也可以做這種人羣的各種細粒度多維度數據分析,比如這個人羣的地域分佈、性別分佈、活躍情況等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/02\/024e0bd1c27b03956e9637bb81810a3a.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"DMP平臺效果"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1. 全場景、海量用戶覆蓋"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/13\/13c8f8270bbf58be534c82e60317e026.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"橫貫打通了貝殼和鏈家兩個APP,其中也整合了PCM站、APP、以及小程序,把所有數據整合到一起,總共涵蓋4億的用戶;業務線涵蓋了二手房、新房、租房、海外、裝修,這些業務APP用戶達到6000萬。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2. 豐富的標籤體系和強大的人羣計算能力"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/34\/34b480eea8fc8903d6a2686a9975ff6c.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"總共有1300+標籤, APP用戶偏好覆蓋率達到60%,每天產出200~500的人羣包,分鐘級的人羣構建,秒級別人羣預估。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3. 穩定、可靠的API服務"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/d5\/d53ce15a80d57e1b15a3dc91b5fdc585.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"API目前已經有6億的調用量了,高峯的時候有8億的調用量,響應時間在五毫秒左右,SLA達到4個九。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"總結"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/14\/142aa7de605f6e3e6d64f8b27135df49.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"貝殼是做人房客信息匹配的,標籤在裏面起到了非常好的連接作用,通過標籤可以給客推薦房子,也可以給客推薦經紀人,同時也可以輔助經紀人更好地瞭解客戶,同時也可以輔助經紀人給委託客推薦感興趣的房子。通過分享中涉及到的案例,大家也能看到,DMP在業務上的應用還是非常廣泛的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"後續我們的工作重心會放在完善數據的準確性和提高數據的覆蓋率上,使其在業務上發揮更大的價值,同時我們也會做用戶生命週期的管理。前面給大家講的案例,都是在一個一個的點上發力,但是其實用戶是需要有一個全生命週期的運營策略,比如怎麼做站外觸達、接下來怎麼做站內精細化運營,站在一種統籌的高度做用戶生命週期狀態的管理,讓用戶儘可能的往成熟期去發展,然後最終產生商業價值。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"今天的分享就到這裏,謝謝大家。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文轉載自:DataFunTalk(ID:datafuntalk)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文鏈接:"},{"type":"link","attrs":{"href":"https:\/\/mp.weixin.qq.com\/s\/inCoVPXuY-G2RZzBNH4THQ","title":"xxx","type":null},"content":[{"type":"text","text":"貝殼DMP平臺建設實踐"}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章