基於Erlang語言的視頻相似推薦(三十一)

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"寫在前面:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大家好,我是強哥,一個熱愛分享的技術狂。目前已有 12 年大數據與AI相關項目經驗, 10 年推薦系統研究及實踐經驗。平時喜歡讀書、暴走和寫作。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"業餘時間專注於輸出大數據、AI等相關文章,目前已經輸出了40萬字的推薦系統系列精品文章,今年 6 月底會出版「構建企業級推薦系統:算法、工程實現與案例分析」一書。如果這些文章能夠幫助你快速入門,實現職場升職加薪,我將不勝歡喜。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"想要獲得更多免費學習資料或內推信息,一定要看到文章最後喔。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"內推信息","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果你正在看相關的招聘信息,請加我微信:liuq4360,我這裏有很多內推資源等着你,歡迎投遞簡歷。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"免費學習資料","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果你想獲得更多免費的學習資料,請關注同名公衆號【數據與智能】,輸入“資料”即可!","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"學習交流羣","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果你想找到組織,和大家一起學習成長,交流經驗,也可以加入我們的學習成長羣。羣裏有老司機帶你飛,另有小哥哥、小姐姐等你來勾搭!加小姐姐微信:epsila,她會帶你入羣。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"作者在《基於內容的推薦算法》中介紹了基於內容的推薦算法的實現原理。在本章中作者會介紹一個具體的基於內容的相似推薦算法實現案例。該案例是作者在2015年基於Erlang語言開發的相似視頻推薦系統,從開發完成就一直在作者公司多個產品線中使用,該算法目前一直還在使用中,已經5年了,效果還相當不錯。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本章會從視頻相似推薦系統簡介、算法原理及實現細節、問題與難點、爲什麼用Erlang語言開發、系統架構與工程實現、核心亮點、未來優化方向、個人收穫與感悟等8個方面來講解。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過學習本章知識點,讀者可以深入瞭解基於向量空間模型的相似推薦算法原理及實現細節、對Erlang語言特性也會有基本的瞭解、同時對實現一個簡單高效的Master/Slaver架構的分佈式計算框架的原理和工程細節有基本的概念。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本章不要求讀者懂Erlang語言,不懂也可以完全理解本章講解的內容。Erlang語言雖然是一個小衆語言,但是Erlang的很多核心思想是值得讀者瞭解的。本章基於Erlang 開發的相似推薦系統,可以讓讀者更好地瞭解利用分佈式技術實現相似推薦的工程實現細節。熟悉分佈式Master/Slaver架構,對於讀者更好地學習瞭解Spark也是大有裨益的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"29.1 視頻相似推薦系統簡介","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"作者所在公司從事智能電視/智能機頂盒上的視頻業務,主要產品是某貓APP,由於智能電視上主要依賴遙控器操作,所以操控不是很方便,產品對推薦系統的依賴會更大。相似推薦就是爲每個視頻關聯一組相似(或者有一定關聯關係)的視頻。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"某貓的產品包括長視頻(電影、電視劇、動漫、少兒、紀錄片、綜藝6種)和短視頻(資訊、奇趣、遊戲、體育、戲曲、音樂6種)兩大類,我們需要爲每類視頻類型的節目關聯一組相似推薦(在同一類別中做推薦,電影的相關推薦只能是電影、體育的關聯推薦是體育等)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"具體怎麼計算相似呢?我們是基於視頻的metadata信息來計算兩個視頻之間的相似度,利用相似度從高到低來排序,獲取某個視頻最相似的topN作爲關聯或者相似推薦。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這裏提一下,雖然長視頻和短視頻採用同一套算法體系,但是由於視頻類型不一樣,前端的產品形態是不一樣的。由於短視頻單片時長較短,一般幾分鐘就播完了,所以短視頻的關聯推薦採用的是信息流的方式,播放原視頻,它關聯的視頻會作爲信息流播放,這樣整體用戶體驗會好很多。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"相信通過上面的介紹,大家對視頻相似推薦的產品形態比較清楚了。在下面一節,我們來講解具體的算法實現細節。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"29.2 相似推薦算法原理及實現細節","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"視頻一般是具備結構化信息的,一般視頻公司會有","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"CMS(Content Management System,內容管理系統)","attrs":{}},{"type":"text","text":",該系統一般會包含媒資庫,媒資庫中針對每個節目會有標題、演職員、導演、標籤、評分、地域等維度數據,這類數據一般存放在關係型數據庫(如MySQL)中。這類數據,我們可以將一個字段(也是一個特徵)作爲向量的一個維度,這時用向量表示視頻,每個維度的值不一定是數值,但是形式還是向量化的形式,即所謂的","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"向量空間模型","attrs":{}},{"type":"text","text":"(","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Vector Space Model,簡稱VSM","attrs":{}},{"type":"text","text":")。我們可以通過如下的方式計算兩個視頻之間的相似度。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"假設兩個視頻的向量表示分別爲:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/f4/f4adfd5b63be71c8bbceee7817f5abb6.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/51/51761787b19e47dc1c54ca479c2c7b57.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這時這兩個視頻的相似度可以採用如下公式計算:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/37/3795ae1b63c7f505662ffd142275c915.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/2f/2fb1e1d228c60c8431ec37b8cc1c1655.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"代表的是向量的兩個分量","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/c4/c499e0650517e9a3f2f948b3da97cfd6.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"之間的相似度。可以採用Jacard相似度等各種方法計算兩個分量之間的相似度。上面公式中還可以針對不同的分量採用不同的權重策略,見下面公式,其中","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/f1/f1dd52ae0bbb2e6a619df75d943614a0.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"是第t個分量(特徵)的權重,具體權重的數值可以根據對業務的理解來人工設置,或者利用機器學習算法來訓練學習得到。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/22/22f70737a7ab5be2ad4696e445e1e250.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有了上面的計算公式,我們就知道該怎麼計算兩個視頻的相似度了。但是上面公式中未解決的問題是,對於某一個具體的維度,我們該怎麼計算相似度呢?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上式中的","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/15/15d754b181b21590d15793f6d019e94b.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"、","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/37/377968bbe97ef8be93e0cb04e135c881.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"分別代表兩個視頻第i 個維度的值,可以是數值、字符串等。下面我們列舉一下針對不同的分量怎麼計算分量之間的相似度。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"29.2.1 年代","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"假設兩個視頻","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/b7/b720aaaa56225a73c150f34f7452dfa9.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 的年代分別是","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ab/abc62bb6c79738713b12ecb6eb49cd94.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"和","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/8f/8f2d4a3bff974d1b7cb8e0d47f34bae7.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":",計算在年代這個維度的相似度可以採用如下分段函數表示:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/23/2340def8a80368042e0c7fad4f9d2881.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中(1)、(2)是剔除掉無效的","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/8f/8f2d4a3bff974d1b7cb8e0d47f34bae7.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"值,(3)是給出的當","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/8f/8f2d4a3bff974d1b7cb8e0d47f34bae7.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在0到2020年之間的一個計算公式,","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/8f/8f2d4a3bff974d1b7cb8e0d47f34bae7.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"值越大,最終的相似度越大。這裏相似度與","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ab/abc62bb6c79738713b12ecb6eb49cd94.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"無關(","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"所以嚴格來說,不叫做相似度,而是貢獻度,下面類似,不再說明,直接用貢獻度來描述)","attrs":{}},{"type":"text","text":"在其他條件相同的情況下,越是近期拍攝的視頻權重越大。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"29.2.2 標題","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"假設兩個視頻","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/b7/b720aaaa56225a73c150f34f7452dfa9.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ,","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/62/6216c0254fdae47209310e9bb6c3b9fb.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"和","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/af/af99a9058c05069d0dd2fb8cae256b4e.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"分別是這兩個視頻的標題,首先我們可以通過分詞或者關鍵詞提取算法將","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/62/6216c0254fdae47209310e9bb6c3b9fb.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"和","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/af/af99a9058c05069d0dd2fb8cae256b4e.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"提取關鍵詞得到對應的關鍵詞,假設如下:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/50/50e29156ab9ea5d01a36b03d1d5e0551.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/34/34eedca8b3b30d3e99edeae5e20d615b.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/cc/ccb9c3a950a7cf1586aecd6af99cb65a.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"分別是","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/27/27cd6cb95aa31d3067e8df3995338836.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"提取關鍵詞後的集合,那麼我們可以用如下Jacard相似度來計算","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/27/27cd6cb95aa31d3067e8df3995338836.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"之間的相似度:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/83/83eb7c6a7baa4cb75f24203a916096b0.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這裏涉及到很多NLP方面的技術和領域知識。首先我們需要有視頻行業相關語料,才能夠保證分詞準確,另外標題可能是雜亂無章的,比如很多電影標題中包含粵語版、國語版、新傳、第三季等對相似度價值不大的詞或者詞組,這些都需要藉助規則或者NLP技術做預處理,才能得到較好的關鍵詞提取效果,最終纔會有較好的相似度計算效果。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"29.2.3 標籤","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於標籤,可以採用跟上面標題一樣的計算方法,因爲標題提取關鍵詞後的每個關鍵詞可以看成是一個標籤。這裏不再細說怎麼計算了。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"29.2.4 地域","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"假設兩個視頻","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/b7/b720aaaa56225a73c150f34f7452dfa9.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ,","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/da/da39f9cd3b902f9e142bbe98883bcb4d.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"和","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/df/df5db948d44fc744dd7b7ef296c45b05.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"分別是這兩個視頻的出品地,我們可以用如下公式來計算這兩個視頻在地域維度的相似度。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/d4/d45089dc2d8a5f670c67cc90a49a0405.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這個公式可以考慮得更復雜一點,將地區按照語言、地域等分成幾類,比如北美、東歐、東南亞等,當兩個視頻的地域完全一樣時相似度爲1,當兩個視頻出品地在同一個地區分類內時,相似度爲0.5(或者其他大於0小於1的值,根據業務經驗開發人員自己定義),否則爲0,這樣會更加精確合理。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"29.2.5 某瓣評分","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"假設兩個視頻","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/b7/b720aaaa56225a73c150f34f7452dfa9.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ,","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e7/e758d236cc4d54e3898fb151f5b629c3.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"和","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/b5/b5d880fe0ace4987eeb171980ad2d52c.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"分別是這兩個視頻的豆瓣評分(豆瓣評分是0到10之間),下面公式給出視頻","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/37/377e4ab65f5c226769ca1ca7fc72ff5a.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在某瓣評分這個維度的","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"貢獻度","attrs":{}},{"type":"text","text":",評分越高,貢獻度越大。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/62/62da69881afb719b6b4d0e832932c82e.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"29.2.6 是否獲獎","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"假設兩個視頻","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/b7/b720aaaa56225a73c150f34f7452dfa9.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ,","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/5e/5e852bce6abee25cdb2fe1f62b35c776.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"和","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/84/8441cc9d349c7c9c87e603876e9e6330.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"分別是這兩個視頻所獲的獎項,那麼可以簡單用下面公式來計算視頻","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/37/377e4ab65f5c226769ca1ca7fc72ff5a.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在獲獎這個維度上的","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"貢獻","attrs":{}},{"type":"text","text":",當然計算公式可以更加複雜,對不同獎項區別對待,級別更高的獎項可以給出更高的貢獻度。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/bf/bf2122cff9add81b1537ca2b14aff7ef.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"29.2.7 導演與演職員","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通常情況下,視頻一般只有一個導演,導演的相似度可以類似地域一樣的方法計算。而演職員可以採用跟標籤類似的方法計算相似度。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上面給出了幾個內容維度怎麼計算兩個視頻在該維度的相似度(或者貢獻度),作者在實際項目中用到的維度比這個更多,但是計算原理類似,這裏不再一一列舉。另外,具體的計算和處理邏輯也會更復雜,比如對於標籤相似度,標籤之間是有相似關係的,比如驚悚和恐怖,它們是相似的。上面方法沒有考慮到標籤之間的這種相似關係而是將他們看成不同的標籤,計算相似性多少有點簡單。實際上,作者在項目中用到了數學中等價類的思想,將很相似的標籤看成是同一類的,在同一類的標籤之間是有相似度的,這樣如果一個電影的標籤是恐怖,另外一個電影的標籤是驚悚,按照上面的計算方法相似度是0,而按照等價類的思路,相似度是大於0的。同時,對於某些類別的視頻加了很多規則性的東西,更好地適應不同類別內容的相似計算。像這類提升效果的處理方法還有很多,這裏不再細說。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"29.3 問題與難點","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"該項目最早(2012年底)是採用Java來開發的,寫一個單機程序,當時視頻量還比較少,也沒有這麼多視頻類別,基本可以支撐,當後面加入越來越多的視頻類別,每類視頻數量也越來越多時,單機計算的性能就出現瓶頸了。當時採用的應對方案是將視頻按照類別分成幾組,每一組採用一個Java線程計算,雖然某種程度上可以做到並行計算,但是每個視頻類型的數量及增長速度是不一樣的,人工按照類型拆開分佈不夠均勻,問題比較多。回顧老的設計面臨的問題,總結一下,基於該算法的相似視頻推薦主要難點有如下幾個:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"29.3.1 數據量大,增速快","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"前面講到我們長短視頻加起來大概有12個類別,類別多,總量有幾百萬條視頻。對於短視頻來說,特別是資訊、體育等,每天都有大量(萬級)新增的視頻。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一次全量計算所有節目的相似度時,由於需要計算的節目太多,必須採用分佈式計算,否則計算太慢,單機可能要花幾個月時間才能完全算完。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"29.3.2 需要實時計算","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在某貓APP上,在各類短視頻站點樹中視頻一般按照時間排列,新的短視頻放在最前面,用戶更容易看到。所以對於新增的視頻,我們需要實時計算出相似的視頻,否則只能用默認推薦代替相似推薦,效果肯定不會太好。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"29.3.3 計算與某個視頻最相似的視頻需要遍歷所有視頻","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們一般只關聯推薦同一大類的視頻,但是各個類別數量是極不均衡的,電影1-2萬部,資訊大概有上百萬。在我們爲某個資訊計算與它最相似的N個資訊時,我們需要遍歷所有其他的資訊才能找到與它最相似的N個,怎麼設計這個遍歷過程對計算時間有很大影響。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"29.3.4 需要更新已經計算的視頻的相似度","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於新入庫的視頻A,我們需要計算它的相似推薦,同時,對於某個已經計算好的視頻B,如果新加入的視頻A與B的相似度高於B原來計算好的topN最相似中某個視頻的相似度,這時就需要更新B的相似度列表,將A添加進去,同時刪除視頻B原來的相似列表中相似度最低的視頻。每個新視頻的加入都可能影響很多已經計算過相似度的視頻,怎麼在短時間內快捷地查找出這類需要更新相似度列表的視頻也是一個挑戰。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上面兩節我們對相似視頻推薦的算法原理及難點做了比較細緻的講解。爲了實現該算法,克服這些難點,作者基於Erlang語言完美地解決了這些問題,在講怎麼利用Erlang語言從工程上實現上面的算法之前,我們先對Erlang語言做一個初略的介紹,方便讀者更好地理解Erlang語言的特性,這些特性使得Erlang非常適合解決該算法的架構和工程實現問題。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"29.4 爲什麼要用Erlang語言開發","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"29.4.1 Erlang語言簡介","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Erlang是一種通用的面向併發的編程語言,它由瑞典電信設備製造商愛立信所轄的CS-Lab開發,目的是創造一種可以應對大規模併發事件的編程語言和運行環境。Erlang問世於1987年,經過十年的發展,於1998年發佈開源版本。Erlang是運行於虛擬機的解釋性語言。在編程範型上,Erlang屬於多重範型編程語言,涵蓋函數式、併發式及分佈式。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Erlang是一個結構化、動態類型編程語言,內建並行計算支持。最初是由愛立信專門爲通信應用設計的,比如控制交換機或者變換協議等,因此非常適合於構建分佈式、實時並行計算系統。使用Erlang編寫出的應用程序運行時通常由成千上萬個輕量級進程組成,並通過消息傳遞相互通訊。進程間上下文切換對於Erlang來說非常簡單,比起C程序的線程切換要高效得多。Erlang語言在電信行業使用得非常多,目前在國內的遊戲界、金融行業及雲計算行業也都有使用案例。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"29.4.2 Erlang語言的特性","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Erlang語言雖然開發於上世紀80年代,但是很多思想是非常超前的,在當前雲計算時代具有非常實用的價值,算是在多如牛毛的編程語言大軍中獨樹一幟。這些特性也是作者選擇利用Erlang語言開發相似視頻推薦系統的主要原因之一。下面我們列舉一些Erlang語言的主要特性:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"(1) 函數式編程及部分語法特性","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Erlang是一個函數式編程語言,即可以將函數作爲參數傳入別的的函數,並且可以作爲函數的返回值。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Erlang語法也比較特殊,通過遞歸來實現迭代邏輯,沒有其他語言的while和for循環結構。Erlang的變量跟數學中類似,只能單次賦值,不可重複賦不同值。Erlang的模式匹配能力也非常強大。Erlang內置了很多數據類型及操作函數,輔助用戶更好地進行函數式編程。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"(2) 併發模型","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Erlang是一個高併發語言,天生支持高併發,Erlang基於Actor的併發編程模型,進程間通信通過消息傳遞進行,高效、自然、可靠。Erlang語言將併發模式作爲自己的核心特性,非常方便構建分佈式處理邏輯,從一開始設計之初就充分利用多核處理器性能,非常適合在現代服務器上構建分佈式應用。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"(3) 跨平臺","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Erlang語言與java類似,採用虛擬機來解釋執行代碼,Erlang的beam虛擬機負責對代碼進行解釋執行,因此具備跨平臺特性,一次編譯到處運行。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"(3) 錯誤處理","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Erlang是一個高容錯的編程框架,它對錯誤處理有兩個設計哲學:(1) 讓另外一個程序來解決錯誤;(2) 如果出錯就讓程序崩潰並重新啓動。第一個設計哲學將錯誤”外包“給另外一個專門的程序監控和處理,這樣原來的程序將核心放到處理邏輯上,這個監控程序可以放在另外一臺機器上,如果原來程序所在機器掛了,監控程序也可以發現問題。基於第二個設計哲學,既然處理邏輯和處理錯誤的程序分離了,如果處理邏輯的程序掛了(一般也是遇到偶發的情況或者傳入非法參數等原因掛掉),處理錯誤的程序就可以讓它重新啓動,這樣系統又可以正常運行了。這個設計哲學跟我們熟知的重啓可以解決90%以上問題不謀而合。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"(4) OTP框架","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"OTP 是整合在Erlang中的一組庫程序。OTP構成Erlang的行爲機制(behaviors),用於編寫服務器、有限狀態機、事件管理器。不僅如此,OTP的應用行爲(the application behavior)允許程序員把寫好的Erlang代碼打包成一個單獨的應用程序;監測行爲(the supervisor behavior )允許程序員創建樹狀結構的進程依賴鏈,使得某個進程死後,它的監控進程(父進程)會重新啓動它而復活。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"OTP提供大量通用的庫程序,可以輕鬆創建具有高度容錯、熱切換等功能的高質量高效的程序。開發者可以通過OTP獲得如下好處:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"a 通用服務器、有限狀態機、事件管理器;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"b 標準化應用程序結構;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"c 代碼熱切換;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"d 監測樹行爲機制,讓你的進程永不”罷工“。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"OTP是在Erlang之上構建系統平臺的標準方式。大型Erlang項目,如ejabberd, CouchDB等,都是基於OTP開發的。我們的視頻推薦系統也大量利用OTP的各種行爲機制,這樣只需要實現核心的接口,進程間的調用、監控這些能力,OTP的行爲機制很容易幫我們做到。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"(5) 內嵌的Mnesia數據庫","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Mnesia是內嵌於Erlang的一款容錯的、分佈式可拓展的交易型數據庫,數據按照表來組織,類似於關係型數據庫,數據可以選擇存在內存或者磁盤中,並且有一套自己的非常方便的查詢語言,可以對數據進行方便快捷的讀寫查詢等操作。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"29.4.3 爲什麼選擇Erlang語言來開發相似視頻推薦系統","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有了上面對Erlang語言的簡單介紹,我們在這裏簡單介紹一下該項目採用Erlang語言來開發的主要原因:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"(1) Erlang語言有比較牛的互聯網應用","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大家耳熟能詳的互聯網軟件,如","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"CouchBase","attrs":{}},{"type":"text","text":"、","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"CouchDB","attrs":{}},{"type":"text","text":"(apache基金會上的一款文檔型數據庫,類似MongoDB)、","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"RabbitMQ","attrs":{}},{"type":"text","text":"(消息隊列中間件),還有基於XMPP協議的IM開源軟件","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"ejabberd","attrs":{}},{"type":"text","text":"等非常流行的軟件都是基於Erlang語言開發的,它們在工業界有大量應用案例。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外大家可能知道,在2014年左右,whatsApp背後只有50多位工程師支撐近5億的月活用戶規模,在當年被Facebook以190億美元收購。而該公司背後用到的編程語言正是Erlang(見參考文獻1)。當時,作者看到這個消息時是非常震驚的,對Erlang語言越發佩服了。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"(2) Erlang語言的幾個特性非常適合該項目","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"前面對Erlang語言的特性做了簡單描述,這些特性是構建相似視頻推薦系統的核心基礎,我們在這裏重點講一下對開發相似視頻推薦系統非常重要的一些特性。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Erlang屏蔽了跨服務器交互的細節,內嵌了跨服務器訪問的rpc及網絡交換函數,跨服務器交互跟與本地交互基本一樣,所以非常適合開發分佈式程序,能夠快速擴展。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Erlang自帶非常多的數據處理函數,方便對Set、List、Map、字符串等結構的各類操作。Erlang包含ETS、DETS等key-value的分佈式數據結構以及嵌入式數據庫Mnesia,非常方便對數據進行讀寫等操作。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Erlang的OTP框架和錯誤處理機制也是非常的強大,適合開發穩定高效的應用程序。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"正是Erlang語言有這些成功的軟件產品、優秀的應用案例及非常有意思的特性,讓作者對Erlang崇拜不已,躍躍欲試,最終決定採用Erlang來開發相似視頻推薦系統。對Erlang語言有興趣的讀者可以看參考文獻2,這是Erlang的作者寫的一本全面介紹Erlang編程的書,非常值得一讀。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"29.5 系統架構與工程實現","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"前面對相似視頻的算法實現細節及Erlang的特性做了完整的介紹,在本節我們就來詳細講解怎麼基於Erlang的一些特性從工程上實現一個高效的、分佈式的Master/Slaver架構的相似視頻推薦系統。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先我們給出相似視頻推薦的架構圖(見下面圖3),再針對每個模塊詳細說明實現的細節。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/eb/ebe12bf2c8d6f1cedccb67cc5268e01d.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖3:基於Erlang語言的相似視頻推薦架構圖","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外,整個項目的工程目錄如下圖,這裏簡單解釋一下:conf是配置文件相關目錄,doc是文檔相關目錄,ebin是編譯文件目錄,log是日誌目錄,Mnesia.helios@Platform-recommended-couchbase11是Mnesia數據存儲目錄,out是輸出相關目錄,RelevanceRecommend.* 、similarity_computing.app是工程啓動及配置相關文件,src是源碼。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/6d/6d6f32b354e35c9335b85ab681e40af2.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖4:基於Erlang語言的相似視頻推薦工程目錄結構","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"相似視頻推薦採用主流的Master/Slaver架構(Spark也是採用的該架構),主要包括Master、Slaver、Riak Cluster(推薦結果存儲,基於Erlang語言開發的開源組件)、Cowboy server(推薦web服務,基於Erlang語言開發的開源組件) 4個核心部分,其中Master、Slaver是整個相似視頻推薦算法的核心。Master主要負責任務的分配、跟Slaver保持聯繫、並且從MySQL中將metadata同步到Mnesia中,而Slaver主要負責相似度計算,計算完後將推薦結果插入Riak集羣中。我們在下面分別介紹這4個模塊的核心功能和實現細節。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"29.5.1 Master節點模塊與功能","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Master節點主要負責任務分派,數據同步,處理節點加入及退出等異常情況。Master包含4個主要組件,如上圖(圖3),各個組件的功能如下:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"(1) data sync模塊","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"該模塊負責將需要計算相似性的視頻從MySQL(媒資庫)同步到Slaver的Mnesia集羣中,Slaver計算時直接從本地Mnesia讀取數據來進行相似計算。由於需要參與計算的字段是較少的(媒資庫字段很多,我們只選擇同步對計算相似度有價值的字段),這裏我們採用Mnesia的內存存儲,將所有數據存在內存中,方便計算程序更快地從Mnesia讀取需要參與計算的視頻metadata,提升計算速度。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"該模塊不光可以具備批量讀取MySQL所有數據的能力(項目第一次跑的時候需要全量計算),同時還可以實時監控媒資庫的變化,如有新視頻加入,馬上(在秒級內)將新視頻同步到Mnesia中。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"(2) add node模塊","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當加入新計算節點時,通過重啓Master節點,Master節點與新節點建立聯繫,識別出新節點,並把新節點加入計算集羣,同時將Mnesia上的數據均勻分配到所有節點(包括新加入的節點)、給新節點分派計算任務(如果當前還有未計算過視頻相似度的視頻的話)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"(3) heartbeat模塊","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Master節點定期(幾秒鐘,可以配置)向所有Slaver節點發送心跳信號,通過該信號探測Slaver是否活着,如果一段時間後Slaver無任何響應,Master會認爲該Slaver掛掉了,這時會將該掛掉的Slaver從計算列表中刪除,後續新的計算任務不再分配給該Slaver。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"(4) Task allocation模塊","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Slaver節點上啓動多個(一般可以設置爲該服務器核數的1-2倍,如4核機器,可以啓動4-8個進程進行計算,有效利用多核計算能力)進行計算,等待Master節點來分配任務。Master節點定期跟Slaver節點通訊,輪詢各個Slaver節點,瞭解Slaver節點是否空閒,如有空閒並且現在還有未完成的計算任務,那麼Master將新的計算任務分配給該slaver進行計算。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"(5) similarity update模塊","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果新加入的視頻A比視頻B相似列表中相似度最低的視頻相似度更大,這時我們就需要更新視頻B的相似列表,將A添加進去,同時將B原來相似列表中相似度最低的視頻剔除掉(見下面圖5)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/65/6559382d5e5802bc144fbf1990eca7ba.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖5:如果新視頻相似度大於某個視頻的相似列表,需要更新相似列表","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"那麼怎麼更新老視頻的相似推薦列表呢?一般來說,我們可以採用如下的方法,該方法也非常簡單,容易理解。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在Master節點維護一張視頻id和它的相似列表最小相似度的表(見下表,表中的視頻都是已經計算完相似推薦的視頻)。新加入的視頻A計算完自身的相似視頻後(在計算topN相似度時,保存A跟其他每個視頻的相似度),將A視頻與每個其他視頻的相似度與下表比對,假設A與id_1的相似度大於s_1,那麼就需要更新id_1的相似推薦列表了。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/80/80616823309227f8cd3b33367a2add63.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/fd/fdbc8743365116ee54fdf32b6ed96e8e.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"實際上可以做很多簡化,比如我們可以先求出上表中第二列的最小值","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e8/e8afd8ac94a25fbee890858bd059134d.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(見下面公式),我們只保留與A頻相似度大於","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e8/e8afd8ac94a25fbee890858bd059134d.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"的視頻,其他視頻直接丟棄(其實很多視頻可以丟棄,畢竟很多視頻跟A是沒有任何相似度的),而不會影響更新計算邏輯,這些相似度大於","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e8/e8afd8ac94a25fbee890858bd059134d.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"的視頻纔是可能需要更新推薦列表的視頻。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/f6/f634bba6d8e8c3c8006e26813794a47f.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Master節點除了上面5個核心模塊外,還維護兩個關鍵的數據結構:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"A :living_nodes:記錄集羣目前可用的Slaver節點,如有有節點加入或者掛掉會更新該數據。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"B: need_computing_id: 記錄哪些視頻還沒有計算相關推薦,針對這些未計算相似度的視頻在任務分配模塊中分派給Slaver節點計算,分配出去後將該視頻從待計算列表中刪除,避免重複計算。如果有新視頻加入,新視頻的id會寫入該列表。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"29.5.2 Slaver:主要負責計算任務","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Slaver節點只有一個核心模塊,即","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"計算模塊","attrs":{}},{"type":"text","text":",負責根據Master節點指派的任務進行相似計算。當Master將某個視頻的計算任務分配到Slaver時,Slaver從Mnesia讀取這個視頻的metadata信息,並計算該視頻與該視頻所在group(如電影組)的所有其他視頻的相似度,將相似度最大的TopN存到Riak集羣中。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這裏我們對核心的計算過程進行更詳細的講解,讓讀者知道作者是怎麼解決TopN最相似視頻的計算問題的。在下面圖6中每個Slaver節點中有4個worker(工作進程,負責進行相似計算),每個worker維護一個最大堆(最大堆中保留的元素個數就是我們需要計算的TopN的相似視頻數,最大堆結構是基於Erlang實現的一個獨立的模塊),最大堆負責保留最相似的N個視頻及其相似度。當Master將某個視頻A分配給worker1時,worker1先從Mnesia集羣中將所有與A在同一類別(如電影)中的所有視頻取出來,循環計算與A的相似度,計算完一個就丟給最大堆,當所有的視頻與A的相似度都計算完後,這些相似視頻都丟給了worker1維護的最大堆,根據最大堆的性質,最終最大堆中留下的視頻(及相似度)就是與A最相似的N個視頻了。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e5/e50c1c6b0b92849e6e593844d738c353.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖6:Slaver中進行相似計算過程與邏輯","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上面的最大堆是基於Erlang的ETS數據結構及相關操作構建的,它非常高效,也是我們整個計算引擎的核心子模塊,最大堆可以自動對丟入的{(videoid(視頻Id),similarity_score(相似度))}結構進行排序,保留最相似的N個。上面提到worker中包含的計算部分,即是基於我們在29.2節中的相似度公式進行計算的。當某個視頻最相似的TopN計算完成後,worker會將推薦結果插入Riak集羣,供前端接口(Cowboy Server)調用。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這裏面也是有很多可優化點的,比如對於新聞或者體育等時效性很強的視頻,我們可以從Mnesia中選取一段時間內(比如過去3天)的視頻來計算相似度,這樣就不需要在計算相似度時跟Mnesia中同一組中的所有其他視頻都計算相似度,大大節省了計算時間。再比如,如果計算相似度中標籤的權重最大,我們計算視頻A與其他視頻相似度時,如果A與其他視頻的標籤相似度爲0(或者很小),我們就沒必要計算它的其他維度的相似度了,直接將該視頻丟棄計算下一個。這些優化點還很多,這裏不一一描述。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"29.5.3 Riak集羣:負責最終相似推薦結果的存儲","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Riak是基於Erlang語言開發的一個分佈式的key-value存儲系統,可以非常容易地水平擴展,非常適合大規模的數據存儲,是整個相似視頻推薦系統的推薦結果存儲模塊,所有視頻的相似推薦列表都存在Riak集羣中。Slaver的worker計算完一個視頻的相似度後會直接將推薦結果插入Riak。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"29.5.4 響應請求模塊:基於用戶請求,給出推薦結果","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當用戶在客戶端訪問某個視頻的詳情頁時,客戶端向服務端發送請求,請求響應模塊(Cowboy Server)根據用戶請求從Riak集羣中將該節目的相似列表取出來,並將需要的其它信息(如標題、演職員、海報圖等在前端展示需要用到的節目metadata信息,這些信息存放在Redis集羣中)填充完整後返回給用戶。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"請求響應模塊是基於Cowboy (一款基於Erlang開發的高性能輕量級接口服務器)來開發的。從前面的介紹中可以知道,Cowboy除了從Riak中獲取推薦列表外,還需要從Redis中獲取節目的metadata信息對節目進行填充。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"29.6 核心亮點","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"到此爲止,我們基本講完了相似視頻推薦的核心算法原理與基於Erlang的工程實現,該系統是作者在2015年開發的,一直在作者公司的兩個產品線中使用到現在,其中一個產品目前還是用的該算法(另外一個產品基於Spark平臺做了重構,跟其它推薦算法的技術架構保持一致),該算法在服務公司業務的5年中,雖然視頻種類和數量有了非常大的增長,但是系統一直比較穩定,也能夠很好地應對視頻量的增長,並且效果還不錯,這得益於Erlang良好的容錯機制及該系統較好的分佈式擴展能力。在這裏我們回顧一下該系統的亮點,讓讀者可以更好地理解它的價值。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"29.6.1 分佈式可拓展能力","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"該系統採用Master/Slaver架構,可以通過水平地增加服務器來拓展系統的計算能力,同時當全量計算完後,後面就是增量計算了,增量計算相對沒有那麼大的計算量,不需要這麼多計算資源,我們可以縮減部分服務器節省開支。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"29.6.2 可以實時對新增加的視頻做計算","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Master的Data sync模塊近實時監控媒資庫MySQL,如果有新視頻加入,馬上將該視頻同步到Mnesia中,並分派給Slaver進行計算,在分鐘級內新視頻就可以完成計算,這樣基本可以有效避免新入庫視頻的相似推薦冷啓動問題。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"29.6.3 系統很穩定","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"該系統一般很少出問題,這得益於Erlang的OTP框架,該系統的Master和Slaver服務都是基於OTP框架來實現的,每個進程都有一個supervisor進程,當進程掛掉後,supervisor會重啓該進程,避免了因爲偶爾的故障或者異常數據導致的系統崩潰,最終讓我們的系統非常穩定。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"29.6.4 將計算過程解耦合,抽象爲最大堆來維護topN最相似的視頻列表","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"將計算相似性計算抽象爲一個計算模塊,每個worker通過維護一個最大堆,可以非常有效地解決最相似的topN計算問題。同時,針對新聞、體育等時效性強的視頻類型,我們還可以只取最近一段時間內的視頻來計算相似度,進一步減少計算量。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"29.6.5 充分利用多核能力","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"每個Slaver可以啓動多個worker進行計算,充分利用現代服務器多核的能力,大大加速了計算過程。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除了上述優點外,還可以通過配置文件來定義各種參數(比如一個Slaver可以啓動多少個worker),可以方便地對參數進行調整,系統啓動時會首先解析這些參數。我們也可以一鍵啓動Master節點和所有的Slaver節點,整個配置和啓動過程跟Spark比較類似。該系統可以看成基於Erlang語言開發的具備特定功能的類Spark的小型分佈式計算平臺。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"29.7 未來優化方向","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"雖然該系統有很多優點,但由於作者個人能力及精力有限,並且對Erlang的瞭解還沒有達到爐火純青的階段,該系統還有非常多的地方是可以做優化的。下面對可能的優化點進行羅列,給讀者提供一些新的思考的視角。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"29.7.1 算法本身的優化","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"該系統基於視頻內容的簡單向量空間模型來計算相似度,雖然算法原理簡單,但是由於視頻的metadata比較雜亂,相似性效果受到數據質量好壞的嚴重影響。同時,向量空間模型是一個比較簡單的模型,無法獲得更復雜的特徵表示。可行的優化點是,我們可以基於metadata數據或者用戶行爲數據做嵌入學習,爲每個視頻構建一個稠密的特徵向量表示,該系統可以通過稠密向量的相似度來計算視頻的相似性。其他各類計算視頻相似度的方法都是可以使用的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"目前我們的計算相似度算法和整個系統還是耦合比較緊密的,通過優化是可以將計算相似性做成可插拔的組件的,這樣就可以方便更換計算相似度的模塊。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外,我們的向量空間模型各個維度的權重是根據人工經驗自定義的,比較主觀,其實是可以","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"利用用戶點擊反饋機制自動化學習最優參數的,這樣可能效果會更好。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"29.7.2 調度策略的優化","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當前該框架的調度策略還非常粗暴,對於每個需要計算相似推薦的視頻,直接從所有Slaver中先過濾出有空餘資源的worker,將任務分配給第一個空閒的worker。針對每個視頻都要從頭過濾一遍,效率很低。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"更好的方式是每個Slaver節點維護一個待計算相似度的固定長度的視頻隊列,當隊列中待計算的視頻都計算完了相似度,Slaver主動向Master申請待計算的視頻。這樣將主動權放到Slaver上,減少原來分配方案中毫無意義的輪詢,同時也減輕了Master節點的壓力。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"29.7.3 部署方式的優化","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"目前的系統雖然部署非常容易,只要在每臺服務器上安裝Erlang,將該項目編譯好,將編譯後的工程代碼分發到每臺服務器上統一的目錄下,修改每臺服務器上的配置文件(實際上所有Slaver上配置是一樣的,跟Master上略有不同)就可以啓動了。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"更好的方式可以利用docker將工程構建在容器之上,利用mesos或者kubernetes等來管理工程運行,可以更好地做到集羣監控和資源彈性伸縮。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"29.7.4 錯誤監控與問題排查的優化","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"目前該項目運行過程中會打少量的日誌記錄,但對於各個模塊中可能存在的錯誤信息並未捕獲並記錄下來,對於問題的發現和排查不是很友好。雖然Erlang很穩定,但是偶爾出一些問題是在所難免的,這一塊的優化也是可行的方向之一。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"29.7.5 數據同步的優化","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"目前是由Master節點的data sync模塊直接監控MySQL(媒資庫),從中將數據同步到Mnesia集羣的。這一塊是可以直接採用消息隊列(如RabbitMQ)解耦的,data sync只需要監控消息隊列中某個topic是否有新節目進來,有新的話就同步到Mnesia中,這比直接監控MySQL高效得多。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"29.8 個人收穫與感悟","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"到此爲止,關於利用Erlang語言開發分佈式視頻相似推薦系統的介紹就講完了。在最後作者簡單說下自己做這個項目後的收穫和感悟。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"作者在2015年花了近半年時間,算是一邊學習Erlang語言(在這之前看過Erlang,但是看得不太懂)一邊開發了該項目。該項目一共5000行左右代碼,雖然不是很多,但是對於像Erlang這類語法簡潔的語言來說,也不算少(如果用Java實現,估計要幾萬行,還很難實現分佈式計算)。在整個開發過程中,最大的收穫有如下3點:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"l 新學習了一門比較有意思的函數式編程語言,對Erlang的特性有了比較深入的瞭解","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"l 對於分佈式計算有了更深刻的認識,這個項目相當於獨立實現了一個小型的分佈式計算引擎,對於深刻認識Spark、Hadoop的原理是非常有幫助的","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"l 獨立完成了一個較大的工程,並且實現了一個基於內容的相似視頻推薦系統","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"做完該項目對個人來說確實是非常有幫助的。但是從整個團隊來說,這樣做未必是好事,利用一個很小衆的語言來開發一整套系統,爲以後埋下了很大的隱患,如果人員離職,很難招聘到Erlang的開發人員,新人很難獨立維護這套系統,風險極大。作爲團隊管理者,應該避免這種情況發生,最好還是利用主流的技術棧,避免留下無窮後患。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當時作者管理經驗還不足、對技術的理解還不夠深入,所以自告奮勇的親自開發了該系統。經過這幾年的積累和成長,對技術和管理有了更深的感悟。如果作爲一名技術管理者,讓我再一次選擇,我可能不會用Erlang來開發該系統,而會採用Spark流式計算引擎來開發。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"參考文獻","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1. ","attrs":{}},{"type":"link","attrs":{"href":"http://highscalability.com/blog/2014/3/31/how-whatsapp-grew-to-nearly-500-million-users-11000-cores-an.html","title":null,"type":null},"content":[{"type":"text","marks":[{"type":"underline","attrs":{}}],"text":"http://highscalability.com/blog/2014/3/31/how-whatsapp-grew-to-nearly-500-million-users-11000-cores-an.html","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2. Programming Erlang: Software for a Concurrent World (Second Edition) Joe Armstrong","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章