推薦算法工程師需要的知識儲備(十三)

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"寫在前面:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大家好,我是強哥,一個熱愛分享的技術狂。目前已有 12 年大數據與AI相關項目經驗, 10 年推薦系統研究及實踐經驗。平時喜歡讀書、暴走和寫作。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"業餘時間專注於輸出大數據、AI等相關文章,目前已經輸出了40萬字的推薦系統系列精品文章,今年 6 月底會出版「構建企業級推薦系統:算法、工程實現與案例分析」一書。如果這些文章能夠幫助你快速入門,實現職場升職加薪,我將不勝歡喜。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"想要獲得更多免費學習資料或內推信息,一定要看到文章最後喔。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"內推信息","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果你正在看相關的招聘信息,請加我微信:liuq4360,我這裏有很多內推資源等着你,歡迎投遞簡歷。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"免費學習資料","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果你想獲得更多免費的學習資料,請關注同名公衆號【數據與智能】,輸入“資料”即可!","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"學習交流羣","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果你想找到組織,和大家一起學習成長,交流經驗,也可以加入我們的學習成長羣。羣裏有老司機帶你飛,另有小哥哥、小姐姐等你來勾搭!加小姐姐微信:epsila,她會帶你入羣。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本章從作者自己的學習成長經歷,基於自己近10年的大數據與推薦系統實踐經驗來講解推薦算法工程師需要的知識儲備。希望本章可以作爲畢業後想從事推薦算法的學生以及有工作經驗但是準備轉行推薦算法的讀者的一份學習指南,讓讀者可以快速地從整體把握推薦算法工程師需要的核心知識儲備。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面我們從一個一般的推薦系統業務流程引出推薦系統需要的各類知識點,並在後面對各類知識點進行詳細說明與介紹。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"1 推薦系統的業務流程","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面圖1是一種可行的推薦系統業務流圖,用戶通過終端(如手機)訪問推薦業務,終端調用推薦系統web服務接口(可能會用CDN加速,同時通過Nginx等web服務做反向代理),推薦接口從推薦結果庫中將用戶的推薦結果取出來,組裝成合適的數據格式再返回用戶。從另外一側,用戶在終端上的行爲會通過日誌收集系統收集到大數據平臺,通過ETL處理進入數據倉庫,我們構建推薦算法模型爲用戶生成推薦結果,將推薦結果通過kafka等消息管道組件存入推薦庫中。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們結合該圖來說明學習推薦系統需要用到哪些技術,需要學習哪些相關知識點。當然,你去一個公司做推薦算法並非一定會接觸到下圖的所有方面(如果是創業公司,很有可能都會接觸,因爲創業公司沒有這麼多資源招聘各個模塊的專業人才,一般一個人要頂幾個人,所以覆蓋的面也會更廣,但在大公司,分工比較細化,可能只會接觸其中某一個點)。如果你對所有模塊有更好的認識和了解,對幫助你形成推薦系統全局認識是大有裨益的。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/51/5100b09fc6fd25587a63b60a5b105aa4.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們可以將上圖中涉及到的知識點分爲","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"基本技能、核心技能、補充技能","attrs":{}},{"type":"text","text":"三大塊。推薦算法工程師一般也分爲偏算法類與","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"偏工程類","attrs":{}},{"type":"text","text":",","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"偏算法類","attrs":{}},{"type":"text","text":"主要是根據產品特性、已有的數據資源設計一個高效可行的算法,也可能會涉及到實現相關算法,而偏工程類主要是推薦算法相關模塊編碼及推薦支撐模塊開發等。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"偏算法類的工程師需要數學基礎好,機器學習理論紮實,最好有相關學術經驗。偏工程類的需要編程能力強,熟悉軟件架構設計、向對象思想、設計模式等,最好有開發較大工程項目的經驗。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"推薦算法工程師的核心技能主要是機器學習相關技術、推薦算法理論、推薦算法工程實現等。數學知識、編程知識、數據結構與算法、數據庫、大數據相關知識、英文閱讀能力等是基礎技能。而產品UI交互、網絡協議、web服務、CDN、數據交互協議等屬於補充技能。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"bgcolor","attrs":{"color":"#FC8F99","name":"red"}},{"type":"strong","attrs":{}}],"text":"入門推薦算法工程師,基礎技能和核心技能是需要學習的,如算法基礎、機器學習相關技術、推薦系統相關常用算法是需要掌握的。但是爲了完整性,我將推薦系統涉及到的所有知識點都羅列出來了,其他非必須掌握的知識點讀者可以分階段選擇性學習。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面我們對推薦系統涉及到的技術等知識點做一個較全面的整理說明,作爲讀者學習的參考指南。讀者可以根據剛剛提到的基本技能、核心技能、補充技能等選擇性、分階段學習。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"2 數學基礎","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數學是一切自然科學的基礎,任何自然科學(甚至人文科學)的發展離不開數學的貢獻,甚至有人說過一個學科發展的成熟程度與它使用數學知識的深度正相關。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"要想學好推薦算法,是需要具備一定數學基礎的,具體需要對如下幾個領域的數學知識有所瞭解和掌握。如果數據基礎紮實,推薦算法可以研究得更深入,對算法原理的掌握也會更加深刻。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我認爲只要學好大學的高等數學、線性代數、概率與統計這三門課就足夠了,是完全可以應付推薦系統需要的數學儲備的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"離散數學作爲計算機系的必修課程,對理解計算機體系結構、更好地理解很多機器學習算法是非常有幫助的。如果你想在推薦上有更深的造詣是需要學習瞭解的,初學者前期可以不必花很多時間在這門課程上。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.1 高等數學","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"微積分是整個高等數學的核心,現代科技的發展得益於微積分的發明,它讓整個高等數學知識在工程科技領域得到非常廣泛的運用,大大促進了自然科學和工程學科的發展壯大。機器學習是計算機與數學的交叉學科,當然也離不開高等數學。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其實推薦模型(甚至絕大多數機器學習算法模型)最終可以歸結爲一個最優化問題。簡單來說,最優化問題就是求函數極值的問題,需要利用各種數值優化技術來求解模型的最優參數,常用的有極大似然估計,梯度下降算法等。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"深度學習的激活函數、機器學習模型的目標函數的性質我們需要了解,需要計算梯度來逐步迭代求解最優解,這些都涉及到微積分相關知識。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外,關於算法的時間空間複雜度(比如歸併排序的時間複雜度是O(nlogn))等都需要用高等數學無窮小的形式來描述。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們需要掌握的高等數學知識主要有初等函數的基本性質、極限、積分、微分、求極值、無窮小量等。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.2 線性代數","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"矩陣運算是非常簡潔高效的一種數學運算。如果用矩陣來描述線性方程組是非常簡單的(Ax=b,A是係數矩陣,b是數值向量,x是未知向量),有很多機器學習算法都利用了矩陣相關知識,如奇異值分解、降維方法等。矩陣運算非常適合在GPU、FPGA(Field Programmable Gate Array)等現代芯片架構上做並行處理。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"推薦系統中比較出名的利用矩陣運算的算法是矩陣分解算法,深度學習中從前一層到後一層的信息傳遞本質上就是矩陣乘法。計算相似度的餘弦相似計算也需要利用向量的內積運算。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們需要掌握基、矩陣及向量相關運算、解線性方程、正交性、特徵值、特徵向量等基本知識點。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.3 概率統計","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"用於模型訓練的樣本可以看成是從滿足某個概率分佈的隨機變量的一次隨機抽樣,基於該觀點,任何一個推薦算法可以看成是一個概率估計問題。很多機器學習問題可以採用概率的思想來解釋,最後通過極大似然方法估計相關參數。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"很多推薦算法可以利用概率的思想來建模,推薦系統的navie bayes方法就是一種簡單的利用概率方法來做推薦的算法。我們也可以將推薦系統看成是二分類問題,可以將用戶是否喜歡某個標的物看成一個概率,概率值的大小代表用戶喜歡的程度,從而可以用logistic迴歸來做推薦。貝葉斯估計也是常用的概率估計方法,在推薦系統中得到了大量的使用,比如主題模型。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們需要掌握什麼是概率、概率的計算、頻率與概率的關係、常用分佈、貝葉斯公式、極大似然估計、先驗估計、概率密度函數、均值、方差、樣本、抽樣、置信度、置信區間等相關概率統計知識。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.4 離散數學","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"學計算機專業的同學本科時必學的一門課程是離散數學,包括的內容有集合論、圖論、代數結構、組合數學、數理邏輯等部分。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"計算機運算本質上就是布爾代數運算,通過二進制數來解決所有計算問題。深度學習的神經網絡模型其實就是一種有向圖的結構,像滴滴打車爲司機尋找最短路徑到達目的地其實是圖的最短路徑問題。機器學習的維度災難就是一種組合爆炸。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於這部分的理解有助於大家更好的理解計算機體系結構及相關算法原理。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"3 機器學習","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"推薦系統是機器學習的一個分支,主要是解決爲海量用戶推薦標的物的問題,可以將推薦系統看成是一個監督學習問題。機器學習中的各種算法都可以用於推薦系統中,比如迴歸、聚類、奇異值分解、深度學習、強化學習、遷移學習等。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對傳統機器學習算法有深入的瞭解和掌握,對學好推薦系統,對推薦系統算法的深刻理解非常有幫助。常用的聚類、分類、迴歸、集成學習需要有較好的掌握。很多基礎算法直接就可以用於做推薦,如邏輯斯蒂迴歸等。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外,對於機器學習的一些基本概念和相關知識點,如訓練集、測試集、驗證集、模型訓練、模型推斷、特徵工程、模型效果評估等要有所瞭解和掌握。這些是構建推薦算法模型過程中一定會涉及到的概念和知識點。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"4 推薦系統","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"既然是推薦算法工程師,當然需要對推薦算法有所瞭解了。首先,需要知道推薦系統是一種解決信息過載的技術手段,知道在什麼場景下需要推薦算法、什麼場景不需要推薦算法、推薦算法會面臨哪些挑戰、推薦算法在工業界的應用場景等。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"推薦系統常用的算法有基於內容的推薦和協同過濾推薦(包括基於用戶的協同過濾和基於物品的協同過濾)。對這兩類算法要有比較好的理解,能夠說清楚算法原理,能夠大致推導這些算法的實現方案。同時,也需要知道怎麼評估推薦算法的好壞,有哪些衡量推薦算法質量的指標,這些指標是怎麼計算的,怎麼解決推薦系統冷啓動問題等。這些知識點在本書後續章節中都會講到。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最好可以基於一些開源的數據集,採用第三方開源機器學習框架,自己能夠獨立實現這些算法,這樣你會理解得比較深刻。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"5 編程能力","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"推薦算法工程師除了設計算法外,可能需要將算法付諸實踐,自己實現算法,即使是利用現有的算法框架做推薦,在處理數據、模型訓練、模型推斷等階段也需要動手編程。所以,推薦算法工程師需要一定的編程基礎。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在工業界最常用的編程語言是Java語言,Java有非常成熟的生態系統,並且推薦系統前期數據處理是需要依賴大數據技術的,而大數據技術基本是基於Java(或者基於JVM的Scala語言)生態系統的。所以掌握Java/Scala開發是可以幫助你快速熟悉和掌握各類大數據開源技術的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隨着深度學習驅動的第三次人工智能浪潮的到來,出現了越來越多的深度學習框架,如Tensorflow、Pytorch、MxNet等等,這些框架基本是採用python語言來跟用戶交互的(底層一般是用C++編寫的),間接促使Python語言火爆起來。Python作爲一個較古老的編程語言,生態相對豐富,易於學習,並且Python有非常成熟的數據處理分析庫及流行的機器學習框架scikit-learn。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"作爲推薦算法工程師,熟悉Java/Scala、Python兩類編程語言基本就夠了。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"6 數據結構與算法","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上面一節提到了做推薦算法需要掌握編程技能,任何類型的編程都或多或少會涉及到一些數據結構與算法,推薦算法的實現過程中也一般會用到大量的數據結構和算法知識。因此,我們需要了解常用的數據結構,比如集合、列表、哈希、鏈表等。常用的排序算法等肯定是需要掌握的。同時要對算法的時間複雜度和空間複雜度要有一定的瞭解。布隆過濾器,壓縮算法,加密算法等更高深的算法也需要有所瞭解,知道他們可以解決哪些問題,在需要的時候可以通過搜索相關材料快速學習。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"7 工程技能","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"推薦算法的實現也需要考慮很多工程問題,數據處理平臺採用什麼,用什麼編程語言,推薦結果存儲在哪裏,推薦結果怎麼給到用戶,這些問題都需要很好的工程實現。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隨着用戶規模的擴大,數據量越來越大,處理數據和訓練推薦模型花費的時間越來越長,怎麼有效的處理大規模數據和併發計算是擺在大家面前的棘手問題。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"用戶訪問推薦頁面是否有延遲,是否會開天窗(訪問推薦頁面無返回結果),怎麼應對開天窗,怎麼縮短訪問時長,怎麼提升推薦web服務的併發能力,這些問題都需要結合工程的知識和行業經驗來改善和優化。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"怎麼設計一套高效的推薦算法組件,讓整個團隊開發效率更高,更容易將推薦算法落地到實際產品中,怎麼在算法精準度、效率、計算複雜度上做平衡也是一種工程實現的哲學。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"總之,你需要有足夠多的工程實踐經驗,纔可以設計一套高效易用的、有業務價值的推薦算法體系。本書第18章有專門關於推薦系統工程實現相關的介紹,讀者可以參考學習。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"8 大數據相關開源技術","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"推薦系統是一個系統性工程,從上面圖1可以知道,要搭建一個穩定有效的推薦系統還是相當複雜的,涉及到很多知識。toC互聯網產品是構建在規模用戶基礎上的生意,好的toC互聯網產品一定是服務於大量用戶的,大量用戶的行爲會產生海量數據,這時大數據相關技術就有了用武之地。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"幸好隨着互聯網和信息技術的發展,隨着開源技術的流行和開源社區的壯大,出現了很多優秀的大數據和AI開源框架,如Hadoop、Spark、Flink、Tensorflow、Pytorch等,這些框架是我們構建工業級推薦系統的基石,下面作者對推薦系統需要用到的一些開源技術做一些簡單介紹,方便讀者瞭解熟悉,基於這些開源技術是非常容易構建一套推薦系統的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"8.1 數據收集系統  ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"構建推薦算法模型需要依賴用戶行爲數據等各類數據,而這些數據來源於用戶在客戶端的操作,所以我們需要將這些操作日誌“運輸”到數據中心,這個過程就是數據收集。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大數據生態系統中常用的收集轉運數據的組件有flume、kafka等。當我們將所有需要的數據收集到數據中心存下來後就可以進行處理、訓練、構建推薦算法模型了。本書第17章有較多關於數據收集的介紹,讀者可以參考。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"8.2 數據存儲系統  ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過上面收集到的數據後,我們需要將數據存下來。由於互聯網公司數據量很大,單臺服務器一般存不下,這時就需要利用分佈式數據存儲技術,因此Hadoop的HDFS分佈式文件系統就派上用場了。HDFS可以橫向擴容,具備數據讀取等常用文件操作,並且每個數據塊可以保留多份副本,即使一臺服務器壞了也不會丟失數據,安全可靠性極高。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在做數據分析時,我們需要更好的存儲、獲取、處理數據,我們一般將數據採用Parquet的數據格式存儲,Parquet是基於Hadoop生態之上的一種列式數據存儲格式,不管採用Hadoop生態上的什麼分析組件,不管什麼數據模型及編程語言,Parquet格式都可以輕鬆應對。Parquet對數據有比較好的壓縮,可以極大減少存儲資源的消耗。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外,隨着公司數據的增大,業務規模的擴大,我們會從更多的維度對數據進行分析處理,這時就有必要構建一套完善的數據倉庫了,大數據社區構建數倉的組件主要有Hive和HBase。Hive是基於關係型數據庫查詢語言SQL的結構化數據存儲組件,Hive採用表的形式存儲結構化數據,利用SQL查詢,非常適合批處理的數據分析形式。如果你需要對數據進行實時的分析處理,可以將數據存到HBase上,它是一種列式數據存儲組件。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"8.3 數據分析系統","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隨着Google在2003發表了3篇劃時代意思的論文(見參考文獻1,2,3),大數據逐步從萌芽到繁榮壯大,這其中最重要的大數據技術當屬2006年啓動的Hadoop工程,Hadoop包含HDFS和MapReduce兩個組件,HDFS用於存儲海量數據,可以利用廉價的服務器構建分佈式集羣,方便存儲大量數據,並且數據有很好的容錯性。而MapReduce是一個基於HDFS之上的數據分析組件。經常十幾年的發展,圍繞Hadoop形成了一套完善的大數據生態系統,正是Hadoop生態系統引爆了大數據浪潮。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"後續陸續出現的Spark、Flink等基於Hadoop之上的數據分析軟件,拓展了大數據分析的能力,這些軟件的發展也壯大了整個大數據生態系統。Spark、Flink上有非常多的算子操作,同時也有相關機器學習庫(Spark的mllib機器學習庫包括ALS推薦算法),這些算法和庫方便我們構建各種推薦模型。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"推薦系統依賴的大數據相關技術還有很多,如調度引擎Azkaban、交互式查詢引擎Hue、OLAP(聯機分析處理)處理系統Impala、Presto等。讀者可以參考大數據相關技術書籍,本章不做深入介紹。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"9 其他支撐技術","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除了上面提到的技能點外,我們還需要對下面的一些知識有所瞭解。這些技能點有些是構建完備的推薦系統必不可少的部分,有些是支撐推薦系統服務更好運轉的基礎能力。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"9.1 數據庫","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在推薦系統架構中,需要將爲用戶生成的推薦結果存入數據庫中,方便web服務提取推薦結果返回給用戶,而業界主要有關係型數據庫和NoSQL數據庫兩大類。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"關係型數據庫是最早被大規模使用的數據庫,在整個互聯網發展史上佔有非常重要的地位,大量用於各類公司作爲最核心的數據存儲(如交易數據、用戶註冊信息等)。關係型數據庫最大的特點是採用行列的形式存儲數據,類似二維的電子表格,現實生活中非常多的數據都可以抽象爲這種表格的形式。從這些表格數據中操作數據(增刪改查)採用SQL語言,它簡單易學,非常高效。目前比較火的開源關係型數據庫有MySQL和ProgreSQL等。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"推薦系統雖然不直接利用關係型數據庫作爲最終推薦結果的存儲,但是推薦的標的物相關的信息、用戶相關的信息等基本會存放在關係型數據庫中,推薦算法工程師至少需要了解熟悉一種關係型數據庫,並且需要熟練使用SQL語言。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"推薦系統每天(甚至是每分鐘或者每秒)需要爲每個用戶計算推薦結果,如果用戶量大的話,將這些推薦結果插入數據庫是一個非常頻繁的讀寫操作,採用關係型數據庫是非常不合適的,這時NoSQL就派上用場了。NoSQL採用key-value的形式存儲數據,是非常適合用於存儲用戶的推薦結果的,key就是用戶的id,value就是爲用戶的推薦結果。非常流行的NoSQL如CouchBase, Redis等都適合做推薦的結果存儲,他們讀寫都是非常高效的,並且可以橫向擴容。作者公司的推薦結果存儲就是採用的這兩個NoSQL數據庫。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"9.2 操作系統","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除了微軟體系外,整個互聯網行業的基礎架構基本是構建在Linux操作系統之上的,推薦系統的任務調度、任務監控等都是部署在Linux服務器上,所以作爲推薦算法工程師是需要熟悉Linux操作系統的。磁盤、內存、核、進程、網絡、文件目錄結構、基礎命令等常用操作是必須熟練掌握的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"9.3 網絡","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"推薦系統的結果需要存到數據庫,用戶訪問推薦服務時需要從數據庫中將推薦結果取出來,這個過程中會涉及到數據在網絡上的傳輸,因此需要對網絡延遲、網絡傳輸等過程有所瞭解。同時數據傳輸遵守網絡協議,我們需要對http、https、tcp等網絡協議有所瞭解。爲了加速用戶獲取推薦結果,讓用戶體驗更好,一般互聯網公司都會通過CDN來加速用戶查詢過程,對CDN技術也需要有所瞭解。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"9.4 互聯網上常用的數據交互協議","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"像 json,xml,protobuf,Avro等常用的數據交互和序列化協議需要讀者熟悉。特別是json,可讀性強,很多互聯網公司採用json格式來作爲數據交互的協議,大量用於數據接口中。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"9.5 Web服務","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從上面圖1可以知道,用戶獲取推薦數據,需要通過web服務模塊,該模塊的作用是通過從推薦結果數據庫中將用戶的推薦結果取出來,組裝成合適的格式返回給用戶。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"常用的web服務組件有基於java語言的Tomcat,基於go語言的gin、Beego,以及基於python語言的Flask等等。如果你的工作中涉及到爲推薦業務開發接口,就需要對這塊熟悉,否則只要知道即可。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"9.6 AB測試與指標體系","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"前面講過推薦算法是一個逐步迭代優化的過程,我們需要根據公司業務場景構建一套完善的指標體系,搭建一套好用的AB測試平臺來評估推薦算法的好壞及對業務的價值,通過不斷優化迭代,讓推薦算法朝着驅動公司業務發展的方向前進。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"作爲一個推薦算法工程師,在平時工作中是會經常接觸到這兩塊的,因此是有義務也是有必要對這兩塊知識點有所瞭解的。由於這兩塊比較偏業務,初學者提前知道就可以了。本書在16、19章會分別介紹推薦指標體系和AB測試平臺的實現,讀者可以後續細化學習。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"10 產品與交互","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"產品是推薦系統價值呈現的載體。用戶通過使用產品中推薦模塊,獲得推薦結果。所以推薦系統怎麼和用戶交互,操作是否便捷流暢,這些因素都會影響推薦系統的最終效果。往往好的UI及交互方式產生的價值比好的算法還大。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"推薦算法工程師對UI展示與交互邏輯需要有一定的瞭解,雖然不必對這塊瞭解太深入,知道一些基本的交互和展示邏輯有助於更好的理解推薦業務,並通過適當的算法邏輯來滿足特定的UI交互。本書22、23兩章分別對推薦產品及UI交互進行了深入的分析。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"11 英文文獻閱讀能力","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"目前關於推薦系統、機器學習等計算機相關書籍及學習資料,比較好的還是國外的。遇到複雜的問題,自己搞不定,也需要去Google上搜索解決方案。好的開源項目也基本是國外的,參考學習材料都是英文的。平時學習參考相關專業論文,也基本是英文的。因此爲了讓自己的能力得到更大的提升,需要具備讀懂英文原版材料或者書籍的能力。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"英文看起來比較難的就是一些專業的詞彙,我建議可以嘗試先看英文的,遇到不懂的單詞查查,當你看完弄懂3本以上的英文參考書時,基本就具備閱讀計算機行業英文文獻的能力了。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"至此,推薦算法工程師需要的知識儲備基本講完了,我們在下表中對相關知識點及比較好的學習資源做了一個歸類整理,方便讀者參考。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/81/814ebe72880a15ed803e783919d67417.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/56/560351ea385e0770bf2191df74fae8cb.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/06/0613d196e254a6999e7b61cf06814c2b.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/0d/0d851d122ce8e6318d5dcd2e6d7e8a42.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章