社羣編碼識別黑灰產攻擊實踐

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"導讀:","attrs":{}},{"type":"text","text":"所謂黑灰產,包含網絡黑產、灰產兩條產業鏈,隨着互聯網的飛速發展,網絡黑灰產也在不斷演變,當前網絡黑灰產已經趨於平臺化、專業化、精細化運作。基於黑灰產攻擊特點,我們提出了一種基於社羣編碼的黑灰產攻擊識別方法,社羣發現部分基於圖關係,編碼部分引入大規模的圖嵌入表示學習。相比於傳統的圖譜關係挖掘,可以更好的識別和度量未知攻擊。而且我們還提出了基於異步準實時的工程化實現,對頻繁變化的黑灰產攻擊有更強的應變靈活性。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"全文4424字,預計閱讀時間12分鐘。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"一、背景","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"所謂黑灰產,包含網絡黑產、灰色兩條產業鏈,隨着互聯網的飛速發展,網絡黑灰產也在不斷髮展,當前網絡黑灰產已經形成了一個平臺化、專業化、精細化,相互獨立又緊密協作的產業鏈。從近幾年多起重大網絡安全事故看,黑灰產已經不再侷限在半公開化的純攻擊模式,而是轉化成爲斂財工具和商業競爭的不良手段,據不完全統計,當前網絡黑灰產的市場規模已經超過千億元人民幣,在這千億級的市場規模下,發展出了非常多的細分領域,如木馬病毒,養號刷單,薅羊毛,電信詐騙,知識盜版,流量劫持等。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"各互聯網平臺爲了防止被網絡黑灰產攻擊,開發了很多防禦及識別技術,最常被廣大用戶感知到的驗證碼技術就是其中之一,在驗證碼技術背後,還有非常多的識別方法,例如通過規則引擎依據防攻擊規則進行分析攔截,通過行爲序列建模對單次請求進行黑灰產行爲判定,通過圖譜關係挖掘用戶之間的相關性以識別黑灰產團伙,基於層次聚類、均值聚類、高斯混合等聚類模型對黑灰產攻擊進行無監督識別,這些方法均能在不同程度上對黑灰產起到識別和防禦作用。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於以上黑灰產攻擊特點,我們提出了一種基於社羣編碼的黑灰產攻擊識別方法,社羣發現部分基於圖關係,編碼部分引入了大規模的圖嵌入表示學習方法,相比於已有的圖譜關係挖掘,可以更好的識別和度量未知攻擊,而且我們也提出了基於異步準實時的工程化算法實現,對頻繁變化的黑灰產攻擊有更強的應變靈活性。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"二、社羣結構","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於社羣編碼的黑灰產攻擊識別方法,在原有圖譜挖掘的基礎上,引入了大規模的圖嵌入表示學習技術,除了能挖掘出黑灰產本身的關聯關係,還能識別出潛在的黑灰產網絡結構,讓識別過程更加準確穩定。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"該方法基於的關聯圖有兩種,分別是同構圖和異構圖,這也是黑灰產挖掘過程中經常會遇到的關聯圖結構。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"同構圖表示網絡圖中所有節點的類型都相同,如對於一個賬號關聯而言,網絡圖中的節點有都是用戶的賬號ID。異構圖則表示網絡圖中所有節點的類型可能是不同的,如在賬號關聯網絡中,網絡圖中的節點除了賬號ID外,可能還有IP地址,設備號,手機號等其他類型的節點。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下圖表示了同構和異構的兩種示意圖,同構圖表示所有賬號ID組成的網絡結構,異構圖表示由賬號ID、設備ID、手機號和IP地址組成的網絡結構。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/41/41a5c1a517e6e57e589b340273f627a6.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於圖結構網絡,除了邊關係,節點自身也會很多固有屬性,如對於一個賬號ID的UGC場景,會有不同的活躍時間,不同的業務場景(如瀏覽圖文,瀏覽視頻,發表圖文等),不同的操作類型(如發文、評論,點贊,轉發等)。這類節點自身屬性如下圖中表格所示。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/c0/c014b591eeb7122df3851f979ca3ad9f.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了便於說明一般情況,後文說明中默認全是異構圖結構,因爲同構圖作爲異構圖的一種特殊情況,即使是實際推廣中是同構圖,也不影響使用異構圖的方法進行分析。實際場景中圖網絡關係如下圖例所示。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/b8/b8cf2a8817f347cd9e9754f40266f911.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了識別出關聯結構圖中的社羣,目前已經有比較多的識別方法,常用的有基於節點的統計特徵,基於節點出入度的分佈變化,基於關聯邊的自定義權重,人工標註等方法,此類方法能識別很多關聯社羣,但是由於圖譜關聯中難以定義邊的權重,會存在較多誤召,所以我們在實踐中基於已有社羣挖掘結果進行編碼以提升黑灰產識別效果,同時嵌入式圖編碼還可以基於節點的鄰居關係進行相似性的無監督學習。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"三、圖嵌入式編碼","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖嵌入式編碼是一種將節點編碼成向量的node2vec方法,我們採用的圖嵌入式編碼方法爲斯坦福大學William L. Hamilton、Rex Ying和Jure Leskovec等人在2016年提出的GraphSAGE,與node2vec相比較而言,node2vec是在圖的節點級別上進行嵌入,GraphSAGE則是在整個圖的級別上進行嵌入。GraphSAGE同時利用節點特徵信息和結構信息得到Graph Embedding的映射,相比之前保存映射結果的方法,GraphSAGE保存了生成embedding的映射,可擴展性更強,對於節點分類和鏈接預測問題的表現也更加突出。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"GraphSAGE算法流程包含三個步驟:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(1)對圖中每個節點鄰居節點進行採樣,因爲每個節點的度是不一致的,所以爲了計算高效,爲每個節點採樣固定數量的鄰居。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(2)根據聚合函數聚合鄰居頂點蘊含的信息。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(3)得到圖中各頂點的向量表示供下游任務使用。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下圖爲該算法作者在論文中提供的採樣和聚合示意圖。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/7d/7d105200b86d11b3c7a9e0bb7c3e624c.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"使用下圖所示的方法進行節點採樣。每一層的node由上一層生成,與本層無關,如此,1層的賬號ID 1已經聚合了0層設備ID 1和手機號2的信息,在二層,手機號2再聚合IP地址1的信息,經過兩層採樣,就可以擴展到賬號ID 1的2階鄰居包含設備ID 1、手機號1、設備ID 2、手機號2、賬號ID 2和IP地址1的所有信息。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/f2/f223f3a39b5d8ca256df41315ee6e38f.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"採樣過程中固定採樣層數(本實踐使用2層)和每層採樣點的節點數(如鄰居節點數上限爲200個),可以控制每次採樣過程對內存的消耗和運算耗時,該方法適用於大規模數據集,對大數據集下的黑灰產社羣挖掘非常有效。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們採用的聚合函數爲均值聚合,直接對目標節點和所有鄰居emebdding中每個維度取平均,後再非線性轉換,","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"原論文中相應函數表達式如下:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/61/618dba31f5c4ed984356d50d2da76ea8.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其主要思想是將目標頂點和鄰居頂點的第k−1層向量進行拼接,然後對向量的每個維度進行求均值的操作,將得到的結果做一次非線性變換產生目標頂點的第k層表示向量。不同的聚合函數計算方法不同,除均值聚合函數外,還有池化聚合器、LSTM聚合器等可選,經過測試,對於黑灰產社羣挖掘而言,不同的聚合函數差異並不明顯。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如此,經過以上對黑灰產社羣的採樣與鄰居聚合,可以得到每一個節點在網絡圖上的向量表示。如上面採用圖示中賬號ID 1的向量就包含了設備ID 1、手機號1、設備ID 2、手機號2、賬號ID 2和IP地址1的網絡結構信息,同時包含了這些ID在不同時間點、不同業務場景、不同操作類型上的行爲特徵。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"四、模型訓練","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於上面的採樣和聚合函數,開始進行參數學習,GraphSAGE不同的損失函數代表了不同的參數學習方法,如下所示損失函數就是一種無監督的損失,傾向於使得相鄰的頂點有相似的表示,相互遠離的頂點的表示差異變大。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/06/0637dbab6179f6f16eeb32728d3a52da.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上式表示節點 u 和隨機遊走到的鄰居節點 v 有相似的embedding表示,而與經過負採樣得到的不相鄰節點 vn 有不相似的embedding表示。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於無監督損失學習到的節點embedding,可繼續供下游任務使用,本實踐就是採用的該方法。當然,對於特定分類任務,也可以使用特定的損失函數,如使用交叉熵進行分類預測。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們通過節點的統計特徵,人工標註確定了正負樣本,使用節點的編碼向量作爲特徵進行分類模型訓練,下圖所示即爲部分數據的挖掘結果可視化。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/41/41124da61d2fc1d067bb7247a8617d64.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"直觀而言,對於社羣團伙的挖掘還是比較合理的,總體分爲三類,紅色爲一社羣,黃色爲一社羣,其餘(綠色)自動歸爲一類。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"五、工程化實現","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了在實際應用中發揮編碼的價值,我們提出了一種異步準實時的黑灰產識別方案。下圖表示了這種識別方案的流程結構。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/de/de366bddf7c09afc310706a7fce5913a.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":"br"}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一個用戶開始請求客戶端,可以採集到用戶的關鍵因子信息,如賬號ID、IP地址、設備號、手機號等,將這部分日誌信息寫入暫存區,暫存區存儲着所有在過去10分鐘(也可以是其他某個時間段,一般而言,日誌量越大,暫存區時間越短,反之越長)內請求過該客戶端的用戶的關鍵因子信息。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"超過10分鐘的關鍵信息則存入離線的分區日誌庫,基於分區日誌庫進行圖譜構建,黑灰產社羣挖掘,社羣編碼,以及使用向量表示訓練分類模型,分類模型可以不用實時訓練,定期使用過去一段時間的分區日誌訓練即可。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於還在暫存區的關鍵因子,會對請求的用戶進行實時圖譜構建,以該用戶爲中心進行節點採樣並做向量表示,使用已經訓練好的分類模型對該用戶的表示向量進行黑灰產預測,如果預測爲正常用戶,則允許用戶在客戶端上的操作,如果預測該用戶爲黑灰產用戶,則拒絕該用戶進行客戶端操作。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"下圖是訓練部分較詳細的過程:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/eb/eb2f059db86f6ca578fa1963e9228b12.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"下圖是預測部分的較詳細過程:","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/7f/7f77b28b1e3e9edb71f943d03457d208.webp","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":2}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"六、創新點","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本實踐提出的一種基於社羣編碼的黑灰產攻擊識別方法,主要創新技術點包括:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於社羣編碼對黑灰產進行有監督識別的方法,相比於既有圖譜挖掘算法,該方法不直接依賴於單個節點屬性,而是將整個社羣的關聯結構編碼到一個表示向量中,對黑灰產的表示更加準確,而且對於歷史上未出現的黑灰產賬號,也能通過網絡結構之間的相似性,通過向量表示進行識別。同時可以有效避免因爲噪聲關係(如黑灰產賬號連接了商場wifi)導致的錯誤識別。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於圖編碼結果本身就具備鄰居之間的強相似性,非鄰居之間的弱相似性。故使用編碼後的向量進行無監督學習(如密度聚類和層次聚類),也可以識別出部分黑灰產在IP、賬號ID、手機號及設備號之間的內在關係,挖掘出相似黑灰產組成的聚類簇。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"使用暫存區將大規模的圖嵌入表示學習方法與小數據集的異步預測結合在一起,並使用編碼後的有監督模型進行快速預測,爲實際工程化應用提供了參考方法。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"七、部分實踐效果","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"搭建的因子編碼模型包含有ip(IP地址)、cookie_id(cookie信息)、device_id(設備ID)、userid(用戶ID)、mobile(手機號)等共計5個關鍵因子,編碼的特徵包含 scence(業務場景), page(頁面), risk(風險程度)等信息(編碼特徵需依據具體場景而定)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"按10分鐘爲一個暫存區窗口,獲取因子數量百萬量級,編碼的關係數量近千萬,編碼的向量長度300左右。經過人工校驗,對embding結果使用有監督方式例行化產出ip維度的風險,使用其他策略交叉校驗,該策略準確率能達到95%左右。無監督模型實踐中效果較差,沒有實際使用。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"八、發展與思考","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於以上實踐發現,社羣編碼方法在黑灰產識別中具有較好的正向效果,但在實際生產過程中,仍然面臨一些問題,以下作簡要探討。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"計算問題:","attrs":{}},{"type":"text","text":"圖計算非常消耗計算資源,而且完整過程中需要短時間進行大量預測,當前結合mini batch訓練預測,在獨立的GPU環境中,完整計算一個日誌分區約需要半小時。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"關鍵因子和編碼特徵的選擇問題:","attrs":{}},{"type":"text","text":"圖算法的結果非常依賴於圖結構和特徵屬性,這爲前期的因子選擇和特徵工程帶來巨大挑戰。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"識別過程的自動化問題:","attrs":{}},{"type":"text","text":"GraphSAGE算法本身具備無監督識別的能力,但黑灰產識別過程中發現這非常容易受到中心節點的影響(這可以通過權重優化得到一定程度解決),故當前仍然需要提前做部分社羣定義,而該過程需要人工或其他建模方法介入,難以實現完全自動化。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"招聘信息:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"百度移動生態事業部MEG,用戶中心招聘研發崗位(PHP/GO/C++)。我們主要負責公司Passport、用戶資產、屬性、百度APP會員等核心業務方向,致力於打造高效、便捷、安全的用戶體系。如果你對Passport、百萬級QPS服務、分佈式設計&治理感興趣歡迎加入我們。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"關注同名公衆號百度Geek說,輸入內推即可,我們期待你的加入!","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"推薦閱讀:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://mp.weixin.qq.com/s?__biz=Mzg5MjU0NTI5OQ==&mid=2247494399&idx=1&sn=0516ad01baf50442933865d33e88e1af&chksm=c03eda83f749539562f223c320c86ce0d8092e4c1d793994fd602fae7cc986c9129f2b77c0ab&token=1987775079&lang=zh_CN&scene=21#wechat_redirect","title":null,"type":null},"content":[{"type":"text","text":"|","attrs":{}}]},{"type":"link","attrs":{"href":"http://mp.weixin.qq.com/s?__biz=Mzg5MjU0NTI5OQ==&mid=2247494896&idx=1&sn=1ad73deb5f2dfbb90e08793bc675cb12&chksm=c03edc8cf749559a547dd313a1a486b20bc7064afb4dd4a8b5b86425b832de10f1d727144e98&scene=21#wechat_redirect","title":null,"type":null},"content":[{"type":"text","text":"百度C++工程師的那些極限優化(併發篇)","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://mp.weixin.qq.com/s?__biz=Mzg5MjU0NTI5OQ==&mid=2247494399&idx=1&sn=0516ad01baf50442933865d33e88e1af&chksm=c03eda83f749539562f223c320c86ce0d8092e4c1d793994fd602fae7cc986c9129f2b77c0ab&token=1987775079&lang=zh_CN&scene=21#wechat_redirect","title":null,"type":null},"content":[{"type":"text","text":"|","attrs":{}}]},{"type":"link","attrs":{"href":"http://mp.weixin.qq.com/s?__biz=Mzg5MjU0NTI5OQ==&mid=2247489076&idx=1&sn=748bf716d94d5ed2739ea8a9385cd4a6&chksm=c03d2648f74aaf5e11298cf450c3453a273eb6d2161bc90e411b6d62fa0c1b96a45e411af805&scene=21#wechat_redirect","title":null,"type":null},"content":[{"type":"text","text":"百度C++工程師的那些極限優化(內存篇)","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://mp.weixin.qq.com/s?__biz=Mzg5MjU0NTI5OQ==&mid=2247494399&idx=1&sn=0516ad01baf50442933865d33e88e1af&chksm=c03eda83f749539562f223c320c86ce0d8092e4c1d793994fd602fae7cc986c9129f2b77c0ab&token=1987775079&lang=zh_CN&scene=21#wechat_redirect","title":null,"type":null},"content":[{"type":"text","text":"|百度大規模Service Mesh落地實踐","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://mp.weixin.qq.com/s?__biz=Mzg5MjU0NTI5OQ==&mid=2247494399&idx=1&sn=0516ad01baf50442933865d33e88e1af&chksm=c03eda83f749539562f223c320c86ce0d8092e4c1d793994fd602fae7cc986c9129f2b77c0ab&token=1987775079&lang=zh_CN&scene=21#wechat_redirect","title":null,"type":null},"content":[{"type":"text","text":"|","attrs":{}}]},{"type":"link","attrs":{"href":"http://mp.weixin.qq.com/s?__biz=Mzg5MjU0NTI5OQ==&mid=2247493116&idx=1&sn=90925b509f4d8bfedc7066f2317e3d9c&chksm=c03ed580f7495c9621068194b799dd7fcc9ebff535a6fa04aacf593eae549c8d500b06df57d1&scene=21#wechat_redirect","title":null,"type":null},"content":[{"type":"text","text":"一種基於實時分位數計算的系統及方法","attrs":{}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"---------- END ----------","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"百度Geek說","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"百度官方技術公衆號上線啦!","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"技術乾貨 · 行業資訊 · 線上沙龍 · 行業大會","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"招聘信息 · 內推信息 · 技術書籍 · 百度周邊","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"歡迎各位同學關注","attrs":{}}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章