字節跳動在聯邦學習領域的探索及實踐

{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據是人工智能時代的石油,但是由於監管法規和商業機密等因素限制,\"數據孤島\"現象越來越明顯。聯邦學習(Federated Learning)是一種新的機器學習範式,它讓多個參與者可以在不泄露明文數據的前提下,用多方的數據共同訓練模型,實現數據可用不可見。"},{"type":"text","marks":[{"type":"strong"}],"text":"本文,InfoQ 經授權整理了字節跳動聯邦學習系統架構師解浚源近期在火山引擎智能增長技術專場的演講("},{"type":"text","text":"火山引擎是字節跳動旗下的數字服務與智能科技及品牌"},{"type":"text","marks":[{"type":"strong"}],"text":"),其分享了聯邦學習在廣告投放和金融等場景中的應用模式、算法研究、軟件系統及實踐經驗。"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大家好,我是解浚源,今天分享的題目是《聯邦學習原理與實踐》,共分爲 6 個部分:聯邦學習簡介、應用場景、基礎算法、隱私保護、Fedlearner 聯邦學習系統以及我們的下一步工作。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"聯邦學習簡介"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先,我們簡單介紹聯邦學習的定義。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大數據是機器學習的石油,但數據孤島問題普遍存在。由於用戶隱私、商業機密、法律法規監管等原因,各機構無法將數據整合在一起,用來訓練一個效果更好的大模型。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/84\/84f54f1293bd3bea8aba917afbdf65f6.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"聯邦學習是一種爲了解決數據孤島問題而提出的機器學習算法,目標是實現私有數據、共享模型。例如現在有三個參與方,每個參與方擁有一個私有集羣和數據,這些參與方想共同訓練一個模型,聯邦學習就可以解決該問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/d1\/d1a9e0fef2507a0f7d89db220cc42015.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在聯邦學習的模式下,可以由一箇中央服務器首先將參數發送給每個參與方,然後每個參與方依據自己的私有數據更新模型,模型更新後再將梯度彙總發送至中央服務器,由服務器更新模型,然後開始下一個循環。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/2e\/2e4112377d02aed78b495e1a3e776e16.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過這樣的方式,各參與方可以在不互相透露原始數據的情況下訓練一個共享參數的模型。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/35\/352689430be2772a0fe9c23e0d184d59.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"常見的聯邦學習範式有縱向聯邦學習和橫向聯邦學習兩種。"},{"type":"text","text":"縱向聯邦學習有兩個參與方,各自擁有同一條樣本的不同特徵,比如一個參與方擁有用戶瀏覽歷史,另一個參與方擁有購買歷史。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在這種情況下,我們可以在兩個集羣各跑一部分模型,通過跨集羣的方式交換中間結果,來達到訓練一個模型的效果,這與機器學習中模型並行的訓練方式類似。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/08\/086f1288efaf6d3650f73223e4d1b0bc.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"橫向聯邦學習是兩個參與方擁有不同樣本的相同特徵,比如兩個參與方都擁有用戶的年齡、性別等,但是用戶並不相同。在這種模式下,每個參與方都可以擁有整個模型,但是各自用不同的數據更新模型,最終彙總模型的梯度來訓練模型,這與分佈式機器學習中的模型數據並行訓練方式類似。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果探究聯邦學習的歷史,其經歷了大概 3 到 5 年的發展。起初是 2015 年,Privacy-Preserving Deep Learning 這樣的概念被提出,而後谷歌的 McMahan 提出若干深度學習方面的訓練和應用模式。2018 年,騰訊發佈聯邦學習白皮書。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"究其本質,聯邦學習最重要的就是保護數據的可用而不可見,也就是數據的隱私保護,其研究有如下方面:一是基於差分隱私的數據保護;二是基於祕密共享的加密計算方法;三是基於同態加密的加密計算方法。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"聯邦學習的應用場景"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如下圖,第一個場景是聯邦學習在深度轉化廣告投放領域的應用。在廣告投放場景下,媒體側的流程是用戶發起請求,媒體通過模型預測用戶最可能感興趣的廣告,並將它展示給用戶,用戶一旦點擊廣告就會跳到一個落地頁,這個落地頁會導向廣告主側的購物網站。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/99\/99888c313dad75c01678ea3fc53f983c.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對廣告主而言,在這個過程中發生的深度事件爲用戶是否轉化。以電商場景爲例,轉化指的是用戶購買了產品,而未轉化就是指用戶沒有購買行爲,廣告主會將轉化事件記錄到數據庫裏面,媒體側也會把這些信息記錄到數據庫裏面。在該領域的傳統做法是廣告主將標籤返回到媒體這一側,然後媒體組合數據和標籤用以訓練模型,使用該模型知道投放優化效果。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/00\/00a31848c6ec63e246d20f89edeabdfe.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在這個場景下,媒體和廣告主分別擁有點擊樣本的不同信息,比如媒體側擁有用戶的特徵、年齡、性別,上下文特徵(用戶點擊該廣告前後看了哪些文章,點擊發生的時間及用戶所處位置);雙方共有的信息是廣告相關的特徵,比如廣告圖片、標題等;廣告主擁有的是用戶歷史特徵,比如用戶以前在該廣告主處購買的商品,以及商品更細節的特徵(商品價格、商品評論),廣告主不會將這些信息同步到媒體側。最後是深度事件,用戶是否確實發生購買行爲還是僅將商品添加至購物車。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/78\/782c2646dea90aabe2f60bc65418da2e.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果應用聯邦學習對該場景進行優化,在線部分保持不變,但是用戶的每個點擊需要附加 request_id,這就唯一標識了用戶的一次點擊,並在媒體側和廣告主側共用一個 ID,唯一標記這一次請求。廣告主和媒體分別將 request_id 存到數據庫中。離線訓練時,媒體側可以找到該條數據輸入模型,最後將數據的 request_id 和輸出的中間結果一起發送給廣告主。廣告主拿到 request_id 後就可以找到其對應的 label,然後用其計算樣本的轉化效果,再用該結果反向傳播計算出梯度,最後將梯度發回媒體側,兩邊分別用該梯度來更新模型。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二個場景是金融信用場景。在該場景下,不同的金融機構希望可以綜合多方數據提高對用戶信用判斷的準確度。如果各方擁有不同用戶的相同特徵,這樣就可以採用橫向聯邦的方式。例如,不同的銀行分別向不同的用戶發放了信用卡貸款,要想建立一個更好的用戶信用評估模型,多方就可以用各自擁有的不同用戶特徵,採用橫向聯邦的方式建立一個模型。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/7c\/7cda414074aff6204aab16717023401c.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另一種情況是雙方擁有相同客戶的不同特徵,這樣就可以採用加密的縱向聯邦方式。例如,一個銀行和一個信貸機構分別擁有相同用戶的不同特徵,比如銀行知道用戶的存款信息,信貸機構知道用戶的貸款信息,這樣就可以綜合訓練出對用戶的信用評估。考慮到金融場景的習慣和數據特點,一般是採用樹模型進行建模,基於樹模型的較著名的聯邦學習算法是 SecureBoost,可以用多方數據在可用不可見的情況下進行加密的樹模型訓練。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"聯邦學習的基礎算法"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在縱向聯邦學習中,如果數據由線上請求產生,雙方在存儲該請求時可能出現丟失和順序不一致的情況,這就需要訓練前雙方對齊數據,比如前面提到的深度轉化廣告投放場景,用戶的點擊數據在媒體側和廣告主側是分別存儲、分別落盤的,雙方的落盤時間可能不一致,順序也有可能由於雙方的處理方式而打亂,這樣就會產生一種對應關係,比如 request_id 0 存放在廣告主的第一個位置,而 request_id 3 在媒體側處於第一個位置。在這種情況下,我們需要把數據進行對齊,排除掉其中一方沒有的數據。流式數據求交算法可以解決該問題,刪掉對方沒有的數據,把共有數據按照統一順序排序。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/ed\/ed587e56e3e82ab8d4e7981825d8d573.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了實現該功能,我們實現了分佈式的流式數據求交算法。該算法中,一方作爲 leader,另外一方作爲 follower,leader 將數據按照自己的存儲順序將 request_id 順序發送給 follower,follower 用自己的 request_id 和 leader 的 request_id 進行求交,求交結束按照 leader 的 request_id 順序生成 DataBlocks 數據塊,最後將生成的數據塊發送給 leader,leader 按照數據塊進行排序,並刪除缺失數據,最後在兩邊形成相同對應的數據塊。一個數據塊在兩方各有一半,在這個對應的數據塊裏,數據嚴格按照一致的順序排序。需要提到是在流式數據求交的算法裏面,只能使用類似於 request_id 這種不泄露用戶隱私的隨機數 ID 作爲主鍵求交,如果是類似於用戶的手機號這種敏感數據,就不能使用這種方式來求交。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/a5\/a5eb001e38c46f7daa2d89dd441742ac.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了處理大規模數據,我們實現了一個基於分佈式的流式求交算法,雙方各自拉起一個 master 和 N 個 worker,worker 之間一一配對,配對後的兩個 worker,其中一個作爲 leader,另一個作爲 follower,然後在一個分片上計算數據求交,從分佈式文件系統上讀取數據和寫入結果。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/04\/043fe60f030e077d2cda769a2cd892f1.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上文提到流式數據求交只能用來處理非敏感的求交主鍵,但有時雙方之間沒有直接的跳轉關係,所以不能用隨機數關聯雙方的數據,比如在金融場景下,可能兩個金融機構需要求交雙方的共有用戶,這種時候就需要用到 Private Set Intersection 求交方式。簡單的哈希加密是無法在這種情況下保證數據安全的,如果把一個手機號的哈希值傳給對方,雖然這不是明文的手機號,但是由於這個手機號的總體空間是有限的,所以另一方可以使用窮舉的方法來破解。作爲替代,我們使用 PSI 的方式,使用 RSA+Hash 雙層加密可以有效避免撞庫破解,保證求交之後的雙方可以互相知道共同擁有的用戶集,如果一方獨有另外一方的用戶,求交之後也不會泄露給對方。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/7b\/7b97b5137bbf1c6ec1ea3c143c9f6c44.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"具體來說,PSI 數據求交的方式需要 A 首先把自己的用戶 ID 進行加盲,乘以隨機數做加密,然後發送給 B,B 對加盲過的 ID 進行 RSA 加密的簽名,把簽名過後的數據發送回客戶。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/11\/1113e2fa39bfd81ab0a80b8cca5e1cd2.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在這個過程中,B 無法得知客戶 ID,因爲進行了加盲處理,當然也無法解盲,但是 A 可以在加了密的 ID 上進行去盲,得到有 RSA 簽名過的 ID,再在上面套一層哈希存到數據庫裏面。由於私鑰只有 B 擁有,所以 A 得到了加密的哈希 ID 也無法進行僞造。另一方面,B 將自己的 ID 進行 RSA 簽名加密,然後再哈希,並將哈希和加密過後的 ID 發送給 A,A 用這個哈希加密的 ID 可以和之前自己進行加密哈希的 ID 做匹配求交。這一過程中,A 和 B 都無法得知對方的原始 ID,同時也無法僞造一個 ID 來和對方求交,雙方都可以控制用來求交的 ID 數據總量。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"聯邦學習中的隱私保護"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在縱向聯邦學習中,有一方把 Label 泄露給另一方的風險,因爲擁有 Label 的一方需要向另外一方發送每個樣本的梯度。但是當正負樣本不均衡的時候,負例的相關梯度會遠遠小於正例相關的梯度,因爲正例很稀少,所以當正例出現的時候,模型對它的加權會比較大,它對梯度的影響也會比較大,接收方就可以根據梯度的範數判斷是屬於正樣本還是負樣本,最簡單的方式是把梯度取一個範數然後排序,把超過一個閾值的所有樣本認爲是正例,否則就是負例。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/32\/329bbbd98eaf430f095145e0e2f2491b.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了降低泄露風險,我們選擇在梯度上疊加高斯噪聲,通過最小化加入噪聲的正負例之間的分佈差異減少泄露。如下圖,G+ 和 G- 分別是正負例對應的梯度,我們在其上加上正例和負例用的噪聲,得到加過噪聲的梯度,然後用 KL 散度最小化正例和負例之間的分佈差異,同時爲了避免噪聲過大導致機器學習的效果過多受損,我們增加一個約束,分別計算正例和負例的相關性矩陣,然後約束它們的噪聲大小在 P 的範圍之內,這個 P 是一個可調的參數。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/65\/659287bfbd46eb8567d866acf1926d4a.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過實驗驗證,我們可以在小幅損失 AUC 的情況下大幅降低泄露 Label 的機率。上圖左邊是在不同 P 值的情況下模型的 AUC,右邊是接收梯度那一方通過梯度來判斷正負例的準確度。我們可以看到藍線是完全不進行噪聲添加的訓練場景,接收方可以 95% 的準確率判斷 Label 是正例還是負例。其他的幾條線是分別添加了不同程度的噪聲,我們可以看到只要略微添加一些噪聲,就可以大幅降低 Label 泄露的準確率到 50% 和 55% 中間,這就相當於泄露的機率非常低了,已經在隨機猜測附近了,模型的 AUC 值下降 1% 左右。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"同時在縱向聯邦學習中,一方還需要傳 Embedding 給另一方,這也存在一些信息泄露的風險,有兩種方法可以保護 Embedding 不泄露信息:一是採用同態加密或者密鑰共享的方式加密傳輸 Embedding,來保證對方不能得到明文的 Embedding 信息;另一種方法是採用差分隱私對抗訓練等方式,想辦法降低 Embedding 中的信息含量,儘量減少隱私泄露。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/95\/9596b10b7a2cc227b113a6c716c4ba37.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Fedlearner 聯邦學習系統"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接下來介紹我們開發的 Fedlearner 聯邦學習系統,目前該系統也已經在火山引擎有了大規模的To B應用。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了在客戶側部署一個我們的聯邦學習系統,我們將整個系統都集成在了 Kubernetes 裏面,其最底層是 NFS 掛載層,提供分佈式的文件存儲。之上使用 ElasticSearch、FileBeat 和 Spark 等來源處理日誌和數據流。在此之上,我們實現了爲聯邦學習定製的任務資源管理調度器,以及用來查詢任務信息的 ApiServer 和聯邦學習鏡像。這些基礎設施完成後,我們就可以拉起聯邦學習的任務了,我們支持多種任務,包括數據分片、數據求交、NN 模型、神經網絡模型和樹模型訓練。在這之上,我們開發了一個可視化的 WebConsole 界面方便算法工程師操作,所有雙方通信都經過 Ingress-Nginx 的接入層來進行加密的跨公網通信。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/e5\/e5b13115957e8df7b6e93ccafab167d1.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"聯邦學習的過程需要參與雙方經過公網來傳輸數據,爲了安全的經過公網通信,我們採用了 HTTPS 雙向加密認證的方式來保證通信雙方的身份可靠,而信息是加密並不可以被第三方監聽的。同時,爲了方便內部聯邦學習任務的邏輯開發,我們採用了 Ingress-Nginx 來做透明加減密和域名轉發。這樣在 Ingress-Nginx 接入層後面的聯邦學習任務可以使用普通的 gRPC 調用,透明的將信息傳遞到另外一個集羣。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/75\/759827e5f1bfa984d13194b7cbef810d.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"聯邦學習任務的一大特點是需要雙方同時各自拉起一個任務,配對以後進行通信才能開始學習任務。落地的過程中,我們發現雙方要同時提交一個任務,而同時進行操作的溝通成本非常高,需要線下經過多輪協調,尤其是調優模型參數,或者是排查問題時,雙方需要同時在線的溝通成本是很高的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/76\/7699de4dc66c3dcdd113ffd9325e8961.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了簡化這個過程,我們開發了一套基於 Ticket 的預授權機制,使用這個機制的時候分爲主動方和被動方,主動方可以創建一個 Ticket,被動方同時也創建一個 Ticket,Ticket 裏面包含的信息主要是任務可以使用的數據和模型等,但不規定模型具體使用的超參數,比如學習率這些不敏感的超參數。一旦雙方 Ticket 建立好,主動方就可以發起一個任務,被動方在接收到任務請求以後,系統就會自動響應拉起任務,這樣雙方可以各自拉起一個任務,這個過程可以重複很多次,每次拉起的任務可以使用不同的參數。這樣,被動方在創建了一個 Ticket 以後,就可以不用進行操作了,只要在多輪訓練結束以後查看一下結果就可以了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/6b\/6b2f5a014067330da10572ff1abd1278.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如上這些操作都可以通過 WebConsole 進行可視化,任務信息也可以經過可視化的展示,可以在網頁上查看任務的效果、日誌、圖表等信息。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/40\/40839ebe76db0349657b4ff60e96730e.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"未來規劃"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後來聊我們下一步的發展方向。在落地過程中,我們總結出了遇到的三個痛點:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/73\/733dbf22ab4f62c3c6567c7a16597a5b.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一是 Ticket 預授權機制允許自動拉起單個任務,但是聯邦的流程通常需要多個任務的聯動,比如上傳、求交、訓練均需要雙方人工的參與,所以我們下一個版本會開發基於 Workflow 的授權,Workflow 就是一個包含多任務的工作流,這多個任務會自動輪換完成調度;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"二是合作方需要先自行完成特徵工程,並且以特定格式灌入系統。這對於客戶的特徵工程能力要求是比較高的,客戶往往只能使用一些有限的特徵。爲解決該問題,我們會增加基於 Spark 的流式歸因和特徵抽取能,Fedlearner 系統可以集成式讀取用戶原始數據,然後自動化抽取特徵,並輸入到最終的模型系統裏面;"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"三是隨着合作方的增加,一對一的聯邦維護成本不斷提高。爲了降低成本,我們會探索多方聯邦,同一個行業的多個合作方可以同時訓練一個模型,這樣可以減少模型的數量,降低人工維護成本。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/43\/4303bf6ca58aa177ccd02f6e687ec0d9.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先具體介紹 Workflow 授權,我們在下一個版本會重新設計 WebConsole 的 2.0 版本,WebConsole2.0 版本支持圖形化配置工作流,可以看到工作流中包含多個任務,任務之間可以有各種依賴關係,每個任務可以有一些參數配置,用戶只要配置好工作流,就可以一鍵運行多個任務。另一個解決方案是集成化的特徵工程會基於 Spark 開發一個特徵抽取系統,該系統包含歸因模塊、特徵抽取、統計模塊和模型訓練模塊。同時還包含在線服務的功能,用戶在抽取完特徵後,可以在線服務中抽取同樣的特徵,輸入到在線服務的模型中。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/e5\/e5ae777c222d361f54aec0fc2207a664.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後是多方聯邦,多方聯邦有兩種方式:橫向 + 縱向;縱向 + 縱向。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/62\/62641cc0242aedd1aa2a7a39077b577f.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"橫向 + 縱向指不同參與方擁有部分重合數據的不同特徵,可以共同訓練一個模型;縱向 + 縱向指不同客戶擁有不同的特徵維度,可以採用三方縱向聯邦的方式綜合同一個用戶的更多維度信息進行訓練。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以上是本次分享的全部內容,感謝大家。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"嘉賓介紹:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"解浚源,字節跳動聯邦學習系統架構師。華盛頓大學計算機博士,開源深度學習框架 MXNet 主要開發者和維護者之一。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"火山引擎是字節跳動旗下的數字服務與智能科技品牌,基於公司服務數億用戶的大數據、人工智能和基礎服務等技術能力,爲企業提供系統化的全鏈路解決方案,助力企業務實地創新,給企業帶來持續、快速增長。火山引擎圍繞數據智能、視覺智能、語音智能、智能應用、多媒體技術和雲原生等六大方向,面向企業級市場推出了數十款技術產品與服務,從開發、應用到運營,滿足不同類型企業在生命週期不同階段業務發展的核心需求。"}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章