實現大規模圖計算的算法思路

{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2017年我以深度學習研究員的身份加入Hulu,研究領域包括了圖神經網絡及NLP中的知識圖譜推理,其中我們在大規模圖神經網絡計算方向的工作發表在ICLR2020主會上,題目是——Dynamically Pruned Message Passing Networks for Large-Scale Knowledge Graph Reasoning。本次分享的話題會沿着這個方向,重點和大家探討一下並列出一些可以降低大規模圖計算複雜度的思路。"}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"圖神經網絡簡單介紹"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/2f\/2f54c3af36f5caad73b40d9d1757b04b.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"圖神經網絡使用的圖"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖神經網絡這幾年特別火爆,無論學術界還是業界,大家都在考慮用圖神經網絡。正因爲圖神經網絡的應用面很廣,所用的圖各種各樣都有,簡單分類如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"① 根據圖與樣本的關係"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"全局圖:所有樣本共用一個大圖"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"比如有一個大而全的知識圖譜,所做任務的每一個樣本都共用這個知識圖譜,使用來自這個知識圖譜的一部分信息。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"實例圖:以每個樣本爲中心構建的圖"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"每個輸入的樣本自帶一個圖,比如要考慮一張圖片中所有物體之間的關係,這可以構成一個物體間關係圖。換一張圖片後,就是另一張關係圖。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"② 根據邊的連接密度"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"完全圖"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"稀疏圖"}]}]}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"圖神經網絡與傳統神經網絡的聯繫"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"神經網絡原本就是圖,我們大多隻是提到“權重”和“層”,再細粒度一點,會講到“單元”(即units)。但是,有圖就有節點和邊的概念,就看你怎麼定義這個節點。在BERT網絡結構中,輸入是一個文本序列, 預處理成一串代表word或sub-word的tokens,我們可以把這些tokens看成是圖中的nodes,這樣BERT變成了一個完全圖上的圖神經網絡,而且BERT網絡結構的每層可以對應到圖神經網絡的一次message passing迭代。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"圖神經網絡與傳統神經網絡的區別"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"傳統神經網絡有多個層的概念,每一層用的都是不同的參數;圖神經網絡只有一個圖,圖中計算通過多步迭代完成節點間的消息傳遞和節點狀態更新。這種迭代式的計算,有點類似神經網絡的多個層,但是迭代中使用的是同一套權重參數,這點又像單層的RNN。當然,如果不嫌複雜,你可以堆疊多個圖,下層圖向上層圖提供輸入,讓圖神經網絡有“層”的概念。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外,圖神經網絡中的nodes與傳統神經網絡中的units不同。圖神經網絡中的nodes是有狀態的(stateful),不像傳統神經網絡中的units,當一層計算完輸出給下一層後,這層units的生命就結束了。Nodes的狀態表示爲一個向量,在下次迭代時會更新。此外,你也可以考慮爲edges和global定義它們的狀態。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"圖神經網絡的計算框架"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"① 初始步"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"初始化每個節點的狀態向量(可以包括各條邊和全局的狀態)"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"② 消息傳遞(message-passing)迭代步:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"計算節點到節點的消息向量"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"計算節點到節點的(多頭)注意力分佈"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對節點收到的消息進行彙總計算"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"更新每個節點的狀態向量(可以包括各條邊和全局的狀態)"}]}]}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"圖神經網絡的計算複雜度"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/a9\/a9ed99e9663d2ce3a9ee76cbcad223cb.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖片計算複雜度主要分爲空間複雜度和時間複雜度。我們使用PyTorch或者TensorFlow進行神經網絡訓練或預測時,會遇到各種具體的複雜度,比如會有模型參數規模的複雜度,還有計算中產生中間tensors大小的複雜度,以及一次前向計算中需保存tensors個數的複雜度。我們訓練神經網絡時,它做前向計算的過程中,由於梯度反向傳播的需要,前面層計算出的中間tensors要保留。但在預測階段,不需要梯度反向傳播,可以不保留中間產生的tensors,這會大大降低空間上的開銷。物理層面,我們現在用的GPU,一張卡的顯存頂到天也就24G,這個尺寸還是有限的,但是實際中遇到的很多圖都非常之大。另外,就是時間複雜度了。下面,我們用T表示一次圖計算中的迭代個數,B表示輸入樣本的批大小(batch size),|V|表示節點個數,|E|表示邊個數,D,D1,D2表示表徵向量的維數。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"空間複雜度"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"模型參數規模"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"計算中間產生tensors規模(此時有B>=1, T=1)"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"計算中間保留tensors規模(此時有B>=1, T>=1)"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"時間複雜度"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"計算所需浮點數規模(此時考慮D1, D2)"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"總結複雜度的計算公式,不外乎如下的形式:"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"降低圖神經網絡計算複雜度的幾點思路"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"思路一:避開|E|"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通常情況下,圖中邊的個數遠大於節點的數量。極端情況下,當邊的密度很高直至完全圖時,圖的複雜度可以達到|V|(|V|-1)\/2。如果考慮兩個節點間雙向的邊,以及節點到自身的特殊邊,那麼這個複雜度就是|V|2。爲了降低計算的複雜度,一個思路就是儘量避開圍繞邊的計算。具體來說,爲了讓計算複雜度從|E|級別降低爲|V|級別,在計算消息向量(message vectors)時,我們僅計算 destination-independent messages。也就是說,從節點u發出的所有消息使用同一個向量,這樣複雜度從邊數級別降爲了節點數級別。值得注意的是,這裏會存在一個問題,消息向量裏不區分不同的destination節點。那麼,能否把不同的destination節點考慮進來呢?當然可以,不過需要引入multi-head attention機制。下面針對這種情況來介紹一下優化方案。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"適合情形"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當|E|>>|V|時,即邊密度高的圖,尤其是完全圖"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"優化方案"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/d5\/d5b86efb651a825b76eff2e479945573.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"思路二:減少D"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"順着思路一,我們在計算attention時,每個attention分數都是一個標量。我們可以減小計算attention所用的向量維數,因爲輸出是一個標量,信息被壓縮到一維空間,所以計算時沒必要使用大向量來提高capacity。如果需要multi-head的話,可以把每個計算channel的向量維數變小,讓它們加起來還等於原來的總維數。這個思路很像BERT,BERT雖然不是GNN,但是這種機制可以運用到GNN中。還有一篇論文,提出了Graph Attention Networks,也用到了類似的思路。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"適合情形"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"引入attention mechanism的multi-head channels設計"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"優化方案"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"每個head channel 的消息計算使用較小的hidden dimensions, 通過增加head的數量來保證模型的capacity,而每個head的attention 分數在一個節點上僅僅是一個標量。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/8c\/8ce00f720ba3abf70a0d4a87ee5a6abb.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"思路三:部分迭代更新(選擇性減少T)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"前面的思路是減少邊數量以及計算維度數,我們還可以減少迭代次數T,這樣中間需保留tensors的規模就會變小,適合非常大的網絡,尤其當網絡節點刻畫的時間跨度很大,或者異構網絡的不同節點需要不同頻次或不同階段下的更新。有些節點不需要迭代更新那麼多次,迭代兩、三次就夠了,有些節點要更新好多次纔行。下圖的右側部分,每步迭代節點都更新;左側部分,節點只更新一次,即使這樣,它的計算依賴鏈條還是有四層。至於更新策略,可以人爲設定,比如說,採取隨機抽樣方式,或者通過學習得到哪些節點需更新的更新策略。更新策略的數學實現,可以採取hard gate的方式(注意不是soft),也可以採取sparse attention即選擇top-K節點的方式。有paper基於損失函數設計criteria去選擇更新的節點,如果某個節點的當前輸出對最終損失函數的貢獻已經很好了,就不再更新。需要注意的是,在hard gate和sparse attention的代碼實現中,不能簡單地把要略過的節點的權重置零,雖然數學上等價,但是CPU或GPU還是要計算的,所以代碼中需要實現稀疏性計算,來減少每次更新所載入的tensor規模。更新的粒度可以是逐點的,也可以是逐塊的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/3b\/3bccca53340bcfcba39a7a9b93abe349.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"適合情形"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"具有大時間跨度或異構的網絡,其節點需不同頻次或不同階段下的更新"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"優化方案"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"更新策略一:預先設定每步更新節點"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"更新策略二:隨機抽樣每步更新節點"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"更新策略三:每步每節點通過hard gate的開關決定是否更新"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"更新策略四:每步通過sparse attention機制選擇top-K節點進行更新"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"更新策略五:根據設定的criteria選擇更新節點(如:非shortcut支路上梯度趨零)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/74\/7448afe00ea4668965488d8d91f877db.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"思路四:Baking(“烘焙”,即使用臨時memory存放某些計算結果)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Baking這個名字,是我引用計算機3D遊戲設計中的一個名詞,來對深度學習中一種常見的技巧起的名字。當某些數據的計算複雜度很高時,我們可以提前算好它,後面需要時就直接拿來。這些數據通常需要一個臨時的記憶模塊來存儲。大時間跨度的早期計算節點,或者異構網絡的一些非重要節點,我們假定它們對當前計算的作用只是參考性的、非決定性的,並設計它們只參與前向計算,不參與梯度的反向傳播,此時我們可以使用記憶模塊保存這些算好的數據。記憶模塊的設計,最簡單的就是一組向量,每個向量爲一個記憶槽(slot),訪問過程可以是嚴格的索引匹配,或者採用soft attention機制。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"適合情形"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大時間跨度的早期計算節點或者異構網絡的一些非重要節點(只參與前向計算,不參與梯度的反向傳播)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"優化方案"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"維護一個記憶緩存,保存歷史計算的某些節點狀態向量,對緩存的訪問可以是嚴格索引匹配,也可以使用soft attention機制。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/25\/2530d2bc832b6ff12d2d9f56fa9ceb67.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"思路五:Distillation(蒸餾技術)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"蒸餾技術的應用非常普遍。蒸餾的思想就是用層數更小的網絡來代替較重的大型網絡。實際上,所有神經網絡的蒸餾思路都類似,只不過在圖神經網絡裏,要考慮如何把一個重型網絡壓縮成小網絡的具體細節,包括要增加什麼樣的loss來訓練。這裏,要明白蒸餾的目的不是僅僅爲了學習到一個小網絡,而是要讓學習出的小網絡可以很好地反映所給的重型網絡。小網絡相當於重型網絡在低維空間的一個投影。實際上,用一個小的參數空間去錨定重型網絡的中間層features,基於hidden層或者attention層做對齊,儘量讓小網絡在某些中間層上產生與重型網絡相對接近的features。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"適合情形"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對已訓練好的重型網絡進行維度壓縮、層壓縮或稀疏性壓縮,讓中間層的feature space表達更緊湊。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"優化方案"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Distillation Loss的設計方案:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Hidden-based loss"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Attention-based loss"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/bf\/bfb0abb03c65ba2ebf7b62babfd5bb02.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"思路六:Partition (or clustering)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果圖非常非常大,那該怎麼辦?只能採取圖分割(graph partition)的方法了。我們可以借用傳統的圖分割或節點聚類算法,但是這些算法大多很耗時,故不能採取過於複雜的圖分割或節點聚類算法。分割過程要注意執行分割算法所用的節點數據,最好不要直接在節點hidden features上做分割或聚類計算,這是因爲只有hidden features相似的nodes纔會聚到一起,可能存在某些相關但hidden features不接近的節點需要放在一個組裏。我們可以將hidden features做非線性轉換到某個分割語義下的空間,這個非線性轉換是帶參的,需要訓練,即分割或聚類過程是學習得到的。每個分割後的組,組內直接進行節點到節點的消息傳遞,組間消息傳遞時先對一組節點做池化(pooling)計算,得到一個反映整個組的狀態向量,再通過這個向量與其他組的節點做消息傳遞。另外的關鍵一點是如何通過最終的損失函數來訓練分割或聚類計算中的可訓參數。我們可以把節點對組的成員關係(membership)引入到計算流程中,使得反向傳播時可以獲得相應的梯度信息。當然,如果不想這麼複雜,你可以提前對圖做分割, 然後進行消息傳遞。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"適合情形"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"針對非常大的圖(尤其是完全圖)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"優化方案"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對圖做快速分割處理,劃分節點成組,然後在組內進行節點到節點的消息傳遞,在組間進行組到節點、或組到組的消息傳遞。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"① Transformation step"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Project hidden features onto the partition-oriented space"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"② Partitioning step"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"③ Group-pooling step"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Compute group node states"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"④ Message-passing step"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Compute messages from within-group neighbors"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Compute messages from the current group node"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Compute messages from other group nodes"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/31\/31ba4d2e580adcaa53bb8197816b9140.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"思路七:稀疏圖計算"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如何利用好稀疏圖把複雜度降下來?你不能把稀疏圖當作dense矩陣來處理,並用Tensorflow或PyTorch做普通tensors間的計算,這是沒有效果的。你必須維護一個索引列表,而且這個索引列表支持快速的sort、unique、join等操作。舉個例子,你需要維護一份索引列表如下圖,第一列代表batch中每個sample的index,第二列代表source node的id。當用節點狀態向量計算消息向量時, 需要此索引列表與邊列表edgelist做join,把destination node的id引進來,完成節點狀態向量到邊向量的轉換,然後你可以在邊向量上做一些計算,如經過一兩層的小神經網絡,得到邊上的消息向量。得到消息向量後,對destination node做sort和unique操作。聯想稀疏矩陣的乘法計算,類似上述的過程,可以分成兩步,第一步是在非零元素上進行element-wise乘操作,第二步是在列上做加操作。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/c0\/c05f12b1b79d449d806430ea91e21ba6.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"適合情形"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當|E|<
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章