JIT in MegEngine

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"背景"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"什麼是天元"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"曠視天元(MegEngine)是一個深度學習框架,它主要包含訓練和推理兩方面內容。訓練側一般使用 Python 搭建網絡;而推理側考慮到產品性能的因素,一般使用 C++ 語言集成天元框架。無論在訓練側還是推理側,天元都擔負着將訓練和推理的代碼運行到各種計算後端上的任務。目前天元支持的計算後端有 CPU、GPU、ARM 和一些領域專用的加速器,覆蓋了雲、端、芯等各個場景。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"天元主要有三大特徵:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1. 訓推一體,不管是訓練任務還是推理任務都可以由天元一個框架來完成。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2. 動靜結合,天元同時支持動態圖和靜態圖,並且動靜之間的轉換也非常方便。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3. 多平臺的高性能支持。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/b8\/6b\/b8c65254305905d2932a0700a00d3c6b.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖 1. 天元架構"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如圖 1 所示,我們可以看到天元提供了 Python 和 C++ 兩種接口。在圖表示上分爲動態圖和靜態圖。運算層組件包括自動求導器、圖優化和圖編譯等。天元的運行時模塊包括內存管理和計算調度,其中內存管理包括靜態內存管理和動態內存管理,以及亞線性內存優化技術。計算內核層包含了天元支持的所有計算後端,我們後續會開源出更多的計算後端。除此之外,天元還包含了一個高性能異構通信庫,它一般會在多機多卡的場景下被用到。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/9c\/95\/9cc210d1fce9eyy263202dc96c34d095.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖 2. 計算圖"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"動態圖和靜態圖是相對的,在動態圖下是沒有計算圖的概念的。但在靜態圖下,天元會維護一張計算圖。如圖 2 所示爲天元中的計算圖表示,圖中圓形表示算子(operator),三角形表示輸入。在天元框架中,動態圖和靜態圖之間的轉換隻需要一條簡單的語句即可完成,如下代碼所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"動態圖和靜態圖的轉換"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"if __name__ == '__main__’:\n gm = ad.GradManager().attach(model.parameters())\n opt = optim.SGD(model.parameters(), lr=0.0125, momentum=0.9, weight_decay=1e-4)\n # 通過 trace 轉換爲靜態圖\n @trace(symbolic=True)\n def train():\n with gm:\n logits = model(image)\n loss = F.loss.cross_entropy(logits, label)\n gm.backward(loss)\n opt.step()\n opt.clear_grad()\n return loss\n loss = train()\n loss.numpy()\n"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"什麼是 AOT 和 JIT"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"AOT("},{"type":"text","marks":[{"type":"strong"}],"text":"A"},{"type":"text","text":"head"},{"type":"text","marks":[{"type":"strong"}],"text":"O"},{"type":"text","text":"f"},{"type":"text","marks":[{"type":"strong"}],"text":"T"},{"type":"text","text":"ime) 和 JIT("},{"type":"text","marks":[{"type":"strong"}],"text":"J"},{"type":"text","text":"ust"},{"type":"text","marks":[{"type":"strong"}],"text":"I"},{"type":"text","text":"n"},{"type":"text","marks":[{"type":"strong"}],"text":"T"},{"type":"text","text":"ime) 都是編譯中的概念。以傳統的 C\/C++ 語言爲例,我們寫完代碼之後,一般會通過編譯器編譯生成可執行文件,然後再執行該可執行文件獲得執行結果。如果我們將從源代碼編譯生成可執行文件的過程稱爲 build 階段,將執行可執行文件叫做 runtime 階段的話,JIT 是沒有build 階段的,它只有 runtime 階段。JIT 一般被用在解釋執行的語言如 Python 中,JIT 會在代碼執行的過程中檢測熱點函數,隨後對熱點函數進行重編譯,下次運行時遇到熱點函數則直接執行編譯結果即可。這樣做可以顯著加快代碼執行的速度。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"什麼是 MLIR"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隨着各種編程語言的出現,現代編譯器也日趨多樣化。特別是近年來隨着深度學習的興起,深度學習軟件框架和 AI 領域專用硬件呈爆發式增長。不斷增加的軟件框架和 AI 硬件之間逐漸形成了一個越來越大的溝壑,如何將框架層對深度學習模型的描述精準高效的翻譯成適應各類硬件的語言成爲難點。MLIR("},{"type":"text","marks":[{"type":"strong"}],"text":"M"},{"type":"text","text":"ulti-"},{"type":"text","marks":[{"type":"strong"}],"text":"L"},{"type":"text","text":"evel"},{"type":"text","marks":[{"type":"strong"}],"text":"I"},{"type":"text","text":"ntermediate"},{"type":"text","marks":[{"type":"strong"}],"text":"R"},{"type":"text","text":"epresentation) 是一種可以在統一的基礎架構下滿足多樣化需求的混合 IR。MLIR 可以滿足包括但不限於以下的需求:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1. 表達數據流圖(如靜態圖模式下的 MegEngine Graph)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2. 表達對該圖做的優化和變換操作"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3. 進行各種算子優化如算子融合(kernel fusion)、循環融合、算子分塊和內存格式(memory layout)轉換等"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"4. 自動代碼生成、顯式緩存管理、自動向量化"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"作爲一個公用的 IR,MLIR 具有非常優秀的表達能力和可擴展性。MLIR 可以表達圖層面的運算,同時可以表達傳統編譯器中的 IR 信息,也可以表示硬件專用的運算。這種不同屬性,不同類型的運算的集合構成了 MLIR 中的方言(Dialect)。MLIR 還提供方便的機制實現不同方言之間的轉換(Lowering Down),因此 MLIR 的一個通用優化將會在多個方面產生收益。接入 MLIR 也將有更大可能享受到它的生態好處,包括性能和擴展性等方面。"}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"動機"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"爲什麼做"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"衆所周知,深度學習模型中有很多 element-wise 操作,例如加減乘除算術運算和神經網絡中的激活函數一般都是 element-wise 操作。天元將 element-wise 操作分爲一元操作、二元操作和多元操作。一元操作主要有 RELU、ABS、SIN 和 COS 等等;二元操作有加法、減法、乘法和除法以及 MAX 等;多元操作有 FUSE-MUL-ADD3 和 FUSE-MUL-ADD4 等,它們分別計算的是 “a"},{"type":"text","marks":[{"type":"italic"}],"text":"b+c” 以及 “a"},{"type":"text","text":"b+c*d”。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"表 1 卷積神經網絡中的 element-wise 操作"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"embedcomp","attrs":{"type":"table","data":{"content":"
modelbatchsizeelement-wisecomputation\/totalcomputation(%)element-wise****time\/total  time(%)
resnet5010.1020542324.6
80.10208536610.8
160.10208017711.9
mobilenetV210.7033333334.1
80.7045929028.9
160.70462769711.9
vgg1610.029274085.8
80.0292771727.1
160.0293425689.4"}}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"element-wise 操作在卷積神經網絡中所佔的地位不可忽視。如表 1 所示,我們選擇公開的卷積神經網絡訓練模型,以純 device kernel 的執行時間爲基準統計卷積神經網絡中的 element-wise 操作的重要性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先可以清晰的看到,element-wise 的計算量的佔比相比於運行時間佔比要低 1-2 個數量級。它的計算量佔的非常少,但是它的運行時間佔比非常多,這個結論是比較反直覺的。並且隨着 batch size 的增加,這個現象也越來越明顯。這是因爲 element-wise 操作計算量較低但是訪存量較高,即計算訪存比較低,是一種典型的訪存受限 (memory bound) 的操作。以 “a+b” 爲例,我們首先要將 a 讀到內存中,再將 b 讀到內存中,做完一次加法之後,我們將結果 c 再寫到內存中。整個過程要經過兩次讀和一次寫才能完成一次計算,所以它的計算反應訪存比非常低。針對訪存受限的操作,優化計算時間實際上是沒有沒有太多的意義的,而應該集中精力優化訪存,訪存優化的常見的優化手段是融合 (fusion)。如果我們能將網絡中連在一起的 element-wise 操作融合成一個算子,則將減少 element-wise 操作的訪存量,增加計算訪存比從而加速網絡的整體性能。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"爲什麼用 JIT 做"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"卷積神經網絡有兩個鮮明的特徵。一個是靜態圖模式下的模型訓練過程中模型的結構一般是不會變的跑;另一個是在模型訓練的過程中,一般會經過很多個 iter\/min-batch,不同的 iter\/min-batch 之間輸入張量形狀(tensor shape)一般也不會變。基於卷積神經網絡的這兩個特徵,我們決定應用 JIT 技術,原因如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1. 只需要在首次運行的時候編譯一次,隨後的不同 iter\/mini-batch 可以重用第一次編譯出來的結果"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2. JIT 具有較強的可移植性,因爲它在運行時獲取平臺信息,然後生成可以在該平臺運行的代碼"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3. JIT 可以解決 element-wise 模式組合爆炸的問題"}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"技術方案"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們通過 Element-wise Fusion 可以把多個 element-wise 操作融合成一個,減少了算子數量也就減少了算子之間的讀寫次數。如圖 3 所示計算圖算的是 “a*b+c”,它需要 4 次讀,2 次寫。4 次讀分別是乘法在讀 a 和 b 兩個輸入,乘法其實還要寫一個隱藏的輸出,加法會讀乘法的輸出作爲輸入,以及加法讀 c 作爲輸入。兩次寫分別是乘法和加法對它們結果的兩次寫操作,總共加起來是 4 次讀,2 次寫。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們將其融合成一個算子 FUSE_MUL_ADD3,由於天元現在已經支持 FUSE_MUL_ADD3 這個 element-wise 模式,所以我們可以直接做模型手術將計算圖從圖 3 左側形式轉到圖 3 右側形式。對於融合之後的計算圖,我們只需要 3 次讀和 1 次寫就可以完成等價計算,相比於融合前減少了 1 次讀和 1 次寫操作。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/43\/bf\/43c2a69511265510f9a3965ff8f593bf.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖 3 融合優化減少訪存次數"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們無法預測用戶將搭出來怎樣的一張計算圖,考慮圖 4 所示的計算圖,其中 element-wise  的個數和順序都不固定,顯然我們不可能提前將各種 element-wise 模式的組合都寫進天元的。在這種情況下,天元會創建一個虛擬的算子來表示整個可被融合的子圖。有了虛擬算子的存在,接下來我們還要解決兩個問題,一個是用虛擬算子替換原始計算圖中可以被融合的子圖,這個工作會在圖優化階段做;另一個是我們要動態生成虛擬算子的代碼並執行。如果我們解決了這兩個問題,我們就解決了整個問題。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/79\/54\/79ed20a6b897f1109679377ab7057d54.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖 4 子圖融合優化"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"圖優化"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了將一張計算圖中的可被融合的子圖融合成一個算子,天元將進行檢測(detection)和融合(fusion)兩步操作,如下步驟 1-3 屬於檢測,步驟 4 則屬於融合:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1. 對原始計算圖進行檢測後生成 internal graph generator,一個 internal graph generator 對應一個唯一的子圖"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2. internal graph generator 稍後會生成 internal graph"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3. 由 internal graph 創建 JITExcutor 算子"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"4. 將 JITExcutor 寫回原始的計算圖"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"檢測"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"檢測算法的主要功能是找出可以被融合的子圖。爲了方便描述,設 G 是計算圖,opr 是圖 G 中的算子,var 是 opr 的輸入和輸出。檢測算法的輸入是原始的計算圖 G,輸出是一個哈希表 M,表中存放的是檢測出的可被融合子圖的輸出 var(記作 endpoint)與其對應的 internal graph generator。算法步驟如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1. 按照逆拓撲序列遍歷圖 G 中的算子 opr"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2. 如果 opr 不是 Elemwise\/PowC\/TypeCvt\/Reduce\/Dimshuffle\/JITExecutor,返回步驟1"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3. 如果 opr 的 input\/output 數據類型不是 float32\/float16,返回步驟1"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"4. process_opr(opr)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"5. 轉到步驟 1"}]},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/94\/f1\/94c3b90a361ea11eef5635yy23eab9f1.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖 5 process_opr 流程圖"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"拓撲序列要求所有的父節點要先於它的子節點被訪問到,與之對應的,逆拓撲序列就是所有的子節點要先於它的父節點被訪問到。算法第 1 步中我們之所以按照逆拓撲序列遍歷計算圖,是因爲要保證遍歷到某個 opr 時,它的子節點都已經被遍歷到了。這樣算法可以查看該 opr 的所有的子節點是不是都在同一張子圖中,如果是,那麼當前 opr 就有很大的可能也在該子圖中。算法的第 2 步和第 3 步實際上說明了天元中的 JIT 的限制。目前天元 JIT 僅支持Elemwise\/PowC\/TypeCvt\/Reduce\/Dimshuffle 這幾種 opr,而且只支持輸入輸出是 float32\/float16 的數據類型。第 4 步詳細流程如圖 5 所示。需要注意的是算法會經過如下三個判斷語句:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1. 該 opr 的子節點是不是都已經在當前的這張子圖中了?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2. 該 opr 的輸出的計算節點(compute node)是不是跟子圖匹配?天元支持跨計算節點的計算圖,例如計算圖中一些 opr 可以運行在 CPU 上,一些 opr 可以運行在 GPU上。但目前天元不支持跨計算節點融合。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3. 該 opr 的輸出的 shape 是不是跟子圖匹配?因爲最終生成的代碼本質上是一個大的循環,循環的維度就是 opr 輸出的 shape,所以如果 shape 不匹配是不能被融合的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/a8\/99\/a86e7e81feb11294e80f543068a11b99.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖 6 檢測算法檢測出的可被融合的子圖"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖 6 中虛線框出來的即爲檢測算法檢測出的兩個可被融合的子圖。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"融合"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"融合算法的主要功能是將檢測出來的子圖融合成一個算子。融合算法的輸入是原始的計算圖和檢測算法輸出的那張哈希表 M,它的輸出是經過融合的計算圖 G‘。算法流程如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1. 按照拓撲序列遍歷圖 G 中的算子 opr"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2. 若 opr 的輸入 var 不是 endpoint, 返回步驟 1"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3. 從 M 中拿到 var 對應的 internal graph generator, 生成 internal graph"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"4. 從 internal graph 創建 JITExecutor"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"5. 寫回原始的計算圖 G"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"6. 轉到步驟 1"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"步驟 2 中如果一個 opr 的輸入 var 不是 endpoint 則表示它是一個子圖中的中間節點而不是子圖的輸出節點。步驟 3 中從 internal graph generator 到 internal graph 需要將子圖的輸入 var 替換爲一個新的 opr JITPlaceholder。JITPlaceholder 中會存諸如子圖的輸入順序這些額外信息,因爲某些 element-wise 操作是對輸入順序敏感的。例如 a 對 b 取餘和 b 對 a 取餘顯然具有不同的語義。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/fc\/b5\/fc2147e30b3a97da5924862dd32705b5.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖 7 融合後的計算圖"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖 7 即爲經過融合算法之後的計算圖,截止到目前爲止,我們已經完成了圖優化方面的所有工作。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"圖編譯"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"經過圖優化之後,我們成功的將計算圖中可被融合的子圖融合成爲一個新的算子,剩下的工作就是爲這個新的算子生成代碼了。JITExecutor 算子的運行時代碼非常簡單,先判斷一下當前的可執行對象是不是已經存在,如果不存在則先編譯出一個可執行對象,如已存在則直接運行。這段代碼在運行時纔會被執行到,所以稱之爲 JIT。當前天元支持三種 JIT 編譯器後端,分別是 NVRTC(支持英偉達 GPU),Halide 和 MLIR。其中後兩個編譯後端支持的平臺衆多,但是 MLIR 具有更優秀的表達能力和擴展性,所以我們接下來以 MLIR 爲例介紹代碼生成、編譯和執行的過程。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"要想使用 MLIR 作爲編譯後端,首先我們需要定義和實現天元自己的方言(MGE Dialect),隨後我們將 MGE Dialect 轉換到 MLIR 既有的 Dialect 上,接下來的絕大部分工作都可以複用 MLIR 中的基礎組件和工具完成。圖 8 描述了 CPU 和 GPU 上大概的執行流程。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/26\/17\/26635eaee0169bdcf303cb1491cf2017.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖 8 JIT 編譯器工作流"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"天元首先將 JITExecutor 算子內部的 internal graph 翻譯成 MGE Dialect。在 CPU 上,MGE Dialect 會先 Lowering 到 Affine Dialect 上,然後會通過 LLVM 的組件 Lowering 到 LLVM Dialect 上,LLVM  Dialect 可以被直接翻譯成 LLVM IR。在這一步之後,其他優化工作都可以直接複用 LLVM 的基礎組件。最後天元使用 MLIR ExecutionEngine 執行 LLVM IR 生成的代碼。在 GPU 上,天元會先將 MGE Dialect Lowering 到 GPU Dialect上,隨後 Lowering 到 NVVM  Dialect,NVVM 會被翻譯成 PTX 彙編。最後通過英偉達提供的 CUmodule 和 CUfunction 兩個機制運行。"}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"實驗和分析"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先參考"},{"type":"link","attrs":{"href":"http:\/\/#how-to-use-codegen?fileGuid=CJHxRQP9JR6pKVGh","title":"","type":null},"content":[{"type":"text","text":"這篇文檔"}]},{"type":"text","text":"在天元中開啓 JIT 支持。本次實驗選了 resnet50, mobilenetV2 和 vgg16 三個業界廣泛使用的模型,batch size 分別設置了 1, 8 和 16。測試硬件環境爲 NVIDIA T4,軟件環境爲 MegEngine v1.2.0。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/58\/8c\/583880b78280398ef8d2d5665c77728c.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖 9 打開 JIT 相比於不開 JIT 的加速比"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由圖 9 可知,和不打開 JIT 支持相比,打開 JIT 支持後 resnet50 最高可以獲得 16% 的加速比,mobilenet V2 則能獲得 6% 到 7% 的加速比,而 vgg16 其實上沒有明顯加速效果。這是因爲 vgg16 模型很大,可以被優化的 element-wise 操作比較少。JIT 的優化效果跟具體的模型是有緊密關係的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/72\/07\/72182ede05de00f08f385772dec41707.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖 10 JIT 編譯耗時"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果打開了 JIT 支持,那麼天元首次運行的時候會有一次 JIT 編譯的過程。JIT 編譯耗時跟具體的編譯的後端以及模型有關,如圖 10 所示 resnet50 耗時 2.7 毫秒,mobilenetV2 耗時 3.9 毫秒。"}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"總結和展望"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本篇文章介紹了天元使用 JIT 實現將任意多個相鄰的 element-wise 算子融合成一個算子的優化。我們在 T4 上用 MegEngine v1.2.0 實驗,相比於優化前,resnet 50 最高可以獲得 16% 的加速比。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以此爲基,展望未來我們可能做的事情如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1. 將 JIT 編譯的結果先離線保存,線上直接將線下編譯好的可執行對象讀進內存。這種做法可以解決線上第一次運行慢的問題,但它可能會損失一部分可移植性,因爲在一種設備上編譯產生的可執行對象一般不能適配所有線上設備。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2. JIT 支持更多的算子"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3. JIT支持更多的數據類型,天元 JIT 優化暫時只支持 float32\/float16 這兩種數據類型。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"4. 動態圖 JIT,也就是傳統意義上的檢測熱點代碼,重編譯後再執行。"}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章