0. 模型複雜度簡介

描述一個具體的深度學習模型，除了性能指標（分類任務的準確度、檢測任務的mAP等），還需要考慮該模型的複雜度，即參數(Parameters，使用Mb作爲單位)的個數和（前向推理的）計算量（使用FLOPs（FLoating point OPerations）或MAC(Memory Access Cost)衡量），前者描述了這個複雜的網絡到底需要多少參數才能定義它？即存儲該模型所需的存儲空間。後者描述了數據過一遍這麼複雜的網絡需要多大的計算量呢？即使用該模型時所需的計算力。

1. 模型複雜度之一：模型參數量的計算方法

卷積層參數量計算

CNN卷積層的parameters分爲兩種： $W$ 和 $b$ ，注意這裏W是大寫，表示一個矩陣，也因此它相比b，既含有更多的信息，同時也是parameters的主要部分。

如上圖所示：以經典的AlexNet模型結構爲例，每個大長方體中的小長方體就是 $W$ ，它是大小爲 $[K_h, K_w, C_{in}]$ 的三維矩陣，其中 $K_h$ 表示卷積核（filter或kernel）的高度， $K_w$ 表示卷積核的寬度， $C_{in}$ 表示前一級輸入通道數（Channels），一般情況下， $K_h$ 和 $K_w$ 的大小相同，且一般都選擇爲3、5、7。
一個卷積核在前級特徵圖上從左往右、從上往下掃一遍，便會計算出很多個前向傳播的值，這些值就會按原來相對位置拼成一個新的feature map，高度和寬度分別爲 $H_{out}$ 和 $W_{out}$ ，當然一個卷積核提取的信息太過有限，於是我們需要個不同的卷積核各掃數據，於是會產生 $N$ 個feature map，即當前層的輸出通道數 $C_{out} = N$ 。
總結起來即爲：尺寸爲 $[K_h, K_w, C_{in}]$ 小長方體(當前層的濾波器組)劃過前一級尺寸爲 $[H_{in},W_{in},C_{in}]$ 的大長方體(當前層輸入的特徵圖)最終生成一個新的尺寸爲 $[H_{out},W_{out},C_{out}]$ 的大長方體(當前層輸出的特徵圖)，這一過程如下圖所示。

於是我們可以總結出規律：對於某一個卷積層，它的parameters個數，即 $W$ 和 $b$ 的權值個數之和爲： $(K_h * K_w * C_{in}) * C_{out} + C_{out}$ ，符號定義同上文。

全連接層參數量計算

剛纔講的都是對於卷積層的，對於全連接層，比如AlexNet的後三層，其實要更簡單，因爲這實際是兩組一維數據之間（如果是卷積層過度到全連接層，如上圖第5層到第6層，會先將第5層三維數據flatten爲一維，注意元素總個數未變）的兩兩連接相乘，然後加上一個偏置即可。所以我們也可以總結出規律：對於某個全連接層，如果輸入的數據有 $N_{in}$ 個節點，輸出的數據有 $N_{out}$ 個節點，它的parameters個數爲： $N_{in}*N_{out}+N_{out}$ 。如果上層是卷積層， $N_{in}$ 就是上層的輸出三維矩陣元素個數，即 $N_{in} = H_{in}*W_{in}*C_{in}$ 。

2. 模型複雜度之二：模型計算量的計算方法

模型的計算量直接決定模型的運行速度，常用的計算量統計指標是浮點運算數FLOPs, 對FLOPs指標的改進指標包括乘加運算 MACCs(multiply-accumulate operations),又被稱爲MADDs.

2.1 FLOPs

2.1.1 注意FLOPs與FLOPS的區別

FLOPs：注意s小寫，是FLoating point OPerations的縮寫（s表複數），意指浮點運算數，理解爲計算量。可以用來衡量模型的複雜度。針對神經網絡模型的複雜度評價，應指的是FLOPs，而非FLOPS。
FLOPS：注意全大寫，是floating point operations per second的縮寫，意指每秒浮點運算次數，理解爲計算速度。是一個衡量硬件性能的指標。比如nvidia官網上列舉各個顯卡的算力（Compute Capability）用的就是這個指標，如下圖，不過圖中是TeraFLOPS，前綴Tera表示量級：MM，2^12之意。

2.1.2 卷積層FLOPs的計算

深度學習論文中常使用的單位是GFLOPs，1 GFLOPs = 10^9 FLOPs，即：10億次浮點運算(10億00百萬,000千,000)
這裏的浮點運算主要就是 $W$ 相關的乘法，以及 $b$ 相關的加法，每一個 $W$ 對應 $W$ 中元素個數個乘法，每一個 $b$ 對應一個加法，因此好像FLOPs個數和parameters是相同的。但其實有一個地方我們忽略了，那就是每一層feature map上的值是通過同一個濾波器處理的結果（權值共享），這是CNN的一個重要特性（極大地減小了參數量）。所以我們在計算FLOPs是隻需在parameters的基礎上再乘以feature map的大小即可，即對於某個卷積層，它的FLOPs數量爲： $[(K_h * K_w )* C_{in} + 1]*[(H_{out}*W_{out})* C_{out} ] = [(K_h * K_w * C_{in}) * C_{out} + C_{out}]*[H_{out}*W_{out}]= num_{parameter}*size_{output feature map}$ ，其中 $num_{parameter}$ 表示該層參數的數目, $size_{output feature map}$ 表示輸出特徵圖的二維尺寸。

2.1.3 全連接層FLOPs的計算

注意：對於全連接層，由於不存在權值共享，它的FLOPs數目即是該層參數數目： $N_{in} * N_{out} + N_{out}$ 。

2.2 MAC、MACC、MADD

2.2.1 MACC概念及其與FLOPs的關係

爲什麼使用乘加運算指標呢？因爲神經網絡運算中乘加運算無處不在：
對於一個3*3濾波器在特徵圖上的一次運算可以表示爲：

y = w[0]*x[0] + w[1]*x[1] + w[2]*x[2] + ... + w[n8]*x[8]

對於上式，記w[0]*x[0] +… 爲一次乘加，即1MACC。所以對於上式而言共有9次乘加，即9MACCs（實際上，9次相乘、9-1次相加，但爲了方便統計，將計算量近似記爲9MACCs，就像算法複雜度通常表示成 $O^{(N)}$ 一樣，都只是一種近似，不需要糾結）

MACC vs FLOPs：對於上式而言，可以認爲執行了9次乘法、9-1次加法，所以一共是9+(9-1)次FLOPs。所以近似來看1MACC $\approx$ 2FLOPs。（需要指出的是，現有很多硬件都將乘加運算作爲一個單獨的指令）。

2.2.2 全連接層MACC計算

In a fully-connected layer, all the inputs are connected to all the outputs. For a layer with I input values and J output values, its weights W can be stored in an I × J matrix. The computation performed by a fully-connected layer is:

y = matmul(x, W) + b

Here, x is a vector of I input values, W is the I × J matrix containing the layer’s weights, and b is a vector of J bias values that get added as well. The result y contains the output values computed by the layer and is also a vector of size J.

To compute the number of MACCs, we look at where the dot products happen. For a fully-connected layer that is in the matrix multiplication matmul(x, W).

A matrix multiply is simply a whole bunch of dot products. Each dot product is between the input x and one column in the matrix W. Both have I elements and therefore this counts as I MACCs. We have to compute J of these dot products, and so the total number of MACCs is I × J, the same size as the weight matrix.

The bias b doesn’t really affect the number of MACCs. Recall that a dot product has one less addition than multiplication anyway, so adding this bias value simply gets absorbed in that final multiply-accumulate.

Example: a fully-connected layer with 300 input neurons and 100 output neurons performs 300 × 100 = 30,000 MACCs.

Note
Sometimes the formula for the fully-connected layer is written without an explicit bias value. In that case, the bias vector is added as a row to the weight matrix to make it (I + 1) × J, but that’s really more of a mathematical simplification — I don’t think the operation is ever implemented like that in real software. In any case, it would only add J extra multiplications, so the number of MACCs wouldn’t be greatly affected anyway. Remember it’s an approximation.

Note

Sometimes the formula for the fully-connected layer is written without an explicit bias value. In that case, the bias vector is added as a row to the weight matrix to make it (I + 1) × J, but that’s really more of a mathematical simplification — I don’t think the operation is ever implemented like that in real software. In any case, it would only add J extra multiplications, so the number of MACCs wouldn’t be greatly affected anyway. Remember it’s an approximation.

In general, multiplying a vector of length I with an I × J matrix to get a vector of length J, takes I × J MACCs or (2I - 1) × J FLOPS.

If the fully-connected layer directly follows a convolutional layer, its input size may not be specified as a single vector length I but perhaps as a feature map with a shape such as (512, 7, 7). Some packages like Keras require you to “flatten” this input into a vector first, so that I = 512×7×7. But the math doesn’t change.

Note:
In all these calculations I’m assuming a batch size of 1. If you want to know the number of MACCs for a larger batch size B, then simply multiply the result by B.

2.2.3 激活層不計算MACC，計算FLOPs

Usually a layer is followed by a non-linear activation function, such as a ReLU or a sigmoid. Naturally, it takes time to compute these activation functions. We don’t measure these in MACCs but in FLOPS, because they’re not dot products.

Some activation functions are more difficult to compute than others. For example, a ReLU is just:

y = max(x, 0)

This is a single operation on the GPU. The activation function is only applied to the output of the layer. On a fully-connected layer with J output neurons, the ReLU uses J of these computations, so let’s call this J FLOPS.

A sigmoid activation is more costly, since it involves taking an exponent:

y = 1 / (1 + exp(-x))

When calculating FLOPS we usually count addition, subtraction, multiplication, division, exponentiation, square root, etc as a single FLOP. Since there are four distinct operations in the sigmoid function, this would count as 4 FLOPS per output or J × 4 FLOPS for the total layer output.

It’s actually common to not count these operations, as they only take up a small fraction of the overall time. We’re mostly interested in the (big) matrix multiplies and dot products, and we’ll simply assume that the activation function is free.

In conclusion: activation functions, don’t worry about them.

2.2.3 卷積層MACC

The input and output to convolutional layers are not vectors but three-dimensional feature maps of size H × W × C where H is the height of the feature map, W the width, and C the number of channels at each location.

Most convolutional layers used today have square kernels. For a conv layer with kernel size K, the number of MACCs is:

K × K × Cin × Hout × Wout × Cout
Here’s where that formula comes from:

for each pixel in the output feature map of size Hout × Wout,
take a dot product of the weights and a K × K window of input values
we do this across all input channels, Cin
and because the layer has Cout different convolution kernels, we repeat this Cout times to create all the output channels.

Again, we’re conveniently ignoring the bias and the activation function here.

Something we should not ignore is the stride of the layer, as well as any dilation factors, padding, etc. That’s why we look at the dimensions of the layer’s output feature map, Hout × Wout, since that already has the stride etc accounted for.

Example: for a 3×3 convolution with 128 filters, on a 112×112 input feature map with 64 channels, we perform this many MACCs:

3 × 3 × 64 × 112 × 112 × 128 = 924,844,032
That’s almost 1 billion multiply-accumulate operations! Gotta keep that GPU busy…

Note:
In this example, we used “same” padding and stride = 1, so that the output feature map has the same size as the input feature map. It’s also common to see convolutional layers use stride = 2, which would have chopped the output feature map size in half, and we would’ve used 56 × 56 instead of 112 × 112 in the above calculation.

2.2.4 深度可分離卷積MACC計算

2.2.5 批歸一化

2.2.6 其他

2.3 計算公式小結

2.3.1 公式小結

計算公式小結如下：

關於FLOPs的計算，Nvidia的Pavlo Molchanov等人的文章的APPENDIX中也做了介紹，

由於是否考慮biases，以及是否一個MAC算兩個operations等因素，最終的數字上也存在一些差異。但總的來說，計算FLOPs其實也是在對比之下才顯示出某種算法，或者說網絡的優勢，所以必須在同一組計算標準下，纔是可以參考的、有意義的計算結果。

2.3.2 模型計算量計算實例

採用1、2兩節的方法，可以很輕鬆地計算出AlexNet網絡的parameters和FLOPs數目，如下圖

模型計算量的應用實例

3. 模型複雜度之三：內存開銷

參考

主要參考：
https://machinethink.net/blog/how-fast-is-my-model/

上篇：https://www.jiqizhixin.com/articles/2019-02-22-22
下篇:https://www.jiqizhixin.com/articles/2019-02-28-3
上篇：https://www.leiphone.com/news/201902/D2Mkv61w9IPq9qGh.html
下篇：https://www.leiphone.com/news/201902/biIqSBpehsaXFwpN.html?uniqueCode=OTEsp9649VqJfUcO
關於網絡輕量化：https://www.jianshu.com/p/b4e820096ace
工具：
https://github.com/sovrasov/flops-counter.pytorch

CNN模型複雜度（FLOPS、MAC）、參數量與運行速度

文章目錄