In this paper, we are interested in the human pose estimation problem with a focus on learning reliable highresolution representations. Most existing methods recover high-resolution representations from low-resolution representations produced by a high-to-low resolution network. Instead, our proposed network maintains high-resolution representations through the whole process.

爲了獲取高分辨率表達，most existing 方法是low-to-high的方式。

本文提出的是high-to-low的方式。

We start from a high-resolution subnetwork as the first stage, gradually add high-to-low resolution subnetworks one by one to form more stages, and connect the mutli-resolution subnetworks in parallel. We conduct repeated multi-scale fusions such that each of the high-to-low resolution representations receives information from other parallel representations over and over, leading to rich high-resolution representations.

本文的網絡總體：

1. first stage：high-resolution subnetwork

2. more stages：high-to-low resolution subnetworks

3. connect：並行的、重複的多分辨率、多尺度融合，即每個stage的low-resolution都要和first stage的high-resolution進行融合。

Introduction

Most existing methods pass the input through a network, typically consisting of high-to-low resolution subnetworks that are connected in series, and then raise the resolution. For instance, Hourglass [40] recovers the high resolution through a symmetric low-to-high process. SimpleBaseline [72] adopts a few transposed convolution layers for generating high-resolution representations. In addition, dilated convolutions are also used to blow up the later layers of a high-to-low resolution network (e.g., VGGNet or ResNet) [27, 77].

現有的方法普遍都是high-to-low的結構：如 Hourglass [Stacked hourglass networks for human pose estimation. In ECCV2016]，[Simple baselines for human pose estimation and tracking. In ECCV2018]。

膨脹卷積也是這樣的，如dilated residual network，就是在ResNet的後面幾層，用膨脹卷積保持分辨率不變，從而實現low-to-high resolution的變換。

We present a novel architecture, namely HighResolution Net (HRNet), which is able to maintain high-resolution representations through the whole process. We start from a high-resolution subnetwork as the first stage, gradually add high-to-low resolution subnetworks one by one to form more stages, and connect the multi-resolution subnetworks in parallel. We conduct repeated multi-scale fusions by exchanging the information across the parallel multi-resolution subnetworks over and over through the whole process. We estimate the keypoints over the highresolution representations output by our network. The resulting network is illustrated in Figure 1.

這段和摘要幾乎一樣 =.=|

Figure 1. Illustrating the architecture of the proposed HRNet. It consists of parallel high-to-low resolution subnetworks with repeated information exchange across multi-resolution subnetworks (multi-scale fusion). The horizontal and vertical directions correspond to the depth of the network and the scale of the feature maps, respectively.

Our network has two benefits in comparison to existing widely-used networks [40, 27, 77, 72] for pose estimation. (i) Our approach connects high-to-low resolution subnetworks in parallel rather than in series as done in most existing solutions. Thus, our approach is able to maintain the high resolution instead of recovering the resolution through a low-to-high process, and accordingly the predicted heatmap is potentially spatially more precise. (ii) Most existing fusion schemes aggregate low-level and high-level representations. Instead, we perform repeated multiscale fusions to boost the high-resolution representations with the help of the low-resolution representations of the same depth and similar level, and vice versa, resulting in that high-resolution representations are also rich for pose estimation. Consequently, our predicted heatmap is potentially more accurate.

兩個優勢：（i）並行結構比串行結構能更好學習high-level representations。（ii）在多個尺度（多個stage）下，high-level 與 low-level 間相互融合，能更好學習high-level representations。

There are two mainstream methods: regressing the position of keypoints [66, 7], and estimating keypoint heatmaps [13, 14, 78] followed by choosing the locations with the highest heat values as the keypoints.

姿態檢測有兩種mainstream：直接返回關鍵點和估計關鍵點的熱圖。

Most convolutional neural networks for keypoint heatmap estimation consist of a stem subnetwork similar to the classification network, which decreases the resolution, a main body producing the representations with the same resolution as its input, followed by a regressor estimating the heatmaps where the keypoint positions are estimated and then transformed in the full resolution. The main body mainly adopts the high-to-low and low-to-high framework, possibly augmented with multi-scale fusion and intermediate (deep) supervision.

傳統的 CNN based 關鍵點熱圖估計採用的網絡和分類網絡相似：high-to-low and low-to-high framework。

High-to-low and low-to-high

The high-to-low process aims to generate low-resolution and high-level representations, and the low-to-high process aims to produce high-resolution representations [4, 11, 23, 72, 40, 62]. Both the two processes are possibly repeated several times for boosting the performance [77, 40, 14].

high-to-low：low-resolution and high-level representations

low-to-high：high-resolution representations

爲了提高性能，這兩個過程可能會重複幾次。

Representative network design patterns include: (i) Symmetric high-to-low and low-to-high processes. Hourglass and its follow-ups [40, 14, 77, 31] design the low-to-high process as a mirror of the high-to-low process. (ii) Heavy high-to-low and light low-to-high. The high-to-low process is based on the ImageNet classification network, e.g., ResNet adopted in [11, 72], and the low-to-high process is simply a few bilinear-upsampling [11] or transpose convolution [72] layers. (iii) Combination with dilated convolutions. In [27, 51, 35], dilated convolutions are adopted in the last two stages in the ResNet or VGGNet to eliminate the spatial resolution loss, which is followed by a light low-to-high process to further increase the resolution, avoiding expensive computation cost for only using dilated convolutions [11, 27, 51]. Figure 2 depicts four representative pose estimation networks.

三類典型網絡：

對稱 high-to-low and low-to-high；

重 high-to-low，輕low-to-high；

結合 dilated conv. (也是重 high-to-low，輕low-to-high)

Figure 2. Illustration of representative pose estimation networks that rely on the high-to-low and low-to-high framework. (a) Hourglass [40]. (b) Cascaded pyramid networks [11]. (c) Simple Baseline [72]: transposed convolutions for low-to-high processing. (d) Combination with dilated convolutions [27].

Bottom-right legend: reg. = regular convolution, dilated = dilated convolution, trans. = transposed convolution, strided = strided convolution, concat. = concatenation.

In (a), the high-to-low and low-to-high processes are symmetric. In (b), (c) and (d), the high-to-low process, a part of a classification network (ResNet or VGGNet), is heavy, and the low-to-high process is light.

In (a) and (b), the skip-connections (dashed lines) between the same-resolution layers of the high-to-low and low-to-high processes mainly aim to fuse low-level and high-level features. In (b), the right part, refinenet, combines the low-level and high-level features that are processed through convolutions.

列了四種典型的姿態估計網絡模型：

(a) Hourglass [Stacked hourglass networks for human pose estimation].

(b) Cascaded pyramid networks [Cascaded Pyramid Network for Multi-Person Pose Estimation].

(c) Simple Baseline [Simple baselines for human pose estimation and tracking].

(d) Combination with dilated convolutions [Deepercut: A deeper, stronger, and faster multiperson pose estimation model].

(a) 對稱網絡。（b）（c）（d）非對稱：high-to-low is heavy; low-to-high is light.

Multi-scale fusion

The straightforward way is to feed multi-resolution images separately into multiple networks and aggregate the output response maps [64]. Hourglass [40] and its extensions [77, 31] combine low-level features in the high-to-low process into the same-resolution high-level features in the low-to-high process progressively through skip connections. In cascaded pyramid network [11], a globalnet combines low-to-high level features in the high-to-low process progressively into the low-tohigh process, and then a refinenet combines the low-to-high level features that are processed through convolutions. Our approach repeats multi-scale fusion, which is partially inspired by deep fusion and its extensions [Deeply-fused nets, Interleaved structured sparse convolutional neural networks, IGCV v1, IGCV v3].

多尺度融合在網絡中的實現形式有 skip-connection；globalnet；deep fusion net。本文重複使用 deep fusion net。

Intermediate supervision.

Intermediate supervision or deep supervision, early developed for image classification [34, 61], is also adopted for helping deep networks training and improving the heatmap estimation quality, e.g., [69, 40, 64, 3, 11]. The hourglass approach [40] and the convolutional pose machine approach [69] process the intermediate heatmaps as the input or a part of the input of the remaining subnetwork.

Our approach

Our network connects high-to-low subnetworks in parallel. It maintains high-resolution representations through the whole process for spatially precise heatmap estimation. It generates reliable high-resolution representations through repeatedly fusing the representations produced by the high-to-low subnetworks. Our approach is different from most existing works, which need a separate low-to-high upsampling process and aggregate low-level and high-level representations. Our approach, without using intermediate heatmap supervision, is superior in keypoint detection accuracy and efficient in computation complexity and parameters.

幾個重點：

1. 並行連結 high-to-low 子網絡；

2. 高分辨特徵圖一直存在；

3. 反覆融合 high-to-low 的特徵；

4. 沒有使用中間層熱圖監督。

There are related multi-scale networks for classification and segmentation [5, 8, 74, 81, 30, 76, 55, 56, 24, 83, 55, 52, 18]. Our work is partially inspired by some of them [56, 24, 83, 55], and there are clear differences making them not applicable to our problem. Convolutional neural fabrics [56] and interlinked CNN [83] fail to produce high-quality segmentation results because of a lack of proper design on each subnetwork (depth, batch normalization) and multi-scale fusion. The grid network [18], a combination of many weight-shared U-Nets, consists of two separate fusion processes across multi-resolution representations: on the first stage, information is only sent from high resolution to low resolution; on the second stage, information is only sent from low resolution to high resolution, and thus less competitive. Multi-scale densenets [24] does not target and cannot generate reliable high-resolution representations.

跟四個網絡做了比較。本文的網絡，跟下面的網絡多少有些聯繫。

Convolutional neural fabrics [Convolutional Neural Fabrics]

interlinked CNN [Interlinked Convolutional Neural Networks for Face Parsing]

grid network [Residual Conv-Deconv Grid Network for Semantic Segmentation]

Multi-scale densenets [Multi-Scale Dense Convolutional Networks for Efficient Prediction]

Approach

Human pose estimation, a.k.a. (also known as) keypoint detection, aims to detect the locations of $\small K$ keypoints or parts (e.g., elbow, wrist, etc) from an image $\small I$ of size W × H × 3. The stateof-the-art methods transform this problem to estimating $\small K$ heatmaps of size $\small {W}'\times {H}'$ , $\small \{H_1, H_2, ..., H_K\}$ , where each heatmap Hk indicates the location confidence of the $\small k$ th keypoint.

We follow the widely-adopted pipeline [40, 72, 11] to predict human keypoints using a convolutional network, which is composed of a stem consisting of two strided convolutions decreasing the resolution, a main body outputting the feature maps with the same resolution as its input feature maps, and a regressor estimating the heatmaps where the keypoint positions are chosen and transformed to the full resolution. We focus on the design of the main body and introduce our High-Resolution Net (HRNet) that is depicted in Figure 1.

這段介紹 pipeline：降維 - 特徵變換（輸出和輸入的分辨率相同） - 熱圖。

Sequential multi-resolution subnetworks

Let $\small \mathcal{N}_{sr}$ be the subnetwork in the $\small s$ th stage and $\small r$ be the resolution index (Its resolution is $\small 1/2^{r-1}$ of the resolution of the first subnetwork). The high-to-low network with $\small S$ (e.g., 4) stages can be denoted as:

Parallel multi-resolution subnetworks

We start from a high-resolution subnetwork as the first stage, gradually add high-to-low resolution subnetworks one by one, forming new stages, and connect the multi-resolution subnetworks in parallel. As a result, the resolutions for the parallel subnetworks of a later stage consists of the resolutions from the previous stage, and an extra lower one. An example network structure, containing 4 parallel subnetworks, is given as follows,

這兩節給出了兩種多尺度分辨率網絡結構。前者是通常encoder的串聯結構。後者就是並行結構。

Repeated multi-scale fusion

We introduce exchange units across parallel subnetworks such that each subnetwork repeatedly receives the information from other parallel subnetworks. Here is an example showing the scheme of exchanging information. We divided the third stage into several (e.g., 3) exchange blocks, and each block is composed of 3 parallel convolution units with an exchange unit across the parallel units, which is given as follows,

where $\small C^b_{sr}$ represents the convolution unit in the $\small r$ th resolution of the $\small b$ th block in the $\small s$ th stage, and $\small \varepsilon ^b_s$ is the corresponding exchange unit.

文章並不是簡單用並行結構，而是考慮了不同尺度間特徵圖的融合過程。

融合過程就公式（3）的結構。

We illustrate the exchange unit in Figure 3 and present the formulation in the following. We drop the subscript $\small s$ and the superscript $\small b$ for discussion convenience. The inputs are $\small s$ response maps: $\small \{H_1, H_2, ..., H_s\}$ . The outputs are s response maps: $\small \{Y_1, Y_2, ..., Y_s\}$ , whose resolutions and widths are the same to the input. Each output is an aggregation of the input maps, $\small Y_k=\sum^s_{i=1}a(X_i,k)$ . The exchange unit across stages has an extra output map $\small Y_{s+1}=a(Y_s, s+1)$ .

Figure 3. Illustrating how the exchange unit aggregates the information for high, medium and low resolutions from the left to the right, respectively. Right legend: strided 3×3 = strided 3×3 convolution, up samp. 1×1 = nearest neighbor up-sampling following a 1 × 1 convolution.

公式 (3) 的具體網絡實現即圖3所示。注意：當融合的兩subnetwork跨越了一個stage，則會多生成一層與被跨越的那層相同分辨率的特徵圖。

The function $\small a(X_i,k)$ consists of upsampling or downsampling $\small X_i$ from resolution $\small i$ to resolution $\small k$ . We adopt strided 3 × 3 convolutions for downsampling. For instance, one strided 3×3 convolution with the stride 2 for 2× downsampling, and two consecutive strided 3 × 3 convolutions with the stride 2 for 4× downsampling. For upsampling, we adopt the simple nearest neighbor sampling following a 1 × 1 convolution for aligning the number of channels. If $\small i=k,a(\cdot,\cdot)$ is just an identify connection: $\small a(X_i,k)=X_i$ .

這段解釋降維和升維的實現：降維用 stride=2的3x3 conv.；升維用最鄰近上採樣+1x1 conv.實現。

Heatmap estimation

We regress the heatmaps simply from the high-resolution representations output by the last exchange unit, which empirically works well. The loss function, defined as the mean squared error, is applied for comparing the predicted heatmaps and the groundtruth heatmaps. The groundtruth heatmpas are generated by applying 2D Gaussian with standard deviation of 1 pixel centered on the grouptruth location of each keypoint.

Network instantiation

We instantiate the network for keypoint heatmap estimation by following the design rule of ResNet to distribute the depth to each stage and the number of channels to each resolution.

The main body, i.e., our HRNet, contains four stages with four parallel subnetworks, whose the resolution is gradually decreased to a half and accordingly the width (the number of channels) is increased to the double. The first stage contains 4 residual units where each unit, the same to the ResNet-50, is formed by a bottleneck with the width 64, and is followed by one 3×3 convolution reducing the width of feature maps to $\small C$ . The 2nd, 3rd, 4th stages contain 1, 4, 3 exchange blocks, respectively. One exchange block contains 4 residual units where each unit contains two 3 × 3 convolutions in each resolution and an exchange unit across resolutions. In summary, there are totally 8 exchange units, i.e., 8 multi-scale fusions are conducted.

In our experiments, we study one small net and one big net: HRNet-W32 and HRNet-W48, where 32 and 48 represent the widths (C) of the high-resolution subnetworks in last three stages, respectively. The widths of other three parallel subnetworks are 64, 128, 256 for HRNet-W32, and 96, 192, 384 for HRNet-W48.

詳細介紹了HRNet的結構：

1. 4 stage, 4 parallel subnetworks

2. stage 1: 4 個殘差網絡模塊，與 ResNet-50裏面的單元相同（bottleneck 結構，含有64個feature map）。

3. stage 2，3，4 分別有1，4，3個轉換 blocks。

4. 每個轉換 blocks 包含 4 個殘差模塊；每個殘差模塊包含兩個 3x3 conv. （分辨率不變）和 1 個轉換單元（降維或升維）。

5. 關於特徵個數，有兩種：

HRNet-W32 （第1個尺度，後三個stage包含32個特徵圖；後面三個尺度分別有64,128,256 個特徵圖）；

HRNet-W48 （第1個尺度，後三個stage包含48個特徵圖；後面三個尺度分別有96,192,384 個特徵圖）。

Experiments

COCO Keypoint Detection

MPII Human Pose Estimation

Application to Pose Tracking

因爲暫時不做這個方面的應用，暫時也就沒詳細看實驗部分。

Ablation Study

ablation 部分主要分析了三部分：

Repeated multi-scale fusion：

(a) W/o intermediate exchange units (1 fusion): There is no exchange between multi-resolution subnetworks except the last exchange unit.

(b) W/ across-stage exchange units only (3 fusions): There is no exchange between parallel subnetworks within each stage.

(c) W/ both across-stage and within-stage exchange units (totally 8 fusion): This is our proposed method.

當然是8個 fusion 的效果好咯。

Resolution maintenance

all the four high-to-low resolution subnetworks are added at the beginning and the depth are the same; the fusion schemes are the same to ours.

這個實驗可能是說，如果 high-to-low resolution 的 subnetwork 從一開始就有，即網絡整體呈現的不是倒三角行，而是矩形。

實驗結果發現這樣的網絡雖然增加了很多網絡層，但效果不好，原因是：

We believe that the reason is that the low-level features extracted from the early stages over the low-resolution subnetworks are less helpful. In addition, the simple high-resolution network of similar parameter and computation complexities without low-resolution parallel subnetworks shows much lower performance.

1. low-level特徵在前幾個stage起到的作用不大，因爲它們都是低分辨率的，low-level特徵也不明顯。

2. 此外，在沒有低分辨率並行子網的情況下，具有相似參數和計算複雜度的簡單高分辨率網絡的性能要低得多。（沒看懂）

Representation resolution

We study how the representation resolution affects the pose estimation performance from two aspects: check the quality of the heatmap estimated from the feature maps of each resolution from high to low, and study how the input size affects the quality.

兩個方面研究分辨率：

1. 比較不同分辨率下 feature map 在估計 heatmap 的準確度；

2. 改變輸入的分辨率。

MyDLNote - Network: Deep High-Resolution Representation Learning for Human Pose Estimation

Deep High-Resolution Representation Learning for Human Pose Estimation

Abstract