Feature Pyramid Networks for Object Detection
Notes for lin16_featur_pyram_networ_objec_detec. The FPN consists of a bottom-up pathway and a top-down pathway.
The bottom-up pathway is a feed forward convolutional network. Each stage consists of a series of convolutional layers that produces features for the next stage. Each stage reduces the scale by a factor of 2. The convolutional features produced for the 4 stages are \(\{C_2, C_3, C_4, C_5\}\).
Once we reach the top of this pathway, we will have a set of spatially coarse, i.e. low resolution, but semantically rich features. Then, we will descend down the top-down pathway. At each stage of this descent, we will upscale the features by a factor of 2. Each stage also has a lateral connection with the bottom-up pathway: the bottom-up convolutional feature of the same scale passes through a \(1 \times 1\) convolution and is element-wise summed with with the upscaled top-down features. Finally, the result passes through a \(3 \times 3\) convolution, which does some anti-aliasing. The resulting feature maps are \(\{P_2, P_3, P_4, P_5\}\).
Note that FPN is only a feature extractor. To perform object detection, a region proposal network and classifier must be used.
1. Applications
1.1. RPN
RPN, as introduced in ren15_faster_r_cnn, only uses one feature map. Each anchor point is associated with panes of multiple scales and aspect ratios. However, in the FPN approach, each \(P_i\) corresponds with a different scale, so we use scale-specific anchors at each stage.
1.2. Fast R-CNN
We have feature maps for each scale, so which one should we use for a given RoI? For small RoI, we use the high resolution \(P_i\) stage, for large \(RoI\), we use the low resolution \(P_i\). We then feed the features through an RoI Pooling layer and a classifier.