Carion et al 2020 - End-to-End Object Detection with Transformers
Notes for carion20_end_to_end_objec_detec_with_trans
Uses transformers to perform direct-set prediction. That is, it takes an image as input and outputs a set of bounding boxes for the set of detected objects in an image.
1. Loss
We have a set of ground truth objects and we have \(N\) predictions. How well did our predictions do? The loss is based on the assignment problem. What we do is try to match our prediction boxes to the ground truth boxes as best we can. Then, once each prediction has been assigned to a ground truth box, we take our loss per pair.
Why do we need to do matching? Because when the model gets a box wrong, e.g. they put the box slightly above the duck and call it a "cat", what mistake is it making? Is it mislabeling the duck? Or is it too far away from the cat? Before we can say, we need to decide what location the model was aiming to put the box. So, for example, our matching might say that the prediction was aiming for the cat.
2. Object detection
To make the final detection a FPN-like component is used. The head predicts a center and height width of the box.