Decoding DEtection TRansformer

Study: bipartite matching, transformer, and parallel decoding

Problem with Modern Object Detectors

Modern object detectors can be put into one of the two categories: single-stage and two-stage detectors. Both of these methods use some sort of initial guesses (i.e anchor boxes) to make predictions. Single-stage detectors use anchors or a grid of potential object centers, while two-stage detectors use proposals. As the paper points out, these methods’ performance significantly rely on these initial guesses. In other words, modern detectors make predictions based on these initial guesses (i.e anchors), whereas the DETR makes predictions directly on the input image.

The approach isn’t completely new. Previous papers have explored end-to-end Set Prediction with bipartite-matching loss and encoder-decoder architecture. But they relied on RNNs versus transformers with parallel decoding that this approach uses.

Comparing DETR with Faster RCNN

Before we dive into details of various parts of this approach, here is a quick comparison between Faster RCNN and DETR.

| | | |

Approach

We can decompose almost every deep learning approach into few parts: input/output format, how the given model will generate output from that input, loss function, and actual training.

Architecture

three parts - CNN feature extractor, transformer, fully connected network

Training

$\theta$ .

Gradient Check