DETR vs Faster R-CNN: End-to-End Detection Hits 42 AP

#detr #fasterrcnn #objectdetection #transformer

The Real Surprise: No NMS, No Anchors, Same Accuracy

Faster R-CNN has dominated object detection since 2015. Anchors, region proposals, non-maximum suppression (NMS)—these handcrafted components became so standard that nobody questioned them. Then Facebook AI dropped DETR in 2020 and achieved 42 AP on COCO with none of that machinery.

You can read the full paper here.

The key insight isn't just "Transformers work for detection." It's that the entire detection pipeline—from feature extraction to final bounding boxes—can be reformulated as a direct set prediction problem. One forward pass, 100 learned queries, bipartite matching loss. Done.

Close-up of keyboard buttons spelling 'ABOUT' against a red surface, minimalist style. — Photo by Miguel Á. Padriñán on Pexels

Why Faster R-CNN's Pipeline Got So Complicated

Before diving into DETR's elegance, let's appreciate what it replaced. Faster R-CNN (Ren et al., NeurIPS 2015) needs:

Anchor generation: ~15K anchors per image across multiple scales and aspect ratios
Region Proposal Network (RPN): First-stage filtering to ~2000 proposals

Continue reading the full article on TildAlice