Sparse R-CNN: Towards More Efficient Object Detection Models

University of Toronto Machine Intelligence Team

9 min readNov 29, 2021

Written by Richard Xu. A discussion of the paper titled “Sparse R-CNN: End-to-End Object Detection with Learnable Proposals”.

Introduction

Long being an important topic in computer vision, object detection involves localising and classifying objects present in a given image. The performance of object detection is measured by the Average Precision (AP). A higher AP means better detection results.

In the task of object detection, each object needs to be localized using a bounding box and classified with a corresponding class label. (Source: Blog by Matthijs Hollemans)

Deep learning techniques helped push the boundary of object detection performance further. With efficient models such as YOLO and Faster R-CNN, we can now rely on lightweight deep learning models to detect objects in real-time, even with the mere compute power of mobile phones.

However, among the widely used object detection models nowadays, there are some inherent issues. I will delve into them by taking apart some well-known object detection models.

A Brief Look at Prior Works

Faster R-CNN is a two-stage object detection method. In this approach, object proposals (potential locations of objects) are extracted in the first stage, and further refinement (final predictions) is completed based on the proposals in the second stage.

An overview of the Faster R-CNN model. (Source: Blog by Hao Gao)

During the proposal extraction process, Faster R-CNN and many other two-stage systems rely on anchor boxes. Anchor boxes are possible aspect ratios and sizes of proposals that need to be pre-defined by us. To generate proposals, a model essentially attempts to fit different anchor boxes to each location in an image and make changes to the scale and location of the anchor boxes.

An example of how anchor boxes operate. (Source: Wovenware)

To be exact, the proposal extraction part (Region Proposal Network) of Faster-RCNN looks like this:

Proposal extraction pipeline in Faster-RCNN.

Here, H and W are the input image dimensions, h and w are the dimensions of each extracted feature map, and n is the number of anchor boxes. So essentially, for each location (or grid cell; there are hw locations in total) on the feature maps, the model will predict one bounding box using each anchor box, described by six parameters (prediction confidences, height, width, and coordinates of object).

By now, you probably wonder, anchor boxes are pretty important. They are! It provides a good reference guide for object detectors to look for objects. For example, if you want the detector to detect photos of outdoor urban scenes, you should design tall, skinny anchor boxes for pedestrians and wide boxes for vehicles. A poorly designed set of anchor boxes would make it difficult for the proposal extractor to pick up potential objects.

You might also be wondering, let’s use a more diverse set of anchor boxes! More anchor boxes theoretically would allow the detector to be more robust to changes in image domains. It also enables the detector to detect more objects in a grid cell, better for crowded scenes. However, this results in another issue: the number of proposals quickly add up. Without post-processing to reduce the number of proposals, you would end up with a model with a near-perfect recall but inferior precision.

Because of this, before the proposals are further refined, the Non-Maximal-Suppression (NMS) algorithm is applied to filter out tens of thousands of weaker or redundant proposals.

Without NMS, predictions from an object detector could look like something on the left side. (Source: Blog by Hao Gao)

The NMS considers a list of proposals from the model, keeps one proposal with the highest confidence score, and deletes the ones that overlap with this proposal by a pre-defined overlap threshold or have prediction confidence below a pre-defined confidence threshold. This process continues until every box is either kept or deleted by NMS. As such, it could be an inefficient post-processing step! In addition, the precision of the model is strongly tied to these additional pre-determined metrics: the NMS threshold values! For example, in crowded scenes, a high overlap threshold would cause a lot of objects to be missed.

After NMS is done, an ROI pooling operation ensures all proposals are stretched to the same size, allowing the subsequent proposal refinement network to “focus” on the proposal. Finally, a regressor predicts the final offsets, scales, and aspect ratios of bounding boxes, while a classifier determines the object class.

Now, I also want to stress that the issues I brought up earlier aren’t just occurring with two-stage systems like Faster R-CNN, but also apparent in more efficient one-stage systems, such as YOLO.

Overview of the object detection pipeline in YOLO

YOLO, short for “You Only Look Once,” is a one-stage system, which means it extracts proposals, provides classification tags, and conducts box refinements all at once after the feature extraction. Even though it is more efficient with less computation, its dependence on anchor boxes and NMS remains. For every grid (in the above example, there are 49 grids in total), predictions are created by fitting and adjusting each of the n anchor boxes. After that, NMS is used to gain precision.

The Issues with Those Works

To summarize, the issues with many of the previous works are:

The reliance on the input of prior information. Extensive hyperparameters, including sizes, quantity, and aspect ratio of anchor boxes, number of grids, as well as the proposal generation algorithm, must be carefully designed and finetuned according to the model and image domains. The performance of these models is susceptible to these settings.
The additional processing required to refine prediction results. NMS requires time and will cause problems with highly overlapping objects.

Time for Sparse R-CNN to Shine

Motivated by the above issues, the authors of Sparse R-CNN created an elegant yet incredibly simplistic idea. Instead of processing k anchor boxes for each of H by W grid cells, Sparse R-CNN uses a set of N proposal boxes that learns to map to the statistics of the image domain, where N << kHW . In such a design, the amount of user-input information and post-processing needed is greatly reduced. Let’s delve into it.

A comparison of Sparse R-CNN and previous works. (Source: Original Paper)

The journey starts with a backbone feature extractor, represented by the first block on the left of the pipeline shown below. This block is based on ResNet and learns information from an image that is useful for object detection.

Overall pipeline for Sparse R-CNN. (Source: Original Paper)

In the next step, a small set of N (suitable values of N range from 100 to 500 in the paper) proposal boxes is drawn over the feature maps created by the backbone. These proposal boxes are randomly initialized. Through training, they learn to cover locations where specific objects will most likely pop up in the image domain, regardless of the image input. Each proposal box has its own learnable proposal feature vector with size d (d=256 in the paper), which provides more information about the characteristics of the proposal box. Together, these learnable parameters describe a set of anchor boxes that we do not define but are determined by the model.

The next block (represented as a cube in the pipeline), applied to each proposal box individually, is a RoIAlign operation to ensure that features from proposals with different sizes are converted to a set of fixed-size (N, S*S, C) box features.

Then, for each proposal box, a dynamic instance interactive head is applied to allow box features from the proposal box to “interact” with other instances. Each head generates dynamic parameters based on the (N, d) proposal features mentioned earlier. Then, a self-attention module, a set of 1x1 convolution kernels, is loaded with these parameters and applied to the combined box features, producing a rich set of object features that reason relations with other objects.

Algorithm for the dynamic head. (Source: Original Paper)

The resulted object features are passed through two separate heads, one is a fully-connected layer for classification, and the other is a three-layer MLP to predict regression targets. The result is a set of N predictions.

Iterative improvement

Like a hard-working student, Sparse R-CNN is not satisfied by the detection results from this pipeline yet. It uses an iterative approach to improve the detection results further. Let’s say the steps from box proposals to box predictions are a single stage of the pipeline. The predicted boxes generated from this stage can become box proposals for the next stage! With more accurate proposals this time, the resulted predictions can be even better as well! After multiple stages (three in this paper), the initial “rough guess” of possible object locations are iteratively shifted to their “final destinations.” At the end of the pipeline, my guess is that there is still a confidence threshold (0.3 in the paper) to eliminate weaker or duplicate predictions, but the process is far more straightforward than NMS. Now, with final predictions crunching out, let’s take a look at how impressive Sparse R-CNN is!

Results and Key Takeaways

Qualitative results of the iterative approach. (Source: Original Paper)

Spending some time on the figure, I noticed a few strengths about Sparse R-CNN:

Learned proposal boxes cover the entire area of the image. This ensures good recall.
The iterative approach works its magic by gradually moving good proposals to more accurate locations.
The duplicate boxes are gone! My guess is that the interaction between instances in the dynamic head helps reduce the confidence score of those duplicate boxes.

And, from this figure, we can see that Sparse R-CNN achieves superior results (higher AP) compared to other well-established methods, with fewer epochs required for training and decent inference speed. In the appendix, the authors show that swapping its feature extractor to a transformer-based backbone provides an even higher AP.

In all, Sparse R-CNN launches a promising approach to object detection. Unlike previous methods such as Faster R-CNN and YOLO, Sparse R-CNN has fewer hyperparameter settings and generates a sparse set of learned proposals. Because of this, the method is easier to finetune, less sensitive to changes in hyperparameter, more adaptable across image domains, and fundamentally superior in handling crowded scenes.

Check out the official code and paper of Sparse R-CNN here, and I hope you enjoyed reading about this method as much as I did!

Reference

Gao, H. (2017, October 5). Faster R-CNN explained. Medium. Retrieved November 27, 2021, from https://medium.com/@smallfishbigsea/faster-r-cnn-explained-864d4fb7e3f8.
Hollemans, M. (2017, May 20). Real-time object detection with Yolo. Retrieved November 27, 2021, from https://machinethink.net/blog/object-detection-with-yolo/.
K, S. (2021, April 30). Non-maximum suppression (NMS). Medium. Retrieved November 27, 2021, from https://towardsdatascience.com/non-maximum-suppression-nms-93ce178e177c.
Koech, K. E. (2021, November 18). Object Detection Metrics With Worked Example. Medium. Retrieved November 27, 2021, from https://towardsdatascience.com/on-object-detection-metrics-with-worked-example-216f173ed31e.
Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). https://doi.org/10.1109/cvpr.2016.91
Ren, S., He, K., Girshick, R., & Sun, J. (2017). Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6), 1137–1149. https://doi.org/10.1109/tpami.2016.2577031
Sun, P., Zhang, R., Jiang, Y., Kong, T., Xu, C., Zhan, W., Tomizuka, M., Li, L., Yuan, Z., Wang, C., & Luo, P. (2021). Sparse R-CNN: End-to-end object detection with learnable proposals. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). https://doi.org/10.1109/cvpr46437.2021.01422