Aleksey Bochkovskiy
7 min readDec 7, 2020

Scaled YOLO v4 is the best neural network for object detection on MS COCO dataset

Paper: https://arxiv.org/abs/2011.08036

Scaled YOLO v4 is the best neural network for object detection — the most accurate (55.8% AP Microsoft COCO test-dev) among neural network published. In addition, it is the best in terms of the ratio of speed to accuracy in the entire range of accuracy and speed from 15 FPS to 1774 FPS. Now it is the Top1 neural network for object detection.

Scaled YOLO v4 outperforms neural networks in accuracy:

  • Google EfficientDet D7x / DetectoRS or SpineNet-190(self-trained on extra-data)
  • Amazon Cascade-RCNN ResNest200
  • Microsoft RepPoints v2
  • Facebook RetinaNet SpineNet-190
Scaled YOLOv4

We show that YOLO and Cross-Stage-Partial (CSP) Network approaches are the best in terms of both absolute accuracy and accuracy-to-speed ratio.

Chart of Accuracy (vertical axis) and Latency (horizontal axis) on a Tesla V100 GPU (Volta) with batch = 1 without using TensorRT.

Even at lower network resolution, Scaled-YOLOv4-P6 (1280x1280) 30 FPS — 54.3% AP is slightly more accurate and 3.7x faster than EfficientDetD7 (1536x1536) 8.2 FPS — 53.7% AP.

Scaled YOLO v4 lies on the Pareto optimality curve — no matter what other neural network you take, there is always such a YOLOv4 network, which is either more accurate at the same speed, or faster with the same accuracy, i.e. YOLOv4 is the best in terms of speed and accuracy.

Accuracy rating of published neural networks: https://paperswithcode.com/sota/object-detection-on-coco

Scaled YOLOv4 is more accurate and faster than neural networks:

  • Google EfficientDet D0-D7x
  • Google SpineNet S49s — S143
  • Baidu Paddle-Paddle PP YOLO
  • And many others…

Scaled YOLO v4 is a series of neural networks built on top of the improved and scaled YOLOv4 network. Our neural network was trained from scratch without using pre-trained weights (Imagenet or any other).

The YOLOv4-tiny neural network speed reaches 1774 FPS on a gaming graphics card GPU RTX 2080Ti when using TensorRT + tkDNN (batch = 4, FP16): https://github.com/ceccocats/tkDNN

YOLOv4-tiny can run in real time with 39 FPS / 25ms latency on JetsonNano (416x416, fp16, batch = 1) tkDNN / TensorRT

Scaled YOLOv4 utilizes massively parallel devices such as GPUs much more efficiently than EfficientDet. For example, GPU V100 (Volta) has performance: 14 TFLops — 112 TFLops-Tensor-Cores https://images.nvidia.com/content/technologies/volta/pdf/tesla-volta-v100-datasheet-letter-fnl-web.pdf

If we test both models on GPU V100 with

  • batch = 1
  • -hparams = mixed_precision = true
  • And without -tensorrt = FP32

Then:

  • YOLOv4-CSP (640x640) — 47.5% AP — 70 FPS — 120 BFlops (60 FMA) Based on BFlops, it should be 933 FPS = (112,000 / 120), but in fact we get 70 FPS, i.e. 7.5% GPU used = (70/933)
  • EfficientDetD3 (896x896) — 47.5% AP — 36 FPS — 50 BFlops (25 FMA) Based on BFlops, it should be 2240 FPS = (112,000 / 50), but in fact we get 36 FPS, i.e. 1.6% GPU used = (36/2240)

Those. efficiency of computing operations on devices with massive parallel computing such as GPUs used in YOLOv4-CSP (7.5 / 1.6) = 4.7x better than the efficiency of operations used in EfficientDetD3.

Usually, neural networks are run on the CPU only in research tasks for easier debugging, and the BFlops characteristic is currently only of academic interest. In real-world tasks, real speed and accuracy are important. The real speed of YOLOv4-P6 is 3.7x faster than EfficientDetD7 on GPU V100. Therefore, devices with massive parallelism GPU / NPU / TPU / DSP with much more optimal speed, price and heat dissipation are almost always used:

  • Embedded GPU (Jetson Nano/Nx)
  • Mobile-GPU/NPU/DSP (Bionic-NPU/Snapdragon-DSP/Mediatek-APU/Kirin-NPU/Exynos-GPU/…)
  • TPU-Edge (Google Coral/Intel Myriad/Mobileye EyeQ5/Tesla-motors TPU 144 TOPS-8bit)
  • Cloud GPU (nVidia A100/V100/TitanV)
  • Cloud NPU (Google-TPU, Huawei Ascend, Intel Habana, Qualcomm AI 100, …)

Also when using neural networks On Web — usually a GPU is used through the WebGL, WebAssembly or WebGPU libraries, for this case — the size of the model can matter: https://github.com/tensorflow/tfjs#about-this-repo

The use of devices and algorithms with weak parallelism is a dead-end development path, because it is impossible to reduce the lithograph size smaller than the size of a silicon atom to increase the processor frequency:

  • The current best size for Semiconductor device fabrication (Lithography) is 5 nanometers.
  • The size of the crystal lattice of silicon is 0.5 nanometers.
  • The atomic radius of silicon is 0.1 nanometer.

The solution — processors with massive parallelism and more than 10 000 ALUs: single crystal or several crystals on one interposer. Hence, it is imperative to create neural networks that make efficient use of massively parallel computing machines such as GPU, NPU, TPU, DSP.

Improvements in Scaled YOLOv4 over YOLOv4:

  • Scaled YOLOv4 used optimal network scaling techniques to get YOLOv4-CSP -> P5 -> P6 -> P7 networks
  • Improved network architecture: Backbone is optimized and Neck (PAN) uses Cross-stage-partial (CSP) connections and Mish activation
  • Exponential Moving Average (EMA) is used during training — this is a special case of SWA: https://pytorch.org/blog/pytorch-1.6-now-includes-stochastic-weight-averaging/
  • For each resolution of the network, a separate neural network is trained (in YOLOv4, only one neural network was trained for all resolutions)
  • Improved objectness normalizers in [yolo] layers
  • Changed activations for Width and Height, which allows faster network training
  • The [net] letter_box = 1 parameter (to keep the aspect ratio of the input image) is used for high resolution networks (for all networks except yolov4-tiny.cfg)

There are different Losses in YOLOv3, YOLOv4 and Scaled-YOLOv4:

Loss for YOLOv3, YOLOv4 and Scaled-YOLOv4
  • for bx and by — this eliminates grid sensitivity in the same way as in YOLOv4, but more aggressively
  • for bw and bh — this limits the size of the bounded-box to 4*Anchor_size

In general, Scaled-YOLOv4 has the same AP50, but a higher AP than the original YOLOv4 with the same resolution and approximately the same speed. Then Scaled-YOLOv4 scales up to achieve a higher AP50 and AP at a lower speed.

The Pytorch YOLOv4 implementation predicts better coordinates (higher AP) but detects fewer objects (lower AP50):

  • YOLOv4(Darknet) — 608x608— 62 FPS — 43.5% AP — 65.7% AP50
  • YOLOv4(Pytorch) — 608x608 — 62 FPS — 45.5% AP — 64.1% AP50

Changes to the network architecture (CSP in the Neck and Mish-activation for all layers) then eliminate flaws of Pytorch implementation, so CSP+Mish improves both AP, AP50 and FPS:

  • YOLOv4-CSP — 608x608–75FPS — 47.5% AP — 66.1% AP50
  • YOLOv4-CSP — 640x640–70FPS — 47.5% AP — 66.2% AP50
Scaled YOLOv4 comparison table

Scaled-YOLOv4 neural network architecture (examples of three networks: P5, P6, P7):

Scaled-YOLOv4 architecture

CSP connection is extremely efficient, simple, and can be applied to any neural network. The idea is:

  • half of the output signal goes along the main path (generates more semantic information with a large receiving field)
  • and the other half of the signal goes bypass (preserves more spatial information with a small perceiving field)
The simplest example of a CSP connection (on the left is a regular network, on the right is a CSP network)
An example of a CSP connection in YOLOv4-CSP / P5 / P6 / P7 (on the left is a regular network, on the right is a CSP network)
YOLOv4-tiny uses 2 CSP connections

YOLOv4 is used in various fields and tasks:

There are YOLOv4 implementations on various frameworks:

pip install yolov4 https://pypi.org/project/yolov4/

- https://github.com/Tianxiaomo/pytorch-YOLOv4

- https://github.com/VCasecnikovs/Yet-Another-YOLOv4-Pytorch

Also, the YOLOv4 approach can be used in other tasks, for example, when detecting 3D objects: