Scaled YOLO v4 is the best neural network for object detection on MS COCO dataset


Scaled YOLO v4 is the best neural network for object detection — the most accurate (55.8% AP Microsoft COCO test-dev) among neural network published. In addition, it is the best in terms of the ratio of speed to accuracy in the entire range of accuracy and speed from 15 FPS to 1774 FPS. Now it is the Top1 neural network for object detection.

Scaled YOLO v4 outperforms neural networks in accuracy:

Scaled YOLOv4

We show that YOLO and Cross-Stage-Partial (CSP) Network approaches are the best in terms of both absolute accuracy and accuracy-to-speed ratio.

Chart of Accuracy (vertical axis) and Latency (horizontal axis) on a Tesla V100 GPU (Volta) with batch = 1 without using TensorRT.

Even at lower network resolution, Scaled-YOLOv4-P6 (1280x1280) 30 FPS — 54.3% AP is slightly more accurate and 3.7x faster than EfficientDetD7 (1536x1536) 8.2 FPS — 53.7% AP.

Scaled YOLO v4 lies on the Pareto optimality curve — no matter what other neural network you take, there is always such a YOLOv4 network, which is either more accurate at the same speed, or faster with the same accuracy, i.e. YOLOv4 is the best in terms of speed and accuracy.

Accuracy rating of published neural networks:

Scaled YOLOv4 is more accurate and faster than neural networks:

Scaled YOLO v4 is a series of neural networks built on top of the improved and scaled YOLOv4 network. Our neural network was trained from scratch without using pre-trained weights (Imagenet or any other).

The YOLOv4-tiny neural network speed reaches 1774 FPS on a gaming graphics card GPU RTX 2080Ti when using TensorRT + tkDNN (batch = 4, FP16):

YOLOv4-tiny can run in real time with 39 FPS / 25ms latency on JetsonNano (416x416, fp16, batch = 1) tkDNN / TensorRT

Scaled YOLOv4 utilizes massively parallel devices such as GPUs much more efficiently than EfficientDet. For example, GPU V100 (Volta) has performance: 14 TFLops — 112 TFLops-Tensor-Cores

If we test both models on GPU V100 with


Those. efficiency of computing operations on devices with massive parallel computing such as GPUs used in YOLOv4-CSP (7.5 / 1.6) = 4.7x better than the efficiency of operations used in EfficientDetD3.

Usually, neural networks are run on the CPU only in research tasks for easier debugging, and the BFlops characteristic is currently only of academic interest. In real-world tasks, real speed and accuracy are important. The real speed of YOLOv4-P6 is 3.7x faster than EfficientDetD7 on GPU V100. Therefore, devices with massive parallelism GPU / NPU / TPU / DSP with much more optimal speed, price and heat dissipation are almost always used:

Also when using neural networks On Web — usually a GPU is used through the WebGL, WebAssembly or WebGPU libraries, for this case — the size of the model can matter:

The use of devices and algorithms with weak parallelism is a dead-end development path, because it is impossible to reduce the lithograph size smaller than the size of a silicon atom to increase the processor frequency:

The solution — processors with massive parallelism and more than 10 000 ALUs: single crystal or several crystals on one interposer. Hence, it is imperative to create neural networks that make efficient use of massively parallel computing machines such as GPU, NPU, TPU, DSP.

Improvements in Scaled YOLOv4 over YOLOv4:

There are different Losses in YOLOv3, YOLOv4 and Scaled-YOLOv4:

Loss for YOLOv3, YOLOv4 and Scaled-YOLOv4

In general, Scaled-YOLOv4 has the same AP50, but a higher AP than the original YOLOv4 with the same resolution and approximately the same speed. Then Scaled-YOLOv4 scales up to achieve a higher AP50 and AP at a lower speed.

The Pytorch YOLOv4 implementation predicts better coordinates (higher AP) but detects fewer objects (lower AP50):

Changes to the network architecture (CSP in the Neck and Mish-activation for all layers) then eliminate flaws of Pytorch implementation, so CSP+Mish improves both AP, AP50 and FPS:

Scaled YOLOv4 comparison table

Scaled-YOLOv4 neural network architecture (examples of three networks: P5, P6, P7):

Scaled-YOLOv4 architecture

CSP connection is extremely efficient, simple, and can be applied to any neural network. The idea is:

The simplest example of a CSP connection (on the left is a regular network, on the right is a CSP network)
An example of a CSP connection in YOLOv4-CSP / P5 / P6 / P7 (on the left is a regular network, on the right is a CSP network)
YOLOv4-tiny uses 2 CSP connections

YOLOv4 is used in various fields and tasks:

There are YOLOv4 implementations on various frameworks:

pip install yolov4



Also, the YOLOv4 approach can be used in other tasks, for example, when detecting 3D objects:




Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Implement and evaluate a logistic regression algorithm in Python

Extracting Interest Points and Their Descriptors (with Harris, SIFT, and SURF) in Image Pairs and…

Creating a Labelled Dataset using a Pretrained Model

Diviner Protocol Launches the first “Play-to-Earn” Game — Diviner Metacity

AI on Thanos Tweets, Emotion Detector using torchMoji

Getting started in Natural Language Processing

Behavioral Cloning — Transfer Learning with Feature Extraction

Performant NLP technics increase gender bias

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Aleksey Bochkovskiy

Aleksey Bochkovskiy

Ex Intel

More from Medium

Chinese MNIST with PyTorch

American Sign Language Translation Approach Using Machine Learning

Plant Disease Detection Using Convolutional Neural Networks with PyTorch

What is Computer Vision? Past, Present and Future