YOLO Evolution: Comparing YOLOv5, v11, v12, and v26 on the Cars Detection Dataset

#computervision #deeplearning #objectdetection #imagedata

Fig 1. YOLOv12 performing real-time vehicle detection on the Cars Detection Dataset.

1. Introduction: The Need for Speed and Precision

Object Detection task: Object Detection is a foundational challenge in Computer Vision that goes beyond simple image classification. While classification identifies what is in an image (e.g., "this is a car"), Object Detection simultaneously answers the questions of "what" and "where." It involves identifying multiple objects within a single frame, classifying each into a specific category, and pinpointing their exact locations using bounding boxes. For tasks like traffic monitoring or autonomous driving—the focus of our Cars Detection Dataset—the model must not only recognise a vehicle but also distinguish between an ambulance, a bus, or a motorcycle in a cluttered, real-time environment.
The "YOLO Revolution": Before the arrival of YOLO (You Only Look Once), the industry standard relied on "two-stage" detectors like Faster R-CNN. These architectures first proposed regions of interest using a Region-Proposal Network (RPN) and then classified those regions in a second pass—a process that was accurate but computationally expensive and slow. The "YOLO Revolution" fundamentally changed this by treating object detection as a single regression problem. By passing the entire image through a neural network once to predict both bounding boxes and class probabilities simultaneously, YOLO achieved unprecedented inference speeds. This shift from multi-stage processing to a streamlined, one-stage architecture made real-time AI applications possible, paving the way for the high-speed, high-accuracy versions we see today, from the classic YOLOv5 to the cutting-edge YOLOv26.

2. The YOLO Lineage: From v5 to v26

YOLOv5: The industry's "Old Reliable." Released in 2020, YOLOv5 remains the benchmark for reliability in the computer vision community. Its architecture, centred around the CSP-Darknet53 backbone. While its out-of-the-box performance on the Cars Detection Dataset was modest (0.2574 mAP), it improved quite well during fine-tuning (0.6966 mAP). One of its most striking attributes in this experiment was its speed stability; whether original or fine-tuned, it maintained a highly consistent inference latency of roughly 4.8 to 4.9 ms, proving why it remains a favourite for production environments where predictable performance is key.
YOLOv11 & YOLOv12: YOLOv11 and YOLOv12 represent the modern frontier, moving beyond simple convolutional stacks toward sophisticated feature aggregation. YOLOv11 proved to be the efficiency champion of this study, achieving the fastest fine-tuned inference speed at just 4.47 ms. Meanwhile, YOLOv12 introduced a more complex architecture utilising Area Attention and R-ELAN (Residual Efficient Layer Aggregation Network). These features allow the model to "focus" better on overlapping vehicles in crowded traffic scenes. This architectural investment paid off in accuracy: YOLOv12 emerged as the overall leader in your experiment with a top score of 0.7402 mAP (after fine-tuning), though its sophisticated attention layers resulted in a slightly higher latency of 7.16 ms.
YOLOv26: YOLOv26 is a revolutionary shift in how object detection handles post-processing. Traditionally, models require Non-Maximum Suppression (NMS)—a separate computation step to filter out overlapping duplicate boxes—which can create bottlenecks on edge hardware. YOLOv26 is designed as an end-to-end, NMS-free detector, aiming to produce final results directly from the network. In your results, this architecture showed a unique "Latency Win" during fine-tuning: while the original model clocked in at 8.20 ms, the fine-tuned version dropped significantly to 5.07 ms. Combined with a strong 0.7104 mAP (after fine-tuning), YOLOv26 proves that removing the NMS bottleneck is a viable strategy for high-performance custom detection.

3. Dataset & Technical Setup

Overview of the Cars Detection Dataset: For this experiment, I utilised the Cars Detection Dataset, a specialised collection of imagery designed for multi-class vehicle detection. The vehicle objects to be detected are categorised into these classes: Ambulance, Bus, Car, Motorcycle, Truck & Background. This dataset includes various perspectives that challenge a model's ability to generalise across different angles and lighting conditions. This variety is particularly useful for testing if a model can distinguish between similar large vehicles (e.g. Bus, Truck), or identify smaller, higher-speed targets (e.g. Motorcycles), making it an ideal playground for comparing the latest YOLO versions.
Slight Modification to Original Dataset's data.yaml file to avoid Path Error:A common hurdle when working with community-contributed datasets is the environment-specific or package-specific configurations.
- In the original version of this dataset, the data.yaml file contained a path variable, for saving relative path of the parent directory of split image folders (train, valid). But due to this, Ultralytics's model.train() function was getting Path Error, while fine-tuning YOLOv5/v11/v12/v26. The model.train() function is taking the directory of data.yaml as the Base directory automatically inside Kaggle Environment. So, the path variable in data.yaml to be commented out for fine-tuning models in Kaggle environment using the Ultralytics package.
- Additionally, there was no test variable for the test split image folder in the data.yaml of the original dataset. So, in the modified version the test variable is added, which enables the use of Ultralytics's model.val() function for running inference & automatic inference metric calculation on the held out test split data.
- Dataset Links: Original Cars Detection Dataset, Modified Cars Detection Dataset.
Dataset License: The original Cars Detection Dataset is published under the Apache 2.0 License, a highly permissive license that allows for modification and redistribution. In alignment with these terms, my modified version of the dataset is also released under the Apache 2.0 License. I have maintained full attribution to the original author, Abdallah Wagih, and included a clear log of the technical changes made.

4. The Experiment: Comparative Methodology

Power of the Ultralytics Framework: To ensure a consistent and high-performance benchmarking environment, all experiments were conducted using the Ultralytics Python package. This framework has become the industry standard for YOLO-based tasks because it provides a unified API for managing multiple model versions—from the legacy YOLOv5 to the State-Of-The-Art YOLOv11 and beyond. Using a single engine for training, validation, and inference ensured that the results were not skewed by different pre-processing or post-processing implementations. This streamlined approach allowed fair comparison of the underlying architectures rather than the software wrappers around them.
Benchmarking Pre-trained (Out-of-the-box) vs. Fine-tuned models: The core of this study focuses on the performance gap between generic knowledge and specialised expertise. I first evaluated the Pre-trained (Out-of-the-box) models, which were originally trained on the COCO dataset. These models were tasked with detecting vehicles in the Cars Detection Dataset. Following this baseline, I performed Fine-tuning on each version. To ensure a scientifically fair comparison, I optimised hyper-parameters for YOLOv5 and then kept these settings fixed across all other YOLO versions. By holding these variables constant, the experiment isolates the impact of the architectural evolutions—such as YOLOv12’s attention mechanisms or YOLOv26’s NMS-free design—revealing how each engine inherently handles domain adaptation under identical training conditions.
Hardware Configuration and Latency Context: In object detection, "accuracy" is only half of the story; "latency" is equally critical for real-world deployment. To provide a standardised context for the speed results, all tests were performed in a Kaggle Notebook environment utilising a NVIDIA Tesla T4x2 GPU. The T4x2 is a widely used accelerator in cloud production environments, making these latency figures (measured in milliseconds per image) a realistic representation of what a developer can expect in a real-world application. By keeping the hardware constant across all 8 test runs, I was able to isolate the architectural efficiency of each YOLO version.
Kaggle NoteBooks for all Experiments:

5. Results & Visual Analysis

The Master Table: Benchmarking Results:
- For all performance evaluations, the last.pt weights were utilised rather than best.pt. Despite keeping training hyper-parameters (including epoch and patience) constant across all versions, the final weights provided more consistent and sensible detections, suggesting that the models reached a more stable convergence point at the end of the training cycle for this specific dataset.
- The quantitative results of the experiment highlight a massive performance leap across all architectures upon fine-tuning. Notably, YOLOv12 emerged as the leader in precision, while YOLOv11 optimised inference speed.

Model Version	mAP@50 (Original)	mAP@50 (Fine-Tuned)	mAP@50-95 (Fine-Tuned)
YOLOv5	0.2574	0.6966	0.5240
YOLOv11	0.2651	0.6945	0.5296
YOLOv12	0.2847	0.7402	0.5572
YOLOv26	0.2852	0.7104	0.5193

Model Version	Inference Time (ms/image, Original)	Inference Time (ms/image, Fine-Tuned)	Speed Change (ms/image)
YOLOv5	4.91	4.82	-0.09 (Stable)
YOLOv11	5.67	4.47	-1.20 (Faster)
YOLOv12	7.18	7.16	-0.02 (Stable)
YOLOv26	8.20	5.07	-3.13 (Major Gain)

Visual Proof: The Impact of Fine-Tuning: To visualise the qualitative improvement, we can compare Confusion Matrices of the original vs fine-tuned models. In the "Out-of-the-box" matrix, there is a high concentration of misclassification errors (too many FP and also some FN), where the model either missed vehicles entirely or misidentified them as background noise. After fine-tuning, these errors significantly reduced. The diagonal of the confusion matrix (for model fine-tuned on Cars Detection Dataset)—representing correct predictions—became much more prominent. [Note: The Confusion Matrix for the original model inference had all the COCO Dataset-related classes. Needed to filter it to show the overlapping classes, but some extra COCO Dataset classes could not be avoided even after filtering]
- For all the YOLO Original versions (v5, v11, v12, v26) inference output bounding boxes and inference metrics, go to this location under output tab of the Kaggle Notebook: /runs/detect/Inference_Study/Original_Inference/
- For all the YOLO Fine-tuned versions (v5, v11, v12, v26) inference output bounding boxes and inference metrics, go to this location under output tab of the Kaggle Notebook: /runs/detect/Inference_Study/FineTuned_Inference/

Fig 2. Filtered Confusion Matrix for the Original YOLOv12 Model Inference on Cars Detection Dataset's test split.

Fig 3. Confusion Matrix for the Fine-tuned YOLOv12 Model Inference on Cars Detection Dataset's test split.

Per-Class Performance-Strengths and Challenges: Across all YOLO versions, the "Car" class emerged as the most recognisable (still having significant FP and FN errors), benefiting from a high volume of training samples. It often struggled to distinguish between a car and the background or other large vehicles, a gap that was only bridged through the fine-tuning process. On the other hand, "Motorcycles" proved to be the toughest for all the architectures (even the attention-heavy YOLOv12). This may be due to their smaller spatial footprint and the higher variance in their appearance compared to cars or buses (which are more box-like). The "Ambulance" class showed comparatively better performance with YOLOv12 (Fine-tuned) even though it is not present in the original COCO Dataset.

6. Deep Dive Analysis

The Fine-Tuning Jump: Bridging the Domain Gap The dramatic surge in performance—with mAP jumping from a baseline of ~0.28 to a peak of ~0.74—highlights the limitations of general-purpose pre-training. While the original models were trained on the COCO dataset, which contains 80 broad categories, they lacked the specialisation required for the nuances of the Cars Detection Dataset.
- In a general context, a "vehicle" is often just a large boxy object; however, for a traffic-specific application, the model must distinguish at a more fine-grained level (e.g., between an Ambulance and a standard Truck). Fine-tuning allows the network to repurpose its learned features to focus on these critical distinctions, such as medical decals or specific chassis shapes, effectively transforming a "generalist" into a "specialist" for vehicle identification.
The Latency Paradox: Why Fine-Tuned Models Ran Faster: An unexpected but fascinating result of the experiment was the "Latency Paradox," where the fine-tuned versions of YOLOv11, YOLOv26 actually recorded lower latency than their original counterparts.
- Insight: This speed gain is primarily due to the reduction in the number of the classification heads(#classes). The original COCO-trained models have 80 different classification heads. But the fine-tuned models have only 6 classification heads, hence less computation load on the final layers.
YOLOv12 vs. YOLOv26: Accuracy vs. Architectural Trade-offs: The comparison between YOLOv12 and YOLOv26 reveals a fundamental trade-off in modern object detection design. YOLOv12 emerged as the accuracy champion, largely due to its Area Attention mechanism which excels at capturing global dependencies and refined spatial details. On the other hand, YOLOv26 represents a shift toward structural optimisation. By utilising an NMS-free design, YOLOv26 aims to eliminate the post-processing bottleneck entirely, while maintaining a decently high accuracy.

7. Conclusion & Future Scope

Optimal Selection-Choosing the Best Model for Traffic Applications: When selecting a model for a real-world traffic monitoring system, the choice depends on the specific priorities of the deployment environment. If the goal is maximum precision—such as identifying specific vehicle types in dense urban congestion—YOLOv12 is the clear leader, achieving a superior 0.7402 mAP in our tests. However, for edge devices with limited computational power where every millisecond counts, YOLOv11 offers the best balance, delivering the fastest fine-tuned inference speed of 4.47 ms while maintaining high accuracy. While the legacy YOLOv5 remains remarkably stable, the architectural advancements in the newer versions provide a clear performance improvements that is well worth the upgrade for modern AI applications.
Future Scope-Enhancing Temporal Consistency with ByteTrack: While this study focused on single-frame object detection (Image Dataset), real-world traffic monitoring is inherently video-based. A logical next step for this project is the integration of a multi-object tracking (MOT) algorithm like ByteTrack. By adding a tracking layer, the system can maintain the identity of vehicles across consecutive frames, even during brief occlusions. Integrating ByteTrack with the high-precision bounding boxes of YOLOv12 would transform this detector into a comprehensive solution capable of analysing vehicle trajectories, counting traffic flow, and detecting complex road events with much higher temporal consistency.
Future Scope-Testing Robustness on Aerial Datasets: To further validate the attention-based advantages of models like YOLOv12, future research should involve testing on some Aerial Dataset. Aerial imagery presents a unique challenge because objects appear significantly smaller and can be oriented in any direction. Since YOLOv12 utilises Area Attention to capture global context, it is hypothesised that it will hold its accuracy better than traditional CNN-based models when detecting tiny objects from a top-down perspective. Transitioning from road-level views to drone-based surveillance will provide a rigorous test of how these architectures scale across different spatial resolutions and altitudes.

8. References and Acknowledgements:

Cars Detection Dataset (Modified): Original dataset Original Cars Detection Dataset by Abdallah Wagih via Kaggle. Modified and redistributed under the Apache 2.0 License.
Ultralytics YOLO Documentation
Ultralytics YOLOv12 Documentation
Ultralytics YOLOv26 Documentation
An Overview of YOLOv5
An Overview of YOLOv12
An Overview of YOLOv26
Ultralytics Official YT Channels

Which one do you prioritise for your projects to be deployed on edge devices: the raw accuracy of YOLOv12 or the NMS-free YOLOv26? Let me know in the comments.