DEV Community

Cover image for Bounding Box Augmentation for Object Detection with Albumentations
Vladimir Iglovikov
Vladimir Iglovikov

Posted on

Bounding Box Augmentation for Object Detection with Albumentations

If you're new to image augmentation, two earlier posts provide the broader context:

This post builds on those ideas and focuses on one specific practical question: how to apply augmentations correctly when your labels are bounding boxes.

It is based on the Albumentations documentation, with additional context and examples for object detection workflows. Albumentations is an open-source image augmentation library with 15k+ GitHub stars and 140M+ downloads.

Contents

  • Bounding box formats
  • Building a detection pipeline
  • Passing labels and metadata
  • A.BboxParams explained
  • Cropping strategies
  • Common mistakes
  • Further reading

When you augment images for object detection, bounding box coordinates must transform in sync with the pixels. A horizontal flip mirrors the image — but if the box coordinates stay the same, every box now points at the wrong object. Albumentations handles this automatically: you declare your box format, and every spatial transform updates both pixels and coordinates together.

Bounding Box Formats

Different frameworks and datasets use different coordinate conventions. Albumentations supports five formats — pick the one your data already uses, and pass it as coord_format in A.BboxParams.

Format Coordinates Values Common in
pascal_voc [x_min, y_min, x_max, y_max] Pixels PASCAL VOC, many custom datasets
albumentations [x_min, y_min, x_max, y_max] Normalized [0, 1] Internal Albumentations format
coco [x_min, y_min, box_width, box_height] Pixels COCO dataset
yolo [x_center, y_center, box_width, box_height] Normalized [0, 1] Ultralytics YOLO, Darknet
cxcywh [x_center, y_center, box_width, box_height] Pixels Like YOLO but not normalized

If you're coming from Ultralytics (YOLOv5/v8/v11), your labels are already in yolo format — use coord_format='yolo'.

Example: For a 640x480 image with a box from pixel (98, 345) to (420, 462):

Bounding box example

  • pascal_voc: [98, 345, 420, 462] — corners in pixels
  • albumentations: [0.153, 0.719, 0.656, 0.962] — corners normalized by image dimensions
  • coco: [98, 345, 322, 117] — top-left corner + box width (420−98) and height (462−345)
  • yolo: [0.405, 0.841, 0.503, 0.244] — center + size, all normalized
  • cxcywh: [259, 403.5, 322, 117] — center + size in pixels

Comparison of bounding box formats

Getting the format wrong is the most common bbox bug. The values will still be valid numbers, the pipeline won't raise an error, but every box will point at the wrong region. Always double-check which format your annotation tool or dataset exports.

Setting Up a Detection Pipeline

import albumentations as A
import cv2
import numpy as np
Enter fullscreen mode Exit fullscreen mode

Create an A.Compose pipeline and pass A.BboxParams to tell it how to handle bounding boxes:

train_transform = A.Compose([
    A.RandomCrop(width=450, height=450, p=1.0),
    A.HorizontalFlip(p=0.5),
    A.RandomBrightnessContrast(p=0.2),
], bbox_params=A.BboxParams(
    coord_format='coco',
    label_fields=['class_labels'],
), seed=137)
Enter fullscreen mode Exit fullscreen mode

You can freely mix any transforms in the pipeline. Pixel-level transforms like RandomBrightnessContrast modify the image and leave boxes untouched. Spatial transforms like HorizontalFlip update both image and box coordinates. The result is consistent — boxes always match the augmented image. See the Supported Targets by Transform reference for which transforms affect which targets.

Applying the Pipeline

Load your image and prepare bounding boxes as a NumPy array with shape (num_boxes, 4):

image = cv2.imread("image.jpg")
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

bboxes = np.array([
    [23, 74, 295, 388],
    [377, 294, 252, 161],
    [333, 421, 49, 49],
], dtype=np.float32)

class_labels = np.array(['dog', 'cat', 'sports ball'])
Enter fullscreen mode Exit fullscreen mode

Pass everything to the transform. Labels go as keyword arguments matching the names in label_fields:

result = train_transform(image=image, bboxes=bboxes, class_labels=class_labels)

augmented_image = result['image']
augmented_bboxes = result['bboxes']
augmented_labels = result['class_labels']
Enter fullscreen mode Exit fullscreen mode

Input/Output with separate labels

The output may contain fewer boxes than the input — boxes that fall outside the augmented image area or become too small are automatically dropped.

Attaching Metadata to Bounding Boxes

Labels are optional. You can pass just coordinates with no metadata at all:

result = transform(image=image, bboxes=bboxes)
Enter fullscreen mode Exit fullscreen mode

When you do need to attach per-box data (class names, IDs, flags), there are two approaches.

Separate label fields

Declare field names in label_fields and pass each as a keyword argument. Values can be strings or numbers — anything that can go in a Python list:

Image with multiple bounding boxes

bbox_params = A.BboxParams(
    coord_format='pascal_voc',
    label_fields=['class_labels', 'difficult_flags'],
)

result = transform(
    image=image,
    bboxes=bboxes,
    class_labels=['dog', 'cat', 'ball'],   # strings
    difficult_flags=[0, 0, 1],              # numbers
)
Enter fullscreen mode Exit fullscreen mode

You can define as many fields as you need. When a box is dropped during augmentation, the corresponding entry is dropped from every field — they stay in sync automatically.

This mechanism is general enough to go beyond class labels. Any per-box data that needs to survive filtering can be passed as a label field. Two examples:

Video augmentation. Stack boxes from multiple frames into one bboxes array and add a frame_id field to track which frame each box came from. After the transform, result['frame_ids'] tells you which boxes survived and their origin frame:

bbox_params = A.BboxParams(
    coord_format='pascal_voc',
    label_fields=['class_labels', 'frame_ids'],
)

bboxes = [[10, 20, 100, 200], [50, 60, 150, 250], [30, 40, 80, 180]]
class_labels = ['car', 'car', 'person']
frame_ids = [0, 0, 1]

result = transform(images=images, bboxes=bboxes, class_labels=class_labels, frame_ids=frame_ids)
Enter fullscreen mode Exit fullscreen mode

Instance segmentation. Attach an instance_id field so each box stays linked to its mask after filtering:

bbox_params = A.BboxParams(
    coord_format='pascal_voc',
    label_fields=['class_labels', 'instance_ids'],
)

result = transform(
    image=image,
    bboxes=bboxes,
    class_labels=['person', 'person', 'car'],
    instance_ids=[0, 1, 2],
)
# use result['instance_ids'] to index into your mask array
Enter fullscreen mode Exit fullscreen mode

Packed arrays

If all your metadata is numeric, you can pack it directly into the bounding box array as extra columns. A (num_boxes, 6) array has 4 coordinate columns + 2 metadata columns (e.g., class ID and track ID). The extra columns are preserved through augmentation without needing label_fields:

bboxes = np.array([
    [23, 74, 295, 388, 1, 17],   # coords + class_id + track_id
    [377, 294, 252, 161, 2, 23],
], dtype=np.float32)

bbox_params = A.BboxParams(coord_format='coco')

result = transform(image=image, bboxes=bboxes)
# result['bboxes'] still has shape (n, 6) — extra columns intact
Enter fullscreen mode Exit fullscreen mode

Use label_fields when you have string labels or want named access to fields. Use packed arrays for compact numeric metadata where named access isn't needed.

BboxParams Reference

A.BboxParams controls how bounding boxes are interpreted and filtered:

  • coord_format (Required): One of 'pascal_voc', 'albumentations', 'coco', 'yolo', or 'cxcywh'.
  • bbox_type: 'hbb' (axis-aligned, 4 coords, default) or 'obb' (oriented, 5 coords with angle). For rotated objects, see Oriented Bounding Boxes (OBB).
  • label_fields: List of keyword argument names holding per-box labels (e.g., ['class_labels']). These stay synchronized when boxes are dropped.
  • min_area: Minimum pixel area after augmentation. Smaller boxes are dropped. Default: 0.0.
  • min_visibility: Minimum fraction (0.0–1.0) of original box area that must remain visible. Default: 0.0.
  • min_width: Minimum box width (pixels or normalized units). Default: 0.0.
  • min_height: Minimum box height (pixels or normalized units). Default: 0.0.
  • clip_bboxes_on_input: Clip coordinates to image boundaries before augmentation. Useful for annotations that extend outside the image. Default: False.
  • filter_invalid_bboxes: Remove invalid boxes (e.g., x_max < x_min) before augmentation. If clip_bboxes_on_input=True, filtering happens after clipping. Default: False.
  • max_accept_ratio: Maximum aspect ratio (max(w/h, h/w)). Boxes exceeding this are dropped. None disables. Default: None.

Handling Imperfect Annotations

Real-world datasets often have boxes that extend outside image boundaries — labeling errors, previous cropping, or annotation tools that allow it. Use clip_bboxes_on_input=True to force coordinates within bounds before augmentation, and filter_invalid_bboxes=True to drop any boxes that become degenerate (zero width/height) after clipping.

bbox_params = A.BboxParams(
    coord_format='yolo',
    label_fields=['class_labels'],
    clip_bboxes_on_input=True,
    filter_invalid_bboxes=True,
)
Enter fullscreen mode Exit fullscreen mode

Filtering with min_area and min_visibility

After a crop, some boxes become tiny slivers or are mostly outside the visible area. Use min_area and min_visibility to control which boxes survive:

Original image with two boxes

After CenterCrop (no min_area/visibility)

After CenterCrop with min_area

After CenterCrop with min_visibility

min_area drops boxes that become too small in absolute terms. min_visibility drops boxes where too much of the original area was cropped away. Which one to use depends on whether you care about absolute box size (use min_area) or how much of the object is still visible (use min_visibility).

Cropping Strategies for Object Detection

  • RandomCrop can produce crops that contain zero bounding boxes — a wasted training sample. Albumentations provides bbox-aware alternatives:
  • AtLeastOneBboxRandomCrop: Guarantees at least one box is present in the crop. Some boxes may be lost. Good when images have many objects and you want diverse crops.
  • BBoxSafeRandomCrop: Guarantees all boxes are preserved. The crop region adjusts to keep every box. Use when losing any annotation is unacceptable (rare objects, critical detection requirements).
  • RandomSizedBBoxSafeCrop: Crops a random portion of the image while preserving all boxes, then resizes to your target dimensions. Provides scale and aspect ratio variation while keeping every object — the most common choice for detection training.

Common Mistakes

Wrong coord_format. This is the #1 bbox bug. If your annotations are in YOLO format but you set coord_format='coco', the pipeline will run without errors — but every box will point at the wrong location. The model trains on garbage labels and mAP stays near zero. Always verify by visualizing a few augmented samples before starting a training run.

All boxes filtered out. Aggressive cropping combined with strict min_area or min_visibility can drop every box from an image, returning an empty bboxes array. Your training loop needs to handle this — either skip empty samples in your dataset class, or use bbox-safe cropping transforms.

Mixing normalized and absolute coordinates. YOLO format expects values in [0, 1]. If you pass pixel coordinates with coord_format='yolo', the pipeline clips them to [0, 1] and you get a single pixel-sized box in the corner. The reverse — passing normalized values with coord_format='pascal_voc' — produces boxes that are fractions of a pixel and get filtered out immediately.

Using unsupported transforms. Not all transforms can update bounding box coordinates. If you add a transform that doesn't support bboxes to a pipeline with A.BboxParams, it raises an error at initialization (not at runtime). Check the Supported Targets by Transform reference.

Visualizing after Normalize. A.Normalize converts pixel values to float with mean subtraction and std division. If you try to display the image after normalization, it looks like noise. Always visualize before A.Normalize and A.ToTensorV2 in your pipeline.

Where to Go Next?

Top comments (0)