DEV Community

Yobitel
Yobitel

Posted on

CVAT AMI from Yobitel - How do annotation types work on AWS?

CVAT AMI runs on your own EC2 infrastructure in AWS. There are no managed annotation platforms, like 

  • No per-seat pricing

  • No data egress to a third-party cloud

  • No dependency on external availability

The AMI packages CVAT with all required services, which makes deployment a single-step process from the AWS Marketplace.

Once the instance is running, the annotation environment gives teams full control over data, workflows, and export pipelines. The quality of the output depends entirely on choosing the right annotation type for the task. 

Yobitel’s CVAT AMI on AWS supports 9 distinct annotation types. Various types are associated with specific geometry, storage formats, and export schemas, which must align with what downstream models expect.

Selecting the wrong type does not produce suboptimal data. It produces structurally incorrect data. A detection model trained on polygons converted to bounding boxes loses boundary precision at inference. A segmentation model trained on bounding box crops receives background pixels as part of the object region. A tracking model trained without persistent object IDs cannot learn identity continuity across frames. These are not quality issues that post-processing can fix. They are architectural mismatches between the annotation geometry and the model input contract.

The 9 annotation types map across four layers of the annotation decision: 

  • Object geometry (rectangular, contoured, linear, point, volumetric)

  • Model task (detection, segmentation, pose estimation, tracking, classification)

  • Output format (YOLO, COCO, Cityscapes, MOT, KITTI)

  • Infrastructure requirement (CPU, GPU, high-IOPS storage)

It covers all the types and the EC2 instance specifications that support each type at a production scale.

Bounding box

Bounding box annotation in CVAT AMI

CVAT stores bounding boxes as 4 pixel coordinates:

  • xtl

  • ytl 

  • xbr top-left per cent 

  • ybr bottom-right corner

Every box carries a label, an optional attribute set, and a confidence flag when auto-annotation is active. It is the fastest annotation type and the most widely supported across detection frameworks.

When to use it

Bounding boxes apply when the target model is an object detector such as YOLO, Faster R-CNN, or SSD. They work for objects that are upright, non-overlapping, and fill most of the rectangle. Object fill ratio is the key threshold. When the ratio of object pixels to total box pixels drops below approximately 40 per cent, the rectangle captures too much background context, which degrades localisation accuracy during training. At that threshold, a polygon or mask produces better training data.

Recommended EC2 instance

Workload Instance Notes
Small tasks, up to 5,000 images t3.medium CPU-only, no AI-assist
Standard production tasks c5.xlarge 4 vCPU, enough for manual bounding boxes
AI-assisted auto-annotation g4dn.xlarge Required for Nuclio-based YOLO auto-detect

Real pipeline

Product detection systems label items on warehouse conveyors with bounding boxes. The model confirms whether a SKU is present in a frame region. YOLO v8 trained on these annotations runs inference in the 40 to 50ms range per frame. Fill ratio stays above 60 per cent across most captures because SKUs are upright and belt-separated.

Step by step in CVAT

Step Action
1. Open task Navigate to http://:8080. Go to Projects, Create Task, upload images or enter an S3 path.
2. Configure S3 The IAM role requires s3:GetObject, s3:ListBucket, and s3:HeadObject. Missing HeadObject fails file enumeration.
3. Draw Press N. Click-drag across the object to define the rectangle.
4. Label Release the mouse. CVAT prompts for a label from the predefined label set.
5. Adjust Hover a box edge for the resize handle and drag. Hold Alt and drag to reposition without resizing.
6. Export Actions, Export Dataset. YOLO 1.1 for detection pipelines. COCO 1.0 for JSON-format annotations.

Polygon

Polygon annotation in CVAT AMI

Polygon geometry has no fixed shape. CVAT stores polygons as ordered vertex arrays, where the shape closes automatically when the final point connects to the first. Each vertex is a pixel coordinate. The contour conforms to concave boundaries, irregular silhouettes, and partially occluded objects.

When to use it

Polygon annotation is correct for instance segmentation training. Mask R-CNN, SegFormer, and SAM fine-tuning all require per-cent boundary masks. Concave silhouettes, overhead vehicles, garments, and agricultural plants are hard to describe with rectangles. Polygons are also necessary when objects of the same class touch or overlap, because each polygon instance carries its own ID regardless of spatial proximity.

Polygon vs bounding box

Detection models (YOLO, Faster R-CNN) only need class and location. Bounding boxes suffice. Segmentation models like Mask R-CNN and SegFormer require precise boundary geometry. Polygons are necessary. When the fill ratio drops below 40 per cent, a polygon also produces better detection data because it eliminates the background context that the rectangle would include.

Recommended EC2 instance

Workload Instance Notes
Manual polygon tracing c5.xlarge Adequate for up to 20 concurrent annotators
Intelligent scissors (edge-snap) c5.xlarge CPU-only mode, no GPU needed
AI-assisted interactive segmentation g4dn.xlarge SiamMask and MobileNet models via Nuclio require a GPU
High-density tasks, 20,000+ images c5.2xlarge Higher memory prevents canvas lag on large images

Real pipeline

Crop disease detection from drone imagery requires polygon precision. The boundary between diseased and healthy leaf tissue is part of the learning signal. Rectangles capture healthy tissue on all sides, making the boundary annotation misleading. Annotators trace polygon contours at 150 to 200 vertex points per diseased region.

Step by step in CVAT

Step Action
1. Activate Polygons can be drawn from the left toolbar or by pressing N in the Draw Shape dropdown.
2. Place vertices Click each vertex around the object boundary. CVAT connects vertices in real time.
3. Close Double-click the final point, or press N again, to close the shape.
4. Edit Right-click the polygon to enter edit mode. Drag vertices, right-click a vertex to delete it, right-click a segment to insert a new point.
5. Intelligent scissors Activate the magnetic lasso icon for high-contrast boundaries. Edge detection snaps vertices, reducing click count by 30 to 60 per cent on clean images.
6. Export COCO 1.0 for segmentation models. LabelMe for per-instance JSON. Cityscapes 1.0 for pixel-indexed PNG masks.

Polyline 

Polyline annotation in CVAT AMI

Polylines share the same ordered vertex array structure as polygons, but do not close. CVAT renders them as a series of strokes with configurable width. The geometry suits linear structures with no enclosed area: lane markings, road edges, cables, conveyor belt paths, blood vessels, structural cracks, and skeletal joint connections.

Recommended EC2 instance

Workload Instance Notes
Standard polyline annotation t3.medium or c5.xlarge CPU-only, low compute demand
High-volume video polylines c5.xlarge Frame scrubbing at scale benefits from more vCPU

Real pipeline

Lane detection systems for highway footage use polylines per lane boundary per frame. Each line carries task-level attributes: line type (solid, dashed, double) and colour (white, yellow). The detection model SCNN takes polyline coordinates as direct supervision targets rather than rasterised masks. Attribute metadata is necessary because the model distinguishes marking type, not just position.

Step by step in CVAT

Step Action
1. Activate Select Polyline from the Draw Shape dropdown or press N with polyline mode active.
2. Place points Click each point along the linear feature. Unlike polygon mode, you do not close the shape.
3. Terminate Double-click the final point or press N.
4. Add attributes Open Label Constructor, add an attribute to the label, set type to select or text, and define the value options. Each polyline instance carries the attribute value labelling during annotation.
5. Export LabelMe or Datumaro format, both of which preserve custom attribute fields.

Point and skeleton labelled

Point and skeleton keypoint annotation in CVAT AMI

Point annotations carry no area geometry. CVAT stores each point as an (x, y, label) tuple representing a single pixel coordinate with an associated label. A skeleton groups related points under one object instance and encodes a connectivity graph between them. This structure maps directly to the COCO Keypoints format.

When to use it

Points apply to keypoint annotation tasks, pose estimation, facial landmark labelling, and centroid-based density estimation. For pose estimation, the model consumes keypoint coordinates and a skeleton graph. Each joint is labelled discretely, with a visibility flag indicating whether the joint is visible, occluded, or out of frame. v=0 for unlabeled, v=1 for occluded, v=2 for fully visible.

Recommended EC2 instance

Workload Instance Notes
Manual keypoint placement t3.medium or c5.xlarge Low compute per frame
Video-based pose tracking c5.xlarge Multi-frame keyframe interpolation is CPU-bound

Real pipeline

Sports analytics datasets for basketball label 17 body keypoints per player, matching the COCO body schema: nose, eyes, ears, shoulders, elbows, wrists, hips, knees, and ankles. Each person instance carries a 17-byte int skeleton with visibility flags. The export uses COCO 1.0 Keypoints format, encoding each instance as [x1, y1, v1, x2, y2, v2, ...].

Step by step in CVAT

Step Action
1. Configure label In Label Constructor, define a label for the object class and configure a skeleton by adding each keypoint as a sublabel with its name.
2. Activate Press N with Points mode selected. Click to place each point.
3. Skeleton mode CVAT prompts for each joint in sequence and connects them per the defined graph.
4. Visibility flags Verify flags in the Objects panel. Set the occluded flag for hidden joints.
5. Export COCO 1.0 Keypoints format. Output JSON carries a keypoints array per instance and a skeleton connectivity field in the category definition.

Instance segmentation 

Instance segmentation provides each individual object with a unique mask. Two cars in the same image get two separate masks with separate instance IDs, even if they overlap. In CVAT, this is produced by drawing polygon or brush mask annotations where each drawn object gets its own unique Iis D in the Objects panel.

When to use it

Instance segmentation applies when the model needs to count, separate, or individually process objects of the same class, particularly when those objects overlap. Mask R-CNN and panoptic segmentation architectures require per-instance masks. If objects of the same class are always spatially separated and identification of individual objects is not required, semantic segmentation is sufficient and faster to produce.

Recommended EC2 instance

Workload Instance Notes
Manual PO, Lygon-based instance masks c5.xlarge CPU-only is adequate for polygon tracing
Brush tool instance masks c5.xlarge Higher RAM helps with large image canvases
AI-assisted interactive segmentation g4dn.xlarge SiamMask requireIDGPU via Nuclio
RLE export of large polygon datasets c5.2xlarge or gp3 EBS at 3000 IOPS I/O intensive during packaging

Real pipeline

Robotic sorting separates items on a conveyor. Items stack and partially overlap. Without instance identity, overlapping objects of the same class merge into an undifferentiated region, and the robot arm cannot resolve a spatial target. Each bottle, can, and paper unit receives its own mask. A length of 1.0 is used with RLE encoding. Each annotation entry in the JSON carries a category_id for class and a unique ID for instance.

Step by step in CVAT

Step Action
1. Draw per object Draw a polygon or use the Brush tool for each object individually. Each drawn annotation is a separate CVAT object with its own row in the Objects panel.
2. Verify IDs Confirm in the Objects panel that no two objects of the same class share an ID. Those row numbers are the instance ID.
3. Export setup Enable mask export in the COCO export settings. CVAT rasterises polygons and runs-length encodes them.
4. Export COCO 1.0 with RLE. The segmentation field per annotation contains {counts, size} when RLE is active.

Semantic segmentation 

Semantic segmentation annotation in CVAT AMI

Pixels in an image are labelled with a class. The output is a mask where each pixel value corresponds to a class index. There are no instance IDs. Three cars in the same frame all receive the same car class index, with no per-car distinction.

When to use it

Semantic segmentation fits scene parsing tasks where class distribution across the image is the objective. Autonomous driving scene parsing, satellite land cover classification, and medical tissue classification are the primary applications. DeepLab, PSPNet, and similar networks train on per-pixel class indices without needing to distinguish individual objects.

Recommended EC2 instance

Workload Instance Notes
Polygon-based semantic masks c5.xlarge Standard CPU workload
Brush-based full-image painting c5.xlarge Adequate for 1024x1024 images
Large image semantic annotation, 4K+ c5.2xlarge Higher RAM prevents canvas rendering delays
Cityscapes export of 10,000+ images c5.2xlarge wit,h gp3 EBS at 3000 IOPS Export I/O intensive

Real pipeline

Satellite imagery land cover classification labels each pixel as forest, agricultural land, water, or urban surface. No per-parcel identity is needed. The export uses Cityscapes 1.0 format, producing a PNG mask where each pixel stores an integer class index. DeepLab v3+ trains directly on these masks.

Step by step in CVAT

Step Action
1. Configure colours Assign a distinct RGB colour per class in Label Constructor. This colour maps to the output mask encoding.
2. Paint Use the Brush tool for large, uniform regions. Use polygon mode for precise class boundaries.
3. Full coverage All pixels should be covered. Uncovered pixels export as background (class index 0) in Cityscapes format.
4. Export Cityscapes 1.0. Output includes labelIds_polygon.png (integer class index per pixel) and color.png (mapped RGB per pixel).
5. Verify mapping Confirm the class-to-index order in your label configuration matches what the training script expects before export.

What is the brush tool and when should it replace polygons?

The Brush tool in CVAT is a freehand painting tool that directly writes pixel-level masks. It produces the same underlying mask format as polygons through a different input mechanism: painting with a brush cursor rather than placing vertices.

When to use it over polygons

The Brush tool works better than polygon tracing for three cases: large amorphous regions without clear linear edges, areas where the boundary is too irregular for efficient vertex placement, and semantic segmentation tasks requiring full pixel coverage. For instance, segmentation of objects with hard edges, the polygon tool with Intelligent Scissors to temporarily activate erase, and precise boundaries. For manual polygon workflows, a c5.xlarge provides sufficient compute, while AI-assisted mask generation requires a g4dn.xlarge to support the GPU inference load.

Step by step in CVAT

Step Action
1. Activate Select the Brush tool from the left toolbar (mask icon).
2. Set size Adjust the brush size from the toolbar slider. Larger brushes cover areas faster. Smaller brushes handle boundary regions.
3. Paint Click and drag over the target object area.
4. Erase Switch to erase mode to correct boundary overshoot. Hold Shift while painting to activate erasing temporarily.
5. Convert Use the polygon approximation feature to convert a freehand mask to a vertex array for post-edit manipulation.
6. Export Cityscapes 1.0 for semantic masks. COCO 1.0 with mask export enabled for instance masks.

Cuboid 

Cuboid annotation in CVAT AMI

A cuboid is a 3D bounding box that encodes spatial depth alongside 2D position. CVAT renders cuboids as perspective-projected shapes on a 2D image plane, combining a front face rectangle with edge lines projecting to a rear face to represent spatial orientation and depth extent.

When to use it

Cuboids apply when the model needs spatial depth information: autonomous driving 3D detection, robotics scene reconstruction, and augmented reality object placement. Models such as PointPillars, SECOND, and CenterPoint trained on 3D bounding box data require cuboid annotations. LiDAR point cloud datasets paired with camera images are the primary use case.

Recommended EC2 instance

Workload Instance Notes
Manual cuboid annotation c5.xlarge CPU-only adequate
LiDAR point cloud annotation (CVAT 3D model) c5.2xlarge Higher RAM and vCPU handles point cloud rendering
GPU-assisted 3D detection pre-annotation g4dn.xlarge Nuclio-based 3D model inference

Real pipeline

An autonomous vehicle dataset labels vehicles, pedestrians, and cyclists with cuboids in dashcam images paired with LiDAR scans. Each cuboid encodes position (x, y, z), dimensions (width, height, depth), and rotation (yaw angle). The model predicts full 3D bounding boxes at inference.

Step by step in CVAT

Step Action
1. Activate Select the Cuboid tool from the draw mode dropdown.
2. Draw the front face Click to define two points: top-left and bottom-right of the front face.
3. Adjust rear face CVAT renders the projected rear face automatically. Drag rear face handles to match depth, extent, and rotation.
4. Label Assign a label from the predefined label set.
5. Export KITTI format for standard 3D detection pipelines. Datumaro for full attribute preservation.

Ellipse

Ellipse annotation in CVAT AMI

A centre point, a semi-major axis, and a semi-minor axis define an ellipse. CVAT stores ellipses as (cx, cy, rx, ry). It is more precise than a bounding box for circular or oval objects and faster to draw than a polygon tracing the same shape.

When to use it

Ellipses apply to objects with roughly circular or oval geometry where a bounding box over-captures background: cell detection in microscopy, wheel detection in vehicle imagery, ball detection in sports analytics, and eye or pupil detection in facial analysis pipelines. Manual ellipse annotation requires minimal compute, making a t3.medium or c5.xlarge sufficient for the workload.

Real pipeline

A cell counting model for microscopy images uses ellipses to annotate individual cells. Each cell is approximately circular but varies slightly in aspect ratio. Bounding boxes over-capture surrounding cell fluid. Full polygon traces of circular boundaries add unnecessary annotation time. Ellipses match the shape with two clicks.

Step by step in CVAT

Step Action
1. Activate Select the Ellipse tool from the draw mode dropdown.
2. Draw Click-drag from the centre of the target object. CVAT draws the ellipse dynamically.
3. Adjust axes Drag the boundary handles to adjust the semi-major and semi-minor axes.
4. Rotate Drag the rotation handle to align the ellipse axis with angled objects.
5. Export COCO 1.0. The ellipse is converted to a polygon approximation with 12 to 20 vertices on export.

Tags in CVAT

Tag annotation in CVAT AMI

Tags are image-level labels that classify an entire image without assigning any geometry. No bounding box, polygon, or mask is drawn. The tag applies to the full image and exports as a classification label.

When to use tags

Tags apply to image classification tasks (positive/negative, pass/fail), content moderation labelling, multi-label classification, and as supplementary metadata alongside geometry annotations. A task can carry both geometry annotations on individual objects and a tag on the overall image. Tag-only classification tasks are best handled on a t3.medium instance, as these workloads require minimal compute and involve no geometry rendering. 

Real pipeline

A quality control pipeline labels part images as pass or fail before routing to a downstream classifier. Tags provide a single classification label per image. A second pipeline layer then applies bounding boxes to failed parts for defect localisation.

Step by step in CVAT

Step Action
1. Define labels In Label Constructor, define classification labels such as defect_present or no_defect.
2. Activate Press T in the annotation canvas to activate tag mode.
3. Assign Select the appropriate label from the dropdown. The tag appears in the Objects panel.
4. Note Tags are not visible as shapes on the canvas. They appear only in the Objects panel and export file.
5. Export CVAT for images 1.1 XML or Datumaro. COCO export does not carry image-level tags.

Video annotation types

Video annotation in CVAT AMI

Video annotation in CVAT introduces a time axis. Two mechanisms make dense video labelling practical at scale.

  • Tracking assigns a persistent ID to an object across its entire visible duration. An object entering at frame 1 and exiting at frame 300 carries the same track ID throughout. This is required for multi-object tracking models such as ByteTrack and DeepSORT, and for action recognition networks that need temporal object identity.

  • Interpolation reduces annotation cost on tracks with predictable motion. An annotator sets keyframes at positions where the object changes meaningfully. CVAT linearly interpolates geometry between keyframes, auto-generating all intermediate frames.

Recommended EC2 instance

Workload Instance Notes
Standard video annotation, up to 30fps c5.xlarge CPU-bound interpolation, adequate for 1080p
High-frame-rate video, 60fps or 4K c5.2xlarge Higher vCPU and RAM prevent canvas lag
AI-assisted tracking (SiamMask) g4dn.xlarge Nuclio-based tracking requires a GPU
Large video export, 1,000+ annotated clips c5.2xlarge with gp3 EBS at 3000 IOPS Export I/O intensive for MOT CSV generation

Real pipeline

Quality control using 10-second clips at 30 fps uses keyframe interpolation. Setting keyframes at frame 1, frame 150 (rotation change), and frame 300 produces 300 annotated frames from 3 manual actions. CVAT fills the 297 intermediate frames.

Step by step in CVAT

Step Action
1. Create a task Add a video or link to a hosted video. CVAT extracts frames at the specified frame rate during task creation.
2. First keyframe Navigate to the first frame where an object appears. Draw the annotation shape. The Objects panel registers this as the first keyframe of a track.
3. Next keyframe Jump forward on the frame slider to a position where the object has moved. Adjust the shape. CVAT marks this as the next keyframe automatically.
4. Interpolation Frames between the two keyframes fill via linear interpolation. No manual action required.
5. Review Press F to advance to the next keyframe. Press V to step frame by frame to inspect interpolated positions.
6. Correct Move to a frame where interpolation diverges. Adjust the shape. CVAT registers a correction keyframe.
7. Outside Right-click the track and set Outside status between keyframes to suppress interpolated frames where the object is not visible.
8. Export MOT 1.1 for tracking model training. CVAT for video XML for QA re-import.

How do you choose between annotation types?

Three questions resolve most cases:

  1. Does the model need instance identity (can it distinguish two objects of the same class)?

  2. Does the model need precise object boundaries beyond a rectangle?

  3. Is the data temporal (video with object motion)?

If all three answers are no, a bounding box is sufficient and produces the highest annotation throughput. Each yes answer escalates the geometry requirement toward a polygon, instance mask, or video track.

For 3D tasks with depth information, cuboids replace bounding boxes. For circular objects where polygon tracing is inefficient, ellipses are faster. For pure classification without spatial localisation, tags are correct.

Export formats per annotation type

Annotation type Export format Format detail
Bounding box YOLO 1.1 One .txt per image with class cx cy w h normalised values
Bounding box COCO 1.0 JSON with bbox: [x, y, w, h] per annotation
Polygon COCO 1.0 JSON with segmentation: [[x1,y1,...]] vertex arrays
Polygon LabelMe Per-image JSON with vertex arrays per instance
Semantic mask Cityscapes 1.0 PNG mask where each pixel stores an integer class index
Instance mask COCO 1.0 with RLE JSON with segmentation: {counts, size} run-length encoded
Polyline LabelMe Per-image JSON preserving attribute values
Keypoints COCO 1.0 Keypoints keypoints: [x,y,v,...] array per instance with skeleton graph
Video track MOT 1.1 CSV with frame, id, x, y, w, h, conf, -1, -1, -1 columns
Video track CVAT for video XML Full keyframe and interpolation data
Cuboid KITTI Per-frame .txt with 3D box parameters
Ellipse COCO 1.0 Converted to polygon approximation on export
Tag CVAT XML / Datumaro Image-level classification label
Custom attributes Datumaro JSON with full attribute preservation per annotation

COCO 1.0 covers bounding boxes, polygons, RLE masks, and keypoints within one JSON schema. When a pipeline ingests multiple annotation types, a single COCO export avoids format divergence at the data loading layer. Any task that uses custom attribute fields requires Datumaro, the only format that preserves those fields in full.

CVAT on AWS AMI covers the full annotation geometry spectrum from single-pixel point coordinates to 3D spatial cuboids. Each type produces a specific data format that maps to a defined model input contract. The selection process starts with what the training framework expects, works back to the annotation geometry that produces it, and then matches the EC2 instance type to the computational profile of that workload.

Top comments (0)