CVAT AMI runs on your own EC2 infrastructure in AWS. There are no managed annotation platforms, like
No per-seat pricing
No data egress to a third-party cloud
No dependency on external availability
The AMI packages CVAT with all required services, which makes deployment a single-step process from the AWS Marketplace.
Once the instance is running, the annotation environment gives teams full control over data, workflows, and export pipelines. The quality of the output depends entirely on choosing the right annotation type for the task.
Yobitel’s CVAT AMI on AWS supports 9 distinct annotation types. Various types are associated with specific geometry, storage formats, and export schemas, which must align with what downstream models expect.
Selecting the wrong type does not produce suboptimal data. It produces structurally incorrect data. A detection model trained on polygons converted to bounding boxes loses boundary precision at inference. A segmentation model trained on bounding box crops receives background pixels as part of the object region. A tracking model trained without persistent object IDs cannot learn identity continuity across frames. These are not quality issues that post-processing can fix. They are architectural mismatches between the annotation geometry and the model input contract.
The 9 annotation types map across four layers of the annotation decision:
Object geometry (rectangular, contoured, linear, point, volumetric)
Model task (detection, segmentation, pose estimation, tracking, classification)
Output format (YOLO, COCO, Cityscapes, MOT, KITTI)
Infrastructure requirement (CPU, GPU, high-IOPS storage)
It covers all the types and the EC2 instance specifications that support each type at a production scale.
Bounding box
CVAT stores bounding boxes as 4 pixel coordinates:
xtl
ytl
xbr top-left per cent
ybr bottom-right corner
Every box carries a label, an optional attribute set, and a confidence flag when auto-annotation is active. It is the fastest annotation type and the most widely supported across detection frameworks.
When to use it
Bounding boxes apply when the target model is an object detector such as YOLO, Faster R-CNN, or SSD. They work for objects that are upright, non-overlapping, and fill most of the rectangle. Object fill ratio is the key threshold. When the ratio of object pixels to total box pixels drops below approximately 40 per cent, the rectangle captures too much background context, which degrades localisation accuracy during training. At that threshold, a polygon or mask produces better training data.
Recommended EC2 instance
| Workload | Instance | Notes |
|---|---|---|
| Small tasks, up to 5,000 images | t3.medium | CPU-only, no AI-assist |
| Standard production tasks | c5.xlarge | 4 vCPU, enough for manual bounding boxes |
| AI-assisted auto-annotation | g4dn.xlarge | Required for Nuclio-based YOLO auto-detect |
Real pipeline
Product detection systems label items on warehouse conveyors with bounding boxes. The model confirms whether a SKU is present in a frame region. YOLO v8 trained on these annotations runs inference in the 40 to 50ms range per frame. Fill ratio stays above 60 per cent across most captures because SKUs are upright and belt-separated.
Step by step in CVAT
| Step | Action |
|---|---|
| 1. Open task | Navigate to http://:8080. Go to Projects, Create Task, upload images or enter an S3 path. |
| 2. Configure S3 | The IAM role requires s3:GetObject, s3:ListBucket, and s3:HeadObject. Missing HeadObject fails file enumeration. |
| 3. Draw | Press N. Click-drag across the object to define the rectangle. |
| 4. Label | Release the mouse. CVAT prompts for a label from the predefined label set. |
| 5. Adjust | Hover a box edge for the resize handle and drag. Hold Alt and drag to reposition without resizing. |
| 6. Export | Actions, Export Dataset. YOLO 1.1 for detection pipelines. COCO 1.0 for JSON-format annotations. |
Polygon
Polygon geometry has no fixed shape. CVAT stores polygons as ordered vertex arrays, where the shape closes automatically when the final point connects to the first. Each vertex is a pixel coordinate. The contour conforms to concave boundaries, irregular silhouettes, and partially occluded objects.
When to use it
Polygon annotation is correct for instance segmentation training. Mask R-CNN, SegFormer, and SAM fine-tuning all require per-cent boundary masks. Concave silhouettes, overhead vehicles, garments, and agricultural plants are hard to describe with rectangles. Polygons are also necessary when objects of the same class touch or overlap, because each polygon instance carries its own ID regardless of spatial proximity.
Polygon vs bounding box
Detection models (YOLO, Faster R-CNN) only need class and location. Bounding boxes suffice. Segmentation models like Mask R-CNN and SegFormer require precise boundary geometry. Polygons are necessary. When the fill ratio drops below 40 per cent, a polygon also produces better detection data because it eliminates the background context that the rectangle would include.
Recommended EC2 instance
| Workload | Instance | Notes |
|---|---|---|
| Manual polygon tracing | c5.xlarge | Adequate for up to 20 concurrent annotators |
| Intelligent scissors (edge-snap) | c5.xlarge | CPU-only mode, no GPU needed |
| AI-assisted interactive segmentation | g4dn.xlarge | SiamMask and MobileNet models via Nuclio require a GPU |
| High-density tasks, 20,000+ images | c5.2xlarge | Higher memory prevents canvas lag on large images |
Real pipeline
Crop disease detection from drone imagery requires polygon precision. The boundary between diseased and healthy leaf tissue is part of the learning signal. Rectangles capture healthy tissue on all sides, making the boundary annotation misleading. Annotators trace polygon contours at 150 to 200 vertex points per diseased region.
Step by step in CVAT
| Step | Action |
|---|---|
| 1. Activate | Polygons can be drawn from the left toolbar or by pressing N in the Draw Shape dropdown. |
| 2. Place vertices | Click each vertex around the object boundary. CVAT connects vertices in real time. |
| 3. Close | Double-click the final point, or press N again, to close the shape. |
| 4. Edit | Right-click the polygon to enter edit mode. Drag vertices, right-click a vertex to delete it, right-click a segment to insert a new point. |
| 5. Intelligent scissors | Activate the magnetic lasso icon for high-contrast boundaries. Edge detection snaps vertices, reducing click count by 30 to 60 per cent on clean images. |
| 6. Export | COCO 1.0 for segmentation models. LabelMe for per-instance JSON. Cityscapes 1.0 for pixel-indexed PNG masks. |
Polyline
Polylines share the same ordered vertex array structure as polygons, but do not close. CVAT renders them as a series of strokes with configurable width. The geometry suits linear structures with no enclosed area: lane markings, road edges, cables, conveyor belt paths, blood vessels, structural cracks, and skeletal joint connections.
Recommended EC2 instance
| Workload | Instance | Notes |
|---|---|---|
| Standard polyline annotation | t3.medium or c5.xlarge | CPU-only, low compute demand |
| High-volume video polylines | c5.xlarge | Frame scrubbing at scale benefits from more vCPU |
Real pipeline
Lane detection systems for highway footage use polylines per lane boundary per frame. Each line carries task-level attributes: line type (solid, dashed, double) and colour (white, yellow). The detection model SCNN takes polyline coordinates as direct supervision targets rather than rasterised masks. Attribute metadata is necessary because the model distinguishes marking type, not just position.
Step by step in CVAT
| Step | Action |
|---|---|
| 1. Activate | Select Polyline from the Draw Shape dropdown or press N with polyline mode active. |
| 2. Place points | Click each point along the linear feature. Unlike polygon mode, you do not close the shape. |
| 3. Terminate | Double-click the final point or press N. |
| 4. Add attributes | Open Label Constructor, add an attribute to the label, set type to select or text, and define the value options. Each polyline instance carries the attribute value labelling during annotation. |
| 5. Export | LabelMe or Datumaro format, both of which preserve custom attribute fields. |
Point and skeleton labelled
Point annotations carry no area geometry. CVAT stores each point as an (x, y, label) tuple representing a single pixel coordinate with an associated label. A skeleton groups related points under one object instance and encodes a connectivity graph between them. This structure maps directly to the COCO Keypoints format.
When to use it
Points apply to keypoint annotation tasks, pose estimation, facial landmark labelling, and centroid-based density estimation. For pose estimation, the model consumes keypoint coordinates and a skeleton graph. Each joint is labelled discretely, with a visibility flag indicating whether the joint is visible, occluded, or out of frame. v=0 for unlabeled, v=1 for occluded, v=2 for fully visible.
Recommended EC2 instance
| Workload | Instance | Notes |
|---|---|---|
| Manual keypoint placement | t3.medium or c5.xlarge | Low compute per frame |
| Video-based pose tracking | c5.xlarge | Multi-frame keyframe interpolation is CPU-bound |
Real pipeline
Sports analytics datasets for basketball label 17 body keypoints per player, matching the COCO body schema: nose, eyes, ears, shoulders, elbows, wrists, hips, knees, and ankles. Each person instance carries a 17-byte int skeleton with visibility flags. The export uses COCO 1.0 Keypoints format, encoding each instance as [x1, y1, v1, x2, y2, v2, ...].
Step by step in CVAT
| Step | Action |
|---|---|
| 1. Configure label | In Label Constructor, define a label for the object class and configure a skeleton by adding each keypoint as a sublabel with its name. |
| 2. Activate | Press N with Points mode selected. Click to place each point. |
| 3. Skeleton mode | CVAT prompts for each joint in sequence and connects them per the defined graph. |
| 4. Visibility flags | Verify flags in the Objects panel. Set the occluded flag for hidden joints. |
| 5. Export | COCO 1.0 Keypoints format. Output JSON carries a keypoints array per instance and a skeleton connectivity field in the category definition. |
Instance segmentation
Instance segmentation provides each individual object with a unique mask. Two cars in the same image get two separate masks with separate instance IDs, even if they overlap. In CVAT, this is produced by drawing polygon or brush mask annotations where each drawn object gets its own unique Iis D in the Objects panel.
When to use it
Instance segmentation applies when the model needs to count, separate, or individually process objects of the same class, particularly when those objects overlap. Mask R-CNN and panoptic segmentation architectures require per-instance masks. If objects of the same class are always spatially separated and identification of individual objects is not required, semantic segmentation is sufficient and faster to produce.
Recommended EC2 instance
| Workload | Instance | Notes |
|---|---|---|
| Manual PO, Lygon-based instance masks | c5.xlarge | CPU-only is adequate for polygon tracing |
| Brush tool instance masks | c5.xlarge | Higher RAM helps with large image canvases |
| AI-assisted interactive segmentation | g4dn.xlarge | SiamMask requireIDGPU via Nuclio |
| RLE export of large polygon datasets | c5.2xlarge or gp3 EBS at 3000 IOPS | I/O intensive during packaging |
Real pipeline
Robotic sorting separates items on a conveyor. Items stack and partially overlap. Without instance identity, overlapping objects of the same class merge into an undifferentiated region, and the robot arm cannot resolve a spatial target. Each bottle, can, and paper unit receives its own mask. A length of 1.0 is used with RLE encoding. Each annotation entry in the JSON carries a category_id for class and a unique ID for instance.
Step by step in CVAT
| Step | Action |
|---|---|
| 1. Draw per object | Draw a polygon or use the Brush tool for each object individually. Each drawn annotation is a separate CVAT object with its own row in the Objects panel. |
| 2. Verify IDs | Confirm in the Objects panel that no two objects of the same class share an ID. Those row numbers are the instance ID. |
| 3. Export setup | Enable mask export in the COCO export settings. CVAT rasterises polygons and runs-length encodes them. |
| 4. Export | COCO 1.0 with RLE. The segmentation field per annotation contains {counts, size} when RLE is active. |
Semantic segmentation
Pixels in an image are labelled with a class. The output is a mask where each pixel value corresponds to a class index. There are no instance IDs. Three cars in the same frame all receive the same car class index, with no per-car distinction.
When to use it
Semantic segmentation fits scene parsing tasks where class distribution across the image is the objective. Autonomous driving scene parsing, satellite land cover classification, and medical tissue classification are the primary applications. DeepLab, PSPNet, and similar networks train on per-pixel class indices without needing to distinguish individual objects.
Recommended EC2 instance
| Workload | Instance | Notes |
|---|---|---|
| Polygon-based semantic masks | c5.xlarge | Standard CPU workload |
| Brush-based full-image painting | c5.xlarge | Adequate for 1024x1024 images |
| Large image semantic annotation, 4K+ | c5.2xlarge | Higher RAM prevents canvas rendering delays |
| Cityscapes export of 10,000+ images | c5.2xlarge wit,h gp3 EBS at 3000 IOPS | Export I/O intensive |
Real pipeline
Satellite imagery land cover classification labels each pixel as forest, agricultural land, water, or urban surface. No per-parcel identity is needed. The export uses Cityscapes 1.0 format, producing a PNG mask where each pixel stores an integer class index. DeepLab v3+ trains directly on these masks.
Step by step in CVAT
| Step | Action |
|---|---|
| 1. Configure colours | Assign a distinct RGB colour per class in Label Constructor. This colour maps to the output mask encoding. |
| 2. Paint | Use the Brush tool for large, uniform regions. Use polygon mode for precise class boundaries. |
| 3. Full coverage | All pixels should be covered. Uncovered pixels export as background (class index 0) in Cityscapes format. |
| 4. Export | Cityscapes 1.0. Output includes labelIds_polygon.png (integer class index per pixel) and color.png (mapped RGB per pixel). |
| 5. Verify mapping | Confirm the class-to-index order in your label configuration matches what the training script expects before export. |
What is the brush tool and when should it replace polygons?
The Brush tool in CVAT is a freehand painting tool that directly writes pixel-level masks. It produces the same underlying mask format as polygons through a different input mechanism: painting with a brush cursor rather than placing vertices.
When to use it over polygons
The Brush tool works better than polygon tracing for three cases: large amorphous regions without clear linear edges, areas where the boundary is too irregular for efficient vertex placement, and semantic segmentation tasks requiring full pixel coverage. For instance, segmentation of objects with hard edges, the polygon tool with Intelligent Scissors to temporarily activate erase, and precise boundaries. For manual polygon workflows, a c5.xlarge provides sufficient compute, while AI-assisted mask generation requires a g4dn.xlarge to support the GPU inference load.
Step by step in CVAT
| Step | Action |
|---|---|
| 1. Activate | Select the Brush tool from the left toolbar (mask icon). |
| 2. Set size | Adjust the brush size from the toolbar slider. Larger brushes cover areas faster. Smaller brushes handle boundary regions. |
| 3. Paint | Click and drag over the target object area. |
| 4. Erase | Switch to erase mode to correct boundary overshoot. Hold Shift while painting to activate erasing temporarily. |
| 5. Convert | Use the polygon approximation feature to convert a freehand mask to a vertex array for post-edit manipulation. |
| 6. Export | Cityscapes 1.0 for semantic masks. COCO 1.0 with mask export enabled for instance masks. |
Cuboid
A cuboid is a 3D bounding box that encodes spatial depth alongside 2D position. CVAT renders cuboids as perspective-projected shapes on a 2D image plane, combining a front face rectangle with edge lines projecting to a rear face to represent spatial orientation and depth extent.
When to use it
Cuboids apply when the model needs spatial depth information: autonomous driving 3D detection, robotics scene reconstruction, and augmented reality object placement. Models such as PointPillars, SECOND, and CenterPoint trained on 3D bounding box data require cuboid annotations. LiDAR point cloud datasets paired with camera images are the primary use case.
Recommended EC2 instance
| Workload | Instance | Notes |
|---|---|---|
| Manual cuboid annotation | c5.xlarge | CPU-only adequate |
| LiDAR point cloud annotation (CVAT 3D model) | c5.2xlarge | Higher RAM and vCPU handles point cloud rendering |
| GPU-assisted 3D detection pre-annotation | g4dn.xlarge | Nuclio-based 3D model inference |
Real pipeline
An autonomous vehicle dataset labels vehicles, pedestrians, and cyclists with cuboids in dashcam images paired with LiDAR scans. Each cuboid encodes position (x, y, z), dimensions (width, height, depth), and rotation (yaw angle). The model predicts full 3D bounding boxes at inference.
Step by step in CVAT
| Step | Action |
|---|---|
| 1. Activate | Select the Cuboid tool from the draw mode dropdown. |
| 2. Draw the front face | Click to define two points: top-left and bottom-right of the front face. |
| 3. Adjust rear face | CVAT renders the projected rear face automatically. Drag rear face handles to match depth, extent, and rotation. |
| 4. Label | Assign a label from the predefined label set. |
| 5. Export | KITTI format for standard 3D detection pipelines. Datumaro for full attribute preservation. |
Ellipse
A centre point, a semi-major axis, and a semi-minor axis define an ellipse. CVAT stores ellipses as (cx, cy, rx, ry). It is more precise than a bounding box for circular or oval objects and faster to draw than a polygon tracing the same shape.
When to use it
Ellipses apply to objects with roughly circular or oval geometry where a bounding box over-captures background: cell detection in microscopy, wheel detection in vehicle imagery, ball detection in sports analytics, and eye or pupil detection in facial analysis pipelines. Manual ellipse annotation requires minimal compute, making a t3.medium or c5.xlarge sufficient for the workload.
Real pipeline
A cell counting model for microscopy images uses ellipses to annotate individual cells. Each cell is approximately circular but varies slightly in aspect ratio. Bounding boxes over-capture surrounding cell fluid. Full polygon traces of circular boundaries add unnecessary annotation time. Ellipses match the shape with two clicks.
Step by step in CVAT
| Step | Action |
|---|---|
| 1. Activate | Select the Ellipse tool from the draw mode dropdown. |
| 2. Draw | Click-drag from the centre of the target object. CVAT draws the ellipse dynamically. |
| 3. Adjust axes | Drag the boundary handles to adjust the semi-major and semi-minor axes. |
| 4. Rotate | Drag the rotation handle to align the ellipse axis with angled objects. |
| 5. Export | COCO 1.0. The ellipse is converted to a polygon approximation with 12 to 20 vertices on export. |
Tags in CVAT
Tags are image-level labels that classify an entire image without assigning any geometry. No bounding box, polygon, or mask is drawn. The tag applies to the full image and exports as a classification label.
When to use tags
Tags apply to image classification tasks (positive/negative, pass/fail), content moderation labelling, multi-label classification, and as supplementary metadata alongside geometry annotations. A task can carry both geometry annotations on individual objects and a tag on the overall image. Tag-only classification tasks are best handled on a t3.medium instance, as these workloads require minimal compute and involve no geometry rendering.
Real pipeline
A quality control pipeline labels part images as pass or fail before routing to a downstream classifier. Tags provide a single classification label per image. A second pipeline layer then applies bounding boxes to failed parts for defect localisation.
Step by step in CVAT
| Step | Action |
|---|---|
| 1. Define labels | In Label Constructor, define classification labels such as defect_present or no_defect. |
| 2. Activate | Press T in the annotation canvas to activate tag mode. |
| 3. Assign | Select the appropriate label from the dropdown. The tag appears in the Objects panel. |
| 4. Note | Tags are not visible as shapes on the canvas. They appear only in the Objects panel and export file. |
| 5. Export | CVAT for images 1.1 XML or Datumaro. COCO export does not carry image-level tags. |
Video annotation types
Video annotation in CVAT introduces a time axis. Two mechanisms make dense video labelling practical at scale.
Tracking assigns a persistent ID to an object across its entire visible duration. An object entering at frame 1 and exiting at frame 300 carries the same track ID throughout. This is required for multi-object tracking models such as ByteTrack and DeepSORT, and for action recognition networks that need temporal object identity.
Interpolation reduces annotation cost on tracks with predictable motion. An annotator sets keyframes at positions where the object changes meaningfully. CVAT linearly interpolates geometry between keyframes, auto-generating all intermediate frames.
Recommended EC2 instance
| Workload | Instance | Notes |
|---|---|---|
| Standard video annotation, up to 30fps | c5.xlarge | CPU-bound interpolation, adequate for 1080p |
| High-frame-rate video, 60fps or 4K | c5.2xlarge | Higher vCPU and RAM prevent canvas lag |
| AI-assisted tracking (SiamMask) | g4dn.xlarge | Nuclio-based tracking requires a GPU |
| Large video export, 1,000+ annotated clips | c5.2xlarge with gp3 EBS at 3000 IOPS | Export I/O intensive for MOT CSV generation |
Real pipeline
Quality control using 10-second clips at 30 fps uses keyframe interpolation. Setting keyframes at frame 1, frame 150 (rotation change), and frame 300 produces 300 annotated frames from 3 manual actions. CVAT fills the 297 intermediate frames.
Step by step in CVAT
| Step | Action |
|---|---|
| 1. Create a task | Add a video or link to a hosted video. CVAT extracts frames at the specified frame rate during task creation. |
| 2. First keyframe | Navigate to the first frame where an object appears. Draw the annotation shape. The Objects panel registers this as the first keyframe of a track. |
| 3. Next keyframe | Jump forward on the frame slider to a position where the object has moved. Adjust the shape. CVAT marks this as the next keyframe automatically. |
| 4. Interpolation | Frames between the two keyframes fill via linear interpolation. No manual action required. |
| 5. Review | Press F to advance to the next keyframe. Press V to step frame by frame to inspect interpolated positions. |
| 6. Correct | Move to a frame where interpolation diverges. Adjust the shape. CVAT registers a correction keyframe. |
| 7. Outside | Right-click the track and set Outside status between keyframes to suppress interpolated frames where the object is not visible. |
| 8. Export | MOT 1.1 for tracking model training. CVAT for video XML for QA re-import. |
How do you choose between annotation types?
Three questions resolve most cases:
Does the model need instance identity (can it distinguish two objects of the same class)?
Does the model need precise object boundaries beyond a rectangle?
Is the data temporal (video with object motion)?
If all three answers are no, a bounding box is sufficient and produces the highest annotation throughput. Each yes answer escalates the geometry requirement toward a polygon, instance mask, or video track.
For 3D tasks with depth information, cuboids replace bounding boxes. For circular objects where polygon tracing is inefficient, ellipses are faster. For pure classification without spatial localisation, tags are correct.
Export formats per annotation type
| Annotation type | Export format | Format detail |
|---|---|---|
| Bounding box | YOLO 1.1 | One .txt per image with class cx cy w h normalised values |
| Bounding box | COCO 1.0 | JSON with bbox: [x, y, w, h] per annotation |
| Polygon | COCO 1.0 | JSON with segmentation: [[x1,y1,...]] vertex arrays |
| Polygon | LabelMe | Per-image JSON with vertex arrays per instance |
| Semantic mask | Cityscapes 1.0 | PNG mask where each pixel stores an integer class index |
| Instance mask | COCO 1.0 with RLE | JSON with segmentation: {counts, size} run-length encoded |
| Polyline | LabelMe | Per-image JSON preserving attribute values |
| Keypoints | COCO 1.0 Keypoints | keypoints: [x,y,v,...] array per instance with skeleton graph |
| Video track | MOT 1.1 | CSV with frame, id, x, y, w, h, conf, -1, -1, -1 columns |
| Video track | CVAT for video XML | Full keyframe and interpolation data |
| Cuboid | KITTI | Per-frame .txt with 3D box parameters |
| Ellipse | COCO 1.0 | Converted to polygon approximation on export |
| Tag | CVAT XML / Datumaro | Image-level classification label |
| Custom attributes | Datumaro | JSON with full attribute preservation per annotation |
COCO 1.0 covers bounding boxes, polygons, RLE masks, and keypoints within one JSON schema. When a pipeline ingests multiple annotation types, a single COCO export avoids format divergence at the data loading layer. Any task that uses custom attribute fields requires Datumaro, the only format that preserves those fields in full.
CVAT on AWS AMI covers the full annotation geometry spectrum from single-pixel point coordinates to 3D spatial cuboids. Each type produces a specific data format that maps to a defined model input contract. The selection process starts with what the training framework expects, works back to the annotation geometry that produces it, and then matches the EC2 instance type to the computational profile of that workload.











Top comments (0)