DEV Community

Beck_Moulton
Beck_Moulton

Posted on

Beyond Image Labels: Estimating Food Portions and Calories using Grounding DINO + SAM

Ever tried those calorie tracking apps where you have to manually search for "medium-sized chicken breast" and hope your estimate isn't off by 300%? We've all been there. Identifying a "pizza" is a solved problem in Computer Vision, but calculating the volume and macronutrients from a single 2D image is where the real engineering begins.

In this deep dive, we are moving beyond simple classification. We will leverage a Zero-shot Object Detection approach using Grounding DINO and the surgical precision of the Segment Anything Model (SAM) to build an automated pipeline for food portion estimation. Whether you are building a fitness app or an automated kitchen assistant, this pipeline represents the state-of-the-art in AI-driven nutrition tracking and automated image labeling.

For those of you looking for more production-ready patterns and advanced deep learning implementations, I suggest checking out the comprehensive guides at WellAlly Tech Blog, which served as a huge inspiration for this architectural design.

The Architecture: From Pixels to Nutrients

The biggest challenge in portion estimation is the lack of depth data. To solve this, we use a reference object (like a plate or a coin) and a multi-stage vision pipeline.

graph TD
    A[Input Image] --> B[Grounding DINO]
    B -- Text Prompt: 'food, plate' --> C[Bounding Boxes]
    C --> D[SAM - Segment Anything]
    D -- Precise Masks --> E[Geometry Engine]
    E -- Ref. Object Scaling --> F[Volume Estimation]
    F --> G[Nutrient Mapping API]
    G --> H[Final Calorie Report]
Enter fullscreen mode Exit fullscreen mode

Prerequisites

To follow along, you’ll need a GPU (12GB+ VRAM recommended) and the following stack:

  • PyTorch: The backbone of our model execution.
  • Grounding DINO: For open-set object detection.
  • SAM (Segment Anything): For high-fidelity segmentation masks.
  • FastAPI: To wrap this logic into a high-performance API.

Step 1: Zero-Shot Detection with Grounding DINO

Traditional detectors like YOLO require training on specific food datasets. Grounding DINO allows us to detect anything by simply typing it out.

from groundingdino.util.inference import load_model, load_image, predict

# Load the pre-trained Grounding DINO model
model = load_model("config/GroundingDINO_SwinT_OGC.py", "weights/groundingdino_swint_ogc.pth")

IMAGE_PATH = "meal.jpg"
TEXT_PROMPT = "chicken breast, broccoli, rice, plate"
BOX_THRESHOLD = 0.35
TEXT_THRESHOLD = 0.25

image_source, image = load_image(IMAGE_PATH)

boxes, logits, phrases = predict(
    model=model, 
    image=image, 
    caption=TEXT_PROMPT, 
    box_threshold=BOX_THRESHOLD, 
    text_threshold=TEXT_THRESHOLD
)

print(f"Detected: {phrases} with confidence {logits}")
Enter fullscreen mode Exit fullscreen mode

Step 2: Surgical Masking with SAM

Bounding boxes are too "blocky" for volume calculation. We need to know exactly which pixels belong to the steak vs. the plate. We pass the boxes from Grounding DINO into SAM to get a precise mask.

import numpy as np
from segment_anything import SamPredictor, sam_model_registry

# Initialize SAM
sam = sam_model_registry["vit_h"](checkpoint="weights/sam_vit_h_4b8939.pth").to("cuda")
predictor = SamPredictor(sam)
predictor.set_image(image_source)

# Convert DINO boxes to SAM format
input_boxes = transform_boxes(boxes, image_source.shape) 

masks, _, _ = predictor.predict_torch(
    point_coords=None,
    point_labels=None,
    boxes=input_boxes,
    multimask_output=False,
)
Enter fullscreen mode Exit fullscreen mode

Step 3: The "Magic" of Volume Estimation

Here is the secret sauce. By identifying a standard-sized object (like a 10-inch plate), we can calculate a Pixels-per-Metric ratio.

  1. Calculate Plate Area: Use the mask of the plate to find the total pixel count.
  2. Scale Factor: Scale = Actual_Plate_Diameter / Pixel_Width_of_Plate.
  3. Food Volume: Since we only have a 2D view, we use a heuristic based on the food type (e.g., a "spherical" assumption for an orange or a "slab" assumption for a steak) combined with the mask area.
def estimate_volume(mask_pixels, label, scale_factor):
    area_cm2 = mask_pixels * (scale_factor ** 2)

    # Heuristic height mapping (Simplified)
    height_map = {"chicken breast": 2.5, "broccoli": 3.0, "rice": 1.5}
    avg_height = height_map.get(label, 2.0)

    volume_cm3 = area_cm2 * avg_height
    return volume_cm3
Enter fullscreen mode Exit fullscreen mode

Deploying with FastAPI

To make this useful, we wrap it in a FastAPI endpoint. This allows a mobile app to send a photo and receive a JSON payload with the nutritional breakdown.

from fastapi import FastAPI, UploadFile

app = FastAPI()

@app.post("/analyze-plate")
async def analyze_plate(file: UploadFile):
    # 1. Save and Load Image
    # 2. Run Grounding DINO + SAM pipeline
    # 3. Calculate Volumes
    # 4. Query Nutrient DB (like Nutritionix or USDA)

    return {
        "items": [
            {"label": "chicken breast", "est_weight": "200g", "calories": 330},
            {"label": "broccoli", "est_weight": "100g", "calories": 35}
        ],
        "total_calories": 365
    }
Enter fullscreen mode Exit fullscreen mode

Advanced Optimization & Best Practices

Estimating 3D volume from a single 2D image is an "ill-posed problem." To get production-grade accuracy, you should consider:

  • Shadow Analysis: Using shadows to estimate the height of the food.
  • Multi-view Fusion: Taking two photos from different angles.
  • Contextual Priors: Knowing that a "slice of pizza" has a standard thickness.

If you are interested in diving deeper into these advanced computer vision techniques, the WellAlly Tech Blog has an incredible series on "Depth Estimation without LiDAR" that perfectly complements this workflow.

Conclusion

Combining Grounding DINO and SAM turns the "What is in this photo?" question into "How much is in this photo?". While we are still in the early days of automated calorie estimation, this zero-shot approach drastically reduces the need for massive, labeled datasets.

Are you working on AI in the health space? Drop a comment below or share your thoughts on how you'd improve the volume estimation logic! 🚀


Happy coding! If you enjoyed this post, don't forget to ❤️ and bookmark it for your next AI project.

Top comments (0)