DEV Community

wellallyTech
wellallyTech

Posted on

From Pixels to Calories: Building an AI Nutritionist with GPT-4o and SAM 2 πŸ₯—

Ever tried logging your lunch in a fitness app, only to find that "Chicken Salad" has 500 different entries? πŸ₯— We can do better. With the rise of Multimodal LLMs and advanced Computer Vision, we can now build a pipeline that doesn't just guess, but actually analyzes the geometry and ingredients of your meal in real-time.

In this tutorial, we are going to build a "Pixels to Calories" pipeline. We’ll use Meta’s Segment Anything Model 2 (SAM 2) for precise object detection and GPT-4o for high-level visual reasoning and nutritional estimation. This combination allows us to handle complex, multi-object plates where traditional models usually fail. For those interested in scaling these types of AI workflows into production-grade systems, you can find more advanced patterns and production-ready examples over at the WellAlly Tech Blog. πŸ₯‘

The Challenge: Why traditional CV isn't enough

Most nutrition apps use simple classification. They see a "round orange object" and say "Orange: 60kcal." But what if it’s a bowl of pasta with mixed toppings? We need to:

  1. Isolate each food item (Segmentation).
  2. Identify the ingredients within those segments.
  3. Estimate volume and density to calculate calories.

The Architecture πŸ—οΈ

Our pipeline follows a "Segment-then-Reason" logic. Instead of sending one giant, messy image to an LLM, we segment the individual components first to provide the model with "focal points."

graph TD
    A[User Image Upload] --> B[FastAPI Endpoint]
    B --> C[SAM 2: Multi-mask Generation]
    C --> D[Object Cropping & Preprocessing]
    D --> E[GPT-4o: Multimodal Reasoning]
    E --> F[Nutritional Data Extraction]
    F --> G[JSON Output: Calories/Macros]
    G --> H[Frontend Visualization]
Enter fullscreen mode Exit fullscreen mode

Prerequisites πŸ› οΈ

To follow along, you'll need:

  • PyTorch (for SAM 2 execution)
  • SAM 2 Weights (from Meta's official repo)
  • OpenAI API Key (for GPT-4o)
  • FastAPI (for the backend)

Step 1: Segmenting the Plate with SAM 2

SAM 2 is a game-changer because it treats images as collections of masks. Unlike its predecessor, SAM 2 is faster and more robust at handling overlapping food items.

import torch
from sam2.build_sam import build_sam2
from sam2.automatic_mask_generator import SAM2AutomaticMaskGenerator

# Initialize SAM 2
sam2_checkpoint = "sam2_hvit_b.pt"
model_cfg = "sam2_hvit_b.yaml"

sam2 = build_sam2(model_cfg, sam2_checkpoint, device="cuda" if torch.cuda.is_available() else "cpu")
mask_generator = SAM2AutomaticMaskGenerator(sam2)

def get_food_segments(image_path):
    image = load_image(image_path) # Helper to load image
    masks = mask_generator.generate(image)

    # Filter masks by size to remove noise (small crumbs or shadows)
    valid_masks = [m for m in masks if m['area'] > 1000]
    return valid_masks
Enter fullscreen mode Exit fullscreen mode

Step 2: Prompting GPT-4o for Visual Reasoning

Now that we have our segments, we send the original image and the segmentation metadata to GPT-4o. We use a structured prompt to ensure the output is a clean JSON object.

import openai

def analyze_nutrition(image_base64, segments_metadata):
    client = openai.OpenAI()

    prompt = f"""
    You are an expert nutritionist. I will provide an image of a meal and 
    segmentation data indicating {len(segments_metadata)} distinct items.

    Task:
    1. Identify each food item.
    2. Estimate the weight in grams based on visual volume.
    3. Calculate Calories, Protein, Carbs, and Fats.

    Return ONLY a JSON array of objects.
    """

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": prompt},
                    {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_base64}"}}
                ],
            }
        ],
        response_format={"type": "json_object"}
    )
    return response.choices[0].message.content
Enter fullscreen mode Exit fullscreen mode

Step 3: Wrapping it in FastAPI

We need an API to make this usable. FastAPI's asynchronous nature is perfect for handling the heavy lifting of SAM 2 and the external API calls to OpenAI.

from fastapi import FastAPI, UploadFile, File

app = FastAPI()

@app.post("/analyze-meal")
async def analyze_meal(file: UploadFile = File(...)):
    # 1. Save and Load Image
    contents = await file.read()

    # 2. Run SAM 2 (GPU Intensive)
    masks = get_food_segments(contents)

    # 3. Call GPT-4o for Nutritional Analysis
    nutrition_json = analyze_nutrition(encode_base64(contents), masks)

    return {
        "status": "success",
        "segments_found": len(masks),
        "data": nutrition_json
    }
Enter fullscreen mode Exit fullscreen mode

The "Official" Way to Scale πŸš€

While the code above is a great starting point for a "Learning in Public" project, building a production-grade health tech app requires more:

  • Distributed Task Queues: Running SAM 2 on a GPU during a web request is risky. Use Celery or Redis Queues.
  • Vector Databases: Store food embeddings to avoid calling GPT-4o for common, identical-looking meals.
  • Model Quantization: Using FP16 or INT8 for SAM 2 can significantly reduce latency.

For a deeper dive into these optimization strategies and to see how to integrate this with a full-stack dashboard, check out the specialized guides at wellally.tech/blog. They cover everything from LLM observability to high-performance inference.

Conclusion

Combining SAM 2 and GPT-4o bridges the gap between raw pixels and semantic understanding. We’ve moved from "this is a picture of food" to "this is 150g of grilled salmon with approximately 300 calories." πŸš€

What would you build with this? A smart fridge? An automated restaurant billing system? Let me know in the comments below! πŸ‘‡

Happy coding! πŸ’»πŸ₯‘

Top comments (0)