DEV Community

wellallyTech
wellallyTech

Posted on

🥘 From Pixels to Proteins: Mastering Calorie Estimation with GPT-4o Vision and SAM

We’ve all been there: staring at a plate of delicious pasta, trying to figure out if it's 400 calories or 800. Tracking macros is the ultimate test of human patience. Traditionally, AI nutrition tracking relied on simple classification models that often failed to distinguish between a "small snack" and a "family-sized feast."

Today, we are bridging that gap. By combining the precision of Meta’s Segment Anything Model (SAM) with the multimodal reasoning of GPT-4o Vision, we can build an automated pipeline that doesn't just recognize food—it understands volume and portion size. In this guide, we’ll explore how to leverage multimodal LLMs and image segmentation to transform a simple photo into a detailed nutritional breakdown.


🏗️ The Architecture: Logic Flow

The biggest pain point in vision-based calorie estimation is "segmentation." If the AI doesn't know where the steak ends and the mashed potatoes begin, the calorie count will be a hallucination. Our solution uses SAM to isolate food items and GPT-4o to provide the "brain."

graph TD
    A[React Native App] -->|Upload Photo| B[FastAPI Backend]
    B --> C[SAM: Precise Segmentation]
    C -->|Individual Food Masks| D[GPT-4o Vision: Reasoning]
    D -->|Context: Plate Size + Item Volume| E[Nutritional Database/Logic]
    E --> F[Final Macro Report: Protein, Fat, Carbs, kcal]
    F -->|JSON Response| A
Enter fullscreen mode Exit fullscreen mode

🛠️ Prerequisites

Before we dive into the code, ensure you have the following:

  • Python 3.10+
  • OpenAI API Key (with GPT-4o access)
  • PyTorch (for running the SAM model locally)
  • FastAPI (for the backend)

🚀 Step 1: Isolating Food with Segment Anything (SAM)

First, we need to extract specific items from the image. While GPT-4o is great at seeing, it’s not (yet) a pixel-perfect segmenter. We use SAM to generate masks for every object on the plate.

import torch
from segment_anything import sam_model_registry, SamAutomaticMaskGenerator
import cv2

# Load the SAM model
sam_checkpoint = "sam_vit_h_4b8939.pth"
model_type = "vit_h"
device = "cuda" if torch.cuda.is_available() else "cpu"

sam = sam_model_registry[model_type](checkpoint=sam_checkpoint)
sam.to(device=device)

mask_generator = SamAutomaticMaskGenerator(sam)

def get_food_segments(image_path):
    image = cv2.imread(image_path)
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

    # Generate masks for all objects in the image
    masks = mask_generator.generate(image)
    return masks # Contains segmentation pixels, bounding boxes, etc.
Enter fullscreen mode Exit fullscreen mode

🧠 Step 2: Reasoning with GPT-4o Vision

Now that we have the segments, we send the original image and the segmentation data to GPT-4o. We use a structured prompt to force the model to estimate weight and volume based on visual cues (like the size of the fork or the plate).

import openai

def estimate_nutrition(image_url, segments_summary):
    client = openai.OpenAI()

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": f"Identify the food items in these segments: {segments_summary}. Provide the estimated weight in grams and total calories. Return as JSON."},
                    {
                        "type": "image_url",
                        "image_url": {"url": image_url},
                    },
                ],
            }
        ],
        response_format={"type": "json_object"}
    )
    return response.choices[0].message.content
Enter fullscreen mode Exit fullscreen mode

🌐 Step 3: The FastAPI Wrapper

To make this accessible to our React Native frontend, we wrap everything in a clean FastAPI endpoint.

from fastapi import FastAPI, File, UploadFile

app = FastAPI()

@app.post("/analyze-meal")
async def analyze_meal(file: UploadFile = File(...)):
    # 1. Save uploaded file
    # 2. Run SAM to get segments
    # 3. Call GPT-4o Vision for reasoning
    # 4. Return the nutritional breakdown
    return {"food_items": ["Grilled Chicken", "Quinoa"], "total_kcal": 450, "confidence": 0.89}
Enter fullscreen mode Exit fullscreen mode

💡 The "Official" Way to Build AI-Driven Health Apps

While this tutorial covers the core logic of combining Vision models, building a production-ready health platform requires more than just a few API calls. You need to handle data privacy (HIPAA compliance), handle edge cases where food is mixed (like stews), and optimize for low-latency mobile experiences.

For more advanced implementation patterns and production-ready examples of AI in healthcare, I highly recommend checking out the WellAlly Official Blog. It's a goldmine for developers looking to build robust, ethical AI solutions in the wellness space. 🥑


✨ Conclusion

The combination of GPT-4o Vision and SAM effectively solves the biggest hurdle in digital nutrition: the "hidden" calories in portion sizes. By segmenting the world first, we give the LLM the context it needs to be accurate rather than just "guessing."

What's next?

  1. Try adding a "Reference Object" (like a coin) to your photos to give the model a scale for even better volume estimation.
  2. Integrate a real-time feedback loop where users can correct the AI to improve future predictions.

Happy coding! If you found this useful, don't forget to heart this post and drop a comment below with your thoughts on AI in the health-tech space! 🚀💻

Top comments (0)