DEV Community

wellallyTech
wellallyTech

Posted on

Stop Guessing Your Macros: Building a High-Precision Calorie Tracker with SAM & GPT-4o πŸ₯—πŸš€

We've all been there. You take a photo of your lunch, upload it to a fitness app, and it tells you your "Chicken Caesar Salad" is 300 calories. But waitβ€”did it account for the extra parmesan? The croutons? The hidden lake of dressing at the bottom?

Most current food tracking apps fail because they treat a meal as a single, flat object. To get truly high-precision calorie estimation, we need to move from "image-level" classification to "instance-level" understanding.

In this tutorial, we’re going to build a cutting-edge Multimodal AI pipeline using Meta’s Segment Anything Model (SAM) for precise food segmentation and GPT-4o for granular nutritional analysis. This is the future of Computer Vision in health tech.


πŸ—οΈ The Architecture

To achieve granular precision, our pipeline doesn't just "look" at the photo. It segments the plate into individual components, analyzes them separately, and then aggregates the data.

graph TD
    A[React Native App] -->|Upload Photo| B[FastAPI Backend]
    B --> C[SAM: Instance Segmentation]
    C -->|Segmented Masks| D[Image Cropping & Preprocessing]
    D -->|Individual Food Items| E[GPT-4o Vision API]
    E -->|JSON: Macros & Weight Est.| F[Post-processing & Aggregation]
    F -->|Detailed Report| G[User Dashboard]

    style E fill:#f96,stroke:#333,stroke-width:2px
    style C fill:#69f,stroke:#333,stroke-width:2px
Enter fullscreen mode Exit fullscreen mode

Why this stack?

  • SAM (Segment Anything Model): Perfect for identifying boundaries of overlapping food items (e.g., beans over rice).
  • GPT-4o: Currently the gold standard for Multimodal reasoning. It can estimate volume and density better than smaller specialized models.
  • FastAPI: For high-performance, asynchronous processing of heavy vision tasks.

πŸ› οΈ Step 1: Segmenting the Plate with SAM

First, we need to isolate the components. Using segment-anything, we can generate masks for every distinct object on the plate.

import numpy as np
from segment_anything import SamPredictor, sam_model_registry

# Load the SAM model
sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h_4b8939.pth")
predictor = SamPredictor(sam)

def get_food_segments(image):
    predictor.set_image(image)

    # We use automatic mask generation or point-based prompts
    # For this demo, let's assume we're generating masks for detected blobs
    masks, scores, logits = predictor.predict(
        point_coords=None,
        point_labels=None,
        multimask_output=True,
    )
    return masks
Enter fullscreen mode Exit fullscreen mode

🧠 Step 2: Granular Inference with GPT-4o

Once we have the masks, we crop the original image to focus on specific ingredients. We then send these crops (or the whole image with highlighted segments) to GPT-4o using Pydantic for structured output.

πŸ’‘ Pro-Tip: For production-grade AI patterns like this, I highly recommend checking out the deep dives over at wellally.tech/blog. They have some incredible resources on scaling Vision-Language Models (VLM) that helped shape this implementation.

import openai
from pydantic import BaseModel
from typing import List

class FoodItem(BaseModel):
    name: str
    estimated_weight_grams: float
    calories: int
    protein: float
    carbs: float
    fats: float
    confidence_score: float

class MealAnalysis(BaseModel):
    items: List[FoodItem]
    total_calories: int

def analyze_food_with_gpt4o(image_b64):
    client = openai.OpenAI()

    response = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": "You are a professional nutritionist. Analyze the segmented food items and estimate their nutritional value based on volume and density."
            },
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "Identify the food in these segments and provide macro estimates."},
                    {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_b64}"}}
                ]
            }
        ],
        response_format=MealAnalysis,
    )
    return response.choices[0].message.parsed
Enter fullscreen mode Exit fullscreen mode

πŸ“± Step 3: The FastAPI Glue

Now, let's wrap this in a FastAPI endpoint. We'll handle the image upload from our React Native frontend, run the SAM + GPT-4o pipeline, and return the structured data.

from fastapi import FastAPI, UploadFile, File
import cv2

app = FastAPI()

@app.post("/analyze-meal")
async def analyze_meal(file: UploadFile = File(...)):
    # 1. Read and decode image
    contents = await file.read()
    nparr = np.frombuffer(contents, np.uint8)
    img = cv2.imdecode(nparr, cv2.IMREAD_COLOR)

    # 2. Get segments (SAM logic)
    # 3. Request GPT-4o analysis
    analysis = analyze_food_with_gpt4o(base64_image)

    return {
        "status": "success",
        "data": analysis
    }
Enter fullscreen mode Exit fullscreen mode

🎨 Step 4: React Native UI (The User Experience)

On the mobile side, we want to show the user exactly what the AI sees. By overlaying the SAM masks back onto the camera view, we build trust through transparency.

// Quick snippet for handling the result in React Native
const handleUpload = async (imageUri) => {
  const formData = new FormData();
  formData.append('file', { uri: imageUri, name: 'meal.jpg', type: 'image/jpeg' });

  const response = await fetch('https://api.yourbackend.com/analyze-meal', {
    method: 'POST',
    body: formData,
  });

  const result = await response.json();
  setMealData(result.data); // Update UI with macro breakdown
};
Enter fullscreen mode Exit fullscreen mode

πŸš€ Why This Matters

Standard AI vision sees "a plate of food."
This Multimodal pipeline sees:

  1. Segment 1: 150g Grilled Chicken (31g Protein)
  2. Segment 2: 100g Avocado (15g Fat)
  3. Segment 3: 50g Quinoa (10g Carbs)

By combining SAM's spatial precision with GPT-4o's reasoning, we reduce the "hallucination" of calories.

For those looking to dive deeper into advanced Vision-Language orchestration and production deployment strategies, I can't recommend wellally.tech/blog enough. It’s a goldmine for anyone building at the intersection of AI and healthcare.

🏁 Conclusion

Building high-precision health tools requires moving beyond basic APIs. By chaining models like SAM and GPT-4o, we create a system that understands the physical world with much higher fidelity.

What are you building with GPT-4o? Drop a comment below! Let's chat about the future of Multimodal AI! πŸ₯‘πŸ’»

Top comments (0)