wellallyTech

Posted on Jun 23

Calorie-GPT: Precision Food Tracking using Segment Anything Model (SAM) and GPT-4o-vision 🥗🤖

#openai #machinelearning #fastapi #python

We’ve all been there: staring at a delicious plate of pasta, wondering exactly how many calories are hidden in that creamy sauce. Most calorie-counting apps rely on manual entry or basic image classification that fails the moment your chicken is hidden under a leaf of kale.

In this tutorial, we are building Calorie-GPT, a sophisticated multimodal AI pipeline. We aren't just "guessing" labels; we are using Meta's Segment Anything Model (SAM) to isolate food items and GPT-4o-vision to perform volumetric reasoning and macronutrient estimation. By combining computer vision with Large Language Models (LLMs), we can turn a simple photo into a detailed nutritional breakdown with high precision.

The Architecture: A Multi-Stage Vision Pipeline

To get accurate results, we can't just throw a raw image at an LLM. We need a "Divide and Conquer" strategy. First, we segment the food to understand boundaries, then we let the LLM analyze the specific segments.

graph TD
    A[User Uploads Photo] --> B[FastAPI Endpoint]
    B --> C{SAM Model}
    C -->|Segment Masks| D[Region Extraction]
    D --> E[GPT-4o-vision Analysis]
    E -->|JSON Response| F[PostgreSQL Storage]
    F --> G[Frontend Display]

    subgraph "The AI Engine"
    C
    D
    E
    end

Why this stack?

Segment Anything Model (SAM): Precise zero-shot segmentation to identify exactly where the food is.
GPT-4o-vision: The gold standard for visual reasoning and estimating portion sizes.
FastAPI: High-performance Python framework to glue our microservices together.
PostgreSQL: To store user history and nutritional logs for long-term tracking.

Prerequisites

Before we dive in, ensure you have:

Python 3.9+
An OpenAI API Key
The segment-anything library and weights (sam_vit_h).

Step 1: Precise Segmentation with SAM

SAM allows us to generate masks for every object in the image. This prevents the AI from getting confused by the table, the cutlery, or the background.

import numpy as np
from segment_anything import SamPredictor, sam_model_registry

# Load the SAM model
sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h_4b8939.pth")
predictor = SamPredictor(sam)

def get_food_segments(image_array):
    predictor.set_image(image_array)
    # For simplicity, we use the center point as a prompt
    # In production, use automatic mask generation
    masks, scores, logits = predictor.predict(
        point_coords=np.array([[image_array.shape[1]//2, image_array.shape[0]//2]]),
        point_labels=np.array([1]),
        multimask_output=True,
    )
    return masks[0] # Return the most confident mask

Step 2: Reasoning with GPT-4o-vision

Once we have our segmented image (or the original image with highlighted masks), we send it to GPT-4o. We use a structured prompt to ensure we get a valid JSON response.

import openai

client = openai.OpenAI()

def analyze_nutrition(base64_image):
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "Identify the food in this segment. Estimate the weight in grams and provide: calories, protein, carbs, and fats. Respond ONLY in JSON format."},
                    {
                        "type": "image_url",
                        "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}
                    },
                ],
            }
        ],
        response_format={"type": "json_object"}
    )
    return response.choices[0].message.content

Step 3: Building the FastAPI Backend

We need an endpoint to receive the image, run the pipeline, and store the result in PostgreSQL.

from fastapi import FastAPI, UploadFile, File
import uvicorn

app = FastAPI(title="Calorie-GPT API")

@app.post("/analyze")
async def upload_meal(file: UploadFile = File(...)):
    # 1. Read Image
    contents = await file.read()

    # 2. Run SAM Segmentation (Logic omitted for brevity)
    # 3. GPT-4o Vision Analysis
    nutrition_data = analyze_nutrition(encode_image(contents))

    # 4. Save to PostgreSQL
    # db.save(nutrition_data)

    return {"status": "success", "data": nutrition_data}

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

The "Official" Way to Scale 🥑

While this pipeline works for a hobby project, deploying multimodal AI at scale requires robust infrastructure and prompt engineering optimization. If you're looking for more production-ready examples, advanced computer vision patterns, or deep dives into LLM orchestration, I highly recommend checking out the technical resources at WellAlly Tech Blog.

Their guides on vision-language models and vector databases were a huge inspiration for the segmentation-first approach used in this project.

Conclusion: The Future of Nutrition

By combining the structural precision of SAM with the cognitive power of GPT-4o, we move away from "magic guessing" and toward a data-driven approach to health.

Next Steps:

Refine SAM: Use "Automatic Mask Generation" to detect multiple food items on one plate.
Calibration: Add a physical reference object (like a coin) in the photo to help GPT-4o estimate volume more accurately.

What are you building with Vision AI? Let me know in the comments! 👇

DEV Community