We’ve all been there: staring at a delicious plate of pasta, wondering exactly how many calories are hidden in that creamy sauce. Most calorie-counting apps rely on manual entry or basic image classification that fails the moment your chicken is hidden under a leaf of kale.
In this tutorial, we are building Calorie-GPT, a sophisticated multimodal AI pipeline. We aren't just "guessing" labels; we are using Meta's Segment Anything Model (SAM) to isolate food items and GPT-4o-vision to perform volumetric reasoning and macronutrient estimation. By combining computer vision with Large Language Models (LLMs), we can turn a simple photo into a detailed nutritional breakdown with high precision.
The Architecture: A Multi-Stage Vision Pipeline
To get accurate results, we can't just throw a raw image at an LLM. We need a "Divide and Conquer" strategy. First, we segment the food to understand boundaries, then we let the LLM analyze the specific segments.
graph TD
A[User Uploads Photo] --> B[FastAPI Endpoint]
B --> C{SAM Model}
C -->|Segment Masks| D[Region Extraction]
D --> E[GPT-4o-vision Analysis]
E -->|JSON Response| F[PostgreSQL Storage]
F --> G[Frontend Display]
subgraph "The AI Engine"
C
D
E
end
Why this stack?
- Segment Anything Model (SAM): Precise zero-shot segmentation to identify exactly where the food is.
- GPT-4o-vision: The gold standard for visual reasoning and estimating portion sizes.
- FastAPI: High-performance Python framework to glue our microservices together.
- PostgreSQL: To store user history and nutritional logs for long-term tracking.
Prerequisites
Before we dive in, ensure you have:
- Python 3.9+
- An OpenAI API Key
- The
segment-anythinglibrary and weights (sam_vit_h).
Step 1: Precise Segmentation with SAM
SAM allows us to generate masks for every object in the image. This prevents the AI from getting confused by the table, the cutlery, or the background.
import numpy as np
from segment_anything import SamPredictor, sam_model_registry
# Load the SAM model
sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h_4b8939.pth")
predictor = SamPredictor(sam)
def get_food_segments(image_array):
predictor.set_image(image_array)
# For simplicity, we use the center point as a prompt
# In production, use automatic mask generation
masks, scores, logits = predictor.predict(
point_coords=np.array([[image_array.shape[1]//2, image_array.shape[0]//2]]),
point_labels=np.array([1]),
multimask_output=True,
)
return masks[0] # Return the most confident mask
Step 2: Reasoning with GPT-4o-vision
Once we have our segmented image (or the original image with highlighted masks), we send it to GPT-4o. We use a structured prompt to ensure we get a valid JSON response.
import openai
client = openai.OpenAI()
def analyze_nutrition(base64_image):
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Identify the food in this segment. Estimate the weight in grams and provide: calories, protein, carbs, and fats. Respond ONLY in JSON format."},
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}
},
],
}
],
response_format={"type": "json_object"}
)
return response.choices[0].message.content
Step 3: Building the FastAPI Backend
We need an endpoint to receive the image, run the pipeline, and store the result in PostgreSQL.
from fastapi import FastAPI, UploadFile, File
import uvicorn
app = FastAPI(title="Calorie-GPT API")
@app.post("/analyze")
async def upload_meal(file: UploadFile = File(...)):
# 1. Read Image
contents = await file.read()
# 2. Run SAM Segmentation (Logic omitted for brevity)
# 3. GPT-4o Vision Analysis
nutrition_data = analyze_nutrition(encode_image(contents))
# 4. Save to PostgreSQL
# db.save(nutrition_data)
return {"status": "success", "data": nutrition_data}
if __name__ == "__main__":
uvicorn.run(app, host="0.0.0.0", port=8000)
The "Official" Way to Scale 🥑
While this pipeline works for a hobby project, deploying multimodal AI at scale requires robust infrastructure and prompt engineering optimization. If you're looking for more production-ready examples, advanced computer vision patterns, or deep dives into LLM orchestration, I highly recommend checking out the technical resources at WellAlly Tech Blog.
Their guides on vision-language models and vector databases were a huge inspiration for the segmentation-first approach used in this project.
Conclusion: The Future of Nutrition
By combining the structural precision of SAM with the cognitive power of GPT-4o, we move away from "magic guessing" and toward a data-driven approach to health.
Next Steps:
- Refine SAM: Use "Automatic Mask Generation" to detect multiple food items on one plate.
- Calibration: Add a physical reference object (like a coin) in the photo to help GPT-4o estimate volume more accurately.
What are you building with Vision AI? Let me know in the comments! 👇
Top comments (0)