Using LLMs for Multimodal Tasks

#learnai #oxlo #ai

In this tutorial we will build a multimodal receipt audit agent that reads JPEG receipts and returns structured JSON policy decisions. This is useful for finance teams and developers automating expense workflows. Because Oxlo.ai charges a flat rate per request, adding vision and reasoning calls stays predictable even when receipt images are large or policy documents grow.

What you'll need

Before starting, grab the following:

Python 3.10 or newer
An Oxlo.ai API key from https://portal.oxlo.ai
The OpenAI SDK installed with pip install openai
A sample receipt image saved as receipt.jpg in your working directory

Step 1: Set up the client and load the image

We will point the OpenAI SDK at Oxlo.ai and add a small helper to base64-encode local images. Keeping the image in memory avoids external hosting and keeps the script portable.

import base64
from openai import OpenAI

client = OpenAI(base_url="https://api.oxlo.ai/v1", api_key="YOUR_OXLO_API_KEY")

def encode_image(image_path):
    with open(image_path, "rb") as f:
        return base64.b64encode(f.read()).decode("utf-8")

image_b64 = encode_image("receipt.jpg")

Step 2: Define the vision system prompt

The vision model needs strict instructions to avoid adding commentary. This prompt forces raw JSON extraction and suppresses extra text.

VISION_PROMPT = """You are a receipt parser. Extract every line item, the merchant name, date, total amount, and tax.
Return your findings as a JSON object with these keys:
- merchant: string
- date: string in YYYY-MM-DD format if visible, otherwise null
- total: number
- tax: number
- items: list of objects, each with name, quantity, and unit_price.
Do not add commentary outside the JSON."""

Step 3: Extract structured data with Kimi K2.6

Kimi K2.6 handles the vision task. We embed the base64 JPEG in the message payload and request JSON mode so the output is machine readable.

def extract_receipt(image_b64):
    response = client.chat.completions.create(
        model="kimi-k2.6",
        messages=[
            {"role": "system", "content": VISION_PROMPT},
            {"role": "user", "content": [
                {"type": "text", "text": "Extract the receipt data from this image."},
                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_b64}"}}
            ]},
        ],
        response_format={"type": "json_object"}
    )
    return response.choices[0].message.content

receipt_json = extract_receipt(image_b64)
print(receipt_json)

Step 4: Audit the receipt against policy

We pass the extracted JSON to Llama 3.3 70B for policy checking. Separating extraction from reasoning keeps each call focused and makes debugging easier.

AUDIT_PROMPT = """You are a finance auditor. Given a receipt JSON, check these rules:
1. Total must not exceed $200.
2. Alcohol items are not reimbursable.
3. Tax must be less than 20% of the pre-tax subtotal.
Return a JSON object with:
- approved: boolean
- violations: list of strings explaining any failures
- summary: one sentence describing the outcome."""

def audit_receipt(receipt_json):
    response = client.chat.completions.create(
        model="llama-3.3-70b",
        messages=[
            {"role": "system", "content": AUDIT_PROMPT},
            {"role": "user", "content": receipt_json},
        ],
        response_format={"type": "json_object"}
    )
    return response.choices[0].message.content

audit_result = audit_receipt(receipt_json)
print(audit_result)

Run it

The snippet below ties everything together. Place it at the bottom of your script, then run python audit.py.

if __name__ == "__main__":
    image_b64 = encode_image("receipt.jpg")
    data = extract_receipt(image_b64)
    print("Extracted:", data)

    result = audit_receipt(data)
    print("Audit:", result)

Example output looks like this:

Extracted: {"merchant": "Acme Cafe", "date": "2024-05-21", "total": 47.30, "tax": 3.30, "items": [{"name": "Sandwich", "quantity": 2, "unit_price": 12.50}, {"name": "Coffee", "quantity": 2, "unit_price": 4.50}]}
Audit: {"approved": true, "violations": [], "summary": "Receipt is within policy limits with no alcohol and tax below 20%."}

Wrap-up and next steps

This agent shows how to chain vision and reasoning on Oxlo.ai using standard OpenAI SDK calls. Because pricing is per request, you can add extra validation steps or switch to larger context models without watching token meters climb.

Two concrete ways to extend this:

Wrap the pipeline in a FastAPI endpoint so users can upload receipts via HTTP and receive the audit JSON.
Add a second vision pass with Gemma 3 27B to detect blurry or cropped receipts before sending them to extraction, reducing error rates.