LLM for Multimodal Tasks and Applications

#learnai #oxlo #ai

We are going to build a receipt parser that accepts an image and returns structured JSON with vendor, date, line items, and total. It runs entirely through a single multimodal LLM call, so there is no separate OCR service or computer vision pipeline to maintain. This is useful for anyone automating expense reports or invoice processing.

What you'll need

Before starting, grab an Oxlo.ai API key from https://portal.oxlo.ai. You will also need Python 3.10 or newer and the OpenAI SDK installed.

pip install openai

Step 1: Verify the client

I always start with a simple text-only call to confirm the API key and base URL are working before I add image handling. This quick smoke test avoids debugging base64 or vision bugs on a broken connection.

from openai import OpenAI

client = OpenAI(base_url="https://api.oxlo.ai/v1", api_key="YOUR_OXLO_API_KEY")

response = client.chat.completions.create(
    model="kimi-k2.6",
    messages=[
        {"role": "user", "content": "Say 'connection ok' and nothing else."},
    ],
)
print(response.choices[0].message.content)

Step 2: Encode the image

The Oxlo.ai chat completions endpoint accepts images as base64 data URLs inside the content array, exactly like the OpenAI format. I write a small helper that reads a local JPEG or PNG and returns the data URL string.

import base64

def encode_image(path):
    with open(path, "rb") as f:
        return base64.b64encode(f.read()).decode("utf-8")

image_path = "receipt.jpg"
base64_image = encode_image(image_path)

print(f"Encoded {len(base64_image)} characters")

Step 3: Write the system prompt

The system prompt is the contract. It tells the model what to extract, defines the JSON schema, and forbids markdown fences so I can parse the output directly with json.loads.

SYSTEM_PROMPT = """You are a receipt parser. Extract the following fields from the receipt image and return only a raw JSON object. Do not wrap the JSON in markdown fences.

Schema:
- vendor: string, the store or restaurant name
- date: string in YYYY-MM-DD format if visible, otherwise null
- line_items: array of objects, each with name (string), quantity (number), and price (number)
- tax: number or null
- total: number or null

If a field is missing or unreadable, use null. Do not add extra commentary.
"""

Step 4: Send the multimodal request

Now I combine the system prompt, the text instruction, and the base64 image into a single messages payload. I use Oxlo.ai's kimi-k2.6 because it handles vision, reasoning, and long context well. Because Oxlo.ai uses flat per-request pricing, passing a high-resolution image does not inflate cost the way token-based billing does, which keeps document pipelines predictable. See https://oxlo.ai/pricing for details.

from openai import OpenAI

client = OpenAI(base_url="https://api.oxlo.ai/v1", api_key="YOUR_OXLO_API_KEY")

user_message = {
    "role": "user",
    "content": [
        {"type": "text", "text": "Extract the receipt data."},
        {
            "type": "image_url",
            "image_url": {
                "url": f"data:image/jpeg;base64,{base64_image}"
            },
        },
    ],
}

response = client.chat.completions.create(
    model="kimi-k2.6",
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        user_message,
    ],
    temperature=0.1,
)

raw = response.choices[0].message.content.strip()
print(raw)

Step 5: Parse and validate

The model should return raw JSON, but I always wrap the parse in a try block. If the model ever slips in markdown fences or extra text, I want to see the raw output instead of getting a cryptic stack trace.

import json

try:
    data = json.loads(raw)
    print(json.dumps(data, indent=2))
except json.JSONDecodeError as e:
    print("Failed to parse JSON:", e)
    print("Raw output was:", raw)

Run it

Here is the complete script. Save it as parse_receipt.py, place a receipt image named receipt.jpg in the same directory, and run python parse_receipt.py.

import base64
import json
from openai import OpenAI

API_KEY = "YOUR_OXLO_API_KEY"
IMAGE_PATH = "receipt.jpg"
MODEL = "kimi-k2.6"

client = OpenAI(base_url="https://api.oxlo.ai/v1", api_key=API_KEY)

SYSTEM_PROMPT = """You are a receipt parser. Extract the following fields from the receipt image and return only a raw JSON object. Do not wrap the JSON in markdown fences.

Schema:
- vendor: string, the store or restaurant name
- date: string in YYYY-MM-DD format if visible, otherwise null
- line_items: array of objects, each with name (string), quantity (number), and price (number)
- tax: number or null
- total: number or null

If a field is missing or unreadable, use null. Do not add extra commentary.
"""

def encode_image(path):
    with open(path, "rb") as f:
        return base64.b64encode(f.read()).decode("utf-8")

def parse_receipt(image_path):
    base64_image = encode_image(image_path)

    response = client.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "Extract the receipt data."},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/jpeg;base64,{base64_image}"
                        },
                    },
                ],
            },
        ],
        temperature=0.1,
    )

    raw = response.choices[0].message.content.strip()
    return json.loads(raw)

if __name__ == "__main__":
    result = parse_receipt(IMAGE_PATH)
    print(json.dumps(result, indent=2))

When I ran this against a coffee shop receipt, the output looked like this:

{
  "vendor": "Blue Bottle Coffee",
  "date": "2024-05-12",
  "line_items": [
    {"name": "Drip Coffee", "quantity": 1, "price": 4.50},
    {"name": "Avocado Toast", "quantity": 1, "price": 12.00}
  ],
  "tax": 1.32,
  "total": 17.82
}

Wrap-up

Two concrete ways to extend this. First, wrap the parser in a loop that walks a directory of images and appends the results to a CSV for bulk reconciliation. Second, add a confidence threshold by asking the model to include a certainty score in the JSON, then route any receipt below 0.9 to a human review queue.