Morgan Willis

Posted on Feb 24

The Python Function That Implements Itself

#aws #ai #agents #programming

What if you could write a Python function where the docstring is the implementation? You define the inputs, the return type, and you write the validation logic that defines what "correct" means. AI handles the rest.

That's the programming model behind AI Functions, a new experimental library from Strands Labs.

Strands Labs is a new GitHub organization where experimental features of the Strands Agents SDK are being built in the open.

With AI functions you still write the validation logic, but instead of implementing the function itself, you let the AI handle generation and self-correct against your checks.

A Different Way to Write AI-Powered Code

Most AI-powered code follows the same pattern. You call the model, parse the response, write validation checks, handle errors, and retry when things go wrong. It's tedious boilerplate that everyone writes slightly differently.

AI Functions inverts this pattern.

You write a function signature, a docstring that serves as the prompt, a return type that defines the contract, and post-conditions that define what correct looks like. There is no function body. The function executes on an LLM instead of a CPU.

The key here is that you still write real validation code. Post-conditions are normal Python functions you author. You define the acceptance criteria, and the system enforces it.

The Receipt Parser

Let's see what this looks like with a receipt parser.

Receipts are a good fit for this pattern because the extraction itself is fuzzy (vendors format receipts differently, line items vary, tax rules change), but the validation is deterministic. You can write a post-condition to check whether the math adds up with plain arithmetic.

In practice, most receipts start as images or PDFs. This example assumes you've already extracted the text using OCR or a document processing service, and now you need to turn that raw text into structured, validated data.

We'll build something that handles that second step: extracting structured data from receipt text and validating that the math actually adds up.

from pydantic import BaseModel, Field
from ai_functions import ai_function

class LineItem(BaseModel):
    description: str = Field(description="Item or service description")
    quantity: int = Field(description="Number of units")
    unit_price: float = Field(description="Price per unit")
    amount: float = Field(description="Total for this line item (quantity * unit_price)")

class ReceiptData(BaseModel):
    vendor: str = Field(description="Vendor or company name")
    invoice_number: str = Field(description="Invoice or receipt number")
    date: str = Field(description="Invoice date (YYYY-MM-DD format)")
    items: list[LineItem] = Field(description="List of line items")
    subtotal: float = Field(description="Sum of all line item amounts before tax")
    tax: float = Field(description="Tax amount")
    total: float = Field(description="Final total (subtotal + tax)")

def validate_math(result: ReceiptData) -> None:
    """Validate that all math is internally consistent."""
    errors = []

    # Check line items: amount = quantity × unit_price
    for i, item in enumerate(result.items):
        expected = item.quantity * item.unit_price
        if abs(item.amount - expected) > 0.01:
            errors.append(
                f"Line item {i} ({item.description}): amount {item.amount} != "
                f"quantity {item.quantity} * unit_price {item.unit_price} = {expected}"
            )

    # Verify subtotal = sum of line items
    items_sum = sum(item.amount for item in result.items)
    if abs(result.subtotal - items_sum) > 0.01:
        errors.append(f"Subtotal {result.subtotal} != sum of line items {items_sum}")

    # Confirm total = subtotal + tax
    expected_total = result.subtotal + result.tax
    if abs(result.total - expected_total) > 0.01:
        errors.append(f"Total {result.total} != subtotal {result.subtotal} + tax {result.tax} = {expected_total}")

    if errors:
        raise ValueError("\n".join(errors))

@ai_function(
    description="Parse a receipt or invoice text and extract structured expense data",
    post_conditions=[validate_math],
    max_attempts=3,
)
def parse_receipt(receipt_text: str) -> ReceiptData:
    """
    Extract structured data from this receipt/invoice.
    Receipt text: {receipt_text}

    Instructions:
    - Extract all line items with their quantity, unit price, and total amount
    - Calculate subtotal as the sum of all line item amounts
    - Extract tax amount (if no tax is listed, use 0.0)
    - Calculate total as subtotal + tax
    - Use YYYY-MM-DD format for the date
    - Ensure all math is consistent
    """

The Pydantic models define the shape of the output. The @ai_function decorator marks this as an AI-powered function. The docstring becomes the prompt, with {receipt_text} as a template variable for the input. The return type tells the system what structure to generate.

Post-conditions let you define what "correct" means in your specific domain. They're standard Python functions that enforce your business logic. The math has to add up and the vendor name can't be empty. The date has to be in the right format. These aren't things you can guarantee with prompt engineering alone.

Here's what happens when you call parse_receipt with some receipt text.

Under the hood, the library hands off to a Strands agent loop. It takes your docstring (with the receipt text filled in), sends it to the model, and asks it to return a ReceiptData object.

Because it's running through a Strands agent, the function gets access to the same tool-use capabilities that Strands agents have, and as the integration matures, potentially other Strands features as well. But from your perspective, as the caller, it's just a function call that returns a Pydantic model.

Once the model responds, validate_math runs against the result. It checks whether each line item's amount equals quantity times unit price, whether the subtotal equals the sum of all line items, and whether the total equals the subtotal plus tax.

If everything checks out, you get your ReceiptData back. If validate_math raises a ValueError, the library takes that error message ("Subtotal 1,492.30 != sum of line items 1,492.80") and sends it back to the model along with the original prompt. The model sees exactly what it got wrong and tries again. This loop repeats up to max_attempts times, so with max_attempts=3, the model gets three chances to produce output that passes your checks.

Worth noting: validate_math checks internal consistency, not extraction accuracy. If the model misreads "$8,400" as "$840" from messy OCR output, the math could still check out while being completely wrong. But that's what additional post-conditions are for. You could write one that cross-references extracted values against the raw input text, checking whether the total the model returned actually appears in the receipt. If it doesn't, something went wrong during extraction, not just during math. The pattern scales to whatever "correct" means for your use case.

You could add more post-conditions too. Maybe validate_completeness to check that required fields aren't empty. Maybe validate_date_format to ensure dates parse correctly. Each one is just a Python function that raises an error when something's wrong.

The Tradeoffs

This pattern is clean, but there are some tradeoffs.

Latency is the first one. Each retry is another model call. If you set max_attempts=3, you're looking at up to three round trips to the model. That's fine for batch processing and background jobs. It's not great for user-facing APIs where you need sub-second responses.

The second tradeoff is cost. Retries multiply your API spend, and each invocation uses a fresh instance of the agent. If your post-conditions fail frequently, you're paying for multiple attempts per extraction.

This retry loop is a feature, not a bug. Monitor your validation failure rates. If post-conditions are failing on most first attempts, your prompt needs work, not more retries. Post-conditions are there to catch the edge cases, not to fix fundamentally broken prompts.

You're trading latency and cost for correctness guarantees on logic you never had to implement.

You didn't have to anticipate every receipt format, handle every edge case for how vendors list line items, or write a parser that accounts for the dozen ways people format currency. The model handles that ambiguity, and the post-conditions catch the errors.

That's the right trade for document processing pipelines, financial data extraction, and any task where a wrong answer is worse than a slow answer. It's the wrong trade for real-time chat interfaces or high-volume, cost-sensitive operations.

The library is experimental and it's a new repo from Strands Labs. It's worth exploring, and expect it to change as it matures.

The Pattern Underneath

What makes this really interesting to me is the programming model. You declare intent through the function signature and docstring. You define correctness through post-conditions, and the AI handles the implementation.

This separation keeps your validation logic as real Python code that you control, test, and version. It's not buried in a prompt or hoping the model "understands" what you mean by correct. When requirements change, you update the post-conditions. When the model improves, you get better first-attempt success rates without changing your code.

Post-conditions give you a way to programmatically define "correct" for your domain, which is something prompt engineering alone can't do. A prompt can tell the model to "make sure the math adds up," but a post-condition actually checks it and provides specific feedback when it doesn't.

I had a ton of fun experimenting with this new project. Try the pattern yourself, the library is available at https://github.com/strands-labs/ai-functions.

DEV Community

The Python Function That Implements Itself

A Different Way to Write AI-Powered Code

The Receipt Parser

The Tradeoffs

The Pattern Underneath

Top comments (0)