DEV Community: Portia AI

Building agents with Controlled Autonomy using our new PlanBuilder interface

Robbie Heywood — Wed, 10 Sep 2025 14:23:42 +0000

Balancing autonomy and reliability is a key challenge faced by teams building agents (and getting it right is notoriously difficult!). At Portia, we’ve built many production-ready agents with our design partners and today we’re excited to share our solution: Controlled Autonomy. Controlled autonomy is the ability to control the level of autonomy of an agent at each step of an agentic plan. We implement this using our newly reshaped PlanBuilder interface to build agentic systems, and today we’re excited to be releasing it into our open-source SDK. We believe it’s a simple, elegant interface (without the boilerplate of many agentic frameworks) that is the best way to create powerful and reliable agentic systems - we can’t wait to see what you build with it!

If you’re building agents, we’d love to hear from you! Check out our open-source SDK and let us know what you’re building on Discord. We also love to see people getting involved with contributions in the repo - if you’d like to get started with this, check out our open issues and let us know if you’d like to take one on.

Straight into an example

Our PlanBuilder interface is designed to feel intuitive and we find agents built with it are easy to follow, so let’s dive straight into an example:

from portia import PlanBuilderV2, StepOutput

plan = (
    PlanBuilderV2("Run this plan to process a refund request.")
    .input(name="refund_info", description="Info of the customer refund request")
    .invoke_tool_step(
        step_name="read_refund_policy",
        tool="file_reader_tool",
        args={"filename": "./refund_policy.txt"},
    )
    .single_tool_agent_step(
        step_name="read_refund_request",
        task=f"Find the refund request email from {Input('customer_email_address')}",
        tool="portia:google:gmail:search_email",
    )
    .llm_step(
        step_name="llm_refund_review",
        task="Review the refund request against the refund policy. "
             "Decide if the refund should be approved or rejected. "
             "Return the decision in the format: 'APPROVED' or 'REJECTED'.",
        inputs=[StepOutput("read_refund_policy"), StepOutput("read_refund_request")],
        output_schema=RefundDecision,
    )
    .function_step(
        function=record_refund_decision,
        args={"refund_decision": StepOutput("llm_refund_review")})
    .react_agent_step(
        task="Find the payment that the customer would like refunded.",
        tools=["portia:mcp:mcp.stripe.com:list_customers", "portia:mcp:mcp.stripe.com:list_payment_intents"],
        inputs=[StepOutput("read_refund_request")],
    )
    # Full example includes more steps to actually process the refund etc.
    .build()
)

The above is a modified extract from our Stripe refund agent (full example here), setting up an agent that acts as follows:

Read in our company’s refund policy: this uses a simple invoke_tool_step, which means that the tool is directly invoked with the args specified with no LLM involvement. These steps are great when you need to use a tool (often to retrieve data) but don’t need the flexibility of an LLM to call the tool because the args you want to use are fixed (this generally makes them very fast too!).
Read in the refund request from an email: for this step, we want to flexibly find the email in the inbox based on the refund info that is passed into the agent. To do this, we use a single_tool_agent, which is an LLM that calls a single tool once in order to achieve its task. In this case, the agent creates the inbox search query based on the refund info passed in to find the refund email.
Judge the refund request against the refund policy: the llm_step is relatively self-explanatory here - it uses your configured LLM to judge whether we should provide the refund based on the request and the policy. We use the StepOutput object to feed in the results from the previous steps, and the output_schema field allows us to return the decision as a pydantic object rather than as text.
Record the refund decision: we have a python function we use to record the decisions made - we can call this easily with a function_step which allows directly calling python functions as part of the plan run.
Find the payment in Stripe: finding a payment in Stripe requires using several tools from Stripe’s remote MCP server (which is easily enabled in your Portia account). Therefore, we set up a ReAct agent with the required tools and it can intelligently chain the required Stripe tools together in order to find the payment. As a bonus, Portia uses MCP Auth by default so these tool calls will be fully authenticated.

Controlled Autonomy

As demonstrated in the above example, the power of PlanBuilderV2 comes from the fact you can easily connect and combine different step types, depending on your situation and requirements. This allows you to control the amount of autonomy your system has at each point in its execution, with some steps (e.g. react_agent_step) making use of language models with high autonomy while others are carefully controlled and constrained (e.g. invoke_tool_step).

From our experience, it is this ‘controlled autonomy’ that is the key to getting agents to execute reliably, which allows us to move from exciting prototypes into real, production agents. Often, prototypes are built with ‘full autonomy’, giving something like a ReAct agent access to all tools and letting it loose on a task. This approach is possible with our plan builder and can work well in some situations, but in other situations (particularly for more complex tasks) it can lead to agents that are unreliable. We’ve found that tasks often need to be broken down and structured into manageable sub-tasks, with the autonomy for each sub-task controlled, for them to be done reliably. For example, we often see research and retrieval steps in a system being done with high autonomy ReAct agent steps because they generally use read-only tools that don’t affect other systems. Then, when it comes to the agent taking actions, these steps are done with zero or low autonomy so they can be done in a more controlled manner.

Simple Control structures

Extending the above example, our PlanBuilderV2 also provides familiar control structures that you can use when breaking down tasks for your agentic system. This gives you full control to ensure that the task is approached in a reliable way:

# Conditional steps (if, else if, else)
.if_(condition=lambda review: review.decision == REJECTED,
    args={"llm_review_decision": StepOutput("llm_refund_review")})
.function_step(
    function=handle_rejected_refund,
    args={"proposed_refund": StepOutput("proposed_refund")})
.endif()

# Loops - here we use .loop(over=...), but there are also alternatives for
#         .loop(while=...) and .loop(do_while=...)
.loop(over=StepOutput("Items"), step_name="Loop")
.function_step(
    function=lambda item: print(item),
    args={"item": StepOutput("Loop")})
.end_loop()

Fun fact: We went with .if_() rather than .if() (note the underscore) because if is a restricted keyword in python

Human - Agent interface

Another aspect that is vital towards getting an agent into production is the ability to seamlessly pass control between agents and humans. While we build trust in agentic systems, there are often key steps that require verification or input from humans. Our PlanBuilder interface allows both to be handled easily, using Portia’s clarification system:

# Ensure a human approves any refunds our agent gives out
builder.user_verify(
    message=f"Are you happy to proceed with the following proposed refund: {StepOutput('proposed_refund')}?")

# Allow your end user to provide input into how the agent runs
builder.user_input(
    message="How would you like your refund?",
    options=["Return to purchase card", "gift card"],
)

Controlling your agent with code

The function_step demonstrated earlier is a key addition to PlanBuilderV2. In many agentic systems, all tool and function calls go through a language model, which can be slow and also can reduce reliability. With function_step, the function is called with the provided args at that point in the chain with full reliability. We’ve seen several use-case for this:

Guardrails: where deterministic, reliable code checks are used to verify agent behaviour (see example below)
Data manipulation: when you want to do a simple data transformation in order to link tools together, but you don’t want to pay the latency penalty of an extra LLM call to do the transformation, you can instead do the transformation in code.
Plug in existing functions: when you’ve already got the functionality you need in code, you can use a function_step to easily plug that into your agent.

# Add a guardrail to prevent our agent giving our large refunds
builder.function_step(
    step_name="reject_payments_above_limit",
    function=reject_payments_above_limit,
    args={"proposed_refund": StepOutput("proposed_refund"), "limit": Input("payment_limit")})

What’s next?

We’ve really enjoyed building agents with PlanBuilderV2 and are excited to share it more widely. We find that it complements our planning agent nicely: our planning agent can be used to dynamically create plans from natural language when that is needed for your use-case, while the plan builder can be used if you want to more carefully control the steps your agentic system takes with code.

We’ve also got more features coming up over the next few weeks that will continue to make the plan builder interface even more powerful:

Parallelism: run steps in parallel with .parallel().
Automatic caching: add cache=True to steps to automatically cache results - this is a game-changer when you want to iterate on later steps in a plan without having to fully re-run the plan.
Step error handler: specify .on_error() after a step to attach an error handler to it, .retry() to allow retries of steps or use exit_step() to gracefully exit a plan.
Linked plans: link plans together by referring to outputs from previous plan runs.

plan = (
    PlanBuilderV2("Run this plan to process a refund request.")
    # 1. Run subsequent steps in parallel
    .parallel()
    .invoke_tool_step(
        tool="file_reader_tool",
        args={"filename": "./refund_policy.txt"},
        # 2. Add automatic caching to a step
        cache=True
    )
    # 3. Add error handling to a step
    .on_error()
    .react_agent_step(
        # 4. Link plans together by referring to outputs from a previous run
        # Here, we could have a previous agent that determines which       customer refunds to process
        task=f"Read the refund request from my inbox from {PlanRunOutput(previous_run)}.",
        tools=["portia:google:gmail:search_email"],
    )
    # Resume series execution
    .series()
)

Shout out to gaurava05 for adding ExitStep as an open-source contribution in this PR.

So give our new PlanBuilder a try and let us know how you get on - we can’t wait to see what you build! 🚀

For more details on PlanBuilderV2, check out our docs, our example plan or the full stripe refund example. You can also join our Discord to hear future updates.

From Hackathon Idea to Life-Saving Workflow: The Story of the DCRCA Agent

Vincenzo Bianco — Mon, 08 Sep 2025 09:23:38 +0000

The AI AgentHack Hackathon

Last week, we ran AI AgentHack, a hackathon where more than 3,000 developers built creative agentic projects using Portia.

Picking winners wasn’t easy, but one project stood out: Team Dark Mode’s DCRCA Agent (Disaster Chaos Response Coordination AI).

The DCRCA Agent helps emergency teams cut through the noise. It scans live news and social feeds, pulls out the key details, and maps emergencies by priority so responders know exactly where to act first.

We were extremely impressed to see this cool Portia use case and, more importantly, an application with great potential for societal impact!

Below is a deep dive into how the team built this using Portia.

How The DCRCA Agent Is Built

The DCRCA Agent is wired together with PlanBuilderV2, where the workflow is laid out step by step: pulling raw data from news feeds, parsing and prioritizing it with LLM steps, routing through a human approval checkpoint, and finally dispatching updates over email and Slack. Each stage has clear inputs and outputs, making the whole flow transparent.

A key design choice was the separation between reasoning and tool calls. Reasoning tasks (like structuring raw data or scoring emergencies) live inside .llm_step(), while external services such as Google Search and Gmail are called through .invoke_tool_step(). This separation keeps debugging and maintenance straightforward.

They also used custom Python functions for oversight with .function_step(). These functions handled approval checks and message formatting, showing how Portia makes human-in-the-loop workflows natural instead of forcing full automation.

Finally, because every step exposes structured outputs at runtime, the agent can surface both intermediate results (like “Slack message sent ✅”) and the overall summary of actions — giving the team visibility into exactly what happened.

We’re thankful to Team Dark Mode and all other hackathon participants for helping us prove that Portia isn’t just for tinkering—it can drive real, high‑stakes workflows. By combining off‑the‑shelf tools, LLM reasoning and human oversight in a single plan, they built something useful and understandable.

It’s exciting to imagine what other novel agentic ideas the community will bring to life next!

If you want to try building your own agentic workflow, check out our GitHub!

5 tools we wish were on the Awesome AI Tools list

Mounir Mouawad — Fri, 15 Aug 2025 12:15:07 +0000

We’re big fans of the Awesome AI tools list and we all use it to discover new AI tools over at Portia AI. My latest and favourite find is Merlin: A Chrome extension that allows me to ask “how to” questions on any app rather than flipping over to ChatGPT or Claude to ask.

Here are five tools we use a lot and wish were on the Awesome AI Tools list:

Textual – We love spicing up our terminal interface for using the Portia SDK and even non-technical customers love it when I run demos from the terminal now. We all have our favourite terminal flavour of it – I made mine with Atari retro vibes 🕹️holler if you’re using Portia and want the code for it!
Mistral OCR – We think it’s the best balance of cost, speed and performance for OCR on the market right now. We also admittedly have soft spot for our neighbours across the English Channel over in La France 🥐.
Visily – Figma for non-designers, it’s my go-to when brainstorming early UX mocks with front-end engineers and UX designers. I especially love the ability to turn any screenshot into a wireframe because I can bring inspirations to life with some tweaks super quickly.
Podcastfy – I can’t say for sure why they skipped the “i” in their name but we love that they are an open source and equally powerful alternative to NotebookLM. One of our engineers built a bite-sized AI news podcast that I listen to during my commute daily. You can recreate it here using Portia SDK or get the daily podcast on our Discord server’s #ai-news channel.
OpenRouter - We love OpenRouter because it allows you to easily try out new models and load balance between models. We actually got an open source contribution for this one recently, so we should be supporting it ❤️

Introducing SteelThread: Evals & Observability for Reliable Agents

Vincenzo Bianco — Thu, 14 Aug 2025 15:44:11 +0000

We’ve spent a lot of time internally running evals for our own agents. If you care about reliability in agentic systems, you know why this matters — models drift, prompts change, third party MCP tools get updated. A small change in one place can cause unexpected behavior somewhere else.

That’s why we’re excited to share something we’ve been using ourselves for months: SteelThread, our evaluation framework built on top of Portia Cloud.

You can try if for free on Portia!

While building our own automations on top of Portia, we realised it was an absolute joy to run evals with owing to two of its core features:

First, every agent run is captured in a structured state object called a PlanRunState — steps, tool calls, arguments, outputs. That makes very targeted evaluators trivial to write, be it deterministic or LLM-as-Judge ones e.g. you can count plan steps, validate the behaviour of a specific tool, review the tone in final summary etc.
Second, we use Portia Cloud to store our agent runs. Whenever we manage to produce a multi-agent plan outcome that is desirable (or undesirable) e.g. during agent development, we can take the inputs and outputs of that agent run (query, plan, plan run) and instantly turn them into an Eval dataset. Since we built SteelThread, we haven’t actually needed to manually curate and build eval datasets from scratch anymore.

Before SteelThread, we still felt the pain that many teams do. Creating and maintaining curated datasets was tedious. Balancing deterministic checks with LLM-as-judge evals was tricky. And running evals against real APIs often meant dealing with authentication, rate limits, or unintended side effects — so we’d spend hours stubbing tools just to test safely.

SteelThread wraps all of this into a single workflow inside Portia Cloud. It gives you two ways to keep your agents in check: Streams, which spot changes in behavior in real time, and Evals which let you run regression tests against a ground truth dataset. Both Streams and Evals allow you to combine deterministic and LLM-as-judge evaluators. You can write your own evaluators but SteelThread comes with a generous helping of off-the-shelf ones for you to use as well.

Here is an example flow where we add a production agent run to an Eval dataset.

Observability and evals are essential for building reliable agentic systems, and SteelThread just makes them easier. Paired with the Portia development SDK, it’s a powerful combo: build structured, debuggable agents, monitor them in production, and turn any incident into a regression test instantly.

If you want to try it, head over to Portia Dashboard or check out our GitHub repo!

Portia AI: Initial Thoughts on GPT-5

Robbie Heywood — Mon, 11 Aug 2025 14:57:35 +0000

At Portia AI, we’ve been playing around with GPT-5 since it was released a few days ago and we’re excited to announce it will be available to SDK users in tomorrow’s SDK release 🎉

After playing with it for a bit, it definitely feels an incremental improvement rather than a step-change (despite my LinkedIn feed being full of people pronouncing it ‘game-changing!). To pick out some specific aspects:

Equivalent Accuracy: on our benchmarks, GPT5’s performance is equal to the existing top model, so this is an incremental improvement (if any).
Handles complex tools: GPT-5 is definitely keener to use tools. We’re still playing around with this, but it does seem like it can handle (and prefers) broader, more complex tools. This is exciting - it should make it easier to build more powerful agents, but also means a re-think of the tools you’re using.
Slow: With the default parameters, the model is seriously slow - generally 5-10x slower across each of our benchmarks. This makes tuning the new reasoning_effort and verbosity parameters important.
I actually miss the model picker! With the model picker gone, you’re left to rely on the fuzzier world of natural language (and the new reasoning_effort and verbosity parameters) to control the model. This is tricky enough that OpenAI have released a new prompt guide and prompt optimiser. I think there will be real changes when there are models that you don’t feel you need to control in this way - but GPT-5 isn’t there yet.
Solid pricing: While it is a little more token-hungry on our benchmarks (10-20% more tokens in our benchmarks), at half the price of GPT-4o / 4.1 / o3, it is a good price for the level of intelligence (a great article on this from Latent Space).
Reasonable context window: At 256k tokens, the context window is fine - but we’ve had several use-cases that use GPT-4.1 / Gemini’s 1m token windows, so we’d been hoping for more...
Coding: In Cursor, I’ve found GPT-5 a bit difficult to work with - it’s slow and often over-thinks problems. I’ve moved back to claude-4, though I do use GPT-5 when looking to one-shot something rather than working with the model.

There are also two aspects that we haven’t dug into yet, but I’m really looking forward to putting them through their paces:

Tool Preambles: GPT 5 has been trained to give progress updates in ‘tool preamble’ messages. It’s often really important to keep the user informed as an agent progresses, which can be difficult if the model is being used as a black box. I haven’t seen much talk about this as a feature, but I think it has the potential to be incredibly useful for agent builders.
Replanning: In the past, we’ve got ourselves stuck in loops (particularly with OpenAI models) where the model keeps trying the same thing even when it doesn’t work. GPT-5 is supposed to handle these cases that require a replan much better - it’ll be interesting to dive into this more and see if that’s the case.

As a summary, this is still an incremental improvement (if any). It’s sad to see it still can’t count the letters in various fruit and I’m still mostly using claude-4 in cursor.

How I Built an AI Agent That Turns Daily AI News Into a Commute-Sized Podcast

Robbie Heywood — Fri, 01 Aug 2025 14:54:27 +0000

The AI landscape moves at breakneck speed. New models, research papers, funding announcements, and product launches happen daily. As someone working in AI, staying current isn't just helpful—it's essential. But when you're heads-down building features and shipping products, it's tough to find the time to stay on top of all the latest developments.

That's exactly the challenge we faced at Portia AI. The solution? An AI agent that helps us make the most of the 5-minute stroll our team makes each afternoon to Kings Cross on their way home.

I’m sure Harry would have spent his commute back from Kings cross listening to our AI podcast too...

Building AI News Into a Routine

Working in AI means being subscribed to information from multiple sources. The traditional approach of manually checking news sites, Reddit, Twitter, and newsletters was tedious and time-consuming, while important developments could take time to circulate through the team.

During one of our regular work hack sessions, inspired by NotebookLM's podcast feature, I decided to tackle this problem by building an AI agent that creates daily short AI news podcasts. Here's how it works:

Subscribes to multiple AI news sources throughout the day
Identifies the most significant developments
Synthesizes the information into a concise narrative
Generates a 2-3 minute podcast episode using the fantastic Podcastfy library
Provides curated links for deeper investigation
Shares the podcast and links on Slack and Discord

We run the agent in the afternoon, so the podcast is available before people's evening commute. This timing allows people to easily integrate the updates into their daily routine. We've also found that the curated links are particularly valuable when there's a topic that's especially relevant to someone, allowing them to dig deeper into the details.

Getting Involved

We know that staying abreast of the latest developments is a difficulty lots of teams face, so we've made these news snippets available on our public Discord server. Come and check it out if it sounds like something that could be useful.

The code is open sourced in our agent examples repo if you're keen to see exactly how it works or build something similar for your own team. I think it’s a nice example of how Portia’s open-source agent SDK makes agents incredibly easy to build. With the agent framework handling much of the complex orchestration between services and APIs, the code ends up being not much more than:

plan_prompt = "<Task specification>"
tools = DefaultToolRegistry(config) + [PodcastTool()]
portia = Portia(tools=tools)
portia.run(plan_prompt)

Hope this helps others stay on top of the fast-moving AI world! Enjoy!

Building AI Agents: Choose your fighter

Katerina — Wed, 30 Jul 2025 20:32:48 +0000

If you want to build agents with Portia AI you can try our SDK for free on Github. Stars welcome!

There are a lot of AI Agent builders out there considering the term ‘AI Agent SDKs’ indexed at 0 on Google Trends only 6 months ago.

At first glance, it’s hard to tell them apart and figure out which will be right for your use case. Everyone uses the same words to describe what they’ve created in an attempt to make the products seem ‘right’ for as many people as possible.

After a few weeks of researching the market at Portia AI, I realised it’d be useful to write something high level about the different Agent Builders out there and how you can choose between them.

For when you want to automate something quickly and without writing much code

In this bucket are n8n, MindStudio and now CrewAI. These products are focussed on business users as well as engineers. They’re designed to be easy to use and can be set up using visual workflows instead of code. They trade off simplicity against control so they’re less flexible and you don’t have the same fine grained options as you would get with a code first product. Great for prototyping and trying things out, or if you’ve got a very simple production use case. if you were building a KYC agent for a bank you would want something where you could build in more oversight.

MindStudio is designed for getting AI agents up and running using templates and without code. It’s useful when you want to move quickly and is integrated with the major LLM providers. It also has built in tools you can use to equip your agents without much effort. It’s most useful for straightforward AI workflows you can prototype in a few minutes, for example an agent that summarises what it sees on the page.
n8n feels more like a Zapier. It’s most powerful when used for connecting things up to automate your existing business process. The visual builder is the entry point though it can be extended with code. It supports a very large library of LLMs, databases, APIs and tools out of the box which means you don’t need to build your own integrations for your current processes. It’s most useful for the kind of simple automations you would have used Zapier for previously, for example, send me an email when someone submits our lead form and add a record to our CRM. lt’s not well set up for long running or complex autonomous tasks which need multi step reasoning or memory management.
Crew AI used to be known for their developer SDK product but they’ve now built tooling for no-code workflows too. They combine a visual, no and low code interface with a developer experience. You can create multi-agent systems, using a lead AI to figure out what’s needed and in what order, then delegate to its ‘crew’ of specialist agents. This could be unpredictable as you’re relying on the lead agent to keep everything under control. They’ve now introduced the concept of ‘flows’ to help offset that. You can define the exact steps the agents need to take and in what order using conditional logic, loops and state management. You lose some of the autonomy and flexibility but it increases the reliability and makes your agents easier to audit. I’m interested to see how they prioritise changes to the no code experience and solutions driven approach against building out the developer product. Their website now emphasises the former.

For when you’re looking for more control over your agents for critical use cases

These are opinionated libraries or SDKs for programmers to use to build their agents. They’re code first frameworks without a no code option. It’s more work to get started than the out of the box and modular solutions but you’ll gain fine grained control and transparency. If you’re building multi-agent systems then this is the level of oversight and control you’ll want. If you’re not then you could give one of the solutions above a try first before graduating onto one of these SDKs if they don’t give oversight you need.

LangChain / LangGraph: LangChain gives you building blocks, think prompts, memory, and tool wrappers, so you can roll your own agent from scratch. It’s one of the most popular tools at this level of abstraction as it supports basically every model, tool, and database you might want to use. LangChain uses an architecture that lets you chain LLM calls and let the model decide what to do next. It’s very adaptable but as with any AI agent, dialling up autonomy means you lose some of the control. It can be hard to debug where things went off the rails. LangGraph, is built on top of LangChain, introduces a more declarative, stateful graph architecture. You define the flow of agent steps explicitly, gaining better visibility and control. It also supports human checkpoints, so you can pause for manual approval. It’s open source and free to try.
Portia AI: Also open source and developer focussed. It’s our product and we built it to be a production‑grade agent orchestration framework. It’s more opinionated than the other tools mentioned in this section as we build in how wt think a production agent should be put together. You can go back and forth with a planning agent to create a structured plan. It then breaks it into multi‑step workflows and the plan is then immutable. The execution agents then carry out the plan. You can also add in ‘escalation points’ out of the box, where the agent needs to check with a human if certain conditions are hit. There’s also built-in authentication, integrated tools like Slack, GitHub, Zendesk, Google and 1000+ tools you can access via MCP servers. It’s aimed at industries where predictability, auditability, and security matter and where you want guidance on how a production grade agent should work. It’s higher level than Lang Graph and Pydantic.
Pydantic AI: Pydantic give you the tools to put a production-grade agent together how you want but it’s complex. It’s built for engineers who have strong opinions about how they want their agent architecture to look. You define agents as Pydantic models with structured input and output types, system prompts, and function‑calling tools. It's model‑agnostic and integrates tightly with Pydantic Logfire for real‑time debugging and observability. Agents are reusable, type‑safe, and can use dependency injection for context/tools. Streaming responses are validated continuously for correctness. You still design the workflow, prompt templates, tool logic, and orchestration yourself. However the framework abstracts away validation, structure, and runtime safety, making it easier to build production‑grade agents in Python.

For when you want to lean into the platforms you already use and wire everything up

These frameworks provide scaffolding for creating AI agents but their main benefit is for if you’re already deep in the Google, OpenAI and Amazon ecosystem and don’t want to add a new provider to your stack.

OpenAI Agent SDK: it gives you scaffolding by creating some lightweight abstractions like the agent loop, tool registration, handoffs to other agents, and tracing. It’s then up to you to create the agents, define the tools and guardrails to put this into production. The built in traceability is super useful and the SDK is interoperable with any LLM under the hood. This is a good option if you want something lightweight and code first for building your first agent.
Google’s ADK (agent developer kit) is another open-source, code-first toolkit. It helps developers build agentic systems that use dynamic reasoning or enforce some structure. It provides different built in agent types (loop, parallel, LLM driven) for you to design your work flows around based on how much control you need. It has built in tools to Google services if you’re automating things within that ecosystem. The memory handling out of the box is also super helpful. You’ll still need to figure out the agent logic, prompts and connection everything up. But it’s flexible and abstracts out some of the more complex areas of agent design.
Amazon Bedrock is another production‑ready agent toolkit. It has a visual builder and modular services like Memory, Gateway, and Code Interpreter. You can connect agents to your APIs, Lambda functions, and knowledge bases that already run on Amazon out of the box. It also now supports memory retention without you having to implement it yourself. Like with the Google ADK, you’ll still need to design the agent logic, orchestration, prompts, and integration flow but Amazon AgentCore helps manage key infrastructure: session isolation, observability, identity, and tool access.

Which one should I pick?

As you can tell from this post, there’s no ‘best’ product out there. You need to figure out the trade-offs you want to make when building your agentic systems. Here are some questions that can help you make that choice:

Who’s building the agents, developers or another business function?
How much control do you need over your agent?
Do you need one agent or many?
Are you creating something quick to test or a crucial piece of infrastructure?
Do you have strong opinions about the way agents should be structured or are you looking for guidance?
How much do you want or need to build yourself?
How much are you locked into a particular vendor?

New frameworks and tools are coming out all the time so we expect to see these tools start to look more similar over time. Also as we’ve seen with the launch of the MCP standard and A2A emerging, things are likely going to get more interoperable over time, not less as standards are established. The frameworks aren’t mutually exclusive and you can choose the right approach based on where you are in your development lifecycle and your answers to the questions above. The no code tools can deliver quick prototypes but are limited. If you need flexibility and control then the mid-level opinionated frameworks are a strong choice. If you have a big developer team and a very custom set up then starting with the base provider SDKs could give you the fine grained control you need. Good luck!

Build an AI Agent And Win 💸

Zevi Reinitz — Fri, 18 Jul 2025 20:23:30 +0000

“Everyone’s talking about AI agents. But what can you actually build?”

We (the team at Portia AI) keep hearing this — so we’re turning the question back to the community... and we're offering $$$ to the people who can come up with the best answer!

Announcing The "Agents Showdown"

For the next 4 weeks, we're taking submissions for our first ever online hackathon. Join us for a chance to earn £500

💸 Glory awaits!

Overview

Portia AI is an open source SDK that wants to stand out because it helps AI agents pre-express their planned response to a prompt, share their progress during execution, and solicit human input under defined conditions.

👉🏼 We want to build some cool examples that leverage our differentiators and add them to our examples repo on Github.

The Bounty

The best submission will win a £500 bounty and will be featured on our social channels.

Judging Criteria

Submission should be very current i.e. leverages the latest emerging technologies in AI (MCP, A2A etc).
- Example: Two Portia agents coordinating their plan runs with each other / kicking off other Portia agents.
Demonstrates Portia’s strong suit e.g. dynamic planning with reinforcement (user-led learning) and / or human-agent interaction (hooks and clarifications).
- Example: Use hooks to handle profanity, PII leaks, prompt injections and more.
Touches on regulated spaces where mistakes due to agents going off the rails are very costly e.g. healthcare, finance, legal, insurance
Quality of submission (demo video, readme, code quality)
Submission should use solely (or predominantly) the Portia framework

Eligibility and How To Enter

Comment on this issue when you start working on the hackathon to let us know you're on it
Submit your project as a comment on this Github issue
- Provide a link to a private repo shared with emmaportia and Momo
Include a high-resolution demo video explaining how your project works
Write up a Readme including:
- Key package dependencies
- Setup steps
- Running instructions
- Overview of key components in the code
Follow our contribution guidelines in particular around linting
Star our GitHub repo 😇

Developer Resources

Get immersed in our SDK and give us a star ⭐️ (GitHub).
Head over to our docs (docs.portialabs.ai).
Join the conversation on Discord (https://discord.gg/DvAJz9ffaR).
Sign Up to Portia cloud and access 1000+ cloud and MCP tools with built-in auth out of the box

Code vs LLM in a simple planning poker agent example

Mounir Mouawad — Wed, 09 Jul 2025 17:00:33 +0000

If you're building AI agents, chances are you often had to consider how much logic you want to handle through the LLM versus through traditional code. I wanted to share my experience with it this morning as a conversation starter and get your thoughts!

What I wanted the agent to do

I normally spend a ton of time gathering feedback from our users. In a previous life I would put those insights into tickets in Linear and spend a ton of mental cycles trying to size the return on effort to inform our prioritisation. In this bold new world of AI, I figured I would instead write up a planning poker agent to help me do t-shirt sizing of some of those tickets in Linear. Built on the Portia SDK, the agent would:

Fetch relevant linear tickets using the remote MCP server for Linear, which is one of 1000s of tools we have with built-in auth.
Simulate sizing estimates from multiple developer personas and get to a consensus for each ticket's effort sizing. Here I wanted to create a ticket estimator tool using a subclass of our LLM tool that would return estimates as structured outputs. The tool would take a context.md file where I keep a summary of the architecture and core abstractions that make up the Portia SDK so it can help the LLM with effort sizing.

As it turns out, I had asked one of our devs (we'll call him Ethan) to do this and forgotten! So we both wrote this thing up at the same time except...I relied quite heavily on the LLM to handle the task while he relied way more heavily on code. Let's unpack how our approches compared.

How each of us built it

Full code in our agent examples repo here.

🧠 LLM-heavy: I relied on a robust prompt and the Portia planning agent to figure out the entire set of steps that need to be taken, that is fetch and filter tickets from Linear, then get estimates for ticket sizes from each developer persona and average them out. Essentially I relied on the LLM itself to 1) index and aggregate the sizing estimates by Linear ticket id and persona, 2) figure out how many tool call iterations (a.k.a. "unrolling") to make to handle all ticket id and persona combinations. Here's the code snippet where the magic happens:

# Get tickets from Linear and estimate the size of the tickets
project = "Async Portia"
query = f"""Get the tickets i'm working on from Linear with a limit of 3 on the tool call. Then filter specifically for those regarding the {project} project.
    For each combination of the tickets above and the following personas, estimate the size of the ticket.
    {personas}

    Return the estimates in a list of PlanningPokerEstimate objects, with estimate sizes averaged across the personas for each ticket.
    """
estimates = portia.run(
    query=query,
    structured_output_schema=PlanningPokerEstimateList,
).outputs.final_output.value.estimates

🧑🏻‍💻 Code-heavy: Ethan on the other hand figured that we don't really need to rely on the LLM, neither for planning nor for indexing / aggregating / iterating on estimates. Instead he used Portia's declarative PlanBuilder interface to enumerate the steps and tool calls needed. He fetched the tickets using a first Portia plan run into LinearTicket objects using structured outputs. To generate sizing estimates, he then iterated with conventional code over each developer persona and over each ticket element in the list returned from the previous plan run. Each iteration called the ticket estimator tool in a single step Portia plan run. Here's a code snippet containing both the ticket fetching plan run and the ticket sizing iterations:

# Fetch Linear tickets
project = "Async SDK"
query = f"Get the tickets i'm working on from linear regarding the {project} project"
plan = PlanBuilder(
    query, structured_output_schema=LinearTicketList
).step(
    query + " and only call the tool with a limit of 3", tool_id="portia:mcp:mcp.linear.app:list_my_issues"
).step(
    f"Filter the tickets to only include specifically the ones related to {project}", tool_id="llm_tool"
).build()
plan_run = portia.run_plan(plan)
tickets = plan_run.outputs.final_output.value.tickets

# Iterate over tickets and persona to generate estimates
for ticket in tickets:
    estimates = []
    estimate_plan = PlanBuilder(
        "estimate the size of the ticket", structured_output_schema=PlanningPokerEstimate
    ).step(f"Estimate the size of the ticket: {ticket.title}\n\n{ticket.description}", tool_id="ticket_estimator_tool").build()
    for persona in personas:
        context = f"""
        {persona}
        {tool_context}
        """
        estimate_tool.tool_context = context
        portia.tool_registry.with_tool(estimate_tool, overwrite=True)
        estimate = portia.run_plan(estimate_plan)
        if estimate.state == PlanRunState.COMPLETE:
            estimates.append(estimate.outputs.final_output.value)

What we learned

Let's compare both approaches side by side and draw some conclusions. I hooked up Langsmith to Portia for observability so I could obtain the metrics shown below.

	LLM-heavy	Code-heavy
Effort	Lowest	Highest
Total tokens	70k	30k
Cost	$0.12	$0.06
Latency [P99]	28.95s	9.70s

So what conclusions can we draw from this exercise?
💡 Reliability: You can trust your Portia agents to figure out the right sequence of steps and to unroll (iterate on) the tool calls correctly so that definitely simplifies development, kinda like a form of vibe coding...but much like vibe coding it does take a bit of 'LLM-whispering' (a.k.a. prompt engineering) and using the right underlying model. For plan runs with heavy iteration expectations in particular, you will need robust eval sets in place to keep tabs on reliability lest you aim for a Mona Lisa and end up with a Picasso.
👣 Traceability: Relying on the LLM to handle planning and execution to the extent I did does make tracing particularly easy. One single PlanRunState instance in the Portia dashboard showed me the entirety of the work done by the underlying subagents. This also makes revisiting the output of the plan run easier of course. Ethan on the other hand ended up with numerous plan runs, which makes auditing and / or debugging harder.
💸 Cost: As you'd expect the LLM-heavy method is slower and costlier. Ultimately we're still processing the same amount of context presumably (same number of tickets and estimations) but the overhead of passing along a growing context window across all execution agents during the plan run means that the LLM-heavy method is inevitably slower and costlier. You're also opening yourself up to the stochasticity of LLMs when code could do the trick.

A parting thought

One aspect I don't consider in the comparison above is autonomy. Because the task is neatly scoped in this example (planning poker agent = fetch and filter tickets + estimate per persona + summarise consensus) you can make the argument that at production scale one should restrict LLM usage only to the tasks that traditional code can't handle as easily (e.g. natural language processing). BUT where inputs from the environment change or the scope of the task is fluid, the LLM-heavy approach truly thrives. I'll try and tease that more obviously in a subsequent post.
👉🏼 If you're interested please shout in the comments down below!

About Portia

Portia AI is an open-source framework for building predictable, stateful, authenticated agentic workflows.

We allow developers to have as much or as little oversight as they’d like over their multi-agent deployments and we are obsessively focused on production readiness.

We invite you to play around with our SDK, break things, and tell us how you're getting on in Discord

3 Issues That Remote MCP Developers Should Avoid

Zevi Reinitz — Tue, 08 Jul 2025 16:48:22 +0000

Remote MCP servers are only just starting to take off. As more platforms roll out support for the Model Context Protocol (MCP), we're seeing rapid growth in developer experimentation — and equally, we're seeing many of the common pitfalls emerge for teams building MCP servers for the first time.

At Portia, we’ve tested a broad range of remote MCP servers — from major providers like Asana, Atlassian, Intercom and Stripe, to emerging integrations like Fulcra, Globalping and Invideo. In the process, we’ve seen a few recurring issues that make MCP servers harder to use, slower to adopt, and more brittle in production.

[For some quick context - Portia is the framework that enables developers to build safe, reliable AI agents. We'd be thrilled to have you check out our SDK and give it a GitHub star!⭐ 🙏]

Here are 3 mistakes we’d recommend every MCP developer avoid.

1. OAuth redirects only work on localhost

Several MCP servers we tested work fine locally, but break when used in real-world staging or production environments. A common cause: OAuth redirect URIs are hardcoded to allow only localhost. While this might be convenient during initial development, it blocks testing in any remote environment.

For example, we couldn’t add Atlassian because they rejected valid redirect URIs during authorization flows. This makes it nearly impossible for integrators to properly test your server in their deployment pipelines.

✅ Best practice: Always allow multiple redirect URIs for OAuth clients, including both local and remote environments. Ideally, make this configurable per client.

2. Missing `.well-known` OAuth metadata or misconfigured tool discovery

The MCP spec depends on being able to automatically discover OAuth configuration using the .well-known/oauth-authorization-server endpoint. Several servers — including PostHog, Semgrep and Invideo — either failed to serve this file or required tokens to access the tools/list endpoint, which prevents tool discovery.

Without a valid .well-known file:

Client libraries can't automatically configure OAuth.
MCP agents can't discover your tool easily.
Developer experience suffers.

✅ Best practice: Make sure your .well-known/oauth-authorization-server is publicly accessible, standards-compliant, and returns the necessary OAuth metadata.

3. Unreliable server availability

MCP servers are expected to handle dynamic discovery, token exchange, and tool invocation — and this requires a fairly high level of reliability. In our testing, we encountered servers that were down for maintenance (Asana), intermittently unavailable (Plaid, Neon), or failed authorization flows with unhelpful errors (Fulcra).

In an agentic world, unreliability doesn’t just block a single API call — it can break entire task chains and workflows.

✅ Best practice: Invest early in uptime monitoring, clear error messages, and comprehensive OAuth error handling. MCP servers should fail gracefully and predictably.

The Good News: Testing is Easy With Portia

If you’re building a new remote MCP server, testing your implementation doesn’t need to be painful. Portia makes it simple to integrate, validate, and experiment with new MCP servers directly inside your agentic app — often in just 3 clicks.

As the MCP ecosystem matures, we're excited to help developers ship more reliable, agent-friendly integrations that just work — across staging, production, and every environment in between.

A Bit More About Portia

Portia AI is an open-source framework for building predictable, stateful, authenticated agentic workflows.

We allow developers to have as much or as little oversight as they’d like over their multi-agent deployments and we are obsessively focused on production readiness.

We invite you to play around with our SDK, break things, and tell us how you're getting on in Discord

Steel Thread, Evals and building reliable agents.

Tom Stuart — Wed, 04 Jun 2025 00:00:00 +0000

Introduction

At Portia we spend a lot of time thinking about what it means to make agents reliable and production worthy. Lots of people find it easy to make agents for a proof of concept but much harder to get them into production. It takes a real focus on production readiness and a suite of features to do so (lots of which are available in our SDK as we’ve talked about in previous blog posts):

User Led Learning for reliable planning
Agent Memory for large data sets
Human in the loop clarifications to let agents raise questions back to humans
Separate planning and execution phases for constrained execution

But today we want to focus on the meta question of how we know that these features help improve the reliability of agents built on top of them by talking about evals.

Evals - What Are They Anyway?

“Evals” is shorthand for evaluations. They’re how we turn the vague question of “is this agent doing the right thing?” into something we can actually measure and repeat.

A good eval works a lot like a unit test or integration test. You give it inputs, run them through the system, and check whether the outputs match what you expected.

Where evals differ from integration or unit tests is in that we are operating in a probabilistic world. LLMs are non-deterministic and thus so is any software built on top of them. Therefore we need to adjust our approach from always asserting something to trying it multiple times and checking the correctness as a percentage.

Consider an example of some software that gets the weather:

Integration test:

GET /weather?location=LONDON 

response.status_code = 200


Eval Approach:

Whats the weather in London Today?

The weather in London is 20C

An integration test is generally easy to write because we can make a request to the software and write concrete assertions about the response. With an agent this is much harder. The response is natural language and the format can vary from call to call. Not only that but with agents we are usually dealing with side effects like an email being sent correctly that aren't contained in the response.

Like unit and integrations tests, we think about evals as existing on a spectrum:

At one end, you have low-level checks. These might confirm that a specific tool was called (a weather tool in the example above), or that a plan has the right structure.
At the other end, you have full end-to-end tests. These look at whether the agent made the right business decision, even if it found a creative way to get there.

Each type of eval gives you different insight. Low level evals tell you if the system is behaving as expected. High level evals help you understand whether it’s making good decisions for the right reasons.

Evals aren’t necessarily for catching bugs though they certainly can help with this. They’re for asking big performance questions like:

What happens if we change LLM provider from OpenAI to Gemini?
Does performance improve if we change the structure or even tone of our prompts?
How does changing the interface of a tool affect the agent's performance?

Steel Thread

To make evals easier, we built Steel Thread, our internal eval framework, now available to partners using the Portia SDK. Steel Thread is a lightweight Python library designed to help you write, run, and analyze evals for your agent workflows.

It focuses on being:

Simple : Define your test cases in Python or JSON. Each test specifies inputs, expected outputs, and what success looks like.
Fast : Built-in concurrency and retries mean you can scale up eval runs without overhead.
Flexible : You can write low-level metrics that look at fine-grained behavior, or high-level E2E runners that test business outcomes.
Visualizable : Steel Thread can push results to external layers like LangSmith, making it easy to explore where things are failing and why.

We use it ourselves to test the Portia SDK and the agents built with it. It’s designed for both internal development workflows and user-facing evaluation.

Example: Writing Your First Eval

Let’s walk through a simplified example of using Steel Thread to evaluate an agent. Say you have an agent that makes a decision based on a set of inputs:

import openai
from steel_thread.runner import E2ERunner
from steel_thread.types import EvalOutput

class MyAgent():
    def run(  # type: ignore
        self,
        user_id: str,
        input_text: str,
    ) -> EvalOutput:

          response = openai.ChatCompletion.create(
              model="gpt-4",
              messages=[
                  {"role": "system", "content": "You are a smart home assistant tasked with identifying what actions to take."},
                  {"role": "user", "content": input_text},
              ],
              temperature=0,
          )
          output_text = response["choices"][0]["message"]["content"].strip().lower()

          if "turn on" in output_text and "light" in output_text:
              action = "lights_on"
          elif "play" in output_text and "jazz" in output_text:
              action = "play_music_jazz"
          else:
              action = "unknown"

          return action

You want to test whether it produces the expected outcome given specific inputs. To do this you define a set of test cases, much like you would with normal software testing:

from steel_thread.e2e import E2EEval, EvalOutput

test_cases = [
    (
        E2EEval(
            id="example-1",
            inputs={"user_id": "u1", "input_text": "turn on the lights"},
        ),
        EvalOutput(final_output="lights_on", final_error=None),
    ),
    (
        E2EEval(
            id="example-2",
            inputs={"user_id": "u2", "input_text": "play jazz music"},
        ),
        EvalOutput(final_output="play_music_jazz", final_error=None),
    ),
]

To integrate this with Steel Thread, we define a runner. The runner is a simple class that knows how to call your agent with the inputs from each test case:

class ExampleAgentRunner(E2ERunner):
    def run(self, user_id: str, input_text: str) -> EvalOutput:
        try:
            result = my_agent.run(user_id=user_id, input_text=input_text)
            return EvalOutput(final_output=result, final_error=None)
        except Exception as e:
            return EvalOutput(final_output=None, final_error=e)

Everything is simple enough, up to this point. This is where the magic comes in. We ask steel thread to handle updating the dataset and then to run the evals:

dataset_name = "My Eval Set"
update_dataset_main(dataset_name, eval_dataset)

run_eval_main(
    dataset_name,
    config,
    ExampleAgentRunner(),
    EvalOptions(
        reps=3,
        upload_results=True,
        max_concurrency=10,
    ),
    extra_metrics=[],
)

Steel Thread will:

Concurrently run your agent against the defined test case(s) using the runner according to your concurrency config.
Score the outputs using a set of default metrics and/or the extra metrics you think are important for your use case.
Upload the results to your visualization layer of choice.

Adding Extra Metrics

In addition to checking whether the agent returns the correct output, sometimes we care about how the agent answers, for example, how concise or verbose its responses.

Steel Thread makes it easy to define your own custom metrics by subclassing the Metric base class. Here's a simple example of a metric that gives higher scores to shorter outputs:

from steel_thread.metrics import Metric, EvaluationResult
from steel_thread.types import Run, Example

class ShortOutputPreferred(Metric):
    """Reward shorter outputs (under 20 tokens) with a score of 1."""

    def evaluate_run(
        self,
        run: Run,
        example: Example | None = None,
        evaluator_run_id: uuid.UUID | None = None,
    ) -> EvaluationResult:
        output = run.outputs or ""
        word_count = len(output.split())
        return EvaluationResult(
            key="short_output_preferred",
            score=1.0 if word_count <= 20 else 0.0,
        )

How We Use Steel Thread

At Portia, Steel Thread is a core part of how we build and validate agentic systems. We use it to continuously measure reliability, correctness, and regressions across different layers of our platform. A few examples:

End-to-End Use Case Testing

We write end-to-end evals for common user journeys whether it's reviewing a customer case, escalating a risk, or generating a compliance summary. These tests help ensure that agents behave as expected across realistic, high-level workflows.

Agent Comparison and A/B Testing

We frequently use our evals to have data driven discussions about feature changes. For example, does introducing a new section of context to the planning prompt affect the performance of our existing evals.

Low-Level Tool Call Verification

We use low-level execution or planning evals to verify that tools are being called with the correct parameters. This is especially useful when agent behavior is compositional or dynamically generated.

Stress-Testing Planning as Tool Count Grows

As our library of tools grows, planning gets harder. We use evals to track how well our agents plan under increasing complexity, including how accurate, efficient, and deterministic they remain.

Why All This Matters

It’s easy to demo something impressive with LLMs. But to deploy an agent in a real-world system with users, data, and risk you need confidence that it will behave consistently and safely.

That’s what evals give you: a feedback loop that allows you to iteratively improve the accuracy and reliability of your agents. You wouldn’t ship software to production without tests and you shouldn’t ship agents to production without evals.

As agents become more complex, the teams that win will be the ones who can reason about behavior, not just improvise it. We hope Steel Thread helps you do that.

Getting Access

Steel Thread is available today to our design partners - get in contact with us at hello@portialabs.ai if you're interested!

Join the conversation

Like this article? – Give us a ⭐ on GitHub. It really helps!

Browse our website and try our (modest) playground at www.portialabs.ai.
Head over to our docs at docs.portialabs.ai or get immersed in our SDK.
Join the conversation on our Discord channel.
Watch us embarrass ourselves on our YouTube channel.
Follow us on Product Hunt.

Design Highlight: Handling data at scale with Portia multi-agent systems

Portia AI — Thu, 22 May 2025 00:00:00 +0000

At Portia, we love building in public. Our agent framework is open-source and we want to involve our community in key design decisions. Recently, we’ve been focussing on improving how agents handle production data at scale in Portia. This has sparked some exciting design discussions that we wanted to share in this blog post. If you find these discussions interesting, we’d love you to be involved in future discussions! Just get in contact (details in block below) - we can’t wait to hear from you.

Calling All Devs

We’d love to hear from you on the design decisions we’re making 💪 Check out the discussion thread for this blog post to have your say. If you want to join our wider community too (or just fancy saying hi!), head on over to our discord, our reddit community, or our repo on GitHub (Give us a ⭐ while you’re there!).

If you’re new to Portia, we’re building a multi-agent framework that’s designed to enable people to run agents reliably in production. Efficiently handling large and complex data sources is one of the key aspects of this, along with agent permissions, observability and reliability. We’ve seen numerous agent prototypes that work well on small datasets in restricted scenarios, but then start to fall over when faced with the scale and complexity of production data. We want to make sure this doesn’t happen when agents are built with Portia. In this blog post, we’ll explore the design decisions we’ve made to enable this.

Real agents handling data at scale

As with all good design discussions, we work backwards from real life use-cases that we’re looking to enable / improve. We’re working with many agent builders and below are a selection of the exciting use-cases we’ve seen that require efficiently processing large data sources:

A debugging agent that can process many large server log files along with other debug information to diagnose issues.
A research agent that can process many documents and search results over time as it conducts research into a particular company or person.
A finance assistant capable of researching over a company’s financial data in a mixture of sheets and docs to answer questions - for example, “from this week’s sales data, identify the top 3 selling products and how their sales are split by Geography”
A personal assistant capable of having long-running interactions with the user, including taking actions such as scheduling events and sending emails, adapting to their preferences over time

In order to handle each of these use-cases well, our agents need to handle data correctly across complex, multi-step plans without being thrown by large documents or making repetitive mistakes. However, we were finding that these agent builders were hitting a couple of key issues:

Plan run state overload: Our execution agent has access to the full state of the plan run. When it runs a tool, it stores the output of the tool run into that state for future use. Over time though, if tools were producing large amounts of data, this state could get very large and congested. As this was passed into the LLM, this would then reduce the accuracy with which the execution agent could retrieve the correct information for each step from the plan run state:
- Example : A debugging agent might download and then analyse the logs from 10 different servers and analyse each of them. It might then move on to another task, but the logs from each of those 10 servers would still be in its plan run state, distracting from other useful information when processing future steps.
Tool calling with large inputs: Our execution agent calls a language model to produce the arguments for calling each tool. However, when we wanted to call a tool with a large argument (e.g. >1k tokens), we would either hit the output token limits of the model or we would hit latencies that would make the system incredibly slow.
- Example: A finance agent might want to read in a large spreadsheet and then pass its contents into a processing tool to extract the key data it needs. We saw occasions where just generating the args for the processing tool took more than 5 minutes because it needed to print out the full contents of the spreadsheet!

We needed to fix these two issues, so our agent builders could stop wrestling with context windows and focus on shipping features.

An aside on long-context models

Before diving into potential solutions, let’s explain why we thought this is a problem worth solving even with the vast context windows seen from the latest models (e.g. Llama4 has a 10m token window while GPT 4.1 has a 1m context window). These models have certainly changed the equation - before their arrival, we hit context window limits a lot more than we currently do. However, using these models with large data sources is still difficult and problematic for multiple reasons:

Accuracy	SOTA models boast strong accuracy scores in needle-in-a-haystack tests, but real scenarios are more complex, requiring reasoning over and connecting different pieces of information in the context, and models get much weaker at this when the context is large.
Cost	Filling GPT 4.1’s context window will cost you $2 of processing for every LLM call. Agentic systems typically make many LLM calls, so $2 can quickly make your system very expensive!
Latency	As the token number increases, so does the latency, particularly for output tokens. OpenAI states that while doubling input tokens increases latency 1-5%, doubling output tokens doubles output latency. When compounded with the fact that agentic systems make many LLM calls, this can make systems very sluggish.
Failure Modes	Interestingly, language models fail in different ways when the context length gets large. There’s a great study on this from databricks. This adds instability to the system because the prompts you’ve been iterating on in low data scenarios suddenly don’t work as you expected in production.

Preventing our plan run state becoming overloaded

Given the above, we can’t just rely on long-context models and need to handle the issue with overloading our plan run state within our framework. To solve this, we needed to reduce the size of the context used by our execution agent, and we did this as follows:

First, we introduced agent memory. For large outputs (the default is above 1k tokens) we store the full value in agent memory with just a reference in the plan run state. This prevents previous large outputs from clogging up the plan run state when they are no longer needed.
- You can configure where your agent stores memories through our storage_class configuration option (see our docs for more details). If you choose Portia cloud, you’ll be able to view the memories in our dashboard:

Our planner selects inputs for each step of our plan. If one of these inputs is in agent memory, we fetch the value from memory, as we know it is specifically needed for this step. This allows the execution agent to fully utilise the large values in agent memory when needed.

You can check out the code for this feature in this PR and the docs are here. For our first implementation of agent memory, we decided to only allow pulling the full value from agent memory, rather than indexing the values in a vector database (or other form of database) and allowing queries based on that. A key reason for this (as well as wanting to keep our initial implementation as simple as possible) is that the way memories need to be queried is very task dependent. There are times when a semantic similarity search of memories is required (e.g. a debugging agent looking for similar errors among log files), while other times require filtering on exact values (e.g. a debugging agent looking for logs between two timestamps from a particular service), a projection of the values (e.g. a finance assistant just taking several columns from a spreadsheet) or access to the full value.

Our future vision - a memory agent

Ultimately, we’ll want to support all these patterns, but doing this efficiently requires indexing and querying the memories intelligently based on the task. We believe that this will be best done by a separate memory agent - an agent within our multi-agent system that indexes and queries agent memories so that the required pieces can be retrieved for the task and passed to the execution agent. This clearly adds complexity to the system though! So we wanted to see how our agent builders use agent memory before jumping to conclusions on the best way to index and query the memories.

Tell us what you think!

We’d love to hear what you thought of the decision to not automatically ingest agent memories into a vector database. Is it something you’d like to see? Get involved in the Github discussion and let us know.

Solving Tool calling with large inputs

Our introduction of agent memory meant that the execution agent was managing the context it sent to the language model much better. However, we were still facing the issue, mentioned above, where the language model struggled to produce large arguments for tools when needed. In order to solve this, we provided the language model with the ability to use templates to input variables. When calling the language model, the execution agent would outline in the prompt that, if the language model simply wanted to use a value from agent memory verbatim, it didn’t have to copy the value out - it could just put, for example {{$large_memory_value}}. We then extended the execution agent to retrieve $large_memory_value from agent memory when this was done and template the value in, so that the tool received the full value.

You can check out the code for this feature in this PR: Add ability for execution agent to template large outputs. Interestingly, after a bit of initial tuning, we found that the language model was able to determine correctly whether it should template a variable or not. This has led to a massive improvement in latency and cost of our agents calling tools with large data sources. For example, a personal assistant use-case that involved analysing a large spreadsheet reduced in time from 3-5 minutes to <10s.

What do you think?

What do you think of allowing language models to template variables rather than copy them out fully? Do you have a better approach for this? Get involved in the GitHub discussion and let us know.

Going forwards

We believe this work sets a great foundation for building multi-agent systems with Portia that handle data at scale, and we’ve got an exciting roadmap of features to keep making this even better:

Pre-ingest knowledge / memories: we want to allow kicking off our agents with knowledge and memories already loaded, rather than requiring the agent to fetch all the information needed as part of the run
Improved pagination handling: we want to allow our execution agent to more efficiently use paginated APIs
Memory agent: as mentioned above, we’re excited about the possibilities opened up by a separate memory agent. Once we’ve got a good idea of how agent memory is being used in its current form, we’d love to start discussions on how this new agent might fit into our system.

We’re really looking forward to finding out what these new large data capabilities will unlock for agent builders working on Portia!

What do you think?

Hopefully you enjoyed this blog post. If you did (or even if you didn’t!), we’d love to hear from you. Did you agree with the design decisions we are taking? Do you think we should take a different approach? If you’ve got thoughts and ideas, we’d love to hear about them in the GitHub discussion associated with this post. And we love chatting about code even more, so if you’ve got an idea, fork our repo and we’d love to review the code 🚀

Join the conversation

Like this article? – Give us a ⭐ on GitHub. It really helps!

Browse our website and try our (modest) playground at www.portialabs.ai.
Head over to our docs at docs.portialabs.ai or get immersed in our SDK.
Join the conversation on our Discord channel.
Watch us embarrass ourselves on our YouTube channel.
Follow us on Product Hunt.

DEV Community: Portia AI

Building agents with Controlled Autonomy using our new PlanBuilder interface

Straight into an example

Controlled Autonomy

Simple Control structures

Human - Agent interface

Controlling your agent with code

What’s next?

From Hackathon Idea to Life-Saving Workflow: The Story of the DCRCA Agent

The AI AgentHack Hackathon

How The DCRCA Agent Is Built

5 tools we wish were on the Awesome AI Tools list

Introducing SteelThread: Evals & Observability for Reliable Agents

Portia AI: Initial Thoughts on GPT-5

How I Built an AI Agent That Turns Daily AI News Into a Commute-Sized Podcast

Building AI News Into a Routine

Getting Involved

Building AI Agents: Choose your fighter

For when you want to automate something quickly and without writing much code

For when you’re looking for more control over your agents for critical use cases

For when you want to lean into the platforms you already use and wire everything up

Which one should I pick?

Build an AI Agent And Win 💸

Announcing The "Agents Showdown"

Overview

The Bounty

Judging Criteria

Eligibility and How To Enter

Developer Resources

Code vs LLM in a simple planning poker agent example

What I wanted the agent to do

How each of us built it

What we learned

A parting thought

About Portia

3 Issues That Remote MCP Developers Should Avoid

1. OAuth redirects only work on localhost

2. Missing .well-known OAuth metadata or misconfigured tool discovery

3. Unreliable server availability

The Good News: Testing is Easy With Portia

A Bit More About Portia

Steel Thread, Evals and building reliable agents.

Introduction

Evals - What Are They Anyway?

Steel Thread

Example: Writing Your First Eval

Adding Extra Metrics

How We Use Steel Thread

End-to-End Use Case Testing

Agent Comparison and A/B Testing

Low-Level Tool Call Verification

Stress-Testing Planning as Tool Count Grows

Why All This Matters

Getting Access

Join the conversation

Design Highlight: Handling data at scale with Portia multi-agent systems

Calling All Devs

Real agents handling data at scale​

An aside on long-context models​

Preventing our plan run state becoming overloaded​

Our future vision - a memory agent​

Tell us what you think!

Solving Tool calling with large inputs​

What do you think?

Going forwards​

What do you think?

Join the conversation

2. Missing `.well-known` OAuth metadata or misconfigured tool discovery

Real agents handling data at scale

An aside on long-context models

Preventing our plan run state becoming overloaded

Our future vision - a memory agent

Solving Tool calling with large inputs

Going forwards