DEV Community: Robbie Heywood

Building agents with Controlled Autonomy using our new PlanBuilder interface

Robbie Heywood — Wed, 10 Sep 2025 14:23:42 +0000

Balancing autonomy and reliability is a key challenge faced by teams building agents (and getting it right is notoriously difficult!). At Portia, we’ve built many production-ready agents with our design partners and today we’re excited to share our solution: Controlled Autonomy. Controlled autonomy is the ability to control the level of autonomy of an agent at each step of an agentic plan. We implement this using our newly reshaped PlanBuilder interface to build agentic systems, and today we’re excited to be releasing it into our open-source SDK. We believe it’s a simple, elegant interface (without the boilerplate of many agentic frameworks) that is the best way to create powerful and reliable agentic systems - we can’t wait to see what you build with it!

If you’re building agents, we’d love to hear from you! Check out our open-source SDK and let us know what you’re building on Discord. We also love to see people getting involved with contributions in the repo - if you’d like to get started with this, check out our open issues and let us know if you’d like to take one on.

Straight into an example

Our PlanBuilder interface is designed to feel intuitive and we find agents built with it are easy to follow, so let’s dive straight into an example:

from portia import PlanBuilderV2, StepOutput

plan = (
    PlanBuilderV2("Run this plan to process a refund request.")
    .input(name="refund_info", description="Info of the customer refund request")
    .invoke_tool_step(
        step_name="read_refund_policy",
        tool="file_reader_tool",
        args={"filename": "./refund_policy.txt"},
    )
    .single_tool_agent_step(
        step_name="read_refund_request",
        task=f"Find the refund request email from {Input('customer_email_address')}",
        tool="portia:google:gmail:search_email",
    )
    .llm_step(
        step_name="llm_refund_review",
        task="Review the refund request against the refund policy. "
             "Decide if the refund should be approved or rejected. "
             "Return the decision in the format: 'APPROVED' or 'REJECTED'.",
        inputs=[StepOutput("read_refund_policy"), StepOutput("read_refund_request")],
        output_schema=RefundDecision,
    )
    .function_step(
        function=record_refund_decision,
        args={"refund_decision": StepOutput("llm_refund_review")})
    .react_agent_step(
        task="Find the payment that the customer would like refunded.",
        tools=["portia:mcp:mcp.stripe.com:list_customers", "portia:mcp:mcp.stripe.com:list_payment_intents"],
        inputs=[StepOutput("read_refund_request")],
    )
    # Full example includes more steps to actually process the refund etc.
    .build()
)

The above is a modified extract from our Stripe refund agent (full example here), setting up an agent that acts as follows:

Read in our company’s refund policy: this uses a simple invoke_tool_step, which means that the tool is directly invoked with the args specified with no LLM involvement. These steps are great when you need to use a tool (often to retrieve data) but don’t need the flexibility of an LLM to call the tool because the args you want to use are fixed (this generally makes them very fast too!).
Read in the refund request from an email: for this step, we want to flexibly find the email in the inbox based on the refund info that is passed into the agent. To do this, we use a single_tool_agent, which is an LLM that calls a single tool once in order to achieve its task. In this case, the agent creates the inbox search query based on the refund info passed in to find the refund email.
Judge the refund request against the refund policy: the llm_step is relatively self-explanatory here - it uses your configured LLM to judge whether we should provide the refund based on the request and the policy. We use the StepOutput object to feed in the results from the previous steps, and the output_schema field allows us to return the decision as a pydantic object rather than as text.
Record the refund decision: we have a python function we use to record the decisions made - we can call this easily with a function_step which allows directly calling python functions as part of the plan run.
Find the payment in Stripe: finding a payment in Stripe requires using several tools from Stripe’s remote MCP server (which is easily enabled in your Portia account). Therefore, we set up a ReAct agent with the required tools and it can intelligently chain the required Stripe tools together in order to find the payment. As a bonus, Portia uses MCP Auth by default so these tool calls will be fully authenticated.

Controlled Autonomy

As demonstrated in the above example, the power of PlanBuilderV2 comes from the fact you can easily connect and combine different step types, depending on your situation and requirements. This allows you to control the amount of autonomy your system has at each point in its execution, with some steps (e.g. react_agent_step) making use of language models with high autonomy while others are carefully controlled and constrained (e.g. invoke_tool_step).

From our experience, it is this ‘controlled autonomy’ that is the key to getting agents to execute reliably, which allows us to move from exciting prototypes into real, production agents. Often, prototypes are built with ‘full autonomy’, giving something like a ReAct agent access to all tools and letting it loose on a task. This approach is possible with our plan builder and can work well in some situations, but in other situations (particularly for more complex tasks) it can lead to agents that are unreliable. We’ve found that tasks often need to be broken down and structured into manageable sub-tasks, with the autonomy for each sub-task controlled, for them to be done reliably. For example, we often see research and retrieval steps in a system being done with high autonomy ReAct agent steps because they generally use read-only tools that don’t affect other systems. Then, when it comes to the agent taking actions, these steps are done with zero or low autonomy so they can be done in a more controlled manner.

Simple Control structures

Extending the above example, our PlanBuilderV2 also provides familiar control structures that you can use when breaking down tasks for your agentic system. This gives you full control to ensure that the task is approached in a reliable way:

# Conditional steps (if, else if, else)
.if_(condition=lambda review: review.decision == REJECTED,
    args={"llm_review_decision": StepOutput("llm_refund_review")})
.function_step(
    function=handle_rejected_refund,
    args={"proposed_refund": StepOutput("proposed_refund")})
.endif()

# Loops - here we use .loop(over=...), but there are also alternatives for
#         .loop(while=...) and .loop(do_while=...)
.loop(over=StepOutput("Items"), step_name="Loop")
.function_step(
    function=lambda item: print(item),
    args={"item": StepOutput("Loop")})
.end_loop()

Fun fact: We went with .if_() rather than .if() (note the underscore) because if is a restricted keyword in python

Human - Agent interface

Another aspect that is vital towards getting an agent into production is the ability to seamlessly pass control between agents and humans. While we build trust in agentic systems, there are often key steps that require verification or input from humans. Our PlanBuilder interface allows both to be handled easily, using Portia’s clarification system:

# Ensure a human approves any refunds our agent gives out
builder.user_verify(
    message=f"Are you happy to proceed with the following proposed refund: {StepOutput('proposed_refund')}?")

# Allow your end user to provide input into how the agent runs
builder.user_input(
    message="How would you like your refund?",
    options=["Return to purchase card", "gift card"],
)

Controlling your agent with code

The function_step demonstrated earlier is a key addition to PlanBuilderV2. In many agentic systems, all tool and function calls go through a language model, which can be slow and also can reduce reliability. With function_step, the function is called with the provided args at that point in the chain with full reliability. We’ve seen several use-case for this:

Guardrails: where deterministic, reliable code checks are used to verify agent behaviour (see example below)
Data manipulation: when you want to do a simple data transformation in order to link tools together, but you don’t want to pay the latency penalty of an extra LLM call to do the transformation, you can instead do the transformation in code.
Plug in existing functions: when you’ve already got the functionality you need in code, you can use a function_step to easily plug that into your agent.

# Add a guardrail to prevent our agent giving our large refunds
builder.function_step(
    step_name="reject_payments_above_limit",
    function=reject_payments_above_limit,
    args={"proposed_refund": StepOutput("proposed_refund"), "limit": Input("payment_limit")})

What’s next?

We’ve really enjoyed building agents with PlanBuilderV2 and are excited to share it more widely. We find that it complements our planning agent nicely: our planning agent can be used to dynamically create plans from natural language when that is needed for your use-case, while the plan builder can be used if you want to more carefully control the steps your agentic system takes with code.

We’ve also got more features coming up over the next few weeks that will continue to make the plan builder interface even more powerful:

Parallelism: run steps in parallel with .parallel().
Automatic caching: add cache=True to steps to automatically cache results - this is a game-changer when you want to iterate on later steps in a plan without having to fully re-run the plan.
Step error handler: specify .on_error() after a step to attach an error handler to it, .retry() to allow retries of steps or use exit_step() to gracefully exit a plan.
Linked plans: link plans together by referring to outputs from previous plan runs.

plan = (
    PlanBuilderV2("Run this plan to process a refund request.")
    # 1. Run subsequent steps in parallel
    .parallel()
    .invoke_tool_step(
        tool="file_reader_tool",
        args={"filename": "./refund_policy.txt"},
        # 2. Add automatic caching to a step
        cache=True
    )
    # 3. Add error handling to a step
    .on_error()
    .react_agent_step(
        # 4. Link plans together by referring to outputs from a previous run
        # Here, we could have a previous agent that determines which       customer refunds to process
        task=f"Read the refund request from my inbox from {PlanRunOutput(previous_run)}.",
        tools=["portia:google:gmail:search_email"],
    )
    # Resume series execution
    .series()
)

Shout out to gaurava05 for adding ExitStep as an open-source contribution in this PR.

So give our new PlanBuilder a try and let us know how you get on - we can’t wait to see what you build! 🚀

For more details on PlanBuilderV2, check out our docs, our example plan or the full stripe refund example. You can also join our Discord to hear future updates.

Portia AI: Initial Thoughts on GPT-5

Robbie Heywood — Mon, 11 Aug 2025 14:57:35 +0000

At Portia AI, we’ve been playing around with GPT-5 since it was released a few days ago and we’re excited to announce it will be available to SDK users in tomorrow’s SDK release 🎉

After playing with it for a bit, it definitely feels an incremental improvement rather than a step-change (despite my LinkedIn feed being full of people pronouncing it ‘game-changing!). To pick out some specific aspects:

Equivalent Accuracy: on our benchmarks, GPT5’s performance is equal to the existing top model, so this is an incremental improvement (if any).
Handles complex tools: GPT-5 is definitely keener to use tools. We’re still playing around with this, but it does seem like it can handle (and prefers) broader, more complex tools. This is exciting - it should make it easier to build more powerful agents, but also means a re-think of the tools you’re using.
Slow: With the default parameters, the model is seriously slow - generally 5-10x slower across each of our benchmarks. This makes tuning the new reasoning_effort and verbosity parameters important.
I actually miss the model picker! With the model picker gone, you’re left to rely on the fuzzier world of natural language (and the new reasoning_effort and verbosity parameters) to control the model. This is tricky enough that OpenAI have released a new prompt guide and prompt optimiser. I think there will be real changes when there are models that you don’t feel you need to control in this way - but GPT-5 isn’t there yet.
Solid pricing: While it is a little more token-hungry on our benchmarks (10-20% more tokens in our benchmarks), at half the price of GPT-4o / 4.1 / o3, it is a good price for the level of intelligence (a great article on this from Latent Space).
Reasonable context window: At 256k tokens, the context window is fine - but we’ve had several use-cases that use GPT-4.1 / Gemini’s 1m token windows, so we’d been hoping for more...
Coding: In Cursor, I’ve found GPT-5 a bit difficult to work with - it’s slow and often over-thinks problems. I’ve moved back to claude-4, though I do use GPT-5 when looking to one-shot something rather than working with the model.

There are also two aspects that we haven’t dug into yet, but I’m really looking forward to putting them through their paces:

Tool Preambles: GPT 5 has been trained to give progress updates in ‘tool preamble’ messages. It’s often really important to keep the user informed as an agent progresses, which can be difficult if the model is being used as a black box. I haven’t seen much talk about this as a feature, but I think it has the potential to be incredibly useful for agent builders.
Replanning: In the past, we’ve got ourselves stuck in loops (particularly with OpenAI models) where the model keeps trying the same thing even when it doesn’t work. GPT-5 is supposed to handle these cases that require a replan much better - it’ll be interesting to dive into this more and see if that’s the case.

As a summary, this is still an incremental improvement (if any). It’s sad to see it still can’t count the letters in various fruit and I’m still mostly using claude-4 in cursor.

How I Built an AI Agent That Turns Daily AI News Into a Commute-Sized Podcast

Robbie Heywood — Fri, 01 Aug 2025 14:54:27 +0000

The AI landscape moves at breakneck speed. New models, research papers, funding announcements, and product launches happen daily. As someone working in AI, staying current isn't just helpful—it's essential. But when you're heads-down building features and shipping products, it's tough to find the time to stay on top of all the latest developments.

That's exactly the challenge we faced at Portia AI. The solution? An AI agent that helps us make the most of the 5-minute stroll our team makes each afternoon to Kings Cross on their way home.

I’m sure Harry would have spent his commute back from Kings cross listening to our AI podcast too...

Building AI News Into a Routine

Working in AI means being subscribed to information from multiple sources. The traditional approach of manually checking news sites, Reddit, Twitter, and newsletters was tedious and time-consuming, while important developments could take time to circulate through the team.

During one of our regular work hack sessions, inspired by NotebookLM's podcast feature, I decided to tackle this problem by building an AI agent that creates daily short AI news podcasts. Here's how it works:

Subscribes to multiple AI news sources throughout the day
Identifies the most significant developments
Synthesizes the information into a concise narrative
Generates a 2-3 minute podcast episode using the fantastic Podcastfy library
Provides curated links for deeper investigation
Shares the podcast and links on Slack and Discord

We run the agent in the afternoon, so the podcast is available before people's evening commute. This timing allows people to easily integrate the updates into their daily routine. We've also found that the curated links are particularly valuable when there's a topic that's especially relevant to someone, allowing them to dig deeper into the details.

Getting Involved

We know that staying abreast of the latest developments is a difficulty lots of teams face, so we've made these news snippets available on our public Discord server. Come and check it out if it sounds like something that could be useful.

The code is open sourced in our agent examples repo if you're keen to see exactly how it works or build something similar for your own team. I think it’s a nice example of how Portia’s open-source agent SDK makes agents incredibly easy to build. With the agent framework handling much of the complex orchestration between services and APIs, the code ends up being not much more than:

plan_prompt = "<Task specification>"
tools = DefaultToolRegistry(config) + [PodcastTool()]
portia = Portia(tools=tools)
portia.run(plan_prompt)

Hope this helps others stay on top of the fast-moving AI world! Enjoy!