DEV Community: Portia AI

Design Highlight: Handling data at scale with Portia multi-agent systems

Portia AI — Thu, 22 May 2025 00:00:00 +0000

At Portia, we love building in public. Our agent framework is open-source and we want to involve our community in key design decisions. Recently, we’ve been focussing on improving how agents handle production data at scale in Portia. This has sparked some exciting design discussions that we wanted to share in this blog post. If you find these discussions interesting, we’d love you to be involved in future discussions! Just get in contact (details in block below) - we can’t wait to hear from you.

Calling All Devs

We’d love to hear from you on the design decisions we’re making 💪 Check out the discussion thread for this blog post to have your say. If you want to join our wider community too (or just fancy saying hi!), head on over to our discord, our reddit community, or our repo on GitHub (Give us a ⭐ while you’re there!).

If you’re new to Portia, we’re building a multi-agent framework that’s designed to enable people to run agents reliably in production. Efficiently handling large and complex data sources is one of the key aspects of this, along with agent permissions, observability and reliability. We’ve seen numerous agent prototypes that work well on small datasets in restricted scenarios, but then start to fall over when faced with the scale and complexity of production data. We want to make sure this doesn’t happen when agents are built with Portia. In this blog post, we’ll explore the design decisions we’ve made to enable this.

Real agents handling data at scale

As with all good design discussions, we work backwards from real life use-cases that we’re looking to enable / improve. We’re working with many agent builders and below are a selection of the exciting use-cases we’ve seen that require efficiently processing large data sources:

A debugging agent that can process many large server log files along with other debug information to diagnose issues.
A research agent that can process many documents and search results over time as it conducts research into a particular company or person.
A finance assistant capable of researching over a company’s financial data in a mixture of sheets and docs to answer questions - for example, “from this week’s sales data, identify the top 3 selling products and how their sales are split by Geography”
A personal assistant capable of having long-running interactions with the user, including taking actions such as scheduling events and sending emails, adapting to their preferences over time

In order to handle each of these use-cases well, our agents need to handle data correctly across complex, multi-step plans without being thrown by large documents or making repetitive mistakes. However, we were finding that these agent builders were hitting a couple of key issues:

Plan run state overload: Our execution agent has access to the full state of the plan run. When it runs a tool, it stores the output of the tool run into that state for future use. Over time though, if tools were producing large amounts of data, this state could get very large and congested. As this was passed into the LLM, this would then reduce the accuracy with which the execution agent could retrieve the correct information for each step from the plan run state:
- Example : A debugging agent might download and then analyse the logs from 10 different servers and analyse each of them. It might then move on to another task, but the logs from each of those 10 servers would still be in its plan run state, distracting from other useful information when processing future steps.
Tool calling with large inputs: Our execution agent calls a language model to produce the arguments for calling each tool. However, when we wanted to call a tool with a large argument (e.g. >1k tokens), we would either hit the output token limits of the model or we would hit latencies that would make the system incredibly slow.
- Example: A finance agent might want to read in a large spreadsheet and then pass its contents into a processing tool to extract the key data it needs. We saw occasions where just generating the args for the processing tool took more than 5 minutes because it needed to print out the full contents of the spreadsheet!

We needed to fix these two issues, so our agent builders could stop wrestling with context windows and focus on shipping features.

An aside on long-context models

Before diving into potential solutions, let’s explain why we thought this is a problem worth solving even with the vast context windows seen from the latest models (e.g. Llama4 has a 10m token window while GPT 4.1 has a 1m context window). These models have certainly changed the equation - before their arrival, we hit context window limits a lot more than we currently do. However, using these models with large data sources is still difficult and problematic for multiple reasons:

Accuracy	SOTA models boast strong accuracy scores in needle-in-a-haystack tests, but real scenarios are more complex, requiring reasoning over and connecting different pieces of information in the context, and models get much weaker at this when the context is large.
Cost	Filling GPT 4.1’s context window will cost you $2 of processing for every LLM call. Agentic systems typically make many LLM calls, so $2 can quickly make your system very expensive!
Latency	As the token number increases, so does the latency, particularly for output tokens. OpenAI states that while doubling input tokens increases latency 1-5%, doubling output tokens doubles output latency. When compounded with the fact that agentic systems make many LLM calls, this can make systems very sluggish.
Failure Modes	Interestingly, language models fail in different ways when the context length gets large. There’s a great study on this from databricks. This adds instability to the system because the prompts you’ve been iterating on in low data scenarios suddenly don’t work as you expected in production.

Preventing our plan run state becoming overloaded

Given the above, we can’t just rely on long-context models and need to handle the issue with overloading our plan run state within our framework. To solve this, we needed to reduce the size of the context used by our execution agent, and we did this as follows:

First, we introduced agent memory. For large outputs (the default is above 1k tokens) we store the full value in agent memory with just a reference in the plan run state. This prevents previous large outputs from clogging up the plan run state when they are no longer needed.
- You can configure where your agent stores memories through our storage_class configuration option (see our docs for more details). If you choose Portia cloud, you’ll be able to view the memories in our dashboard:

Our planner selects inputs for each step of our plan. If one of these inputs is in agent memory, we fetch the value from memory, as we know it is specifically needed for this step. This allows the execution agent to fully utilise the large values in agent memory when needed.

You can check out the code for this feature in this PR and the docs are here. For our first implementation of agent memory, we decided to only allow pulling the full value from agent memory, rather than indexing the values in a vector database (or other form of database) and allowing queries based on that. A key reason for this (as well as wanting to keep our initial implementation as simple as possible) is that the way memories need to be queried is very task dependent. There are times when a semantic similarity search of memories is required (e.g. a debugging agent looking for similar errors among log files), while other times require filtering on exact values (e.g. a debugging agent looking for logs between two timestamps from a particular service), a projection of the values (e.g. a finance assistant just taking several columns from a spreadsheet) or access to the full value.

Our future vision - a memory agent

Ultimately, we’ll want to support all these patterns, but doing this efficiently requires indexing and querying the memories intelligently based on the task. We believe that this will be best done by a separate memory agent - an agent within our multi-agent system that indexes and queries agent memories so that the required pieces can be retrieved for the task and passed to the execution agent. This clearly adds complexity to the system though! So we wanted to see how our agent builders use agent memory before jumping to conclusions on the best way to index and query the memories.

Tell us what you think!

We’d love to hear what you thought of the decision to not automatically ingest agent memories into a vector database. Is it something you’d like to see? Get involved in the Github discussion and let us know.

Solving Tool calling with large inputs

Our introduction of agent memory meant that the execution agent was managing the context it sent to the language model much better. However, we were still facing the issue, mentioned above, where the language model struggled to produce large arguments for tools when needed. In order to solve this, we provided the language model with the ability to use templates to input variables. When calling the language model, the execution agent would outline in the prompt that, if the language model simply wanted to use a value from agent memory verbatim, it didn’t have to copy the value out - it could just put, for example {{$large_memory_value}}. We then extended the execution agent to retrieve $large_memory_value from agent memory when this was done and template the value in, so that the tool received the full value.

You can check out the code for this feature in this PR: Add ability for execution agent to template large outputs. Interestingly, after a bit of initial tuning, we found that the language model was able to determine correctly whether it should template a variable or not. This has led to a massive improvement in latency and cost of our agents calling tools with large data sources. For example, a personal assistant use-case that involved analysing a large spreadsheet reduced in time from 3-5 minutes to <10s.

What do you think?

What do you think of allowing language models to template variables rather than copy them out fully? Do you have a better approach for this? Get involved in the GitHub discussion and let us know.

Going forwards

We believe this work sets a great foundation for building multi-agent systems with Portia that handle data at scale, and we’ve got an exciting roadmap of features to keep making this even better:

Pre-ingest knowledge / memories: we want to allow kicking off our agents with knowledge and memories already loaded, rather than requiring the agent to fetch all the information needed as part of the run
Improved pagination handling: we want to allow our execution agent to more efficiently use paginated APIs
Memory agent: as mentioned above, we’re excited about the possibilities opened up by a separate memory agent. Once we’ve got a good idea of how agent memory is being used in its current form, we’d love to start discussions on how this new agent might fit into our system.

We’re really looking forward to finding out what these new large data capabilities will unlock for agent builders working on Portia!

What do you think?

Hopefully you enjoyed this blog post. If you did (or even if you didn’t!), we’d love to hear from you. Did you agree with the design decisions we are taking? Do you think we should take a different approach? If you’ve got thoughts and ideas, we’d love to hear about them in the GitHub discussion associated with this post. And we love chatting about code even more, so if you’ve got an idea, fork our repo and we’d love to review the code 🚀

Join the conversation

Like this article? – Give us a ⭐ on GitHub. It really helps!

Browse our website and try our (modest) playground at www.portialabs.ai.
Head over to our docs at docs.portialabs.ai or get immersed in our SDK.
Join the conversation on our Discord channel.
Watch us embarrass ourselves on our YouTube channel.
Follow us on Product Hunt.

Beyond APIs: Software interfaces in the agent era

Portia AI — Fri, 09 May 2025 16:40:31 +0000

For decades, APIs have been the standard for connecting software systems. Whether REST, gRPC, or GraphQL, APIs follow the same principle: well-structured interfaces that are defined ahead of time to expose data and functionality to third parties. But as AI Agents start taking on more autonomous operations this rigid model is limiting what they can do.

APIs work well when requirements are known in advance, but agents often lack full context at the start. They explore, iterate and adapt based on their goals and real-time learning. Relying solely on predefined API calls can restrict an agent’s ability to interact dynamically with software.

Like many in our industry, we have been dealing a lot with the challenges of agent to software interfaces. We think the future of these interfaces will move beyond static APIs toward more flexible, expressive, and adaptive mechanisms. More on our thinking below, we’d love to hear your thoughts!

The limitations of APIs for agents

APIs are designed for predictable, developer-driven interactions. The developer writes a request, expects a response, and handles errors explicitly. While this works well for traditional software integrations, it introduces several friction points when applied to autonomous agents operating in dynamic environments.

1. Rigid interfaces don’t work well with dynamic reasoning

APIs define fixed contracts with specific endpoints, request formats, and outputs. But agents operate in a less deterministic way. Whilst an API structure may work well for one use case another may be completely impossible given the set of endpoints and data the API exposes. For example, an API might provide a fetch_customer_data(id) function, but an agent will likely not start with an ID, but perhaps a name or email. This forces agents to reason about chaining multiple API calls for tasks that could be a single step.

2. API interfaces need to be integrated ahead of time

To achieve good performance APIs often need to be wrapped in a tool that is passed to the agent. This means that work is needed ahead of time to integrate the API. Again this limits what the agent can achieve. It is unable to discover and integrate APIs it needs at runtime based on the task that it has been asked to achieve.

Even for coding agents the requirements for authentication and documentation require some work to integrate the API ahead of time. Not to mention that for a production system your agent needs to be able to handle a whole other set of software engineering concerns, for example pagination, caching and rate-limiting

3. Handling errors and change management

In traditional software, API failures are usually logged and errors are returned to the caller. But agents can dynamically react to errors and adjust their path toward the goal. This is a completely new paradigm and one of the advantages of agents, but existing APIs often don’t provide enough information or direction for the agent to plan how it will recover from an encountered error effectively.

Likewise, having done all the work to integrate an API, your agent is tied to the current implementation. Any versioning changes or worse breaking changes will mean your shiny agent is now functionally broken.

What would good look like?

If APIs aren’t enough, what does an agent-first interface look like? Instead of rigid, predefined endpoints, future agents to software interfaces should be adaptive, declarative, and goal-oriented. Rather than requiring agents to conform to static API contracts, the software itself should expose capabilities in a way that agents can reason about, compose, and execute dynamically.

1. Self describing and discoverable Interfaces

Agents should not need hardcoded API specifications or be coded ahead of time to connect to them. Instead, it should be possible for agents to describe available actions, parameters, and expected outputs on the third party software. This could include:

Schema-based discovery - agents should be able to discover and call third party systems dynamically. This means they should be able to connect, list available functionality and integrate with them all at runtime.
Execution hints - Interfaces should provide execution hints beyond basic documentation. For example, “This action requires a valid session token” or “This function is expensive, use sparingly”.
Fine grained agent native authentication - Most APIs authentication and authorization controls are designed around human users who sign up ahead of time. Future software to agent interfaces should allow an agent to securely sign up and have sensible authorization controls for what they can do. We’ve explored how we think authentication might evolve here.

2. Flexible, goal-oriented invocation

Today callers of an API need to work out which set of APIs to call to achieve their goal. But in the future agents should be able to express intent, and the system should handle how to achieve the goal. This would mean software that exposes declarative interfaces where the agent specifies what it wants to achieve, and the system determines how.

3. Robust error handling and introspection

Instead of returning opaque error codes, software should provide rich, structured feedback that agents can reason about. This could include execution hints that nudge agents towards other approaches (This API requires X, you can get it by calling API Y) or clear guidance if the error can’t be worked around. It would also likely involve providing additional context in the error response so in the case a human needs to be included in the resolution, simple steps can be provided to them.

4. Support for parallel execution

Agents benefit from running multiple actions in parallel but this isn’t a model many APIs support. Pagination for example is an area where the design of the API has a big impact on how easily the agent can fetch data in parallel. Page based APIs (GET /blogs?page=1) are easier to load in parallel then than token based ones (GET /blogs?token=1s2fsf) where the result of the current page is needed to load the next one.

The Path Forward

To move beyond APIs, we need to rethink how software communicates its capabilities to agents in a way that is flexible, interpretable, and adaptable. The goal isn’t to replace APIs outright but to evolve toward a model where agents don’t just consume endpoints—they understand and navigate software functionality intelligently.

Let’s explore some real-world approaches and frameworks pushing toward this future.

1. LLM assisted APIs aka Tools

A feature of some agentic frameworks (like Portia AI) is to wrap the APIs that are provided to the agent in a level of LLM smarts. This gives more flexibility in how the tool is called.

Examples:

If the format of a timestamp in an input field has changed the LLM can try again with the correct format as long as the API returns a descriptive error field.
If the name of a field in the response has changed the LLM can still use the response without needing a change to the tool definition.

@tool("get_posts_since", args_schema=GetPostsInput, return_direct=True)
def get_posts_since(timestamp: str) -> dict[str, Any]:
    url = f"https://my_hosted_blog.com/posts?since={timestamp}"
    try:
        response = requests.get(url, timeout=5)
        response.raise_for_status()  # Raise an error for non-200 responses
        return response.json()
    except requests.RequestException as e:
        return f"Error fetching posts: {str(e)}"

✅ Pros:

The agent only has to reason about function signatures rather than full API calls.
Allows some flexibility for the agent to try new approaches and recover from some errors.
The agent can handle extraction of the relevant data from the response itself.

❌ Cons:

Still requires explicit coding of the functions ahead of time.
No built-in reasoning about dependencies or execution order which is usually required with REST APIs.

2. Generating Code Instead of Calling APIs

Instead of integrating directly with APIs, agents can generate and execute code using SDKs that wrap underlying functionality. This is common today for AI-assisted development, where models write API client code dynamically instead of calling endpoints directly.

Example:

An LLM generates a Python snippet using a cloud provider’s SDK instead of making raw API requests:

import boto3
s3 = boto3.client("s3")  
s3.upload_file("report.pdf", "my-bucket", "report.pdf")

✅ Pros:

Leverages existing SDK ecosystems (AWS SDK, Google Cloud SDK, etc.).
Gives agents more control over execution (e.g., handling exceptions, retries).
More resilient to API changes as the SDK abstracts away versioning differences.

❌ Cons:

Requires agents to understand and generate valid code. Depending on the goal this might require accurately calling several functions within the SDK.
Performance relies on well documented SDK which work exactly as described.
Execution environments need to be properly sandboxed to prevent rogue code being executed. Either you have no sandboxing which is a security nightmare in production or you sandbox the code that is generated but this is an engineering challenge and limits functionality.

3. Computer and / or browser use

Some agents bypass APIs entirely, interacting with software through web browsers just like humans do.

Examples:

Instead of integrating with a banking API, an agent navigates the bank’s website, logging in, clicking buttons, and extracting data from web pages.
Browser agents go beyond basic automation tools like Selenium, Playwright, and Puppeteer by leveraging multi-modal discovery of web pages and reasoning to navigate them.

✅ Pros:

Works even when APIs aren’t available.
Mimics real human interactions, reducing integration friction.

❌ Cons:

Can be more fragile as changes to UI or new pop ups can break automation and are far more common than API changes.
The approach is orders of magnitude slower than API access and also far more expensive as more LLM usage is needed to guide the computer usage
Can be a challenge to deal with authentication, a lot of energy has been expended in the last decade trying to prevent automation from interacting with websites (i.e. Captchas). We tinkered with browser agents a few months back. Check it out here if you're interested(↗) and keep an eye on a fresh look in the coming weeks!

4. Dynamically discoverable tools

One of the most promising directions for agent-software interaction is Model Context Protocol (or MCP), which emphasizes dynamic tool discovery, self-registration, and agent-to-agent collaboration. Instead of hardcoded integrations, software components expose self-describing capabilities that agents can discover and use on demand. We’ve written about our experience
using MCP here.

Examples:

An agent contacts an MCP registry and lists a set of MCP servers relevant to its task. The agent can self-register with these servers using OAuth Dynamic Registration.
Once registered the agent can list all the tools a server has and call them based on the metadata supplied by them.

✅ Pros:

No integration required ahead of time. Thanks to the registry and dynamic registration new tools can be integrated on the fly.
More composable and modular, tools can be combined in new ways based on the goal at hand.
Maintenance of tools is handled by the third party who owns the MCP server.

❌ Cons:

Whilst we believe MCP will be a big part of the puzzle in the future, current implementations are somewhat lacking. There are few registries and authentication has only been added to the standard(↗) in the last couple of days.
Even with good implementations in the future, the quality of tools can vary largely and agents will need to be able to identify good servers/tools for good performance.

5. Agent <> Agent interfaces

Instead of imperative API calls (POST /create_user), in the future declarative workflows will let agents specify what they want to achieve, and the external system will determine how to execute it. We see some natural language APIs today but true flexibility will be achieved when we can have inter-system agent handoff.

Note we see this as different to intra-system agent handoffs; many multi-agent systems today allow you to have agents talking to agents, but we believe in the future the interface between systems will also be agent <> agent.

Example:

Instead of calling your billing software APIs manually, an agent would submit a high-level goal:
- "Ensure customer X has an active subscription and has been notified of their renewal"
The system maps this to the correct sequence of operations, handling retries and dependencies automatically.

✅ Pros:

Shifts complexity from the agent to the software, moving the need to understand the domain logic of the problem to the party with the most context. In the above example I don’t need to understand how a user entity relates to a subscription and the set of API calls to update both.
Removes overhead of API management, versioning etc since execution logic is abstracted.

❌ Cons:

Will require strong guardrails, authentication controls, and human in the loop controls etc.
This is a new paradigm in software engineering and this may limit adoption.

Bringing it all together

At Portia, we’re excited about the future of agent-first software interfaces. The limitations of traditional APIs don’t mean the end of structured integrations, but rather the beginning of a new era—one where software exposes its capabilities in ways that agents can reason about, adapt to, and dynamically interact with.

From LLM-assisted smart tools to dynamically discoverable interfaces like MCP, the path forward is clear: agents need more flexible, self-describing, securely authenticated and goal-oriented mechanisms to interact with software.

We believe the best agent <> software interfaces are still ahead of us, and we’re excited to push the boundaries. If you’re thinking about these challenges too, we’d love to hear your thoughts!

Join the conversation

Like this article? – Give us a ⭐ on GitHub. It really helps!

Browse our website and try our (modest) playground at www.portialabs.ai.
Head over to our docs at docs.portialabs.ai or get immersed in our SDK.
Join the conversation on our Discord channel.
Watch us embarrass ourselves on our YouTube channel.
Follow us on Product Hunt.

Visualise your Obsidian notes with Qwen3

Portia AI — Thu, 08 May 2025 00:00:00 +0000

Many users with stringent security, privacy or latency requirements have told us they prefer to run their own LLM instances locally. We recently added support for interfacing with Ollama models running locally.

To explore how we might use a local LLM practically, we decided to build an app that could turn an Obsidian note into a concept map – a visual diagram that shows how different ideas in the note are related. As an early stage startup we've actually been building our internal apps on top of local LLMs to keep our costs low: we use the obsidian app in this post to visualise notes coming out of our weekly engineering design meetings!

A Microservice Concept Map

The app reads a single note, uses Qwen3 4B to analyse its contents, extracts relationships between key concepts, and outputs a PNG file with a graph-style visualisation. The entire process runs locally, using Portia AI to handle orchestration between tools.

Obsidian is a great app for storing notes

If you're serious about organising ideas, thoughts, or long-form research, Obsidian is one of the best apps available. It's fast, extensible, and designed around a powerful but simple principle: your notes are plain text files that are stored locally by default (in Markdown).

Why local LLMs are worth using

Most people experience language models through cloud-based APIs like ChatGPT or Claude. These are large, powerful models hosted on someone else's infrastructure. But for many developer workflows – especially apps that run against your own local data – there's a strong case for running smaller models directly on your own machine. The Portia AI SDK supports all models that can be run by Ollama.

Tradeoff	Local LLMs	Hosted LLMs
Privacy	Everything stays on your computer	All prompts and data are sent over the internet
Latency	Near-instant results	Slower, network-dependent
Cost	Free after setup	Pay-per-token or subscription
Control	Full control over the model and environment	Limited access to fine-tuning or weights
Accuracy and context	Smaller context windows, less precision	Larger, more capable, and usually more accurate
Setup complexity	Requires installation and some configuration	Works out of the box with API key

Why we chose Qwen3

We decided to use the Qwen3 family of models from Alibaba’s open-source LLM line. These models are trained with multilingual capabilities and perform well even at smaller sizes.

The Qwen3 4B model in particular offers a nice balance:

It can run comfortably on machines with under 10GB of VRAM
It’s available through Ollama, which makes setup simple.
It’s reasonably competent at task execution and factual recall, especially for structured prompts

That said, Qwen3 4B can and will make mistakes, especially when extracting subtle relationships or summarising long-form content. That’s a tradeoff we accept for speed and local control.

What does this app do?

Let's talk about how to run the app, and what it does, before going into the code.

You can run the app with uv run main.py NOTE where NOTE should be the name of one of your notes in an Obsidian vault. In the provided Obsidian vault, there's a note called DDD, all about Domain Driven Design. So you could call the app with uv run main.py DDD.

The app will then configure an explicit plan, consisting of the following steps, using Portia’s plan builder:

Step	Tool
List all available vaults	MCP Tool call
Fetch the note from the obsidian vaults	MCP Tool call
Create a concept map visualization using the extracted relationships	Custom visualisation Tool

We chose to configure an explicit plan rather than rely on Portia's planning agent because tests showed Qwen3 4b to be unreliable at planning. In this case, the plan is always going to be the same, so it makes some sense to outline it explicitly in code using the PlanBuilder interface.

The plan is passed to Portia, which has been configured to use Qwen3 4b via the Ollama interface. (We'll show you that below.)

What is Ollama?

Ollama is a free, open-source app that allows you to run large language models (LLMs) locally on your computer or a server. It currently supports 30 different models.

Diving into the code

Let's take a deeper look at the app, and what the code looks like. All the code is available in our example code GitHub repo.

We're not going to cover every line of code in this project. If you want to see all of the code, check out the full code example. We'll guide you through all the important code below.

Configuring Portia to use a local LLM

Portia supports running local LLMs via Ollama. At the moment, that’s over 30 models, including Meta’s Llama Series, the Qwen series of models that we’re using here, and many others. The only requirement is that when you specify it in code, the model name begins with "ollama/" and then the specifier for the model you wish to run.

config = Config.from_default(
   default_log_level="DEBUG",
   default_model="ollama/qwen3:4b",
   execution_agent_type=ExecutionAgentType.ONE_SHOT,
)

This code in GitHub

Portia offers two types of execution agents that take care of executing a step. DEFAULT agent which is parsing and verifying arguments of the tool to reduce hallucinations or made up values before the tool is called. This is recommended for complex tasks and tools that have complex parameters (with defaults..etc). ONE_SHOT is faster and more cost efficient when the tool call is simple. We generally recommend using ONE_SHOT for smaller models (like Qwen3 4b) as our default agent is optimised more for larger, more capable models

Adding the required tools

An MCP server already exists for Obsidian, and luckily Portia makes our lives much easier by supporting MCP out-of-the-box! The code below installs and runs obsidian-mcp locally via npx. The visualisation tool is part of this project, and is included in the code-base. (We'll tell you more about the visualisation tool in a moment.)

obsidian_mcp = McpToolRegistry.from_stdio_connection(
   server_name="obsidian",
   command="npx",
   args=["-y", "obsidian-mcp", os.getenv("OBSIDIAN_VAULT_PATH")],
)

# Add all tools to the registry
tools = obsidian_mcp + ToolRegistry([VisualizationTool()])

portia = Portia(
   config=config,
   tools=tools,
)

This code in GitHub

Once the tools have been configured, they're passed to Portia's constructor, along with the required configuration.

Vibe coding a visualisation tool 🏄🏻‍♂️

Because we're on-trend at Portia, we decided to vibe-code the visualisation component, which renders concept maps from the relationships extracted in each note. Omar wrote the visualisation tool quickly, guided more by intuition and immediate usefulness (and a dash of Spidey sense) than formal design specs.

It was an ideal candidate for this approach because:

The requirements were loose: "make a diagram that looks decent"
It could be tested easily and repeatedly with mock data
The failure modes (e.g. cluttered layout, hard-to-read arrows) were visual and obvious

It was fun to build, and it works reliably, but we don't recommend using this code in production, and we're not going to talk about it here! One does not simply vibe code one’s way into production.

Making a plan

We've designed a plan explicitly for completing this task. Using the PlanBuilder interface is useful when you want to implement a simple and / or repeatable plan. It's also useful in cases when the underlying LLM available is not strong at planning tasks as is the case in this example.

Teaching Portia’s planning agent

Another option to increase reliability is to use Portia's new User Led Learning feature to guide future planning in the right direction.

The following code can be found in the create_plan_local function.

plan = (
   PlanBuilder(
       f"Create a concept map image from the note with title {note_name}"
   )
   .step(
       "List all available vaults", "mcp:obsidian:list_available_vaults"
   )
   .step(
       f"Fetch the note named '{note_name}' from the obsidian vaults",
       "mcp:obsidian:read_note",
   )
   .step(
       f"Create a concept map visualization using the extracted relationships. Title the image {note_name} and output the image to the directory {os.getenv('OBSIDIAN_VAULT_PATH')}/visualizations",
       "visualization_tool",
   )
   .build()
)

This code in GitHub

Putting it all together

And, honestly, that's kind of it. This plan can be provided to Portia's run_plan method, and Qwen will read the Obsidian note specified with args.note, will generate a concept map, and add it to a visualisations directory in your Obsidian vault!

plan = create_plan_local(portia, args.note)
portia.run_plan(plan)

This code in GitHub

The quality of the results can be a little varied, depending on the source material, and how the Qwen3 model is feeling when you run it.

What did we learn?

Qwen3 4b is very capable, given its small size and requirements.
It can be unreliable at planning and sometimes even tool call generation for some cases that require big inputs.
Planning issues can be avoided if you are able to explicitly design a plan or use User Led Learning.
Our ONE_SHOT agent is catching up on accuracy of the DEFAULT_AGENT over time due to the fast paced improvement on the models. We’re constantly evaluating the performance of both agents (on openai 4o and claude 3.5 latest). The DEFAULT_AGENT is still doing better at resolving more complex tasks that require tools with lots of parameters.

There are many reasons you might want to run local models, and they come with upsides and downsides. Ultimately whether you use something like Qwen3 locally, or a larger model remotely comes down to your own requirements, and what suits them best

Next Steps

First, you should definitely give our SDK Repo on GitHub a ⭐️!

If you enjoyed this post, check out our other post on User Led Learning, a Portia feature that can dramatically increase the reliability of your agent's planning.
If you want to build agents that can interact with websites check out our most recent post on local and remote browser integration

Join the conversation

Like this article? – Give us a ⭐ on GitHub. It really helps!

Browse our website and try our (modest) playground at www.portialabs.ai.
Head over to our docs at docs.portialabs.ai or get immersed in our SDK.
Join the conversation on our Discord channel.
Watch us embarrass ourselves on our YouTube channel.
Follow us on Product Hunt.

A unified framework for browser and API authentication

Portia AI — Thu, 01 May 2025 00:00:00 +0000

The core of the Portia authorization framework is the ability for an agent to pause itself to solicit a user's authorization for an action it wants to perform. With delegated OAuth, we do this by creating an OAuth link that the user clicks on to grant Portia a token that can be used for the API requests made by the agent. We generally like API based agents for reliability reasons – they're fast, predictable and the rise of MCP means integration is getting easier.

However, there are some actions which are not easily accessible by API (my supermarket doesn't have a delegated OAuth flow surprisingly!), and so, there is huge power in being able to switch seamlessly between browser based and API based tasks. The question was, how to do this consistently and securely with our authorization framework.

With OAuth, authorization is done via a token. The protocol for obtaining it has been solidified over many years, but fundamentally if you have access to the token, you have access to the API. With browser based auth, the authentication is fairly baked into the browser itself using cookies or local storage. Then you layer on bot protections – 20 years of sophistication has been built into detecting nefarious bots, which is how an agent looks to a website irrespective of whether the agent is actually doing something useful.

Luckily, the age of agents means various players are rethinking this, and we found a browser infrastructure provider called BrowserBase. Their product allows the creation of browser sessions with a lot of the components that we needed to make this work. Combining BrowserBase with BrowserUse for goal orientated tasks and the Portia framework for authorization means we can offer a very similar paradigm for our developers as with OAuth based tools.

The below shows a quick video of an agent we built using a combination of API and browser based tasks to accomplish LinkedIn outreach.

Tool creation

With Portia, users can add Browser based capabilities to their applications with one line:

tools = (
    PortiaToolRegistry(my_config)
    + [BrowserToolForUrl("https://www.linkedin.com")]
)

This indicates to the planner that it can navigate to the LinkedIn website to achieve the overall user goal.

Browser agent

When execution gets to a step requiring the browser, we create a session, either locally or remotely (using BrowserBase). The browser agent then navigates the website to achieve the step task.

Authentication

The browser agent is instructed such that if it encounters authentication, it should return from its task. We then produce an ActionClarification which contains a link for the user to click on to perform the authentication action. If the developer is using our end user concept for scalable auth, the end user has a unique session on BrowserBase and a unique login URL. They can then log-in and the cookies are saved remotely using BrowserBase's secure cookie store.

API or Browser Agents: Who wins?

In the rapidly evolving world of agents, we frequently get asked as to how to think about Browser vs API based agents. Our general rule of thumb is 'use API based tools when available', but here's a quick comparison between the two:

	Browser agents	API agents
Speed	Much slower – can require 10-100x the LLM calls vs API based agents	Faster
Cost	Much more expensive	Cheaper
Reliability	Less predictable in terms of task completion, but more likely to succeed on retry. Possible to get blocked by bot protections	Predictable. Requires more investment to create the tools that can be used in agentic systems.
Types of tasks	Exploratory research tasks. Small well defined tasks between API based tasks	Larger tasks, particularly those involving data processing and linking multiple systems together. Use whenever available

What about those bot protections?

When we first started working on this feature, I assumed that we would nearly always be blocked by bot protections. Thankfully, this turned out not to be the case frequently - on many websites, if you prove that there is genuinely a human in the loop at the point of authentication, your agent can proceed, though often the 2FA or multi factor checks that the human needs to do are harder than if you are browsing the web as a human. Some websites still have more fundamental infrastructure blocks but the approach that many seem to be taking of authorizing as long as it's genuinely on behalf of a human feels balanced and appropriate.

Join the conversation

Like this article? – Give us a ⭐ on GitHub. It really helps!

Browse our website and try our (modest) playground at www.portialabs.ai.
Head over to our docs at docs.portialabs.ai or get immersed in our SDK.
Join the conversation on our Discord channel.
Watch us embarrass ourselves on our YouTube channel.
Follow us on Product Hunt.

A deep dive into our “User Led Learning” feature

Portia AI — Thu, 17 Apr 2025 00:00:00 +0000

At Portia, we believe building agents for production means balancing AI autonomy with human control – something we call the ‘spectrum of autonomy’. We have previously seen how clarifications can be used during plan runs to handle the human:agent interface. With our new User Led Learning feature, we’re bringing this level of feedback into the planning process as well.

Developers now have a powerful way to shape the Planning agent’s behavior—without rewriting prompts or tweaking models. When you generate a plan using the Portia AI SDK, that plan can be stored in the Portia cloud where it can be highlighted as a preferred plan with a simple thumbs-up. Each “like” tells the Portia planning agent, this was a good plan for this type of user intent—and over time, those signals help planning agents make better decisions on their own. It’s a subtle but powerful shift along the spectrum of autonomy: agents become more capable and self-directed, while still staying grounded in what users actually want.

Portia AI’s overview

What does User-Led Learning solve?

We see three areas where the tension between AI autonomy and the predictability users want is at its peak:

Conversational user experiences: When you interact with an agent, your instructions are usually written in natural language. That’s great for usability, but tough for precision. Human language is full of implications. You might say “send a message,” but really mean “send a WhatsApp message. Or you might expect the agent to summarize the message afterward – even if you never said that out loud. These gaps between what’s said and what’s meant are where agents can go off-course.
Context specific workflows: A business might want to automate a complex set of steps that are specific to their data collection pipelines for example. This isn’t something an LLM can reason about reliably based on their pre-training data. For example, every business may have their own workflows for completing KYC / KYB processes.
App specific tool chaining complexity: Some apps require chaining several tools together but the tool descriptions and arguments are not sufficient to ensure the LLM chains them in the right sequence with high reliability across production-scale volumes of agentic workflows. For example, sending an email may require an id which first must be retrieved by mapping to your email address.

They work every time, sometimes.

So how does it work?

By surfacing the plans you like, you give the LLM guidance about your preferences so it can bias towards them when it encounters user prompts with similar intent. At a high level this process involves the following:

You can “Like” plans saved to Portia Cloud from the dashboard to signal that they are your “ground truth”.
You can then pull “Liked” plans based on semantic similarity to the user intent in a query by using our freshly minted portia.storage.get_similar_plans method.
Finally you can ingest those similar plans as example plans in the Planning agent using the portia.plan method’s example_plans property.

Let’s take a scenario that we’ve written about before – building a refund agent to process refund requests. This usually results in a relatively complex plan - usually around nine steps broadly covering the following:

Load the refund policy and customer request from file
Use LLM smarts to assess the request
Request human approval
Process the refund through 3 Stripe interactions – find the customer ID, find the relevant payment for that customer, create the refund.

Without user-led learning: Improving reliability through painstaking prompt engineering

Getting set up

The code blocks in this post are available for you in our examples repository on GitHub. Make sure you have followed the steps in the README to get setup correctly, including minting a Stripe test API key and installing project dependencies.

With the kind of multi-step plan we’re looking at here, the quality of your prompt is very important. For example, take the vague_prompt in the code snippet below (accessible in our examples repository on GitHub in the file 01_ull_vague_prompt_no_examples.py). . We found that such a relatively vague prompt resulted in the correct order of steps only 82% of the time. The LLM would sometimes get mixed up in the ordering of Stripe interactions or omit one of them e.g. it would skip loading payment intents for the Customer. At times it would even skip the critical step of requesting human approval!

01_ull_vague_prompt_no_examples.py

from common import init_portia

# Define the prompts for testing
vague_prompt = """
Read the refund request email from the customer and decide if it should be approved or rejected.
If you think the refund request should be approved, check with a human for final approval and then process the refund.

To process the refund, you'll need to find the customer in Stripe and then find their payment intent.

The refund policy can be found in the file: ./refund_policy.txt

The refund request email can be found in "inbox.txt" file
"""

# This is function initializes Portia and all the tools.
# You can find it in common.py
portia_instance = init_portia()

# Generate a plan and print it out.
# 18% of the time, steps will be in the wrong order!
plan = portia_instance.plan(vague_prompt)
print(plan.pretty_print())

We can improve the results of this by spending more time on the prompt, being more prescriptive about what we want to be done and when. Here’s a better prompt we had arrived at after some prompt engineering and eval running, which you can run a few times from 02_ull_good_prompt_no_examples.py in our examples repository on GitHub:

A more prescriptive prompt

Read the refund request email from the customer and decide if it should be approved or rejected.
If you think the refund request should be approved, check with a human for final approval and then process the refund.

Stripe instructions -- To create a refund in Stripe, you need to:
* Find the Customer using their email address from the List of Customers in Stripe.
* Find the Payment Intent ID using the Customer from the previous step, from the List of Payment Intents in Stripe.
* Create a refund against the Payment Intent ID.

The refund policy can be found in the file: ./refund_policy.txt

The refund request email can be found in "inbox.txt" file.

The above prompt has better results because it breaks down the steps and lists them in order. This significantly increases reliability – in this case when we tested it, the generated plans were correct 94% of the time – that’s a 12% improvement (percentage points that is 🧐)! But this has two issues. Firstly, there’s still a 6% error rate – not terrible, but not perfect, and secondly, it’s very prescriptive. Instead of giving the agent the autonomy to do the planning for us, we’re pretty much having to program the plan ourselves.

Enter user-led learning: hone in on good plans and let Portia do the rest

With user-led learning, the first thing we did was to optimise how the Planning agent uses example plans. Secondly, we wanted to be able to capture reinforcing signals from end users as they continuously run plans in production, so that the example plans fed back to the Planning agent are reflective of the latest workflows in a particular context. So we introduced the ability to “like” plans in order to signal that those are preferred outcomes. Then you have the ability for the Planning agent to pull the most semantically relevant “liked” plans to use as example plans.

To bring this to life, let’s first simulate the process of a satisfactory Plan being created and run several times. In a real world scenario, you would be using Portia Cloud (i.e. have the PORTIA_API_KEY env variable set, such that the default storage class for all your configs is CLOUD). All plans and plan runs generated would be automatically saved to the cloud and accessible in the dashboard. So outside of this exercise you would just visit the dashboard and “like” your favourite plans as they emerge! For now we’re going to create a plan using the PlanBuilder and save it to Portia Cloud so we can then like them from the dashboard. Here’s the code (accessible in 04_ull_create_example_plans.py in our examples repository on GitHub). Notice the subtle prompt (and therefore plan!) differences we're introducing.

04_ull_create_example_plans.py

from portia.plan import PlanBuilder

from common import init_portia

# Create example plans for refund processing
example_plans = []

# Example 1: Create refund given user email
plan1 = (
    PlanBuilder(
        "Create a refund for a customer with email john.doe@example.com"
    )
    .step(
        "Find the customer in Stripe by email john.doe@example.com",
        "mcp:stripe:list_customers",
        "$customer_data",
    )
    .step(
        "Extract customer ID from response",
        "extract_customer_id_tool",
        "$customer_id",
    )
    .input("$customer_data")
    .step(
        "Find payment intents for the customer",
        "mcp:stripe:list_payment_intents",
        "$payment_intents",
    )
    .input("$customer_id")
    .step(
        "Extract payment intent ID from response",
        "extract_payment_intent_id_tool",
        "$payment_intent_id",
    )
    .input("$payment_intents")
    .step("Create the refund", "mcp:stripe:create_refund", "$refund_result")
    .input("$payment_intent_id")
    .build()
)
example_plans.append(plan1)

# In the sample code, we add two more example plans here.

portia = init_portia()
for plan in example_plans:
    portia.storage.save_plan(plan)
print("""
Plans saved in Portia cloud storage.

Now you should go to the Portia dashboard and 'like' them.""")

Once these plans have been saved, you can go to the Plans page on the Portia dashboard and click on the thumbs up next to the three plans you just created. (They’ll be on the last page of the list of plans.)

Approving plans in the Portia dashboard

The next step is to use the portia.storage.get_similar_plans method to match a user prompt to the preferred plans. Given the semantic similarity between the vague prompt we introduced in this post and the prompts that were used to create our three preferred plans, we expect the get_similar_plans method will retrieve all three of them. Note that this method allows you to play with the similarity threshold and to limit the number of similar plans you want to retrieve as well. To see how this comes together, make sure you have liked the plans created above in the dashboard then run the code below (accessible in 05_ull_vague_with_examples.py in our examples repository on GitHub).

05_ull_vague_with_examples.py

from common import init_portia

# Define the prompts for testing
vague_prompt = """
Read the refund request email from the customer and decide if it should be approved or rejected.
If you think the refund request should be approved, check with a human for final approval and then process the refund.

To process the refund, you'll need to find the customer in Stripe and then find their payment intent.

The refund policy can be found in the file: ./refund_policy.txt

The refund request email can be found in "inbox.txt" file
"""

portia_instance = init_portia()
example_plans = portia_instance.storage.get_similar_plans(vague_prompt)
if not example_plans:
    print(
        "No example plans were found in Portia storage. Did you remember to create and 'like' the plans from the previous step?"
    )
else:
    print(f"{len(example_plans)} similar plans were found.")
plan = portia_instance.plan(
    vague_prompt,
    example_plans=example_plans,
)
print(plan.pretty_print())

To give you a sense of the reliability gains you can achieve with user-led learning, check out the chart below. Notice that we were able to increase plan reliability from 82% to 98% with just a single example plan and that two were enough to achieve 100% reliability.

Reliability improves quickly with only a few approved plans.

In conclusion

User-led learning allows you to bias your Planning agent towards previous plan runs you consider a success. You simply “like” plans you like (doh!) amongst those you have saved in the Portia dashboard. Portia can then fetch the most semantically similar plans to a user prompt and load those as example plans to help steer the Planning agent. Et voilà!

Give this feature a try and let us know how you find it. And as always, please show us some love if you like our content by giving our SDK a star ⭐️ over on Github.

Join the conversation

Like this article? – Give us a ⭐ on GitHub. It really helps!

Browse our website and try our (modest) playground at www.portialabs.ai.
Head over to our docs at docs.portialabs.ai or get immersed in our SDK.
Join the conversation on our Discord channel.
Watch us embarrass ourselves on our YouTube channel.
Follow us on Product Hunt.

More features for your production agent … and a fundraising announcement

Portia AI — Wed, 16 Apr 2025 00:00:00 +0000

We came out of stealth a few weeks ago. Since then we’ve been working with our first few design partners on developing their production agents and have been heads down building out our SDK to solve their problems. To equip us with enough runway to grow, we’ve also been lucky enough to raise £4.4 million from some of the best investors we could ever hope for: General Catalyst (lead), First Minute Capital, Stem AI and some outstanding angel investors 🚀

In this post we want to give you a sense of what’s coming over the next couple of months.

Portia AI’s overview

If you're new here

Portia AI is an open source SDK with a cloud component, focused on making it easy for developers to build agents in production. Our three pillars are predictability, controllability and authentication. What does this even mean and why does it matter?

Predictability: AI agents are attractive because they leverage LLM reasoning to offer a degree of autonomous decision making. Yet many pilot projects have fallen short of moving to production because they lack preemptive visibility into agents behavior. With our Planning agent, developers and / or their end users can pre-express and iterate on the intended course of action of the LLM (Plan) before execution begins.
Controllability: Many companies we spoke to are concerned that existing options do not offer the ability to monitor agents’ progress or intervene when needed. These limitations are especially critical in regulated industries such as Financial Services, or in end-user-facing applications where compliance, auditability, and trust are non-negotiable. Portia’s Execution agents update the plan run state (PlanRunState) as they go and are able to pause execution to solicit human input in a structured interaction called a Clarification.
Authentication: Users want to securely authenticate agents into their applications and confine them to a specific scope. We offer a cloud hosted catalogue of tools with built-in authentication. Get the full story from our recent blog post.

Teach our Planning agent new things 🧠

Our early adopters already love that they can supply example plans to the Planning agent. This is a tried and tested approach known as “few-shot prompting”. This week, we released a feature that allows developers to simply “like” plans in the Portia dashboard and then we do the rest. Our Planning agent will retrieve the most relevant “liked” plans from Portia Cloud based on the user prompt and load those in as guidance for the agent. We found that the Planning agent is able to reliably adapt complex plans from previously completed tasks to a new task, even when the user only provides a high level request (e.g. “Retrieve additional data to complete this supplier’s KYB missing information”). We found that Portia was able to produce an 8-step and even a 16-step data collection plan with 100% reliability using user-led learning. Previously this would have been impossible to produce those without extensive prompt engineering.

We will be sharing a step-by-step cookbook shortly if you’re curious to get hands-on with this feature. Make sure you’re signed up to our Discord or LinkedIn channel so you don’t miss it.

Our current design partners have also shared that they may want access to different variants of our Planning agent, with a live A/B testing ability to select the best suited plan for the objective at hand. For example, the Planning agent underpinning some of their more generalised agent use cases should be optimised to handle very large tool sets (~500 tools) while another should be optimised for planning large, multi-step data processing tasks. With a smaller local model deployment, we have been able to reduce tool selection errors by 50% using this approach and are open to trialling this with more partners. Do give us a shout with this contact form to learn more.

We’re even more showtime ready 🕺🏼

We’re making it simpler than ever to deploy the Portia SDK for complex tasks in any production environment. We now support all the commonly used models including OpenAI, Anthropic, Mistral, Gemini, Azure OpenAI and Bedrock. You can also wire up your own LLM instance into Portia AI so you can use your preferred local model in your own private deployment environment, such as Llama or DeepSeek. In Q2 we are introducing the ability to handle very large inputs (e.g. large pdf files, books etc) as well as a context-aware approach to handling the mess of API paginations. Stay tuned for more on this (Discord, LinkedIn).

Elegant auth UX for web agents 🔜

Using our clarification construct to handle human:agent interfaces, we are releasing a headless browser agent that can handle seamless handovers to humans during a session whenever a login is needed, before resuming its task. With our solution, end users will be able to enter their login details directly into the website within the browser session: they will never be compelled to share them with an intermediary party. Be the first to get your hands on this one! (Discord, LinkedIn).

If you’re looking to build agents in production that behave reliably and can be steered, please give us a whirl and share your thoughts. And if you need a white glove partner to help you deploy them, get in touch with us using this form. Our SDK is available (give us a star ⭐ ️), you can get hands-on with some examples in our examples repo or check out this short code-along intro video on our YouTube channel.

Join the conversation

Like this article? – Give us a ⭐ on GitHub. It really helps!

Browse our website and try our (modest) playground at www.portialabs.ai.
Head over to our docs at docs.portialabs.ai or get immersed in our SDK.
Join the conversation on our Discord channel.
Watch us embarrass ourselves on our YouTube channel.
Follow us on Product Hunt.

Agent-Agent interfaces and Google's new A2A protocol

Portia AI — Mon, 14 Apr 2025 00:00:00 +0000

This week, Google announced their new Agent-to-Agent protocol, A2A, designed to standardise how AI agents collaborate, even when run by different organisations using different underlying models. Positioned as complementary to MCP – which standardises agent access to external tools – A2A aims to standardise direct agent-agent communication. Google even declared A2A ♥️ MCP, highlighting their vision for synergy between these protocols.

At Portia, we’ve been thinking about how agents interact with external systems via tools and agents for some time. You may have even read our post two weeks ago, Software interfaces in the agent era. We divided the topic of agent integration with external systems into five categories based on increasing complexity, and A2A sits firmly at the top, in the Agent-Agent interface level.

Increasing complexity of communication

Understandably, some of the reaction to A2A has been that it isn’t clear whether it is needed (and if it is, whether it is needed yet) and how it fits together with MCP. In particular, with tools and agents ultimately both being a way to get a task done and facing many of the same challenges (discovery, task definition, input / output definition, auth etc.), some people are questioning whether we need another protocol on top of MCP, or whether it is enough to just wrap agents in tools. We’ve been diving into A2A over the last couple of days and wanted to share our thoughts on these topics.

Agents vs Tools

Both agents and tools are mechanisms for achieving tasks, but they generally differ significantly in complexity, autonomy, and interaction patterns:

Task Definition: Tools handle narrow, clearly defined tasks; agents handle broad, higher-level, open-ended goals.
Autonomy: Tools just do what they’ve been programmed to do; agents act autonomously, breaking down goals and seeking additional info if needed.
Input: Tools take structured input; agents understand natural language.
Single Step vs Multi Step: Tools are generally single-shot, with a call either returning outputs or an error. Conversely, agents break a task down and work through it in multiple steps. This may involve the agent proactively reaching out to collect more information for the task, or even asking the user some clarifying questions.
State: Tools are stateless; agents can build context over time.
Length: Tools generally run quickly, with most APIs returning in less than a second; agents may work over minutes, hours, or days.

The grey area between agent-agent and agent-tool communication

While this can seem a clear and natural divide, our work at Portia shows there's often a grey area between them:

Browser Use : Though agent-like in behavior (e.g., autonomous navigation via natural language), we’ve had success using browser tools in a single-turn, tool-like way to retrieve structured data.
Deep Research: Some implementations behave like slower search tools, others like full agents asking clarifying questions. Sometimes the same implementation can display both, depending on the query.
Agentic Tools: Tools can show agent-like traits: holding state (e.g., counters), running long processes (e.g., ML training), or even handling tasks that might require an agent in more complex scenarios (e.g. document retrieval).
Single vs Multi step: Even the clearest distinction – single vs. multi-step interaction – isn’t absolute. Just as agents ask for additional information, tools throw errors detailing the info they need. Often the loop needed to handle both is the same.

With the distinction between the categories quite blurry, it certainly adds complexity to the ecosystem if you need different protocols for the different sides of the spectrum.

A2A & MCP

A2A (Agent-to-Agent) is a protocol designed for enabling autonomous agents to communicate, discover each other, and collaborate on tasks. Positioned at the agentic end of the spectrum, A2A focuses on agents that can take higher-level responsibility for executing tasks, compared to MCP which is more tool-oriented.

To demonstrate the difference, imagine booking a dinner using an agent. With MCP, a restaurant booking platform might expose tools such as ‘find_restaurants’ or ‘book_restaurant’. My agent must then use these tools to achieve the goal of organising dinner.

Conversely, with A2A, the restaurant booking platform provides an agent with a skill for finding and booking restaurants – a concept deliberately looser than a tool. The remote agent will then take control of the full task, including tracking its state and deciding when to communicate with my local agent and when to mark the task as complete.

Restaurant reservations with MCP

Restaurant reservations with A2A

To dive a bit deeper, let’s take a look at the core components of A2A:

Agent Description : A2A agents have JSON "agent cards" outlining skills, auth methods, and input/output formats. These are higher-level and less structured than MCP's task-focused tool descriptions.
- As the skills description within A2A is deliberately more vague, it will be interesting to see how people handle defining the boundary around what a particular agent can and can’t do.

A2A Agent card

Agent Discovery: Agents can be discovered via a well-known URL (/.well-known/agent.json). Registries are likely to be added, similar to MCP’s tool registries.
Multi-step Interactions : A2A supports long-running tasks through multi-message exchanges, allowing agents to schedule, negotiate, and send progress updates. MCP does not yet have support for this richness of multi-message exchanges.
Offline Handling : As tasks are long-running and agents may not have a session open for the full duration, A2A has support for agents sending push notifications that are received later by the client (e.g. for task updates). This is not supported natively in MCP.
Auth: A2A supports all OpenAPI auth schemes (e.g., API keys, OAuth2, JWTs), offering more flexibility than MCP’s OAuth2-only approach. However, this also means that my local agent needs to handle all of these auth schemes too.
Outputs: Task results are called "artifacts". These are the equivalent to MCP’s tool outputs but with the key difference that artifacts are split into parts by default.

MCP vs A2A: Our Predictions

User pov	Agent::tools	Agent::MCP servers	Agent::Agent (A2A)
Capability discovery and selection	🔴 Tools have to be manually added to agent. Selection is limited to LLM’s ability to cope with large tool sets.	🟡 Tools are automatically discovered through MCP. Selection is still limited to LLM’s ability to cope with large tool sets.	🟡 Agent capabilities are advertised through their agent card. Once registries have been added to the protocol, agents will be discovered through a registry.
Ease of interface to other systems	🔴 Developer has to understand the other system’s API in order to manually write / select tools.	🟡 Agent calls MCP tools with a single-step interaction. My agent needs to understand the external system to determine how to chain tool calls together to achieve a goal.	🟢 My agent connects to a remote agent to access other systems’ capabilities. The remote agent determines how to use these capabilities to achieve my goal.
Auth	🔴 Developer has to implement their own auth on tools for the agent to use	🟡 Auth (based on OAuth2) has recently been released, though is yet to be widely adopted	🟡 Launches with all auth schemes supported by OpenAPI, though this means my agent will need to support whichever auth scheme is supported by the remote agent
Task control & completion	🟢 My agent has full control over how the task is performed, including deciding when it is complete. All output from the other system is accessible to my agent via tools, and my agent determines what information is retained during and after runtime	🟢 As with simple tool usage, my agent has full control over the task, its completion, the output of tools and how information is retained.	🟡 My agent relies on controlling the remote agent through negotiation and relies on the information sharing and retention decisions of the remote agent. It also relies on the remote agent to decide when a task is deemed complete or when further input is needed.

A summary of the various communication schemes between agents and tools. Traffic light symbols give an assessment of how well each technology solves the user problem.

As discussed in our previous blog post (Beyond APIs), we believe agent-to-agent communication will eventually become widespread. This communication will be interactive, multi-turn and goal-oriented, rather than utilising single-shot, transactional, rigid APIs and tools that are common now. At Portia, we’ve built our clarifications architecture to handle this and it’s exciting to see the ecosystem progressing in this direction.

However, we do not foresee A2A getting the same rapid adoption of MCP. MCP addressed a clear, mainstream problem: enabling agents to interact with APIs. Agent builders wanted to move beyond simple chat or RAG systems without building custom tools for every API, while API providers wanted to support agents without adapting to every agent framework. MCP elegantly solved this MxN problem by allowing providers to repackage their existing APIs and documentation into an MCP server easily.

In contrast, agent-to-agent communication hasn’t yet become mainstream and is significantly more complex. Deploying an agent in front of an API introduces challenges like managing ambiguous multi-turn requests, maintaining state, handling offline clients, and gracefully resolving cascading errors. Additionally, with the distinction between tools and agents not clear cut, it adds complexity to the ecosystem to have different protocols for both.

Therefore, in the short-term we expect companies to continue to focus on MCP and we expect to see a growing usage of agents within tools. We then expect to see MCP evolve to handle these ‘agents in tools’ use-cases more natively and elegantly, leading to MCP covering all of the 'tool-agent’ spectrum.

Join the conversation

Like this article? – Give us a ⭐ on GitHub. It really helps!

Browse our website and try our (modest) playground at www.portialabs.ai.
Head over to our docs at docs.portialabs.ai or get immersed in our SDK.
Join the conversation on our Discord channel.
Watch us embarrass ourselves on our YouTube channel.
Follow us on Product Hunt.

Build a refund agent with Portia AI and Stripe's MCP server

Portia AI — Thu, 20 Mar 2025 00:00:00 +0000

Anthropic open sourced its Model Context Protocol, or MCP for short, at the end of last year. The protocol is picking up steam as the go-to way to standardise the interface between agent frameworks and apps / data sources, with the list of official MCP server implementations growing rapidly. Our early users have already asked for an easy way to expose tools from an MCP server to a Portia client so we just released support for MCP servers in our SDK ⭐️.

In this blog post we show how you can combine the power of Portia AI’s abstractions with any tool set from an MCP server to create unique agent workflows. The example we go over is accessible in our agent examples repository here.

Connect to MCP servers with the Portia SDK

Connecting to an MCP server allows you to load all tools from that server into a ToolRegistry subclass called an McpToolRegistry, which you can then combine with any other tools you offer to Portia’s planning and execution agents.

We allow developers to load tools from MCP servers into an McpToolRegistry using the two commonly available methods today (to find out more about these options, see the official MCP docs):

stdio (Standard Input / Output): The server kicks off as a subprocess of the python process where your Portia client is running. Portia’s SDK only requires you to provide a server name and command with args to spin up the subprocess, giving you the flexibility to integrate any server written in any language using any execution mechanism. This method is useful for local prototyping e.g. you can load a local MCP server repo, kick off a process and interact with its tools in no time.
sse (Server-Sent Events): The server is accessible over HTTP. This could be a locally or remotely deployed server. We just need to specify the current server name and URL for the Portia SDK to interact with it.

In our Stripe example, once you provide the NPX command Portia’s SDK takes over and manages everything for you:

We will spin up the Stripe agent toolkit MCP server locally using the NPX command args
Our built-in MCP client, which uses the official MCP python SDK under the hood, will query the Stripe MCP server to understand what tools it provides, and make these available to your Planner and Execution Agents.

We can then extract the tools and automatically convert them to Portia Tool objects using a stdio connection:

stripe_mcp_registry = McpToolRegistry.from_stdio_connection(
    server_name="stripe",
    command="npx",
    args=[
        "-y",
        "@stripe/mcp",
        "--tools=all",
        f"--api-key={os.environ['STRIPE_API_KEY']}",
    ],
)

Use a clarification to loop in a human

Most use cases we’ve seen out there don't leverage the refund tool from Stripe MCP server because it is a high risk use case for an agent to act on. With Portia’s clarifications, we can ensure the agent pauses the plan run and solicits human approval before a plan run happens.

Clarifications: A brief recap

During agentic workflows, there may be tasks where your organisation's policies require explicit approvals from specific people e.g. allowing bank transfers over a certain amount. Clarifications allow you to define these conditions so the agent running a particular step knows when to pause the plan run and solicit input in line with your policies. When Portia encounters a clarification and pauses a plan run, it serialises and saves the latest plan run state. Once the clarification is resolved, the obtained human input captured during clarification handling is added to the plan run state and the agent can resume step execution.

For more on clarifications, visit our docs.

For our refund example, we want the following to happen:

Load a refund policy document and check the transaction details against it.
1. Reject the request if not within the policy and fail the plan run
2. Else make a recommendation to approve, along with rationale
→ The RefundReviewerTool offers this functionality
Given the context for agent’s refund approval recommendation,
1. Solicit human approval and
2. Reject the request and fail the plan run if the human did not approve it
→ The RefundHumanApprovalTool offers this functionality
If the plan run passes the previous step successfully, create a refund using the appropriate tool loaded from Stripe’s MCP server.

Bringing it all together

With Portia AI, you don’t need to create individual agents and point them explicitly at each step in the above process or at the required tools. Our planning agent will do exactly that for you. All you need to do is to prompt it using the Plan method (the more detailed the prompt, the more reliable it will be) and share a superset of tools with it. In the refund_agent.py code, we demonstrate the power of our planning agent by passing our Portia client all the tools from the Stripe MCP tool registry, the two local tools described above (RefundReviewerTool and RefundHumanApprovalTool) and Portia’s catalogue of cloud tools (DefaultToolRegistry).

portia = Portia(
    config=config,
    tools=(
        stripe_mcp_registry
        + InMemoryToolRegistry.from_local_tools(
            [
                RefundReviewerTool(),
                RefundHumanApprovalTool()
            ]
        )
        + DefaultRegistry(config)
    )
)

plan = portia.plan(
    f"""
    Read the refund request email from the customer and decide if it should be approved or rejected.
    If you think the refund request should be approved, check with a human for final approval and then process the refund.

    Stripe instructions:
    * Customers can be found in Stripe using their email address.
    * The payment can be found against the Customer.
    * Refunds can be processed by creating a refund against the payment.

    The refund policy can be found in the file: ./refund_policy.txt

    The refund request email is as follows:

    {customer_email}
    """
)

Here you have the option of implementing an end user feedback loop to refine your plan before running it. We demonstrate this with the scheduling agent example here. You also have the option of providing example plans to the Portia planning agent for added reliability e.g. if you want this refund process to always follow the same set of steps. The above code will produce a plan in line with the instructions we outlined in the prompt and include the relevant tools automatically. Below is an abridged version:

[
    {
        "task": "Read the refund policy from the file to understand the conditions for a valid refund.",
        "tool_id": "file_reader_tool"
    },
    {
        "task": "Review the refund request email against the refund policy to decide if the refund should be approved or rejected.",
        "tool_id": "refund_reviewer_tool"
    },
    {
        "task": "Request human approval for processing the refund if the review indicates approval.",
        "tool_id": "human_approval_tool"
    },
    {
        "task": "If the refund is approved by the human reviewer, locate the customer in Stripe using their email address.",
        "tool_id": "mcp:stripe:list_customers"
    },
    {
        "task": "Retrieve the payment intent associated with the found customer to identify the payment to refund.",
        "tool_id": "mcp:stripe:list_payment_intents"
    },
    {
        "task": "Process the refund by creating a refund for the identified payment intent.",
        "tool_id": "mcp:stripe:create_refund"
    }
]

And finally running this plan will allow you to test how it all comes together. Here’s a snazzy snappify animation of the PlanRunState across all steps of the plan run. Note how a clarification is raised and then approved by a human before the refund creation agent executes the final step.

On our roadmap: Supporting conditionals

Offering conditionals means you won’t need to hardcode the if / else logic within the RefundHumanApprovalTool definition: The Portia planning agent will be able to add conditions against the create refund step and insert a separate step to fail the plan conditioned on the human rejecting the refund. We already built this and are in the process of tuning it in staging before it’s ready for show time. Watch this space 👀!

Our reflections on working with MCP servers

Here's what we learned experimenting with MCP servers so far:

Most MCP servers are making use of the tool primitives, but not the prompts or resources which are also supported in the MCP spec. It is not clear what a best in class implementation of those would be.
The power of a standardised protocol is real! In our Stripe refund agent example we seamlessly integrate Stripe tools provided by their Javascript MCP server into a Python Portia Agent.
Provided the app owner who publishes an MCP server is maintaining it, MCP servers can be a powerful and pain-free way of discovering and loading tools into your AI app.
The limitation is that you are beholden to the MCP server owner’s tool definition, and those can vary in quality (e.g. tool and / or args description does not offer enough guidance for an LLM to invoke the tool reliably at scale). This can be a particular problem with community provided servers, so make sure you check out the quality of the tool definitions.
The MCP specification does not include output schema for tools, and many MCP tool descriptions do not describe exactly what the tool returns. This can create challenges for the Agent using the tool.
We need an MCP discovery service that allows an LLM to discover MCP servers from an app owner and load the details to connect to them (an MCP DNS server if you’re into acronym salads 🥗). The MCP folks have a registry concept in the works to address this issue.
Tool auth is a challenge for many people looking to deploy Agents in the real world - we’re well aware of that. MCP servers today are generally run locally with credentials such as API keys provided at start-up. Again, the MCP specification is moving quickly and a draft for Auth support has been published.

Join the conversation

Like this article? – Give us a ⭐ on GitHub. It really helps!

Browse our website and try our (modest) playground at www.portialabs.ai.
Head over to our docs at docs.portialabs.ai or get immersed in our SDK.
Join the conversation on our Discord channel.
Watch us embarrass ourselves on our YouTube channel.
Follow us on Product Hunt.

Seamless human agent interactions with just-in-time authorization

Portia AI — Thu, 13 Mar 2025 00:00:00 +0000

In part 1 of this series, we established why there is a need for a Just-In-Time (JIT) authorization system, whereby an agent has the ability to authorize itself only at the point where it is very likely that they will 1/ need that authorization and 2/ that they are clear what they will use it for. In this section, we’ll look at how we have done this at Portia AI.

A tenet of agentic systems is that they are designed to operate autonomously but JIT auth requires an interruption of the agentic system so that it can solicit human input.

In reality, we think it’s becoming increasingly obvious that seamless hand-off back and forth between agents and humans to collaborate on a task needs to be a well supported expectation and yet most agentic frameworks make this hard work to do. In Portia, we refer to these agent-to-human requests as ‘clarifications’.

If you’ve written an agent before, you’ve probably experienced an agent death loop – where the agent gets itself stuck and continually retries until you cancel the operation (or it hits its maximum retries).

For Just-In-Time auth, we want to accomplish a few things. Firstly, many agentic systems perceive a task as incomplete if they encounter a requirement for the end user to complete authentication. The agents then end up attempting retries – or rather, by trying to get the user to authenticate, they instead make the agent enter a death spiral. Sigh. We’ll refer to this problem as the ‘human-agent short circuit’ problem.

The second issue arises if your agentic system supports authorization within the flow of an agent, as you would need the end user to perform the actual authentication and take action, most typically by clicking a link. This then kicks off a somewhat complicated handshake to retrieve the authorization token and the agent needs to be made aware and resume its task from where it was. We’ll refer to this as the ‘human-agent hand-off’ problem.

This third problem is almost trivial in the grand scheme of things. OAuth links are kinda long and ugly, but most agentic frameworks expect to hand things back to users in natural language. This means that a user would be presented with something rather incomprehensible like:

Click the link to authenticate: https://accounts.google.com/o/oauth2/v2/auth?redirect\_uri=https%3A%2F%2Fapi.portialabs.ai%2Fapi%2Fv0%2Foauth%2Fgoogle%2F&client\_id=1062040369470-6hqq9140gs1451mvb3fon3md1ekhnlns.apps.googleusercontent.com&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fgmail.modify&state=APP\_NAME%3Dgoogle%253A%253Agmail%26WORKFLOW\_ID%3Dwkfl-87a960b7-f750-414b-8d5b-72c2c203c5fc%26END\_USER\_ID%3Dportia%253A%253A2%26ORG\_ID%3Dc31d809a-c6f3-48e2-9cf0-2cf079ead258%26CLARIFICATION\_ID%3Dclar-894c4a62-a092-4501-8501-174a9d78c7e5%26SCOPES%3D%2Bhttps%253A%252F%252Fwww.googleapis.com%252Fauth%252Fgmail.modify&access\_type=offline&response\_type=code&prompt=consent

It’s ugly and if the end user makes a mistake in copying that link, it won’t work! We’ll refer to this as the ‘human-agent presentation’ problem.

Making human-agent interaction a first class citizen for agentic AI

These were 3 of the initial problems that we wanted to tackle with Portia AI. The first problem is solvable as long as it’s a fundamental part of the agentic system such that agent introspection comes after pre-inspection of a task’s output. Then, in the event that an agent-to-human clarification is raised, it can be returned immediately to the end user rather than the agent trapping itself in an endless death loop of retries. Most agentic systems make the assumption that human-in-the-loop actions should come after the agent has made its best attempt at completing its task (shown above). To handle this in Portia, it's fundamental that any tool call can return either a clarification or the output from the tool, and if a clarification is returned, it will be handed back to the developer to present to the end user.

We use this as a critical part of our auth system, but it’s useful more broadly as it creates an extremely flexible system that developers can use to hand off seamlessly between human control and agent control. For example, if a tool returns too many results, and the user needs to select the right one to proceed, developers can return a multiple choice clarification, or in the future, even trigger this behaviour automatically.

Scaling to 1000s of users

The second issue, the ‘human-agent hand-off’ problem, requires a set of events to be synchronized back and forth between human and agent (e.g. “auth needs to be completed”, “auth has completed and the agent can resume”, etc). It also requires the in-flight agent state to be saved so that it can be resumed after the end user has completed the authentication – this is relatively easy to do if you make the assumption that you have only one end-user or that they will immediately authenticate, but we wanted to create a production ready system that could be scaled up to 1,000s of end users, and we wanted end users to be able to respond in their own time to their agents, so they can get on with their day-to-day lives. So the Portia framework handles this for developers and we support the concept of end-users as a primitive in our framework so tasks, tool calls and authentication sessions can be attributed to individuals across your organisation or production use case.

Making it look good

The third issue, the ‘human-agent presentation’ problem, is fairly easy to layer on to the previous concepts. Clarifications in Portia are structured, which means they can be easily rendered in different elegant UI formats to the end-user. Rather than an ugly link, the developer can easily render a button that hides the complexity from the end user. You can even configure the guidance you want to attach to your clarification:

Click the link to authenticate: https://accounts.google.com/o/oauth2/v2/auth?redirect\_uri=https%3A%2F%2Fapi.portialabs.ai%2Fapi%2Fv0%2Foauth%2Fgoogle%2F&client\_id=1062040369470-6hqq9140gs1451mvb3fon3md1ekhnlns.apps.googleusercontent.com&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fgmail.modify&state=APP\_NAME%3Dgoogle%253A%253Agmail%26WORKFLOW\_ID%3Dwkfl-87a960b7-f750-414b-8d5b-72c2c203c5fc%26END\_USER\_ID%3Dportia%253A%253A2%26ORG\_ID%3Dc31d809a-c6f3-48e2-9cf0-2cf079ead258%26CLARIFICATION\_ID%3Dclar-894c4a62-a092-4501-8501-174a9d78c7e5%26SCOPES%3D%2Bhttps%253A%252F%252Fwww.googleapis.com%252Fauth%252Fgmail.modify&access\_type=offline&response\_type=code&prompt=consent

becomes (with minimal developer effort!):

When we were designing Portia, we started with authentication, but quickly realized that the things that made it hard to do this were more general than just authentication and much more about the fundamentals of human-agent interaction. We look forward to hearing your thoughts and feedback on the product and our open-source SDK.

Like this article? – Give us a ⭐ on GitHub. It really helps!

Browse our website and try our (modest) playground at www.portialabs.ai.
Head over to our docs at docs.portialabs.ai or get immersed in our SDK.
Join the conversation on our Discord channel.
Watch us embarrass ourselves on our YouTube channel.
Follow us on Product Hunt.

Why authentication is a challenge for AI agents

Portia AI — Mon, 10 Mar 2025 00:00:00 +0000

AI Agents are a rapidly evolving technology in the AI space. The introduction of LLMs and the ability for LLMs to interact with other software autonomously has paved the way for a new wave of technological innovation. This is an exciting development but it needs appropriate guardrails to ensure that an agent is really enacting your wishes and not sending rogue emails on your behalf to your entire address book. This is the first of a 2-part series that discusses some of the challenges of appropriately authenticating and authorizing agents so they can safely fulfill requests.

The misaligned incentives of agents and authentication

Authentication is one of the most well-understood guardrails of the internet. These days, we take for granted that you cannot easily send an email on someone else’s behalf. In the earliest days of the internet, people wrote basic bots to brute force attack usernames and passwords. But today, through widely deployed methods like OAuth combined with captcha, 2FA, and IP checks, authentication ensures that the appropriate human is present and hard to impersonate.

This presents a problem for agents. Their inherent value proposition is to act autonomously, whereas authentication has evolved precisely to ensure that the appropriate human is present for certain tasks.

Most agentic systems solve this today by pre-authenticating and authorizing the agent for any actions that it might want to take. The problem is that this pre-emptive access is far too broad and essentially removes one of the best safety guardrails we already have. You can replace it with a human-in-the-loop check instead, which many agentic systems have, but this is a bit like giving a burglar the keys to the castle but putting an electric fence around the perimeter – we’re essentially having to remove the authentication barrier and replace it with less sophisticated systems.

Conversely, it’s also fundamentally limiting the potential of our agents – ideally, you want a system that grants the agent a minimal set of privileges so that it can achieve its tasks. Pre-authorization means that, unless your end user is happy to sit and grant a bunch of authorizations to agents that they may not use, you’ll end up inadvertently limiting how many tools and systems your agent can access. Naive pre-authentication makes it hard to strike the balance between overgranting and undergranting and so ends up as both a limiting factor and also a far too permissive guardrail for agents.

We’ve been trying to think about how just-in-time authorization can work for agents. How can we create an agentic system that enables agents to get the authorization they require only when it is clear that they require it?

An architecture for just-in-time agent authentication and authorization

Just-in-time authorization means that an agent has the ability to authorize itself only at the point where it is very likely that they will 1/ need that authorization and 2/ that they are clear what they will use it for. If we think of agent execution as a graph of execution nodes against systems that might require authentication, there are 2 ways we can solve this problem:

1/ Authorization at the point of execution

In this case, the agent must pause itself to retrieve the authorization it requires from the user. For this to be useful in an autonomous agent scenario, this means that any state up to that point must be saved so that it can be resumed once the authentication action has been completed.

The advantage of this is that the agent can request exactly the authentication and authorization that it needs for the task it’s trying to execute. The disadvantage is that your agent can only proceed so far autonomously and you risk it getting stuck repeatedly every time it hits an authentication.

2/ Scoped authorization based on an articulated plan

Some agentic systems these days rely on some amount of chain of thought reasoning and pre-planning. The idea with this kind of authentication is to try to pre-process the articulated plan to identify authentication requirements as early as possible or to group them together to minimize round trips back to the user. This has become a core part of how we have designed Portia AI – the plan and what the agent is attempting to do should always be clear to the human. It also allows us to scope the authorization provided to the agent accurately.

However, agentic systems are increasingly evolving to have adaptive planning such that the authentication requirements may change as the agent discovers more information about its goal or backtracks from a particular route. An adaptation of this is to try to do probabilistic pre-authentication based on some combination of the articulated plan or the domain that the agent is operating in. Ultimately however, this is just an optimization on top of pre-authentication described in the first section and so has the same fundamental limitations and concerns.

So what is the right solution? Having the ability to authenticate at the point of execution is a fundamental building block for effective authentication and authorization. With this enabled, you can then build probabilistic optimization systems on top of it which are optimizing for the agent progressing as far as it can without human interruption, but which can recover in the case that they diverge from the probabilistic outcome.

In part 2, we will look at how you can build this just-in-time authentication into agentic systems and how we’ve solved it at *Portia AI.

Like this article? – Give us a ⭐ on GitHub. It really helps!

Browse our website and try our (modest) playground at www.portialabs.ai.
Head over to our docs at docs.portialabs.ai or get immersed in our SDK.
Join the conversation on our Discord channel.
Watch us embarrass ourselves on our YouTube channel.
Follow us on Product Hunt.

Start building authenticated and predictable agents with Portia AI

Portia AI — Tue, 04 Mar 2025 00:00:00 +0000

Tired of your AI agents going off the rails? Well look no further 😅! We are releasing an open source developer framework that allows you to build agents that pre-express their actions, share their progress and can be interrupted by a human.

Portia AI was born from our tinkering together with Fintech co-pilots (closer to our home turf). We both believed that AI agents represented a paradigm shift in how software interacts with users and their environment. Amongst other major changes in the past year, AI has become a primary user interaction layer as evidenced by the overwhelming focus of traditional players in the space (e.g. Salesforce’s Agentic 2.0 press release), the shift from “Buy” to “Build” in the SaaS space is accelerating as folks like Klarna leverage AI to automate across functions, and the rise of web agents and multimodal models.

The problem space

We were inspired by the explosion of AI-powered use cases, but the challenges we encountered as we tinkered were also sobering. Through this experience and conversations with other developers we honed in on the following pain points:

Planning : Many use cases require visibility into the LLM’s reasoning, particularly for complex tasks requiring multiple steps and tools. LLMs also struggle picking the right tools as their tool set grows: a recurring limitation for production deployments.
Execution : Tracking an LLM’s progress mid-task is difficult, making it harder to intervene when guidance is needed. This is especially critical for enforcing company policies or correcting hallucinations (hello, missing arguments in tool calls!).
Authentication : Existing solutions often disrupt the user experience with cumbersome authentication flows or require pre-emptive, full access to every tool—an approach that doesn’t scale for multi-agent assistants.

Our proposed solution

While AI engineers with deep expertise have been hacking their way through these issues, we wanted to democratise the solutions for all developers with Portia AI. As a first step, we are offering an open source Github repo (↗), augmented with elective cloud-hosted features to help speed up deployments, and accessible from the Portia dashboard (↗).

Pre-expressed plans: Our open source planning agent guides your LLM to produce an explicit Plan in response to a prompt, weaving the relevant tools, inputs, and outputs for every step.
Stateful, controllable agents: Portia will spin up a PlanRun and a series of execution agents to implement the generated plans and track the run state throughout execution. Using our Clarification abstraction you can define points where you want to take control of plan runs e.g. to resolve missing information or multiple choice decisions. Portia serialises the PlanRun state, and you can manage its storage / retrieval yourself or use our cloud offering for simplicity.
Extensible, authenticated tool calling: Bring your own tools on our extensible Tool abstraction, or use our growing plug and play authenticated tool library, which will include a number of popular SaaS providers over time (Google, Slack, Zendesk, Github etc.). All Portia tools feature just-in-time authentication with token refresh, offering security without compromising on user experience.

Intrigued?

Give our live playground a try on our website.

Conclusion

It’s early days–for us and for the ecosystem at large. Everything from LLM reasoning, to authentication and APIs in the age of AI agents is evolving rapidly. With Portia AI, we want to help developers ride this wave of innovation by combining intelligence, autonomy, and security.

If this resonates with you, let’s connect on our Discord channel . We’re building and iterating based on feedback from our community, and we’d love to hear your thoughts. Together, let’s tackle the gnarly challenges standing in the way of the agentic future.

Emma & Mounir

Join the conversation

Give us a ⭐ on GitHub – it really helps!
Browse our website and try our (modest) playground at www.portialabs.ai.
Head over to our docs at docs.portialabs.ai or get immersed in our SDK.
Join the conversation on our Discord channel.
Watch us embarrass ourselves on our YouTube channel.
Follow us on Product Hunt.

What's next for Browser Agents? 🤔

Portia AI — Fri, 28 Feb 2025 00:00:00 +0000

TLDR

I've been tinkering with browser automation recently (e.g., building a bot to search and buy on Amazon), and Operator’s release got me thinking about the future of these tools. Here are 3 key challenges browser agents face today:

Moving from text-only to multi-modal AI models.
Solving authentication without blending in with bad bots.
Enabling human-in-the-loop collaboration that's seamless and smart.

In this post we unpack these challenges, share insights, and explore what’s next for browser agents. Would you trust browser agents with your day-to-day tasks? Let me know your thoughts! 👇

ChatGPT Operator is out – what's next for browser agents?

AI start-ups building browser agents must be losing sleep over Operator’s release 😱. We recently tinkered with browser agents ourselves to automate searching on Amazon and buying an item. Having seen some of the demos on Operator, here are our three broad takeaways:

1. The future of browser agents is multi-modal models

I tried building my Amazon browser agent using a “unimodal” text-based LLM to really grok that point [For the tech savvy, I used Browserbase as my headless browser to automate browser tasks with code]. Because it doesn’t understand web navigation I had to figure out the exact structure of the webpage I wanted to automate and spoon feed it to the LLM. Not only would a developer need to do this for every website, they’d need to revisit this every time a website changes structure. To make matters worse, my context window (the amount of data the LLM can hold in a given conversation) was constantly saturated with the amount of HTML code retrieved. I had to find all sorts of hacks esp. on websites like Amazon (filter for specific tags, convert to other formats like text or markdown etc.). Operator is multi-modal, meaning it was trained on and processes both text and visual data. It is able to navigate a webpage dynamically and to process elements visually rather than rely exclusively on verbose HTML dumps. There are a few other contenders in this space – I was able to successfully star a Github repository using the browser-use open source framework for example (Star them on github!). You can see a GIF below and how they interpret web page layout visually. The accuracy of such models is improving fast, even though some benchmarks claim they are still < 58% (see WebArena’s leaderboard).

Takeaway

Multi-modal AI models are able to navigate the web based on their ability to interpret website content visually as well as textually.

2. There’s currently little to nothing distinguishing AI web agents (good bots!) from automated fraud (bad bots!)

In our attempts at building browser agents, we got blocked at various points in the browsing session. Occasionally we would get blocked right at the very start of a browsing session and would have to use proxies and other incognito methods. Incidentally we’ve seen a demo shared on X where it seems that Operator occasionally struggles with that as well. More importantly, we could not find a way to load or fill the fields on login pages even when we got the browser agent to hand over the session to me ☠️. Based on the demos I’ve seen, Operator solves authentication for some providers by handing control over to the user in order to complete a login (e.g. Booking.com, Thumbtack or Google Calendar access). Based on the press release, they are relying on bespoke partnerships with those domains to identify Operator-managed browsing sessions and allow these sessions to proceed "while respecting established norms". For now none of the other multi-modal-based frameworks we’ve seen have an easy answer to this problem.

Takeaway

The ecosystem needs a reliable security standard to establish this handshake between browser agents and websites. We envision this looking like an adapted version of delegated OAuth, and it will hopefully level the playing field for startups and make browser agents safer.

3. We’re missing a structured way to handle the back and forth between a web agent and a user

While we are seeing frameworks for building AI agents start to support human intervention (aka “human in the loop”), none of the well known browser agent frameworks offer a structured way to handle the back and forth between the agent and the human user. Operator definitely stands out in that regard as you can see in this flight booking example. I am not clear what the user experience would be like if the user was not immediately available to guide the LLM. Would it be able to “save its progress” and resume once the user responds to it? Can the human pre-emptively define conditions where the web agent should consult it beyond the obvious ones like making a payment e.g. “if you find any offers for travel insurance during the booking process, let me know so I can explore them before making a decision”?

Takeaway

Resilient and flexible human-in-the-loop support needs to be a core feature of browser agents to ensure that users have as much control as possible over the actions of their browser agents.

How do you feel about browser agents? Would you trust them? Would you use them more broadly to assist you in your day-to-day life? Let us know in the comments below, or join us on Discord and share your thoughts 🙏

Join the conversation

Join the conversation on our Discord channel.
Watch us embarrass ourselves on our YouTube channel.
Follow us on Product Hunt.

DEV Community: Portia AI

Design Highlight: Handling data at scale with Portia multi-agent systems

Calling All Devs

Real agents handling data at scale​

An aside on long-context models​

Preventing our plan run state becoming overloaded​

Our future vision - a memory agent​

Tell us what you think!

Solving Tool calling with large inputs​

What do you think?

Going forwards​

What do you think?

Join the conversation

Beyond APIs: Software interfaces in the agent era

The limitations of APIs for agents

1. Rigid interfaces don’t work well with dynamic reasoning

2. API interfaces need to be integrated ahead of time

3. Handling errors and change management

What would good look like?

1. Self describing and discoverable Interfaces

2. Flexible, goal-oriented invocation

3. Robust error handling and introspection

4. Support for parallel execution

The Path Forward

1. LLM assisted APIs aka Tools

2. Generating Code Instead of Calling APIs

3. Computer and / or browser use

4. Dynamically discoverable tools

5. Agent <> Agent interfaces

Bringing it all together

Join the conversation

Visualise your Obsidian notes with Qwen3

Obsidian is a great app for storing notes

Why local LLMs are worth using

Why we chose Qwen3

What does this app do?

What is Ollama?

Diving into the code

Configuring Portia to use a local LLM

Adding the required tools

Vibe coding a visualisation tool 🏄🏻‍♂️

Making a plan

Teaching Portia’s planning agent

Putting it all together

What did we learn?

Next Steps

Join the conversation

A unified framework for browser and API authentication

Tool creation

Browser agent

Authentication

API or Browser Agents: Who wins?

What about those bot protections?

Join the conversation

A deep dive into our “User Led Learning” feature

What does User-Led Learning solve?

So how does it work?

Without user-led learning: Improving reliability through painstaking prompt engineering

Getting set up

01_ull_vague_prompt_no_examples.py

Enter user-led learning: hone in on good plans and let Portia do the rest

04_ull_create_example_plans.py

05_ull_vague_with_examples.py

In conclusion

Join the conversation

More features for your production agent … and a fundraising announcement

If you're new here

Teach our Planning agent new things 🧠​

We’re even more showtime ready 🕺🏼​

Elegant auth UX for web agents 🔜​

Join the conversation

Agent-Agent interfaces and Google's new A2A protocol

Agents vs Tools​

A2A & MCP​

MCP vs A2A: Our Predictions​

Join the conversation

Build a refund agent with Portia AI and Stripe's MCP server

Connect to MCP servers with the Portia SDK​

Use a clarification to loop in a human​

Clarifications: A brief recap

Real agents handling data at scale

An aside on long-context models

Preventing our plan run state becoming overloaded

Our future vision - a memory agent

Solving Tool calling with large inputs

Going forwards

Teach our Planning agent new things 🧠

We’re even more showtime ready 🕺🏼

Elegant auth UX for web agents 🔜

Agents vs Tools

A2A & MCP

MCP vs A2A: Our Predictions

Connect to MCP servers with the Portia SDK

Use a clarification to loop in a human

Bringing it all together

Our reflections on working with MCP servers

Making human-agent interaction a first class citizen for agentic AI

Scaling to 1000s of users

Making it look good

The misaligned incentives of agents and authentication

An architecture for just-in-time agent authentication and authorization

1/ Authorization at the point of execution

2/ Scoped authorization based on an articulated plan

Our proposed solution

Join the conversation

ChatGPT Operator is out – what's next for browser agents?

Join the conversation