DEV Community

Cover image for Code vs LLM in a simple planning poker agent example
Mounir Mouawad for Portia AI

Posted on

Code vs LLM in a simple planning poker agent example

If you're building AI agents, chances are you often had to consider how much logic you want to handle through the LLM versus through traditional code. I wanted to share my experience with it this morning as a conversation starter and get your thoughts!

What I wanted the agent to do

I normally spend a ton of time gathering feedback from our users. In a previous life I would put those insights into tickets in Linear and spend a ton of mental cycles trying to size the return on effort to inform our prioritisation. In this bold new world of AI, I figured I would instead write up a planning poker agent to help me do t-shirt sizing of some of those tickets in Linear. Built on the Portia SDK, the agent would:

  1. Fetch relevant linear tickets using the remote MCP server for Linear, which is one of 1000s of tools we have with built-in auth.
  2. Simulate sizing estimates from multiple developer personas and get to a consensus for each ticket's effort sizing. Here I wanted to create a ticket estimator tool using a subclass of our LLM tool that would return estimates as structured outputs. The tool would take a context.md file where I keep a summary of the architecture and core abstractions that make up the Portia SDK so it can help the LLM with effort sizing.

As it turns out, I had asked one of our devs (we'll call him Ethan) to do this and forgotten! So we both wrote this thing up at the same time except...I relied quite heavily on the LLM to handle the task while he relied way more heavily on code. Let's unpack how our approches compared.

How each of us built it

Full code in our agent examples repo here.

  • ๐Ÿง  LLM-heavy: I relied on a robust prompt and the Portia planning agent to figure out the entire set of steps that need to be taken, that is fetch and filter tickets from Linear, then get estimates for ticket sizes from each developer persona and average them out. Essentially I relied on the LLM itself to 1) index and aggregate the sizing estimates by Linear ticket id and persona, 2) figure out how many tool call iterations (a.k.a. "unrolling") to make to handle all ticket id and persona combinations. Here's the code snippet where the magic happens:
# Get tickets from Linear and estimate the size of the tickets
project = "Async Portia"
query = f"""Get the tickets i'm working on from Linear with a limit of 3 on the tool call. Then filter specifically for those regarding the {project} project.
    For each combination of the tickets above and the following personas, estimate the size of the ticket.
    {personas}

    Return the estimates in a list of PlanningPokerEstimate objects, with estimate sizes averaged across the personas for each ticket.
    """
estimates = portia.run(
    query=query,
    structured_output_schema=PlanningPokerEstimateList,
).outputs.final_output.value.estimates 
Enter fullscreen mode Exit fullscreen mode
  • ๐Ÿง‘๐Ÿปโ€๐Ÿ’ป Code-heavy: Ethan on the other hand figured that we don't really need to rely on the LLM, neither for planning nor for indexing / aggregating / iterating on estimates. Instead he used Portia's declarative PlanBuilder interface to enumerate the steps and tool calls needed. He fetched the tickets using a first Portia plan run into LinearTicket objects using structured outputs. To generate sizing estimates, he then iterated with conventional code over each developer persona and over each ticket element in the list returned from the previous plan run. Each iteration called the ticket estimator tool in a single step Portia plan run. Here's a code snippet containing both the ticket fetching plan run and the ticket sizing iterations:
# Fetch Linear tickets
project = "Async SDK"
query = f"Get the tickets i'm working on from linear regarding the {project} project"
plan = PlanBuilder(
    query, structured_output_schema=LinearTicketList
).step(
    query + " and only call the tool with a limit of 3", tool_id="portia:mcp:mcp.linear.app:list_my_issues"
).step(
    f"Filter the tickets to only include specifically the ones related to {project}", tool_id="llm_tool"
).build()
plan_run = portia.run_plan(plan)
tickets = plan_run.outputs.final_output.value.tickets

# Iterate over tickets and persona to generate estimates
for ticket in tickets:
    estimates = []
    estimate_plan = PlanBuilder(
        "estimate the size of the ticket", structured_output_schema=PlanningPokerEstimate
    ).step(f"Estimate the size of the ticket: {ticket.title}\n\n{ticket.description}", tool_id="ticket_estimator_tool").build()
    for persona in personas:
        context = f"""
        {persona}
        {tool_context}
        """
        estimate_tool.tool_context = context
        portia.tool_registry.with_tool(estimate_tool, overwrite=True)
        estimate = portia.run_plan(estimate_plan)
        if estimate.state == PlanRunState.COMPLETE:
            estimates.append(estimate.outputs.final_output.value)
Enter fullscreen mode Exit fullscreen mode

What we learned

Let's compare both approaches side by side and draw some conclusions. I hooked up Langsmith to Portia for observability so I could obtain the metrics shown below.

LLM-heavy Code-heavy
Effort Lowest Highest
Total tokens 70k 30k
Cost $0.12 $0.06
Latency [P99] 28.95s 9.70s

So what conclusions can we draw from this exercise?
๐Ÿ’ก Reliability: You can trust your Portia agents to figure out the right sequence of steps and to unroll (iterate on) the tool calls correctly so that definitely simplifies development, kinda like a form of vibe coding...but much like vibe coding it does take a bit of 'LLM-whispering' (a.k.a. prompt engineering) and using the right underlying model. For plan runs with heavy iteration expectations in particular, you will need robust eval sets in place to keep tabs on reliability lest you aim for a Mona Lisa and end up with a Picasso.
๐Ÿ‘ฃ Traceability: Relying on the LLM to handle planning and execution to the extent I did does make tracing particularly easy. One single PlanRunState instance in the Portia dashboard showed me the entirety of the work done by the underlying subagents. This also makes revisiting the output of the plan run easier of course. Ethan on the other hand ended up with numerous plan runs, which makes auditing and / or debugging harder.
๐Ÿ’ธ Cost: As you'd expect the LLM-heavy method is slower and costlier. Ultimately we're still processing the same amount of context presumably (same number of tickets and estimations) but the overhead of passing along a growing context window across all execution agents during the plan run means that the LLM-heavy method is inevitably slower and costlier. You're also opening yourself up to the stochasticity of LLMs when code could do the trick.

A parting thought

One aspect I don't consider in the comparison above is autonomy. Because the task is neatly scoped in this example (planning poker agent = fetch and filter tickets + estimate per persona + summarise consensus) you can make the argument that at production scale one should restrict LLM usage only to the tasks that traditional code can't handle as easily (e.g. natural language processing). BUT where inputs from the environment change or the scope of the task is fluid, the LLM-heavy approach truly thrives. I'll try and tease that more obviously in a subsequent post.
๐Ÿ‘‰๐Ÿผ If you're interested please shout in the comments down below!

About Portia

Portia AI is an open-source framework for building predictable, stateful, authenticated agentic workflows.

We allow developers to have as much or as little oversight as theyโ€™d like over their multi-agent deployments and we are obsessively focused on production readiness.

We invite you to play around with our SDK, break things, and tell us how you're getting on in Discord

Top comments (0)