DEV Community: Vinicius Dallacqua

Agents That Build Agents — Building Autonomous Browsing with Claude Code

Vinicius Dallacqua — Tue, 10 Feb 2026 09:18:47 +0000

Agents That Build and Improve Agents — Building Autonomous Browsing with Claude Code

There's a way of working with AI agents that's becomming a common emerging pattern. But it requires a different posture, one that is kind of unnatural to most developers. A willingness to be comfortable with not knowing exactly where you're going. Let me explain.

I've spent the last year researching and experimenting with agents, building my own agent specialized in performance based on a fork of DevTools. And since the holidays last year, after much research, I felt it was time to venture into a larger scope project: the browser. I felt that there was more to agents and browsers and the future of the web than what I saw in the currently available options and decided to dig deep into the problem myself.

After setting up some of the initial service layers, it was time to build the automation service layer for it. I've grew to love using Claude Code over a regular IDE, and since the early release of sub-agents some emmerging patterns began to form on my workflow when working with it. But lets begin by better describing what am I building this time around.

This new side-project is meant to be the convergence of the best personal assistant and a browser, and experiment with the future of the web and interfaces.

It has many layers like a gen-ai pipeline for reimagining websites whilst keeping the branding aspects and a multilayered memory system that abstracts interests over time. But for this writeup I'll describe the early R&D for the autonomous browsing service. The concept for this service is no different to any current AI browser, the user makes a request, the browser navigates and executes a plan, and reports back with what it found.

After working for a company that handles browser automation for agentic QA (QA.tech) for 5 months at this time, I now know better of what it takes to drive browsers with the help of AI agents. Working wich codebases painstakingly built over years at the teaches one to to appreciate the evolution of code built for such complext task.

So I was curious how a codebase built for a similar purpose would be conceived nowadays, where models and tools have evolved significantly in capability. So I decided to try a different approach, one where I was genuinely unsure how well would it work.

What surprised me wasn't that it worked. It was how quickly I got to an MVP and the learnings I developed.

Sandboxes as a learning ground

Developers are now getting more used to think and use AI assistants as execution engines. You know what you want, you describe it and the AI writes the code. After which you review, iterate and ship.

What I ended up leaning towards is to treat the agent as a research partner. Where you have a direction, not necessarily destination.

Here's a concrete example. When I built the first draft of the browsing agent, I was trying to make it create pages in Notion without hardcoded instructions. The first obvious approach would be to try and prompt your way into teaching the agent what to click, what to look for. So the agent could understand how to drive that session to a given goal.

It failed. The agent was uncapable of even getting to the new page after many tries and when it finally did manage it could not do much more than entereing a title. Notion was picked as a testing ground due to how complex the underlying DOM structure and interactions are.

The first session tried to type into buttons. Pasted full markdown content including the title twice. Didn't verify anything between actions. It assumed structure instead of observing it.

If you've used agent-browser by Vercel you probably noticed that the 'baseline' intelligence for agents on driving web pages is quite powerful now. With nothing but a simple --help and in a few iterations, agents can learn how to drive web pages without much prompting at all. But when you are building a service, you don't have the luxury to allow the agent to spend a few iterations learning how to use the browser every time. That's what system prompts are for.

So I tried something different, instead of telling the agent what todo, I asked: what if the agent could figure it out? I know I just said that I don't have that luxury when a user is trying to use it. But, what if I can pre-compute that?

So "the agent" in this case became actually not just one but many. I used Claude Code with agent-browser CLI and tasked Opus to a session where it was the orchestrator, leading Sonnet models to drive CDP sessions based on tasks with a goal and a few vague step suggestions based on it. Nothing trying to 'tell' the browsing agent what to do, but instead just some tips to steer it towards the goal, but nothing technical or too scripted. The goal was to 'distil' the accumulated knowledge into designing the right CoT prompt and tools to drive a CDP session. Creating evals along the way. And these would in turn be used by Gemini 3 Flash as the actual 'brains' of the service. So the first sessions became fixtures to design for evals that ended up being ran by Gemini to guague and measure the iterations, but the design of it and 'lab' version of it would be entirely via Claude Code via different scritps to allow for some basic tokenomics and each sub-agent reporting their 'experience' along the way.

After a few iterations, based on how the sub-agents experienced the session, they feedbacked to Opus, and together they created a chain-of-thought prompt and tools to successfully drive some basic interactions around notion. Without any input from me on specifically how to drive the session or what to look out for. The curious part was that they arrived at a mental model very similar to how I've approached similar problems at work and other places.

This was quite a moment for me.

I had 3-5 sub-agents trying to independently find their way around Notion to execute the same task. Each one exploring, failing, learning. Then reporting back what worked.

And from that chaos, a pattern emerged. Not one I designed, one they collectivelly discovered.

For the most experienced out there, there might be questions on overfitting as an outcome. Thought I cannot be 100% certain just yet without running a larger experiment, the design and guardrails were structure to avoid it.

I had 'hand written' code for agents before, where I've forked and extracted parts of DevTools to implement my own [Performance specialized agent(https://github.com/PerfLab-io/perfagent). But this was different, the mindset and process, and the time of iteration was something I had to think of and rebuild my entire development process around.

From coding agent to research assistant

The sub-agent's findings were more structured than I first started with, fllowing a certain pattern. One that looks familiar to anyone building for browser automation:

OBSERVE → REASON → ACT → VERIFY

OBSERVE: Take a snapshot to understand current state
REASON: Decide what to do based on what's actually there
ACT: Perform ONE action only - the action itself might be a combination of keystrokes or commands ctrl + Z for undo.
VERIFY: Take a fresh snapshot to confirm the result

Opus as the coordinating agent arrived a form of schema to map actions for tool call schemas and error avoidance rules as the ones bellow extracted from one of the reports:

Never fill a button—buttons are for clicking
Never use stale refs—always refresh after mutations
Never paste complex content at once—break into simple paragraphs

I was expecting the pattern to arrive at something that uses more vision context to extract coordinates for actions, but the agents leaned more into a combination of singals. They very quickly 'decided' to use the accessibility tree that browsers expose for screen readers as a core signal, alongside screenshots as a validation layer.

The accessibility tree is ground truth.

That phrase came from one of the sub-agent's own analysis and notes. The agents discovered that the accessibility tree contained what it needed to understand the page. No CSS selectors, no coordinates, just the meaning embedded in the interface itself. Granted, the a11ty tree is often incomplete and neglected, but in conjunction with screenshots a good dual-layer to drive autonomous browsing.

This kinda represents one of the most interesting arcs in web dev to me, where we'll end up making web interfaces more accessible but humans won't be doing most of the clicking around.

Another important learning brought up in notes was that "Observation isn't free, but mistakes are more expensive". The cost of taking an extra snapshot is negligible. The cost of acting on stale information and course correction is exponentially bad.

One of the agents noted that "Element refs are ephemeral" and to "Treat them as valid only until the next mutation", as it tried to interact with the dynamic nature of the Notion interface after pressing 'Enter', the same textbox might have a different ref number.

Using Claude code as a research lab with a lead agent orchestrating smaller agents to use a form of 'knowledge distilation' and autonomous learning was not in my playbook before not so long ago. I had a hunch that generic approaches might beat specific ones from experience, that a certain patterns are powerful. But to see those 'simulation engines' arriving at their own conclusions was very exciting and something that I used for other more advanced features that will become the topic of future writeups.

Moving with the process

There's a passage from Dune, the movie adaptation versiion of the quote:

"The mystery of life isn't a problem to be solved, but a reality to experience; a process that cannot be understood by stopping it. We must move with the flow of the process, we must join it, we must flow with it."

Though the irony on the background for the series is based does not escape me, this feels surprisingly relevant to this new moment.

The instinct most developers have is to control the process. Trust nothing you can't immediatelly verify. This makes sense when you're writing deterministic code. But it might slow down the development when you're working with systems that are more autonomous by nature.

The better approach was to join the flow. Set a direction, provide constraints, then watch what patterns emerge.

I've seen more than once people questioning if we are going to end up 'dumber' or less capable than before as engineers. And thoguht it's hard to asnwer for sure what will happen, similar movements have happened before. We are moving to higher forms of abstraction, where we as software engineers are getting more involved on architecting, researching and validating than directly writting code, for the most part. Ensuring the correct abstractions and safeguards for the systems and leaving implementation details parts more as a 'read' operation then a 'write' operation.

This is quite similar to the move from lower level languages to higher levels and from focusing on the standard libraries to using frameworks as we consolidated best practices and common abstractions.

As a result, we as engineers are able to focus less on (re)building basic feature details to more on complex features and capabilities and getting closer and closer to the product parts.

A good example is another research task where I launched a "sub-agent committee" to find better ways to handle handover between agents (the conversation and browsing agents). I've used four parallel research agents investigating different aspects of the same problem.

The premise was simple, the inballance of input to output token is quite significant. So if you can avoid loading into context and 'repeating' the same tokens on handover between agents, you will arrive at a more structured, faster and also cheaper outcome. The problem is not in storage/handover format, but in how the receiving agent consumes the context.

The more autonomous process meant I could iterate with an orchestration layer of sub-agents over different possible alternatives to arrive at the best possible implementation. Part of the discovery came from progressive disclosure and how SKILLS.md work on levereaging this principle. But the details will be further explored in another writeup as well.

These insights emerged because I let them emerge. I defined the problem space and designed experiments to prove / disprove different premisses, provided the tools, and reviewed the outcome on the other side.

Both researches involved designing features and patterns that were more complex in nature to some of the code I've written by hand before, and I arrived at an earlier draft of it much earlier than I'd expect to otherwise.

Thinking machines and the future

Working with agents as research partners changes something subtle but important: your relationship to not knowing the outcome upfront.

Developers tend to treat uncertainty as a problem to eliminate. You research until you know, then you build. Knowledge precedes action.

But some problems we'll be trying to solve on these new systems can't be known in advance. The solution space is too large and the interactions are too complex. You have to act your way into understanding.

This is where agents become a multiplying factor. They let you explore faster than your own 'context window' allows. They're the external actor that makes iteration possible at the speed insight requires.

There's a common anxiety in tech right now: what happens to developers when AI can code?

Well, it already can (and for quite some time now). So what then?

I already see very smart people arriving at the same conclusion, the craft is evolving, not dying.

Ten years ago, the craft was knowing syntax and language quirks. Then it was knowing frameworks and ecosystems. Then it was knowing distributed systems and infrastructure.

Now it's knowing how to work with agents in your workflow and understanding how to better shape context for the best outcome. It's knowing when to specify and when to explore, developing intuition for what agents are good at and where they need guidance.

The result is not about how better I would have done it myself, but now how much more can I experiment with, how reproducible 'good' code can be and how fast can I iterate over ideas.

It's still the same craft, but different.

So experiment away.

The time for the browsers is here

Vinicius Dallacqua — Thu, 29 Jan 2026 11:32:54 +0000

The web remembers everything about itself and almost nothing about you.

Think about that for a second. Every website you visit knows exactly what it wants to show you. The web became more one sided and more about labeling and grouping you into a certain bucket. But from your perspective, the person staring at a screen, the web is an endless stream of things that were mostly not designed for you. They were designed at you.

We've gotten used to this. We open a browser, go to a page, consume what's there and move on. Maybe we bookmark something, or too many things...we've all been there, maybe we forget to. Either way, the browser itself doesn't care. It doesn't get what you're interested in, what you've been researching, or that you went down that rabbit hole about generative design systems or and never quite found the article that tied it all together. It also does not care if you prefer things more visual and less walls of text (as someone writting one wright now, I know some of you may bail when you finish reading this sentence). But maybe there's a new way to think about content delivery, the nature of the web and web browsers in general.

The browser is one of the most used piece of software in the world (web views also count, ok). And it has no memory (just yet). We are starting to see movement towards the future of the web and software as a whole, but let me talk about the vision.

The gap most are missing

There's a lot of conversation happening right now about AI agents, agentic interfaces, MCP, and the future of software. Most of it focuses on what AI can do: execute tasks, generate code, automate workflows. The discourse is dominated by capability. What can the model do? How many tools can it call? How autonomous can it be?

But there's a gap in this conversation. A big one.

How can we use these new capabilities to provide entire new and more personal ways to think about and consume software. About software that changes completely based on who you are, what you care about, and how your interests evolve over time. Not something that is trying to categorize you to feed the same group, the same thing. But instead to actually help you get the most as it gets to know on your preferences, taste and personality.

There is a lot of focus on execution, not understanding. On tasks, not people.

I've spent years obsessing about web performance and talking about it. Talking about the convergence of product and performance metrics, what it means to serve the users the best and most delightful experiences. Caring that much about user's experience taught me something deeper: the things that matter most to users are often invisible to the systems that serve them. The gap between what users experience and what systems optimize for is where the real problems live, and I'm not just talking about metrics here.

The same is true for personalization. Not the shallow kind "recommended for you" based on what you clicked yesterday. I mean something deeper. Something that actually knows and is there to serve you.

We see parts of this vision now taking shape as we starting to build this type of systems.

Why the browser

There's a reason I keep coming back to the browser as the right home for this.

Every other platform is a walled garden, and even on the web there's a rise of different closed systems. Each one optimizes for its own engagement metric and ignores everything else about you.

The browser sits at the intersection of all of it. It's the one piece of software that touches every part of your digital life. And yet, for now, it's just a viewport. This current model is changing and evolving but I feel like there's a greater unrealized potential at hand.

I once wrote that the browser should become "the last super-app." I think that's directionally correct, but the framing is incomplete. It's not about cramming more features into a browser. It's about making the browser actually aware of the person using it. More like an assistant that helps curate and surface things the way you like and for the things that trully speak to you.

That's a different problem. And it's the one worth solving.

What does it mean for a system to remember about you?

When I say memory, I don't mean a database with your browsing history. I mean a system that understands the difference between a passing curiosity and a genuine interest. That can tell when you've been circling a topic for days versus when you stumbled on something once and moved on.

Consider how your own memory works. You don't remember every article you've ever read. But you do remember the themes that keep pulling you back. The ideas that connect across different contexts. The feeling of recognition when you encounter something that fits into a pattern you've been building, maybe without even realizing it.

That's the kind of memory layer I'm interested in building for apps now.

The problem is that this is genuinely hard and also as most things related to AI, we are still kinda figuring out. Same for security layers around those and how to best balance capability with privacy.

But one thing is for sure, systems that remember you and facts about you will be more enjoyable to use.

What comes next

I'm building something based on these ideas. A an app that remembers, that develops an understanding of your interests over time and uses that understanding to surface things you'd actually care about—without you having to ask.

This side-project has been my main obsession since early last year and it was being conceptualized and slowly developed over the course of last year as I experimented with PerfAgent.

There's a lot more to say about how this works in practice: how interests are built, how the system understands your taste, how it presents things, how trust works when your browser starts acting on your behalf and how do we secure it?

But those are topics for another post.

For now, I'll leave you with this: the web was built to connect people to information, and other people. Somewhere along the way, it became very good at connecting information to people. The direction matters.|

And I think it's time to flip it back.

The time for the browsers

Vinicius Dallacqua — Thu, 29 Jan 2026 11:29:59 +0000

The web remembers everything about itself and almost nothing about you.

Think about that for a second. Every website you visit knows exactly what it wants to show you. The one sided nature of the web has always been about labeling and grouping you into a certain bucket. But from your perspective, the person staring at a screen, the web is an endless stream of things that were never designed for you. They were designed at you.

We've gotten used to this. We open a browser, go to URLs, consume what's there and move on. Maybe we bookmark something, or too manyu things...we've all been there. Maybe we forget to. Either way, the browser itself doesn't care. It doesn't get what you're interested in, what you've been researching, or that you went down that rabbit hole about generative design systems or and never quite found the article that tied it all together. It also does not care if you prefer things more visual and less walls of text (as someone writting one wright now, I know some of you may bail when you finish reading this sentence). But maybe there's a new way to think about content delivery, the nature of the web and web browsers in general.

The gap most are missing

But there's a gap in this conversation. A big one.

There is a lot of focus on execution, not understanding. On tasks, not people.

We see parts of this vision now taking shape as we starting to build this type of systems.

Why the browser

There's a reason I keep coming back to the browser as the right home for this.

Every other platform is a walled garden, and even on the web there's a rise of different closed systems. Each one optimizes for its own engagement metric and ignores everything else about you.

I've once wrote that the browser should become "the last super-app." I think that's directionally correct, but the framing is wrong. It's not about cramming more features into a browser. It's about making the browser actually aware of the person using it. More like an assistant that helps curate and surface things the way you like and for the things that trully speak to you.

That's a different problem. And it's the one worth solving.

What does it mean for a system to remember about you?

That's the kind of memory layer I'm interested in building for apps now.

But one thing is for sure, systems that remember you and facts about you will be more enjoyable to use.

What comes next

This side-project has been my main obsession since early last year and it was being conceptualized and slowly developed over the course of last year as I experimented with PerfAgent.

But those are topics for another post.

And I think it's time to flip it back.

Long Frames and INP: Understanding Post-Load Performance

Vinicius Dallacqua — Wed, 09 Oct 2024 10:05:23 +0000

Whenever talking about application performance as a subject, it is common to think about loadtime and the metrics that are involved in that part of the user journey. While initial page load speed remains crucial, it alone is not sufficient to ensure a great user experience, as most of the user's time is spent after the page is finished loading. This is why measuring and monitoring post-load experiences is essential.

For a long time even our understanding of ‘responsiveness’ was somewhat bound to load-time metrics. “How long it takes to be interactive”, “How long it takes for the browser to respond to the users’ first input”, “How long did the JS parsing block the main thread”, all of which are focused around load time and making sure the browser is ready to start responding to interactions as soon as the loading stage is completed.

We now have INP as part of the web-vital and as a core web-vitals metric. It measures the responsiveness of a web page to user interactions and focuses on the delay between a user's action and the next frame shipped, providing better insight into the users’ experience as they navigate and interact with your site.

Alongside INP as a metric, the new Long Animation Frame API offers developers a great attribution model and a way of thinking on how to portion and divide work in on the main thread. Introduced in Chrome 123, the LoAF API allows us to detect long Animation Frames that may cause visual jank or poor responsiveness in our applications, either by delaying the response to an input, introducing long processing times or bottlenecks on styling and layout delaying the presentation of the next frame.

While INP gives us a high-level view of our application's responsiveness, LoAF provides the granular data needed to build better attributions and fix the underlying causes of poor interactivity. This combination allows developers to not only measure overall responsiveness but also to drill down into specific problematic interactions and optimize them effectively.

But before we dive deeper into those new tools lets get a bit of a history lesson on how our performance metrics got here and why they evolved the way they did!

History of Performance Metrics and Tooling

To fully appreciate the significance of Interaction to Next Paint (INP) and the Long Animation Frames (LoAF) API, it's crucial to understand the journey and history of how we think about performance metrics, as It reflects our growing understanding of what constitutes the key indicators and guidelines of a good user experience.

The RAIL Model

In 2015, Google introduced the RAIL model, which stands for Response, Animation, Idle, and Load. This model provided a solid a user-centric approach to performance, breaking down the user experience into key parts and giving some recommendations for each part:

Response: Respond to user input within 100ms.
Animation: Produce a frame in 10ms.
Idle: Maximize idle time to increase the odds of responding quickly to user input.
Load: Deliver content and become interactive in under 5 seconds.

The RAIL model was a significant mark on our journey, as it encouraged developers to think about performance in terms of user perception and interaction, not just loading speed. Not only that, but it was, probably, the first time we started thinking about interactivity and set some key metrics to help us understand and allow the browser to respond and ship frames on time.

Defining the 50ms window per task to allow the browser to process events and ship frames for a smooth 60fps experience.

50 ms or 100 ms? Timing tasks execution window acording to user input and frame timing window on a 60fps ‘schedule’. From: https://web.dev/articles/rail

Early days of interactivity metrics

Soon after the RAIL model, we got a few key metrics that focused around load-time. Developers mostly measured how long it took for a page to fully load and be responsive as quickly as possible:

Page Load Time: With metrics such as FCP, FMP and LCP
(first) CPU Idle Time: The point at which the CPU has the first 'quiet window' after the initial page load.
Time to Interactive (TTI): Initially named 'Time to Consistently Interactive', it indicates when the page was consistently able to respond to user input.
First Input Delay (FID): Measured the time from when a user first interacts with a page to the time when the browser is able to respond to that interaction.
Total Blocking Time (TBT): Lab metric that traditionally measures the total amount of time between First Contentful Paint (FCP) and Time to Interactive (TTI) where the main thread was blocked for long enough to prevent input responsiveness. This is the most common usage, though TBT is a metric that can be used over an entire session duration.

With TTI and FID being our first metrics more focused around interactivity, but still mostly attributed to how much time it took for the browser to download and parse assets and be able to start responding to user inputs and lacking a better understanding on the different causes of poor interactivity.

These metrics were steps in the right direction, but they still had limitations. FID for instance, only measured the first interaction, which didn't necessarily reflect the overall responsiveness of the page. And Time To Interactive was complex in nature, hard to reason about and somewhat unreliable.

How FCP, TTI and FID correlates to each other. Part of the https://web.dev/articles/fid article

TBT was an interesting addition to the toolkit. As a lab metric, it was used by lab tools to asses the ‘total blocking time’ during the loading of the page leading up to the TTI mark. But it is not necessarily a load-time metric, as its objective is to assess and measure the impact of long task blocking the main thread over time. Although this metric is not necessarily connected to user interactions, it is a good indicator of the possible impact of long tasks blocking user interactions and visual updates.

Searching for a better, user-centric, metric

With the knowledge and experience gathered over time with the web-vitals being established alongside those metrics, the Chrome team started investigating how a better, user-centric, responsiveness metric could be shaped. One that can observe not only the load-time but post load-time part of the user experience. And also encompasses all of the parts that can be attributed to slow interactions. As mentioned on the article "Towards a better responsiveness metric", those earlier metrics do not focus on the user experience directly, but instead on how much JavaScript runs on the page as as it is loading.

Sections of a user interaction as part of the https://web.dev/blog/better-responsiveness-metric article

In the image above, taken from the same article, we can already see a lot of resemblance to how INP as a metric functions and identify its different parts. From the input delay section, the processing time section and the next frame being shipped as the full account of the interaction duration.

Nowadays you can also observe this segmentation on DevTools when inspecting different ‘interaction’ entries on the ‘Interactions’ track of the Performance tab.

An interaction with the input delay and presentation delay beign displayed as ‘whiskers’ and the processing duration as a solid bar.

Layout Animation Frames (LoAF) and INP

The introduction of the Long Animation Frames (LoAF) API, alongside Interaction to Next Paint (INP), represents a significant improvement in how we measure and optimize interactions.

That is why in INP replaced FID in March 2024 as a core web-vital, bringing a session-wide metric that is user-centered and better suited to help us understand real users’ experience around interactions.

Also with the LoAF API came a shift, changing our base for measurements on interactions away from focusing on tasks to animation frames. Bringing several key advantages:

Animation Frames encompass all the work done by the browser in order to ship a new frame, including JavaScript execution, style calculations, layout, paint, and compositing. This provides a more comprehensive picture of performance compared to isolated task measurements.
Users perceive performance in terms of visual updates and responsiveness, which better correlates with the Animation Frame as a base metric.
Multiple small tasks that individually don't qualify as "long tasks" can collectively delay a frame. LoAF captures this cumulative effect, which task-based metrics might miss. Some interactions may incur several tasks and trigger multiple event handlers, where the Long Task API would only show any potential outliars, making it difficult to build attributions from.

Another important advantage of this new segmentation of work is that if you only consider long tasks as a source of interactivity problems you are eliminating an entire class of performance problems that has to do with styling and layout that also will occupy the main thread, preventing the browser from responding to interactions and slow down the production of new frames.

Because of that, frame-based measurements are better at identifying jank or stuttering in animations and interaction. This is crucial for ensuring smooth experiences and better correlations for your metrics, allowing for more effective optimization strategies.

Also important to note is that Animation Frames are not directly connected to user interactions, as they are more of a segmentation of work in order to ship a frame, and may originate from many different sources.

Utilizing the Long Animation Frames (LoAF) API

The LoAF API is a great candidate to monitor for issues around the user’s experience. As stated on the LoAF API article, a long animation frame is when a rendering update is delayed beyond 50 milliseconds, the same threshold for long tasks.

The LoAF API provides detailed insights into frame performance, rather than just a start and duration timings, we have an entire breakdown of the frame cycle (as shown on the LoAF API article):

startTime: the start time of the long animation frame relative to the navigation start time.
duration: the duration of the long animation frame (not including presentation time).
renderStart: the start time of the rendering cycle, which includes requestAnimationFrame callbacks, style and layout calculation, resize observer and intersection observer callbacks.
styleAndLayoutStart: the beginning of the time period spent in style and layout calculations.
firstUIEventTimestamp: the time of the first UI event (mouse/keyboard and so on) to be handled during the course of this frame.
blockingDuration: the duration in milliseconds for which the animation frame was being blocked.

Those timestamps can be used to calculate different parts of the frame cycle

How LoAF entry timings can be used as a breakdown for a frame cycle. From: https://developer.chrome.com/docs/web-platform/long-animation-frames

Some gotchas with LoAF entries

It is important to note that, similar to INP, LoAF entries aim to measure the entirety of the frame lifecycle and entries span from the entirety of the sesison. So LoAF entries attributions may come in different shapes and root causes and not only from script execution. Also, as pointed on the LoAF article, script attribution is only provided for scripts running in the main thread of a page, including same-origin iframes. Which means that third party scripts, extensions, cross-origin iframes and other sources won’t have script attributions, but may contribute to LoAF entries.

There’s also, as of this writting, information missing about scripts without source information, such as event handler callbacks and inline scripts.

Visualizing INP and LoAF data in the wild

Here we have two examples of how you can visualize INP in the wild. On the left you have the Vercel toolbar, showing a collection of INP entries on dev mode and on the right you have the trace viewer on a tool I am creating, PerfLab. You can see the the INP entry highlighted on the trace displayed, alongside with the report cards for other data present on the trace.

Here’s another trace visualized on PerfLab, showcasing that Animation Frames, and LoAF entries, are not directly linked to the INP as a metric. But they can cause input delay to an interaction.

This particular Animation Frame was part of a trace session to analyze the loading experience of a page and trying to understand the visual jank around the experience. Even though there was no direct user input, the page seemed unresponsive and you could see on the trace the main thread busy with different thrid party scripts.

Using LoAF to Improve INP

Interaction to Next Paint (INP) measures the latency of interactions throughout a page session, so INP attribution may come from interactions in any point in time. The Long Animation Frame (LoAF) data will help to add better attribution to what could have contributed to poor INP scores by providing information on the entire frame duration and timings to help you understand where the animation frame spent the most time blocking. And INP issues can stem from various sources, including input delay, script execution and layout and style operations. And those may come from code executed 1st or 3rd party scripts.

Making sure to capture both INP and LoAF data, utilizing the web-vitals attribution build you can better assess not only the state of your interaction metrics, but also what are the causes for any potential problem.

We’ve come so far from the early days of performance tooling and metrics and now have such incredible tools at our disposal to help us better understand and improve our experiences on the web! It is a truly incredible journey and with INP and LoAF as the latest entries on the toolkit we can finally have a better understanding of user interactions and better deliver a delightful post-load experience to our users.

Why your performance work is not seen

Vinicius Dallacqua — Wed, 17 Apr 2024 12:11:57 +0000

If you are reading this, chances are you care about performance. Also, chances are, you have played around and established some form of Lab or RUM solutions to start capturing data about your application. If you haven’t, I have just the article for you. You have run Lighthouse reports and time after time you have seen that there are a few, or sometimes lots, of improvements that could be done, but it just seems to not reach a point where you can actually focus on fixing those problems and get long lasting, meaningful, improvements to your users. Or establish a continuous process, or governance, to track and iterate over your application’s progress. That backlog seems to never reach a point where performance tasks are moved to working items, and, unless something is on fire, most of your reports don’t translate into the right value being delivered.

This article represents some of my findings over many years creating or collaborating with performance efforts whilst establishing a governance around our work. And some of those points I’m very much still experimenting with or establishing them into my current workflow, so tag along and let's discuss those findings.

Making sure to maximize value from your tooling

Shaping a better Lab tooling

There are many great articles on Lab testing, so this article is not going to cover lab data setup and how it differs from your field data. But if there is a key takeaway is this: Lab data is a great way to avoid significant regressions from ever being shipped to real users, and one of the best ways to get those early hints is to set up performance budgets. Having a direct correlation between lab metrics and code changes allows for the correlation of improvements or regressions.

Having a good setup for your lab data means creating guardrails that will alert you of any regressions. And creating meaningful budgets is also important to ensure your application progress over time is measured and compared over its own trends. Setting up KPIs/SLOs and budgets when your metrics are way above the ‘good’ threshold may be daunting and create too much friction or analysis paralysis whilst trying to ship something meaningful. Setting up a good governance process means assessing achievable meaningful progress over time. Sometimes you have to set your thresholds based on your own starting points to iterate over and quickly deliver results over time.

Extracting more out of RUM tooling

There are great options when it comes to RUM tools for developers to choose from, and sometimes teams can even build tailor made in-house solutions to monitor their Field Data. But to extract the most value out of any RUM tool you need to cover a few points from the engineering and product perspective. For a tool is only as useful as how much your teams are willing to use them in their processes to extract its value. And for teams shipping features to a product, value does not only come from developer focused metrics, but product focused too.

Teams may struggle having to aggregate data points from different tools to extract correlations between that performance improvement and actual value shipped to users. Sometimes the only thing preventing teams from building correlations is the tools they use, or how they use the tools available.

Correlations are a great way to prove value for a performance governance process. And any value that is only focused on developer experience may incur a risk of being one sided, thus not translating into more value for your users which in turn may translate to less buy-in from the product side.

You may also want to segment your data based on market and user base. Not all markets behave the same, and not all markets have the same priority or use to your product. When it comes to shipping the best value, you have to ensure long tail users of markets that are not your top priority are not skewing percentiles if you only use global data as context. Whilst improving experience for all your users is always an overarching goal, it is not always a possible one to prioritize from a product perspective.

Your percentile distribution also matters. Focusing in one percentile may blindside and miss improvements achieved for other parts of your distribution. Visualizing the data in its entire distribution will help understand where the most impact is being made and identify improvements made across the user base for a segment. There’s an awesome talk by Tim Vereecke that shows this in great detail.

All application metrics can be observed and segmented through different product lenses. The key is to trace correlations, for the bigger a metric context is, the harder it is to correlate actual value and usage to your users. This way, you can ensure that your performance governance can also drive and inform product level decisions and prioritization.

Trends and historical data

On Lab and RUM setups it is important not only to have performance in place, but to observe your metrics through historical data. Assessing trends and understanding evolution over time is important to avoid having your metrics creeping up past your budget levels.

Historical data also comes in the form of documentation. Same with any good incident remediation process, an important step to avoid recurring regressions on any team is to document those. Building and sharing knowledge around your product’s regression and understanding why they happen is a key step in a healthy governance process.

How to strengthen your performance governance

Metrics and KPI / SLAs

Core web vitals offer a great standardized set of metrics, and those are a great starting point to extract value out of your website metrics. But at some point in your governance structure you may need to assess which metrics better correlate to your product’s value at that time. Not all pages have the same importance and not all features the same usage, so building KPIs and SLOs using all of web vitals metrics may not be optimal. Choosing a subset of those metrics based on usage trends and patterns of your app and features can be a powerful way to create KPIs between the engineering side of your team and product. And SLAs between your team and downstream teams and services to ensure metrics don’t degrade over time. This process also needs to be iterative to fine tune and adjust as trends around your product metrics and usage evolves over time.

Product teams can also extract great value establishing their own metrics, to better represent their product and user journeys.

Buy-in from management

To establish a governance process, you first need to prove there’s value in integrating another process into any team’s workflow. And one way to prove there’s value is via correlations.

We can always start from the point that Web Vitals directly translate to better SEO. But this alone may not always directly translate to a governance process. SEO ranking is only a factor, and better ranking does not only come from Web Vitals metrics. A key metric that most product teams need to track somehow is conversion, though conversion metrics can have many different underlying meanings. Correlating product facing metrics such as click-through and conversion rates with your performance trends is a key factor to align the two sides of your team and product areas for continuous buy-in from management.

Data is always undeniable, and delivering value that you can trace and correlate to product metrics is a great way to use data.

Finally, your performance governance is a continuous process and as such it needs to matter not only to your engineering side of the company but also to your product and users. Setting up a meaningful process is to fine tune your tools and metrics to better represent your product and users. Another key step is to publicize your progress, internally and externally. There’s always great learnings to be shared from regressions and improvements, documenting and sharing those helps cement your governance progress. So measure, experiment, ship, report and repeat!