DEV Community

Cover image for From One Skill to a Plugin: My Story Behind the Initial Release of grafana-k6-plugin
CharlyAutomatiza
CharlyAutomatiza

Posted on

From One Skill to a Plugin: My Story Behind the Initial Release of grafana-k6-plugin

It started with a feeling I could not ignore

Some ideas do not arrive as a polished product vision. They arrive as discomfort.

I was spending more and more time working with LLMs inside developer workflows, and while the speed was impressive, I kept running into the same problem: the output often looked better than it actually was. The code was fluent, but not always trustworthy. The suggestions were fast, but not always current. The answers sounded confident, but that confidence was not always backed by real engineering judgment.

If you are new to the concept, skills are folders of instructions, scripts, and resources that Claude and other agents can load dynamically to perform better on specialized tasks.

In the k6 world, that gap matters a lot.

For anyone new to it, k6 is Grafana's open-source performance testing tool for load, stress, and reliability testing of APIs and services.

I was seeing models suggest deprecated code, outdated references, and quality patterns that might pass a superficial review but would never feel solid if you actually cared about long-term maintainability or performance realism. Models could generate code very easily - that did not mean they were generating the right code for the real problem.

As a Grafana k6 Champion with a strong performance engineering background, I knew the tool had enormous potential. I also knew a lot of that potential was being left on the table.

The real value is not just generating a test script. The real value is being able to ask: is this scenario realistic? Are these thresholds meaningful? Is this auth flow being handled safely? Is this a recommendation I would actually trust in front of a team?

I did not want to build something that simply made LLMs produce more code.
I wanted to build something that helped them produce better decisions.


At first, I thought this would be a single skill

The original idea was smaller. I was not thinking about a plugin. I was thinking about a skill.

I wanted to condense my practical experience into something reusable - something that could guide an agentic IDE toward stronger k6 outputs, better defaults, and more reliable performance testing behavior.

That first idea made sense for a while. A single skill felt elegant. Compact. Easier to reason about.

But as I kept building, I ran into a reality that changed the whole direction of the project. Once the idea started growing, putting too much responsibility into one skill would reduce precision instead of increasing it.

That is where a very old lesson came back to me.

In my first year at university, one of the ideas that stayed with me was the classic divide and conquer mindset: break complexity into clearer, specialized parts, make each part excellent at its own job, reduce ambiguity by reducing responsibility overlap.

That is exactly what this project needed.

So the idea stopped being "one smart skill" and became something much better: a plugin built around three dedicated skills, each one focused, specialized, and expert in its own job.

Instead of asking one giant skill to plan, build, validate, clarify, and reason about every edge case equally well, I could design a system where each part had a clearer mission and a tighter behavior contract.

That was the moment the project became more than an experiment.


I was not trying to outgenerate LLMs, I was trying to outguide them

This is probably the most important thing to understand about why I built grafana-k6-plugin.

I was never trying to compete with LLMs at generation speed. Almost any model can produce code with a decent prompt today - that is no longer the hard part. What interested me was something deeper: how to inject more judgment into the generation process. How to make the output aligned with current platform realities, aware of performance engineering tradeoffs, and resistant to fragile assumptions.

And especially: guided by expert questions when critical information is still missing.

That last part became one of the foundations of the plugin. I did not want skills that would guess when they should ask. I wanted skills that would know when not to guess.

In performance engineering, there are doubts that are negotiable and doubts that are not. If I do not know the target, the workload shape, the auth constraints, or the SLA expectations, I do not want a confident hallucination. I want the right question asked at the right time.

When uncertainty is critical, the skill should ask like an expert. And the human should remain the one making the decision.

That is not a weakness in the workflow. That is the workflow.


Learning about AskUserQuestion changed how I thought about skill design

One of the moments that pushed this idea forward came from something very practical.

My colleague Alan showed me the AskUserQuestion capability in Claude Code during a skills hackathon we had at work. That may sound like a small detail, but for me it clicked immediately.

Because suddenly the interaction model was not just "generate the answer." It became: detect what is missing, stop at the right moment, and ask the right thing.

That opened a much more powerful path. It meant I could build skills that behaved less like autocomplete with extra steps and more like a real specialist working alongside me.

The hackathon itself was part of the spark too. Seeing what people were building, how quickly a good idea could evolve into something tangible, and the energy of people discovering new ways to package judgment and domain expertise into agentic workflows - that pushed me further. It made me feel that this plugin was worth building for the wider ecosystem, not just for myself.


When I stopped relying on intuition and started building evidence

At some point, every ambitious build hits the same wall.

You can feel that you are improving things. You can believe the structure is getting stronger. But eventually, feeling is not enough.

I needed evidence. And this is where Anthropic's skill-creator workflow became fundamental for me - not just as a creation aid, but as part of an evaluation discipline.

References:

That meant designing evals, running them in structured iterations, comparing with-skill and without-skill behavior, using multiple agents where useful, and gathering empirical evidence instead of relying on taste.

The difference in quality of decisions was significant.

Before that, improvements sounded like: "this feels cleaner," "this looks more robust," "this answer reads better." After I started working through evals and comparisons, the conversation became sharper: "this iteration actually improved ambiguity handling," "this change is attractive but does not improve outcomes," "this needs a correction plan, not another vague rewrite."

I also wanted to pressure-test the skills under real conditions: HTTP cases, gRPC scenarios, browser-oriented flows, script validation issues, safety and quality edge cases, and guidance quality under incomplete inputs. Not just ideal prompts. Multi-agent execution helped simplify comparison and made it easier to see whether a behavior was genuinely robust or just accidentally tuned to one narrow interaction pattern.

When you are deep inside a project, doubt never disappears. But evidence gives doubt a productive direction. It turns anxiety into diagnosis and makes solid correction plans possible - the kind that say: this is the weak spot, this is the reason, this is the adjustment, and this is how we will know if it worked.

That kind of loop creates trust. Not only in the artifact, but in the process itself.


Review was another turning point, and it was not always comfortable

Before each PR, I kept iterating my PR review prompts. I did not want review to be a last-minute checkbox. I wanted it to be a designed stage with real value.

GitHub Copilot review was genuinely important in that stage. It surfaced useful issues, accelerated feedback loops, and gave me another layer of signal before merging.

But it also taught me something very important: good review support is not the same as full understanding.

One example captured the problem perfectly. Inside the skill references, I had DO and DONT examples. The DONT examples were intentional - educational anti-patterns that existed so the skills could recognize poor practices and steer people toward better ones.

At times, Copilot review identified those DONT examples as mistakes that should be corrected. From a pattern-matching point of view, that reaction made sense. From a human intent point of view, it was wrong.

Without professional judgment, it becomes easy to "fix" the wrong thing. You can clean up the signal that was supposed to teach the model. You can make the final system more fragile while convincing yourself you improved it.

That is not a theoretical risk. I saw versions of it happen.

This is one of the strongest lessons from the project: our role as professionals is not optional. I do not see human-in-the-loop as a fallback. I see it as part of the architecture.


Ship before you disappear into endless polishing

By the time I reached the initial release, the feeling was not "this is finished forever." It was something more grounded: this is now solid enough to put in front of the community and learn in public.

That is an important distinction.

I do not believe in waiting for some mythical state of perfection before sharing useful work. If I had waited for the moment when every possible improvement was already solved, this plugin would probably still be sitting in an endless spiral of iteration. There is always one more refactor, one more rule, one more benchmark, one more prompt tweak.

At some point, you have to stop polishing in private and step onto the field.

You have to ship.
You have to ask for feedback.
You have to let the real world respond.

I fully expect there to be opportunities for improvement - I actually want that. I hope people open PRs. I hope the project keeps evolving. But I believe deeply in finishing stages: close one chapter, learn from it, and continue.

That is healthier than living forever inside an unfinished draft.


Try it yourself

My repo:
https://github.com/charlyautomatiza/grafana-k6-plugin

Installation

Add this marketplace to Claude

/plugin marketplace add charlyautomatiza/charlyautomatiza-marketplace
Enter fullscreen mode Exit fullscreen mode

Install a plugin from the marketplace

/plugin install grafana-k6
Enter fullscreen mode Exit fullscreen mode

Or install directly from the plugin repository

npx skills add charlyautomatiza/grafana-k6-plugin
Enter fullscreen mode Exit fullscreen mode

Once installed, you have three specialized skills available, each one with a clear role:

  • k6-plan: turns your objective into a concrete test strategy before code generation.
    • "Plan a load test for my users API."
  • k6-builder: creates runnable k6 artifacts from your requirements or from a plan.
    • "Generate a runnable k6 script for my checkout API."
  • k6-validate: reviews an existing script and prioritizes what to fix.
    • "Review this k6 script and flag the top issues that could make results unreliable."

Across all three skills, if required context is missing, the agent will pause and ask the right questions before continuing.

Each one knows when to ask before it acts.

If you are curious about how I built and evaluated these skills, the Anthropic skill-creator workflow is the same workflow I used.


Build something, test it honestly, and put it out into the world

If there is one thing I hope people take from this story, it is this: do not underestimate what happens when you combine domain expertise, agentic tooling, and a serious feedback loop.

Code generation is no longer the differentiator. What matters is whether the code reflects judgment, context, and good engineering instincts. AI review can accelerate a lot, but it can also misunderstand why something exists - and that is where professional criteria becomes decisive. Specialization matters: turning one skill into a plugin with three dedicated expert skills was one of the best decisions in the entire project. And evidence beats vibes: if you want to improve a skill seriously, you need evals, comparison, and a repeatable process.

If you are working with agentic IDEs today, I really think this is the moment to build.

Create your own skills. Teach them to ask the right questions. Test them with real evals. Compare with and without your guidance. Review them hard. Ship them before you disappear into endless polishing.

That is how we get better tooling. That is how we stop treating AI assistance like magic and start treating it like engineering.

And if this plugin helps you do that, even a little, then this journey was worth it.

If you try it and see ways to improve it, send a PR. I mean that.

Big hug, Charly

Top comments (0)