DEV Community

eternalsix
eternalsix

Posted on • Originally published at eternalsix.com

I tested 50 AI tools in May - the 7 I kept

I Tested 50 AI Tools in May. Here Are the 7 I Actually Kept.

By day 18 of May I had 34 browser tabs open, six half-finished integrations, and a $600 API bill I could not fully explain. I had set a simple rule at the start of the month: spin up every AI tool that crossed my feed, run it on a real workflow I own, and cut anything that did not survive contact with actual work. Not demos. Not onboarding videos. Real tasks — code review, customer research, content pipelines, data extraction, internal tooling. Forty-three tools got uninstalled. Seven stayed. Here is exactly what I kept and why.


The Filtering Problem Nobody Talks About

The AI tool landscape in 2026 is not a quality problem. There are genuinely good tools being built everywhere. It is a signal-to-noise problem — and the noise is architectural, not cosmetic.

Most tools fail the same way: they are optimized for the demo, not the workflow. They shine in isolation. You paste in a prompt, get a crisp output, feel briefly impressed, then realize you need to move that output somewhere, combine it with something else, or run it forty times with different inputs — and suddenly the tool offers you a copy button and nothing else.

I call this the "last mile problem." The generation is solved. The operationalization is not. Every tool I cut in May failed at the last mile. Every tool I kept solved it.


The 7 Tools That Survived

1. Claude (API, not the chat UI)
I already use Claude. What changed in May was switching almost entirely to the raw API with structured outputs and prompt caching. The chat UI is for exploration. The API is for building. If you are still copy-pasting from claude.ai into your workflow, you are leaving most of the value on the table. Cache hit rates on my repeated document analysis workflows dropped costs by ~70%.

2. Cursor
Not new, but I stress-tested it hard — specifically its multi-file context and its ability to hold a mental model of a growing codebase across sessions. It held. The tab completion is now so accurate on my own code that I catch myself waiting for it on non-Cursor editors like I would autocorrect on a phone. Nothing else came close for actual coding velocity.

3. Firecrawl
Web scraping has always been the unsexy bottleneck in research pipelines. Firecrawl turns any URL into clean markdown that a model can actually read without burning context on HTML garbage. I built a competitive monitoring pipeline in three hours that would have taken two days with Playwright and manual parsing. It failed on maybe 8% of targets (paywalls, heavy JS apps). That is honest and acceptable.

4. Exa
Semantic search over the live web, with an API that returns clean results you can pipe directly into model context. The difference from standard search APIs is that Exa understands what you are looking for, not just what words you used. I used it for sourcing primary evidence during research tasks where keyword search was returning garbage. High signal, low hallucination risk because you are feeding the model real content.

5. Replicate
For image and audio model access without standing up infrastructure. I ran comparative tests on a client's product image generation workflow. Being able to swap models with a single line of code — Flux, SDXL, Recraft — without changing anything else in the pipeline was the feature. Costs are predictable. Latency is acceptable for batch jobs.

6. Inngest
This one surprised me. Inngest is technically a workflow orchestration tool, not an "AI tool," but it made the list because it solved the hardest problem I have building AI pipelines: reliable, retryable, observable async execution. When an LLM call fails at step 4 of 7, you do not want to restart from step 1. Inngest handles exactly this. If you are building anything multi-step with AI, you need something in this category.

7. Braintrust
Evaluations. Every serious AI builder eventually hits the wall where "it feels like it works" is not enough and you need to measure regression. Braintrust gives you a logging and eval layer that is not painful to set up. I integrated it in half a day. Now I have baselines. Now I know when a prompt change makes things worse, not just different.


Why 43 Tools Got Cut

The patterns were consistent enough that I wrote them down mid-month:

  • Wrappers with no API. Any tool that only exists as a chat interface over a model I already have access to. There are dozens of these. They add no leverage.
  • Single-step tools. Useful once, useless as infrastructure. If a tool solves one isolated problem and cannot connect to what happens before or after it, the cognitive overhead of context-switching is not worth the marginal quality gain.
  • Pricing that punishes scale. Several tools were excellent at low volume and economically broken at real volume. I ran projections. If the cost curve does not stay reasonable at 10x my current usage, the tool is not safe to build on.
  • No observability. If I cannot see what happened when something went wrong, I cannot build on it in production. Black box is fine for toys. It is disqualifying for infrastructure.
  • Hallucination with confidence. A few tools were generating outputs that were confidently wrong in ways that would slip through human review. Not a matter of model quality — a matter of the tool not being designed to surface uncertainty.

The Framework I Use to Evaluate Any AI Tool Now

Run every candidate through this five-question filter before spending more than 30 minutes on it:

  1. Does it have an API? If no, it lives in a silo. Silos do not scale.
  2. Can I run it 1,000 times without touching it? Automation is the point. If it requires human intervention at any step, measure that cost explicitly.
  3. What does failure look like, and will I know it happened? Test breakage, not just the happy path.
  4. What is the cost at 10x current volume? Calculate it before you commit.
  5. Does it make the output usable, or just generate it? Generation is not the product. Usable output in the right place at the right time is the product.

If a tool clears all five, it earns a two-week trial on a real workflow. If it fails any of them, I cut it without ceremony.


How AI Handler Approaches This

The reason I ran this experiment is that I kept rebuilding the same scaffolding — the API wiring, the retry logic, the routing between models, the output formatting, the logging — every single time I wanted to use a new AI capability. Every new tool added another integration surface. Every new model meant another decision point buried in code.

AI Handler is the unified AI workflow tool I am building to solve exactly this. The premise is that the best individual AI tools should be composable without custom glue code for every combination. You should be able to route tasks to the right model and tool, observe what happened, retry what failed, and operationalize the whole thing without becoming a DevOps engineer in the process.

The seven tools I kept in May all do one thing extremely well. AI Handler is the layer that makes them work together as a system — with a single interface for inputs, a consistent observability layer, and cost controls that do not require you to babysit a dashboard.

The problem I am solving is not "which AI is best." It is "how do you run AI workflows in production without the workflow becoming the project."


AI Handler is the unified AI workflow tool I am building. Launching June 2026. Email ceo@eternalsix.com for beta access.

Top comments (0)