DEV Community: Dan Gurgui

A Month with DeepSeek: What Happened When I Replaced Claude Opus for Real Work

Dan Gurgui — Sat, 23 May 2026 13:02:35 +0000

The Month I Swapped Claude for DeepSeek

For the last month or so, I've been running a small experiment. I moved my main coding workflow off Claude Opus and onto DeepSeek's v4 Pro, the latest release while it was discounted, and I used it for real work.

The test was simple: build an MVP, index a marketplace, run long agentic sessions, and see what breaks first, the model, my patience, or my wallet.

Spoiler: none of them broke. And the bill came in under $50.

Why I Tried DeepSeek in the First Place

I'd love to tell you this was a deliberate, well-reasoned migration, but it wasn't.

When DeepSeek announced their latest model, the timing was useful: I was already frustrated with Claude. Opus deployments were unstable that week and quotas were burning down faster than usual, sessions were getting cut, and the 5-hour windows on the $200 Max plan kept hitting walls in the middle of long agentic runs. If you've been on the Max plan during a busy week, you know the feeling. You're mid-flow, the agent is exploring a codebase, and suddenly you're locked out for the next 90 minutes.

So I configured my claude CLI to point at DeepSeek's API instead. Bought $10 in credits as a sanity check. Set the discounted endpoint, picked the right model name, and went back to work.

Honestly, I switched because Claude was failing me at that moment, not because I had carefully read the DeepSeek release notes. The discount made the experiment cheap enough that there was nothing to debate.

The Real Test: Building dronicons.com

The workload mattered, so let me describe it.

I've been building dronicons.com, an MVP that aggregates drone components from various retailers into a single searchable marketplace. Think of it as a price-comparison and parts-discovery tool for people building or repairing drones. It's a real product with real users coming in, not a weekend toy.

The work splits into two very different shapes:

Application code: Nest.js frontend, API layer, search, filtering, deduplication logic, schema work, plus the usual bug fixing and refactoring you accumulate in an MVP that's evolving weekly.
Indexing pipelines: scrapers and parsers that pull product listings from multiple sources, normalize them into a unified schema, classify components (motors, ESCs, frames, flight controllers), and enrich them with metadata. This is messy, edge-case-heavy work where every retailer formats its data differently.

That second workload is where LLMs really earn their keep. Every retailer is its own snowflake. Specs are inconsistent. Categories overlap. The model needs to make judgment calls thousands of times per import run.

So when I say I exhausted the DeepSeek API, I mean it. I was hitting it from two angles: long agentic coding sessions during development, and high-volume classification calls during indexing. That's a much harder test than just "vibe coding a CRUD app for an afternoon."

Where DeepSeek Felt Stronger: Long-Running Work

The first few hours, I didn't feel a difference. Output quality looked roughly comparable to Opus. Code style was fine. Reasoning seemed sound.

But after a couple of days, a pattern emerged.

DeepSeek lasted longer on a single thought. With Opus, my typical loop looked like this: send a prompt, watch the agent work for 10 to 15 minutes, get a partial result, then spend the next round reprompting, clarifying, or correcting course. It's iterative by necessity. The model would tap out before the task was actually done, or it would commit to a direction early and need me to nudge it.

With DeepSeek, I'd send a similar prompt and the session would run for 30 to 40 minutes before coming back with something I could actually evaluate. Same kind of task, same codebase, but the model kept pulling on threads instead of declaring victory early.

That changed how I worked. I stopped sitting next to the agent. I'd write a detailed prompt, often using dictation because I could think out loud, then walk away, make coffee. When I came back I just read what it did.

This matters because the cost of context switching for the human is the hidden tax of agentic coding. Every reprompt is 30 seconds of "where was I again?" Every clarification is a small drain on focus. Cutting the reprompt frequency in half had a bigger effect on my output than any specific code quality difference.

I'll caveat this honestly: Opus is no slouch, and I haven't spent enough time with the very latest Opus 4.7 release to compare apples to apples. My comparison is against the version I'd been running for months. But for my workflow, on my codebase, DeepSeek covered more ground per prompt.

Sub-Agents Went Deeper Than Expected

The bigger surprise was in the sub-agents.

Most modern agentic setups spawn sub-agents for parallel exploration: one investigates the database layer, another reads the frontend, another reviews tests. Each sub-agent has its own context budget, and when it terminates, it returns a summary to the orchestrator.

With Opus, my sub-agents typically returned after burning 20-30k tokens. That's a reasonable exploration: a handful of files, some focused reasoning, a summary. Useful, but you can feel the ceiling.

With DeepSeek, the same sub-agents were running through 80-90k tokens before terminating. That's three to four times the depth. They were reading more files, cross-referencing more thoroughly, and surfacing edge cases I wouldn't have thought to ask about.

For planning work, this is the difference between a good plan and a complete one. When you ask an agent to design an indexing pipeline that handles five retailers, a shallow exploration gives you the happy path. A deep exploration gives you the happy path plus the seventeen ways each retailer breaks your assumptions: missing fields, inconsistent units, broken pagination, retailer-specific anti-scraping behavior, mixed currencies, duplicate SKUs across stores.

I didn't have to prompt for edge cases. The agent found them itself because it had the budget to actually look.

That said, deeper exploration also means longer waits and more tokens consumed. It's not free. But on this kind of work, the trade was clearly worth it.

The Cost Reality: Heavy Usage Under $50

Now the part everyone wants to know.

Across more than a month of heavy daily usage, the numbers tell the story. On the Max plan, Opus cost me $200 per month as a subscription. DeepSeek API usage came in at $40 to $50 total in credits over the same period. Typical run durations went from 10-15 minutes with Opus to 30-40 minutes with DeepSeek. Sub-agent token depth jumped from 20-30k to 80-90k. And where Opus quota exhaustion was frequent, I never hit a wall on DeepSeek.

My pattern was simple: top up $10, use it for roughly a week, top up again when it ran low. Over the full month I never crossed $50 in spend. That includes coding sessions, agentic runs, and the indexing workload for the marketplace, which on its own would have been the bulk of any LLM bill on most providers.

DeepSeek's off-peak discount helps. Even so, if the price doubled tomorrow when the discount ended, this would still be cheaper than Opus for my workload. And I'd still get the part that money can't always buy on subscription plans: no session limits, no hourly quotas, no "you've used 80% of your window" warnings during the deep-work hours.

There's a broader market signal here too. As noted in 3 market trends that could shape the rest of 2026, open-weight and Chinese model providers are taking real share against the incumbents, and pricing pressure is part of why. My one-engineer experiment is a data point inside that pattern, not an exception to it.

The Tradeoffs: Still Not One-Shot Magic

Let me be clear about what DeepSeek is not.

It is not a one-shot oracle. Neither is Opus. In a month of heavy use, I never got a single prompt to produce final, shippable code without iteration. Always there were bugs, always there were misreadings of the codebase, always there was a follow-up round.

DeepSeek also feels more conservative in some places. Frontend polish, microcopy, and tasteful UI choices still feel like an Opus strength to me, though that may be a "what I'm used to" effect rather than a real gap.

And the longer runs cut both ways. When the model is right, you save reprompts. When it's wrong, you've now wasted 40 minutes instead of 15 going down the wrong path. Long-running agents amplify both signal and noise, so a clear, well-scoped prompt matters more, not less.

Why I'm Not Rushing Back to Opus

I know Opus has shipped meaningful changes recently and I'll experiment with them eventually.

But I'm not paying $200 again for a subscription that punishes the way I actually work: long sessions, deep exploration, mixed coding and data workloads, weekends where I want to think out loud and let the model run.

DeepSeek isn't beating Opus on every axis. It's beating it on the axis I happen to care about right now, at a fraction of the price, with no quota anxiety. That's a serious competitor. And the next version will only narrow the remaining gaps.

If you've been hitting the same Opus walls I was, spend $10 and run your own experiment. That's all this is.

- 3 market trends that could shape the rest of 2026

Dan Gurgui | A4G
AI Architect

Weekly Architecture Insights: Subscribe

I Deployed Gemma 4 32B on a Rented H100 for $1.50/Hour. The Hard Part Wasn't What I Expected.

Dan Gurgui — Sun, 05 Apr 2026 17:11:38 +0000

I Deployed Gemma 4 32B on a Rented H100 for $1.50/Hour. The Hard Part Wasn't What I Expected.

The surprising part: H100 access felt almost trivial

This week I experimented with vast.ai, a marketplace where you can rent GPU hardware on demand for AI workloads. I walked in expecting friction. Provisioning an NVIDIA H100, deploying a brand-new model, configuring networking — all of it sounded like a weekend project at minimum. Instead, I had a freshly released Gemma 4 32B model running and responding to prompts in about an hour. The cost? Roughly $1.50 per hour for an H100.

Why I tried vast.ai (and what I needed)

I've been wanting to test self-hosted LLMs for coding assistance. The goal was simple: deploy a capable model on remote hardware, connect to it from my local development environment, and use it as a coding agent through Cline. No API rate limits, no per-token billing that spirals, just a flat hourly rate for raw compute.

Vast.ai gives you a catalog of available machines from individual GPU providers. You pick an NVIDIA card (anything from consumer RTX series up to H100s), configure storage, CPU cores, and RAM, then spin it up. Like an Airbnb for GPUs. The platform handles the matchmaking; you handle the workload. With the AI tools ecosystem now tracking over 4,000 tools and growing, self-hosted infrastructure like this is becoming a practical alternative to managed API services, especially when you want full control over your model and data.

Deployment walkthrough: Gemma 4 32B in about one hour

Google had just released Gemma 4, and I wanted to test it while it was still fresh. The deployment process on vast.ai was more straightforward than I expected.

I selected an H100 instance with enough VRAM to fit the 32B parameter model comfortably. The platform lets you filter by GPU type, VRAM, and price, so finding the right machine took a few minutes. Once provisioned, I SSH'd into the instance and set up the serving stack. For a model like Gemma 4 32B, you need a serving framework (vLLM or text-generation-inference work well here) that exposes an OpenAI-compatible API endpoint.

The model download and loading took the bulk of that hour. Once the server was up, I could hit the endpoint from my local machine. The deployment side of this experiment was the easy part.

Cost and speed reality check: what $1.50/hour buys

For context, an H100 on AWS (p5 instances) runs roughly $30 to $40 per hour depending on region and commitment. Even spot pricing on major clouds rarely drops below $10/hour. Lambda Labs and RunPod sit somewhere in the $2 to $4/hour range for comparable hardware. At $1.50/hour, vast.ai is at the aggressive end of that spectrum.

The inference speed I observed was around 20 tokens per second. Not blazing fast, but comparable to what you experience with Claude or other hosted coding agents through tools like Cline. For interactive coding workflows, 20 tokens/sec is workable. You're not waiting 30 seconds for a response. It feels conversational enough.

The tradeoff is clear: you lose the managed experience and reliability of a first-party API. You gain cost control and model flexibility.

The real challenge: using the remote LLM from my local machine

Everything I described so far went smoothly. The friction started the moment I tried to connect Cline (a VS Code extension for AI-assisted coding) to my remotely deployed model.

Cline expects an OpenAI-compatible endpoint, which my serving stack provided. But the integration was rough. I hit bugs I didn't anticipate: connection timeouts that weren't timeout issues, malformed request headers, response parsing failures that gave cryptic error messages. Each problem required a different workaround. Some were Cline configuration issues. Others seemed to be edge cases in how Cline handles non-OpenAI endpoints.

I did manage to get a small feature implemented and a PR submitted. But the ratio of "time debugging the toolchain" to "time actually coding with the model" was painful. For every productive 15 minutes, I spent 15 to 30 minutes troubleshooting the connection layer. Getting Cline to behave was, by far, the hardest part of this entire experiment.

Failure mode postmortem: context overflow killed the machine

The most frustrating failure was a context window overflow. Gemma 4 32B on the H100 has a context window around 32,000 tokens. During a longer coding session, Cline pushed the conversation past that limit, hitting roughly 32,500 tokens. Instead of gracefully truncating or compacting the conversation, the model tried to process the full context.

That extra 500 tokens was enough to overfill the GPU's VRAM. The process didn't crash cleanly. It hung. The machine became unresponsive, SSH sessions froze, and there was no way to recover. I had to terminate the instance entirely and provision a new one, losing the session state.

The model didn't fail loudly. It failed silently, which is worse.

This is a real operational risk when you're self-hosting. Managed APIs handle context truncation for you. When you own the stack, you own every failure mode too.

Lessons learned: guardrails you'll want from the start

If you're planning a similar setup, a few mitigations would save you hours.

Budget your context aggressively. Set a hard limit at 80% of the model's context window (around 25,600 tokens for a 32K model). Don't let your client tool manage this on its own. Monitor token counts on the server side if possible.

Break complex coding tasks into smaller requests rather than letting the conversation accumulate. Shorter, focused prompts keep you well within the context budget and reduce the chance of a catastrophic hang.

Vast.ai supports stopping and restarting instances, so snapshot your instance before long sessions. If you're about to start a heavy run, make sure you can recover without re-provisioning from scratch.

What I'm looking for next

The experiment proved the concept. Self-hosted LLMs on rented hardware are viable for coding workflows, and the cost is genuinely competitive. The weak link wasn't the model or the infrastructure. It was the local client tooling.

I'm actively looking for alternatives to Cline that handle remote OpenAI-compatible endpoints more gracefully, especially around context management and error recovery. If you've had success with other tools (Continue, Aider, or something else entirely), I'd genuinely like to hear about it.

The infrastructure problem is solved. The developer experience problem is not.

Dan Gurgui | A4G
AI Architect

Weekly Architecture Insights: Subscribe

How I Passed the AWS Generative AI Developer Professional Certification (and Earned the Early Adopter Badge)

Dan Gurgui — Sun, 18 Jan 2026 09:17:00 +0000

TL;DR

Time invested: ~4 weeks of focused preparation
Resources used: Frank Kane's Udemy course, Stephane Maarek's AI Practitioner tests, Tutorials Dojo practice exams, AWS documentation, hands-on Bedrock projects
Difficulty level: Hardest AWS exam I've taken—questions require 2-3 layers of mental assumptions
Top 3 tips: (1) Take AI Practitioner first for exam structure familiarity, (2) Focus on Bedrock integrations with other AWS services, (3) Budget the full 4 hours and prepare physically for endurance

Why I Pivoted from Solutions Architect Professional to GenAI (and What Surprised Me)

I recently got my AWS Generative AI Developer Professional certification, along with the early adopter badge. Here's the thing—this wasn't my original plan at all.

I'd been preparing for the AWS Solutions Architect Professional for some time, with the exam scheduled somewhere in December. Then AWS launched the Generative AI Developer certification mid-November, and I found out about it at the beginning of December.

It caught me off guard. AWS putting this much emphasis on Generative AI specifically? That was a signal worth paying attention to. The generative AI in software development market is experiencing explosive growth, and AWS clearly wants certified professionals ready to build on their platform.

I decided to dig into it. With significant AWS experience and training already under my belt, I wanted to understand what this certification actually meant—and whether pivoting made strategic sense.

Turns out, it did. But the path wasn't straightforward.

What I Thought the Exam Would Be vs. What It Really Targets

I understood very quickly that this certification is Bedrock and AI heavy. What I wasn't sure about was how deep it went into machine learning territory.

I tried my best to find information online. No luck. With a brand-new certification, the community hadn't built up the usual knowledge base of "here's what to expect" posts. That's the early adopter tax—you're trading uncertainty for the badge before the market gets flooded.

What I eventually discovered through preparation and the exam itself:

Bedrock is the core. If you don't know Bedrock inside and out, you're not passing.
Integration is everything. More than 50% of questions require knowledge of how Bedrock works with other AWS services—Lambda for Agents, OpenSearch for RAG, IAM for permissions, CloudWatch for monitoring, Comprehend for PII detection.
SageMaker shows up, but it's not the focus. There are ML questions, but they're not as dominant as some practice tests suggest.
This is really a GenAI Architect exam in disguise. Despite the "Developer" title, the focus is on integrating services and designing solutions, not just writing Python code.

The exam aligns with where the industry is heading. According to TechTarget's analysis of 2026 generative AI trends, agentic AI orchestration and plug-and-play LLMs are becoming key focus areas—exactly what Bedrock enables. AWS is making this certification hard because the market value for these skills is projected to skyrocket.

My Preparation Path: What I Used and Why

Starting with Frank Kane's Course (The Skim Phase)

Knowing the exam was Bedrock-heavy, I decided to take Frank Kane's Udemy course to get a grasp of it. The course is substantial—22 hours of learning. Initially, I just skimmed it.

What I realized: this is all about Bedrock and how Bedrock works together with other AWS services, plus the AI solutions AWS provides. It gave me a mental map of what I needed to know, even if I wasn't ready to go deep yet.

The AI Practitioner Detour (Strategic, Not a Distraction)

I remembered that the AI Practitioner certification also touches on many AWS AI solutions and Bedrock. So I made a decision that ended up being crucial: take AI Practitioner first.

Not because I needed the knowledge—the requirements are somewhat light, more general understanding of ML/AI terminology than hands-on skills. But because I needed to understand the exam structure and feeling before jumping into a Professional-level certification with zero practice materials available.

I used Stephane Maarek's practice test. Did it twice before scheduling the exam. The exam itself was very straightforward, and I got results immediately.

This gave me exactly what I needed: training for the AWS Generative AI Professional format without the high stakes.

The Deep Dive (Where Real Learning Happened)

After AI Practitioner, I went back to Frank Kane's course—this time properly. I started playing around with Bedrock and foundation models hands-on.

After making sure I had a solid understanding, I did the course's practice test.

It felt very, very easy. Too easy for a Professional certification.

That worried me.

Practice Tests Reality Check: Too Easy, Off-Target, or ML-Heavy

I did some digging around, and many Redditors shared the same experience: the exam is Professional-level hard, but there aren't many accurate practice tests available. The questions are reportedly brutal.

So I practiced even more with Bedrock directly:

Deployed guardrails for a test project (critical for enterprise adoption—hallucination prevention and PII masking are the biggest blockers right now)
Integrated AWS Comprehend for PII detection
Deployed AgentCore (directly relevant to the agentic orchestration trend predicted for 2025/2026)
Re-read the entire AWS documentation at least twice

I tried Gemini's quiz features, but it kept drifting into ML Specialty territory rather than AWS Bedrock Generative AI specifics. Not helpful.

Then I found people recommending a new practice test on Tutorials Dojo. I took it.

Reality check: I was barely getting 65%. And I understood exactly why—I had very little experience with Machine Learning, and many questions were SageMaker ML-heavy.

I worked to get to a 75% rate on the two question sets there. But it felt like I'd need another month or two of deep ML study to score higher.

The SageMaker Dilemma: How Much ML You Actually Need

I was confronted with a big dilemma.

Option A: Take the Generative AI exam without deep SageMaker knowledge.
Option B: Spend another month or two mastering ML concepts first.

Many people in the community recommended sticking to Bedrock. The logic: this is a Generative AI certification, not ML Specialty. SageMaker matters, but it's not the core.

I decided to take my chances.

The result: It went really well.

While the exam did have a couple of SageMaker questions, the majority were Bedrock-heavy. All the hands-on Bedrock practice paid off.

Here's the lesson: knowing what NOT to study deeply can be as important as knowing what to study. I could have spent two months on SageMaker and still faced the same Bedrock-integration questions. The risk calculation was worth it.

What the Exam Actually Felt Like: Question Style, Time Pressure, and Mental Endurance

The questions were some of the toughest I had ever seen on any AWS exam.

Not because of technical complexity alone—but because of how they were formulated. Each question required building a mental model with two or three layers of assumptions before arriving at an answer.

Here's what I mean. Don't expect questions like "What does Guardrails do?" Expect something closer to: "If a Guardrail filters PII, but the Agent is configured to retry on failure, and the Lambda timeout is set to X seconds, what is the user experience when the content policy triggers?"

You're not just recalling facts. You're simulating system behavior in your head.

Time breakdown:

Some questions took me 10-15 minutes each
I used 3 hours and 30 minutes of the 4-hour exam
That's an average of about 3 minutes per question
I sped up on the last questions because I physically couldn't sit still any longer

The 4-hour allocation for non-native English speakers isn't generous—it's necessary. And even native speakers should expect to use most of it.

Physical reality: By hour three, I desperately needed to move around. I couldn't resist the urge to finish quickly just to stand up. Plan for this. The exam is a mental marathon, but it's also a physical endurance test.

Key Takeaways: How to Prepare Efficiently (Without Overstudying)

Here's what I'd tell anyone preparing for this certification:

The exam is genuinely difficult. The tough questions aren't a rumor. Accept this going in and prepare accordingly.
The 4-hour time limit is real. Non-native speakers get this by default, but everyone needs it. Don't rush through practice tests—simulate the actual pacing.
Prepare physically. No water breaks, no bathroom breaks, no interruptions for the full duration. Eat well before. Hydrate earlier in the day, not right before.
Practice tests are imperfect. They'll either train you for things not on the exam or be too soft. Use them for structure familiarity, not content accuracy.
AWS documentation + hands-on practice is the real preparation. Deploy guardrails. Build agents. Integrate Comprehend. Read the docs twice.
Know Bedrock integrations cold. More than half the questions require understanding how Bedrock works with Lambda, OpenSearch, IAM, CloudWatch, S3, and other services.
Take another certification first. AI Practitioner or AWS Solutions Architect gives you exam format experience and foundational knowledge. Prior certification experience is incredibly valuable here.

Who Should Take This Next (and What's Coming)

If you have solid AWS experience and want to position yourself for the generative AI wave, this certification is worth the effort. The early adopter badge won't be available forever, and the market demand for these skills is only growing.

My recommendation: Don't rush it, but don't over-prepare either. Focus on Bedrock, integrations, and hands-on practice. Accept that some ML questions will show up, but don't let SageMaker anxiety derail your timeline.

In a follow-up post, I'll share what I actually learned through this process—the technical knowledge that stuck, and what prior experience helped me the most. Stay tuned.

LangSearch Inside Claude: The Fastest “Search Tool” Setup I’ve Used Lately

Dan Gurgui — Thu, 01 Jan 2026 15:13:56 +0000

Hook: When Claude’s web search feels nerfed, add a turbocharger

I’ve been playing with a bunch of “AI + web” setups lately, and I keep running into the same vibe: the model is smart, but the search layer feels… constrained.

You ask for sources, you ask for breadth, you ask for “show me five different angles,” and you get a couple of thin results, slow turnaround, or citations that feel like they were picked by a cautious librarian with a strict budget. I’m not even mad about it, I get why default web search has guardrails. But in practice, it can feel nerfed.

Then I tried LangSearch inside Claude.

Man. That shit is amazing.

The difference isn’t subtle. With the default experience, I’m nudging and waiting. With LangSearch wired in, it’s like flying. Blazing fast queries, lots of results, and tight iteration loops. I haven’t felt that kind of “search responsiveness” in other assistants lately.

What LangSearch-in-Claude actually is (in plain terms)

At a high level, you’re doing something simple: you’re giving Claude a better search engine to call.

Claude supports tool use (Anthropic calls it tool use / function calling). You register a tool with:

A name (what Claude will call)
A description (when it should use it)
A schema (what inputs it accepts)
And you provide the actual execution (you run the API call in your app, or via an agent runner)

Then you hand it:

Your LangSearch API key
A bit of LangSearch documentation (or at least the endpoint + parameters you want Claude to use)
And you let Claude decide when to execute search queries

In practice, it looks like: Claude generates a structured tool call, your code calls LangSearch, and then Claude reads the results and synthesizes an answer with citations.

Here’s a minimal sketch of what “tool registration” looks like conceptually (exact wiring depends on your runtime and SDK):

{
  "name": "langsearch_query",
  "description": "Search the web for up-to-date information and return top results with snippets and URLs.",
  "input_schema": {
    "type": "object",
    "properties": {
      "query": { "type": "string" },
      "num_results": { "type": "integer", "default": 5 }
    },
    "required": ["query"]
  }
}

And then your executor does something like:

// TypeScript-ish pseudo-code
async function langsearch_query({ query, num_results = 5 }) {
  const res = await fetch("https://api.langsearch.com/v1/search", {
    method: "POST",
    headers: {
      "Authorization": `Bearer ${process.env.LANGSEARCH_API_KEY}`,
      "Content-Type": "application/json"
    },
    body: JSON.stringify({ query, num_results })
  });

  return await res.json();
}

That’s it. You’re not “making Claude smarter.” You’re giving it a higher-throughput retrieval layer.

Why it feels so fast: the practical differences you notice

Speed is a mushy word, so here’s what I actually mean when I say it feels faster.

1) Latency: time-to-first-usable result drops

With a good search API, you get results back quickly, consistently. That matters because most of us don’t do one search. We do a research loop:

Ask question
Skim results
Refine query
Pull a second source to confirm
Summarize, compare, decide

If each loop costs 20–40 seconds, you stop iterating. If each loop costs 3–8 seconds, you keep going. Iteration speed is the real productivity unlock.

2) Throughput: you can ask for breadth without punishment

A common failure mode with built-in search tools is that they return a tiny handful of results, or the model “chooses” to search less often than you’d like.

With LangSearch, you can comfortably ask for:

10–20 results
multiple query variants
separate searches per subtopic

…and it doesn’t feel like you’re paying a tax in waiting time.

3) Relevance: fewer “why is this result here?” moments

This is subjective, but I noticed fewer irrelevant links and fewer “SEO sludge” pages in the top set. That means less time spent telling the model, “no, not that, the other thing.”

4) Tool calling reliability is good enough to trust in the loop

Tool use has gotten materially better. Anthropic’s tool use is generally strong, and independent evaluations like Berkeley’s function calling leaderboards show modern models are much more consistent about producing valid tool calls than they were a year ago (BFCL: https://gorilla.cs.berkeley.edu/leaderboard.html). That reliability matters because flaky tool calls destroy the “flying” feeling fast.

Where it shines: 4 research workflows that benefit immediately

This is where it stopped being a neat trick and started being a daily driver for me.

1) Competitive scans without the pain

If you’ve ever tried to map a market quickly, you know the drill: a dozen tabs, half of them garbage, and you still miss two important players.

With LangSearch inside Claude, I’ll do something like:

“Search for the top 15 vendors in X”
“Now search for ‘X vs Y’ comparison posts”
“Now search for pricing pages and extract tiers”
“Now summarize positioning in a table”

What changes is not that Claude can summarize, it always could. What changes is how quickly you can gather enough raw material to make the summary credible.

2) Troubleshooting with real-world context

This is my favorite use case.

When you hit a weird production issue, the docs are often not enough. You want:

GitHub issues
changelogs
forum posts
“someone hit this in Kubernetes 1.29 with Cilium” type threads

LangSearch is great for that “needle in a haystack” search pattern, especially when you chain it:

Search for the exact error string
Search again with the library version
Search for “workaround” / “regression” / “breaking change”
Pull 3–5 sources and ask Claude to reconcile them

The output gets better because the input set is better.

3) Sourcing: pulling multiple perspectives fast

Engineers often need to answer questions that look simple but aren’t:

“Is this API stable?”
“What are the known footguns?”
“Is the community alive?”
“Does anyone regret adopting this?”

Those answers don’t come from one official page. They come from triangulation.

LangSearch makes it cheap to pull:

official docs
blog posts
issue trackers
community threads

Then Claude can do what it’s good at: pattern matching across sources and telling you what’s consistent vs what’s anecdotal.

4) Summarizing multiple pages (without pretending)

A lot of assistants will “summarize the web” while actually summarizing a couple of snippets.

With a fast search tool, you can push a more honest workflow:

pull 10–15 relevant URLs
ask Claude to summarize with citations
ask it to call out disagreements between sources
ask it what’s missing and run another search

This is especially useful for writing technical docs, internal RFCs, or even blog posts where you want breadth without spending half a day collecting links.

How to evaluate it yourself (a simple benchmark you can run)

If you’re considering wiring this into your own setup, don’t trust vibes. Run a repeatable test.

A lightweight benchmark

Pick three research tasks you actually do at work. For example:

Troubleshoot a specific error message from your logs
Compare two competing tools (feature + pricing + tradeoffs)
Find the latest docs / changelog for a dependency and summarize what changed

Then run the same workflow across:

Claude + built-in web search (if you have it enabled)
Claude + LangSearch tool
(Optional) another assistant you use day-to-day

Track three metrics:

Time-to-first-good-answer (not first answer, first useful one)
Citation quality (are links relevant, diverse, and not duplicated?)
Iteration count (how many follow-ups did you need to get to “done”?)

If LangSearch is doing what I’m seeing, you’ll notice the biggest win in iteration count and time-to-first-good-answer, not in raw “model intelligence.”

Caveats and gotchas before you wire it into everything

This is the part people skip, then they get burned.

Cost and rate limits: Fast search encourages more searching. That’s good, until you hit per-minute limits or your bill spikes. Put basic throttling and caching in place.
Key security: Treat the LangSearch API key like any other production credential. Don’t paste it into random clients. Use server-side execution, env vars, secret managers.
Hallucinated citations: Even with real search results, the model can still misattribute a claim to a URL. You want your tool to return structured fields (title, snippet, url), and you want prompts that force quoting or explicit referencing.
Over-trusting “top results”: Search ranking is not truth ranking. For sensitive decisions, you still need to sanity check primary sources.
Built-in search is improving: Anthropic has been investing in web search (they announced a Web Search API in 2025: https://www.anthropic.com/news/web-search-api). The gap may narrow over time. But today, alternatives can still be worth it if research speed matters to you.

Closer + CTA: Add LangSearch, then compare against ChatGPT/Gemini

If you’re using Claude for real engineering work and you keep bouncing off the built-in search experience, try adding LangSearch as a tool. The setup is straightforward, and the payoff is immediate if you do any serious research loops.

I haven’t wired the same setup into ChatGPT yet, but I probably will, mostly because I want a fair comparison under the same benchmark.

If you run this test, I’d love to hear your numbers: time-to-first-good-answer, citation quality, and where it helped (or didn’t). What workflows are you trying to speed up?

Dan Gurgui | A4G
AI Architect

Weekly Architecture Insights: Subscribe

My Experience Using the BMAD Framework on a Personal Project (Patience Required)

Dan Gurgui — Tue, 30 Dec 2025 20:45:20 +0000

Getting Started: “I’ll just use BMAD to move faster”

Over the last couple of weeks I’ve been working with the BMAD framework on a personal project, and I wanted to write this up while it’s still fresh.

Going in, my expectation was pretty simple: I’d plug in my idea, let the workflow guide me, and I’d be writing code quickly, with better direction and fewer dead ends.

That’s… partially true. But there’s a big caveat.

BMAD is not a “start coding in 20 minutes” setup. It’s closer to “do the work up front so the coding part stops being the hardest part.”

And if you’re used to hacking a prototype together first and figuring out the product later, this is going to feel slow. Sometimes painfully slow.

The First Reality Check: it takes a lot of time before you write anything

The first thing you notice with BMAD is that it pushes you into an extensive workflow before you’re allowed to feel productive in the way engineers usually define productivity (shipping code).

It takes you through a bunch of steps like:

Defining the problem (and not just “I want to build X”, but “what pain exists and for who?”)
Defining user personas
Brainstorming approaches
Researching the space
Clarifying constraints (time, money, infra, team, target platform)
Turning that into epics, stories, and execution plans

All of that is useful. But it’s not free.

For me, it took roughly 12 to 16 hours before the first line of code was written.

That number sounds ridiculous if you’re thinking in “weekend project” mode. But the more I sat with it, the more it made sense: BMAD forces you to do the thinking you usually avoid until the project is already messy.

And to be fair, I’ve done the opposite too many times:

Build something fast
Realize I built the wrong thing
Rewrite it
Lose motivation
Abandon it

So yes, this up-front investment is real. It’s also kind of the point.

The Frameworks Are Actually Good (especially for business thinking)

One of the things I genuinely liked is that the frameworks presented in BMAD give you a different perspective, especially around the business side of what you’re building.

If you’re an engineer building a personal project, you usually start with:

“What stack do I want to use?”
“What architecture seems clean?”
“What cloud services are cheapest?”

BMAD drags you back to questions like:

Who is this for, specifically?
What are they trying to accomplish?
What do they do today instead?
Why would they switch?
What’s the smallest thing that proves value?

Even if you think you already know those answers, writing them down forces clarity.

The value here isn’t that it tells you something magical. The value is that it makes you commit to decisions.

But again, you pay for that clarity with time. You’re not coding, you’re thinking and documenting.

“Party Mode” and how I burned through context and credits

Then I hit the fun (and painful) part: party mode.

If you haven’t used it, party mode is basically the “get multiple perspectives and generate a lot of material quickly” mode. It can be super useful when you want breadth:

different solutions
different tradeoffs
different product angles
risk lists
architecture options

I made the mistake of telling it to run party mode with LangSearch and also run party mode with Gemini, and that combo absolutely exhausted my context window and usage credits.

What happened was predictable in hindsight: party mode wants to read, pull in sources, synthesize, then generate. That means:

lots of tokens in
lots of tokens out
and depending on the tools, lots of paid calls

I tried to be clever and tell it something like: “don’t read everything, just put stuff into files and summarize.”

In practice, that didn’t really work the way I expected. Once you’ve instructed the workflow to do deep research, it tends to follow through. It wants to gather the material so it can justify conclusions. That’s good for quality, but bad for cost control if you’re not careful.

Still, I’ll say this: it was very useful. The output was genuinely better when it had multiple angles to compare. It just came at a price.

If you’re going to use party mode, my advice is simple:

use it intentionally
set boundaries (scope, sources, max depth)
and assume it will be expensive if you let it run wild

12–16 hours later: the first line of code… and then I hit an architecture wall

After all the setup and the workflow, I finally got to the point where code started getting written.

And almost immediately I realized I had made an architecture mistake.

This part is important because it’s the kind of mistake that’s easy to make when you’re letting an assistant drive, and you’re “supervising” instead of actively building.

I had told the architect to focus on low cost, so it leaned into a serverless setup, specifically AWS Lambda-style compute. Then I told it to use NestJS.

On paper, that sounds fine. In reality, it’s tricky.

NestJS can run in a serverless environment, but it’s not “drop in NestJS and deploy to Lambda” unless you set it up correctly. You typically need an adapter layer (for example, using @vendia/serverless-express or similar patterns) or you use a framework that’s more directly aligned with serverless request handling.

Without that, you get a mess of mismatched assumptions:

long-lived server patterns vs cold starts
framework bootstrapping time vs latency expectations
request lifecycle differences
deployment packaging and handler wiring

So what happened next is exactly what you’d expect: errors all over the place, and a system that kept trying to fix itself in a loop, without making real progress.

The 6-hour debugging spiral (and why it was so confusing)

I spent a huge amount of time trying to fix it, around six hours.

The frustrating part was that in the moment, I didn’t immediately know what was wrong. It wasn’t one clean error like “you used the wrong import.”

It was more like:

something fails
you fix the symptom
something else fails
the fix introduces another issue
you end up in a loop

If you’ve ever dealt with a misaligned architecture decision early in a project, you know the feeling. The code is “correct” in isolation, but the environment and assumptions are wrong.

This is also where AI-assisted workflows can get weird. If the system is trying to be helpful, it can keep proposing changes that look plausible locally, but don’t address the root mismatch. You can burn a lot of time approving “reasonable” edits that never converge.

And that’s exactly what happened. It kept spinning, and I kept thinking, “why is this stuck?”

The turning point: I didn’t figure it out, the retrospective did

Here’s the interesting part: it wasn’t me that realized the core issue first.

What happened is I noticed it was spending too much time and not converging, and I decided to initiate the BMAD workflow for running a retrospective.

That retrospective step ended up being the breakthrough.

Because instead of continuing forward motion (which was fake progress), it forced a pause and asked:

what are we trying to do?
what’s blocking us?
what assumptions did we make?
what changed?
what decision is causing repeated failure?

That’s when it became clear that the setup was not right. The architecture needed adjustment to match the runtime model.

Once that was identified, the next steps were obvious:

either adjust the NestJS setup to run properly in a serverless handler model
or change the compute model (for example, containerized service on something like ECS/Fargate, or a simple VM), depending on goals
or pick a framework more naturally aligned with serverless

The main point is that the retrospective forced the system to stop patching and start diagnosing.

And honestly, this is one of the strongest arguments for structured workflows like BMAD. Most engineers don’t run retrospectives on a personal project when things go wrong. We just grind harder.

I’ve done that grind plenty of times. It rarely helps.

After the fix: everything went smoothly (and the “stories” became the superpower)

Once everything was set up correctly, the experience changed completely.

The biggest win for me was the fact that I had stories. Real stories. Not vague tasks like “build backend.”

With stories, I could tell it exactly what to implement, in a way that was scoped and testable. That meant I wasn’t doing a bunch of extra work translating ideas into engineering tasks. The translation was already done.

At that point my role became:

supervise
review decisions
sanity check the code
occasionally click yes/no for requests and changes
keep it aligned with the goal

That’s a very different feeling than “I’m the one doing everything.”

And it’s genuinely cool when it works because it shifts the bottleneck. Instead of “how fast can I type,” it becomes “how well can I review and steer.”

If you’ve ever led a team, you’ll recognize that mode. You’re not writing every line. You’re making sure the work being done is the right work.

What BMAD gets right: patience in exchange for momentum

Overall, I think BMAD is really cool.

But I don’t want to oversell it. The trade is clear:

You need patience to set it up
You need to give good answers
You need to review everything
and you need to accept that the early phase feels slow

If you treat it like a magic code generator, you’re going to be annoyed.

If you treat it like a process that front-loads thinking, documentation, and execution structure, it starts to make sense.

And once you’re past that initial slope, it becomes pretty straightforward.

The underrated feature: you can resume anytime because everything is in documents

Another thing I didn’t appreciate until I was in it is how nice it is that you can resume at any time.

Because everything is written down, you’re not relying on your memory or on some fragile chat context. You have artifacts:

personas
problem statements
architecture notes
epics
stories
decisions and tradeoffs

So you can come back after a day or a week and say:

“execute this epic”
“continue this story”
“implement the next task”
“run a retrospective on the last change”

And it doesn’t feel like starting over.

For personal projects, that’s huge. Most of us lose momentum not because we can’t code, but because we return after a break and spend an hour reconstructing context.

BMAD reduces that tax.

What I’d tell another engineer before they try it

I’m not going to pretend this is the answer for every project. If you’re hacking a quick script or testing an API idea, BMAD is probably too heavy.

But if you’re building something that you actually want to ship, even as a solo developer, it’s worth considering.

A few practical lessons from my run:

Budget time for setup. If you expect to write code in the first hour, you’ll fight the workflow.
Be careful with party mode. It’s useful, but it can burn context and credits fast.
Don’t treat architecture prompts casually. “Low cost” pushes you toward serverless patterns, which can be great, but it constrains framework choices and deployment shape.
Use the retrospective when you’re stuck. The instinct is to push forward. The smarter move is to stop and diagnose.
Stories are where the payoff happens. Once you have good stories, execution becomes much more mechanical.

Closing thoughts

BMAD ended up being one of those experiences where the first phase feels like friction, and then later you realize the friction was the whole point.

It forced me to slow down, define what I was doing, and make decisions explicit. I burned time (and credits) in a couple places, especially with party mode. I also lost six hours to an architecture mismatch that I should have caught earlier.

But once the workflow and docs were in place, it got surprisingly smooth. Being able to resume from epics and stories, and to steer implementation without constantly rewriting requirements, is a real productivity shift.

If you try BMAD, bring patience. Bring discipline. And assume you’ll spend more time thinking before you spend time coding.

If this resonates, what’s your experience been with structured AI-assisted workflows? I’m curious.

AWS in the AI era: Bedrock, SageMaker, and the enterprise-first tradeoff

Dan Gurgui — Tue, 30 Dec 2025 20:27:05 +0000

1. The enterprise AI bet: what AWS is actually optimizing for

Here’s the uncomfortable truth about AWS in AI: they’re not trying to “win the model leaderboard.” They’re trying to win regulated, enterprise AI workloads where the boring stuff matters more than the demos.

If you’re building AI in a bank, healthcare company, or a Fortune 500 with a security team that says “no” by default, the biggest risk isn’t that your model is 2% worse on a benchmark. It’s that you can’t answer basic questions like:

Where did the data go?
Who accessed it?
Can we keep traffic private?
Can we prove compliance later?

AWS’s AI story (Bedrock + SageMaker + the prebuilt services like Comprehend/Textract/Transcribe) is basically: control, governance, deployment flexibility, and integration with the rest of AWS—even if that means they move slower on “shiny new capability” than innovation-first competitors.

2. AWS’s AI stack, mapped to real enterprise jobs-to-be-done

When people say “AWS AI,” they often mash everything together. In practice, AWS has multiple layers, and each maps to a different “job” inside an enterprise.

Bedrock: “Give me foundation models, but keep it enterprise-safe”

Amazon Bedrock is the managed “foundation model” layer. You use it when you want access to large models (text/image, etc.) without owning the training pipeline.

The enterprise job-to-be-done here is usually:

Build internal copilots (support, ops, engineering enablement)
Do RAG (retrieval-augmented generation) over company docs
Add summarization/classification into workflows

Bedrock’s pitch is less “best model” and more choice + governance + integration. You can swap models, apply guardrails, and wire it into IAM/VPC patterns you already use.

SageMaker: “We’re building, not just consuming”

SageMaker is for teams that want control: training, fine-tuning, hosting endpoints, MLOps workflows, model registry, monitoring, and pipelines.

The job-to-be-done:

Train or fine-tune models on proprietary data
Run repeatable ML pipelines with approvals and audit trails
Own deployment patterns (multi-account, multi-region, blue/green)

If Bedrock is “buy,” SageMaker is “build.” It’s also where AWS shines for organizations that already have a platform mindset.

Comprehend: “We need NLP features, not a whole LLM app”

Comprehend is classic managed NLP: entity extraction, sentiment, classification, PII detection, etc.

The job-to-be-done:

Extract meaning from support tickets, reviews, claims, emails
Detect PII for compliance workflows
Standardize analytics without building a custom model

It’s not sexy, but it fits enterprises that want predictable outputs and a managed service contract.

Textract: “Turn PDFs and scans into data we can use”

Textract does OCR + structured extraction from forms and tables.

The job-to-be-done:

Invoice processing
Insurance claim ingestion
KYC document parsing
Any “we’re drowning in PDFs” workflow

This is one of those services you don’t brag about, but it pays for itself when it works.

Transcribe: “Convert audio to text at scale”

Transcribe is speech-to-text.

The job-to-be-done:

Call center transcription
Meeting notes
Compliance archiving
Searchable audio libraries

And yes, this is where the quality/cost conversation gets real (we’ll get there).

3. Differentiators that matter in regulated environments

If you’ve only built AI prototypes, AWS can feel “too heavy.” If you’ve built AI in a regulated org, a lot of AWS’s choices make more sense.

Data boundaries (and why enterprises obsess over them)

A big part of AWS’s positioning is reducing the fear that your data becomes someone else’s training set.

For Bedrock specifically, AWS states that customer inputs and outputs are not used to train the underlying foundation models by default. That’s the kind of sentence that procurement teams love, because it maps to a risk they can actually articulate.

In practice, what matters isn’t marketing—it’s whether you can put the right contractual and technical boundaries around data flows.

Private networking: VPC, PrivateLink, and “keep it off the public internet”

A lot of AI competitors assume public endpoints and “trust us” security. AWS’s default enterprise move is: put services behind private connectivity.

Patterns you’ll see in real deployments:

Bedrock access via VPC endpoints / AWS PrivateLink (where supported)
SageMaker endpoints in private subnets
Tight egress controls + centralized logging

This isn’t about paranoia. It’s about making your AI system fit the same threat model as everything else you run.

IAM, auditability, and “who did what, when”

AWS’s identity and governance tooling is a differentiator when you actually need it:

IAM policies for fine-grained access
CloudTrail for audit logs
KMS for encryption and key control
Organizations / SCPs for guardrails at scale

If you’ve ever been asked to produce an audit trail for an AI system, you know why this matters. It’s not just security—it’s operational credibility.

Residency and multi-region controls

Enterprises care about data residency, disaster recovery, and “what happens if a region is down.”

AWS’s global footprint and mature multi-region patterns make it easier to design:

Region-pinned workloads
Cross-region failover
Separate prod/test accounts with clear boundaries

Guardrails and governance as product features

AWS is leaning into guardrails (policy controls, content filters, safety boundaries) because enterprises want enforceable rules, not “please behave” prompts.

This is the enterprise-first vs innovation-first trade: guardrails slow you down a bit, but they also keep you from getting fired.

4. Where AWS falls short: product quality and developer experience

Now the part people don’t say out loud: AWS’s AI portfolio is uneven. Some services are rock-solid. Others feel like they shipped because the roadmap demanded it, not because the UX was done.

Transcribe quality: “good enough” isn’t always good enough

There are plenty of teams who report that Transcribe can struggle depending on:

Accents and multilingual audio
Crosstalk in meetings
Domain-specific vocabulary (medical, legal, internal acronyms)
Noisy environments

Speech-to-text is brutally sensitive to audio quality and domain mismatch. If you’re building anything user-facing, “mostly accurate” can translate into “constant complaints.”

The practical issue isn’t whether Transcribe is bad. It’s that you may need to run bake-offs and measure WER (word error rate) on your audio—not a vendor demo.

Q Developer: useful, but AWS-shaped

Amazon Q Developer is clearly designed to make AWS developers faster. That’s not inherently wrong.

But if your stack is multi-cloud, heavy Kubernetes, or you’re not all-in on AWS services, Q Developer can feel narrow. It’s less “universal coding copilot” and more “AWS acceleration tool.”

That’s fine if you want exactly that. It’s frustrating if your expectation is parity with general-purpose coding assistants.

OpenSearch as a knowledge base: operational pain is real

AWS pushing OpenSearch (their Elasticsearch fork) is a classic example of enterprise tradeoffs: you get control, hosting options, and integration—but you also inherit operational complexity.

Teams using OpenSearch for RAG knowledge bases often run into:

Debugging relevance issues (tokenization, analyzers, mappings)
Cluster sizing and shard management
Upgrades and version quirks
“It works until it doesn’t” operational incidents

Yes, you can use managed OpenSearch. You still need people who understand it. If you don’t have that expertise, “cheap and flexible” becomes “slow and fragile.”

This is where many teams end up hybrid: a managed vector DB elsewhere, or a simpler managed retrieval layer—because DX matters when you’re iterating weekly.

5. Where AWS falls short: cost, pricing complexity, and surprise bills

AWS has a cost story that’s both true and annoying:

Infrastructure can be cost-effective at scale.
Managed AI services can get expensive fast.
Pricing is rarely simple enough to estimate confidently.

Transcribe pricing vs alternatives (a concrete example)

AWS Transcribe’s standard batch transcription is $0.024 per minute in the first pricing tier, according to AWS’s pricing page: https://aws.amazon.com/transcribe/pricing/?p=ft&z=4

Let’s do back-of-the-napkin math:

10,000 minutes/month (~167 hours) → 10,000 × $0.024 = $240/month
100,000 minutes/month (~1,667 hours) → $2,400/month
1,000,000 minutes/month (~16,667 hours) → $24,000/month

At enterprise scale, that’s real money—especially if you’re also paying for:

Storage (S3)
Processing pipelines (Lambda/ECS)
Search/indexing (OpenSearch)
Observability (CloudWatch costs add up)

Research and market comparisons often show that alternatives can be dramatically cheaper—up to ~89% cheaper in some scenarios (depending on model/provider and quality targets). The exact number varies, but the point stands: AWS’s managed convenience is not always the low-cost option.

The real cost killer: “pricing complexity tax”

Even when the per-unit price is reasonable, teams get hit by:

Hard-to-predict request patterns
Multiple services each with their own meters
Network egress surprises in hybrid setups

If you don’t model the full system cost, you’re not budgeting—you’re guessing.

6. Industry direction: open weights + portability, and how AWS fits

The long-term industry gravity is toward more model choice and more portability.

Not just “open source code,” but increasingly open weights and ecosystems where you can run the same model across clouds—or on-prem—depending on security, cost, or latency constraints.

Why open weights are winning mindshare

Open-weight models give you:

Deployment control (where it runs, how it scales)
Vendor optionality (swap infra without rewriting everything)
Better customization paths (fine-tune, distill, quantize)

Enterprises like this because it reduces lock-in risk. Engineers like it because it’s closer to how we build everything else: composable components, measurable performance, replaceable parts.

AWS’s quiet advantage: data gravity and the “boring” platform

Here’s where AWS is better positioned than people think: data management.

If your organization already lives in:

S3 as the data lake
Glue / Lake Formation for catalog and governance
Redshift for warehousing
Kinesis/MSK for streaming
IAM/KMS/CloudTrail for security and audit

…then AWS is a natural place to operationalize open-weight models, because the hardest part of enterprise AI is usually data access + governance, not model APIs.

Infrastructure competitiveness: Trainium/Inferentia vs the world

AWS also has a strong infra story with Trainium (training) and Inferentia (inference). Performance-per-dollar comparisons vary by workload, but independent analyses have compared AWS Trainium against Google TPU v5e and Azure ND H100 instances and found meaningful tradeoffs in cost and throughput depending on model shape and batch sizes (see: https://www.cloudexpat.com/blog/comparison-aws-trainium-google-tpu-v5e-azure-nd-h100-nvidia/).

The point isn’t “AWS is always cheapest.” It’s that AWS is investing in custom silicon plus the surrounding platform. If you’re doing sustained training/inference at scale, that matters.

So the industry trend (open models, portability) doesn’t necessarily threaten AWS. It can actually strengthen AWS’s platform moat—as long as AWS keeps the developer experience and managed service quality competitive.

7. Decision guide: when AWS is the right AI platform (and when it isn’t)

I think about this as build vs buy vs hybrid:

AWS is the right choice when…

You’re in a regulated environment and need IAM, audit logs, encryption, residency controls
Your data is already in AWS and moving it out would be slow/expensive
You want hybrid flexibility: mix Bedrock (buy) with SageMaker (build)
You have platform engineers who can operate the surrounding stack (networking, security, observability)

AWS is not the right choice when…

You need the absolute bleeding edge model capability this quarter and don’t want to wait for AWS integrations
Your team is small and you can’t afford the operational overhead (OpenSearch clusters, multi-service pipelines, cost modeling)
You’re mostly non-AWS and would be fighting the ecosystem instead of benefiting from it

Quick selection checklist

What’s your acceptable error rate (WER, hallucination rate, extraction accuracy)?
What’s your cost target per 1K requests / per hour of audio / per document?
Do you need private networking and audit trails, or is this a public SaaS feature?
What’s your exit plan if pricing or quality disappoints?

AWS wins when you value control and integration. It loses when you value speed and simplicity above all else.

8. Closing: a pragmatic way to evaluate AWS AI in 30 days

If you’re evaluating AWS for AI, don’t start with architecture diagrams. Start with a 30-day pilot that forces reality to show up.

Pick one real workflow (transcription, doc extraction, RAG over internal docs) and measure:

Quality: WER / extraction accuracy / human-rated usefulness
Cost: full system cost, not just API calls
Latency & reliability: p95 response times, error rates, retries
Operational load: how many “platform chores” show up weekly

And write down an exit strategy on day one: what you’d swap first (model, vector store, hosting) if AWS isn’t the fit.

What would your 30-day bake-off reveal about your actual constraints?

My Experience Using the BMAD Framework on a Personal Project (Patience Required)

Dan Gurgui — Sun, 28 Dec 2025 19:46:08 +0000

Getting Started: “I’ll just use BMAD to move faster”

Over the last couple of weeks I’ve been working with the BMAD framework on a personal project, and I wanted to write this up while it’s still fresh.

Going in, my expectation was pretty simple: I’d plug in my idea, let the workflow guide me, and I’d be writing code quickly, with better direction and fewer dead ends.

That’s… partially true. But there’s a big caveat.

BMAD is not a “start coding in 20 minutes” setup. It’s closer to “do the work up front so the coding part stops being the hardest part.”

And if you’re used to hacking a prototype together first and figuring out the product later, this is going to feel slow. Sometimes painfully slow.

The First Reality Check: it takes a lot of time before you write anything

The first thing you notice with BMAD is that it pushes you into an extensive workflow before you’re allowed to feel productive in the way engineers usually define productivity (shipping code).

It takes you through a bunch of steps like:

Defining the problem (and not just “I want to build X”, but “what pain exists and for who?”)
Defining user personas
Brainstorming approaches
Researching the space
Clarifying constraints (time, money, infra, team, target platform)
Turning that into epics, stories, and execution plans

All of that is useful. But it’s not free.

For me, it took roughly 12 to 16 hours before the first line of code was written.

And to be fair, I’ve done the opposite too many times:

Build something fast
Realize I built the wrong thing
Rewrite it
Lose motivation
Abandon it

So yes, this up-front investment is real. It’s also kind of the point.

The Frameworks Are Actually Good (especially for business thinking)

One of the things I genuinely liked is that the frameworks presented in BMAD give you a different perspective, especially around the business side of what you’re building.

If you’re an engineer building a personal project, you usually start with:

“What stack do I want to use?”
“What architecture seems clean?”
“What cloud services are cheapest?”

BMAD drags you back to questions like:

Who is this for, specifically?
What are they trying to accomplish?
What do they do today instead?
Why would they switch?
What’s the smallest thing that proves value?

Even if you think you already know those answers, writing them down forces clarity.

The value here isn’t that it tells you something magical. The value is that it makes you commit to decisions.

But again, you pay for that clarity with time. You’re not coding, you’re thinking and documenting.

“Party Mode” and how I burned through context and credits

Then I hit the fun (and painful) part: party mode.

If you haven’t used it, party mode is basically the “get multiple perspectives and generate a lot of material quickly” mode. It can be super useful when you want breadth:

different solutions
different tradeoffs
different product angles
risk lists
architecture options

I made the mistake of telling it to run party mode with LangSearch and also run party mode with Gemini, and that combo absolutely exhausted my context window and usage credits.

What happened was predictable in hindsight: party mode wants to read, pull in sources, synthesize, then generate. That means:

lots of tokens in
lots of tokens out
and depending on the tools, lots of paid calls

I tried to be clever and tell it something like: “don’t read everything, just put stuff into files and summarize.”

Still, I’ll say this: it was very useful. The output was genuinely better when it had multiple angles to compare. It just came at a price.

If you’re going to use party mode, my advice is simple:

use it intentionally
set boundaries (scope, sources, max depth)
and assume it will be expensive if you let it run wild

12–16 hours later: the first line of code… and then I hit an architecture wall

After all the setup and the workflow, I finally got to the point where code started getting written.

And almost immediately I realized I had made an architecture mistake.

This part is important because it’s the kind of mistake that’s easy to make when you’re letting an assistant drive, and you’re “supervising” instead of actively building.

I had told the architect to focus on low cost, so it leaned into a serverless setup, specifically AWS Lambda-style compute. Then I told it to use NestJS.

On paper, that sounds fine. In reality, it’s tricky.

Without that, you get a mess of mismatched assumptions:

long-lived server patterns vs cold starts
framework bootstrapping time vs latency expectations
request lifecycle differences
deployment packaging and handler wiring

So what happened next is exactly what you’d expect: errors all over the place, and a system that kept trying to fix itself in a loop, without making real progress.

The 6-hour debugging spiral (and why it was so confusing)

I spent a huge amount of time trying to fix it, around six hours.

The frustrating part was that in the moment, I didn’t immediately know what was wrong. It wasn’t one clean error like “you used the wrong import.”

It was more like:

something fails
you fix the symptom
something else fails
the fix introduces another issue
you end up in a loop

If you’ve ever dealt with a misaligned architecture decision early in a project, you know the feeling. The code is “correct” in isolation, but the environment and assumptions are wrong.

And that’s exactly what happened. It kept spinning, and I kept thinking, “why is this stuck?”

The turning point: I didn’t figure it out, the retrospective did

Here’s the interesting part: it wasn’t me that realized the core issue first.

What happened is I noticed it was spending too much time and not converging, and I decided to initiate the BMAD workflow for running a retrospective.

That retrospective step ended up being the breakthrough.

Because instead of continuing forward motion (which was fake progress), it forced a pause and asked:

what are we trying to do?
what’s blocking us?
what assumptions did we make?
what changed?
what decision is causing repeated failure?

That’s when it became clear that the setup was not right. The architecture needed adjustment to match the runtime model.

Once that was identified, the next steps were obvious:

either adjust the NestJS setup to run properly in a serverless handler model
or change the compute model (for example, containerized service on something like ECS/Fargate, or a simple VM), depending on goals
or pick a framework more naturally aligned with serverless

The main point is that the retrospective forced the system to stop patching and start diagnosing.

And honestly, this is one of the strongest arguments for structured workflows like BMAD. Most engineers don’t run retrospectives on a personal project when things go wrong. We just grind harder.

I’ve done that grind plenty of times. It rarely helps.

After the fix: everything went smoothly (and the “stories” became the superpower)

Once everything was set up correctly, the experience changed completely.

The biggest win for me was the fact that I had stories. Real stories. Not vague tasks like “build backend.”

At that point my role became:

supervise
review decisions
sanity check the code
occasionally click yes/no for requests and changes
keep it aligned with the goal

That’s a very different feeling than “I’m the one doing everything.”

And it’s genuinely cool when it works because it shifts the bottleneck. Instead of “how fast can I type,” it becomes “how well can I review and steer.”

If you’ve ever led a team, you’ll recognize that mode. You’re not writing every line. You’re making sure the work being done is the right work.

What BMAD gets right: patience in exchange for momentum

Overall, I think BMAD is really cool.

But I don’t want to oversell it. The trade is clear:

You need patience to set it up
You need to give good answers
You need to review everything
and you need to accept that the early phase feels slow

If you treat it like a magic code generator, you’re going to be annoyed.

If you treat it like a process that front-loads thinking, documentation, and execution structure, it starts to make sense.

And once you’re past that initial slope, it becomes pretty straightforward.

The underrated feature: you can resume anytime because everything is in documents

Another thing I didn’t appreciate until I was in it is how nice it is that you can resume at any time.

Because everything is written down, you’re not relying on your memory or on some fragile chat context. You have artifacts:

personas
problem statements
architecture notes
epics
stories
decisions and tradeoffs

So you can come back after a day or a week and say:

“execute this epic”
“continue this story”
“implement the next task”
“run a retrospective on the last change”

And it doesn’t feel like starting over.

For personal projects, that’s huge. Most of us lose momentum not because we can’t code, but because we return after a break and spend an hour reconstructing context.

BMAD reduces that tax.

What I’d tell another engineer before they try it

I’m not going to pretend this is the answer for every project. If you’re hacking a quick script or testing an API idea, BMAD is probably too heavy.

But if you’re building something that you actually want to ship, even as a solo developer, it’s worth considering.

A few practical lessons from my run:

Budget time for setup. If you expect to write code in the first hour, you’ll fight the workflow.
Be careful with party mode. It’s useful, but it can burn context and credits fast.
Don’t treat architecture prompts casually. “Low cost” pushes you toward serverless patterns, which can be great, but it constrains framework choices and deployment shape.
Use the retrospective when you’re stuck. The instinct is to push forward. The smarter move is to stop and diagnose.
Stories are where the payoff happens. Once you have good stories, execution becomes much more mechanical.

Closing thoughts

BMAD ended up being one of those experiences where the first phase feels like friction, and then later you realize the friction was the whole point.

If you try BMAD, bring patience. Bring discipline. And assume you’ll spend more time thinking before you spend time coding.

If this resonates, what’s your experience been with structured AI-assisted workflows? I’m curious.

Word count: ~1,890

DEV Community: Dan Gurgui

A Month with DeepSeek: What Happened When I Replaced Claude Opus for Real Work

The Month I Swapped Claude for DeepSeek

Why I Tried DeepSeek in the First Place

The Real Test: Building dronicons.com

Where DeepSeek Felt Stronger: Long-Running Work

Sub-Agents Went Deeper Than Expected

The Cost Reality: Heavy Usage Under $50

The Tradeoffs: Still Not One-Shot Magic

Why I'm Not Rushing Back to Opus

Further Reading

- 3 market trends that could shape the rest of 2026

I Deployed Gemma 4 32B on a Rented H100 for $1.50/Hour. The Hard Part Wasn't What I Expected.

I Deployed Gemma 4 32B on a Rented H100 for $1.50/Hour. The Hard Part Wasn't What I Expected.

The surprising part: H100 access felt almost trivial

Why I tried vast.ai (and what I needed)

Deployment walkthrough: Gemma 4 32B in about one hour

Cost and speed reality check: what $1.50/hour buys

The real challenge: using the remote LLM from my local machine

Failure mode postmortem: context overflow killed the machine

Lessons learned: guardrails you'll want from the start

What I'm looking for next

The infrastructure problem is solved. The developer experience problem is not.

How I Passed the AWS Generative AI Developer Professional Certification (and Earned the Early Adopter Badge)

TL;DR

Why I Pivoted from Solutions Architect Professional to GenAI (and What Surprised Me)

What I Thought the Exam Would Be vs. What It Really Targets

My Preparation Path: What I Used and Why

Starting with Frank Kane's Course (The Skim Phase)

The AI Practitioner Detour (Strategic, Not a Distraction)

The Deep Dive (Where Real Learning Happened)

Practice Tests Reality Check: Too Easy, Off-Target, or ML-Heavy

The SageMaker Dilemma: How Much ML You Actually Need

What the Exam Actually Felt Like: Question Style, Time Pressure, and Mental Endurance

Key Takeaways: How to Prepare Efficiently (Without Overstudying)

Who Should Take This Next (and What's Coming)

Further Reading

LangSearch Inside Claude: The Fastest “Search Tool” Setup I’ve Used Lately

Hook: When Claude’s web search feels nerfed, add a turbocharger

What LangSearch-in-Claude actually is (in plain terms)

Why it feels so fast: the practical differences you notice

1) Latency: time-to-first-usable result drops

2) Throughput: you can ask for breadth without punishment

3) Relevance: fewer “why is this result here?” moments

4) Tool calling reliability is good enough to trust in the loop

Where it shines: 4 research workflows that benefit immediately

1) Competitive scans without the pain

2) Troubleshooting with real-world context

3) Sourcing: pulling multiple perspectives fast

4) Summarizing multiple pages (without pretending)

How to evaluate it yourself (a simple benchmark you can run)

A lightweight benchmark

Caveats and gotchas before you wire it into everything

Closer + CTA: Add LangSearch, then compare against ChatGPT/Gemini

My Experience Using the BMAD Framework on a Personal Project (Patience Required)

Getting Started: “I’ll just use BMAD to move faster”

The First Reality Check: it takes a lot of time before you write anything

The Frameworks Are Actually Good (especially for business thinking)

“Party Mode” and how I burned through context and credits

12–16 hours later: the first line of code… and then I hit an architecture wall

The 6-hour debugging spiral (and why it was so confusing)

The turning point: I didn’t figure it out, the retrospective did

After the fix: everything went smoothly (and the “stories” became the superpower)

What BMAD gets right: patience in exchange for momentum

The underrated feature: you can resume anytime because everything is in documents

What I’d tell another engineer before they try it

Closing thoughts

AWS in the AI era: Bedrock, SageMaker, and the enterprise-first tradeoff

1. The enterprise AI bet: what AWS is actually optimizing for

2. AWS’s AI stack, mapped to real enterprise jobs-to-be-done

Bedrock: “Give me foundation models, but keep it enterprise-safe”

SageMaker: “We’re building, not just consuming”

Comprehend: “We need NLP features, not a whole LLM app”

Textract: “Turn PDFs and scans into data we can use”

Transcribe: “Convert audio to text at scale”

3. Differentiators that matter in regulated environments

Data boundaries (and why enterprises obsess over them)

Private networking: VPC, PrivateLink, and “keep it off the public internet”

IAM, auditability, and “who did what, when”

Residency and multi-region controls