DEV Community: L. Cordero

AMD Advancing AI 2026: Software, Hardware & Framework, Unified

L. Cordero — Thu, 23 Jul 2026 22:46:16 +0000

Where Chris Lattner calls AI "mid" and George Hotz wants to knock a trillion dollars of value off of NVIDIA with tinygrad.

Preamble

IMHO: AMD's Advancing AI 2026 summit in San Francisco was free. And right now, while the firehose of free AI events and learning is still on, I'm taking every opportunity I can get my hands on. Eventually the free opportunities will slowly disappear, and this age of companies wanting us to adopt their tech and ecosystem will end. Cynical, pessimistic maybe. But think about the money spent on a two-day event at the Moscone Center. Free if you sign up. Free learning, access to vendors, tech talks, keynote speakers, food and lunch provided. That's an investment, and eventually the accountants, or the shareholders, are going to need to see a return. But I digress. This is my rundown of day one.

AMD changed my mental model

To say I'm a fish out of water at any event is an understatement. AWS, MLH, Google, and now AMD, I've tackled each one as a way to explore this brave new world, but sometimes I just don't have the context. And that's okay.

AMD was more technical. More focused on hardware. There was more than one robot. And it changed my mental model of technical ecosystems and the symbiosis within them, and I'm so here for it.

A panel with no theme, and then Chris Lattner called AI "mid"

Day one started with a three-person panel: Chris Lattner, Ramin Hasani, and Hassan Akbari, hosted by, checks notes, a name I didn't catch. There was no real theme. It was a roundtable discussion, and here's what came out of it.

Chris Lattner created LLVM, the compiler infrastructure a huge amount of modern software quietly runs on. So when he talks, I try to keep up. His point wasn't only about software. It was about how the hardware we run works with the applications we write, and why you can't reason about one without the other.

He said he'd been told his approach wouldn't work, because CUDA has the moat. His answer, the way I understood it: CUDA is kind of like GCC. GCC is the decades-old open-source compiler a lot of software gets built with, so, big and foundational and hard to dislodge. A monolith. Powerful, legacy, wrapped in an incredible moat, and also not that good. His point: you don't beat the moat head-on, you go around it with architecture. Let the hardware express what it can do instead of locking everyone into one vendor's language. That's his whole Mojo play at Modular, for what it's worth, a portable alternative to CUDA.

Then he dropped the line. AI is mid.

Appropriate laughter followed. But I don't think he meant it as a burn. The way I read it, he meant AI is a distribution follower. The AI most of us touch every day, the LLMs, the predictive models, those are the product. They ride on top of the work happening behind the scenes: the training, the hardware, the compute. He lives in that granular layer. Most of us live in the public one. So when he says mid, I think he's talking about the surface, not the machine underneath it.

Maybe I'm getting his intent wrong. I'm allowed. But that's the read I walked away with, and it stuck.

Ramin Hasani talked over my head, and I wrote it down anyway

Ramin Hasani talked in abstractions I didn't fully follow. And I mean abstract for me, not abstract in general. The concepts were above where I am right now. But I wrote them down anyway, because future me is going to want them.

Here's what I caught, in the order my pen caught it. Think about the problem at the algorithmic level before you reach for kernel optimizations. AI designing AI, and not just attention, look at the other transformers. Liquid foundation models, which are his company Liquid AI's whole thing, so that one I could at least place. Matching the best models to the right hardware: CPU, NPU, GPU.

I'm not going to pretend I can explain all of that yet. I can't. But it's in the notebook for when I get to the "kernel" level of my studies. Lol.

Hassan Akbari tied it together

Hassan Akbari was the one who tied it together. He talked about a unified ecosystem between frameworks, hardware, and software. He mentioned kernel again, and no, I still didn't catch the context, so onto the pile it goes with Ramin's. His point was that the interconnectivity of all three has to be leveraged for optimization. You don't tune one piece in isolation. You use how they connect.

Then he asked the question that stuck with me. Large models can now be distilled into smaller ones that are still effective. So are we wasting compute? Scaling models, the way Hassan framed it, should be pragmatic. That one made sense to me.

His other points came from the customer side, where reliability, cost per token, and accuracy are what matter. Optimization is the whole goal, and evaluation pipelines are how you measure it: benchmarks, deployments, the numbers. At least that's what I wrote down. Correct me if I mangled it.

George Hotz wants to knock a trillion dollars off NVIDIA

Picture the stereotypical genius from Silicon Valley, the show. That's George Hotz. It was awesome, it was inspiring, and it was very on the nose. Note to self for the day: I am sharing a room with very, very, very smart people.

Even his intro was wild. Jailbroke the iPhone. Reverse-engineered the PlayStation 3. Made a self-driving car , comma.ai, that got a cease-and-desist letter from the DMV. Got hired to hack a company's own systems to keep them safer. I could not tell if I was watching an inspirational documentary or a cautionary afterschool special. But I already said he was awesome, and I stand by it.

His talk was deep. Graphs, lines of code, evaluations. I have pictures, if anyone wants them. At one point he said the Radeon RX 7900 XTX, three years old now, is still a good value at $999. I wrote down "what is a 7900 XTX?" and filed it for later. Back to it.

George didn't hold back, and the candor was refreshing. tinygrad, which I looked up, is an open-source neural network framework and deep learning library, and on stage he put it at 25,000 lines of pure Python, about 9,000 of that the core, meaning the essential engine with everything else built around it.

His stated goal, and I quote, is to "commoditize the petaflop." Translation, delivered deadpan: if tinygrad succeeds, it knocks a trillion dollars of value off NVIDIA. Insert audience laughter here. Same moat Chris Lattner was talking about. George just comes at it from the scrappy open-source end.

His pitch is GPUs for the middle class. He talked about growing up in New Jersey with a parent who was a teacher. He doesn't care about data centers. What he cares about is the highest development velocity, and not the out-of-the-box-fast kind. His argument, the way I understood it: an out-of-the-box TensorFlow might be faster on day one, but tinygrad's velocity curve is built to climb higher and faster over time. And here's the part that got me. tinygrad has no dependencies. None. No numpy was used in the making of tinygrad. Which I learned is the whole flex. No dependencies means nothing underneath you can break, bloat, or drift out of version. Fewer moving parts, less that can rot. Very much my kind of principle. He even mentioned running tinygrad as a backend, which I am absolutely going to go try.

So the guy who reverse-engineered a PlayStation had my full attention. So glad I made it.

Do you see why I love these events?

Here's what kept pinballing around my head after all of that.

I'm an AI-assisted builder. I live on the top-level software end. I write code, I deploy applications. So why haven't I been more thoughtful about what runs underneath all of it? My dev.to articles focus on the end product, the thing I shipped. Not the framework, not the hardware, not the compute it all rides on.

Maybe that's the next stage of my development. Or maybe not. Here's my honest worry: if I try to hold every piece at once, framework and hardware and software and infra, I might get so overwhelmed I don't build anything at all.

And then the comforting thought. Maybe the three of them were mostly talking about training models, and I can go back to coding with vibes. The event leaned hard into CPU, GPU, NPU, compute, and infrastructure, not so much the SaaS layer I live in. So maybe it isn't my fight yet. Something to ponder.

But that pinball is the whole reason I go. For a few hours, something pulled me out of my product-focused viewpoint and made me look at the layer underneath. That's the value. Not the swag bag, necessarily!

The workshops, and the one that masqueraded

After the opening session I hit the workshops. One was Build Your OpenClaw Agent with Multi-Modal Models, running on AMD GPUs. Another was billed as vibecoding with local models. Good practice, all of it technology I hadn't touched. AMD's learning platforms were new to me. Working out of Jupyter notebooks was new too.

The vibecoding workshop, well, the title masqueraded it. It was a Lemonade and Qwen workshop, and it rocked. First time I'd heard of Lemonade. It's a community project, sponsored by AMD with optimizations from their engineers, that runs large language models locally on your own GPU or NPU. We ran Qwen3.6-35B-A3B on it. They walked the framework from the ground up: silicon, engines, routing, models, then the app. Another pyramid. And there I was again, looking down from my product endpoint at the top and seeing how much sits underneath. Same shift as the panel, twice in one day.

This one got me excited though. Running Lemonade locally means everything stays with me. A new idea, a new system, a new output to go explore. I'm sure the learning curve will be steep. We were working inside pre-set-up AMD environments, and standing something up from scratch will bring its own challenges. But as always, excited and ready to try something new.

What's next

Full disclosure: I did day one as a day trip, LAX to SFO and back. Day two happens without me, and I'm bummed to miss it. I'll have to watch Dr. Su's keynote on Youtube.

What I'm carrying home instead is the Lemonade itch. A from-scratch local setup on my own machine, which will have its own learning curve. I'll report back on how that goes.

Until the next event! I hope to see my Dev.to fam there, too!

AI Assisted. Human Approved. Powered by NLP.

Open AI Build Week: Foundation First, Software Second

L. Cordero — Tue, 21 Jul 2026 04:34:11 +0000

IMHO, there's a rush to sell AI to government right now (note: I work in government). I kept asking a different question: has anyone checked whether government is ready to buy?

I built Verity Lex over about two days of OpenAI Build Week to answer it, and one rule ended up shaping the whole build: the AI reads the court's record, but it never assigns the score. That sounds like a small design choice. It's the build, and teaching a machine to respect it was the hard part, and the one that cost me a stressful evening I could have skipped. Point Verity Lex at a California superior court (Santa Barbara is live, more courts to come) and it reads the public record, checks it against published legal standards, and returns an AI-readiness score anyone can recompute.

This is the story of what I built, how I directed Codex and GPT-5.6 to build it, and the one lesson I'd like to pass on to any of my Dev.to fam shipping an AI.

The thesis: you're selling to government backwards

Software gets sold to public institutions the wrong way around. A vendor shows up with a solution and goes looking for a problem, without ever understanding the institution it's selling into. What is this court held to? Is it meeting that standard? Are its people ready for the thing you want to sell them? Skip those questions and you get shelfware and eroded trust, which anyone who has worked in the public sector has watched happen.

AI is about to run that same play at scale. So Verity Lex starts at the bottom of a pyramid, not the top:

Standards. What is this institution held to, and does its own record show it's meeting it? This is Verity Lex.
Structure. Is the organization itself ready? Governance, roles, operational policy.
People. Do staff know how to use AI safely? Where are the culture blockers?
Solutions. Only now do you get to sell something, prescribed against a diagnosed picture instead of a pitch deck's guess.

You earn the right to sell to government by understanding it first. Verity Lex is tier one, shipped.

The architecture: neurosymbolic on purpose

Here's the trap most AI demos fall into: they put the model in charge of the answer. Ask an LLM to score a court's compliance and you get a number shaped by whatever the sampling did that second, with citations it might have dreamed. Run it twice, get two answers. No government buyer can procure "the AI felt good about our compliance."

So I split the job into three, along one boundary: perception versus judgment.

Finding the documents: GPT-5.6 directs, Tavily retrieves. A model-directed ReAct loop decides where to look. GPT-5.6 reasons about the next move; Tavily runs the retrieval, searching the court's public site and pulling documents a plain fetch can't reach. Tavily was new to me going in, and giving the model a real search-and-extract layer instead of a hand-rolled fetch is a chunk of why the agent can find a document that moved.
Reading the documents: GPT-5.6 extracts. It reads what comes back and pulls evidence into a strict JSON schema. Finding and reading are both perception, and perception is what a model is good at.
Scoring the readiness: a deterministic rule engine. Pure TypeScript, no model imported anywhere in it. It takes the extracted signals and applies a published registry of weighted, legally-grounded artifacts. Same signals in, same score out, forever. Scoring is judgment, and judgment about a number has to be reproducible.

Two of those jobs are perception, one is judgment, and the design lives on keeping them apart.

Would you trust an AI to grade its own homework? Neither would a court. The model has no path to the score, by construction. Every finding cites a real document and a quoted line. Anything it can't find is marked not located, never absent, because a public record going quiet isn't proof of anything. And you can download an audit bundle and recompute the score yourself. The constraint is the product.

How I built it: directing Codex

I built this with Codex in VS Code using what I think of as a creative-director workflow. I own the judgment. Codex owns the implementation. The boundary between us is enforced, not trusted.

In practice that meant gated, block-scoped prompts. Every one started with "propose a file plan first, do not refactor unrelated code, stop after the PR." Codex built each block, the rule engine, the agent tools, the loop, the API, the hardening, CI, the add-ons, as its own pull request. I reviewed and tested every one before it merged, and CI enforced that no red PR reached main. The commit history is the collaboration log: over twenty scoped PRs, each one a discrete piece of the argument.

Codex was good. It caught a next/server import trap and fixed it by using web-standard Response.json. It diagnosed a lockfile mismatch that only showed up on the Linux CI runner. And when it was wrong, I overruled it, which is the entire point of keeping a human at the gate. At one point it proposed swapping deterministic installs for a looser command to "fix" a CI failure. That would have traded away reproducibility for a problem we hadn't even diagnosed. We diagnosed it instead. It was a stale cache.

GPT-5.6 runs live inside the shipped product, directing discovery and extracting evidence, with Tavily as the retrieval layer underneath it, and it is architecturally barred from touching the score.

Lessons learned: my tests were green the whole time. Then I deployed.

Here's the part I'd do differently. If you're building something similar, maybe it saves you the evening it cost me.

At first, my CI was green. Rule engine tested, agent loop tested, bounds tested, API contract tested. I read that green as "it works." Then I set my real API keys in production, clicked the button, and watched it fail three different ways in a row.

First a 400 from the model: OpenAI's JSON mode requires the word "json" to appear in the input itself, and my instruction saying so in a separate field didn't count. Then the agent's planner returned an action with no query attached, so every search got rejected and the scan found nothing. Then the extractor produced JSON in the wrong shape and my validator threw on every document.

Three bugs, one root cause: every one of them lived at the boundary between my code and the real model, and my stub had hidden all of them.

Here's the precise miss. My stub model always returned correctly-shaped, pre-baked responses. So my tests proved my code could consume a good model answer. They never proved my prompts could produce one. Those are different claims. My planner test literally handed the stub an action that already had the required fields, so it passed, while the actual prompt never told a real model those fields existed. The test fixture quietly knew something the prompt didn't teach.

I fixed the three shape bugs. And then a worse one showed up, because I'd finally learned to actually measure: I ran the same scan five times and got different scores. Fifty-nine one run, sixty-seven the next. Same court, same day.

My first guess was budget, that the agent ran out of steps before covering everything. Wrong. I handed it a generous budget and it still wandered, using a fraction of it and stopping in different places each run. The problem wasn't resources. I was asking a non-deterministic model to do a deterministic job, march through nine known legal standards and check each one, and a model does not march. It strolls. More budget just bought a longer runway. Oops!

Thereason Verity Lex is "trustworthy" is that I'd taken the score away from the model and given it to deterministic code. So my first instinct was: do the same to the search. Write the queries in code, pick the links in code, let the model only read. Cage it into repeating itself. I even confirmed the model wouldn't help me the easy way, GPT-5.6 doesn't expose a temperature setting, so I couldn't just turn its randomness down.

I got most of the way to convincing myself to build the cage before I saw the trap. A court renames a policy PDF, or moves it to a new page, and a hardcoded search walks right past it. The free, uncaged agent, reading the results and reasoning about them, finds it at the new location. Caging the agent would make it repeatable and blind, blind to exactly the thing this product exists to catch: change. The improvisation I was trying to delete was the value.

So the lesson is more careful than "make everything deterministic." Move a guarantee into code when a second answer is simply wrong, scoring, where two numbers for the same evidence is a bug by definition. But where the adaptivity itself is the value, like finding a document that moves, don't cage it. Make it reliable a different way.

And the different way is the thing I'd cut from v1 for simplicity: memory. The scan wanders because it's stateless. It re-improvises the entire hunt every single time, starting from nothing. Ugh, my design bad. A version that remembered what it found last time wouldn't re-hunt a known policy, it would re-verify it, and when the policy moved it would notice the move instead of silently missing it. The variance I'd been treating as a defect turns into signal: convergence when the record holds still, a change alert when it doesn't. The fix for an unreliable observer was never to cage it. It was to give it a memory. That's the top of the roadmap now, and it's the sharpest thing this whole build taught me: a stateless observer cannot repeat itself, and memory, not a cage, is what makes an adaptive agent reliable. EUREKA

The practical habits that would have caught all of this earlier:

Smoke-test the model interface the day it's written, not the day it deploys. One real-key call doing one discover and one extract would have surfaced all three prompt bugs ealier, calmly, instead of at 11pm.
Write prompt-contract tests that never call the API. Assert the extractor prompt actually contains the schema field names and the valid IDs. Cheap, static, catches underspecification.
Record the stub's responses from one real call and freeze them, so the stub can't drift into being more forgiving than reality.
Run anything non-deterministic several times and assert the spread is bounded. Variance is invisible to a single run, and a stub can never show it to you.
Re-derive tuned parameters when their inputs change. My search budget was set when the registry was smaller and never revisited when it grew.

An ode to determinism: the deterministic core had zero production bugs. The guardrails held even while the model layer was failing. Every bug degraded gracefully, structured errors and honest empty states, never a crash, because the fail-safe design was real. And once I added one line of error logging, each mystery became a one-paste diagnosis. The architecture was resilient and the bugs were findable, which is a far better place to be than a fragile system hiding quiet ones.

The category I didn't see until I'd built it.

Here's the last thing the build taught me, and I was shook. For most of it I thought I was making a SaaS: point it at a court, get a score. Judged as that, the variance really was fatal. Run it twice, get two numbers, look broken.

Then I actually looked at what I'd built. A human gate on verification. A draft-inquiry workflow. Findings meant to be reviewed and confirmed, not consumed. Those aren't consumer-SaaS features. They're the features of an analyst's tool. I hadn't built a one-click verdict machine. I'd built readiness intelligence, the kind of thing a vendor runs (oh, the irony), reviews, and tracks over time, and I just missed the point entirely at first.

That reframe dissolved the thing I'd been panicking about. A business-intelligence tool is supposed to be run repeatedly. You pull the data, you review it, you watch the trend. Variance between pulls isn't a defect, it's the raw material of a baseline. And the credibility was never in the summary number anyway. It's in the citations: every finding links to a real document and a quoted line you can open and check. The number can move. The evidence doesn't lie.

You rarely see the category up front. You build into it, and one day you look up and realize what you actually made. Naming it right, mid-build, under a deadline, turned my biggest liability into the most honest thing about the product.

What's next

Verity Lex today is one court, scored against one registry version, on a public-record foundation, a live observer that reads the record fresh each time. It's stateless: nothing is stored yet, and it re-hunts from scratch every run. Everything past this point is roadmap, not shipped.

Two things sit at the top of it, and both come straight out of the variance. First, ensemble extraction. GPT-5.6 exposes no temperature control, so a single reading of a document varies run to run. Running each extraction several times concurrently and accepting a signal only on a majority vote tightens the score without caging the agent, in parallel, so it costs time you won't feel.

Second, and the one the lesson pointed to: memory. v2 would write every scan to an append-only log first, a side effect that can't touch the scan itself, then turn on the read, and that's where it stops being a one-shot scanner and becomes something worth paying for. Scan a court a few times and the baseline strengthens and converges. Store it and you can tell when the record moves, a policy posted, a plan pulled down. The agent stays free to hunt, because hunting a moving target is the point, and memory is what turns its wandering into a baseline you can trust and a change feed you can watch. Above that sit the tiers left as roadmap: structure, people, solutions, deliberately not half-built.

But the wedge is real, it's live, and it holds the line it was built to hold: the AI reads the record, and it never assigns the score.

*Built for OpenAI Build Week 2026 with Codex, GPT-5.6, and Tavily.

AI assisted. Human approved. Powered by NLP.

verity-lex.vercel.app*

AWS Weekend Agent Challenge: Daybreak

L. Cordero — Sat, 18 Jul 2026 04:12:39 +0000

Vision and what the agent does

I build fast and I build a lot. As of this weekend that is 49 public repos. What I am bad at is remembering the state of any of them. Which ones are live, which shipped with no README, which went quiet months ago and quietly rotted. The answer used to be that I had no idea until I went digging.

Daybreak is the fix. I wanted the review done before I even sat down: not a dashboard I have to remember to check, but a brief already in my inbox, waiting at first light. That is the name. Every morning at 6 AM Pacific it wakes up on its own, with no button to press. It takes inventory of every repo on my GitHub account, reasons over that inventory with Amazon Nova, and leaves a short brief in my inbox: how many repos are live, how many have gone quiet, how many are missing a README or a license, and three specific "do something here" nudges. By the time I am awake, the review is done and waiting. The best tool is the one you never have to open.

It runs read-only. It reports, I decide. It never touches my repos.

How I built it

PRD first, always. One architecture doc with MUST, STUB, and NEVER labels before a line of code, then a block-by-block build with a pass or fail checkpoint after each block. Uptime checks, showcase scraping, and auto-generated README pull requests are all stubbed with implementation notes, not half-built. Shipping small and lean beat shipping big and broken.

I ran it as a multi-model workflow. Claude for the architecture and the PRD, Kiro for the spec-driven build. I wrote phase-aware guardrails so Kiro could do the file authoring and tests while I was at work, then flip to a go-live mode that let it deploy with my credentials once I got home. I direct and approve, the agents generate.

The challenge worth mentioning was getting Nova to answer at all. My first invoke failed with "on-demand throughput isn't supported." Nova will not run on-demand from the bare model id. You have to call the cross-region inference profile, us.amazon.nova-lite-v1:0, and your IAM policy has to allow both the profile and the underlying model in every region the profile routes to. List only the profile and you get AccessDenied even with model access on. Once I understood that, it answered first try.

A second lesson landed mid-build. The old Bedrock model access page was retired in late 2025 and Amazon-provided models are auto-enabled now, so a step I had planned no longer existed. Checking whether a tool is current before depending on it saved me from chasing a dead console page.

AWS services used and architecture overview

The whole thing is one Lambda and a schedule.

Amazon EventBridge Scheduler fires the function daily at 6 AM Pacific with a timezone-aware cron, so daylight saving handles itself. This is the trigger, not a button.
AWS Lambda runs the agent in Python with zero third-party dependencies, so the deploy package is a single zipped file.
Amazon Bedrock with Nova Lite via the Converse API does the reasoning.
Amazon SES emails the brief.
Amazon CloudWatch logs every run and alarms on any error.

Infrastructure is a single AWS SAM template, so the whole stack deploys and rolls back as one unit.

Flow: EventBridge Scheduler triggers Lambda, which reads the GitHub API, calls Bedrock Nova, and sends through SES to my inbox.

One design choice I am proud of: structure is deterministic, flavor is generated. The counts of how many repos are live, quiet, or unlicensed are computed in Python so they can never be hallucinated. Nova only writes the narrative and the nudges around those fixed numbers.

What I learned

Bedrock inference profiles, and why on-demand Nova needs the us. prefix and IAM that covers the fan-out regions. Least-privilege IAM for an agent: four actions, no wildcards, with Parameter Store wired for secrets (the token slot exists but v1 runs unauthenticated just fine). SES sandbox is fine when sender and recipient are the same verified address, which skips a production-access request. And the value of a hard line between deterministic logic and model reasoning, so the numbers are trustworthy and the writing is still warm.

The proof it ran without me: I set the EventBridge schedule to fire at 8:07 PM, stepped away, and it went off on its own. It pulled all 49 repos, Nova wrote the brief, and SES delivered it in about 4 seconds, with the CloudWatch timestamp and the email to match. That was a real scheduled trigger, not a button, which is the whole point of the challenge. With autonomy proven, I set the production schedule to 6 AM Pacific daily. The verbatim CloudWatch logs, timestamps, and request IDs are in PROOF.md.

Link to repo

Repo: github.com/earlgreyhot1701D/daybreak

Proof of autonomous run: PROOF.md

AI Assisted. Human Reviewed. Powered by NLP.

MLH x DigitalOcean Hackathon: Jury Duty, Explained

L. Cordero — Sat, 11 Jul 2026 20:55:15 +0000

Justicia Clew: what a hackathon build teaches you that a tutorial never will

This was my first SF hackathon. I was nervous going in, and not really about the code. I was nervous I'd get put on a team I didn't pick, or worse, that I'd have to pitch myself onto one. But, I was able to build solo, phew! Turned out to be the right call for a two-day sprint where I needed to move fast and make my own calls without a group chat to sync with first.

Solo only gets you so far though. I need to build with someone eventually, just for the practice of it, doesn't have to be this project or even this kind of tool. So if anyone reading this wants to build something together sometime, my inbox is open.

Anyhow, here's what I built: Justicia Clew is a mobile-first web app that answers jury duty questions in your court's own words, not legal advice, just plain-language answers grounded in your actual county court's public website, with a real phone number when it doesn't know something instead of guessing. Built for Santa Barbara County as the live demo, on a FastAPI backend and DigitalOcean's Gradient AI agents, over about two days.

The part nobody tells you: shipping infra means un-shipping it too

Here's a moment from tonight that didn't make it into any tutorial I've read on "how to deploy your hackathon project."

I finally had the whole pipeline working. Knowledge base grounded in real Santa Barbara jury services content, agent refusing cleanly when it didn't know something instead of guessing, frontend wired to real API calls instead of mock data. Time to deploy. I open DigitalOcean App Platform, pick my repo, and the console shows me a number: $24 a month.

Immediate gut reaction: hell no.

Then the actual thinking kicked in. That $24 is a monthly rate, not a bill for hitting deploy. DigitalOcean bills by the hour, prorated. Run it for a few hours during judging and destroy it after, and you're looking at pennies, not $24. I had $200 in hackathon credits sitting there too, so the dollar cost of tonight was never really the issue.

The real issue was time I hadn't budgeted for. Because it turns out "deploy your AI app" isn't one task, it's two: deploy it, then remember to tear it back down. My $200 in hackathon credits expires July 13th, two days after the deadline. My knowledge bases run on a managed OpenSearch database, which is a second billable resource completely separate from the app itself, easy to forget exists once it's just quietly running in the background. So the actual finish line isn't "it's live." It's "it's live, and I've written myself a reminder to kill the App Platform app and the OpenSearch database before the credit window closes."

Nobody puts that in the excited "I shipped my hackathon project!" post. It's not a dramatic problem. It's just real infrastructure behaving like real infrastructure: it keeps costing you something (money, attention, a reminder on your calendar) until you actively tell it to stop. A mockup doesn't have that problem. A mockup doesn't have a lot of problems, which is exactly why it's not the same thing as shipping.

That's the honest cost of building with real tools instead of toy ones on a hackathon deadline. Not the $24. The fact that "done" has an expiration date you have to manage yourself.

When the AI can just... write to your repo

Small moment, worth flagging because it changed how I thought about the workflow mid-build.

Deploy failed at 10am the next morning. Build log showed a real, specific error: no Python version pinned, DigitalOcean's buildpack defaulted to the newest one available (3.14), and pydantic-core (a dependency of a dependency) needed to compile from source using a Rust tool called PyO3, which doesn't support Python 3.14 yet. Clear root cause, one-line fix: add a .python-version file pinning to 3.12.

Except Claude didn't just tell me to make that file. It wrote it, sent it to me, and pushed it directly into my repo through the device connection, then handed me back three git commands to commit and push. I hadn't fully clocked that this was possible until it happened. My mental model going in was "AI writes code in a scratch space, I copy it in." Turns out the actual model, once you connect your machine, is closer to "AI can touch your files directly, and tells you what it did."

That's not a complaint. It was the right fix, and I checked it was really there before pushing, same as I'd checked every Kiro claim all day. But it's worth naming as its own thing in a "creative director directs, AI builds" workflow: the line between "AI suggests, I execute" and "AI executes, I review" is thinner than it looks once your tools are actually wired together, not just chat-in-a-browser-tab. I stayed the one who ran git push, so the last gate was still mine. But the boundary moved, and I didn't fully expect it to.

The scary-sounding fix that wasn't

After deploying, I asked the live app a question I expected to work, "can my employer fire me for jury duty," and got a flat refusal. Then another one. And another. My first read was "our RAG is thin," like the retrieval itself was somehow bad.

It wasn't a retrieval problem. It was a coverage problem. When I actually went and looked at what was in the knowledge base, I realized my fallback file upload (the workaround from when DigitalOcean's crawler couldn't fetch the live court page) only ever captured one tab of a six-tab page. The Employer Information tab, the one with the actual California Government Code language protecting jurors from being fired, had never made it in. Neither had the postponement/deferral process. The agent wasn't broken, it was accurately refusing to answer questions it genuinely had no source content for.

The part I expected to be scary was fixing it: removing an existing data source from a knowledge base that's already attached to a live agent, and swapping in a corrected one. That sounds like exactly the kind of thing that could break a working system. It didn't. The knowledge base keeps its own ID no matter what data sources go in or out of it, the agent references that ID, not the individual files, and it reindexes and picks up new content automatically. No reattaching the agent, no redeploying the app, nothing. Remove the old source, add the new one, wait a minute for it to reindex, ask the same question again.

The lesson wasn't really about DigitalOcean's API design, though that part holds up well. It was that "thin RAG" and "missing content" look identical from the outside, a flat refusal either way, and only one of them means your retrieval setup needs work. Go look at what's actually in the knowledge base before you assume the model's the problem.

Bringing in a fourth tool for the last mile

By the time the app was deployed and mostly working, my usual three-tool split, Claude for architecture and docs, Kiro for the actual build, Gemini for visual and content, wasn't quite enough. I wanted an adversarial QA pass before submitting, something whose whole job was to try to break what Kiro had built, not build more of it.

So I brought in Claude Code for that specifically: a code review, endpoint and input edge cases, a check on whether the app has any real protection against someone hammering it, and a walkthrough acting like an actual juror using the thing.

First draft of that prompt asked it to actually generate the traffic itself, fire a real batch of requests at the live API to prove it holds up. Caught myself before sending it: the app calls a metered DigitalOcean agent per request, real hackathon credits, no rate limiting yet since that's an intentional stub for tonight. Actually hammering the live endpoint to test whether it can be hammered would have been a strange way to find out the answer is "not yet," at my own expense. Rewrote it as a report-only pass instead: read the code, reason about the exposure, name the gap plainly, don't go generate the problem to prove it exists.

Small moment, but it's the same lesson as the deploy pricing scare earlier in the night. Real infrastructure means real consequences for careless testing, not just careless shipping.

The near-miss that would have quietly broken everything

Somewhere in the middle of setting up knowledge bases for six counties, I almost combined all six into one.

The Gradient AI Platform lets a knowledge base hold multiple data sources, which is exactly the right feature for the wrong moment. I had six county URLs open in six tabs, adding them one after another, and the flow doesn't stop you from adding source five and six to the same knowledge base you just created for source one. Nothing in the console yells "wait, these are supposed to be separate." I caught it staring at the confirmation screen before clicking create, because the whole architecture depends on county isolation coming from separate knowledge bases, one per county, each attached to its own agent. Combine them and a question about Santa Barbara parking could pull retrieved chunks from Fresno's page, with no clean way to tell the agent "only look at this part."

Nothing bad happened. But it's the kind of mistake that doesn't announce itself later, it just quietly answers questions with the wrong county's information and looks like a retrieval quality problem instead of what it actually is, a setup mistake from two days earlier.

A URL that wouldn't load, for no reason the console would tell me

The Gradient AI Platform's web crawler is supposed to pull a URL straight into a knowledge base. Mine kept failing on Santa Barbara's real, live, working jury services page. Not a typo, not a dead link, I could open it fine in a browser. The console just said it couldn't be fetched, no status code, no reason.

My best theory, never confirmed, is bot protection on the court's side blocking whatever the crawler identifies itself as. There was no way to check that from the DigitalOcean side. So I worked around it: copied the page content by hand into a markdown file and uploaded that as the data source instead of the URL. It worked. It also cost me an hour figuring out that file upload was even an option, and then later cost me another round when my first copy-paste only grabbed one of six tabs on that page and missed the two sections (employer protections, postponement process) that mattered most.

Trust, but check the file yourself

This is the one I'd tell any solo builder using an AI coding assistant, no exceptions.

More than once, Kiro reported a fix as done, tested, confirmed, here's the proof, and the actual file on disk still had the old code. Not once. At least three separate times, across different files, different fixes. The first time I assumed I'd misread something. By the third time I stopped assuming and started verifying every single claimed fix against the real file before moving on, no matter how confident the report sounded.

Eventually I stopped routing some fixes through Kiro at all and had Claude write and commit them directly, then had Kiro independently check afterward, from a fresh read, no memory of writing them. That's not a knock on Kiro specifically. It's a rule I'd apply to any AI tool making changes I can't watch happen in real time: a reported fix is a claim, not a fact, until you've read the file yourself.

Cutting six counties down to one, on purpose

The original scope was six counties and a bilingual interface. By the halfway point of day one, the honest read was that six full counties meant six half-working ones, and I'd rather ship one that actually works end to end than five that don't.

So I labeled everything MUST, STUB, or NEVER before writing more code. Santa Barbara: MUST, fully built, real content, real agent, real answers. The other five counties: STUB, meaning the architecture supports adding them (one knowledge base, two environment variables, no code changes) but there's no content behind them yet. Spanish: STUB too, the toggle exists in the UI but it's disabled, because shipping a language switch that only translates button labels and not the actual answers would be worse than not having it. NEVER went on the list too: no legal advice, ever, no matter how the feature might tempt someone to add it later.

Cutting scope on purpose, with the cuts written down instead of just quietly not happening, is the only reason this got submitted instead of half-finished.

Hitting submit

I wrote at the top of this that I was nervous going in. Different kind of nervous now, the kind where you've just clicked submit on Devpost and there's nothing left to do but wait.

Two days, zero prior DigitalOcean experience, one person, one county, one real working answer to a question that matters to nearly everyone eventually: what do I actually do about this jury summons. That's the whole thing. If you made it this far and you're someone who'd want to build something together sometime, the offer from the top of this post still stands.

Fable 5 Hype: Fangirling with Datasets to Build a Lakers Dashboard

L. Cordero — Mon, 06 Jul 2026 04:05:07 +0000

This is the story of a for-fun project, Luka Fit Index that started with me typing "ai for fun? picture this" at Claude Fable 5, Anthropic's new model, the one all the launch hype has been about. I wanted to see what the hype felt like on a project with zero stakes: no hackathon deadline, no rubric, just my team, my datasets, and an afternoon. It ended with a deployed page, a PRD, and a tool whose best feature is a paper trail of its own mistakes.

The idea that wasn't supposed to be serious

If you follow the NBA you know my team, the Lakers, had a summer. LeBron declined to re-sign and is now a free agent. Ayton got traded for a bench guard and picks. In one 35-minute stretch of free agency, the front office signed four new players. It's Luka's team now, and every one of those bets rides on one question nobody can actually answer in July: do these guys fit the way Luka plays?

My first idea was the obvious one. Build a metric. Score every player. Crown the offseason a success or a failure.

Then I pressure-tested it in the Fable 5 chat, and the whole thing fell apart in the best way. Every fit metric is stuffed with opinions wearing a math costume. The weights are your bias. The samples are too small. The new guys have played zero possessions together, so any "chemistry" claim is fiction. A verdict engine was impossible to build honestly.

So I flipped it. If the bias can't be removed, print it. The tool became a bias-visibility experiment disguised as a sports dashboard. Every assumption on the page, every sample size next to its score, and one hard rule that survived the entire build: no blended composite score, ever. Four axes stay four axes. If you want a single number that tells you what to think, this is the wrong tool on purpose.

Why usage, and the rules before the scores

The whole thesis sits on one stat. Usage rate is the share of team possessions a player ends, a shot, free throws, or a turnover. It's the cleanest measurable answer to "who has the ball?" Luka just led the NBA at 38.0. He ends more possessions than anyone alive. So every teammate's fit starts with one question: can you be great without it?

From there, four axes. Spacing, because his kick-outs need shooters. Play finishing, because efficiency without the ball is the job. Defensive cover, because someone has to guard so he doesn't. And ball-need, inverted, where lower usage scores higher next to the highest-usage player in basketball.

The rules got decided before any player got scored, which mattered more than I expected. When the data later disagreed with my takes, the rules didn't bend. I did.

The fifteen-minute spike that earned its keep

Before a line of the page existed, we ran my usual reality check, a fifteen-minute spike with pass/fail criteria. Fable 5 did the pulling: real stats for the four signings, proposed scores on each axis, receipts logged to a spike file I could verify against later. My job was the criteria and the verdict. The spike caught two things the concept phase, mine and the model's both, had confidently wrong.

The 3&D wing the front office signed, Quentin Grimes, shot a career-worst 33.4% from three last season (no disrespect Quentin!). The story in my head was a season out of date. And the new starting center played five games before shoulder surgery ended his year, so his "best fit on paper" scores rest on old data and a repaired labrum.

Two corrections in fifteen minutes, before a single pixel existed. That's when confidence tags got promoted from footnote to first-class UI element. They changed the verdict on half the players they touched.

The question that broke my own metric

Here's the part I want to remember. The tool flagged Reaves, our co-star, as the worst fit on the roster. His 26.6 usage next to Luka's 38.0 looked like pure overlap. The math was clean. The take was loud.

Then I asked one question, and it wasn't a stats question, it was a fan question. What about the minutes Luka spends resting on the bench?

Season usage can't tell the difference between a guard dominating the ball next to his star and a guard running the show while the star rests. Those are opposite signals wearing the same number. Fable 5 owned the miss immediately, then went and found game-level with/without splits hiding in plain sight on StatMuse. In the 41 games with Luka, Reaves ran a 25.2 usage. In the 10 games without him, 34.3, at a 67.5 true shooting. That's not a fit problem. That's your co-star proving he can carry the minutes the franchise player sits.

The tool's loudest take collapsed under one good question, and the collapse became the best content on the page. There's a table on Reaves' card now showing all three numbers. The correction ships with the product.

It kept happening, and I kept the receipts

Correction two: the free agent board. The fit metric loved Gary Payton II, best defender available, and Fable 5 ranked him second. I asked whether forwards, wings, and centers should outrank guards on the priority list, since that felt like the actual hole. The model counted the roster, seven guards, three forwards, one and a half healthy centers, and demoted its own number two pick because he'd be the eighth guard. Fit and need are two lenses on the page now, shown separately, never averaged, because averaging away a real tension is how tools lie politely.

Correction three: Nicolas Batum. He almost went unscored. Fable 5 had flagged him as the name most worth pulling, then neither of us pulled him. He surfaced when I asked for an audit of everything we'd discussed but not done, and the pull came back with the lowest usage of any player on the page, 9.7, with 40.4% shooting, at the exact position the roster lacks. The wing the need lens was begging for, nearly skipped. He's number two on the board with a note admitting it.

And one more, the one I'm proudest of catching: the defensive scores originally leaned on steals, blocks, and reputation, which is a fraction of what defense is. Fable 5 and I worked through the upgrade together, the model documented it beautifully in the methodology, and then implemented it nowhere. I caught the gap by asking one question: wait, didn't we decide to change the defensive metrics? Fable 5 owned it, "you didn't miss anything, I did," pulled defensive rating for all fourteen scored players, and the signing marketed as a 3&D wing, Quentin again - apologies!, graded out below team average. His cover score dropped from a 4 to a 2. Reputation lost to a number we could cite. That's the whole project in one sentence.

What I learned, hopefully

The disclaimers are the product. Printing them turned out to be the more interesting design, and weirdly, the more credible one. Every "check my sources" claim on the page is one click to verify.

Deterministic structure, labeled interpretation. The stats are pulled and printed. The 0 to 5 bars are my judgment mapping, and the page says so in the footer. The v1 PRD replaces my mapping with fixed thresholds, and its QA gate is brutal by design: where the formula disagrees with the scores, the formula wins.

A tool that argues back is worth more than one that agrees. Three corrections, all documented, all still visible on the page. The build log reads like a transcript of me losing arguments to my own ideas and assumptions. I recommend it.

And here's my Fable 5 review, since the title promised one: the hype for me wasn't speed or polish. It was that when my question broke its loudest take, it said so, pulled the data, and rewrote its own conclusion instead of defending it. That's the behavior I actually want from a build partner.

The division of labor stayed clean, and that's why it worked. Fable 5 pulled every stat, drafted every score, built every page, and kept the receipts. I set the rules, asked the questions, caught the unimplemented decision, and approved everything that shipped. Creative director model. Neither half gets there alone.

The best questions came from fandom, not from stats. "What about when Luka rests" is something any Lakers fan would ask. No model volunteered it. Same for "shouldn't wings and centers come first." The human in the loop, moi, earned their spot, and so did the model that took both questions seriously instead of defending its work.

The page is designed to be provably wrong later. Every score can be graded against the 2026-27 season, and the footer promises it will be. An app that can audit its own past takes instead of just producing new ones forever. That's the honest version of sports analytics I can build from a laptop.

It's live, it's free, and it will be stale within a week because the roster is still moving, which the page also tells you.

Data and sources

Everything on the page traces to public sources, and fair is fair, they did the real work:

StatMuse — every player stat on the page: usage rates, shooting splits, true shooting, defensive ratings, and the game-level with/without splits that changed the Reaves verdict. The underrated find of this build: StatMuse answers URL-pattern questions like a query API, no scraping required.
Bleacher Report — roster construction and cap position after the Ayton trade and the free agency signings.
Spotrac — contracts and the 13-of-15 roster math.
NBC Sports — the remaining free agent market that fed the vet-minimum board.
ESPN — the "35-minute flurry" reporting on the Lakers' free agency and offseason context.
NBA.com — the Kessler shoulder surgery reporting behind his low-confidence tag. The page itself carries the same credits with the same links, because a tool about visible assumptions should cite like it means it.

Live: https://luka-fit-index.netlify.app/
More of my work: https://earlgreyhot1701d.github.io/Clew-Labs/

AI Assisted. Human Approved. Powered by NLP.

AI For Fun! Électrique Chats for Hack the Kitty, Built with Kiro.

L. Cordero — Sat, 04 Jul 2026 03:14:14 +0000

A cat astrologer, spec-driven and running on Amazon Bedrock

A companion to A Builder in Paris: Do Devs Dream of Électrique Chats?

Last month I wrote about the idea. Six rainy days in Paris, a closed laptop, and a hackathon I did not mean to enter, and somewhere between the Musée de l'Orangerie and a lot of walking, an idea arrived. Cats are inscrutable. The people who love them are obsessed with understanding them anyway. Astrology is an old framework for making the unknowable feel readable, and maybe, just maybe, it helps us understand them a little. Her name is Madame Minou, a French cat astrologer who reads your cat's stars from a café terrace.

That first article was the idea. This one is the build.

Vibe-coded, but on rails

Was it vibe-coded? You know it! AI wrote the lines, and I said "no, not like that" more times than I can count. But it was vibe-coding on rails, and the rails were Kiro. Before a single line of app code, I wrote the requirements in EARS notation, a design doc, and a build-ordered task list, all living in .kiro/specs. Decide what "done" means before letting anyone, human or model, start building. The specs are what kept the vibes on track.

Then the steering files. .kiro/steering held the enduring rules of the project: product principles, security guardrails, technical direction, and UI law. These were the thing that kept a long, multi-session build from drifting. When a new session opened, the steering files were already the shared context. "The café blue" was one token, not five guesses. Security was not optional.

From there, the loop: Kiro implemented one approved block at a time, ran each task's PASS/FAIL QA gate on itself before moving to the next, and only stopped for my review on the two things that actually mattered. I directed and approved. Kiro proposed and built. Spec first, block by block, human in the loop.

The facts are sacred

Here is the part that looks like a party trick and isn't. Madame never guesses the chart. The sun sign is computed in code, deterministically. The model only writes the voice over the facts it is handed. It cannot invent a sign, because the facts come first. Deterministic structure, AI flavor. The astrology is the vocabulary; the structure underneath is the catnip.

All in on AWS

Claude runs through Amazon Bedrock on IAM, which means there is not a single API key anywhere in the stack. Lambda, API Gateway, and DynamoDB run the readings and a real per-IP freemium gate. S3 and CloudFront serve the café.

What broke during the build

I promised myself limitations language would be a feature, so here is the struggle bus story, not the tidy one.

I started on the Claude Platform on AWS path and hit a wall: my hackathon account could not complete the Marketplace subscription. So I pivoted to Amazon Bedrock, where my Claude access actually lived, and the whole thing got simpler.

I wanted real ephemeris math for moon and rising signs, but pyswisseph is a native C extension with no Python 3.13 wheel, and it would not compile in the build environment I had. Rather than fight a compiler at nine at night, I shipped a pure-Python sun sign (it is just a date-range lookup, no ephemeris required) and moved moon and rising to v2. Sun sign is most of the value, and now it is rock solid instead of theoretical.

And the deploy. A reserved AWS_REGION env var that failed the whole stack. A Lambda that returned "internal server error" because the build packaged the handler but not the server module it imported. CORS. A missing paragraph renderer. Every one of them a real bug, every one of them a one-line fix once I stopped guessing and read the actual error. Powered by NLP but humbled by CORS!

What I cut

Moon and rising signs. The pyswisseph build wall. A Lambda layer is the v2 fix.
Daily nudge and history. Shipped as honest "coming soon" stubs, not half-features.
Wider birth-city coverage, a premium tier, a custom domain, the exact Paris Métro font. All real, all v2. Shipped honest, not complete.

For Penelope

Madame Minou is for Penelope. Tuxedo, my wife's BFF and my fourteen-year frenemy, who I was allergic to the whole time and could barely pet. She passed in February. This is the first thing I have built for a cat I never quite got to hold. There is a quiet link in the app, in her memory, to the Lap of Love Angel Fund. Because the stars are just a beautiful vocabulary for love.

Try Madame Minou: https://dghcwayx8gb6b.cloudfront.net
Code: https://github.com/earlgreyhot1701D/madame-minou
Built for Hack the Kitty

AI Assisted. Human Reviewed. Powered by NLP.

Can retrieval agents like ChatGPT and Perplexity read your website? Agentis Lux sees what they see.

L. Cordero — Sun, 28 Jun 2026 21:13:08 +0000

I created Agentis Lux for the purposes of entering H0 Hackathon (Vercel + AWS Databases). #H0Hackathon See Agentis Lux's Devpost.com entry.

It started with a comment at a hackathon.

A you.com employee said the thing out loud: the web has a second audience now. When you ask ChatGPT or Perplexity a question, a retrieval agent fetches a page and reads its HTML to answer you. Not the laid-out site with the buttons and the hero image. The markup underneath. These agents arrive by the million, and many of them rely on the raw or minimally rendered HTML rather than running your JavaScript, so they often see far less of your page than a person does.

That comment sent me to build. My first answer to it was Hermes Clew, for the GitLab Duo Agent Platform Challenge. Hermes lived inside GitLab Duo Chat, no frontend, no database: a Python engine that scanned the HTML, JSX, and TSX files in a repo, scored them across six categories, and let an LLM reason over the findings. It proved the core idea. It also told developers how to fix things, lived inside one vendor's chat, and only worked on files in a repo.

Agentis Lux is what happened when I took that idea to the open web and rebuilt it with a different stance. Any live URL, not just repo files. Its own product on a real cloud architecture, not a chat window. And no fix suggestions, on purpose, where Hermes used to hand them out. Same six-category bones, a new body, a sharper philosophy. It scans your site and shows you what that second audience experiences when it tries to read it.

What it does

You paste a URL to Agentis Lux. You get a report. The report is written from the agent's point of view.

Not "this is broken." More like: "an agent landing on this page can't tell which element starts checkout, because it's a styled div and not a button."

It reports findings. It does not suggest fixes, and that is on purpose. I know what the agent sees, not what you should change. That is the whole value: visibility, and you decide what to do with it. Awareness, not judgment.

Six deterministic checks score the frontend out of 100: semantic HTML, form accessibility, ARIA, structured data, content in the HTML, and link and navigation. A parallel set of six API checks runs on the backend.

The one idea the architecture is built on

Structure is deterministic. Flavor is AI.

The checks and the scoring are pattern matching. No model touches the number. Same input, same score, every time. I only spend AI in two places where a regex can't help: a Bedrock call writes the one-line plain-language verdict, and a second Bedrock layer runs an agent simulation, reasoning about what a retrieval agent would experience on the page and reporting what it could and could not accomplish. Not an autonomous agent clicking around. A simulation of the experience.

Vercel runs the entire frontend and the edge layer. The Next.js App Router app deploys to Vercel with the /api/scan route as a serverless proxy in front of the AWS backend, so the browser never talks to Lambda directly. Preview deployments on every push meant I could see each change live before it merged, which is most of how a solo builder keeps quality up without a QA team. The custom domain, HTTPS, and CDN were Vercel defaults I didn't have to think about, which kept my attention on the scan engine.

The AI is constrained, not creative. Low temperature, capped tokens, and a system prompt that encodes the product's own rules: no fixes, no judgment words, no em dashes. The simulation returns structured JSON, and any finding it references is filtered against the deterministic findings, so the model can't invent something the math didn't catch. If it fails validation, it falls back to a template. Math for trust, and the AI is fenced into exactly the two jobs where judgment helps.

Math stays math, so you can trust the number. Language and judgment are where AI earns its place.

This sounds like a philosophy choice. It ended up being an economics choice that fell out of the architecture. The deterministic core runs at any scale for almost nothing, so the free tier can stay free. I only pay for model tokens on the sentence and the simulation, the two places a human reads. I didn't design that in a spreadsheet. It just dropped out of keeping the math and the AI in separate boxes.

Why DynamoDB, and how I used it

The hackathon stack is Vercel on the frontend and AWS on the back, with DynamoDB as the data layer. I wanted to use DynamoDB as a deliberate data model, not a key-value afterthought, because every access pattern in this product is a single key lookup. That is exactly what it is built for.

Five tables, each with one job:

ScanCache, 15-minute TTL, keyed by a hash of the URL, dedupes repeat fetches and keeps Bedrock cost down.
ScanResults, 24-hour TTL, keyed by an opaque id, anonymous, results that expire on their own.
BenchmarkScans, the 50-site dataset, with a GSI on vertical, rewritten monthly by an EventBridge refresh.
ScanCounters, server-side counts, no PII. Reserved for the team tier.
Users, reserved for signed-in history. A stub.

Two of those are live on every scan, one holds the benchmark, and two are reserved stubs for later. Two TTLs, two lifetimes, two reasons. Per-vertical rollups use the GSI, not a second database. No joins, no migrations, no idle server.

The write on a live scan is fail-soft and async. The scan returns to you whether or not the write lands, and a failed write goes to CloudWatch instead of your screen. The scan result is the product. Persistence is a side effect.

(The product is Agentis Lux. The engine is Perseus Clew, part of my Clew suite, which is why the AWS tables carry the PerseusClew prefix.)

The bet I made in public

Before the engine scanned anything, I wrote down what I expected it to find across 50 sites and committed it to the repo with a timestamp. Predictions first, data later, so I couldn't move the goalposts.

Then I scanned ten sites each across e-commerce, SaaS, content and media, US government, and indie builder projects.

Indie builders won. Mean score 77 out of 100, ahead of government, SaaS, and e-commerce. The single highest score in the whole run was a personal developer portfolio at 91. Scores ran from 34 to 91. Four sites blocked the scan at the door, including OpenAI.

I missed three of my six predictions. That is the point of pre-registering them. If I had gone six for six you should distrust me, because it would mean I only predicted what I already knew. The misses are where I learned something: that craft beats compliance, that the API is the real blind spot, and that a hand-built personal site reads cleaner to an agent than most of the web's biggest companies.

The full dataset, including the sites that blocked me, is in the repo.

The gaps

Fetching arbitrary user-supplied URLs on a public endpoint is a security problem before it is a feature. The backend does full DNS resolution, blocks private and reserved IPs, validates every redirect hop, forces HTTPS, and caps size and time. That hardening took as long as some of the checks did.

Bedrock had to be allowed to fail. If the model is slow or errors, the report still renders, because the AI verdict has a deterministic template under it as a floor. The hero line never breaks, because the score under it was never AI in the first place.

And also: this is a solo build on a deadline. The backend is JavaScript, not TypeScript. The benchmark page serves a published snapshot instead of querying DynamoDB live. The results view still has heading-hierarchy work. All of it is written down in KNOWN-LIMITATIONS.md, as choices, with reasons. On a product whose whole thesis is readability, hiding the gaps would be the one move I could not make.

Where this sits next to the other tools

Scrunch, recently acquired by Sitecore, works on AI search visibility: whether your brand gets cited when someone asks an AI a question. That is about being found. Agentis Lux is about whether an agent can read and use what it finds. Visibility, not operability.

Google's experimental Agentic Browsing audit in Lighthouse (May 2026) checks the agent-as-actor surface: WebMCP and whether a browser-driving agent can operate your page. Agentis Lux goes deeper on the agent-as-reader surface, the raw HTML a retrieval agent forms an impression from before it ever acts. Different door.

The agentic web is new enough that Google only added experimental, unscored checks two months ago. That is not a reason this is unoriginal. It is evidence the lane is open.

What the tool says about itself

Agents are not one reader. They are a spectrum, from the retrieval crawler that never runs your JavaScript to the browser-driving agent that does. The interesting output is the gap between them, and that is where this goes next: live benchmark querying, score history, and a render mode that shows the delta between what a non-JS agent sees and what a JS-capable one sees.

The tool scans its own site and publishes the result. It went from 70 to 96 after I fixed what it found, with one finding still open and shown anyway. Because if I scrubbed my own site to a perfect 100, you would have every reason not to trust the number on yours.

Try it on your own site: agentislux.io. The code, the methodology, and the raw benchmark data are in the repo.

For your second audience.

Links

Live: agentislux.io
Demo video (2:57): youtube.com/watch?v=bv56_XB1E_c
Code (Perseus Clew engine): github.com/earlgreyhot1701D/perseus-clew
The earlier proof of concept, Hermes Clew: github.com/earlgreyhot1701D/hermes-clew
H0 Hackathon: h01.devpost.com

- More from Clew Labs: earlgreyhot1701d.github.io/Clew-Labs

AI assisted. Human approved. Powered by NLP.

My app didn't go "viral". My AWS bill did.

L. Cordero — Thu, 25 Jun 2026 03:54:05 +0000

And by viral I mean from $0 to $31.

Umami told me Clew Directive got 14 visits last month. AWS told me I owed $31 for it. That works out to $2.21 a visitor, which would make it the most expensive free learning-path tool in California.

Spoiler alert: 14 visitors, $31, and not a single one of them was the reason.

Something was off. Here is how Amazon Q, Claude, and a few hours of reading my own code untangled it. The app turned out to be innocent.

What Clew Directive is, quickly

A free, stateless tool that builds you a personalized AI learning-path PDF. You take a 60-second Vibe Check, four questions about your goals and how you learn, and it maps you to free, verified resources and hands you a briefing. No accounts, no database, no paywall, nothing stored about you. It runs on Amazon Nova, which is why it costs close to nothing to operate, which is also why a $31 bill made no sense.

The name is the Theseus kind of clew. A ball of thread to find your way out of the maze. Less hype, more direction. Live at clewdirective.com.

The number that didn't add up

Twelve visitors, 14 visits, 93% bounce, average session about a minute. Referrers from Bing, Google, Yahoo, GitHub. Visitors from the US, India, Netherlands, Egypt, Ethiopia, Singapore. Mostly crawlers stopping by to say hello.

A few curious humans and a parade of bots is not a $31 month. So either every visit was doing something enormous, or the bill was never about visits at all.

The dashboard lied, politely. An Amazon Q Story

My cost tracker said Clew Directive was running on Claude Sonnet. Sonnet is the expensive one. Case closed, right?

I opened the repo. Clew Directive does not run Sonnet. The Navigator agent runs Amazon Nova 2 Lite. Scout and Curator run Nova Micro. The IAM policy is scoped to Nova ARNs only, so a Sonnet call from these functions would come back AccessDenied. The app physically cannot bill Sonnet.

The math agreed. A full learning-path generation on Nova costs about two-tenths of a cent. Fourteen visits, even with the agents fanning out to a few calls each, rounds to lunch money. Nothing here gets you to $31.

One detail I want to flag, because it set the tone for the whole hunt. The same assistant, Q, that mislabeled the model also quoted me Haiku pricing at a quarter of the real rate. So here is the rule I kept coming back to: trust what a tool retrieves, verify what it remembers. Those are two different things.

Asking a better question

The question stopped being "why is my app so expensive" and became "what is actually spending, and why is it wearing my app's name."

Q pulled the breakdown. The month was 28 million tokens across only 8 active days, and two of those days did 70% of the work. May 24 and 25. Memorial Day weekend.

The shape of the cost was the real tell (Sonnet only):

Cache writes: 4.1M tokens, $15.33 (55%)
Cache reads: 23.8M tokens, $7.14 (26%)
Output: 346K tokens, $5.20
Input: 120K tokens, $0.36

The fingerprint

A web app serving 14 visitors does not look like that. Heavy cache write up front, heavy cache read after, almost no real input or output, is the signature of an agent reasoning over a big fixed context. It loads that context once, caches it, then re-reads it on every turn.

Clew Directive does no prompt caching at all. So whatever ran up the bill, it was an agent chewing on a large cached context, not an app answering users. Which pointed me at a very different project.

It was me, over a long weekend sprint

Clew Directive had zero commits on May 24 or 25. Last time I touched it was May 9.

A different repo lit up. vigil-crest. Created May 23, four commits on May 24. And I know exactly what it is, because I wrote a whole article about it and published it on, of course, May 24.

Vigil Crest is a challenge-triage agent I talk to on Telegram. It browses the live DEV challenge feed and tells me which hackathons are worth my time. Its stack, in my own published words: AWS Bedrock running Claude Sonnet 4.6, reached through an EC2 instance role so no credentials sit on the box, hosted on an always-on t3.micro. Read that back against what CloudTrail handed me: an assumed role, vigil-crest-bedrock-role, on an EC2 instance, calling claude-sonnet through the streaming API. (I am not pasting the full ARN. Account IDs stay home.)

Same project. Same box. Same model. The weekend I shipped it.

So the $28 was vigil-crest, on Sonnet, while I spent two days hammering on it before submission. Each triage run caches a fat context, the agent persona, the stack file, the rendered challenge feed, then re-reads it across turns. That is the cache-heavy shape, exactly. Whether it was the agent's own test runs or the tooling I built it with, both ran on that one EC2 box under that one role. Real work, priced correctly, just filed under the wrong project.

Q was Sherlock. Claude was the Watson who argued back.

I want to shine a light on how this got solved, because neither tool did it alone and neither did I.

Amazon Q in the console has one thing Claude does not: keys to the building. It reads my live account. CloudTrail, Cost Explorer, the actual deployed config, the IAM principal behind a single call. That is the force multiplier. I do not have CloudTrail memorized and I am not going to hand-read every IAM policy at 9pm. Q walked the crime scene and came back with the role name, the instance, the timestamp, the model, in minutes. That is the legwork no amount of reasoning replaces.

But access is not the same as the right conclusion. Q pulled clean evidence and attached the wrong story to it three times. It was a runaway process. It was your app. Check your application logs. Every pass, perfect data, wrong suspect. The evidence was never the problem. The narrative on top of it was.

Claude could not see my account at all. What it could do was refuse the easy story and push the evidence back through the actual code and the math. The repo says Nova. The IAM says AccessDenied for Sonnet. The token shape says agent, not app. It also said, out loud, that it could only see my commits and not my deployments, so part of this was inference and I should confirm it. A tool telling me where its own knowledge stops is worth more than one that sounds certain.

So Q was the detective with the magnifying glass on the real scene, and Claude was the Watson who kept asking does that actually follow. The twist is that in this case Watson is the one who pushed back on the detective. Sherlock had the keys. Watson had the doubt. I had to point them at each other and refuse to take the first confident sentence either one offered.

There is a tidy irony in here. Vigil Crest, the agent that ran up the bill, is built on exactly this idea: a verdict that knows how sure it is beats a confident guess. It hedges its calls on purpose. Solving its own bill came down to making my tools do the same thing, separate what they pulled from what they assumed. The agent's whole design philosophy is what cracked the case it caused.

What I learned, hopefully

The bill was never about traffic. Umami is client-side JavaScript. It counts browsers that run my script. It cannot see bots, it cannot see API calls, and it has nothing to do with Bedrock spend. I had tied two unrelated numbers together and scared myself.

The project label was a guess, not a fact. Bedrock charges are account-level. They do not inherit the tags of whatever called them. Unless you set up Application Inference Profiles and call those, every model dollar lands in a bucket marked "no project," and something has to claim it. Mine claimed Clew Directive by assumption.

The cost that bites you is the quiet one. The $31 sprint is over. June reads $0. The token burst was loud and self-limiting, it ended when I closed the laptop. The thing that does not end is the EC2 box under Vigil Crest, billing by the hour because it is meant to stay on. A small always-on t3.micro is cheap, but cheap and forgotten is how standing costs sneak up. Know what you keep running, and why.

The AI tools were useful and wrong in the same breath. Q read CloudTrail and Cost Explorer cleanly and narrated the wrong story three times, blaming my app on every pass. Claude caught the bad pricing and read the repos, and still had to admit it could see commits, not deployments. The actual work was pinning each claim to a source instead of to a confident sentence. Trust retrieval. Verify recall.

So no, Clew Directive did not go viral. It served 14 people and a crowd of crawlers and cost me almost nothing, which is exactly what it was built to do.

The bill was me, in a trench coat made of EC2, building the next thing.

Tell AWS. I want them to know it was me.

Clew Directive is free and open source. Find your way out of the AI-course maze at clewdirective.com.

AI Assisted. Human Approved. Powered by NLP.

Breaking Build: Kiro and Claude delivered exactly what I asked, and it wasn't what I wanted

L. Cordero — Fri, 19 Jun 2026 16:48:42 +0000

Building in public means showing the part where the robots did great work on the wrong thing.

The deploy on Agentis Lux succeeded. Green check, no errors, site live. I scanned my own site to grab a "before" shot for a before-and-after, and the scanner handed back a score of 62.

It handed back 62 for the next site too. And the next one. Same score, same findings, every time, including a finding about a "checkout button" on a site that has no checkout button.

The build worked. It was running a version of the scanner I'd written weeks ago and abandoned. Everything I'd built since then was sitting in the repo, merged, tested, and not deployed. The deploy pipeline had run exactly once, in May, and never again. AND I NEVER NOTICED!

So the live site was a confident, well-tested, fully-green stub.

Technically, nothing went wrong. That's the part I keep mulling over...and over...and over.

Mind the gap!

I build with AI agents. I direct, they generate. One agent writes the infrastructure, another audits it, I make the calls and merge. It's fast and it's good, and the failure mode is not what I expected.

I expected the agents to make mistakes. They mostly don't. What they do instead is build exactly what I asked for, correctly, when what I asked for wasn't what I wanted. The bug isn't in the code. The bug is in the gap between my instruction and my intention, and the agent fills that gap with whatever's most literally true. This exact thing, context engineering, came up at Anthropic's talk at the AWS Summit.

A human orchestrator, in this case...me, pushes back. "You said deploy, but the pipeline hasn't run since May, did you mean redeploy the current code?" An agent says "deploy succeeded" because the deploy did, in fact, succeed. It answered the question I asked. I asked the wrong question that sat clearly in my blind spot.

I hit this four times on one project in about a week. Same shape every time.

Four times it was right and wrong at once

The stub that shipped. The 62 that came back for every single site, the Groundhog Day score. The infrastructure was real, the tests were green, the deploy worked. It just deployed code I'd left behind. "Is it deployed" was true. "Is the thing I built deployed" was the question I forgot to ask. [Lesson: Don't assume.]

The three doors, one of them real. My scanner takes three kinds of input: a URL, a code repo, an API spec. The interface showed three tabs for them. Clean, obvious, exactly what the design implied. Only the URL one was wired up. The other two were built to the spec I gave, which described three tabs, and I'd later decided to ship only URL scanning first and never updated the interface to match. So a visitor clicks "API spec," types something in, and hits a polite wall. The tabs were correct. My scope had moved and the tabs hadn't heard about it. [Lesson: Kiro and Claude can't read my mind!]

The findings only an engineer could read. My whole audience is people who build with AI and may not know what a <ul> is. The scanner's findings said things like "repeated sibling elements not wrapped in ul or ol." That is a correct finding. It is also useless to the person I built the tool for. I'd asked for accurate, technical, no-fluff findings. I got them. I forgot to ask "can my actual user read this." [Lesson: Don't forget you're building for the end user, a real person, not a theoretical one.]

The card that rendered nothing. A social card route, built, deployed, working. I saved the image and got a zero-byte file. The route fetched three fonts from the web, and when one came back empty instead of failing outright, the image renderer got fed garbage and produced nothing. The catch block that was supposed to handle font failures never fired, because the fetch didn't fail. It "succeeded" with an empty hand. The error handling was correct for the error it was watching for. The actual failure walked in through the one door nobody was watching. [Lesson: Don't skip testing the live workflow.]

The pattern

Every one of these passed its own test. The deploy deployed. The tabs matched the spec. The findings were accurate. The card route ran. If I'd trusted "it works," all four would have shipped.

The thing that caught them was not better prompting and not a smarter agent. It was me looking at the actual output and asking a more simplified question than the agent was capable of asking. Not "did it run." "Is this the thing I wanted." A 62 on every site is suspicious if you bother to scan a second site. Three tabs are a trap if you click the ones you didn't finish. A finding is useless if you read it as your own user instead of as the engineer who wrote it.

Agents optimize for what you said. The whole job of the human in the loop is to keep checking what you said against what you meant, because the agent can't see the difference and you're the only one who can.

Why I keep doing it anyway

This reads like I haven't learned my own lessons that I've been writing about. So, yes and no? The agents did weeks of real work in days. The audit agent caught real bugs the tests missed. The infrastructure is solid. I would not give that back.

But there's a reason the model is "I direct, they generate" and not "they build, I watch." Direction is not a one-time instruction you hand off. It's the continuous act of holding the work up against intent and saying "close, but that's not it." The agents are extraordinary at "exactly what you asked." Knowing what to ask, and noticing when the answer is technically perfect and quietly wrong, is the part that's still mine.

The deploy succeeded. Not the deployment I thought it was. And now I know to look twice.

All four of these are from building Agentis Lux, an agent-readiness scanner. Yes, a tool that tells other people what agents can't read shipped a stub, hid a broken tab, and rendered an empty card. It's in the open if you want to watch me keep catching myself: [https://github.com/earlgreyhot1701D/perseus-clew].

AI assisted. Human approved. Powered by NLP.

Built with Kiro, Claude, and a lot of looking at the actual output.

Predictions first, data later: seven hot takes on AI agent readiness before I scan 50 sites.

L. Cordero — Wed, 17 Jun 2026 04:08:50 +0000

Can a robot read this? Asking for a friend.

A few months ago, for a hackathon, I built a small tool that checked whether an AI agent could read a website. Deterministic checks underneath, an AI reasoning layer on top. It worked. I called it Hermes Clew.

I could have left it there. The hackathon ended and it was only a proof of concept. But the idea didn't leave. The more agents showed up at the products people ship, the more it looked like the question of the next few years. Not "can a person use this." Can an agent.

So Hermes became two things: Perseus Clew, the engine, and Agentis Lux, the product. (Latin, roughly "light of the agent." It shows you what the agent sees.) I laid out the thesis and the build in my last post. This is the part I promised to come back for.

Consider it my flag in the moon sand. I think agents are the question. And I'm willing to write down what I expect to find before I have any data to hide behind.

Here's the bet.

I'm pre-registering my predictions (i.e. I'm playing myself!)

Before the engine scans a single site, I wrote down what I think it will find. Then I committed it to the repo with a timestamp.

Scientists call this pre-registration. I call it no take-backs.

The reason is simple. When the data comes in, it is easy to look at the numbers, find the pattern that flatters the project, and write the post as if that is what I expected all along. Writing the predictions down first makes that impossible. If I'm right, I predicted it. If I'm wrong, the miss stays on the page next to the result. Either way you can check my work.

That matters more than usual here, because scoring tools have a reputation for marking their own homework. The fix is to make everything inspectable, including the part where I could fool myself.

The thesis it all hangs on

The web was built by humans, for humans. Search crawlers and screen readers got partial accommodation later. Agents are showing up to a playground nobody built for them.

So I expect most sites to have gaps in what an agent can read and do. And I expect those gaps to land in predictable places: wherever no human incentive paid for machine readability. A site tends to be readable to an agent by accident, wherever something else (search ranking, accessibility law) already forced clean structure. The rest is a blind spot.

That is the engine under every prediction below.

The predictions

Once the engine is live I'm scanning 50 sites: ten each across e-commerce, SaaS, content and media, government, and indie/builder projects. Chosen by maximum-variation sampling, documented site by site in the repo, to span size, platform, and build type. Not at random, and not by the score I expected.

Each prediction has a confidence tag and a line saying what would prove it wrong, because a prediction you can't lose isn't a prediction.

1. The rendering cliff is the deciding line. (Pretty sure. The one I'd bet money on.)
Sites built as heavy client-side JavaScript apps will be hard for agents to read, no matter what kind of site they are. Most AI crawlers don't run JavaScript. Vercel's network data, across hundreds of millions of crawler fetches, found no JavaScript execution at all. If your content only appears after the JS runs, the agent sees an empty shell.
Wrong if: client-heavy sites score about the same as server-rendered ones on whether the content is in the HTML.

2. Government beats startups. (Pretty sure.)
US government sites will be easier for agents to read than small indie and startup sites. Not because anyone in government set out to court agents, but because federal law (Section 508) forces clean, labeled, semantic markup, and that same structure is what an agent parses. Regulation made them accidentally agent-ready. I'm keeping this run US-only on purpose, so the 508 mechanism is the thing being tested and not a mix of different countries' rules.
Wrong if: US government sites land at or below indie sites on semantic structure.

3. Structured data is a commerce-and-media thing. (Maybe, leaning pretty sure.)
The machine-readable labels that tell an agent what a page is will show up mostly on shopping and news sites, and be close to absent on government and indie sites. Search ranking is the only incentive that paid for them.
Wrong if: it's spread evenly, or shows up where there's no search incentive.

4. E-commerce is the widest spread. (Maybe.)
Online stores will have the biggest internal range of any group. The platform mix runs from templated stores to fully custom JavaScript storefronts, so some will be clean and some an agent can't read.
Wrong if: store scores cluster tightly, or another group spreads wider.

5. Some sites lock the gate by accident. (Coin toss.)
More than a couple of sites will block agents at robots.txt, so the agent doesn't reach the page at all. I think most of this is unintentional, a default or a blanket rule, not a decision to keep agents out.
Wrong if: almost nobody restricts agents, or the ones who do clearly meant to.

6. Scores will be all over the map. (Pretty sure.)
Overall, scores will spread wide rather than cluster, because there is no settled standard for agent readiness yet. The rules are months old, and how many of us outside developer-tool companies have adopted them? When the rules are this new, sites can't have converged on them.
Wrong if: scores cluster tightly in one band, which would mean sites are already converging without trying.

7. Good spec, wrong doorstep. (Already seen it, so not a blind call.)
Unlike the other six, this one isn't blind. I'd already seen it while picking the sites. Nine of the ten specs I confirmed live in a GitHub repo, not on the company's own site. One serves it from its own domain. So I'm logging it as an expectation I already have reason to hold, not a bet I placed before seeing anything. The pattern: companies that ship an API tend to publish a spec, but not where an agent would look for it. The agent arrives at the front door, where it would actually check, and finds nothing. The spec exists, just somewhere the agent would never think to look. Great spec, wrong doorstep.
Wrong if: more than one or two of the ten turn out to expose their spec at a discoverable path on their own domain after all.

How I'll know if this is interesting

I decided this before seeing any numbers, on purpose, so I can't talk myself into a story later. Both outcomes are findings. They just land at different volumes.

Loud: scores spread out, the groups look clearly different from each other, and at least one result surprises me (government beating startups would do it). Spread plus a surprise is a story that carries the post on its own.

Quiet: everything clusters in one band and no group stands apart. That's not a dud. It's a finding: sites haven't converged on agent-readiness yet, and here's the baseline that says so. Quieter post, real result, and the next scan has something to measure against.

The one thing I won't do is manufacture variance that isn't there. If the data comes back flat, I report it flat. This is a first reading of something six weeks old. A baseline can't fail by coming out flat. It can only fail by lying about its shape.

What this is not

Fifty sites, ten per group, is illustrative. It is not a representative sample of the whole web, and the writeup will say so. It shows patterns and examples, not statistical proof.

I'm scanning the public surface only. For a SaaS product that means the marketing site, the docs, and the API spec, not the app behind the login. Agents meet your product at the public doors long before any login. I'm scanning the doors.

And I score what exists. A site with no forms doesn't get marked down for forms. So I'll compare group to group one category at a time, like to like, not on one combined number.

I'm scanning my own front door too

Two of the fifty are Dev.to and Devpost. The place you're reading this, and the place I'm submitting it. They're in the content group, scanned like everything else. I'm interested to see how my favorite platforms read to an agent, and that goes in the writeup with everyone else's.

And the tool has to pass its own scan. AgentisLux gets pointed at its own site with the same checks I'm aiming at everyone else. I'm running it twice: once now, before the benchmark, and again after, so you can watch my own front door change. No sparing the house of its own inspection.

What's next

The engine is built, tested, and merged: the six frontend checks, the six API checks, the scan handler, the agent-simulation layer, and the batch engine that runs all 50 sites. The public scan reads the frontend. The API checks run inside the benchmark. That split is the point, it's the doors idea in the architecture. The last step before the scan runs is getting it deployed to production. You can follow the build if you want to watch it happen. When the scan lands, I'll post the results right next to these predictions, the hits and the misses both.

Until then, the bets are in writing. I built the first version of this idea months ago, and I've been wrong plenty. They're timestamped in the repo, and when the data lands they'll be sitting right where I left them, right or wrong. If you've got a hunch about which one breaks first, well, ready to place your bet?

AI assisted. Human approved (all bets are mine). Powered by NLP.

I created Agentis Lux for the purposes of entering H0 Hackathon (Vercel + AWS Databases). #H0Hackathon See Agentis Lux's Devpost.com entry.

AWS Summit Los Angeles 2026: Why Am I Always Learning the Hard Way?

L. Cordero — Mon, 15 Jun 2026 01:36:47 +0000

I walked into the Kiro lab ready to learn.

I'd been building a web app (coming soon!) with Kiro for weeks. Next.js on Vercel, API routes talking to DynamoDB, Bedrock handling the AI layer. An H0 hackathon submission with 15 days left on the clock. By then Kiro and I had a rhythm, so the lab wasn't a rescue but a resource. I signed up because I was curious. One saying I keep close: you don't know what you don't know. I went to sharpen how I build, not to confirm I already had it down.

Clayton Markos was running it, an AWS Senior Technical Instructor. The session had one goal: spec-driven development. And this was new ground. I'd never had Kiro start a project. I scaffold it myself, then bring it in. Letting Kiro generate the structure first was something I hadn't done.

The task was a weather app. Ninety minutes, build it, deploy it. I was confident I could get it done. Nervous but confident. Building under pressure like this is new to me. 90 minutes? Let's go.

Then I watched how the build was supposed to go.

Spec first. Not a vague prompt and a prayer, an actual spec. The what, the constraints, the boundaries, written down before Kiro touched a line. Then Kiro works inside that.

I'd been doing the upfront work. Just not in the shape Kiro wants it. I build with Claude as my architecture and build assistant, so I had docs. Plenty of them. But they were Claude docs. Reasoning and notes written for me to read, not specs written for Kiro to build from. Kiro's strength is spec-driven. I'd been handing it Claude-shaped prose and asking it to infer the spec. In some places that worked. In others it drifted, because I'd given it room and no edges.

I built the weather app. Deployed it by the end of the lab. It shipped, same as my builds always ship. Thankfully. But I didn't come to ship a weather app. I came to learn. And the lab handed me the question underneath all of it. Am I using these tools to their full capability? Do I understand how they work? Am I building so my projects can succeed?

The spec is a tension, not a setting

The lab reframed how I get the most out of Kiro. It isn't tighter control or looser reins. It's the spec, and how I develop it. That's one of the levers that decides how well the project holds up.

Too rigid, and Kiro has no room to make a good call. You've pre-decided everything, including the parts you shouldn't have, and now it's a very expensive autocomplete. Too loose, and it drifts. It fills the gaps with its own guesses and you spend your time pulling it back.

The sweet spot is narrow. The spec defines the what and the constraints. Kiro decides the how. I already had a version of this in my steering doc, a rule that says propose before you build, ask before you assume. I just hadn't connected it to the spec the way the lab did.

The learning curve tax

I'm self-taught. My first prototype was a jury eligibility chatbot, and I started it before I knew what an API was. The whole time, one question. Can I make this work? Turns out I could. I demoed it to my boss at the time.

Not much has changed. I still pick up a tool by using it, usually with AI in the loop, usually inside something I've already shipped or am racing to ship. The understanding of how to guide the build shows up late, a beat after I needed it. Filed under my lessons learned doc for the next projects, the June Game Jam and Hack the Kitty.

That's the tax. You don't know what you don't know, so you can't plan around it. You build, and you let the gaps introduce themselves, one expensive and time-consuming surprise at a time.

And so far, I keep paying it. I haven't found a way to skip the tax or leap the learning curve, at least not one that's mine. Hence, the lab.

The rest of the day kept pointing at the same thing

The Anthropic talk, "Effective Context Engineering for AI Agents," was standing room only. Every seat gone, people on the floor along the walls because there was nowhere else to put them. Turns out context about context is in high demand! Worth it.

Jacqueline Garrahan, Technical Staff at Anthropic, framed the shift for me. Prompt engineering used to be two pieces: system_prompt and user_message. Write good instructions, get good output. Context engineering is the wider job: system_prompt, tool_definitions, retrieved_documents, tool_results. Everything the model can see before it answers. Your prompt is one input now, not the whole show.

Half of what I'd picked up in the Kiro lab was the same idea. What you feed the tool decides what it hands back.

After the "Prompt to Production: AWS Database Integration in Vercel v0" presentation, I did something that is not like me.

Hedieh Zandi, a Vercel Product Lead and an H0 sponsor, had just presented. The stack she walked through was the one I'd built my submission on. Next.js on Vercel, API routes straight to DynamoDB, Bedrock for the AI layer. So I walked up and introduced myself. Told her I'd entered the hackathon she'd just been presenting on. That I was watching her present the stack I built on.

This was the first time I've explained one of my projects out loud to people who do this for a living. I was nervous. The kind of nervous where you hear your own voice and it sounds like someone else's. I did it anyway, and I'm glad I made myself. One hurdle down.

So why the hard way

And that's the question I keep landing on. I don't have a buttoned-up answer, and that bums me out a little.

Here's what I do have. AI-assisted coding, vibe coding, whatever you want to call it, has moved fast. I started by typing "build a jury chatbot prototype (no mistakes, lol)" into a chat box. Now I'm doing end-to-end spec-driven development. That's not a learning curve. It's closer to a free fall. You learn on the way down, and it leaves scar tissue. It's just how I've done all of it.

Labs like "Structured Approach to AI Coding with Spec-Driven Development on Kiro" are the net. They find the blind spots I can't see from inside my own workflow. They humble me. Then they hand me enough to build the next thing with a little more confidence than the last.

I've still got 15 days until submission. Back to the spec.

AI Assisted. Human Approved. Powered by NLP.

An Instagram ad promised me a free AI course. Was it a scam?

L. Cordero — Sun, 07 Jun 2026 16:50:44 +0000

A free AI course promoted by Instagram? I almost scrolled right past it.

The algorithm dropped it in my feed, and my instinct was the one I've built up over years of dodging nonsense online. Is this real, or is someone about to take my information?

Then I read the pitch. A free, one-week AI literacy course for any "American worker," taught entirely over text. "No laptop or internet needed. Just your phone." Well. I'm an American worker, I want to learn about AI, and free is very much my price point. Real or fake, I clicked anyway.

So what is it?

It's an initiative called AI Ready from the U.S. Department of Labor, and the lessons call themselves your AI 101 course. One week, ten minutes a day, delivered entirely by text message. You text READY to 20202 and it runs like a text thread with a patient teacher. A short lesson arrives, it asks you a question, you reply, and the next piece comes back. That's the whole setup. No app to download, no account to build, no laptop required.

My skeptic still wasn't satisfied. Why is the DOL handing this out for free? Why hadn't I seen a single piece of media about it? I went digging before I trusted it. It checked out, so I signed up.

A quick note for my dev.to fam outside the US

Some of you are reading from outside the US, and I want to flag this early. Enrollment runs through a US text number, so this program may only work inside the US. I haven't tested it on an international phone, so it's worth checking out. Even if it isn't available where you are, you might know someone it would reach, and a free, text-only on-ramp is a model worth seeing wherever you build.

What won me over

The first lesson landed while I was at work. A gif popped up on my screen, which is a fun way to get someone's attention, and I'll admit it worked.

The format takes the pressure off. Instead of a blank box waiting for you to know what to say, you answer with A, B, or C. Go quiet for a few hours and it sends a gentle nudge to check in. And when my first scheduled time didn't work, I used the chat to reschedule. Small thing, but it told me someone designed this with the user in mind.

The lessons themselves are short and easy to follow. Bite-size, not overwhelming, which is the part I think a beginner would appreciate most. And the topics are useful right off the bat. Lesson 4 was "the recipe for a great prompt." Lesson 5 was "put AI to work for you." Useful, not abstract. Somewhere in the week I stopped being suspicious and started being impressed.

The last message doesn't just say goodbye. It hands you links to keep going, with starter courses from AWS, OpenAI, and Microsoft, plus a career explorer. That part may matter to new learners to keep the momentum going.

What it isn't

There's a ceiling here. One week at ten minutes a day will not make anyone an AI practitioner, and it won't replace building something with your own hands. It's a door, not a destination. The goal is to get someone through the door without scaring them off, and for that it works.

It also runs through a third-party platform called Arist, and the course tells you your number is used only for the course and not sold. I always check the privacy language before I sign up, and this one names it plainly. I'd tell a family member the same thing.

Pass it on

So here's why I'm sharing a beginner course with a room full of builders.

Through your wonderful articles, I've learned that in our community most of us here are past the beginner stage. We're building, shipping, breaking things, writing about it. We may not be the audience for a ten-minute intro course. But every one of us knows someone who is. A parent, a cousin, a coworker who keeps saying they're "behind on AI" and feels it every time they open their phone.

There's so much noise pointed at people right now. Paid bootcamps, breathless ads, the steady message that they've already missed the train. A free course that lives in their text messages and asks for ten minutes a day is about the least intimidating on-ramp I've seen so far.

I think part of this work, that we're all trying to do, is reaching back and bringing someone with us. Sharing the free thing. Lowering the barrier for the person who hasn't started. This is an easy one to pass along.

If that person is in the US, tell them to text READY to 20202, or send them to arist.link/aiready.

I was impressed. I think they will be too.