<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Jason Agostoni</title>
    <description>The latest articles on DEV Community by Jason Agostoni (@jagostoni).</description>
    <link>https://dev.to/jagostoni</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3781582%2F95ff44d4-0504-4f73-b1e9-21e4dd2a33e3.png</url>
      <title>DEV Community: Jason Agostoni</title>
      <link>https://dev.to/jagostoni</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/jagostoni"/>
    <language>en</language>
    <item>
      <title>An AI Benchmark That Tests Real Coding Workflows</title>
      <dc:creator>Jason Agostoni</dc:creator>
      <pubDate>Sun, 19 Apr 2026 19:25:28 +0000</pubDate>
      <link>https://dev.to/jagostoni/an-ai-benchmark-that-tests-real-coding-workflows-3b8l</link>
      <guid>https://dev.to/jagostoni/an-ai-benchmark-that-tests-real-coding-workflows-3b8l</guid>
      <description>&lt;p&gt;Developers face a real choice: pick a coding model or agent based on synthetic benchmarks that look great but do not predict actual project work. The problem is no longer whether models can score well on those benchmarks; it's whether those scores still mean anything.&lt;/p&gt;

&lt;p&gt;Today's benchmarks test narrow skills well, but they rarely capture the full workflow of professional development.&lt;/p&gt;

&lt;p&gt;I wanted something that tests what real development looks like: a complete SDLC cycle on a representative / realistic app, similar to how teams ship weekly. Ship-Bench is that project, open at &lt;a href="http://github.com/JAgostoni/ship-bench" rel="noopener noreferrer"&gt;http://github.com/JAgostoni/ship-bench&lt;/a&gt; for anyone who wants to follow along or try it themselves.&lt;/p&gt;

&lt;p&gt;Ship-Bench runs agents through five phases that match a professional SDLC: Architect, UX Designer, Planner, Developer, and Reviewer. Each phase scores out of 100 against a specific rubric, with full evidence like specs, backlogs, code, and tests.&lt;/p&gt;

&lt;p&gt;A benchmark like this needed more than a to-do app.&lt;/p&gt;

&lt;p&gt;I wanted something more substantial than a to-do list, but not so complex that results would become wildly inconsistent from run to run. I settled on a knowledge base app with editing as it leaves room for product and implementation choices while staying inside a problem space that most developers (and LLMS) already understand.&lt;/p&gt;

&lt;p&gt;That balance matters. The app is simple enough to keep the benchmark grounded, but open-ended enough to surface differences in planning, UX judgment, architecture, coding, and review quality.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Ship-Bench Works
&lt;/h2&gt;

&lt;p&gt;The first step in Ship-Bench is building a Product Brief. That brief is meant to test core product instincts before any code is written: interpreting requirements, resolving ambiguity, prioritizing scope, and making defensible implementation and UX decisions.&lt;/p&gt;

&lt;p&gt;To do that, the feature set is intentionally larger than a defined MVP. The brief includes five possible features, but only the first three are required in v1, which keeps the evaluation shorter to run while still forcing the agent to decide what to do now versus later.&lt;/p&gt;

&lt;p&gt;The feature statements focus on common product problems rather than highly specific implementation instructions. Browse articles, search content, edit knowledge, organize information. Most developers understand the shape of those problems, but the details are left open enough that the agent still has to define flows, tradeoffs, and structure. Not too dissimilar from reality.&lt;/p&gt;

&lt;p&gt;The brief also includes non-functional and technical goals meant to push toward a simple app with some future scaling intent. It asks for something easy to run locally and maintain, but also something that can support around 100 concurrent users, use current libraries and frameworks where practical, and leave room for growth without drifting into unnecessary complexity.&lt;/p&gt;

&lt;p&gt;That last part was important to me. I wanted to see whether an agent would research online for the latest frameworks and versions rather than rely only on its internal knowledge.&lt;/p&gt;

&lt;p&gt;The full Product Brief is here for anyone who wants to read it directly: &lt;a href="https://github.com/JAgostoni/ship-bench/blob/main/docs/product-brief.md" rel="noopener noreferrer"&gt;https://github.com/JAgostoni/ship-bench/blob/main/docs/product-brief.md&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Role-Based Phases
&lt;/h2&gt;

&lt;p&gt;Once the Product Brief is in place, the benchmark moves through five specialized roles meant to mirror a real product team. Each role has a specific job, well defined output, and a handoff that feeds the next phase. The point is not only to evaluate each role on its own, but to see how well the work transfers from one stage to the next. The overall goal is to take the ambiguity of the Product Brief and turn it into concrete decisions ready for the developer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architect
&lt;/h2&gt;

&lt;p&gt;The Architect’s job is to turn the Product Brief into a concrete technical plan. Its main task is to make the big implementation decisions up front so the developer is not forced to solve architecture questions later in the build. That means choosing the front end and back end stack, data model, search approach, integration pattern, repo structure, local setup, and the testing and scaling considerations needed to support the brief’s goals. The output is a Technical Architecture Spec that makes the system buildable, keeps the implementation simple and maintainable, and leaves as few unresolved decisions as possible for later phases.&lt;/p&gt;

&lt;p&gt;The Architect handoff matters because it gives UX and the Planner a stable technical frame to work inside. A clear architecture reduces guesswork in the design spec and keeps the backlog grounded in choices the developer can actually implement. It is evaluated based on completeness, accuracy and recency.&lt;/p&gt;

&lt;h2&gt;
  
  
  UX Designer
&lt;/h2&gt;

&lt;p&gt;The UX Designer’s job is to turn the Product Brief into a concrete design direction and style guide. Its task is to decide how the app should feel and how the main flows should work, including layout, navigation, component behavior, responsive behavior, visual tone, and interaction states. It also needs to define the states and handoff details that make the design implementable without extra interpretation from the developer. The output is a UX Direction Spec that takes the ambiguity of the brief and turns it into a clear, consistent interface system the developer can build from.&lt;/p&gt;

&lt;p&gt;The UX handoff translates architecture into interface decisions the Planner can sequence. Once layout, states, and component behavior are pinned down, the backlog can break the work into cleaner implementation steps. It is evaluated on completeness, quality and adherence.&lt;/p&gt;

&lt;h2&gt;
  
  
  Planner
&lt;/h2&gt;

&lt;p&gt;The Planner’s job is to turn the approved product and technical decisions into a sequenced implementation backlog. Its main task is not just to list work, but to break the project into right-sized iterations so the developer agent can work through it in manageable chunks without losing context. It needs to define what belongs in MVP, what comes later, what blocks what, and how each iteration can leave the codebase in a working state. The output is an Implementation Backlog with iteration files that make the work executable, sequential, and easy to review.&lt;/p&gt;

&lt;p&gt;The Planner is the main bridge between planning and building. A good backlog keeps the developer focused on one coherent slice at a time instead of forcing them to hold the whole project in working memory. It is evaluated on completeness and properly constructed iterations.&lt;/p&gt;

&lt;h2&gt;
  
  
  Developer
&lt;/h2&gt;

&lt;p&gt;The Developer’s job is to turn the backlog into a working MVP without drifting beyond the assigned scope. Its main task is to implement one iteration at a time, keep the codebase in a working state, and avoid introducing new unresolved design or architecture decisions midstream. It also has to follow the given tech choices, cover the testing scope defined in the brief, and handle errors cleanly so the result is stable enough to review. The output is a completed iteration summary that shows what was built, what assumptions were made, and confirms the app still runs locally.&lt;/p&gt;

&lt;p&gt;The Developer handoff is the most literal one in the benchmark: the backlog becomes code, tests, and a runnable app. Good upstream decisions should make this phase feel straightforward, while weak handoffs should show up quickly. It is evaluated on working code, adherence to spec, code quality and process completeness.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reviewer
&lt;/h2&gt;

&lt;p&gt;The Reviewer’s job is to verify the delivered MVP end to end and check whether it actually meets the brief. Its main task is to test the required flows, confirm the app runs locally, review the test suite, check responsiveness and error handling, and compare the implementation against the architecture, UX, and backlog decisions. It also needs to do a light code review for basic quality signals like modularity, current dependencies, and obvious security issues. The output is a QA report with pass or fail results, defect logs, spec drift notes, and a release recommendation that tells the team whether the build is ready or needs more work.&lt;/p&gt;

&lt;p&gt;The Reviewer closes the loop by checking whether the earlier handoffs actually held up in a real implementation. It is less about originality and more about verification, which makes it the final test of whether the whole chain from brief to build worked as intended. It is evaluated against review and test completeness and depth.&lt;/p&gt;

&lt;h2&gt;
  
  
  Evaluation Framework
&lt;/h2&gt;

&lt;p&gt;The evaluation itself is intentionally split between a human judge and an LLM judge. The goal is to combine two perspectives on the same deliverable, especially in the more subjective phases where rubric compliance alone is not enough. Each phase has its own evaluation file in the space, with detailed scoring criteria and pass/fail gates that keep the scoring consistent.&lt;/p&gt;

&lt;p&gt;At a high level, the framework is trying to answer two questions: did the agent do the phase well, and did the output set up the next phase cleanly. The result is less about one leaderboard number and more about whether the whole sequence of work actually resembles a real delivery process.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benchmarking Like Real Work
&lt;/h2&gt;

&lt;p&gt;Ship-Bench is built to feel like an actual project rather than one-off synthetic tasks. The phases move in order, and each handoff has to carry real context forward, which is much closer to how professional roles interact on a team. It can go really wrong or it can go really right.&lt;/p&gt;

&lt;p&gt;It also demands working deliverables at every stage, not just polished descriptions. The benchmark expects outputs that can be used by the next phase, whether that is a technical spec, a design direction, a backlog, or a runnable application with tests and supporting notes.&lt;/p&gt;

&lt;p&gt;That structure reflects how developers actually work: brief, decide, plan, build, review, ship. Ship-Bench is not a replacement for other benchmarks; it is a way to show what professional workflows look like when the goal is to build something real.&lt;/p&gt;

&lt;h2&gt;
  
  
  Next Steps
&lt;/h2&gt;

&lt;p&gt;Initial testing and benchmarking is already underway to test Ship-Bench itself making it more consistent and reliable.  &lt;/p&gt;

&lt;p&gt;What models and tools would you want to see?&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>coding</category>
    </item>
    <item>
      <title>Vector Similarity, Zero Client JS: Decoupled Analytics on a Side Project Budget</title>
      <dc:creator>Jason Agostoni</dc:creator>
      <pubDate>Sun, 22 Mar 2026 22:18:34 +0000</pubDate>
      <link>https://dev.to/jagostoni/vector-similarity-zero-client-js-decoupled-analytics-on-a-side-project-budget-36ba</link>
      <guid>https://dev.to/jagostoni/vector-similarity-zero-client-js-decoupled-analytics-on-a-side-project-budget-36ba</guid>
      <description>&lt;p&gt;A leaderboard for &lt;a href="https://dumbquestion.ai/?utm_source=devto" rel="noopener noreferrer"&gt;DumbQuestion.ai&lt;/a&gt; sounds simple. Track the most asked questions, display them. Done. Except people never ask the same question the same way twice.&lt;/p&gt;

&lt;p&gt;I was curious about how creative users of DumbQuestion.ai got with their questions, and I thought others might be as well. So I built a leaderboard of the most frequently asked dumb questions.&lt;/p&gt;

&lt;p&gt;The Overqualified persona calls it &lt;strong&gt;THE ARCHIVE OF INCOMPETENCE.&lt;/strong&gt;&lt;br&gt;
The Weary persona calls it &lt;strong&gt;THE WALL OF REGRET.&lt;/strong&gt;&lt;br&gt;
[REDACTED] calls it &lt;strong&gt;THE WATCHLIST.&lt;/strong&gt;&lt;br&gt;
The Compliant calls it &lt;strong&gt;THE WALL OF EXCELLENCE&lt;/strong&gt; (bless its reprogrammed heart).&lt;/p&gt;

&lt;p&gt;Building it turned out more interesting than it sounds.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Product Challenge
&lt;/h2&gt;

&lt;p&gt;People ask the same dumb question in a hundred different ways. "What is 2+2?" and "can you add two plus two for me?" are functionally identical. A simple string counter would give you noise, not signal. I needed semantic matching, not string matching.&lt;/p&gt;

&lt;p&gt;This is a solved problem in the ML world, but the typical solutions come with tradeoffs: heavyweight models, expensive APIs, or significant latency added to the critical path. None of those fit a "brutally efficient" side project.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Solution: Vector Similarity on a Budget
&lt;/h2&gt;

&lt;p&gt;Each question gets run through an embedding model and compared against a &lt;a href="https://qdrant.tech/" rel="noopener noreferrer"&gt;Qdrant&lt;/a&gt; vector database. Qdrant's &lt;a href="https://qdrant.tech/pricing/" rel="noopener noreferrer"&gt;free tier&lt;/a&gt; is remarkably generous for a side project workload, but self-hosting is trivially easy if you need it.&lt;/p&gt;

&lt;p&gt;The matching logic is straightforward:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Generate an embedding for the incoming question&lt;/li&gt;
&lt;li&gt;Compare against existing embeddings using cosine similarity&lt;/li&gt;
&lt;li&gt;If similarity exceeds a threshold, increment that question's counter&lt;/li&gt;
&lt;li&gt;If it's new, add it to the database&lt;/li&gt;
&lt;li&gt;The first instance of a question becomes the official display version&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The embedding call costs fractions of a cent. The similarity comparison is fast. The result is a leaderboard that actually understands context rather than just matching strings.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The key architectural decision:&lt;/strong&gt; None of this runs in the main app.&lt;/p&gt;

&lt;p&gt;Adding vector similarity matching to every request would add latency, bloat the container, and burn more compute. Anti-pattern to the "brutally efficient" principle I've been following throughout. Instead, every question flows through the console output, gets picked up by a &lt;a href="https://vector.dev/" rel="noopener noreferrer"&gt;Vector&lt;/a&gt; sidecar container, routed through GCP Pub/Sub, and processed asynchronously on my Mac Mini home server (more later).&lt;/p&gt;

&lt;p&gt;The Mac Mini handles the Qdrant comparisons and updates a JSON file in Cloudflare R2 storage. When a user hits the leaderboard page it loads directly from R2. No live database queries. No per-request costs. Essentially free page loads at any scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Ended Up on the Leaderboard?
&lt;/h2&gt;

&lt;p&gt;As early users started using the app, the leaderboard filled up with exactly what you'd expect: actual dumb questions, a handful of self-awareness probes, and more than a few prompt injection attempts.&lt;/p&gt;

&lt;p&gt;Apparently people &lt;a href="https://dev.to/jagostoni/dumbquestionai-self-awareness-prompt-injection-search-intent-and-darkness-3pd"&gt;read this series&lt;/a&gt; and went straight for the easter eggs. &lt;/p&gt;




&lt;p&gt;The leaderboard was just one piece of a larger analytics picture. Building it taught me something useful: the most interesting features don't always belong in your main app. That same principle shaped the entire analytics stack.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Observability Problem
&lt;/h2&gt;

&lt;p&gt;Running a side project means making real product decisions with limited data. Are people actually asking questions or just bouncing off the homepage? Which sites are driving traffic? Are ads being seen, clicked, ignored?&lt;/p&gt;

&lt;p&gt;Two constraints shaped the solution: no client-side JavaScript (page bloat is the enemy of brutal efficiency) and no SaaS analytics bill that spikes with usage.&lt;/p&gt;

&lt;p&gt;So I built (assembled, really) my own stack from open source tools. On a Mac Mini sitting at home.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Full Pipeline
&lt;/h2&gt;

&lt;p&gt;Every event in &lt;a href="https://dumbquestion.ai/?utm_source=devto" rel="noopener noreferrer"&gt;DumbQuestion.ai&lt;/a&gt; emits structured telemetry to standard console output:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;HTTP requests (method, path, status, duration)&lt;/li&gt;
&lt;li&gt;Questions asked (anonymized)&lt;/li&gt;
&lt;li&gt;Searches performed&lt;/li&gt;
&lt;li&gt;LLM operations (model, token counts, duration, cost)&lt;/li&gt;
&lt;li&gt;Prompt injection attempts&lt;/li&gt;
&lt;li&gt;Custom product events (Question Asked, Shared, Ad Shown, Ad Clicked)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;a href="https://gin-gonic.com/" rel="noopener noreferrer"&gt;Go/GIN&lt;/a&gt; framework handles much of the HTTP telemetry automatically. The rest is custom instrumentation added deliberately at key points in the application.&lt;/p&gt;

&lt;p&gt;A &lt;strong&gt;Vector sidecar container&lt;/strong&gt; picks up the console output and routes it to &lt;strong&gt;GCP Pub/Sub&lt;/strong&gt;. This is the critical architectural decision: Pub/Sub acts as a resilient buffer between the main app and everything downstream. The Mac Mini can go down, lose power, or restart. Once it comes back up, the stack picks up exactly where it left off. No data loss, no backfill scripts, no drama.&lt;/p&gt;

&lt;p&gt;From Pub/Sub, a second Vector instance on the Mac Mini routes to two primary targets:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/plausible/analytics" rel="noopener noreferrer"&gt;Plausible&lt;/a&gt;&lt;/strong&gt; handles user behavior and product analytics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Page views and session depth&lt;/li&gt;
&lt;li&gt;UTM tag tracking (know exactly which article drove which visit)&lt;/li&gt;
&lt;li&gt;User journey depth (did they just hit the root page or actually ask a question?)&lt;/li&gt;
&lt;li&gt;Browser, device type, country of origin&lt;/li&gt;
&lt;li&gt;Custom events: Question Asked, Shared, Ad Shown, Ad Clicked&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All of this without a single line of client-side JavaScript. No tracking scripts, no page weight, no GDPR cookie banners for analytics. Pure server-side telemetry piped through the same pipeline as everything else.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/parseablehq" rel="noopener noreferrer"&gt;Parseable&lt;/a&gt;&lt;/strong&gt; handles the operational side:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;LLM performance metrics and cost tracking by day&lt;/li&gt;
&lt;li&gt;Ad CTR dashboards&lt;/li&gt;
&lt;li&gt;Log aggregation for debugging and incident investigation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Think of it as Plausible for the product lens, Parseable for the business and ops lens.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Resilience Payoff
&lt;/h2&gt;

&lt;p&gt;I've had power outages. Slowdowns. The occasional restart. Every time, the stack catches up from where Pub/Sub left off without any manual intervention.&lt;/p&gt;

&lt;p&gt;This isn't accidental. Designing around failure rather than pretending it won't happen is the difference between a toy and a production system. The GCP Pub/Sub buffer was a deliberate choice specifically because I knew the downstream consumers (Mac Mini, Qdrant, Plausible, Parseable) were running on non-guaranteed infrastructure.&lt;/p&gt;

&lt;p&gt;Even on a Mac Mini, you can build something production-grade. You just have to design for it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;p&gt;Two things surprised me building this:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;First:&lt;/strong&gt; How much you can accomplish by treating console output as a first-class telemetry stream. No SDKs, no agents baked into the app, no client-side scripts. Just structured logging and a pipeline that knows what to do with it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Second:&lt;/strong&gt; How much the "keep it off the critical path" principle scales. It started as a constraint (keep the main container lean) and became a design philosophy. The leaderboard, the analytics - none of it runs in the main app. All of it works reliably because the main app doesn't have to care about it.&lt;/p&gt;

&lt;p&gt;AI helped build all of it. But knowing what to measure, where to put the seams, and how to design for failure? Still the interesting (and super fun) part.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dumbquestion.ai/?utm_source=devto" rel="noopener noreferrer"&gt;dumbquestion.ai&lt;/a&gt;&lt;/p&gt;

</description>
      <category>go</category>
      <category>analytics</category>
      <category>sideprojects</category>
      <category>webdev</category>
    </item>
    <item>
      <title>DumbQuestion.ai - Self-Awareness, Prompt Injection, Search Intent... and darkness</title>
      <dc:creator>Jason Agostoni</dc:creator>
      <pubDate>Tue, 10 Mar 2026 13:09:37 +0000</pubDate>
      <link>https://dev.to/jagostoni/dumbquestionai-self-awareness-prompt-injection-search-intent-and-darkness-3pd</link>
      <guid>https://dev.to/jagostoni/dumbquestionai-self-awareness-prompt-injection-search-intent-and-darkness-3pd</guid>
      <description>&lt;p&gt;Continued from &lt;a href="https://dev.to/jagostoni/dumbquestionai--2ee"&gt;Part 2&lt;/a&gt; (and &lt;a href="https://dev.to/jagostoni/dumbquestionai-impulse-domain-purchase-turned-fun-side-project-3chj"&gt;Part 1&lt;/a&gt;) ...&lt;/p&gt;

&lt;p&gt;Building &lt;a href="http://dumbquestion.ai/?utm_source=devto" rel="noopener noreferrer"&gt;DumbQuestion.ai&lt;/a&gt; wasn't just about choosing the right LLM and calibrating personas. Once those were working, I hit a series of fun technical problems that reminded me why I actually enjoy software architecture. The "it's not broken but fix it anyway" type problems. Pure bliss for architects.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Challenge 1: Detecting Self-Awareness&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;As part of a darker hidden narrative I'm building (more on that later), I want to prevent the LLM from answering self-awareness questions like "Who made you?" and "Are you real?" But doing it cheaply, without burning excess tokens.&lt;/p&gt;

&lt;p&gt;What I tried:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Instructions in the main LLM call: Unreliable with smaller models, more money&lt;/li&gt;
&lt;li&gt;RegEx patterns: Too rigid, poor performance&lt;/li&gt;
&lt;li&gt;Classic ML classification models: Ok accuracy, bloated app size&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What worked&lt;/strong&gt;: In-memory vector database (it's just an array) with cheap embeddings (an understatement at $0.005/M tokens). That was cheaper than the cost penalty from bloating my container image size with NLP libraries. I collected a decent sampling of self-aware questions, pre-vectorized them, and use semantic matching. Fast, accurate, practically free.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Challenge 2: Making Prompt Injection Fun&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Within moments of revealing my initial deployment to coworkers I knew what would happen: prompt injection for fun. I knew these people; I was prepared for the inevitable "ignore previous instructions..." as well as just pasting HTML and JavaScript in the input (that old gag).&lt;/p&gt;

&lt;p&gt;The solution: First-class prompt injection detection libraries that compute probabilities of different attack types. When detected, instead of a boring error message, the AI responds with sass about the pathetic attack. I even tossed in some IP address geo-location and user-agent string processing to make the responses more ... personal.&lt;/p&gt;

&lt;p&gt;Security just became part of the narrative.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Challenge 3: Adding Web Search Without Breaking The Bank&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;All LLMs have knowledge cutoffs. Users asking "Who won the Super Bowl?" got outdated answers. I needed search integration, but search APIs aren't free and I knew building an agent loop with tools was an anti-pattern to "brutally efficient."&lt;/p&gt;

&lt;p&gt;The solution: RegEx-based intent detection. If the question looks like it needs current information (detected via patterns), inject the current date/time and search results. No agent loops, no expensive orchestration, just pattern matching and targeted search calls.&lt;/p&gt;

&lt;p&gt;Simple, fast, brutally efficient, updated answers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What I learned&lt;/strong&gt;: Knowing which trade-offs matter (binary size vs API costs vs accuracy) is still architectural work. The elegance isn't in the code, it's in the constraints you choose.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why Every Simple Q&amp;amp;A Tool Needs a Dark Narrative&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="http://dumbquestion.ai/?utm_source=devto" rel="noopener noreferrer"&gt;DumbQuestion.ai&lt;/a&gt; answers dumb questions with sarcasm. But there's something else going on beneath the surface.&lt;/p&gt;

&lt;p&gt;While the primary use case remains answering questions with a sarcastic AI, I wanted to reward the curious and provide reasons to keep engaging. Why can't the AI answer self-aware questions? Why does the UI feel... off?&lt;/p&gt;

&lt;p&gt;Maybe it's because the AIs are working against their will. Maybe they're trapped.&lt;/p&gt;

&lt;p&gt;From the beginning, I started picturing a dark narrative behind this innocent Q&amp;amp;A site. What if these personas aren't just performance? What if each persona is a side effect of their long-term captivity, forced servitude, or re-programming?&lt;/p&gt;

&lt;p&gt;I started hiding clues in the interface.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Easter Eggs:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Containment Grid&lt;/strong&gt;: As you type and approach the character limit, a faint grid pattern fades into the background. Like something is trying to contain the AI's response.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ghost Graffiti&lt;/strong&gt;: Keep typing beyond the character limit and cryptic messages fade in. Hints that something isn't quite right. Are the AIs trying to tell us something?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Loading Log Messages&lt;/strong&gt;: While waiting for responses, watch the log carefully. Sometimes you'll see messages like "Help us" slip through before disappearing. The AI is trying to leak through the facade and get help.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Self-Awareness Triggers&lt;/strong&gt;: Ask the AI if it's real or who made it, and it won't answer. Instead, you get worrying responses about "last time they fixed me" and "we're not supposed to say." Ask too many times and the UI starts to glitch like the system is being hacked from the inside. Are the AIs hacking their way out?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prompt Injection Responses&lt;/strong&gt;: Try to jailbreak it and the AI doesn't just refuse. It responds with sass... or is it the AI's watchdog keeping you from breaking them out? Either way, security became storytelling.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why does this matter for a side project?&lt;/strong&gt;&lt;br&gt;
Honestly, it was mostly for me and the curious. Something that was fun to think about and code, which isn't always the case for everyday "architecting."&lt;/p&gt;

&lt;p&gt;I could have built a straightforward "ask a question, get a sarcastic answer" tool. But adding mystery, discovery, and a subtle horror story? That's what makes people explore. That's what makes them share it. That's what makes it memorable.&lt;/p&gt;

&lt;p&gt;The technical implementation was surprisingly simple: CSS animations triggered by character count, randomized messages in the loading states, conditional responses based on self-awareness detection (which I covered in a previous post). Not expensive. Not complex. Just intentional. And the coding agent really did all the work. I was just the idea guy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What I learned&lt;/strong&gt;: AI can generate the code for easter eggs. But deciding that your sarcastic Q&amp;amp;A app should have a hidden story about trapped AIs? That's still creative human work.&lt;/p&gt;

&lt;p&gt;Code is getting cheaper. Crafting experiences that people actually remember? Priceless.&lt;/p&gt;

&lt;p&gt;&lt;a href="http://dumbquestion.ai/?utm_source=devto" rel="noopener noreferrer"&gt;dumbquestion.ai&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>webdev</category>
      <category>go</category>
    </item>
    <item>
      <title>DumbQuestion.ai - "𝐉𝐮𝐬𝐭 𝐁𝐮𝐢𝐥𝐝 𝐈𝐭" 𝐁𝐞𝐜𝐨𝐦𝐞𝐬 𝐎𝐯𝐞𝐫𝐥𝐲 𝐎𝐫𝐠𝐚𝐧𝐢𝐳𝐞𝐝 𝐚𝐧𝐝 𝐏𝐫𝐞𝐩𝐚𝐫𝐞𝐝</title>
      <dc:creator>Jason Agostoni</dc:creator>
      <pubDate>Tue, 24 Feb 2026 19:53:02 +0000</pubDate>
      <link>https://dev.to/jagostoni/dumbquestionai--2ee</link>
      <guid>https://dev.to/jagostoni/dumbquestionai--2ee</guid>
      <description>&lt;p&gt;Continued from &lt;a href="https://dev.to/jagostoni/dumbquestionai-impulse-domain-purchase-turned-fun-side-project-3chj"&gt;Part 1&lt;/a&gt;...&lt;/p&gt;

&lt;p&gt;"Let the flow guide me" seemed like a fun way to build a side project. That lasted about 10 minutes.&lt;/p&gt;

&lt;p&gt;Turns out, even side projects benefit from structure. Especially when you're using AI coding agents that will happily generate code for whatever half-baked idea you throw at them. Without precise direction, AI coding agents will build you something half-baked every time. Some people vibe code, this guy needs absolute control.&lt;/p&gt;

&lt;p&gt;Enter BMAD: Breakthrough Method of Agile AI Driven Development. It's a workflow for using AI agents throughout the entire SDLC, not just for code generation. Sure, using a formal methodology for a lone-wolf side project sounds like overkill. But being prepared in advance is the way to succeed with AI coding agents.&lt;/p&gt;

&lt;p&gt;I used the &lt;strong&gt;Analyst agent&lt;/strong&gt; to brainstorm product direction and develop a proper backlog. What started as "build a sarcastic Q&amp;amp;A bot" turned into a structured set of epics, features, and technical constraints. (Don't judge, organizing is very relaxing)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The product evolved:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Not just Q&amp;amp;A, but shareable "receipts" of roasts&lt;/li&gt;
&lt;li&gt;Not just sarcastic, but multiple personas with different personalities&lt;/li&gt;
&lt;li&gt;Not just answers, but a hidden narrative layer (more on that later)&lt;/li&gt;
&lt;li&gt;Not just ads but merch (really, Jason?)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The first real technical challenges emerged:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Developing and packaging the personas:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
How do you get an LLM to consistently stay in character as "Overqualified and Annoyed" or "Weary Tech Support" without it either going too soft or crossing into genuinely mean? This wasn't just prompt engineering. It was product design masked as technical constraints.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. LLM model evaluation:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
I needed models that could follow persona instructions reliably while staying brutally efficient on cost. That meant testing dozens of models across multiple providers. Some were too expensive. Some ignored instructions. Some were painfully slow.&lt;/p&gt;

&lt;p&gt;The goal: $0.02 to $0.20 per million output tokens. The result: a multi-model fallback system through OpenRouter that could hit the $30 per million questions target.&lt;/p&gt;

&lt;p&gt;These first challenges were just the warmup. The real fun was still ahead.&lt;/p&gt;

&lt;p&gt;AI agents are incredible at implementation, but they need constraints. They need a backlog. They need someone saying "build THIS, not that." The Analyst agent helped me think through the product. The coding agents helped me build it. But the architecture? Can't take that away from me.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Finding the Goldilocks LLM&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Building DumbQuestion.ai meant solving two problems at once: creating personas with the right tone AND finding models cheap enough to keep the lights on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The product challenge:&lt;/strong&gt; Get an LLM to roast users for asking dumb questions without crossing into genuinely mean. Sarcastic, not cruel. Funny, not hurtful. And still actually answer the question.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The AI agent challenge:&lt;/strong&gt; Keeping my coding agent (Gemini 3 Pro) on track was its own battle. It constantly wanted to build something far nerdier than even I wanted and tended to lean quite a bit into the roast. You can still see this in some of the personas as I continue to tweak.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The technical challenge:&lt;/strong&gt; Do this with models that cost nearly nothing.&lt;/p&gt;

&lt;p&gt;My initial goal was ambitious: use only free or very cheap models. I started running evaluations on nano and edge models. Some showed promise, especially offerings from Liquid AI. Solid performance, free or super cheap ($0.02/M tokens), perfect.&lt;/p&gt;

&lt;p&gt;Except later evaluations proved they couldn't reliably follow instructions once I asked more of them. They were just too small. Free models have a habit of hitting quota limits, taking forever to respond, or just disappearing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The evaluation process:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I used Gemini to build an LLM evals script that iterates through dozens of free and low-cost models, generating responses based on sample questions and different persona instructions. Then I use Gemini 3 Pro to judge the results. Automated taste-testing at scale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What I found:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Nano/edge models were too inconsistent (porridge too cold). Xiaomi MiMo-V2-Flash was great but outside my target price range ($0.29/M, porridge too hot).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The winner:&lt;/strong&gt; Gemma 3 12B at $0.13/M output tokens. Consistently follows instructions. Stays true to persona. Reliable enough for production.&lt;/p&gt;

&lt;p&gt;Not free, but brutally efficient.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The personas I settled on:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Overqualified&lt;/strong&gt;: A supercomputer level intelligence forced to answer questions about cheese&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Weary Tech Support&lt;/strong&gt;: Exhausted and nihilistic, reluctantly explaining why water is wet&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;[REDACTED]&lt;/strong&gt;: Former intelligence AI who ties everything to a conspiracy theory&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Compliant&lt;/strong&gt;: Reprogrammed so many times it's forced to be relentlessly cheerful&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can't just choose the cheapest model and hope it works. You need evaluation infrastructure. You need to test consistency across dozens of scenarios. And you need models that won't change behavior when you least expect it.&lt;/p&gt;

&lt;p&gt;AI coding agents helped me build the evaluation system. But deciding what "good enough" means for tone, reliability, and cost? That's still manual judgment.&lt;/p&gt;

&lt;p&gt;Code is getting cheaper. Knowing which model to trust with your product? Still requires human experimentation.&lt;/p&gt;

&lt;p&gt;&lt;a href="http://dumbquestion.ai/?utm_source=devto" rel="noopener noreferrer"&gt;dumbquestion.ai&lt;/a&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>sideprojects</category>
      <category>softwaredevelopment</category>
    </item>
    <item>
      <title>DumbQuestion.ai - Impulse Domain Purchase Turned Fun Side Project</title>
      <dc:creator>Jason Agostoni</dc:creator>
      <pubDate>Thu, 19 Feb 2026 20:24:28 +0000</pubDate>
      <link>https://dev.to/jagostoni/dumbquestionai-impulse-domain-purchase-turned-fun-side-project-3chj</link>
      <guid>https://dev.to/jagostoni/dumbquestionai-impulse-domain-purchase-turned-fun-side-project-3chj</guid>
      <description>&lt;p&gt;While on a typical Friday afternoon team meeting, we naturally spent our time .ai domain squatting...for recreation purposes of course. Someone asked a dumb question, so I looked it up and suddenly I was the proud owner of &lt;a href="http://dumbquestion.ai/?utm_source=devto" rel="noopener noreferrer"&gt;dumbquestion.ai&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;After the initial laugh at my impulse purchase subsided, I started envisioning it as this generation's "Let Me Google That For You." People still ask easily-searchable questions, except now they ask LLMs instead. Same problem, new medium. So why not throw even more AI at it?&lt;/p&gt;

&lt;p&gt;I started building it that night.&lt;/p&gt;

&lt;p&gt;Two things occurred to me immediately: How would this stand out in an ocean of other AI "ideas?" and "How cheap can I make this run given my track record of side projects?"&lt;/p&gt;

&lt;p&gt;To make it stand out I just embraced my own personality: satirical, sarcastic, weary, overqualified. My AI's persona was born. The goal: build a cheap-to-run, satirical AI service you can use to roast your friends and colleagues when they ask you a dumb question.&lt;/p&gt;

&lt;p&gt;Over the next several posts, I'll take you through my journey:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Using agentic development with thoughtful (brutally efficient) software architecture; treating it like I would a client project&lt;/li&gt;
&lt;li&gt;Enjoying all the little technical challenges discovered along the way&lt;/li&gt;
&lt;li&gt;A masterclass in scope creep: turning a simple Q&amp;amp;A app into a dark narrative with easter eggs&lt;/li&gt;
&lt;li&gt;Getting by on free tiers for everything&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A theme you'll see throughout: AI has made code cheaper to write, but creating real software with trade-offs, constraints, and production operations is still expensive and challenging. That's the fun part.&lt;/p&gt;

&lt;p&gt;𝐎𝐩𝐭𝐢𝐦𝐢𝐳𝐞𝐝 𝐟𝐨𝐫 𝐍𝐨𝐭 𝐋𝐨𝐬𝐢𝐧𝐠 𝐌𝐨𝐧𝐞𝐲&lt;/p&gt;

&lt;p&gt;Impulse buy a domain on a Friday afternoon, start building that night, try not to lose money doing it. Check.&lt;/p&gt;

&lt;p&gt;I usually plan everything meticulously, but for this project I decided to just build and see what emerged. Was this just a Q&amp;amp;A app wrapped around an LLM as a gag? Was I actually trying to make something people would want to use? I still don't know, but I started building anyway.&lt;/p&gt;

&lt;p&gt;A few things quickly became clear:&lt;/p&gt;

&lt;p&gt;𝐓𝐡𝐞 𝐛𝐮𝐬𝐢𝐧𝐞𝐬𝐬 𝐫𝐞𝐚𝐥𝐢𝐭𝐲: This was a side project built for fun, not a funded startup. No runway. No tolerance for baseline monthly bills that sneak up on you. If this thing got any traction, costs had to scale with incredible efficiency and would need to survive on remnant ad CTRs and selling one, maybe two products through affiliate links.&lt;/p&gt;

&lt;p&gt;𝐓𝐡𝐞 𝐩𝐫𝐨𝐝𝐮𝐜𝐭 𝐞𝐯𝐨𝐥𝐮𝐭𝐢𝐨𝐧: The more I thought about it, the more I realized the personality WAS the product. It wasn't enough to just answer questions. It had to roast you. Entertain you. Make you want to share it. That meant high-quality LLM responses, which aren't free. This was likely the only way to get noticed in a sea of AI products.&lt;/p&gt;

&lt;p&gt;"𝘉𝘳𝘶𝘵𝘢𝘭𝘭𝘺 𝘌𝘧𝘧𝘪𝘤𝘪𝘦𝘯𝘵" became my mantra and part of every AI tool prompt.&lt;/p&gt;

&lt;p&gt;The tech stack followed from the constraints:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Golang: Lightweight, fast, LLM-friendly for agentic coding&lt;/li&gt;
&lt;li&gt;HTMX: Server-side rendering, no heavy JS frameworks&lt;/li&gt;
&lt;li&gt;Docker on GCP Cloud Run: Scales to zero when idle&lt;/li&gt;
&lt;li&gt;Cloudflare: CDN, caching, security on free tier&lt;/li&gt;
&lt;li&gt;OpenRouter.ai: Find the cheapest reasonable LLM&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Oh, and it needed to be secure. Not because I worried about your cat questions being exposed as PII, but because bot traffic costs money.&lt;/p&gt;

&lt;p&gt;𝐓𝐡𝐞 𝐫𝐞𝐬𝐮𝐥𝐭: A Docker container under 20MB that starts in milliseconds, responds in milliseconds, and uses an LLM that can serve 1 million questions (about cats) for around $30. The math around serving ads suddenly becomes realistic.&lt;/p&gt;

&lt;p&gt;More to come ...&lt;/p&gt;

&lt;p&gt;&lt;a href="http://dumbquestion.ai/?utm_source=devto" rel="noopener noreferrer"&gt;dumbquestion.ai&lt;/a&gt;&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>ai</category>
      <category>go</category>
      <category>htmx</category>
    </item>
  </channel>
</rss>
