DEV Community: LyricalString

Solving the LLM Black Box Problem with Structured Reasoning

LyricalString — Mon, 11 May 2026 20:46:58 +0000

The "black box" problem in Large Language Models is often discussed as a philosophical hurdle, but for engineers building high-stakes vertical applications, it is a hard technical bottleneck. In domains like legal tech, medical diagnosis, or financial auditing, a correct answer without a verifiable trace is often as useless as a wrong answer.

Anthropic’s recent research, "Teaching Claude Why," addresses this head-on. It moves the conversation from simple Chain-of-Thought (CoT) prompting—where we simply ask a model to "think step-by-step"—to a more structured approach of training models to provide explicit, interpretable reasoning paths that are decoupled from the final output.

For anyone building AI infrastructure or specialized agents, this shift from mimicking reasoning to structuring it is the difference between a prototype and a production-ready system.

The Limitation of Standard Chain-of-Thought

Most developers are familiar with the CoT pattern. You append a prompt like "Let's think step by step" to your input, and the model generates a sequence of intermediate tokens before arriving at the conclusion. While this significantly improves performance on arithmetic and symbolic logic tasks, it suffers from two major structural flaws:

The Co-dependency Problem: The reasoning process and the final answer are often entangled in a single, continuous token stream. If the model makes a subtle error in step two, it will often "hallucinate" a justification in step three to maintain linguistic coherence with its own mistake. The reasoning becomes a post-hoc rationalization rather than a logical derivation.
Lack of Verifiability: Because the reasoning is just more text, there is no programmatic way to intercept, validate, or audit the logic mid-stream. You are essentially trusting the model to be right about its own process.

How "Teaching Claude Why" Changes the Architecture

Anthropic’s research explores a method to force the model to treat reasoning as a distinct, structured component of its inference process. Instead of treating reasoning as a byproduct of text generation, the goal is to train the model to produce a "reasoning trace" that follows specific logical constraints.

The core of this approach involves training the model on datasets where the reasoning steps are explicitly labeled and checked for logical consistency. This isn't just about more data; it's about a different loss function during training that penalizes logical leaps and rewards the explicit connection between a premise and a conclusion.

1. Explicit Trace Generation

In this framework, the model is trained to generate a structured trace. Think of this as a "logical scratchpad" that is separate from the final response. This allows the system to perform what we might call "Reasoning Interception." If an agent is processing a complex legal document, the system can pause after the reasoning trace is generated, run a symbolic checker or a second "critic" model against that trace, and only proceed to the final answer if the logic holds.

2. Reducing Rationalization

By training on datasets that specifically highlight common logical fallacies, the research aims to minimize the "hallucination of logic." When a model is trained to recognize that a specific step (e.g., "If A implies B, and B is false, then A must be false") is a required structural element, it becomes much harder for the model to skip that step or provide a nonsensical justification just to reach a target token.

Implications for Vertical AI: The Case for Legal Tech

If you are building in the legal or compliance space, this research is a roadmap. In these industries, "explainability" is not a feature; it is a requirement for adoption.

Consider a legal agent tasked with identifying conflicting clauses in a 100-page Master Service Agreement (MSA). A standard LLM might flag a conflict, but its "reasoning" might be a vague summary of the text. Using the principles from Anthropic's research, a specialized agent would:

Step 1: Isolate the Clause. Identify the specific text segments.
Step 2: Deconstruct the Logic. Extract the conditional logic (e.g., "If [Event X] occurs, then [Liability Y] applies").
Step 3: Compare via Formal Logic. Run the extracted logic against the conflicting clause to find the exact point of divergence.
Step 4: Generate the Trace. Produce a step-by-step breakdown that a human lawyer can audit in seconds.

This transforms the LLM from a "black box oracle" into a "transparent reasoning engine."

The Engineering Trade-off: Latency vs. Reliability

We must be realistic about the cost of this approach. Structured reasoning is not free.

Generating an explicit, high-fidelity reasoning trace increases the total token count per request. This leads to:

Higher Latency: More tokens mean more time-to-first-token and longer total generation cycles.
Increased Inference Costs: You are paying for the "thinking" tokens, not just the "answer" tokens.

However, for high-stakes applications, this is a necessary trade-off. In a chatbot for casual conversation, latency is king. In a system that determines whether a contract is legally binding or if a medical diagnosis is correct, reliability and auditability are the only metrics that matter.

Building the Next Generation of Agents

As we move from "Prompt Engineering" to "Context and Reasoning Engineering," the focus for infra builders will shift. We will stop obsessing over how to write the perfect instruction and start focusing on how to build the perfect environment for reasoning to occur.

This means:

Designing specialized "Critic" architectures: Using one model to generate the trace and a second, perhaps smaller/faster model, to verify the logical integrity of that trace.
Integrating Symbolic Logic: Combining the probabilistic strengths of LLMs with the deterministic strengths of traditional code (e.g., using the LLM to translate natural language into a formal logic language like Prolog or a custom DSL, then executing it).
Developing Audit-Ready Logs: Building observability pipelines that don't just log the input and output, but capture and index the intermediate reasoning traces for long-term debugging and compliance.

Anthropic’s research suggests that the path to truly autonomous, trustworthy AI isn't through bigger models alone, but through models that can actually show their work.

Original research via Anthropic: Teaching Claude Why

How I serve 12,237 law pages in 0.3 seconds with Astro and zero client JavaScript

LyricalString — Mon, 06 Apr 2026 10:16:53 +0000

Spanish law is public. Reading it shouldn't cost €200/month.

That's why I built Ley Abierta, an open source platform indexing every Spanish law from 1835 to today. 12,237 laws. 42,000 Git commits tracking every reform. Lighthouse score: 100 Performance.

Here's how it works.

The problem

Spain's official gazette (BOE) publishes legislation as XML. If you want the consolidated text of a law with all reforms applied, you either:

Read the BOE website (good luck navigating it)
Pay for Westlaw, Aranzadi, or similar services

There's no free, searchable, version-controlled source of Spanish law. So I built one.

Architecture overview

BOE API → Pipeline (Bun) → Git repo (Markdown) → Astro (SSG) → Cloudflare Pages
                         → SQLite + FTS5       → Elysia API  → Hetzner Docker

The same data lives in three places, each for a different reason:

Layer	Format	Why
JSON cache	12,231 JSON files	Pipeline source of truth
Git repo	Markdown + YAML frontmatter	Human readable, version control
SQLite	14 tables + FTS5 index	Fast queries, full text search

The website is fully static. The API handles search and the dynamic stuff (email alerts, omnibus detection). They deploy independently.

Astro content collections: 12K pages, one build

Each law is a Markdown file in a public Git repo (leyabierta/leyes). At build time, Astro checks out this repo and treats every file as a content collection entry.

The frontmatter carries all the metadata:

---
title: "Ley 35/2006, de 28 de noviembre, del Impuesto sobre la Renta de las Personas Físicas"
rank: "ley"
status: "vigente"
published_at: "2006-11-29"
jurisdiction: "es"
materias: ["IRPF", "Hacienda Pública", "Impuestos"]
reforms_count: 47
---

Astro generates a static HTML page for each entry. The build takes ~45-60 seconds for all 12,231 pages using Astro 6.1.1's queued rendering with 4-worker concurrency.

What comes out the other side is pure HTML on Cloudflare's CDN. There's nothing to hydrate, nothing to parse on the client. Load time is basically TCP overhead.

Performance:  100
FCP:          0.3s
LCP:          0.7s
TBT:          0ms

The daily pipeline

Every morning, a GitHub Actions workflow:

Discovers new laws from the BOE API
Fetches XML in parallel (6 workers, rate limited)
Parses metadata, articles, reform history
Commits each law as a Markdown file with the real publication date as the commit date
Pushes to the leyes repo
Triggers an Astro rebuild if anything changed

On Sundays, a full re-check catches updates to existing laws. Weekdays are incremental, only new publications.

The pre-1970 problem

Git stores dates as Unix timestamps. The oldest law in the database is from 1835. That's before Unix.

My workaround: commit date is set to 1970-01-02 (earliest safe date), but the real publication date lives in YAML frontmatter and a custom Git trailer (Source-Date: 1835-05-24). The web and API always use the real date. Git history shows the placeholder.

This affects ~334 laws. Not ideal, but it preserves the commit-per-reform model that makes git diff work across the entire corpus.

BOE API quirks (hard won knowledge)

A few things the documentation won't tell you:

Accept: application/json returns 400 on the /texto endpoint. You must parse XML.
limit=-1 silently caps at 10,000 results. Always paginate with explicit offsets.
The /analisis endpoint returns a subset of subject categories. For the full list, you need to scrape ELI meta tags from the HTML version.
Regional laws use IDs from regional bulletins (BOA, BOJA, DOGV), not BOE. Jurisdiction must be extracted from the ELI URL pattern (/eli/es-pv/ → Basque Country).

Full text search with SQLite FTS5

Search needs to be fast and accent insensitive ("politica" should match "política"). SQLite's FTS5 extension handles this natively.

The search index covers law titles, full text, and citizen friendly tags:

CREATE VIRTUAL TABLE norms_fts USING fts5(
  norm_id UNINDEXED,
  title,
  content,
  citizen_tags
);

Queries use a two-pass approach: title matches rank higher than content matches. Results are paginated with chunked ID filtering to avoid SQLite's variable limit on large result sets (splitting into 5K-item chunks).

The API (Elysia on Bun) exposes this as a REST endpoint with filters for rank, status, jurisdiction, and subject category. Swagger docs at api.leyabierta.es/swagger.

Omnibus law detection

An "omnibus" law bundles unrelated topics into a single piece of legislation. Governments use them to slip unpopular measures past public scrutiny, and in Spain it happens all the time. A tax reform hidden inside a natural disaster decree, that kind of thing. Nobody was tracking it, so I built a detector.

How detection works

If a law touches 15+ distinct subject categories (after filtering generic ones like "Public Administration"), flag it as omnibus
Extract the law's structure (titles, chapters, articles) and send to Gemini Flash
The model generates a label, headline, summary, article count, and a sneaked_in boolean for each topic

The sneaked_in flag is the interesting part. It catches topics that have nothing to do with the law's official title. Energy regulation buried in a social security update, that sort of thing.

{
  "topic_label": "Energía (medida encubierta)",
  "headline": "New renewable energy requirements",
  "article_count": 8,
  "sneaked_in": true
}

Cost: ~$0.01/day using Gemini Flash through OpenRouter.

Results are served via API, rendered on the /omnibus page, and available as an RSS feed.

Email notifications

Citizens can subscribe to topics they care about. When a law affecting those topics gets reformed, they get an email with a plain language summary.

The system is event driven:

Daily cron generates AI summaries for new reforms
Match subscriber topics against reform subjects
Send via Resend (transactional email)
Track in notified_reforms to prevent duplicates

Double opt-in uses HMAC-signed confirmation links. No authentication needed, subscriptions are managed by email token.

Things I'd do differently

SQLite from day one. I spent weeks querying the Git repo directly before accepting that Git is not a database. git log --grep is not a substitute for WHERE materia = 'IRPF' AND status = 'vigente'.

I also shouldn't have trusted the BOE API docs. They're incomplete and in some places just wrong. Would have saved time starting by scraping the endpoints and figuring out the actual behavior.

One Astro gotcha: content collections with 12K+ files will eat your memory during builds if you're not careful. Queued rendering in Astro 6 fixed this but I burned a few afternoons on OOM crashes before finding it.

Try it

Web: leyabierta.es
API: api.leyabierta.es/swagger
Code: github.com/leyabierta/leyabierta (MIT)
Laws: github.com/leyabierta/leyes (Public domain)

If you have ideas, spot bugs, or want to adapt this for your country's legislation, issues and PRs are welcome.

I'm Alex, a solo developer from Spain. You can find me on LinkedIn.

Building an AI Coworker That Asks Questions Instead of Guessing

LyricalString — Fri, 20 Mar 2026 11:47:23 +0000

You tell your AI coworker: "create a task for the new feature."

It creates the task. Assigns it to nobody. Sets priority to medium. Picks a random project.

Nothing is technically wrong. But everything is useless.

The AI didn't have context. And instead of asking, it guessed.

This is the default behavior of every LLM tool system I've seen. Missing parameter? Use a default. Ambiguous input? Pick the most likely interpretation. The AI never stops and says "hey, who should I assign this to?"

So I built a system that does exactly that.

The Design: AskUserQuestion as a First-Class Tool

The idea is simple: give the LLM a tool called ask_user_question that it can call like any other tool. Instead of creating a task, sending a message, or querying a database — it asks the human a question.

Here's the tool definition the LLM sees:

{
  name: "ask_user_question",
  description: "Ask the user a clarifying question with a rich interactive UI.
    Use when you need user input before proceeding. Supports free-text,
    single/multi-choice, and yes/no questions.",
  parameters: {
    question: "The question to ask",
    question_type: "free_text | single_choice | multi_choice | yes_no",
    options: [{ label: "Option A", description: "..." }],
    // Or for sequences:
    questions: [
      { question: "Who should own this?", question_type: "single_choice", options: [...] },
      { question: "What priority?", question_type: "single_choice", options: [...] },
    ]
  }
}

The LLM decides when to use it. Not the user. Not the system. The AI recognizes it's missing information and proactively asks before proceeding.

The AI isn't a chatbot waiting for input. It's an agent executing a task that chooses to pause because it needs clarification.

The Hard Part: Blocking Execution

When the LLM calls ask_user_question, the tool needs to:

Show the question to the user
Wait for their answer
Return the answer as the tool result
Let the LLM continue in the same execution context

Steps 1 and 4 are easy. Steps 2 and 3 are the interesting engineering problem.

The LLM is running inside a tool execution pipeline. When it calls a tool, the pipeline expects a synchronous result. But our "result" depends on a human doing something in a browser — which could take seconds or minutes.

The Redis Pub/Sub Parking Pattern

Here's how we solved it:

class AskUserQuestionService {
  async parkAndWaitForAnswer(
    questionId: string,
    questionData: StoredQuestion,
  ): Promise<QuestionAnswer | null> {
    // 1. Store the question in Redis with a 5-minute TTL
    await redisService.setState(
      `ask-question:${questionId}`,
      questionData,
      300, // 5 minutes
    );

    // 2. Create a dedicated Redis subscriber for this question
    const subscriber = createClient({ url: redisUrl });
    await subscriber.connect();

    const channel = `ask-question-answer:${questionId}`;

    try {
      // 3. Block until we receive the answer (or timeout)
      const answer = await new Promise<QuestionAnswer | null>((resolve) => {
        const timeout = setTimeout(() => {
          resolve(null); // Timed out
        }, 300_000); // 5 minutes

        subscriber.subscribe(channel, (message) => {
          clearTimeout(timeout);
          resolve(JSON.parse(message));
        });
      });

      return answer;
    } finally {
      await subscriber.unsubscribe(channel);
      await subscriber.quit();
      await redisService.deleteState(`ask-question:${questionId}`);
    }
  }
}

The tool execution blocks on a Promise that resolves when the user answers. Redis pub/sub acts as the bridge between the user's browser and the waiting tool.

When the user submits their answer, the API endpoint publishes to that specific channel:

async submitAnswer(questionId: string, answer: string | string[]) {
  // Validate caller matches the intended recipient
  const stored = await redisService.getState(`ask-question:${questionId}`);
  if (stored.workspaceId !== callerWorkspaceId || stored.memberId !== callerMemberId) {
    return { success: false, error: "Unauthorized" };
  }

  // Publish -> the waiting subscriber resolves -> tool returns -> LLM continues
  await publisher.publish(
    `ask-question-answer:${questionId}`,
    JSON.stringify({ answer, answeredAt: new Date().toISOString() }),
  );
}

The dedicated subscriber per question is important. You can't use a shared connection because Redis subscriptions are per-connection. Each pending question gets its own subscriber, its own channel, and its own cleanup.

Delivering Questions via WebSocket

The question needs to appear in the user's chat in real time. We use our existing WebSocket infrastructure to push a ask-user-question event that the frontend listens for.

On the frontend, when a question arrives:

A Zustand store maps conversationId to pendingQuestion
The chat input component is replaced with the question card
The user interacts with the card (selects options, types text)
On submit, a POST to /ai-ask-question/answer sends the answer back

The input replacement is key UX. The question doesn't appear as a message in the chat — it takes over the input area. This makes it clear that the AI is waiting for you, and you can't do anything else in that conversation until you answer (or dismiss).

Multi-Question Sequences

Sometimes the AI needs to ask multiple related things. Instead of calling the tool three times (which would show three separate cards), it can send a sequence:

ask_user_question({
  questions: [
    { question: "Which project?", question_type: "single_choice", options: [...] },
    { question: "Who should own it?", question_type: "single_choice", options: [...] },
    { question: "Any additional context?", question_type: "free_text" },
  ]
})

The user sees a paginated card with arrow navigation. Answers are collected locally and submitted all at once. The LLM receives all answers in a single tool result.

This is better than multiple tool calls because:

One round-trip instead of three
The user sees all questions upfront (progress indicator: "2 of 3")
They can skip questions they don't want to answer
The LLM gets all context at once, not incrementally

Mobile: Same Feature, Different Challenges

We built the same feature in React Native. Same WebSocket delivery, same Zustand store pattern, same question types. But mobile has its own quirks:

Keyboard management: the input replacement needs to handle the software keyboard showing/hiding
Haptic feedback: option selection triggers Haptics.impactAsync() for tactile confirmation
Scroll behavior: the question card needs to stay visible above the keyboard
Offline: if the user is offline when the question arrives, WebSocket reconnect needs to re-deliver

What the LLM Actually Receives

When the user answers, the tool returns a plain string. For single questions:

"Project Alpha"

For sequences, it's formatted as:

Q1 (Which project?): Project Alpha
Q2 (Who should own it?): Maria
Q3 (Additional context?): This is for the Q2 release

Simple text. No JSON. The LLM reads it naturally and continues its task with full context.

If the user times out (5 minutes), the tool returns:

{ "userAnswer": null, "timedOut": true, "message": "The user did not respond within the time limit" }

The LLM then decides what to do — usually it falls back to reasonable defaults and mentions it assumed values.

When Should AI Ask vs. Infer?

This is the real design question. You don't want an AI that asks about everything — that's worse than one that guesses.

Our heuristic: ask when the wrong guess has meaningful consequences.

Assigning a task to the wrong person? Ask.
Picking the wrong project? Ask.
Choosing between high and medium priority? Infer (low stakes).
Formatting a message slightly differently? Infer.

The tool description tells the LLM: "Use when you need user input before proceeding." The emphasis on "before proceeding" signals that this is for blockers, not preferences.

In practice, the LLM uses it about once every 10-15 tool calls. Just enough to be helpful without being annoying.

Security Considerations

Every question is scoped:

Stored in Redis with workspaceId + memberId
Answer submission validates the caller matches the stored recipient
Questions auto-expire after 5 minutes (Redis TTL)
All transport is via authenticated WebSocket
No question data persists after the flow completes

What I'd Do Differently

If I were building this again:

Batch questions more aggressively. The LLM sometimes asks one question, gets the answer, then realizes it needs to ask another. I'd add a system prompt nudge to gather all unknowns before asking.
Persistent questions. If the user closes the app and reopens, the pending question is gone. The Redis TTL is 5 minutes. For async workflows, this should be longer and stored in the database, not just Redis.
Question templates. The LLM generates the question text every time. Pre-defined templates for common patterns (assignee selection, project picker) would be faster and more consistent.

Wrapping Up

An AI tool system is incomplete if the AI can't ask questions. Every other tool (create task, send message, query data) assumes the AI has enough context. This one covers the case where it doesn't.

Small addition to the tool set. Big difference in how the AI actually works with you.

I'm building Trilo, a workspace that unifies tasks, chat, and notes for solopreneurs — with an AI coworker that actually understands your work. If you're interested in AI-powered productivity tools, let's connect.

Find me on LinkedIn or GitHub.