I Built Scroll: A Human-in-the-Loop Writing Assistant That Starts With What You Know

#aiops #webdev #aws #programming

I got so fed up with how AI-generated content lacks the actual human-centric feeling in my writings. Oftentimes, I'll end up starting all over again.

The problem was everything that came after ripping out the generic intro, cutting corporate filler, and inserting the real story the tool had no way of knowing, rewriting until the article sounded like me and not like a content agency intern. By the time I'd finished, the AI had saved me nothing. I would have shipped faster writing from scratch.

That wasn't a model quality problem. Claude, GPT-4, Gemini, they're all capable. The issue was architectural: every one of those tools jumps straight to generating without asking who I am, what I actually built, or what I think about it. They fill that void with confident-sounding generalities, and I spend the next hour excavating them out and replacing them with the real substance. Every single time.

So I built Scroll — a human-in-the-loop content writer that doesn't start writing until it knows what I actually want to say.

The Actual Problem With Generic AI Writing Tools

Generic tools produce what I'd call "corporate neutral" content. It's grammatically clean, structurally sound, and completely devoid of the specific experience that makes engineering content worth reading. If I'm writing about a production incident where Lambda cold starts tanked our API latency, I need that article to include the actual numbers, the wrong hypothesis I had first, and the moment I realized what was happening. A tool that doesn't know any of that can't write that article. It can only approximate it, and readers can tell.

The domains where this matters most are the ones I care about: engineering, infrastructure, and technical post-mortems. These are high-context areas where vague claims get called out immediately, and personal specificity is what makes something worth sharing.

I wasn't looking to replace my writing. I was looking for a research assistant and a first-draft engine that writes in my voice because it starts by asking what I actually know.

How Scroll Works: Questions First, Everything Else Second

The human-in-the-loop part isn't a checkbox or a review step at the end. It's the foundation of the entire pipeline.

When you start a piece, the orchestrator generates a set of pre-writing questions tailored to the article topic — things like what prompted this, what you actually built, what surprised you, and what your honest take is. You answer them in the Chainlit chat interface, in as much or as little detail as you want. Those answers become the context that drives everything downstream: the research queries, the writing angle, the tone, what gets included, and what gets cut.

The pipeline runs sequentially after that:

The orchestrator collects your answers and extracts the research context
It dispatches to a Researcher agent, which runs Tavily web searches and saves findings to content/research/slug.md
It dispatches to a Writer agent, which drafts the article grounded in both the research and your actual answers
Google Gemini generates a hero image
Tavily runs a plagiarism check on the output
Everything syncs to S3 and a summary comes back to you in the chat

The key design decision is that the pipeline cannot produce authentic output without real answers. If you give it nothing, it has nothing to work with. That's intentional. The tool is only as good as the context you feed it.

Three Agents, Two Models, One Pipeline

Under the hood, Scroll runs three agents built with the Claude Agent SDK: an Orchestrator, a Researcher, and a Writer.

The Orchestrator runs on Claude Sonnet. It handles the conversation, generates the pre-writing questions, coordinates the pipeline, and makes the dispatching decisions. It's doing the expensive reasoning work, so Sonnet is the right call there.

The Researcher runs on Claude Haiku. This was a deliberate cost optimization. The Researcher's job is to take a set of search queries, run them through Tavily, and summarize the results into a structured research file. That's a lot of tokens for a task that doesn't need Sonnet-level reasoning. Haiku runs at around $0.80 per million tokens versus Sonnet's $3. For a tool I'm running at low traffic on a personal AWS account, that difference adds up. The mixed pipeline saves roughly 50-70% in model costs compared to running everything on Sonnet.

The Writer runs on Sonnet again. Final draft quality is worth paying for.

Each agent has its own tool set. The Researcher has access to Tavily search. The Writer has access to the file system for saving output. The Orchestrator dispatches to subagents and coordinates the flow. This keeps concerns separated and makes the pipeline easy to modify — if I want to swap in a different search provider or add a new output format, I change one agent's tools, not the whole system.

The Interface: Chainlit

I used Chainlit as the chat interface. It's a Python framework that handles session management, authentication, async step tracking, and the streamed response UI out of the box. For a human-in-the-loop workflow where you need a polished conversational interface without building one from scratch, it was the right tool. Setup was fast, it plays well with async Python, and the step visualization made it easy to see where the pipeline was at any given point.

As of this writing, the app supports blog posts, LinkedIn posts, and Twitter/X threads. The pre-writing questions adapt to the content type.

Infrastructure: $6/Month and Scale-to-Zero

This is running on AWS, and the infrastructure is the part I'm most satisfied with.

The architecture has three layers:

Routing layer: CloudFront sits in front of everything, handles TLS, and routes to a Lambda Function URL. The Lambda is the router — it's what decides whether the ECS Fargate task is awake and handles waking it up if it isn't.

Compute layer: The app runs in an ECS Fargate container. Fargate scale-to-zero is the core cost optimization here. A CloudWatch alarm watches for inactivity — 20 minutes without a request — and when it fires, it publishes to an SNS topic, which triggers a Sleeper Lambda that sets the ECS service's desired count to zero. The task goes to sleep, and you stop paying for it.

When a new request comes in, the Router Lambda detects that the task is down and spins it back up. Cold start is 30-60 seconds. For a personal content tool I'm using a few times a week, that's a perfectly acceptable tradeoff. At always-on Fargate pricing, this would cost $7-15 per day. At scale-to-zero, it's around $6 per month.

Storage layer: SQLite with Litestream for continuous replication to S3. This was the key piece that made scale-to-zero actually viable for a stateful app. When the Fargate task dies and restarts, Litestream replays the WAL from S3, and the database is back where it left off in under a second. Without this, every scale-down event would lose session data. With it, the ephemeral compute and durable storage are cleanly decoupled.

All secrets go through AWS Secrets Manager. Docker images live in ECR. The whole infrastructure is Terraform.

Deploying looks like this:
Github Repository

Deploying looks like this:

cd infra
terraform init
terraform apply

ECR_URL=$(terraform output -raw ecr_repository_url)
aws ecr get-login-password --region eu-west-1 | docker login --username AWS --password-stdin $ECR_URL
docker build --platform linux/amd64 -t $ECR_URL:latest ..
docker push $ECR_URL:latest
aws ecs update-service --cluster scroll --service scroll --force-new-deployment --region eu-west-1

Running it locally is simpler:

uv sync
cp .env.example .env
# Fill in: ANTHROPIC_API_KEY, GOOGLE_API_KEY, TAVILY_API_KEY, CHAINLIT_AUTH_SECRET
uv run chainlit run app.py

What I'd Do Differently

The pre-writing questions are generated fresh each time by the orchestrator, which means they vary in quality depending on how the model interprets the topic. Sometimes they're sharp and immediately useful. Sometimes they're a bit generic, and I have to mentally reframe them before answering. A better approach might be a curated question bank per content type that the orchestrator pulls from and adapts, rather than generating cold every time.

The plagiarism check via Tavily is also rough around the edges. It flags overlapping phrases rather than true plagiarism, and the signal-to-noise ratio isn't great. I keep it in the pipeline because something is better than nothing, but I wouldn't rely on it for anything high-stakes without a second review.

The cold start is real. Thirty to sixty seconds after inactivity feels fine when you know it's coming, but it's jarring if you forget and expect an instant response. A simple "waking up" message in the UI would help set expectations — that's on the list.

What HITL Actually Means in Practice

The framing I keep coming back to is that Scroll is a third category, not a point on a spectrum. It's not "AI writes" and it's not "human writes." The human validates at the moment that matters most, before the first word of the draft, and the AI handles everything downstream. The quality of the output is a direct function of the quality of your answers. Give it real experience and real opinions, and you get an article that sounds like you wrote it. Give it nothing, and the tool has nothing to work with.

That's actually a feature, not a limitation. The tool creates a forcing function for the most valuable part of content creation: figuring out what you actually think and why it's worth saying. Most writers, myself included, try to skip that step and end up editing endlessly to compensate.

If you're spending more time fixing AI-generated drafts than you would have spent writing them yourself, it might not be the model's fault. It might be that the tool started writing before it knew anything about you.