Koustubh

Posted on Nov 4

Part 4: Building Station Station - Where SDD Helped (and Where It Didn't)

#specdriven #debugging #ai #claude

In Parts 1-3, we covered Spec-Driven Development, the Station Station project, and the agent-os workflow. We saw a structured process that delivered 8 features, fully deployed and working. But I've been painting a rosy picture. Let me be honest about the challenges.

This part is about the real development experience: Where did the structured SDD approach actually help? Where did I still struggle despite having specs and tasks? What problems can structure solve, and what problems require good old-fashioned debugging?

If you're considering SDD for your next project, this is the part you need to read. Because understanding what structure can and can't solve is critical to setting realistic expectations.

Challenge 1: Cloudflare Authentication Bypass

Let me start with the most frustrating part of the entire project: getting authentication to work with the Myki portal.

The Context:
The whole project depends on accessing Myki transaction data. No authentication = no data = no project. This was the critical blocker. Everything else was blocked until this worked.

The Problem:
The Myki portal is protected by Cloudflare Turnstile, which actively detects and blocks headless browsers. My first attempt using standard Playwright headless mode failed immediately with the "Verifying you are human..." overlay blocking form access.

The Spec:
"Use Playwright to authenticate with Myki portal and extract session tokens for API calls."

Simple requirement, right? But the spec didn't capture the complexity of Cloudflare bot detection.

How SDD Helped

The structured approach provided a framework for tackling this beast:

Clear success criteria: "Extract Bearer token from authentication response" - The spec told the AI (and me) exactly what success looked like, even if neither of us knew how to get there yet.

Task breakdown kept me focused: Instead of having the AI try to solve "make authentication work" all at once, the tasks broke it down into pieces:

Launch browser with Playwright
Navigate to login page
Fill in credentials
Submit form
Extract authentication tokens

When the AI's implementation of step 3 (fill in credentials) was blocked by Cloudflare, the task breakdown showed me exactly where the problem was.

Documentation of attempts: Each failed approach got documented in the spec as "out of scope" or "doesn't work because..." This prevented me from asking the AI to retry the same failed approaches days later.

What SDD Couldn't Solve

But here's the brutal truth: specs don't solve hard technical problems for you.

The AI tried implementing authentication multiple ways based on the spec. Each attempt failed. Over two days, I kept iterating:

Attempt 1: AI implements standard headless Playwright

browser = playwright.chromium.launch(headless=True)

Result: Blocked by Cloudflare immediately. "Verifying you are human..."

Attempt 2: I ask AI to try headed mode (visible browser)

browser = playwright.chromium.launch(headless=False)

Result: Better, but still detected as automation. Random CAPTCHA challenges.

Attempt 3: I ask AI to try user-agent and header spoofing

context = browser.new_context(user_agent="...")

Result: Cloudflare is smarter than that. Still blocked.

Attempt 4: AI tries stealth mode plugins
Result: Helped a bit, but not consistent. Sometimes worked, sometimes didn't.

The Solution (After Two Days of Frustration)

What finally worked: Browser profile trust signals

After researching Cloudflare bypass techniques, I figured out the solution and told the AI to implement it:

# Create empty Chrome profile directory structure
profile_dir = create_empty_chrome_profile()
# Launch with profile - appears as "real" browser to Cloudflare
browser = playwright.chromium.launch_persistent_context(
    user_data_dir=profile_dir,
    headless=False
)

The profile directory contains just enough metadata (Cookies, Preferences, History files - all empty) to make Playwright look like a legitimate Chrome browser instead of automation.

Why This Was So Frustrating

The spec couldn't help because:

This required deep knowledge of browser fingerprinting and bot detection
Solutions aren't documented well (Cloudflare actively tries to prevent bypass)
Trial and error was the only way to find what worked
Each attempt took 5-10 minutes to test (AI implements → I run the code → see if blocked)

What I actually needed to do:

Research how Cloudflare Turnstile detects automation
Learn that browser profiles affect fingerprinting
Try approach after approach until something worked
Debug headless browser issues by inspecting what Cloudflare was detecting

None of this came from the spec. The spec told the AI what to achieve. But figuring out how required me to research, experiment, find the solution, and then tell the AI to implement it.

The Takeaway

What SDD provided:

Clear goal to work toward (extract Bearer token)
Focus on one step at a time instead of being overwhelmed
Documentation of failed approaches to avoid repetition
Motivation to keep going (this was Task 1 on a roadmap of 8 features - couldn't give up)

What SDD couldn't provide:

Technical solution to Cloudflare bypass
Knowledge of browser fingerprinting
Shortcuts to avoid trial-and-error debugging
The actual working approach (that required research and experimentation)

Would specialized subagents have helped?
Honestly, I don't know—I didn't try them for this problem. Agent-os has advanced features like specialized research agents and orchestrated task execution that might have helped with researching Cloudflare bypass techniques. But I was using the standard workflow, so I can't say whether those advanced features would have shortened the two-day struggle.

Lesson learned:
SDD gives you structure to tackle hard problems systematically. But hard problems are still hard. Structure doesn't replace technical knowledge, research, and persistence. It just gives you a framework to keep trying without getting lost.

When I finally got authentication working after two days, having it documented as "Task 1: COMPLETE ✓" with detailed notes on the working approach was incredibly valuable. Future features could reference "see Task 1 for Cloudflare bypass pattern." Without that documentation, I might have forgotten the solution by the time I needed to debug it again.

Challenge 2: The Multi-Layer Bug

After finally getting authentication working, I ran into a different kind of frustration: the manualAttendanceDates feature bug.

The Context:
This requirement came much later, after the entire end-to-end system was already working. The app was successfully tracking attendance based on Myki transactions, deployed and running. But then I realized I needed a way to record office attendance on days when I drove to work instead of taking the train. No Myki transaction = no automatic detection. The solution was to add a manualAttendanceDates config field where I could explicitly list dates I was in the office.

This was an enhancement to an already-working system, not part of the original implementation.

The Spec:
Clear requirements. Well-defined tasks. Config schema documented. Expected behavior spelled out.

The Implementation:
Feature got implemented according to the tasks. Initial version deployed.

The Problem:
Manual dates weren't showing up correctly. But it wasn't just one bug—it was multiple issues across different parts of the system.

How SDD Helped

Having the spec gave me a debugging roadmap:

Check the spec requirements - What should happen? "Manual dates should appear on the calendar with the same styling as PTV-detected dates"
Follow the task breakdown - Config parsing → Python backend → GitHub Actions workflow → JSON output → Frontend rendering
Trace the data flow - The spec documented the exact data structure at each layer

I could systematically check each layer by reviewing the code:

Python code: Reviewing the implementation, I could see it wasn't properly merging manual dates with PTV-detected dates
Workflow file: Looking at GitHub Actions, I realized it needed updates to handle the new field
UI: Checking the frontend code, I spotted where it needed changes to render manual dates correctly

This wasn't a single bug, but multiple integration issues across three different components.

What SDD Couldn't Solve

Even with perfect specs, I still had to:

Actually review the code across all three layers to spot the issues
Give the AI specific hints - "Look at how manual dates are merged in the Python code," "Check the workflow file," "The UI might not be checking the right field"
Understand the integration points - recognizing that adding one field means touching Python backend, GitHub Actions workflow, and React frontend
Connect the dots across components - understanding how the config file flows through Python processing, gets written to JSON, picked up by GitHub Actions, and rendered by the UI

The spec told me what should happen. It didn't tell me that integrating a new field into an existing multi-layer system would require touching all these different pieces. The AI could implement each fix once I pointed it to the right location, but finding those locations required me to review the code and understand the full data flow.

The Takeaway

What SDD provided:

Clear expected behavior to test against
Systematic way to isolate which layer was failing
Documentation of the intended data structure

What still required human debugging:

Understanding multi-layer integration issues
Recognizing when different components had different assumptions
Finding the exact line where the mismatch occurred

Challenge 3: Timezone Handling

The Context:
Calendar dates were displaying incorrectly—off by 1 day from the actual values in attendance.json.

The Spec:
"Display attended days on calendar matching the dates in the JSON file exactly."

Simple, right? Match the dates. But there was a subtle problem.

How SDD Helped

The spec was clear about what should happen (dates must match), which made it obvious when they didn't. Without a spec, I might have thought "close enough" or missed the off-by-one bug entirely.

The task breakdown also helped isolate the problem:

Task: "Parse attendance dates from JSON"
Task: "Mark calendar tiles for attended dates"

The bug was in the date parsing task, not the calendar rendering task. Task isolation made debugging faster.

What SDD Couldn't Solve

The spec said "match the dates exactly" but didn't specify how to handle timezones. The bug was subtle:

// Initial implementation - caused timezone conversion
const dateString = date.toISOString().split('T')[0];
// For dates near midnight, UTC conversion shifts the date!

The problem: JavaScript's toISOString() converts to UTC. For dates near midnight, this can shift the date forward or backward. Nov 1, 2024 01:00 AEDT becomes Oct 31, 2024 14:00 UTC—wrong day!

Why the spec didn't prevent this:
The spec didn't say "use local timezone, not UTC" because I didn't think about timezones when writing it. The requirement seemed obvious: match the dates. But "obvious" hides assumptions.

The Fix

// Corrected implementation - uses local timezone
const dateString = date.toLocaleDateString('en-CA'); // YYYY-MM-DD format

The Takeaway

What SDD provided:

Clear success criteria (dates must match exactly)
Quick detection that something was wrong
Task isolation to narrow down where the bug was

What SDD didn't prevent:

Subtle implementation details (timezone handling)
Hidden assumptions in "obvious" requirements
Need for domain knowledge (how JavaScript handles dates)

Lesson learned:
Good specs need to surface non-obvious assumptions. "Match the dates" should have been "Match the dates using local timezone to avoid UTC conversion issues." But you often don't know to specify this until you've been bitten by the bug.

Challenge 4: Third-Party Library Integration

The Context:
Integrating the date-holidays npm package to automatically detect Victoria public holidays.

The Spec:
"Use date-holidays library to fetch Victoria public holidays. Display them on the calendar with red text."

How SDD Helped

The spec documented exactly which library to use and what the expected behavior was. When the integration didn't work as expected, I could reference the spec to confirm what was supposed to happen.

What SDD Couldn't Solve

The library returned dates in an unexpected format: "YYYY-MM-DD HH:MM:SS" strings instead of JavaScript Date objects.

// Initial attempt based on spec
const holidayDate = new Date(holiday.date);
// Assumed library returns Date objects - it doesn't!

Why the spec didn't prevent this:
The spec said "use the library" but didn't document the exact return format because I hadn't investigated the library deeply when writing the spec. I assumed standard Date objects.

The Fix

I had to inspect the actual library output, recognize the format mismatch, and handle it explicitly:

// Working solution after investigation
const dateString = holiday.date.substring(0, 10); // "2025-11-04"
const [year, month, day] = dateString.split('-').map(Number);
const holidayDate = new Date(year, month - 1, day); // Uses local timezone

The Takeaway

What SDD provided:

Documentation of which library to use (no decision paralysis)
Clear requirement for what should be displayed (public holidays with red text)
Task to test the integration

What SDD didn't prevent:

Runtime surprises from third-party libraries
Need to investigate actual library behavior vs documented behavior
Format mismatch that only appears when you run the code

Lesson learned:
Specs can't predict every third-party library quirk. You discover these by running code and inspecting actual output. The structured approach helps you document the quirks once you find them, so future tasks can reference the pattern.

Challenge 5: User Preferences vs Developer Assumptions

The Context:
During development, weekends were displaying in red text (standard react-calendar behavior). I assumed this was confusing since it wasn't attendance data.

The Plan:
Remove the red weekend styling via CSS override. Seemed like a clean UI improvement.

What Actually Happened:
Before implementing the change, I asked the user (myself, wearing the user hat instead of developer hat). Response: "I really liked keeping weekends and public holidays in red."

I almost removed a feature I valued because I was thinking like a developer, not a user.

How SDD Helped

The structured workflow created natural checkpoints for user feedback:

After completing a task group, review with user
Before removing functionality, validate with user
Spec updates require user approval

Without this structure, I would have just removed the feature mid-coding session without stopping to think "Should I ask about this?"

What SDD Couldn't Solve

SDD doesn't tell you what users want. It creates opportunities to ask, but you still have to:

Actually ask the question
Listen to the answer
Override your own assumptions

The Takeaway

What SDD provided:

Natural review checkpoints to get user feedback
Process that encourages "ask before removing"
Documentation of decisions (why we kept the feature)

What SDD didn't prevent:

Making wrong assumptions in the first place
Need for actual user communication
Temptation to "just fix it" without asking

Lesson learned:
Structure creates opportunities for better decisions, but you still have to take advantage of those opportunities. The review checkpoint is useless if you skip it.

Where SDD Provided the Most Value

After going through these challenges, here's where the structured approach actually helped:

1. Clear Success Criteria

Every bug was obvious because the spec defined success. "Dates should match exactly" meant off-by-one was clearly wrong. Without specs, I might have rationalized it: "Close enough, probably a display thing."

2. Systematic Debugging

Task breakdown gave me a debugging roadmap. Instead of randomly checking files, I could trace the data flow through the task list:

Config parsing (Task 1)
Backend processing (Task 2)
JSON output (Task 3)
Frontend rendering (Task 4)

Check each layer systematically until you find the broken one.

3. Documentation of Decisions

When I came back a week later to add a new feature, the specs told me:

Why certain approaches were chosen
What assumptions were made (and documented)
How data flows through the system

Without this documentation, I would have re-learned the codebase every time.

4. Review Checkpoints

The workflow forced me to pause and review:

After each task group
Before major changes
When user feedback was needed

These pauses prevented rushing ahead with wrong assumptions.

Where SDD Couldn't Replace Human Judgment

But let's be honest about what structure can't solve:

1. Multi-Layer Integration Issues

Specs describe individual components well. But when components need to work together, you still have to understand the full picture. The manualAttendanceDates bug required understanding backend + frontend + data contract all at once.

2. Hidden Assumptions and Edge Cases

"Match the dates" seemed clear until timezone conversion bit me. "Use the library" seemed clear until format mismatches appeared. Good specs surface assumptions, but you often don't know what to surface until you've been bitten.

3. Third-Party Library Quirks

Specs can't predict runtime behavior of external dependencies. You discover these by running code, inspecting output, and debugging when things don't work as documented.

4. User Preferences and Domain Knowledge

Structure can't tell you what users value or what domain-specific constraints matter. You still need actual user communication and domain expertise.

The Honest ROI of SDD

Let's be real about the time investment:

Time spent on specs: ~4-5 hours across all features
Time spent on implementation: ~2 days
Time spent debugging: ~4-6 hours (timezone, manualAttendanceDates, library integration)

Total: ~3 days for 8 features shipped and deployed

Where SDD Saved Time

Debugging was faster:

Systematic task-by-task checking vs random file jumping
Spec told me what should happen vs guessing
Data flow documented vs reverse-engineering it

Resumability was huge:

Came back a week later, knew exactly where I left off
Spec reminded me why decisions were made
Task list showed what was done and what was next

Fewer forgotten requirements:

Everything documented upfront vs relying on memory
Edge cases captured in spec vs discovered in production
Complete feature set shipped vs "80% done" abandonment

Where SDD Cost Time

Upfront spec creation:

4-5 hours thinking and documenting
But this is thinking time I'd need anyway, just formalized

Learning the workflow:

First project with agent-os had a learning curve
Second project would be faster

Maintaining documentation:

When specs changed, had to update docs
But this paid off when resuming work later

The Takeaway

Spec-Driven Development isn't magic. It won't prevent bugs, eliminate debugging, or replace human judgment. But it provides:

Structure when you'd otherwise be lost
Documentation when you'd otherwise forget
Checkpoints when you'd otherwise rush ahead with wrong assumptions
Resumability when you'd otherwise re-learn the codebase

The challenges I faced—timezone bugs, library quirks, multi-layer issues—would have happened with or without SDD. The difference is how I dealt with them:

Without SDD: Random debugging, forgotten context, abandoned projects
With SDD: Systematic debugging, documented decisions, completed features

That's the honest ROI.

What's Next

We've now seen the complete picture: the workflow (Part 3) and where it actually helps vs where you still struggle (Part 4). You know the realistic benefits and the honest limitations.

In Part 5, we'll wrap up with a decision framework: when should you use SDD for your project, and when is a simpler approach better? We'll also cover how to get started with Station Station yourself (it's free and open source), and where to go from here.

If you're ready to make the call on whether Spec-Driven Development fits your workflow, Part 5 has the answers.