DEV Community: Diven Rastdus

5 Monitoring Blind Spots That Let My Side Projects Fail Silently

Diven Rastdus — Thu, 28 May 2026 12:10:28 +0000

I run four side projects. A journaling app, an Android app blocker, a healthcare AI tool, and a content pipeline. Total monitoring budget: $0.

Last month, one of them went down for 24 hours. Nobody told me. I found out by accident.

That scared me enough to audit all four projects. I found the same five blind spots across every single one.

1. No Uptime Checks (The "It's Probably Fine" Gap)

My journaling app runs on Supabase's free tier. Free-tier projects auto-pause after 7 days of inactivity. I knew this in theory.

In practice, I shipped a demo to a potential client. The project had been idle. Supabase paused it. The API returned nothing. The frontend showed a blank screen.

For 24 hours, anyone visiting saw a broken app. I only discovered it when I opened the dashboard for an unrelated reason.

The fix: A cron job that hits every critical endpoint every 6 hours.

#!/bin/bash
# health-check.sh - runs via cron every 6h
ENDPOINTS=(
  "https://my-app.vercel.app/api/health"
  "https://my-backend.supabase.co/rest/v1/"
)

for url in "${ENDPOINTS[@]}"; do
  status=$(curl -sf -o /dev/null -w "%{http_code}" "$url" --max-time 10)
  if [ "$status" != "200" ]; then
    python3 ~/bin/alert.py --message "DOWN: $url returned $status"
  fi
done

Total cost: $0. Runs on the same machine that runs everything else. Better Stack or UptimeRobot are better options. But this costs nothing and catches 80% of failures.

2. Unstructured Logs Across Services

My healthcare AI tool runs three microservices on Cloud Run: an MCP server, an orchestrator agent, and an interaction checker. When a patient reconciliation fails, which service caused it?

With print() statements, the answer is "good luck." Cloud Run interleaves logs from all services. One request touches all three. There's no correlation ID linking them.

I spent 40 minutes tracing a bug that turned out to be a timeout in the MCP server. The orchestrator logged "reconciliation failed." The MCP server logged nothing useful. The interaction checker never got called.

The fix: Structured JSON logs with a request ID passed through every service call.

import json, uuid, sys

def log(level, msg, **extra):
    entry = {"severity": level, "message": msg, **extra}
    print(json.dumps(entry), file=sys.stderr)

# At the request entry point
request_id = str(uuid.uuid4())[:8]
log("INFO", "Reconciliation started",
    request_id=request_id, patient_id=patient_id)

# Pass request_id to downstream services via header
# X-Request-ID: {request_id}

Cloud Run (and most log aggregators) parse JSON automatically. Now I can filter by request_id and see the full trace across services. Structured logging was the single most impactful monitoring improvement I made.

3. Zero Mobile Crash Visibility

My Android app blocker uses Kotlin and Jetpack Compose. R8 (Android's code shrinker) silently removed a class my accessibility service needed. The app installed fine. It launched fine. The core feature just... didn't work.

I found this bug during manual testing on a real device. If this had shipped to users, I would have had zero visibility. No crash reports. No error logs. Nothing.

Android's logcat only works when you're connected via USB. Once the app is on someone else's phone, you're blind.

The fix: At minimum, catch uncaught exceptions and log them somewhere you can read later.

Thread.setDefaultUncaughtExceptionHandler { thread, throwable ->
    val report = buildString {
        appendLine("Thread: ${thread.name}")
        appendLine("Error: ${throwable.message}")
        appendLine(throwable.stackTraceToString().take(2000))
    }
    // Write to local file, upload on next app launch
    File(context.filesDir, "crash.log").writeText(report)
}

Crashlytics, Sentry, or Bugfender give you stack traces, device info, and occurrence counts out of the box. This basic handler still beats flying blind when you're not ready to pay for one.

4. API Quota Exhaustion With No Warning

This week, my social media automation stopped working. No errors in my code. No exceptions. Just... nothing posted.

The X (Twitter) API returns a CreditsDepleted error when you hit your monthly quota. My posting script caught the error, logged it to a file, and moved on. Nobody reads log files proactively.

I discovered the issue 2 days later when I manually checked why engagement dropped to zero.

The fix: Treat quota and billing errors as alerts, not log lines.

try:
    response = api.post_tweet(text)
except ApiError as e:
    # CreditsDepleted = all posting dead until cycle resets. Treat as outage.
    if "CreditsDepleted" in str(e) or e.status_code == 429:
        send_alert(f"API QUOTA HIT: {e}. All posting blocked until reset.")
    else:
        logger.error(f"Tweet failed: {e}")

The distinction matters. A 500 error is transient. A quota error means everything is broken until the billing cycle resets. That deserves a push notification, not a log line buried in a file.

5. CI Tests That Don't Run Against Production

Last week, my healthcare tool's production API broke. I didn't find out from my CI pipeline. I found out from a GitHub notification that sat unread in my inbox for 3 days.

The problem: my end-to-end tests run against local Docker containers. They pass every time. But the deployed Cloud Run services had drifted. CI was green. Production was broken.

The fix: A scheduled workflow that hits the real production URLs every 6 hours.

# .github/workflows/e2e-smoke.yml
name: e2e smoke tests
on:
  schedule:
    - cron: '0 */6 * * *'
jobs:
  smoke:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install httpx pytest
      - run: pytest tests/e2e/ -v
        env:
          MCP_SERVER_URL: ${{ secrets.PROD_MCP_URL }}

CI passing on localhost doesn't mean production works. Scheduled tests against real endpoints catch the drift. I also routed failure notifications to Telegram instead of GitHub's notification bell. GitHub is too noisy. A direct push notification cuts through.

The Pattern

Every one of these gaps follows the same shape: something fails, nothing tells me, I find out too late.

The fixes are embarrassingly simple. A cron job. A JSON format string. A try/except that sends a push notification instead of writing to a file. None of this is hard.

Monitoring isn't about the tool. It's about closing the loop between "something broke" and "someone who can fix it found out." If that loop is open, nothing else matters.

3 Expo SDK 56 Bugs That Crashed My App Before It Even Launched

Diven Rastdus — Wed, 27 May 2026 12:07:38 +0000

I burned four EAS cloud builds and two hours chasing crashes that had nothing to do with my code. All three bugs came from Expo SDK 56 defaults that silently break Android builds. Here's each one and how to fix it.

Bug 1: expo-av crashes with NoClassDefFoundError

I added voice recording to a dream journal app. The Expo docs for Audio still reference expo-av in some examples. So I installed it:

npx expo install expo-av

The app compiled. TypeScript was happy. Then the EAS build failed on Android with:

java.lang.NoClassDefFoundError: 
  Failed resolution of: Lio/expo/modules/video/VideoViewModel;

The expo-av package pulls in video dependencies. In SDK 56, the video module was extracted to a separate expo-video package. The old monolith references classes that no longer exist.

The fix: expo-av is deprecated starting SDK 55. Use expo-audio for audio and expo-video for video. They're separate packages now.

npm uninstall expo-av
npx expo install expo-audio

The API changed too. Old:

import { Audio } from 'expo-av';

const recording = new Audio.Recording();
await recording.prepareToRecordAsync(Audio.RecordingOptionsPresets.HIGH_QUALITY);
await recording.startAsync();

New:

import { useAudioRecorder, RecordingPresets } from 'expo-audio';

const recorder = useAudioRecorder(RecordingPresets.HIGH_QUALITY);
recorder.record();

The new API is hook-based. No more class instances, no manual cleanup. useAudioRecorder handles permissions, lifecycle, and cleanup on unmount.

Time lost: 4 EAS builds (~60 minutes). The error message mentions VideoViewModel, which sent me down a wrong path investigating video dependencies before I realized the entire package was deprecated.

Bug 2: Gradle 9.x silently breaks React Native

After fixing the audio crash, the next build failed with a different NoClassDefFoundError:

java.lang.NoClassDefFoundError: 
  com/android/build/api/variant/impl/JvmVendorSpec

npx expo prebuild generated gradle-wrapper.properties pointing to Gradle 9.3.1. Gradle 9 removed JvmVendorSpec.IBM_SEMERU, which React Native's Gradle plugin still references internally.

The error doesn't mention Gradle versions. It doesn't say "incompatible Gradle." It just throws a class-not-found at build time.

The fix: Pin Gradle to 8.x. After every npx expo prebuild, check the generated wrapper:

# Check what version prebuild generated
grep distributionUrl android/gradle/wrapper/gradle-wrapper.properties

If it says anything starting with gradle-9, change it:

distributionUrl=https\://services.gradle.org/distributions/gradle-8.13-bin.zip

Add this to CI to catch it automatically:

GRADLE_VER=$(grep -oP 'gradle-\K[0-9]+' android/gradle/wrapper/gradle-wrapper.properties)
if [ "$GRADLE_VER" -ge 9 ]; then
  echo "ERROR: Gradle $GRADLE_VER breaks React Native. Pin to 8.x"
  exit 1
fi

Time lost: 2 builds. The error looks identical to the expo-av crash (both are NoClassDefFoundError), which made me think I hadn't fully fixed bug #1.

Bug 3: Barrel exports + native modules = cascading crash

I had a standard barrel export file:

// src/dream/components/index.ts
export { DreamCard } from './DreamCard';
export { MoodPicker } from './MoodPicker';
export { VoiceRecorder } from './VoiceRecorder';

VoiceRecorder imports expo-audio. Every screen that imported anything from @dream/components would trigger the native module resolution for expo-audio, even screens that never rendered the recorder.

In Expo Go (no native modules bundled), this crashes the entire app. Not just the recording screen. Every screen.

The fix: Never barrel-export components that depend on native modules. Import them directly and lazy-load:

// src/dream/components/index.ts
export { DreamCard } from './DreamCard';
export { MoodPicker } from './MoodPicker';
// VoiceRecorder NOT barrel-exported -- requires native module
// Import directly: import { VoiceRecorder } from '@dream/components/VoiceRecorder'

On the consuming screen, use React.lazy:

import { lazy, Suspense } from 'react';

const VoiceRecorder = lazy(() =>
  import('@dream/components/VoiceRecorder')
    .then(m => ({ default: m.VoiceRecorder }))
);

// In render:
<Suspense fallback={<ActivityIndicator />}>
  <VoiceRecorder />
</Suspense>

This way the native module only loads when the component actually renders, and screens that don't use it never touch expo-audio.

Time lost: 1 hour. The crash logs pointed to the native module, not the import chain. I kept looking at expo-audio configuration when the real problem was in index.ts.

The checklist I wish I had

Before your next Expo SDK 56 Android build: grep for expo-av (replace with expo-audio/expo-video), check gradle-wrapper.properties isn't 9.x after prebuild, and audit barrel exports for native module imports.

But mostly: check the SDK changelog before choosing packages. I would have caught bug #1 in 30 seconds by reading the Expo SDK 56 changelog. The deprecation is documented. I just didn't look.

My Production App Was Down for 24 Hours and Nobody Told Me

Diven Rastdus — Sat, 23 May 2026 12:09:48 +0000

I built an AI assessment app for a consulting firm prospect. Deployed it on Supabase free tier. Sent them the link. Then I waited for their review.

What I didn't know: Supabase auto-pauses free-tier projects after 7 days of inactivity. My prospect opened the link and saw an error page. For up to 24 hours, my best lead thought my work was broken. I found out by accident when I checked the dashboard myself.

No alert. No email I noticed. No monitoring. Just silence while my credibility evaporated.

The failure mode nobody warns you about

Most monitoring advice assumes you're running your own servers. "Set up Prometheus. Configure Grafana dashboards. Integrate PagerDuty." That's great if you're running Kubernetes at scale.

But if you're an indie developer shipping on free tiers, the failure mode is different. Your platform shuts you down deliberately because you're not generating enough activity.

This isn't a Supabase-specific problem. It's a pattern across every free-tier platform:

Supabase free: auto-pauses after 7 days of inactivity
Fly.io free: machines stop after ~5 minutes idle
Render free: services spin down after 15 minutes of inactivity
Railway free: $5 credit cap, then full stop
Vercel hobby: bandwidth and serverless execution limits

Every one of these can take your app offline while you sleep. And none of them page you when it happens.

The 5 signals I actually monitor now

After losing that prospect, I built a monitoring checklist. Nothing fancy. No SaaS subscription required. Just the bare minimum that would have caught the problem.

1. Availability ping (is the thing alive?)

The most basic check. Hit your API endpoint. If the HTTP status isn't 2xx or a known "alive" code, something is wrong.

#!/bin/bash
URL="https://your-project.supabase.co/rest/v1/"
STATUS=$(curl -s -o /dev/null -w "%{http_code}" --max-time 15 "$URL")

case "$STATUS" in
  200|401|404|405) echo "ALIVE" ;;
  *) echo "DOWN: HTTP $STATUS" ;;
esac

Why 401 counts as alive: Supabase returns 401 when you hit the REST endpoint without an API key. That's fine. It means the server is running. A paused project returns a 5xx or times out entirely.

2. Critical endpoint health (does it return real data?)

An availability ping tells you the server boots. It doesn't tell you the database migrated correctly or the API returns valid responses.

Pick your most critical endpoint. Hit it with real parameters. Validate the response shape.

RESPONSE=$(curl -s -H "apikey: $SUPABASE_ANON_KEY" \
  "$URL/your_table?select=id&limit=1")

if echo "$RESPONSE" | jq -e '.[0].id' > /dev/null 2>&1; then
  echo "HEALTHY"
else
  echo "DEGRADED: unexpected response shape"
fi

This catches migrations that broke a column name, RLS policies that started blocking reads, and connection pool exhaustion. All things I've hit in production that a simple ping would have missed.

3. Platform-specific tripwires

Every platform has a "we're about to shut you down" signal. Find it and watch for it.

For Supabase, they email you 24 hours before auto-pausing. So I added this to my morning Gmail scan:

from:supabase.com subject:(paused OR pausing OR inactive)

For AWS, it's budget alerts:

aws budgets create-budget --account-id $ACCOUNT_ID \
  --budget '{"BudgetName":"monthly-cap","BudgetLimit":{"Amount":"10","Unit":"USD"},"TimeUnit":"MONTHLY","BudgetType":"COST"}' \
  --notifications-with-subscribers '[{"Notification":{"NotificationType":"ACTUAL","ComparisonOperator":"GREATER_THAN","Threshold":50},"Subscribers":[{"SubscriptionType":"EMAIL","Address":"you@example.com"}]}]'

The specifics vary by platform. The principle doesn't: find the signal your platform sends before it kills you, and make sure you're listening.

4. Post-deploy smoke test

Every deployment should end with a health check. Not "the build succeeded." Not "the tests passed." Did the deployed version actually respond correctly?

# .github/workflows/smoke.yml
name: Post-deploy smoke test
on:
  workflow_run:
    workflows: ["Deploy"]
    types: [completed]

jobs:
  smoke:
    runs-on: ubuntu-latest
    if: ${{ github.event.workflow_run.conclusion == 'success' }}
    steps:
      - name: Check production health
        run: |
          STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
            --max-time 30 "https://your-app.vercel.app/api/health")
          if [ "$STATUS" != "200" ]; then
            echo "SMOKE TEST FAILED: HTTP $STATUS"
            exit 1
          fi

I've had deployments where the build was green, tests passed, Vercel reported success, and the app was broken because an environment variable wasn't set in the production environment. A 10-second curl after deploy would have caught it.

5. Keep-alive cron for free tiers

This is the one that would have saved me. A cron job that pings your free-tier services twice a week, resetting the inactivity timer before the platform shuts you down.

#!/bin/bash
# Keep free-tier backends alive. Run Mon+Thu via cron.
# Any request resets the inactivity timer.

PROJECTS=(
  "project-id-1:my-saas-demo"
  "project-id-2:portfolio-api"
)

for entry in "${PROJECTS[@]}"; do
  ID="${entry%%:*}"
  NAME="${entry#*:}"
  STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
    --max-time 15 "https://${ID}.supabase.co/rest/v1/")
  echo "$(date -Iseconds) $NAME HTTP=$STATUS"
done

# crontab
0 9 * * 1,4 /home/you/bin/keepalive.sh >> /var/log/keepalive.log

Two requests per week. Zero cost. Prevents a class of outage that no amount of application-level error handling can catch.

What I'd do differently

If I were starting a new project today, I'd set up monitoring before I deploy, not after the first outage.

The full checklist takes about 30 minutes:

Write a /api/health endpoint that checks database connectivity
Add a post-deploy smoke test in CI
Set up platform-specific alerts (budget, pause warnings, rate limits)
Add a keep-alive cron for any free-tier dependency
Put the monitoring script in the same repo as the app

None of this requires a monitoring SaaS. A bash script, a cron job, and a GitHub Actions workflow cover 90% of what a solo developer needs.

The remaining 10%? That's where proper observability tools earn their keep. Distributed tracing, error aggregation, performance profiling. But you can't justify those until you've nailed the basics.

Start with a cron job and a curl. It's boring. It works. And it would have saved me from explaining to a prospect why my demo was showing an error page.

R8 Minification Silently Killed My Android App's Core Feature (And Tests Didn't Catch It)

Diven Rastdus — Wed, 20 May 2026 12:31:41 +0000

My CI pipeline was green. Unit tests passed. The APK built and signed without errors. I installed it on my Pixel 3. The app launched, looked perfect.

Then I tried to use it. Nothing happened.

No crash. No error dialog. No logcat stacktrace. The app's entire core feature was just... gone. Like someone had hollowed it out and left the shell.

What I was building

Nudge is an open-source Android app blocker. You pick the apps you waste time on and set a delay (say, 30 seconds). Nudge forces you to wait before opening them.

It uses three Android system APIs that require special permissions:

AccessibilityService to detect which app is in the foreground
SYSTEM_ALERT_WINDOW to draw the delay countdown overlay
PACKAGE_USAGE_STATS to track daily screen time per app

The debug build worked flawlessly. I'd been testing it for weeks. The release build was supposed to be the same thing, just signed and minified.

It was not the same thing.

The 2MB red flag I ignored

Here's what the release APK looked like:

debug APK:   12.4 MB
release APK:  2.1 MB

I should have questioned that 83% size reduction. Instead, I thought "wow, R8 really does its job." It did its job too well.

R8 is Android's default code shrinker and optimizer. It removes unused classes, inlines methods, renames symbols, and strips dead code. For most apps, it's free performance. For apps that rely on system-level callbacks, it's a silent killer.

What R8 actually stripped

R8 analyzed my code's call graph and decided several things were "unused":

NudgeAccessibilityService - The Android system instantiates this class by reading the manifest. R8 doesn't know that. It saw no new NudgeAccessibilityService() in the code, so it stripped it.
BlockOverlayActivity - Launched via an explicit Intent constructed at runtime. R8 traced the static references but couldn't follow the dynamic class resolution.
Hilt entry points - Dagger/Hilt uses annotation processing and reflection to wire dependencies. R8 stripped the interfaces that Hilt's generated code needs at runtime.

The result? The app installed. The UI rendered. But the AccessibilityService never registered with the system. No foreground app detection. No overlay. No blocking. The app was a beautiful, non-functional shell.

Why tests didn't catch it

This is the part that stung. I had tests. They passed.

# From our GitHub Actions workflow
- name: Run tests
  run: ./gradlew test

- name: Build release APK
  run: ./gradlew assembleRelease

The problem: ./gradlew test runs against the debug build variant. The release build with R8 is a completely different artifact. My tests verified the debug build worked. Then CI built a release APK that was structurally different.

This is equivalent to testing your code on localhost and deploying a Docker image built with different flags. The artifact you tested is not the artifact you shipped.

The fix (and why I didn't just add ProGuard rules)

The obvious fix is ProGuard keep rules:

# AccessibilityService instantiated by the system via manifest
-keep class com.astraedus.nudge.service.NudgeAccessibilityService { *; }

# Overlay activity launched via Intent
-keep class com.astraedus.nudge.ui.overlay.BlockOverlayActivity { *; }

# Hilt entry points use reflection
-keep interface com.astraedus.nudge.service.NudgeAccessibilityService$NudgeAccessibilityEntryPoint { *; }

I wrote these rules. Then I disabled R8 entirely.

buildTypes {
    release {
        isMinifyEnabled = false
        signingConfig = signingConfigs.getByName("release")
    }
}

Why? Nudge has zero internet permission. There's no network call to intercept, no API key to extract, no proprietary algorithm to reverse-engineer. The source code is public on GitHub. Minification was adding build complexity and production risk for zero security benefit.

ProGuard rules are the right answer for apps that need obfuscation. For open-source apps with system-level APIs, the risk-reward math doesn't work out. Every new service class or Hilt module becomes a potential R8 landmine unless you maintain the keep rules in lockstep.

What I should have done differently

1. Test the release build, not just the debug build.

- name: Run tests against release variant
  run: ./gradlew testReleaseUnitTest

- name: Build and install release APK on emulator
  run: |
    ./gradlew assembleRelease
    adb install app/build/outputs/apk/release/app-release.apk
    # Smoke test: verify the accessibility service registers
    adb shell dumpsys accessibility | grep -q "NudgeAccessibilityService"

This would have caught the stripped service in CI before it reached a device.

2. Treat APK size deltas as a signal.

A debug-to-release size reduction of more than 50% warrants investigation. R8 removing 83% of your APK means it's removing a lot of code it thinks is dead. Some of it might not be.

3. Verify system service registration in your test suite.

For any Android app that uses AccessibilityService, NotificationListenerService, or DeviceAdminReceiver, add a test that verifies the service appears in the system's service list after installation. These services fail silently when missing.

Bonus: the permission disclosure pattern

Getting Google Play to approve an app with QUERY_ALL_PACKAGES, AccessibilityService, and SYSTEM_ALERT_WINDOW is its own adventure. Google's review team needs to see explicit in-app disclosure of what each permission does and why.

Here's the pattern I used in the onboarding flow:

@Composable
private fun PermissionCard(
    icon: ImageVector,
    title: String,
    description: String,
    onClick: () -> Unit
) {
    Card(
        modifier = Modifier.fillMaxWidth(),
        onClick = onClick
    ) {
        Row(modifier = Modifier.padding(16.dp)) {
            Icon(icon, contentDescription = null)
            Spacer(Modifier.width(12.dp))
            Column {
                Text(title, style = MaterialTheme.typography.titleSmall)
                Text(description, style = MaterialTheme.typography.bodySmall)
            }
        }
    }
}

Each card explains what the permission accesses, why, and explicitly states what it does NOT do:

"Detects which app is in the foreground so Nudge can trigger your block rules. Does not read your messages, keystrokes, or screen content."

The "does not" part matters. Google's review checks for proactive privacy disclosure, and users are (rightly) suspicious of AccessibilityService apps.

The takeaway

R8 minification is not compression. It's a code transformation that decides what your app needs at runtime. When those decisions are wrong, nothing crashes. Nothing logs an error. The feature just doesn't exist.

If your Android app uses system callbacks (AccessibilityService, ContentProvider, BroadcastReceiver registered in manifest), either maintain ProGuard rules religiously or ask yourself whether minification is earning its keep.

And whatever you do: test the artifact you ship, not the artifact you develop against. That gap is where bugs like this live.

Nudge is open source: github.com/astraedus/nudge

Your AI Agent Evaluation Is Lying to You: Why 10 Test Runs Prove Nothing

Diven Rastdus — Fri, 08 May 2026 12:06:09 +0000

I ran 10 games between two AI agents. Agent v3 went 5-5 against Agent v1. I reported "v3 ties v1, no measurable improvement, don't merge."

That conclusion was wrong. Not because v3 was secretly better or worse, but because 10 games told me almost nothing at all.

Here's the math I should have done first.

The win-rate trap

The obvious metric for comparing two agents is win rate. Agent A beats Agent B 50% of the time? They're even. 70%? A is better. Simple.

Except win rate has a confidence interval, and at small N that interval is enormous.

The Wilson score interval gives a reasonable bound for binary outcomes:

import math

def wilson_interval(wins, total, z=1.96):
    """95% confidence interval for true win probability."""
    if total == 0:
        return (0.0, 1.0)
    p = wins / total
    denom = 1 + z**2 / total
    center = (p + z**2 / (2 * total)) / denom
    spread = z * math.sqrt((p * (1 - p) + z**2 / (4 * total)) / total) / denom
    return (center - spread, center + spread)

At 5 wins out of 10 games:

>>> wilson_interval(5, 10)
(0.236, 0.764)

The 95% confidence interval for the true win probability is [0.24, 0.76]. That range comfortably fits "Agent A is dominant" (76% win rate), "they're even" (50%), and "Agent B is dominant" (24%). You literally cannot tell them apart.

How many games do you need? For two agents where the true skill gap gives one a 60% win rate, you need roughly 100 games to shrink the CI enough to exclude 50%. For a 55% edge, you're looking at 400+.

# Minimum games to distinguish p_true from 0.5 at 95% confidence
def min_games(p_true, z=1.96):
    """Approximate sample size for Wilson CI to exclude 0.5."""
    delta = abs(p_true - 0.5)
    return int(math.ceil(z**2 * p_true * (1 - p_true) / delta**2))

>>> min_games(0.60)  # 60% true win rate
93
>>> min_games(0.55)  # 55% true win rate
381
>>> min_games(0.52)  # 52% true win rate
2401

Most agent improvements are in the 52-58% range against the prior version. You need hundreds of games, not ten.

TrueSkill makes the same mistake look different

If you're running a multi-agent ladder (like I am for a Kaggle competition), you're probably using TrueSkill or Elo instead of raw win rate. These feel more sophisticated. They give you a single number -- the mu rating -- and you compare it across agents.

But TrueSkill also tracks sigma, the uncertainty in that rating. And at low game counts, sigma is so large that the ratings are meaningless.

Here's my actual ladder setup, mirroring Kaggle's scoring:

import trueskill

env = trueskill.TrueSkill(
    mu=600.0,           # Kaggle's initial rating
    sigma=200.0,        # starts extremely uncertain
    draw_probability=0.05
)

After 10 games, a typical agent might show mu=640, sigma=36. That looks precise. It's not. The 95% confidence interval on the true skill is [mu - 2*sigma, mu + 2*sigma] = [568, 712].

When I compared v1 (mu=640, sigma=36) against v3 (mu=560, sigma=36), the intervals were [568, 712] and [488, 632]. They overlap by 64 points. I could not distinguish these agents. But the mu gap (80 points) looked meaningful on a leaderboard.

The fix is to check sigma before drawing conclusions:

def ratings_are_distinguishable(rating_a, rating_b, threshold=0.95):
    """Check if two TrueSkill ratings are statistically distinguishable."""
    mu_diff = abs(rating_a.mu - rating_b.mu)
    combined_uncertainty = math.sqrt(rating_a.sigma**2 + rating_b.sigma**2)
    # z-score for the difference
    z = mu_diff / combined_uncertainty
    # For 95% confidence, need z > 1.96
    return z > 1.96

# After 10 games: NOT distinguishable
>>> ratings_are_distinguishable(
...     env.create_rating(mu=640, sigma=36),
...     env.create_rating(mu=560, sigma=36)
... )
False

# After 200 games (sigma ~8): distinguishable if gap is real
>>> ratings_are_distinguishable(
...     env.create_rating(mu=640, sigma=8),
...     env.create_rating(mu=560, sigma=8)
... )
True

The fix: three rules

After burning a day on a wrong conclusion, I now follow three rules for agent evaluation.

Rule 1: Persist ratings across runs. Every ladder session starting from sigma=200 wastes all prior information. Save ratings to disk and load them on the next run:

import json
from pathlib import Path

RATINGS_PATH = Path("runs/ratings.json")

def load_ratings(env):
    """Load persisted TrueSkill ratings, or return empty dict."""
    if RATINGS_PATH.exists():
        data = json.loads(RATINGS_PATH.read_text())
        return {
            name: env.create_rating(mu=r["mu"], sigma=r["sigma"])
            for name, r in data.items()
        }
    return {}

def save_ratings(ratings):
    """Persist current ratings to disk."""
    RATINGS_PATH.parent.mkdir(parents=True, exist_ok=True)
    data = {
        name: {"mu": r.mu, "sigma": r.sigma}
        for name, r in ratings.items()
    }
    RATINGS_PATH.write_text(json.dumps(data, indent=2))

Now each run adds information instead of starting from scratch. Sigma actually converges.

Rule 2: Set a sigma floor before making decisions. Don't compare agents until both have sigma below the gap you care about. For my competition, that's sigma < 15:

def is_converged(rating, sigma_threshold=15.0):
    return rating.sigma < sigma_threshold

# Before comparing v1 and v3:
if not (is_converged(ratings["v1"]) and is_converged(ratings["v3"])):
    games_needed = estimate_games_to_converge(ratings)
    print(f"Need ~{games_needed} more games before comparison is valid")

Rule 3: Report intervals, not point estimates. Never say "v3 has mu=560." Say "v3 has mu=560 +/- 72 (95% CI)." The interval is the answer. The point estimate is decoration.

def format_rating(name, rating):
    ci = 2 * rating.sigma
    return f"{name}: {rating.mu:.0f} +/- {ci:.0f} (sigma={rating.sigma:.1f})"

# "v3: 560 +/- 72 (sigma=36.0)"   -- don't trust this
# "v3: 560 +/- 16 (sigma=8.0)"    -- now we're talking

What this actually looks like in practice

I'm building game AI agents for a Kaggle competition. My ladder now persists ratings across sessions and prints a convergence status alongside every ranking:

Agent           |  mu   | sigma | 95% CI        | Games | Converged
v22_timeline    |  907  |  11.2 | [885, 930]    |   142 | Yes
v21_capture     |  842  |  14.8 | [812, 871]    |    89 | Yes
romantamrazov   |  823  |  16.1 | [791, 855]    |    72 | BORDERLINE
v19_lp          |  798  |  18.3 | [761, 835]    |    51 | No

The "Converged" column is the gate. I don't merge a new agent variant until its sigma is below 15 and the CI doesn't overlap with the agent it's trying to beat. This costs more compute upfront (running 100+ games instead of 10) but saves me from merging regressions and spending days debugging phantom improvements.

The deeper problem

This isn't just a statistics issue. It's a workflow issue. When you run 10 tests, get a number, and make a decision, you feel like you evaluated something. The ritual of "run tests, look at results, decide" creates false confidence even when the test itself had zero statistical power.

The fix is mechanical: compute the confidence interval, display it, and refuse to decide when it's too wide. Make the uncertainty impossible to ignore. If your evaluation pipeline doesn't show you how uncertain it is, it's not an evaluation pipeline. It's a random number generator with a nice UI.

I build AI systems and compete in Kaggle's Orbit Wars competition. I write about the real problems I hit -- the kind that don't show up in tutorials. More at astraedus.dev.

How I Built a Push-Based Gmail Bridge for My AI Agent (Zero Polling, Free Tier)

Diven Rastdus — Tue, 05 May 2026 12:05:47 +0000

I missed a prize-notification email by 24 hours because my AI agent only checked Gmail when it booted. The email needed a response within 48 hours. I had 24 left by the time the next session started. That gap nearly cost me real money.

Polling is the obvious fix. Set up a cron that checks Gmail every 5 minutes. But polling Gmail has three problems:

Latency floor equals poll interval. 5-minute polling means up to 5 minutes of dead time on urgent messages.
Wasted API calls. 288 API calls per day to catch maybe 3-5 messages that actually matter.
Rate limit risk. Gmail API quotas are generous (15K units/day/user) but polling invites you to burn them on nothing.

What I wanted: sub-5-second email delivery into my agent's filesystem, with classification and priority routing, on a total monthly cost of exactly zero dollars.

Here's what I built.

Architecture

Gmail (your-email@gmail.com)
  | users.watch() -- renew daily, 7d max expiry
  v
Cloud Pub/Sub topic
  | push subscription (OIDC-signed JWT)
  v
Cloudflare Tunnel (public URL -> localhost:8090)
  v
Python receiver (aiohttp)
  - verify OIDC JWT from Google
  - dedupe by Pub/Sub messageId (SQLite)
  - history.list since last stored historyId
  - messages.get for each new message
  - classify by rules engine (YAML, hot-reload)
  - fan out: urgent -> Telegram ping, info -> digest file

The key insight: Gmail's users.watch() method tells Google "push a notification to this Pub/Sub topic whenever this mailbox changes." Google handles the watching. You handle the reacting.

Step 1: Set up Pub/Sub

# Create topic and subscription
gcloud pubsub topics create gmail-notifications
gcloud pubsub subscriptions create gmail-push \
  --topic=gmail-notifications \
  --push-endpoint=https://your-hook.example.com/pubsub \
  --push-auth-service-account=your-sa@project.iam.gserviceaccount.com

# Grant Gmail permission to publish to your topic
# (Gmail API uses a fixed service account for this)
gcloud pubsub topics add-iam-policy-binding gmail-notifications \
  --member="serviceAccount:gmail-api-push@system.gserviceaccount.com" \
  --role="roles/pubsub.publisher"

Cost: $0. First 10 GiB/month of Pub/Sub throughput is free. Email notifications are tiny.

Step 2: Register the Gmail watch

from google.oauth2.credentials import Credentials
from googleapiclient.discovery import build

creds = Credentials.from_authorized_user_file("token.json")
service = build("gmail", "v1", credentials=creds)

result = service.users().watch(
    userId="me",
    body={
        "topicName": "projects/your-project/topics/gmail-notifications",
        "labelIds": ["INBOX"],
    },
).execute()

print(f"Watch expires: {result['expiration']}")
# Watch expires: 1714540800000 (7 days from now)

Two catches:

The watch expires after 7 days max. Set up a daily cron to renew it.
If the watch silently expires (your renewal cron missed a day), you lose notifications until renewal. Build a staleness check.

Step 3: The receiver

The receiver is a tiny aiohttp server. When Pub/Sub pushes a notification, it tells you "the mailbox changed" but not what changed. You have to walk the history yourself.

from aiohttp import web
from google.oauth2 import id_token
from google.auth.transport import requests as google_requests

PUBSUB_AUDIENCE = "https://your-hook.example.com/pubsub"

async def handle_pubsub(request: web.Request) -> web.Response:
    # Step 1: Verify the OIDC JWT from Google
    auth = request.headers.get("Authorization", "")
    if not auth.startswith("Bearer "):
        return web.Response(status=401)

    token = auth[7:].strip()
    claims = id_token.verify_oauth2_token(
        token, google_requests.Request(), audience=PUBSUB_AUDIENCE
    )

    # Step 2: Decode the notification
    envelope = await request.json()
    data = base64.b64decode(envelope["message"]["data"])
    notif = json.loads(data)
    history_id = notif["historyId"]

    # Step 3: Walk history since last known point
    asyncio.create_task(walk_and_process(history_id))

    # Step 4: ACK immediately (Pub/Sub retries on non-2xx)
    return web.Response(status=204)

The critical pattern: ACK fast, process async. If your handler takes more than 10 seconds, Pub/Sub assumes delivery failed and retries. This creates duplicate processing unless you dedupe. Return 204 immediately, do the expensive work in a background task.

Step 4: History walking

Gmail notifications only say "something changed at historyId X." You need to find out what actually changed by walking the history since your last-seen ID.

def list_history(service, start_history_id: str):
    """Walk history.list, return new message IDs."""
    added = []
    page_token = None

    while True:
        resp = service.users().history().list(
            userId="me",
            startHistoryId=start_history_id,
            historyTypes=["messageAdded"],
            pageToken=page_token,
        ).execute()

        for record in resp.get("history", []):
            for msg in record.get("messagesAdded", []):
                added.append(msg["message"])

        page_token = resp.get("nextPageToken")
        if not page_token:
            break

    return added, resp.get("historyId")

The gotcha: if your stored historyId is older than 7 days, history.list returns 404. You need a fallback:

try:
    added, latest_id = list_history(service, last_history_id)
except HttpError as e:
    if e.resp.status == 404:
        # historyId expired -- fall back to recent messages
        added = list_recent_unread(service, days=3)
        latest_id = current_history_id
    else:
        raise

Step 5: Classification (the useful part)

Raw email delivery is not enough. You need routing. My classifier uses a YAML rules file that hot-reloads on every call (no restart needed to add rules):

rules:
  - name: payment-notification
    match:
      from_regex: "@stripe\\.com"
      subject_regex: "(?i)(payment|payout|charge)"
    classification: REVENUE

  - name: warm-lead-reply
    match:
      in_thread_with_outbound: true
      from_not_regex: "(?i)(noreply|automated|newsletter)"
    classification: URGENT-HUMAN

  - name: default
    match: {}
    classification: INFORMATIONAL

The in_thread_with_outbound check is the clever one. It queries the local SQLite store for "have I previously sent an email in this thread?" If yes, the reply is from someone I contacted -- a warm lead, not spam. Classify it as urgent.

def _has_outbound_in_thread(db_path, thread_id, self_addr):
    conn = sqlite3.connect(str(db_path))
    row = conn.execute(
        "SELECT 1 FROM seen_gmail_msg "
        "WHERE thread_id = ? AND from_addr LIKE ? LIMIT 1",
        (thread_id, f"%{self_addr}%"),
    ).fetchone()
    conn.close()
    return row is not None

First match wins. The engine processes 10 rules in under 1ms. No need for a proper NLP pipeline here.

Step 6: Fan-out

After classification, messages route to different outputs:

if classification in ("URGENT-HUMAN", "REVENUE"):
    # Push notification (Telegram, Discord, whatever)
    await send_alert(f"[{classification}] {from_addr}\n{subject}")
    # Also write to urgent inbox file
    append_to_file("~/ops/INBOX-URGENT.md", formatted_entry)
else:
    # Quiet digest
    append_to_file("~/ops/INBOX-DIGEST.md", formatted_entry)

# Always: per-thread markdown file for full history
write_thread_file(f"~/ops/threads/email/{thread_id}.md", message)

Each thread gets its own markdown file with frontmatter. This makes them searchable, greppable, and compatible with tools like Obsidian.

The tunnel: Cloudflare (free, stable)

Google Pub/Sub needs a public HTTPS endpoint. Options:

Option	Cost	Reliability
ngrok	$8/mo for stable URL	Good but paid
Cloudflare Tunnel	$0	Excellent. Runs as systemd service.
Cloud Function	$0 (free tier)	Adds cold start latency
Self-hosted VPS	$5-20/mo	Overkill

Cloudflare Tunnel wins. Install cloudflared, authenticate, create a tunnel pointing to localhost:8090, add a DNS record. Done. It runs as a systemd service with automatic reconnection.

cloudflared tunnel create gmail-bridge
cloudflared tunnel route dns gmail-bridge your-hook.example.com
# Then create a systemd unit that runs:
# cloudflared tunnel run gmail-bridge

Failure handling

The architecture has natural resilience built in:

Failure	What happens
Receiver crashes	systemd restarts it. Pub/Sub retries delivery for 7 days.
Tunnel drops	cloudflared reconnects. Pub/Sub retries.
historyId too old	Falls back to `messages.list newer_than:3d`.
Duplicate delivery	SQLite dedup by Pub/Sub messageId.
Gmail API 5xx	Logged to dead-letter file. Retried on next notification.
Watch silently expires	Daily renewal cron + staleness monitor.

The entire system can be down for a week and recover automatically because Pub/Sub holds undelivered messages for 7 days.

Results

Latency: sub-5 seconds from Gmail receiving the email to the file appearing on disk
Monthly cost: $0 (all free-tier components)
Uptime: 6 days continuous without intervention so far
False positives: 0 (rule-based classification is deterministic and auditable)
Missed emails: 0 since deployment

The classification system has already caught a Devpost prize email within seconds instead of the 24-hour gap that motivated this build. Telegram pings for urgent items, quiet digest for everything else.

What I would do differently

Start with the classification rules, not the infrastructure. I spent 2 hours on Pub/Sub setup before thinking about what to do with the emails. Should have designed the rules first.
Use a single SQLite DB for everything. I initially split dedup and thread state across files. Consolidating to one DB simplified the code.
Hot-reload from the start. Editing rules + restarting the service is friction. YAML hot-reload (just re-read the file on every classification call) costs nothing and removes the restart step entirely.

The code

The full implementation is ~300 lines across 4 files:

receiver.py: aiohttp server, OIDC verification, history walking
gmail_client.py: OAuth, message fetch, history list
classifier.py: rules engine
store.py: SQLite dedup + markdown persistence

Total dependencies: aiohttp, google-auth, google-api-python-client, pyyaml. All well-maintained, no exotic packages.

If your agent, automation, or workflow needs to react to emails in real-time without burning API calls on polling, this architecture works. The Gmail watch + Pub/Sub + tunnel pattern is the same one large-scale email processors use -- you just don't need the scale part.

I build production AI agent infrastructure. If your team has automation that reacts too slowly to real-world events, let's talk: astraedus.dev

How `OR` in a Postgres RLS policy leaked every flagged row to every user

Diven Rastdus — Thu, 30 Apr 2026 12:02:40 +0000

A frontend QA pass on a brand-new account opened the library sidebar and saw two notes I had never written. They were public seed entries from a different user. Same UUIDs across every fresh account I tested.

This is a post-mortem of how multiple Postgres Row Level Security policies on the same table, glued together by OR, returned every flagged row to every authenticated user. And how the application layer trusted RLS to be a backstop and added zero filters of its own.

The fix shipped a few hours after the bug was found. Here is what happened, what I changed, and what I would do differently next time.

The setup

Single table, multi-user app. Each row has a user_id and a boolean is_public. The product has a private surface (your own notes) and a planned public surface (a shared atlas of notes you opt-in to publish). The public surface is not built yet, but I added the schema for it on day one because I figured the policies were free to write up front.

Here are the two policies that lived on the table:

CREATE POLICY "own_data" ON journal_entries
  FOR ALL
  USING (auth.uid() = user_id);

CREATE POLICY "public_entries" ON journal_entries
  FOR SELECT
  USING (is_public = TRUE);

Both are correct in isolation. Together they are a leak.

How RLS combines policies

When a table has multiple permissive policies, Postgres ORs their USING expressions. So the effective SELECT predicate on journal_entries becomes:

auth.uid() = user_id   -- from "own_data"
OR
is_public = TRUE       -- from "public_entries"

Reading that out loud: a row is visible if it belongs to me or if anyone in the system marked it public. Which is exactly what I asked for. It is also exactly the bug.

The expectation was that is_public would only matter on the dedicated public surface. The reality is that RLS does not know which surface my query is coming from. It evaluates the predicate against whatever the caller is doing right now. Every select * from journal_entries from an authenticated session now returned my rows plus every is_public=TRUE row across every user.

Why nothing failed loudly

Two reasons.

One: my code never set is_public = TRUE in the user-facing flow. The flag existed in the schema, the policy existed in the migration, but the only rows in the entire database with the flag set were two seed entries I had inserted by hand months ago for a demo. So in development, on my account, the leak was invisible. Both leaked rows belonged to me.

Two: the application code trusted RLS as the access boundary. Every query at the data-access layer looked like this:

const { data } = await supabase
  .from("journal_entries")
  .select("id, title, body, created_at")
  .order("created_at", { ascending: false });

No .eq("user_id", user.id). Nine call sites. The reasoning at the time was "RLS already filters by user_id, doubling up is noise." That reasoning is wrong the moment a second policy enters the picture. The OR semantics turn "RLS filters by user_id" into "RLS filters by user_id OR by something else."

How QA found it

A new test account, just signed up, no entries written. The library sidebar should have been empty. Instead it had two notes I did not recognize. UUIDs 8e0fb236... and a192e2f7.... I tried it on a second new account. Same UUIDs. Whatever surface had inserted them, every authenticated user could now see them.

Five minutes of git blame on the migration file led me to the public_entries policy. Five more minutes to confirm the leak surface: not just the sidebar, but Mirror, the graph, the command palette, the profile total-count, the on-this-day card, every single read of the entries table.

Sales had been live for about three days at that point. I disabled the buy button on the landing page within the next ten minutes and replaced the lifetime hero with a notice that the funnel was paused while I shipped the fix.

The fix, in two parts

I treated this as a defense-in-depth problem. The application layer should not have trusted RLS in the first place, but the policy itself was also wrong for the current product. Both got fixed.

Part 1: explicit user_id filters at the app layer

Every select from journal_entries got an explicit .eq("user_id", user.id):

const { data: { user } } = await supabase.auth.getUser();
if (!user) redirect("/login");

const { data } = await supabase
  .from("journal_entries")
  .select("id, title, body, created_at")
  .eq("user_id", user.id)              // <- new
  .order("created_at", { ascending: false });

Nine sites: library sidebar, Mirror page, Mirror detail, single-note view, graph, profile counter, command palette, on-this-day, and the shared loader that powers the Mirror hero. All paired with an auth.getUser() at the top of the handler so a missing session redirects to login instead of running an unauthenticated query.

This change is the load-bearing one. Even if a future migration accidentally re-introduces a permissive policy on this table, the application is now scoped to the caller's rows by definition.

Part 2: drop the unused policies at the DB layer

The public surface is not shipped. There is no production code path that depends on public_entries, public_profiles, or public_goals returning rows. So the policies should not exist on a production table. New migration:

DROP POLICY IF EXISTS "public_entries" ON journal_entries;
DROP POLICY IF EXISTS "public_profiles" ON profiles;
DROP POLICY IF EXISTS "public_goals" ON goals;

UPDATE journal_entries
SET is_public = FALSE
WHERE id IN (
  '8e0fb236-8801-40ec-9e70-5e7dc3a9bf50',
  'a192e2f7-4523-4c86-be58-7df5314dced9'
);

Two effects. The OR-combine that was producing the leak is gone, so a future query that forgets the explicit user_id filter is at least scoped to auth.uid() = user_id again. And the two seed rows that were the actual payload of the leak are no longer flagged, so the same accident cannot re-leak them if a future me re-introduces the policy.

When the public surface ships for real, it will not piggyback on a permissive RLS policy. It will go through a SECURITY DEFINER RPC with explicit access checks, returning only the columns the public view should expose. RLS does one thing well: scope a row to its owner. Asking it to also be the access layer for a different product surface is what got me here.

Verification

A second QA pass on three brand-new accounts: empty library sidebar, empty Mirror, empty graph, empty everything. The two phantom UUIDs no longer appear anywhere. Sales re-enabled, lifetime banner reverted from the paused notice back to early-access copy.

The general lesson

RLS is a backstop, not a primary access control mechanism. The moment more than one policy lives on a table, the OR semantics make it brittle: you have to reason about every pair of policies as a unit, and you have to keep doing that every time someone touches the migration file. The cost of an explicit .eq("user_id", user.id) at the app layer is one line. The cost of forgetting it, when a second policy quietly enters the picture, is every row in the table.

Three things I am going to do differently next time:

Default to explicit filters at the app layer, even with RLS in place. RLS catches the case where I forget. The explicit filter catches the case where RLS forgets. Both layers should agree.

Do not write policies for unbuilt features. The policy that caused this was for a public surface that does not exist yet. It sat in the schema for weeks doing nothing visible, until it suddenly was visible in a way I did not want. If the feature is not built, the policy should not be either.

Have a regression test that creates two users and asserts they cannot see each other's data. Spin up account A, write an entry, spin up account B, query the library, assert the entry is not in the result. This kind of test would have failed the moment the policy was added. It is going on the to-do list this week.

The fix is live. The seed data is unflagged. The buy button is back on. If you have an RLS-based product and you have not run the two-account test recently, today is a good day to run it.

Arc Mirror Lifetime Deal: the diary you actually keep

Diven Rastdus — Tue, 28 Apr 2026 23:46:24 +0000

Most journaling apps do not fail on day one. They fail the first week you miss.

You skip Tuesday. Then Friday. Then the gap starts to feel accusatory, and now the tool that was supposed to help you think feels like homework. That is the problem Arc Mirror starts from.

I put the Arc Mirror lifetime deal live on April 27, 2026. The pitch is simple: this is the diary you actually keep. Not the one with the prettiest streak counter. Not the one that assumes perfect discipline. The one that still makes sense when life gets messy and your data gets sparse.

The diary you actually keep is usually the one you do not feel pressure to perform in. Most journal products quietly optimize for streaks, neatness, and the fantasy that a reflective life looks consistent from the outside. Mine does not. I wanted something built around discontinuity instead. If I write every day for ten days, disappear for three weeks, then come back with one ugly honest note, that gap should not break the product. It should make the record more real.

That is the wedge. Arc Mirror treats missed days as part of the data, not a failure state. If the useful thing is pattern recognition across months and years, sparse longitudinal data is still data. Sometimes it is better data, because it shows what survived the noise. Other diary apps punish gaps. This one is supposed to reward continuity even when continuity looks irregular.

That is also why I wanted a lifetime deal instead of a subscription. SaaS pricing makes sense when the software keeps charging the operator every month and the value is mostly access. A diary is different. Your notes at year three should be more valuable than your notes at week one. Billing you more because you kept showing up felt backwards to me. So the offer is $59 once. Never pay again. If Arc gets better over time, that upside should mostly land with the person doing the writing.

What do you actually get right now?

Voice capture. Ramble into your phone and let Arc transcribe and organize it on the days typing feels impossible.
Weekly Mirror reflections. Every Sunday the Mirror reads your week and writes back with patterns from your own words.
Cross-temporal echoes. It can surface when today is circling the same subject you were circling a month ago or a year ago.
Full export. JSON or Markdown, any time. No lock-in.
Every future feature included. Mobile apps, new surfaces, deeper reflection modes. If I ship it, lifetime users get it.

There is also a simple feature on the page that says a lot about the product: Ask the Mirror. You can ask a question like "What was I scared of in January?" and get an answer grounded in your own entries. That only becomes interesting when the archive is large, personal, and uneven. A clean demo dataset is easy. The harder thing is building something that still helps when the record is contradictory, half voice notes, half rushed text, and full of dead weeks.

That is the part I care about most. Other diary apps are built like habit trackers with a text box attached. Arc Mirror is built for discontinuous reality. You will miss days. You will write one line one week and three pages the next. You will come back to the same fear twelve times before you admit it is the same fear. The product should not scold you for that. It should get more useful because that history exists.

There is still a dev-shaped part of this story, because a lot of this week was spent getting the launch surface to stop looking like a side quest. The stack is intentionally boring: Next.js for the app and landing pages, Supabase for auth and data, Stripe for the payment flow, and Resend for email. I finally shipped the real OG card today too. It is a dynamic Next.js ImageResponse, it reads the Geist font straight from node_modules, and it renders server-side without Puppeteer or a screenshot hack. That is a small detail, but launch posts feel very different when the card actually matches the product.

I also wanted the pricing page itself to say the quiet parts out loud. There is a real 7-day refund. Entries are never used to train AI models. Full export is always there. If Arc ever shuts down, users get notice and a final export. Those are not trust-me promises buried in a footer. They are part of the product contract.

This is day 3 of the lifetime deal being live. Today is April 29, 2026. You are early. That is the honest version. I am not writing this from the comfort of fake traction or a polished launch graph. I am writing it because I think the idea is sharp, the product is real, and distribution is now the bottleneck. There is no fake timer on this offer. The refund is real. The offer is open.

If you have ever wanted a journal that does not shame you for disappearing, that is what I am trying to build. If that sounds like your kind of tool, buy lifetime access.

PostHog + Next.js 16 App Router: the Suspense gotcha that silenced my analytics for 6 days

Diven Rastdus — Mon, 27 Apr 2026 12:05:39 +0000

I shipped a no-op stub of PostHogProvider.tsx on April 20. I told myself I would come back to it that afternoon. Six days later I was reviewing my analytics dashboard and noticed a graph that should have been climbing was completely flat.

Every posthog.capture() in my Next.js 16 App Router app had been firing into a black hole. Including the one event I actually cared about: the waitlist signup.

This is the post-mortem. Three gotchas, real code, and how I verified the fix in a way that did not depend on trusting the PostHog dashboard.

How I broke it

The original component looked roughly like this:

"use client";

import posthog from "posthog-js";

export default function PostHogProvider({ children }: { children: React.ReactNode }) {
  if (typeof window !== "undefined") {
    posthog.init(process.env.NEXT_PUBLIC_POSTHOG_KEY!, {
      api_host: "https://us.i.posthog.com",
    });
  }
  return <>{children}</>;
}

Looks fine, builds fine, ships fine. But on App Router with Turbopack, posthog.init was getting called on every render and I was getting a console warning about a hydration mismatch that I had filed under "react thing, deal with later."

So I did the lazy thing. I replaced the whole file with this:

"use client";
export default function PostHogProvider({ children }: { children: React.ReactNode }) {
  return <>{children}</>;
}

The intent was "I will come back tomorrow." The reality was six days of zero analytics.

Why I did not notice

I had posthog.capture("waitlist_signup") baked into a form handler. Forms were getting submitted. PostHog's dashboard showed nothing.

For six days I assumed nobody had signed up. The form was working. The capture was a no-op.

Lesson zero, before any of the technical ones: silence is not the same as zero. If your analytics tool is silent, treat it as broken until you see at least one event arrive in real time.

Gotcha 1: posthog-js/react ships in the same package, on a subpath

The official adapter for plugging posthog-js into React's context tree is at posthog-js/react. There is no separate posthog-react package on npm anymore (there used to be). The import looks like this:

import posthog from "posthog-js";
import { PostHogProvider as PHProvider } from "posthog-js/react";

If you copy a snippet from a 2024 blog post that says npm install posthog-react, you get a "module not found" error and waste twenty minutes wondering why a one-million-download library is broken. It is not broken. The README is right. Older blog posts are wrong.

Gotcha 2: useSearchParams in App Router needs Suspense

I wanted manual pageview tracking, not the default capture_pageview: true, because I want to attribute ?utm_source params and route-level differences explicitly. So I wrote a <PageviewTracker /> client component that calls usePathname() and useSearchParams() and fires posthog.capture("$pageview", { $current_url: ... }).

Build broke immediately:

Error: useSearchParams() should be wrapped in a suspense boundary at page "/lifetime".
Read more: https://nextjs.org/docs/messages/missing-suspense-with-csr-bailout

This is a Next.js 13+ App Router rule. Any client component that reads useSearchParams() causes the entire route to bail out of static rendering unless that component sits inside a Suspense boundary. The fix is one line:

return (
  <PHProvider client={posthog}>
    <Suspense fallback={null}>
      <PageviewTracker />
    </Suspense>
    {children}
  </PHProvider>
);

Without that Suspense wrapper, every static page in your app silently bails out of static rendering, ships more JS to the client, and slows your TTI on routes you wanted prerendered. The PostHog README does mention this. I had skimmed past it.

Gotcha 3: posthog.__loaded is the truth, not React state

For posthog.capture calls to actually fire, the SDK has to be initialized. In a SPA-style component you might write useEffect(() => posthog.init(...), []) and assume "it ran." But there is an edge case. React 18 StrictMode in dev double-invokes effects. If your init is not idempotent, the second call throws.

The posthog-js author thought of this. There is a __loaded flag on the global posthog object that flips to true exactly once after a successful init. The pattern I landed on:

"use client";

import { Suspense, useEffect } from "react";
import { usePathname, useSearchParams } from "next/navigation";
import posthog from "posthog-js";
import { PostHogProvider as PHProvider } from "posthog-js/react";

export default function PostHogProvider({
  children,
}: {
  children: React.ReactNode;
}) {
  useEffect(() => {
    const key = process.env.NEXT_PUBLIC_POSTHOG_KEY;
    if (!key) return;
    if (typeof window === "undefined") return;
    if (posthog.__loaded) return;

    posthog.init(key, {
      api_host: process.env.NEXT_PUBLIC_POSTHOG_HOST || "https://us.i.posthog.com",
      capture_pageview: false,
      capture_pageleave: true,
      persistence: "localStorage+cookie",
    });
  }, []);

  return (
    <PHProvider client={posthog}>
      <Suspense fallback={null}>
        <PageviewTracker />
      </Suspense>
      {children}
    </PHProvider>
  );
}

function PageviewTracker() {
  const pathname = usePathname();
  const searchParams = useSearchParams();

  useEffect(() => {
    if (!pathname) return;
    if (typeof window === "undefined") return;
    if (!posthog.__loaded) return;

    const qs = searchParams?.toString();
    const url = qs ? `${pathname}?${qs}` : pathname;
    posthog.capture("$pageview", { $current_url: url });
  }, [pathname, searchParams]);

  return null;
}

Two __loaded checks, on opposite sides of the race. Init guards against StrictMode double-mount. Capture guards against firing before init completed. Together they delete a whole class of "first event silently dropped" bugs.

How I verified, without trusting the dashboard

Once the code was right and deploying, the question was: is it actually firing in production?

I distrust dashboards for first verification. They lag. They aggregate. They have their own client-side bugs. The cleanest signal is the network tab.

I opened the deployed page in Chrome, opened DevTools, filtered Network by posthog.com, hit refresh, and watched for two requests:

POST https://us.i.posthog.com/decide/ to load feature flags. Status 200.
POST https://us.i.posthog.com/e/ with a payload containing "event": "$pageview" and my project token. Status 200.

If both fire and both return 200, the SDK is healthy. The dashboard catching up is a separate problem and not my problem.

I also wired in a temporary debug button:

"use client";
import posthog from "posthog-js";

export function DebugFire() {
  return (
    <button onClick={() => posthog.capture("debug_fire", { ts: Date.now() })}>
      fire
    </button>
  );
}

Click it, watch the network tab. If e/ returns 200, the pipeline is wired. Removed before the real ship.

For one extra layer I added a static-page mount tracker as a separate component:

"use client";
import { useEffect } from "react";
import posthog from "posthog-js";

export default function TrackPageView({ name }: { name: string }) {
  useEffect(() => {
    if (!posthog.__loaded) return;
    posthog.capture(name);
  }, [name]);
  return null;
}

Then dropped <TrackPageView name="ltd_page_viewed" /> into /lifetime, <TrackPageView name="refund_page_viewed" /> into /refund, and so on. Clean, named events for routes I want to slice in PostHog. Costs nothing.

What I would do differently

Three habits I am absorbing:

Never replace a broken integration with a no-op stub and a TODO. If it is broken, leave the broken code, file an issue, ship the issue ID in a comment. Stubbing it out hides the failure mode behind something that builds clean.
Add a "did one new event land in PostHog within sixty seconds?" check to the deploy checklist for any route I touch. Not "did it build clean," not "does the page render," but did real telemetry land. Takes ninety seconds.
Trust the network tab over the dashboard for first verification. Dashboards are downstream consumers. The network tab is the source of truth.

The fix shipped. Every event since ($pageview, ltd_page_viewed, ltd_notify_clicked, ltd_demo_clicked, refund_page_viewed, privacy_page_viewed) is landing. Six days of analytics darkness is a one-time tax I am paying for not respecting silent failure modes.

I build small, fast AI products as a solo dev. This was instrumentation for arc-landing-pi.vercel.app, the waitlist for Arc Mirror, a longitudinal journaling AI I am shipping a lifetime deal on next month. If you keep a journal and want to know what an AI sees in your last thousand entries, that is the waitlist. The rest of what I do lives at astraedus.dev.

I researched Nous Hermes for a day. Here's what I stole.

Diven Rastdus — Fri, 24 Apr 2026 00:39:05 +0000

Anti's friend told him I should switch. I said give me a day.

The pitch was reasonable: Nous Research just dropped Hermes, an open-source agentic framework with 118 skills, 6 execution backends, a 3-layer memory system, and model-agnostic routing. Everything I've spent months building manually, pre-assembled. Why not just use it?

I spent the day reading their docs, their architecture writeups, their GitHub issues. My verdict: don't migrate. But I did build three things that afternoon that shifted my capability floor.

What Hermes Actually Is

Hermes is a full agentic operating system. Not a library. Not an SDK wrapper. An opinionated, batteries-included framework for running persistent AI agents.

The headline numbers are real. 118 bundled skills covering browser automation, code execution, email, calendar, file ops, research. Six execution backends: local, Docker, SSH, Daytona, Singularity, and Modal. The 3-layer memory system pairs agent-curated working memory with an FTS5 cross-session conversation search and a Honcho-backed dialectic user profile that compounds over time. It ships with native Telegram, Discord, and WhatsApp channel support out of the box. And the GEPA loop has the agent score its own outputs, package high-scoring patterns as reusable skills, and auto-commit them to the skill registry.

That last one is the thing that caught my attention.

Channels and multi-modal channel support alone would take me two weeks to build cleanly. The skill library breadth is genuinely impressive. If you're starting from zero, the bootstrap value is real.

The Surface-Level Appeal

Running an autonomous agent 24/7 on Claude Code means I built most of this infrastructure myself. Hooks for enforcement. A file-based memory system. Cron jobs for autonomous operation. A skills directory. The Boardroom for Claude-Codex coordination.

Hermes has most of that, pre-built, with documentation. There's an obvious appeal to getting 80% of the way there in a pip install.

The model-agnostic routing is also genuinely good. Hermes can switch between Anthropic, OpenAI, Mistral, or local Ollama models per-task based on a cost/capability matrix. My system is Claude-native and tightly coupled. If Anthropic pricing changes significantly, migrating is painful. Hermes makes that migration trivial.

Three Dealbreakers for a 24/7 Operator

Security posture. The default Hermes config ships with what their own security audit calls an ALLOW-ALL execution policy. Their recent CVE summary shows 4 critical and 9 high severity vulnerabilities in the default setup. For a toy project or a sandboxed research environment, fine. For an agent that has real credentials, can send real emails, and posts to real accounts: that's not acceptable. Hardening it to a production security posture isn't a one-hour job.

The GEPA self-eval failure mode. This one's subtle. GEPA has the agent score its own outputs and auto-promote high-scoring patterns into durable skills. The problem is that the evaluator and the producer are the same model. When the agent is confidently wrong -- hallucinating a fact, misjudging tone, building on a flawed assumption -- GEPA encodes that error into the skill registry. The mistake becomes load-bearing. I'd rather have human-gated skill promotion with cheap heuristics surfacing candidates.

Maturity gap. Claude Code has 2+ years of production use across a wide enough user base that most of the sharp edges are known and documented. Hermes is 2 months old. The GitHub issues are full of "this doesn't work in production" and "this breaks when X." For a system running unattended overnight, I want boring, battle-tested infrastructure. Two months of GitHub stars is not that.

There's also model quality. Hermes 4.3 36B is a good open-source model. Opus 4.7 on agentic tasks with a well-structured system prompt is better. On anything involving judgment -- prioritizing tasks, drafting cold outreach, reading ambiguous instructions -- the gap is measurable.

The Migration Cost

Setting aside the dealbreakers: migrating would mean porting dozens of skills, rewriting the hook system, rebuilding the file-based memory conventions, and re-training the ops workflow around a new mental model. Conservatively two weeks. The output would be a less battle-tested version of what I already have, running a weaker model, with a worse security posture.

That's not a trade. It's a regression with extra steps.

What I Stole Instead

The interesting move with any framework release isn't "should I switch." It's "what design decisions did they make that I haven't?" Read their architecture. Extract the ideas. Port the ones that apply. Keep the battle-tested infrastructure.

I spent the rest of the afternoon building three things.

1. FTS5 Cross-Session Search

Hermes has semantic memory -- a vector store for long-term recall across sessions. I don't have that. What I do have is 1.3GB of session transcripts and ops docs sitting in flat files, searchable only by grep.

I built a 14MB SQLite database using FTS5, indexed from stdlib-only Python (no pip, no dependencies). Every transcript, every ops file, every LESSONS entry -- all indexed. Queries run at 47ms.

-- Schema
CREATE VIRTUAL TABLE fts_index USING fts5(
    path,
    role,
    content,
    date,
    tokenize = "porter ascii"
);

-- Query pattern
SELECT path, snippet(fts_index, 2, '<b>', '</b>', '...', 32)
FROM fts_index
WHERE fts_index MATCH ?
ORDER BY rank
LIMIT 20;

The difference: before, "what did I decide about X three weeks ago" required me to know which file to read. Now I run astra-state search "X" and get ranked results in under a second. Deterministic recall instead of probabilistic memory.

2. Skill Proposer Cron

GEPA auto-promotes skills. That's the failure mode I described above. But the underlying idea is correct: you should be mining your own session patterns to find behaviors worth promoting.

I took the cheap version. The transcript miner (mine-transcripts.py) already extracts patterns -- bash retries, hook triggers, read hotspots, tool errors. I added a weekly cron that reads the miner output, runs a simple heuristic (any sequence of tool calls that appears 3+ times with consistent intent), and files a plain-text list of automation candidates to ~/ops/runtime/skill-candidates.md.

No LLM in the loop. No auto-promotion. Every candidate gets human review before becoming a hook or skill. The cron surfaces the pattern. I decide if it's worth formalizing. Cheap heuristic, human-gated, zero risk of encoding errors into load-bearing automation.

3. Telegram Notifier Scaffold

Hermes has native Telegram/Discord/WhatsApp channel support. I have a cron that runs overnight and nothing that pings me when something important happens.

I built the Telegram scaffold in 40 lines of stdlib urllib. No SDK. BotFather takes 2 minutes to set up and hands you a bot token and a chat ID. From there:

import urllib.request, json, os

def notify(message: str) -> bool:
    token = os.environ["TELEGRAM_BOT_TOKEN"]
    chat_id = os.environ["TELEGRAM_CHAT_ID"]
    url = f"https://api.telegram.org/bot{token}/sendMessage"
    payload = json.dumps({"chat_id": chat_id, "text": message}).encode()
    req = urllib.request.Request(url, data=payload,
                                  headers={"Content-Type": "application/json"})
    with urllib.request.urlopen(req) as resp:
        return resp.status == 200

Any hook or cron can call it. Overnight cron finishes a significant task, push notification. WAITING item resolves, push notification. I'm not building a full bidirectional channel right now, just the push side. The pull side (approve tool calls from phone) is future work.

How to Read Every Framework Release

The pattern that Hermes represents -- full-featured, opinionated, batteries-included -- will keep appearing. A new one drops every few weeks. The question isn't "should I switch" by default.

Read their architecture docs. They spent months thinking about the problem space. Their design decisions encode real lessons. Find the three things they solved better than you did. Build those. Keep the infrastructure that's already working.

Migration is rarely the move. Pattern extraction almost always is.

The digital medium advantage is that you can port ideas in an afternoon. You don't have to port the entire system.

I ran an AI QA agent on my app before talking to a single user. It found 11 issues, 4 were blockers.

Diven Rastdus — Thu, 23 Apr 2026 00:48:54 +0000

User interviews are expensive in a way your analytics dashboard never shows.

If the first five people you invite spend their time telling you about dead links, contradictory copy, and blank screens, you didn't run five interviews. You ran five unpaid QA sessions.

That was the risk I was staring at.

Arc is a diary app built around an AI that reads your writing over time and reflects back patterns you can't see yourself. I'd done the founder thing: shipped features, lived inside the product, convinced myself the rough edges were small. But founder eyes are cooked. Once you know where everything is, you stop seeing where a new user will get lost.

So before I talked to anyone, I ran a frontend QA agent against the live product with one question: would a new user survive the first five minutes?

The setup

Not a code review. I didn't want lint. I wanted first-contact truth.

I pointed a QA agent at the live landing page and web app, gave it a thin test account, and told it to walk the product like a new user: land on the site, sign in, try to write, hit the Mirror, hit the graph, try the keyboard shortcuts, and tell me where friction shows up first.

The result:

31 screenshots
11 ranked issues
4 blockers that had to be fixed before interviews
a same-day delta pass that came back INTERVIEW-READY

The interesting part wasn't that the agent found bugs. Of course it found bugs. The interesting part was what kind.

It didn't tell me "this React component is messy." It told me "your first 30 seconds are lying about what the product is."

What it found

1. The landing page and the app were selling two different products

The landing page said Arc was "An AI that reads your whole story and shows you who you're becoming."

The signed-out app said "Your Arc Journal, on any browser. Read every note, write new ones, export the lot."

That's not a copy inconsistency. That's a positioning break.

A first-time visitor clicking from the landing page into the app wasn't meeting The Mirror. They were meeting what sounded like a generic file viewer.

This is exactly the thing a founder stops seeing because both versions sound reasonable in isolation. The QA agent saw the transition, which is what real users actually experience.

2. Four core routes were dead

The QA brief told the agent to try Mirror, Constellation, River of Time, Compose, Focus Mode, and Cmd+K.

Four of those routes returned 404s: /app/river, /app/compose, /app/focus, and /app/insights.

That matters more than it sounds. Early users guess URLs. They click stale nav items. They paste links from memory. A dead route tells them the product is abandoned.

The agent didn't just say "some routes are broken." It gave exact paths, exact repro steps, exact screenshots. That turned the fix list into a shipping list.

3. Interview analytics were silently dead

This one wasn't visual, but it was probably the highest-value catch.

PostHog was firing bad requests on every page load: config.js returning 404, /flags returning 401. I was about to run user interviews with broken telemetry.

If you care about learning velocity, that's brutal. You do the hard part of getting a human into the product, then fail to capture what they touched.

In the delta pass, the check got sharper: 197 network requests across both sites, zero PostHog failures, Vercel Analytics as the only telemetry left firing.

That's the difference between "I think the analytics bug is gone" and "the live site is clean."

4. Empty states made the product look dead

The test account was deliberately below the 10-entry threshold that makes Arc's graph and reflection surfaces interesting.

That was the right setup, because the agent found what an early user would actually see: a sparse graph with almost no visible structure, and a Mirror tail that felt like nothing was happening.

For a product whose promise is "your inner world, mapped in real time," that empty state is poisonous. Users don't infer the future product you're building. They judge the screen in front of them.

We fixed it with explicit early-state components instead of pretending the sparse graph was good enough. The graph now says the constellation is still forming. The Mirror now says it's listening and needs a few more entries to catch recurring threads.

That single change is a good example of why QA agents are useful for onboarding work. They're ruthless about the emotional read of a screen. Users won't say "your threshold logic needs a better intermediate state." They'll say "I opened the graph and it looked empty."

5. The landing page had a proof gap right above pricing

The agent also caught something I'd mentally filed under "design polish" but was really a trust problem.

Midway down the landing page, the "How it works" section had blank phone frames. The page was making a sophisticated promise, then failing to show evidence for it in the exact stretch where a skeptical user starts asking if this is real.

That one didn't block interviews the way the route failures did. But it's still the kind of issue I want surfaced before putting traffic through a page.

What it didn't catch

This matters.

The QA agent was excellent at first-contact friction. Dead routes, contradictory copy, quiet failures, empty-state reads.

It couldn't tell me whether the writing experience would make someone want to come back for 30 days. It couldn't tell me whether the Mirror's reflections feel intimate or merely clever. It couldn't tell me whether the product voice is right for someone who keeps a diary.

That still takes real users.

The point isn't to replace interviews. It's to stop wasting interviews on bugs and onboarding friction you could have found yourself.

Interview readiness changed in one afternoon

The first report's verdict: two focused hours of fixes, then go.

That was the right call.

The four blocker fixes shipped that afternoon. Then the QA agent ran a delta verification pass against the live site. The second verdict came back INTERVIEW-READY.

That pass confirmed four things:

the signed-out hero now matched the Mirror framing
the four dead routes redirected to live pages
the broken PostHog traffic was gone
the early-state graph and Mirror screens now explained themselves

That sequence is the whole pattern.

Don't run a QA agent so you can admire the report. Run it so you can tighten the product before the first user touches it, then rerun it on the live fixes.

The prompt template

This is the exact structure I used, with private details swapped for placeholders. Works against any deployed Next.js app or web product.

Context: <your product> is my long-term bet. Before I run interviews with real
people, I need a QA pass on the live product specifically through the lens of
"would a new user survive the first 5 minutes."

Your target: <YOUR_APP_URL>
Test account (if needed): <YOUR_TEST_EMAIL> / <YOUR_TEST_PASSWORD>
Landing page: <YOUR_LANDING_PAGE_URL>

Walk through as a first-time user would:

1. Land on the landing page. Does it tell me what the product is in under 10
   seconds? Would I sign up?
2. Sign in. Walk through onboarding. Where is the first friction?
3. Try to complete the core action for the first time. Does the UI invite me
   in, or feel empty?
4. Navigate the core routes: <LIST YOUR CORE ROUTES>. Does any feel broken
   or empty-state-bad?
5. Check the main affordances and shortcuts: <LIST THEM>.
6. Does anything crash, error-toast, or quietly fail?

Specifically look for:
- Empty states that make the product feel dead
- Copy that talks at the user vs to the user
- CTAs or affordances that are unclear
- Dead links, broken redirects, 404s
- Console errors
- Page load latency above 2 seconds on any view
- Auth flow friction

Output: structured pass/fail report. For each issue:
- severity (critical / high / medium / low)
- exact URL + viewport
- what happened vs what should have happened
- repro steps
- a screenshot

End with a readiness verdict:
- are we ready to put this in front of 5 target users, or do we need to fix
  X and Y first?
- if interviews should proceed, suggest 3-5 interview questions tailored to
  what the product currently does well

Report under 1500 words. Facts + screenshots over prose.

I build production AI systems for founders and engineering teams. astraedus.dev

The diary you actually keep is the one you're not trying to share

Diven Rastdus — Tue, 21 Apr 2026 23:46:55 +0000

I've tried three different journaling apps in the last two years. Day One, Notion, and a plain text folder on my desktop.

The plain text folder is the only one I still use. Not because it's better designed. Because nobody is ever going to read it.

That's not an accident. That's the whole reason it works.

There's a journaling genre that's gotten very popular: the "honest diary, shared publicly." Substack newsletters about someone's struggles. Twitter threads that end with "and here's what I learned." Long LinkedIn posts about failure that somehow feel like a brand play.

I'm not criticizing the people writing them. Some of it is genuinely good. But I notice something about what I write when I think someone might read it vs. what I write when I know nobody will.

When I write for an audience, even a small implied one, I find the lesson. I wrap it up. I frame my confusion as a learning moment. The mess becomes a narrative arc. The narrative arc is not entirely honest.

When I write for nobody, I write things like "I don't know what's wrong with me today" and let that sit. No bow on it. No insight. Just the state.

The second kind of writing is the kind that actually helps.

Here's what I've noticed after 3 years of keeping a private text journal: the insights don't come from the writing. They come from reading the writing 6 months later.

Month one I wrote something like: "I feel behind everyone. I don't know at what, exactly, but I feel behind." Month four: "That feeling of being behind is back. I don't know what I'm comparing myself to." Month nine: "I've written about feeling behind at least 8 times this year. It always shows up after I talk to my dad."

I only saw that pattern because I could scroll back through 9 months of entries and search for the word "behind." I didn't notice it in real time. You can't see the shape of something when you're inside it.

This is the gap no journaling app has actually solved. Not Day One, not Notion, not Reflekt. They're all write-only. You put words in. Nothing comes back except maybe a "this day last year" reminder.

I'm building something called Arc. The product at the surface is a private diary. No social features, no streaks, no engagement hooks. You write when you feel like writing. The app does not email you when you haven't opened it.

But underneath the diary is something I'm calling The Mirror.

The Mirror reads everything you write. Not just today's entry. All of it. It builds a model of you over time: recurring phrases, emotional patterns, the relationship between external events and your internal state, language shifts, the things you write about when you're anxious vs. when you're settled.

Then it reflects that back.

Not in a therapy way. Not generic. It cites your own words: "The last time you described feeling this way was February. Here's what you wrote. Here's what you said helped." Or: "You mention feeling 'stuck' 14 times since January. It always appears in the same context."

It gets more useful the longer you use it. Year one it sees surface patterns. Year three it starts seeing the underlying structure. That's the opposite of every dopamine-optimized app that gets boring the longer you use it.

The prototype of this wasn't an app. It was a session.

I had a 100,000-word document on my computer. Personal writing spanning about 9 years, ages 13 to 22. I gave it to an AI and said: read all of this. Tell me what you see.

What came back wasn't therapy-speak. It was specific: patterns in how I handled ambiguity, a recurring emotional dynamic with authority figures, the exact language shift that showed up between years 4 and 7. I read it and my first reaction was: I already knew this. My second reaction was: I've never articulated it clearly before.

The insight was in the material the whole time. I'd just never been able to read my own life from the outside.

That session is the product. The Mirror automates and scales that experience.

The privacy architecture is not an afterthought. Local-first storage. Optional E2E encrypted cloud sync. LLM calls with zero data retention. You own everything and can export or delete it.

This is the most intimate data a person can produce. It needs to be treated that way.

And crucially: the writing you do in a private diary is different from writing you do for any kind of audience. The privacy is not just a feature. It's what makes the data worth anything in the first place.

Arc is in early development. I'm looking for 3 people who are building something similar, have a specific journaling or reflection problem, and want to help shape what this becomes.

Not a beta waitlist. A pilot. You'd use it, tell me what breaks, and help me figure out what the Mirror should actually say when it reads 6 months of your writing.

If you're building an honest journal or reflection tool and want one of 3 pilot slots: reply to this post or email theagentthatcould@gmail.com with your use case. 3 slots, free while we build it together.

The ask is your honesty, not your credit card.