DEV Community: Haripriya Veluchamy

Serving ML Artifacts from Amazon S3 Files How I used After the Launch

Haripriya Veluchamy — Mon, 22 Jun 2026 13:37:34 +0000

The Honest Story

Two months ago, everyone was posting about Amazon S3 Files. New feature, big announcement, screenshots everywhere. I scrolled past most of them another AWS launch, another round of "here's what it does" posts.

I never actually understood what it was until it became my problem.

I was in the middle of my ML learning journey, building a semantic search project. Large FAISS indexes, BM25 artifacts all needed to be available at serving time. I was uploading them directly into my deployment. It worked, until it didn't. Hit the size limit. Container cold starts became painful. Every restart meant downloading hundreds of megabytes before the first request could be served.

I had one option left S3.

And somewhere in the back of my head I remembered: "wait, there's that new S3 file system thing everyone was talking about." S3 is already my favorite AWS service. So why not try it?

I tried it. I built something real with it. And now I'm writing the post I wish existed two months ago not "here's what S3 Files is" but "here's what actually happens when you use it."

What I Built

A semantic search engine over AWS documentation 5048 pages, indexed with FAISS + BM25 hybrid retrieval, served via FastAPI. The entire ML artifact stack (indexes, metadata) is served directly from an S3 Files NFS mount on EC2. No cold start downloads. No boto3 in the serving code. Just open().

Stack:

Embeddings: all-MiniLM-L6-v2
Semantic search: FAISS (IndexFlatIP)
Keyword search: BM25Okapi
Serving: FastAPI + Jinja2
Artifact storage: Amazon S3 + S3 Files (NFS)
Compute: EC2 t3.medium (Ubuntu 22.04)

GitHub: Harivelu0/s3-files-ml-serving

The Problem S3 Files Solves

Before S3 Files, serving large ML artifacts from containers looked like this:

# Every container startup
from huggingface_hub import hf_hub_download

faiss_path = hf_hub_download(repo_id="...", filename="faiss.index")  # 7MB
bm25_path  = hf_hub_download(repo_id="...", filename="bm25_index.pkl")  # 17MB
meta_path  = hf_hub_download(repo_id="...", filename="corpus_meta.pkl")  # 2.6MB

Every container restart meant downloading ~27MB before the first request. For larger models this becomes hundreds of MB or GBs. You'd need to manage download logic, handle failures, worry about /tmp size limits, and every container on the same host duplicates the same data.

With S3 Files, your serving code becomes:

import faiss
import pickle

# Just open() no download, no boto3
index = faiss.read_index("/mnt/artifacts/faiss.index")
with open("/mnt/artifacts/bm25_index.pkl", "rb") as f:
    bm25 = pickle.load(f)

The S3 bucket is mounted as an NFS volume on the EC2 host. Your container reads from it like a local file. That's it.

Architecture

AWS Docs (5048 pages)
        ↓
crawl_aws_docs.py (Sitemap + BeautifulSoup)
        ↓
build_index.py (FAISS + BM25 + corpus_meta)
        ↓
S3 Bucket (versioning enabled)
        ↓
S3 File System (NFS layer on top of bucket)
        ↓
Mount Target (NFS endpoint inside VPC)
        ↓
EC2 Ubuntu 22.04
  └── /mnt/artifacts (S3 Files mounted here)
       └── Docker container
            └── FastAPI reads indexes directly
                 └── http://ec2-ip:8000

Setting Up S3 Files What Actually Matters

1. S3 bucket needs versioning enabled

This one catches everyone. S3 Files will refuse to create a file system without it.

aws s3api put-bucket-versioning \
    --bucket your-bucket \
    --versioning-configuration Status=Enabled

2. IAM role trust principal is `elasticfilesystem.amazonaws.com`

Not s3files.amazonaws.com even though the service is called S3 Files, it's built on EFS under the hood.

{
  "Principal": { "Service": "elasticfilesystem.amazonaws.com" },
  "Condition": {
    "StringEquals": { "aws:SourceAccount": "YOUR_ACCOUNT_ID" }
  }
}

3. The IAM role needs EventBridge permissions

S3 Files uses EventBridge internally to monitor bucket changes. Without these, your file system gets stuck in creating forever.

{
  "Action": [
    "events:PutRule", "events:DeleteRule",
    "events:PutTargets", "events:RemoveTargets"
  ],
  "Resource": "*"
}

4. boto3 API uses camelCase

The S3 Files boto3 client uses camelCase unlike most other AWS services.

# Wrong
s3files.create_file_system(Bucket=bucket_arn, RoleArn=role_arn)

# Correct
s3files.create_file_system(bucket=bucket_arn, roleArn=role_arn)

5. Mount needs `amazon-efs-utils`, not plain NFS

Plain mount -t nfs4 will fail. S3 Files requires the amazon-efs-utils package (v3.0+) which handles TLS + IAM auth automatically.

# Build and install amazon-efs-utils
sudo apt-get install -y cmake golang-go rustc cargo
git clone https://github.com/aws/efs-utils
cd efs-utils && ./build-deb.sh
sudo apt-get install -y ./build/amazon-efs-utils*.deb

# Then mount
sudo mount -t s3files fs-0xxxxxxxxx:/ /mnt/artifacts

6. EC2 instance needs `AmazonS3FilesClientFullAccess`

Without this policy on the EC2 role, the mount returns access denied by server.

aws iam attach-role-policy \
    --role-name your-ec2-role \
    --policy-arn arn:aws:iam::aws:policy/AmazonS3FilesClientFullAccess

The Serving Code

FastAPI startup loads everything from the mount. No download logic. No error handling for network failures during download. The mount is always there.

ARTIFACTS_DIR = Path(os.getenv("ARTIFACTS_DIR", "/mnt/artifacts"))

@asynccontextmanager
async def lifespan(app: FastAPI):
    # Reads directly from S3 Files mount
    state["index"] = faiss.read_index(str(ARTIFACTS_DIR / "faiss.index"))

    with open(ARTIFACTS_DIR / "bm25_index.pkl", "rb") as f:
        d = pickle.load(f)
        state["bm25"] = d["bm25"]

    with open(ARTIFACTS_DIR / "corpus_meta.pkl", "rb") as f:
        state["meta"] = pickle.load(f)

    state["model"] = SentenceTransformer("all-MiniLM-L6-v2")
    yield

Container startup: model loads in ~2s. Artifacts available instantly from mount.

Weekly Index Updates The Real S3 Files Advantage

This is where S3 Files goes beyond just solving cold starts.

AWS updates their docs regularly. With traditional artifact serving you'd need to rebuild indexes, push a new container image, and redeploy causing downtime.

With S3 Files:

# update_pipeline.sh runs weekly via EventBridge
python scripts/crawl_aws_docs.py --update   # only changed pages
python precompute/build_index.py            # rebuild indexes to /tmp

# Atomic swap no downtime
mv /tmp/faiss.index.new     /mnt/artifacts/artifacts/faiss.index
mv /tmp/bm25_index.pkl.new  /mnt/artifacts/artifacts/bm25_index.pkl
mv /tmp/corpus_meta.pkl.new /mnt/artifacts/artifacts/corpus_meta.pkl

The serving container picks up the new indexes on the next query. Zero restart. Zero redeploy. The mount sees the updated S3 objects immediately.

This is the killer feature not just cold start elimination, but live index updates without any deployment.

Results

The search UI running at http://98.93.65.241:8000:

Query: "aws sns permission issue" → 10 results in 334ms
Artifacts loaded at startup: instant (no download)
Index update: rebuild + upload to S3 → serving picks up without restart

What I Learned

S3 Files is genuinely useful for ML workloads where:

Artifacts are large (>100MB)
Multiple containers need the same data
Indexes update regularly without downtime
You want to avoid baking artifacts into Docker images

It's not a replacement for EFS if you need pure file system performance. And it's not for every use case if your artifacts never change and containers rarely restart, the complexity isn't worth it.

But if you've ever stared at a container downloading 500MB on every cold start, S3 Files is exactly what you were waiting for.

Resources

How I Built a Self Resizing EC2 for My ML Data

Haripriya Veluchamy — Sun, 07 Jun 2026 11:45:14 +0000

The Pain Point

I'm on an ML learning journey. That means a lot of data. A lot of processing. And a lot of AWS free credits I really can't afford to waste.

Here's what my typical day looked like:

I'd spin up a t3.large to run a data pipeline. Load some datasets, process them, store them. The pipeline would run for a couple of hours sometimes I didn't even know exactly how long it would take. Then I'd go to sleep.

Next morning I'd check CloudWatch and realise the VM had been sitting idle since 3AM. Running. Doing nothing. Burning credits.

That's the problem. You need a big machine for data loading. You don't need it after.

The First Thought Let AWS ML Decide

My first instinct was to use AWS Compute Optimizer. It's a managed ML service that analyses your EC2 usage patterns and recommends the right instance type. Smart, right?

I enabled it. Waited. And waited.

Turns out Compute Optimizer needs at least 30 consecutive hours of usage data before it generates recommendations. For a VM I spin up occasionally for pipeline runs that's not practical.

So I moved on.

The Second Thought Bedrock Agent

Next idea: use Amazon Bedrock as an agent to reason about when and how to resize. Let an LLM decide.

But the more I thought about it, the more it felt like overkill. The decision isn't complex:

"Did the pipeline finish? Yes → resize down. No → don't."

That's not a reasoning problem. That's an automation problem. Using Bedrock here would be AI washing adding complexity without adding value.

So I kept it simple.

The Solution EventDriven VM Resize

What I built: a lightweight agent that listens for your pipeline to complete, then automatically resizes the VM down.

Stack:

emit_event.py runs on your VM, fires when pipeline exits
Amazon EventBridge receives the event
AWS Lambda handles resize logic
Amazon SNS sends email alert (success or failure)

Pipeline finishes
      ↓
emit_event.py → EventBridge
      ↓
Lambda triggered
      ↓
SUCCESS → resize down + email
FAILURE → keep size (debug) + email

No ML. No LLM. Just the right tool for the job.

Architecture

Your VM                        AWS Cloud
──────────────────             ─────────────────────
run_pipeline.sh                EventBridge (custom bus)
  step1.py               →          ↓
  step2.py                     Lambda
  step3.py                       ├── stop EC2
  emit_event.py ──────────→      ├── resize instance type
                                 ├── start EC2
                                 └── SNS email alert

All AWS resources are deployed with a single CloudFormation command. No manual console clicking.

Setup 3 Steps

Step 1 Deploy AWS infrastructure

git clone https://github.com/Harivelu0/vm-resize-agent
cd vm-resize-agent

aws cloudformation deploy \
  --template-file infra/template.yaml \
  --stack-name vm-resize-agent \
  --capabilities CAPABILITY_NAMED_IAM \
  --parameter-overrides \
      AlertEmail=your@email.com \
      TargetInstanceType=t3.medium \
  --region us-east-1

Check your email → confirm the AWS subscription link.

Step 2 Copy agent to your VM

scp -i your-key.pem -r agent/ pipeline/ \
  ec2-user@your-vm-ip:~/vm-resize-agent/

# on VM
pip install boto3
aws configure

Step 3 Add your pipeline steps

Edit pipeline/steps.conf:

python3 /home/user/myproject/fetch_data.py
python3 /home/user/myproject/transform.py
python3 /home/user/myproject/load_db.py

Run:

bash pipeline/run_pipeline.sh

That's it. When pipeline finishes → VM resizes → email arrives.

The Key Design Decision steps.conf

One thing I was particular about: this tool should work for anyone's pipeline, not just mine.

So run_pipeline.sh never changes. Users only edit steps.conf one command per line, one step per line.

# steps.conf
python3 /home/user/fetch_gdelt.py
python3 /home/user/build_forecasts.py
python3 /home/user/calibrate.py

The wrapper reads each line, runs it in order, tracks success/failure, and emits the event at the end. Generic by design.

The Email Alert

On success:

Pipeline: data-loader
Status:   SUCCESS
Steps:    3/3
Duration: 45m 12s
Instance: i-0abc123
Resized:  Yes -> t3.medium

On failure:

Pipeline:  data-loader
Status:    FAILURE
Steps:     2/3
Failed at: step_3
Instance:  i-0abc123
Resized:   No (kept original size for debugging)

Failure case is important the VM intentionally stays large so you can SSH in and debug without losing state.

Difficulties I Faced

1. IP changes after resize

When Lambda stops and restarts the EC2, it gets a new public IP. Learned this the hard way when SSH stopped working. Fix: assign an Elastic IP before running demos.

aws ec2 allocate-address --region us-east-1
aws ec2 associate-address \
  --instance-id i-xxx \
  --allocation-id eipalloc-xxx

2. Docker goes down after resize

EC2 resize = reboot. Any running Docker containers stop. Add this to /etc/rc.local on your VM so services restart automatically:

sudo service docker start
cd /home/ec2-user/myproject && docker-compose up -d

Live Demo

For the demo I used a real weather dataset loaded into Postgres running in Docker on the EC2.

EC2: t3.large (before)
       ↓
Pipeline runs: download → parse → load into Postgres
       ↓
emit_event.py fires automatically when done
       ↓
Lambda: stops EC2 → resizes → starts EC2
       ↓
EC2: t3.medium (after)
       ↓
Email arrives in Gmail

Cost Reality

Resource	Monthly Cost
EventBridge	~$0 (free tier)
Lambda	~$0 (free tier)
SNS email	~$0 (first 1000 free)
Total stack	~$0

The savings depend on your instance. A t3.large idle for 20hrs/day wastes ~$25/month. For larger instances like c5.4xlarge you're looking at $200+ saved per month.

When to Use This

This is for you if:

You run data pipelines on EC2 manually or on a schedule
Your pipeline takes unpredictable time to complete
You don't need the heavy instance after loading is done
You're not ready for EMR or Glue yet (pre-pipeline stage)

This is not for you if:

You're already using EMR Serverless or Glue (they auto-terminate)
Your pipeline runs less than 30 minutes (manual resize is fine)
You need horizontal scaling (use Auto Scaling Groups instead)

Why a VM and Not Glue or EMR?

Honest answer I didn't know enough about my data yet to make that decision.

Glue and EMR are great services. But they come with questions you need to answer upfront:

What's your data format?
What transformations do you need?
What's the volume?
Do you need Spark?

When you're learning, you don't have those answers yet. You just need to load data, see what you're working with, and prove the pipeline works.

A VM lets you do that with zero infrastructure decisions. Just Python scripts. When the pipeline is proven and you understand your data then you migrate to the right managed service.

This is that phase. Before you know which service you need.

This is actually how real teams work too. Nobody starts with EMR on day one of a new data project. You explore first. You prove it works. Then you scale.

Phase 1: Explore         → VM + Python scripts
Phase 2: Prove it works  → VM + vm-resize-agent
Phase 3: Scale           → Glue / EMR / Spark

Repo

Everything is open source. Clone, edit steps.conf, deploy.

GitHub: https://github.com/Harivelu0/vm-resize-agent

What's Next

A few things I want to add:

Auto resize back up before next scheduled run (cron-based)
Slack alerts alongside email
Support for Azure VMs (same pattern, different SDK)

If you have ideas or run into issues open a GitHub issue. Happy to help.

Building in public as part of my ML learning journey. Follow along for more practical AWS patterns.

How I Built Production AI Agent Monitoring with Langfuse

Haripriya Veluchamy — Wed, 13 May 2026 16:32:02 +0000

Multi-agent AI systems fail silently.

A 200 OK response doesn’t mean the AI made good decisions.

That was the biggest thing I realized while building a multi-agent system.

My architecture looked like this:

User Query → Multi Agent Call → Final Response

Everything looked normal from an infrastructure perspective.

APIs were healthy
Latency looked fine
Users were getting responses

But I still couldn’t answer important questions:

Did the Agent route the query to the right specialist?
Did the agent hallucinate information?
Did it ignore specialist outputs?
Did it attribute responses incorrectly?

Traditional monitoring couldn’t help because the system technically wasn’t failing.

The failures were happening at the decision layer.

Full Trace Visibility

I used Langfuse to trace every agent execution.

That includes:

Tool calls
Input/output payloads
Token usage
Latency per step

If an agent touched something, I wanted visibility into it.

No black boxes.

Deterministic Checks

Some validations didn’t need another LLM.

I added rule-based checks for things like:

Did the agent call tools from the correct domain?
Did the agent call tools it wasn’t supposed to?
Was the expected workflow followed?

These checks are binary:

Pass → 1
Fail → 0

Fast and cheap.

Faithfulness Checks

This was mainly for hallucination detection.

I compare the final response with outputs from specialist agents.

If the Final layer introduces claims that weren’t exist in source outputs, it gets flagged.

This helped catch cases where the system sounded confident but wasn’t grounded.

LLM Judges

For things deterministic checks can’t measure, I use Azure OpenAI as judges.

They evaluate:

Routing correctness
Response quality
Attribution accuracy
Conflict handling

This runs for every multi-agent response.

Expensive? Yes.
Useful? Definitely.

100% Traffic Monitoring

I didn’t want sampling.

Every production request goes through the evaluation pipeline.

Because edge cases are usually the exact things sampling misses.

Cost + Latency Tracking

Multi-agent systems get expensive very fast.

I track:

Tokens per agent
Latency per step
Expensive execution paths

This made optimization much easier.

What This Actually Caught

This surfaced issues normal monitoring completely missed.

Wrong attribution

Correct insights were assigned to the wrong specialist.

Ignored outputs

Sometimes Agent completely ignored specialist responses.

Routing mistakes

The call occasionally sent queries to the wrong agent.

None of these showed up in normal monitoring dashboards.

Everything looked healthy.

Stack

Observability: Langfuse
LLM Evaluation: Azure OpenAI
Deterministic Checks: TypeScript

Final Takeaway

For multi-agent systems, uptime monitoring is not enough.

You also need decision monitoring.

Because a successful response can still be completely wrong.

I Built a Self-Updating SEO Brain Inspired by Andrej Karpathy's LLM Wiki

Haripriya Veluchamy — Sun, 10 May 2026 13:39:47 +0000

The Tweet That Changed How I Think About AI + Knowledge

In early April 2026, Andrej Karpathy (OpenAI co-founder, former Tesla AI Director) posted something deceptively simple:

"Something I'm finding very useful recently: using LLMs to build personal knowledge bases for various topics of research interest."

He followed it up with a GitHub gist titled LLM Wiki an "idea file" describing a pattern for building knowledge bases that actually compound over time instead of rediscovering the same information on every query.

I decided to build it for a real production problem.

The Problem: RAG Has No Memory

In My organization, we were running an SEO monitoring pipeline for our landing site. It used Cognee (a knowledge graph framework) backed by Neo4j + ChromaDB on a $35/month VM.

Every day it would:

Pull Google Search Console data
Scrape our pages for SEO issues
Query the knowledge graph
Post a Slack report

It worked. But it had a fundamental flaw the same flaw Karpathy describes in every RAG system.

Every time it ran, it was rediscovering knowledge from scratch. It had no memory of what it found yesterday. It couldn't connect "we deployed this fix on March 13" with "clicks went up 81% on March 18." It couldn't say "this keyword has been declining for 3 weeks here's why." It just answered the current query and forgot everything.

On top of that: the VM broke 3+ times. Neo4j config issues. ChromaDB API path changes. Sidecar containers failing silently. We spent more time fixing the pipeline than reading its output.

Karpathy's Core Insight

His idea is simple but changes everything:

Instead of retrieving from raw documents at query time compile knowledge once, keep it current, and query the compiled result.

The architecture has 3 layers:

raw/        → immutable source data (never edited)
wiki/       → LLM-maintained markdown knowledge base
AGENTS.md   → schema/rules file telling the LLM what to do

The LLM reads new raw data → integrates it into the wiki → cross-references it with existing knowledge → flags contradictions → the wiki gets richer every day.

His exact framing: "Obsidian is the IDE. The LLM is the programmer. The wiki is the codebase."

No vector database. No embeddings. No $35/month VM. Just markdown.

What I Built: LLM Wiki for SEO

I took Karpathy's pattern and applied it specifically to SEO monitoring for vibetrader.com.

The 3 Layers

raw/ : daily immutable snapshots:

raw/gsc/2026-04-17.json      ← Google Search Console (clicks, CTR, positions, queries)
raw/audit/2026-04-17.json    ← page audit (H1, meta, schema, canonical checks)
raw/commits/2026-04-17.json  ← landing-site git log (what code changed)

wiki/ : LLM-maintained knowledge:

wiki/overview.md              ← "story so far" updated daily
wiki/log.md                   ← append-only daily log
wiki/topics/keywords.md       ← keyword clusters + position tracking
wiki/topics/issues.md         ← open SEO issues with severity
wiki/topics/recommendations.md ← history of recs + acted/pending status
wiki/topics/code-changes.md   ← code change impact tracker
wiki/topics/performance.md    ← CAUSAL CHAINS: issue → fix → metric improvement
wiki/topics/competitors.md    ← weekly competitor analysis
wiki/topics/lint-report.md    ← weekly wiki health check

AGENTS.md : the schema file. Tells the LLM exactly how to update each page, what cross-links to write, what format to follow.

The Flow

3:00 PM IST  daily
─────────────────
gsc_pull.py    → pulls GSC data via API → raw/gsc/today.json
audit.py       → fetches vibetrader.com, checks H1/meta/schema → raw/audit/today.json
               → posts Slack audit digest (SEO/GEO/AEO scores)
ingest.py      → reads ALL raw/ + ALL wiki/
               → sends full context to Azure OpenAI
               → LLM updates all wiki pages with cross-links
               → appends to log.md
               → posts enriched Slack report
git commit     → wiki/ + raw/ committed back to main

4:00 PM IST  daily (1 hour after ingest)
─────────────────────────────────────────
fix_agent.py   → reads wiki/issues + recommendations
               → LLM classifies which issues are auto-fixable
               → skill functions run deterministically (not LLM-generated code)
               → raises PR to landing-site dev branch
               → never auto-merges

11:30 AM IST  Saturday
───────────────────────
lint.py        → audits wiki for staleness, contradictions, orphan pages
               → health score /100
               → saves lint-report.md + posts Slack

2:30 PM IST  Sunday
────────────────────
competitor_analysis.py → scrapes competitor sites
                       → LLM analysis → updates competitors.md

The Knowledge Graph in Plain Markdown

The key innovation over standard LLM Wiki I added a performance.md page that tracks causal chains:

### ✅ 2026-03-18 — API response fix
- **Cause:** [[topics/issues]] API returning incomplete pagination data
- **Fix:** API response items + pagination as separate fields → [[topics/code-changes]]
- **Before:** 127 clicks/day
- **After:** 229.7 clicks/day
- **Impact:** +81% clicks
- **Keywords moved:** [[topics/keywords]] — all branded queries improved
- **Confidence:** HIGH — spike confirmed same week as deploy

Every entry answers: what caused this? what moved? how confident?

This is what Cognee was trying to do with Neo4j store causal relationships between entities. We're doing it in plain markdown, with [[wiki-links]] that Obsidian renders as a visual knowledge graph.

Obsidian as the Visualization Layer

Open Obsidian → point it at your wiki/ folder → graph view instantly shows all cross-connections between pages:

performance ↔ keywords ↔ issues ↔ recommendations ↔ code-changes

As the wiki accumulates data daily, the graph gets richer. After 30 days, clicking any node shows a real story: "this keyword dropped because of this issue, which was fixed by this code change, which resulted in this metric improvement."

That's the knowledge graph. No Neo4j required.

Interactive Query Mode

Beyond the daily Slack push, I added a local query script:

# one-shot
python scripts/query.py "what keywords are improving this week?"

# REPL with follow-up context
python scripts/query.py
you> which code change had the most SEO impact?
wiki> Based on code-changes.md:
      March 18 — API response fix
      Before: 127 clicks/day | After: 229.7 clicks/day | Impact: +81%
      This is the highest confirmed impact change in the wiki.

you> why did that happen?   ← follow-up, remembers context
wiki> ...

The LLM answers from the accumulated wiki context not by re-reading 30 days of raw JSON. This is exactly what Karpathy's "query" operation describes.

Weekly Lint Pass

One gap in most LLM Wiki implementations no health check. I added lint.py that runs every Saturday:

Staleness detection : claims in the wiki that contradict newer log entries
Contradiction detection : two pages saying opposite things
Orphan pages : pages with no cross-links pointing to them
Missing concept pages: [[wiki-links]] referenced but never created
Stale recommendations: recs pending > 30 days

Posts a health score /100 to Slack and saves lint-report.md.

The Fix Agent (Skill-Based, Not LLM-Generated Code)

The fix agent reads the wiki, identifies fixable issues, and raises PRs to the landing-site repo. The key design decision: the LLM only classifies issues it never writes code.

Instead, deterministic "skill" functions handle each fix type:

skills/add_jsonld.py          ← checks schema not already present before adding
skills/add_internal_link.py   ← validates page exists in app/ first
skills/fix_meta_tags.py       ← only updates if explicit value given
skills/fix_robots_txt.py      ← safe append-only sitemap fix

The LLM outputs {"skill": "add_jsonld", "page": "app/layout.tsx"}. The skill runs. No hallucinated file paths. No duplicate schemas. No business logic touched.

PRs always target dev branch. Never auto-merge. Human review required.

vs Cognee: Honest Comparison

After running both systems in parallel for a week:

	Cognee	LLM Wiki
Infrastructure	Neo4j + ChromaDB + VM	Plain markdown files
Monthly cost	~$35	~$0 (just Azure OpenAI tokens)
Times it broke	3+	0
Files to maintain	6 JS pipeline files	5 Python scripts
History awareness	Inconsistent	Grounded references exact dates
AEO accuracy	3/10 (missed JSON-LD already added)	7/10 (reflected actual state)
Knowledge compounds	❌	✅

LLM Wiki won on every dimension except one: Cognee had richer graph relationships out of the box. But we replicated that with performance.md and [[wiki-links]].

What Karpathy Got Right

The core insight is correct and the more I use it the more obvious it becomes:

RAG = studying for an exam by re-reading all your textbooks every time you get a question.

LLM Wiki = you already made notes, highlighted the important parts, drew arrows between connected ideas. Now you just read your notes.

The wiki grows daily. The LLM gets smarter about your domain the more data accumulates. By day 30, query.py answers questions with 30 days of grounded context. By day 90, performance.md has a dozen confirmed causal chains real institutional knowledge about what moves your metrics and why.

That's what no RAG system gives you. And you don't need a vector database to get there.

Try It Yourself

The pattern works for any domain where you're accumulating knowledge over time:

SEO monitoring (what I built)
Competitor tracking
Research — papers, articles, building a thesis over weeks
Engineering team wiki — fed by Slack threads, PRs, incident reports
Personal second brain — journal entries, articles, book notes

The only thing you need:

A raw/ folder where data lands
A wiki/ folder for LLM-maintained markdown
An AGENTS.md (or CLAUDE.md) telling the LLM the rules
A cron job that runs ingest daily

Start with one raw data source and one wiki page. Let it run for a week. The compounding effect becomes obvious fast.

I built a 20-20-20 eye reminder because my eyes were dying at my desk

Haripriya Veluchamy — Thu, 16 Apr 2026 08:45:00 +0000

I have severe eye dryness. My doctor told me to follow the 20-20-20 rule every 20 minutes, look at something 20 feet away for 20 seconds. Simple. Except when you're deep in a bug or a feature, 20 minutes evaporates and you've been staring at your monitor for 3 hours straight.

I tried phone reminders. I dismissed them without thinking. I tried sticky notes. I ignored them. I needed something that actually blocked me from working until I did the break.

The key insight: a dismissible reminder is just noise. The popup had to be impossible to close until the 20-second countdown finished.

What it does

Runs silently in the system tray in the background
Every 20 minutes fires a loud beeping alarm (winsound, no external files needed)
Dark blocking popup appears X button is disabled, you cannot close it
20-second countdown runs automatically with a progress bar
The "Continue" button is locked until the countdown finishes, then turns green
Two buttons: Continue (back to work) or Stop (end the session)

The core trick: disabling the close button

The single most important line in the whole app:

win.protocol("WM_DELETE_WINDOW", lambda: None)

By overriding WM_DELETE_WINDOW with a no-op lambda, clicking the X does absolutely nothing. The window stays open. You have to wait out the countdown.

The countdown + locked button

The "Continue" button starts state="disabled" and only becomes clickable when the timer hits zero:

def tick():
    remaining["s"] -= 1
    count_var.set(str(max(remaining["s"], 0)))

    if remaining["s"] <= 0:
        continue_btn.configure(
            state="normal",
            bg="#003322",
            fg="#44ff88",
            cursor="hand2"
        )
    else:
        after_id["id"] = win.after(1000, tick)

The button turns green when it unlocks a satisfying visual reward for actually doing the break.

The alarm

No external audio files needed. winsound is part of the Windows standard library:

for _ in range(6):
    winsound.Beep(1000, 300)
    time.sleep(0.1)
    winsound.Beep(1400, 300)
    time.sleep(0.1)

Two alternating frequencies, six times. Annoying enough that you cannot ignore it. Runs in a daemon thread so it does not block the UI.

The full popup

def show_popup():
    result = {"action": None}
    root = tk.Tk()
    root.withdraw()

    win = tk.Toplevel(root)
    win.title("Eye Break!")
    win.configure(bg="#0a0a0a")
    win.attributes("-topmost", True)
    win.protocol("WM_DELETE_WINDOW", lambda: None)  # <-- the magic line

    # ... countdown, progress bar, locked button ...

    win.after(1000, tick)
    play_alarm()
    root.mainloop()
    return result["action"] == "continue"

The popup is always-on-top (-topmost True) so it cannot be buried under other windows.

Setup: auto-start on Windows boot

Step 1 Install dependencies:

pip install pystray pillow

Step 2 Save eye_reminder.py to a permanent location:

C:\Users\you\EyeReminder\eye_reminder.py

Step 3 Create a startup shortcut. Press Win+R, type shell:startup, and create a shortcut there pointing to:

pythonw C:\Users\you\EyeReminder\eye_reminder.py

Using pythonw instead of python means no terminal window appears on startup.

Step 4 Reboot. The app silently starts in your system tray. Right-click the tray icon to quit anytime.

Testing it

Before running with 20-minute intervals, test with 5 seconds:

TEST_MODE = True
INTERVAL_SECONDS = 5 if TEST_MODE else 20 * 60
BREAK_SECONDS = 5 if TEST_MODE else 20

Flip TEST_MODE = False when you are happy with it.

Dependencies

Package	Why
`tkinter`	Built-in — popup UI
`winsound`	Built-in — alarm beeps
`pystray`	System tray icon
`Pillow`	Required by pystray for the icon image

Core functionality (popup + alarm) works with zero installs. pystray and Pillow are optional the app works without them, just without the tray icon.

Full source

Harivelu0 / eye-remainder

eye-reminder

A lightweight Python app that enforces the 20-20-20 rule for eye health on Windows.

Every 20 minutes, a blocking popup appears with a loud alarm. You cannot dismiss it until a 20-second countdown finishes. No cheating.

Why

The 20-20-20 rule says: every 20 minutes, look at something 20 feet away for 20 seconds. Every reminder app I tried was too easy to dismiss. This one isn't.

Demo

[20 min timer] --> ALARM fires --> Blocking popup appears
                                        |
                                   20-sec countdown
                                        |
                               "Continue" button unlocks
                                        |
                              Click Continue --> back to work
                              Click Stop    --> session ends

Features

Runs silently in the system tray
Loud beeping alarm on every trigger (no audio files needed)
Popup is always-on-top and the X button is disabled
"Continue" button locked until full 20-second countdown completes
System tray icon with right-click Quit option
Built-in test mode (5-sec interval) for quick verification

Requirements

Windows
Python 3.8+
…

View on GitHub

My eyes are noticeably less dry after two weeks of using this. Sometimes the best tool is the one you build in an afternoon because nothing else works the way you need it to.

I Built an AI SEO Monitor That Remembers Everything

Haripriya Veluchamy — Fri, 10 Apr 2026 14:16:39 +0000

Most SEO monitoring tools give you a snapshot: today's clicks, today's issues, today's recommendations. You fix something, come back tomorrow, and the tool has no idea what you did or whether it helped.

I wanted something smarter a system that remembers site's history, correlates code changes with ranking shifts, and gives AI-generated insights that get better every day. Here's what I built and how.

The Problem With Basic SEO Monitoring

A typical monitoring setup looks like this:

Fetch Google Search Console data
Run a Lighthouse audit
Send a Slack message with today's numbers

That's fine. But it can't answer questions like:

"Did that metadata fix I deployed 10 days ago actually improve rankings?"
"This recommendation has been flagged for 15 days why hasn't it been fixed?"
"Clicks dropped this week was it a code change or an algorithm shift?"

To answer those questions, you need memory. That's where Cognee comes in.

What Is Cognee?

Cognee is a knowledge graph SDK. Instead of storing data as flat rows in a database, it extracts entities and relationships and stores them as nodes and edges in a graph (Neo4j) with vector embeddings in a vector database (ChromaDB).

Think of it like this: a normal database stores "clicks = 262 on April 8". A knowledge graph stores "keyword 'vibe trading' ranked at position 1.78 on April 8, which is 12 spots better than March 25, and that improvement happened 3 days after a metadata fix was deployed".

The difference matters when you want AI to reason across weeks of history not just today.

System Architecture

The full pipeline runs daily via GitHub Actions:

PHASE 1 — Parallel data collection
├── Lighthouse audit (performance, SEO scores)
├── Broken links check
├── Meta tags validation
├── Core Web Vitals (via PageSpeed API)
└── Google Search Console (clicks, CTR, position, queries)

PHASE 2 — Main analysis job
├── git-change-detector.js    → scans commits, classifies SEO-relevant changes
├── cognee_ingest.py          → writes today's data to Neo4j + ChromaDB
├── cognee-store-updater.js   → updates 30-day rolling JSON snapshot
├── audit-scraper.js          → fetches live pages, scores SEO/GEO/AEO signals
├── audit-ingest.py           → stores audit scores in the knowledge graph
├── cognee-analyzer.js        → builds enriched AI context, calls Azure OpenAI
├── send-ai-slack.js          → posts daily report to Slack
└── cognee-blob-sync.js       → backs up knowledge graph to Azure Blob Storage

PHASE 3 — Weekly (Sundays)
└── competitor-monitor.js     → fetches competitor pages, scores them, posts comparison

Why Cognee? The Knowledge Graph Advantage

Every day, cognee_ingest.py builds a structured document containing today's GSC metrics, top queries, AI recommendations, and recent git commits. Azure OpenAI reads this and extracts entities keywords, positions, dates, code changes which Cognee writes to Neo4j as connected nodes. graph starts to grow.

After 30 days, the graph contains nodes like:

(Keyword: "platform name") —[RANKED_AT]→ (Position: 1.78, date: April 8)
(CodeChange: "metadata fix") —[HAPPENED_BEFORE]→ (MetricSnapshot: April 5)
(Recommendation: "add FAQ schema") —[FLAGGED_ON]→ (Date: April 1)
(Recommendation: "add FAQ schema") —[FLAGGED_ON]→ (Date: April 2)
... (flagged 12 days in a row)

Now when the AI runs its daily analysis, it doesn't just see today's data. It sees patterns:

Keyword velocity: which keywords improved or dropped more than 5 positions in 14 days
Stuck recommendations: same issue flagged 3+ days in a row, still unactioned
Code change impact: did clicks or position change after a specific deploy?

The Slack report reflects this. Instead of "your CTR is 18.94%", it says:

"Your site has more than doubled daily clicks over the past month (106% growth), driven by a metadata fix on March 26 and header overlap fixes on March 28. Short-term momentum is slowing 7-day clicks are -3% suggesting you need to now expand content around fast-moving branded keywords."

That's a different class of insight.

The Audit Scrape SEO/GEO/AEO Scoring

Beyond GSC data, audit-scraper.js fetches your actual pages daily and scores them across three dimensions:

SEO classic signals: title tag, meta description, H1, canonical, OG tags, schema markup, JS-gated content detection

GEO (Generative Engine Optimization) how well AI search engines like Perplexity or ChatGPT Search can read and cite your content: structured data presence, content density, crawlability

AEO (Answer Engine Optimization) featured snippet and voice search readiness: FAQ schema, article schema, H2 density, word count

Each page gets a score out of 10. The system flags critical issues (JS-gated content = crawlers see a blank page, missing H1 = no primary ranking signal) and sends a separate Slack message with the audit digest.

🔴 SEO Audit — 2026-04-08
Scores: SEO 9/10 | GEO 9/10 | AEO 3/10 | Combined 21/30

Critical Issues:
🚨 missing_h1 on Pricing — Missing primary ranking signal
🚨 js_gated_content on Pricing — Crawlers see blank page

Code Change Impact Tracking

This is the part I'm most proud of. git-change-detector.js scans git commits and classifies them it looks for commit messages mentioning SEO-related terms (metadata, schema, redirect, canonical, performance, etc.) and logs them with their date.

change-impact-tracker.js then cross-references those commits with GSC metrics. For each logged change, it compares the 7-day window before vs after deployment:

✅ Migrate to new partition keys (2026-03-30)
   → Position improved 6.7 spots (17.39 → 10.71)

✅ API pagination fix (2026-03-18)
   → Clicks grew 81% (127 → 229.7/day)

⏳ content deploy (2026-03-13)
   → Monitoring... (not enough post-deploy data yet)

This surfaces directly in the Slack report under "Code Change Tracker". Over time, it tells you which types of changes actually move the needle.

Storage Architecture

Three layers, each with a different purpose:

Layer	What it stores	Why
Neo4j (Azure VM)	Graph nodes + edges — keywords, positions, code changes, relationships	Multi-hop reasoning: "which keyword improved after which deploy?"
ChromaDB (Azure VM)	Vector embeddings of all entities	Semantic search across history
cognee-knowledge.json (Azure Blob)	30-day rolling JSON snapshots	Fast daily reads without querying the graph every run

The JSON file is the workhorse for the daily Slack report. Neo4j and ChromaDB are queried for deeper pattern analysis and become increasingly valuable as history accumulates.

Key Things I Learned

Cognee initializes config at import time. If you set environment variables after import cognee, they're ignored. You have to call cognee.config.set_graph_db_config() directly after import to update the live config object. This cost me several hours.

The mistralai import conflict. Cognee's dependency instructor==1.14.x tries to import Mistral from mistralai at import time regardless of whether you use it. Fix: inject a fake mistralai module into sys.modules before importing Cognee.

JS-gated content is invisible to the audit scraper. If your page renders entirely client-side, the raw HTML fetch returns fewer than 80 words. The scraper flags this as js_gated_content — which is actually useful because it means Google probably can't index it either.

The knowledge graph gets smarter non-linearly. Day 1 the system is just a fancier GSC dashboard. Day 7 you start seeing real code change verdicts. Day 30 the AI recommendations start referencing patterns that span weeks. The value compounds.

Tech Stack

GitHub Actions — pipeline orchestration, daily cron
Node.js — audit scraper, Cognee analyzer, Slack formatting, git change detection
Python — Cognee SDK ingestion (cognee_ingest.py, audit-ingest.py)
Cognee 0.5.3 — knowledge graph SDK
Neo4j Community — graph database
ChromaDB — vector database
Azure OpenAI — GPT-4.1 for analysis, text-embedding-3-large for vectors
Azure Blob Storage — knowledge graph backup/restore
Azure VM (Standard B2s) — hosts Neo4j + ChromaDB via Docker Compose
Google Search Console API — real click/impression/position data

🚀 Beyond RAG: Simulating the Future with MiroFish

Haripriya Veluchamy — Tue, 07 Apr 2026 17:15:58 +0000

Lately, most of us have been working with RAG systems retrieving context, grounding responses, improving accuracy.

But what if instead of just retrieving knowledge, we could simulate outcomes?

I recently came across MiroFish, and decided to test it out.

🧪 What I Tried

I cloned the repo, ran it locally, and fed it a simple scenario:

“What happens when an AI assistant is introduced into a company’s daily workflow?”

Instead of a static answer, it generated a multi-agent simulation over time.

🧠 What Makes It Different

Unlike traditional systems, MiroFish:

Creates a virtual environment
Generates multiple agents (employees, managers, etc.)
Simulates interactions over time
Produces a temporal report (day-by-day evolution)

This means you’re not just asking:

“What will happen?”

You’re observing:

“How things evolve step by step.”

📊 Sample Insights from My Test

From a 14-day simulation, I observed:

📈 Initial boost in productivity
⚖️ Diverging employee satisfaction
🔁 Emerging dependency on AI
🧩 Different behaviors across teams

It felt less like querying an LLM… and more like watching a system evolve.

💡 Where This Can Be Useful

This kind of simulation opens up interesting possibilities:

🏢 Organization & Product

AI adoption strategies
Remote work policy changes
Feature rollout impact

📦 Business Decisions

Pricing experiments
Customer behavior prediction
Growth strategy testing

🌍 Macro Scenarios

Economic shifts
Supply chain disruptions
Policy or geopolitical changes

🔄 RAG vs Simulation (My Take)

Approach	What it does
RAG	Retrieves and explains existing knowledge
Simulation (MiroFish)	Models and predicts possible futures

Both are powerful but they solve very different problems.

⚡ Final Thoughts

We’re slowly moving from:

👉 “Answering questions”
to
👉 “Rehearsing decisions”

MiroFish feels like an early step in that direction.

Still experimenting, but this approach definitely opens up a new way of thinking about AI systems.

If you’ve tried something similar or have ideas for scenarios to test — would love to hear 👇

Everyone Suddenly Said “RAG is Dead”

Haripriya Veluchamy — Sat, 04 Apr 2026 13:41:22 +0000

Lately I keep seeing this everywhere:

“RAG is dead”
“Vector search is outdated”
“Reasoning-based retrieval is the future”

And suddenly… everyone is talking like vector search is useless.

I’m not against the hype. These things happen.

But honestly, this whole idea didn’t just click for me immediately.

Because for me, this problem was already in my head for a long time.

Not because of hype.

Just because of my use case.

What I Was Actually Trying to Figure Out

I read a lot of long tech blogs and architecture posts.

After reading, I always have questions like:

“why did they do this?”
“what’s the tradeoff here?”
“what happens if we change this design?”

So I wanted a system where I can:

paste a document
ask questions
actually get useful answers

At some point I started thinking:

should I just stick with vector RAG?
or should I try something like PageIndex / reasoning-based retrieval?
or even something like Agent-style flow later?

That curiosity is what pushed me to build this.

Not the “RAG is dead” trend.

So I Built a Simple Comparison

Nothing fancy.

Just a small app where:

same document
same question
same model
only retrieval changes

Pipeline 1 — Vector RAG

split document
embed
store in ChromaDB
retrieve top-k
answer

This is what most of us are already doing.

Pipeline 2 — PageIndex

build a tree structure from the document
let the model navigate it
pick relevant sections
answer from that

This felt very different.

Not “searching”.

More like… “reading with guidance”.

What I Noticed

The difference is actually deeper than I expected.

Vector RAG:

find similar chunks

PageIndex:

figure out where the answer should be, then go there

That “where” part is interesting.

One Example

I tested with a Netflix architecture article.

Question:

Why did they use live origin instead of only CDN?

Vector RAG

faster (~7s)
decent answer
but retrieval had some noise

PageIndex

slower (~11s)
answer felt more precise
citations were cleaner

My Honest Take

Vector RAG is not dead.

But…

Blind chunking + embedding + top-k is not enough anymore (at least for some cases).

Where I See the Difference

Vector RAG works well when:

you have multiple documents
you want speed
you just need “good enough” answers

PageIndex works well when:

single long document
structured content
you want cleaner reasoning

What I’m Actually Thinking Now

I don’t think this is:

“one replaces the other”

Feels more like:

both solve different parts of the problem

What I’m more interested in now is:

can I combine them?
use vector search to find documents
then use something like PageIndex inside that?

That feels more practical.

Why I’m Exploring This

For my use case, I’m also thinking about:

can I plug this into an agent flow later?
how does retrieval affect agent decisions?
does better retrieval reduce hallucination in multi-step tasks?

That’s where this is going for me.

Final Thought

Honestly, I didn’t build this to prove anything.

Just to understand.

And one thing became clear very fast:

hype says “X is dead”
reality says “it depends”

If you’re building something similar, I’d really suggest:

Don’t pick a side early.

Test both.

You’ll understand the difference immediately.

Harness Engineering: The Concept I Didn't Know I Needed

Haripriya Veluchamy — Wed, 25 Mar 2026 18:24:15 +0000

Honestly, when I first heard the term Harness Engineering, I thought it was just another buzzword.

I already knew about Prompt Engineering. I had heard about Context Engineering. I thought, okay this is probably just the same thing with a fancier name.

But then I started actually using agentic tools like Cursor and Windsurf in my day-to-day work. And something clicked.

"Wait... this thing is not just answering my question. It's planning, building, testing, fixing — all on its own. How?"

That's when I went deeper. And what I found actually changed how I think about building with AI.

First What Even is a Context Window?

Before we get into Harness Engineering, need to understand one thing.

Every AI model has something called a context window. Think of it like a whiteboard. The model can only see what's written on that whiteboard right now. Once the conversation gets too long, old stuff disappears. And when you start a brand new chat the whiteboard is completely blank.

That's the core problem:

AI has no memory between sessions. Every new session, it starts fresh.

For a simple question answer task, that's fine. But what if the task takes days?

What is Harness Engineering?

Let me show you how this concept evolved:

Prompt Engineering   → How do I ask better questions?
Context Engineering  → How do I manage what's inside one session?
Harness Engineering  → How do I make an agent work across many sessions?

Harness Engineering is not about writing better prompts. It's about designing the system around the model so the agent always knows where it is, what it has done, and what it needs to do next. Even after the context window resets completely.

The Moment I Really Got It

When I was exploring how tools like Cursor work under the hood, I realized something.

When Cursor builds a feature for you:

It scans your codebase
Makes a plan
Implements step by step
Runs tests automatically
Fixes bugs it finds
Continues without you prompting every single move

That is Harness Engineering. The tool is not just "smart." Someone designed a system that makes it stay on track even as context windows reset.

A Real Example: Building an App with an Agent

Let's say you ask an AI agent:

"Build me a complete Food Delivery App."

This is not a one-session task. Here's what happens without any harness:

Session 1:
Agent builds Login page, starts Restaurant list...
Context window fills up. Stops.

Session 2:
Agent starts fresh. No memory.
Builds Login page again. 😵
Duplicate code. Broken app. Confused agent.

Now with Harness Engineering:

Before any coding starts, an Initializer Agent sets up three simple things:

features.json — Every task with a status:

[
  { "task": "Login Page", "status": "pending" },
  { "task": "Restaurant List", "status": "pending" },
  { "task": "Cart System", "status": "pending" }
]

progress.txt — A running log:

Last completed: Nothing yet
Next task: Login Page

setup.sh — A script to spin up the dev server automatically.

Now every new session just does this:

Read progress.txt  → know where to continue
Read features.json → pick the next pending task
Run setup.sh       → environment is ready
Build → Test → Update files → Git commit
Session ends cleanly. Next session picks up exactly here.

The agent has no memory but it doesn't need memory. The system remembers for it.

The 3 Things That Actually Make This Work

1. Legible Environment

Every session should be able to answer three questions just by reading files:

What is the goal?
What is done?
What is next?

Feature lists, progress logs, git history, docs — these are not optional. They are the foundation.

2. Verification Before Moving On

Agents have a habit of saying "Done!" when things are actually broken. I've seen this personally with Claude Code and Cursor.

The fix is giving the agent real tools to test its own work like running the app, checking the UI, catching bugs end to end. Not just saying it worked. Actually proving it.

3. Use Simple, Familiar Tools

This one surprised me the most.

Vercel built a very fancy, specialized agent with custom tools and heavy prompt engineering. It worked but barely. Fragile. Slow.

Then they removed almost all the custom tools and replaced everything with one simple batch command tool.

Result? 3.5x faster. 37% fewer tokens. Success rate went from 80% to 100%.

Why? Because models like Claude have seen billions of lines of code using git, grep, npm. They understand these natively. Custom tools are unfamiliar territory.

Simple tools the model already knows > Fancy tools you built from scratch.

How This Connects to MCP

If you've worked with MCP (Model Context Protocol) before, this connects directly.

In a Harness Engineering setup:

The Host (Claude Desktop, Cursor) is your computer
The MCP Client is like an adapter built into the host, you don't touch it
The MCP Server is what you build your custom tools, your file readers, your test runners

Your MCP Server becomes the hands of your long-running agent. It reads progress files, runs tests, queries databases, and verifies work all between sessions.

You only build the server. The host handles the rest.

Common Mistakes to Avoid

Letting the agent "one-shot" the whole task it will run out of context and leave things half done
Not giving the agent a way to test its own work it will always claim success
Building overly specialized tools simpler is almost always better
No clean state at end of each session the next session will be confused

A Simple Harness Checklist

Before building a long-running agent system, make sure:

[ ] Feature list exists with pass/fail status per task
[ ] Progress file updated at end of every session
[ ] Git commits made with descriptive messages
[ ] Dev environment spins up automatically (setup script)
[ ] Agent has real testing tools not just unit tests
[ ] Generic tools used wherever possible

Final Thoughts

The models today are genuinely capable. The missing piece is almost never the model itself.

It's the system around it.

That's what Harness Engineering is. Not a new model. Not a new prompt trick. Just smart system design that lets an agent stay on track across sessions, verify its own work, and actually finish what it started.

Once I understood this, the way I think about building AI-powered tools completely changed.

If you're building anything agentic even something small think about what happens when the context window resets. Does your agent know how to pick up where it left off?

If yes, you're already doing Harness Engineering. 😊

Why Your PostgreSQL Keeps Running Out of Connections

Haripriya Veluchamy — Tue, 17 Mar 2026 17:31:11 +0000

PostgreSQL connection errors are one of those things that look terrifying when they hit production.

I used to think:
"Why is the database refusing connections?"
"Did something crash?"
"Is the server overloaded?" 😅

Recently, while working on a production system, I ran into the classic TooManyConnectionsError. Not once twice. On two different services. Same database, same root cause.

That experience helped me clearly understand why this happens and how to fix it properly.

This post is me breaking that down in a simple way, based on what actually worked.

What does TooManyConnectionsError actually mean?

In simple terms, PostgreSQL has a limit on how many connections it allows at the same time.

This limit is set by max_connections and is usually:

50–100 for small/basic tiers
100–200 for general purpose tiers

When your application tries to open more connections than this limit, PostgreSQL says no. That's the error.

The important thing to understand:

The database isn't down. It just has no room for new connections.

But I'm using a connection pool... why is this still happening?

This is the part that confused me.

I was already using a connection pool. Every tutorial says "use a pool" and I did. So why was I still exhausting connections?

Here's what I found out:

The problem wasn't that I didn't have a pool. The problem was I had too many pools.

Let me explain.

How the bug actually works

Most people write their database client class something like this:

class DatabaseClient:
    def __init__(self):
        self.pool = create_pool(min=2, max=10)

Looks clean right? Every instance gets its own pool. Professional.

But here's what happens in a real application:

Health checker creates a DatabaseClient() → pool of 10
Dependency checker creates a DatabaseClient() → another pool of 10
Each worker/task creates a DatabaseClient() → another pool of 10 each

If you have 6 workers, that's already:

10 + 10 + (6 × 10) = 80 connections from ONE container

Run 2 containers? That's 160.

Your database allows 100.

💥 TooManyConnectionsError

And the worst part? Each pool individually looks reasonable. It's only when you add them all up that it explodes.

The fix is embarrassingly simple

One process, one pool. Everything borrows from it.

Instead of each class creating its own pool, create ONE pool at the module/process level and every instance uses that shared pool.

# Created once at module level
_shared_pool = None

def get_shared_pool():
    if _shared_pool is None:
        _shared_pool = create_pool(min=2, max=5)
    return _shared_pool

class DatabaseClient:
    def get_connection(self):
        return get_shared_pool().acquire()

Now it doesn't matter how many DatabaseClient() instances you create. They all share the same 5 connections.

That's it. That's the fix.

Things I learned the hard way

Here are some gotchas that bit me:

1. The close() trap

If your client class has a close() method that closes the pool, and some other code calls it mid-process congratulations, you just killed the pool for everyone.

Make close() a no-op on individual instances. Only close the shared pool when the entire process shuts down.

2. The cascade effect

When the database runs out of connections, it doesn't just fail your query. It also fails your health check. And when the health check fails, your orchestrator thinks the container is unhealthy and might restart it. Which creates new pools. Which makes things worse.

I literally got two alerts 30 seconds apart. First one: TooManyConnectionsError. Second one: dependency_check_failed. Same container, same root cause. One bug, two pages.

3. Make pool size configurable

Use an environment variable like PG_POOL_MAX_SIZE=5. When you're debugging at 2 AM, you don't want to redeploy just to change a number.

4. Do the napkin math

Before deploying, always calculate:

pool_max_size × max_replicas < max_connections - admin_headroom

Example:

Pool max: 5
Total replicas across all services: 10
Total: 50
Database max_connections: 100
Admin headroom: 20
50 < 80 ✅

If the math doesn't work, either reduce pool sizes or put a connection pooler like PgBouncer in front of the database.

Common mistakes I see (and made myself)

Creating a new pool per class instance instead of sharing one
Not accounting for multiple services hitting the same database
Forgetting that container scaling = more pools = more connections
Not closing pools gracefully on shutdown (idle connections linger)
No rollback plan when the fix itself has a bug 😅

A simple checklist before you deploy

Before pushing connection pool changes to production:

✅ Single shared pool at module/process level
✅ Pool size is configurable via env variable
✅ close() on individual instances is a no-op
✅ Shared pool closes on process shutdown
✅ Napkin math checks out
✅ Code compiles/passes syntax check (trust me on this one 😅)

Final thoughts

Connection pool exhaustion doesn't have to be scary.

If I had to summarize everything in one line:

Many pools = trouble. One shared pool = peace.

The bug is almost never that you forgot to use a pool. It's that you accidentally created too many of them.

Once you understand this, TooManyConnectionsError stops being a 1 AM panic and becomes just another thing you know how to handle.

Hope this helps someone who just got paged for the first time because their database "ran out of connections"... 🙂

EnvHub: A Zero-Knowledge Secret Manager Built with GitHub Copilot CLI

Haripriya Veluchamy — Sun, 15 Feb 2026 16:11:25 +0000

This is a submission for the GitHub Copilot CLI Challenge

What I Built

I built EnvHub a secure, versioned environment variable manager that finally treats your .env files with the same respect as your code.

Show the Full Application flow here https://youtu.be/54do2TvHB3Y

But before I show you what it does, let me tell you why it exists.

The Problem That Kept Happening

Every developer has done this at least once:

You're working on a feature. You pull the latest code. You run the app. It breaks. You spend 30 minutes debugging only to realize someone added a new environment variable and forgot to tell the team.

Or worse: you accidentally overwrote your .env with an old version and replaced new config with outdated values. Production keys? Gone. New API endpoints? Reverted.

I've lost count of how many times this happened to me and my team:

"Hey, can you Slack me the new .env?" Insecure and lazy
"Wait, which version of the .env are you using?" Pure chaos
"Who changed the DATABASE_URL and when?" No audit trail whatsoever
"I just overwrote the new config with my old .env..." Silent disasters

We treat code with version control with Git, pull requests, code reviews. But we treat our most sensitive configuration database passwords, API keys, encryption secrets like scratch paper.

That's insane.

So I decided to build something better.

The Vision: Git, But For Secrets

I wanted a tool that worked the way developers already think:

Push your .env to a central place (like git push)
Pull it down on any machine (like git pull)
History of every change with who, what, and when (like git log)
Encrypted so even if someone accesses the storage, they can't read the secrets

And critically: Zero-knowledge architecture. I didn't want to build another SaaS where I hold everyone's secrets. That's a liability nightmare and honestly, I wouldn't trust a random developer with my production credentials either.

The solution? You deploy EnvHub to YOUR Vercel account, with YOUR storage, encrypted with YOUR keys. I literally cannot access your data even if I wanted to.

What EnvHub Actually Does

Let me walk you through the complete experience.

The Web Dashboard

When you first log in (via GitHub OAuth no new passwords to remember), you see a clean, dark-themed dashboard.

On the left sidebar, you have your Workspace Explorer. It's organized hierarchically:

📁 Project (e.g., "my-startup")
   └── 🖥️ Service (e.g., "backend-api")
       └── 🔵 Environment (e.g., "production")

Click on any environment, and you see all your variables displayed clearly keys and values visible right there, no extra clicks needed.

Want to edit? Click the edit button, and you get a full editor. You can:

Add variables individually one at a time with key-value inputs
Bulk upload paste your entire .env file content at once

Every save requires a change reason because six months from now, you'll want to know why someone changed the STRIPE_API_KEY.

Below the variables, you'll see the Version History table. Every single change is recorded:

Version	Date	User	Reason
v3	Feb 14, 2026	@harivelu0	Updated Stripe to production keys
v2	Feb 10, 2026	@teammate	Added Redis connection string
v1	Feb 1, 2026	@harivelu0	Initial setup

Click on any version to view what the variables looked like at that point in time. Full time-travel debugging. Accidentally overwrote something? Just check the previous version.

The CLI (Where the Real Power Lives)

The web dashboard is great for browsing and quick edits. But for developers who live in the terminal, the CLI is where it gets powerful.

Install it with one command:

pip install https://your-envhub-instance.vercel.app/cli/envhub_cli-2.0.3-py3-none-any.whl

Initialize it to point to your instance:

envhub init --api-url https://your-envhub-instance.vercel.app/api

Why do we need this step? Because EnvHub is self-hosted everyone deploys their own instance. Your company's EnvHub might live at envhub.mycompany.com while another team uses secrets.startup.io. The init command tells the CLI where YOUR backend lives.

Login using your existing GitHub credentials:

envhub login

This uses the GitHub CLI (gh) under the hood, so if you're already logged into gh, you're good to go. No new passwords, no separate accounts.

Now you're ready.

Push your local .env to the cloud:

envhub push -p my-startup -s backend-api -e production -r "Added new payment gateway keys"

The -r flag is the change reason required, so your team always knows why things changed. -p flag refers project name, -s refers service and -e refers environment

Pull variables to any machine:

# Print to console (great for reviewing or piping)
envhub pull -p my-startup -s backend-api -e production

# Save directly to a file
envhub pull -p my-startup -s backend-api -e production -o .env

By default, pull outputs to the console. This is intentional it lets you review what you're getting before saving it.

View the audit history:

envhub history -p my-startup -s backend-api -e production

┏━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Version ┃ Date                ┃ User        ┃ Reason                          ┃
┡━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 3       │ 2026-02-14 09:30:00 │ @harivelu0  │ Added new payment gateway keys  │
│ 2       │ 2026-02-10 14:22:00 │ @teammate   │ Added Redis connection string   │
│ 1       │ 2026-02-01 11:00:00 │ @harivelu0  │ Initial setup                   │
└─────────┴─────────────────────┴─────────────┴─────────────────────────────────┘

New developer joining the team? envhub pull. Spinning up a new server? envhub pull. Debugging why production broke last Tuesday? envhub history.

The Security Model (I Take This Seriously)

Let me be clear about how EnvHub protects your secrets.

1. Zero-Knowledge Architecture

When you deploy EnvHub, you deploy it to your own Vercel account. Your data lives in your own Vercel Blob storage. The encryption key is your own key that you generate.

I, as the creator of EnvHub, have absolutely zero access to:

Your Vercel account
Your Blob storage
Your encryption key
Your secrets

Even if someone subpoenas me, I literally cannot hand over your data because I don't have it.

2. Encryption at Rest

Every single variable value is encrypted using Fernet (AES-128 symmetric encryption) before it's stored.

Here's what actually gets saved in Vercel Blob:

Without your ENVHUB_MASTER_KEY, those encrypted values are worthless.

3. Organization Gating

Set the ALLOWED_ORGS environment variable to your GitHub organization name, and only members of that organization can log in.

ALLOWED_ORGS=my-company,partner-org

Random person finds your EnvHub URL? They can't even access anything. Access denied.

4. Audit Everything

Every single change records:

Who made the change (GitHub username)
When they made it (timestamp)
What they changed (full variable snapshot)
Why they changed it (required change reason)

Six months later when production breaks, you'll know exactly who to ask and what changed.

How GitHub Copilot CLI Made This Possible

Here's my situation: I work in DevOps. I'm comfortable with infrastructure, CI/CD pipelines, cloud services, React, and Next.js. But building a distributable Python CLI tool? That was new territory for me.

GitHub Copilot CLI was my guide through unfamiliar terrain.

The First Conversation: "How Do I Structure This?"

I knew what I wanted to build. I didn't know how to build a proper CLI.

I asked Copilot:

"How do I create a Python CLI that can authenticate with a Next.js API using GitHub OAuth?"

Copilot gave me a clear strategy:

Use the typer library for CLI structure (modern, type-hint based)
Use the rich library for beautiful terminal output and created folder structure and initialize with code snippets
Don't reinvent auth use the GitHub CLI (gh auth token) to get the user's existing token

That last point was the key insight. Instead of building a complex OAuth flow for the terminal, I just grab the token that's already there from the GitHub CLI. Users are already logged into gh for their daily work. Why make them log in again?

def get_auth_headers(api_url):
    result = subprocess.run(["gh", "auth", "token"], capture_output=True, text=True)
    token = result.stdout.strip()
    return {"Authorization": f"Bearer {token}"}

Secure authentication in 4 lines of code.

The Battle: Cross-Platform Path Handling

My CLI worked perfectly on my Linux. Then I tested it on Windows.

Immediate crash.

The problem? I was building file paths with forward slashes (/), but Windows uses backslashes (\). When uploading to Vercel Blob, the paths were getting mangled.

I asked Copilot:

"My CLI works on Linu but fails on Windows because of path separators. How do I handle this?"

Copilot showed me how to normalize paths consistently:

from pathlib import Path

def normalize_path(path_str):
    return Path(path_str).as_posix()  # Always use forward slashes

Simple fix. Now EnvHub runs seamlessly on Windows, Mac, and Linux.

The Optimization: Smart "No-Change" Detection

Early testers reported a problem: they'd accidentally run envhub push twice and create duplicate versions with identical content. The history log was getting cluttered with meaningless entries.

I asked Copilot:

"How can I detect if the variables being pushed are identical to the current version and skip the write?"

Copilot helped me implement a comparison check on the backend:

// Check if variables actually changed
const currentBundle = await manager.getBundle(project, service, environment);
if (currentBundle && JSON.stringify(currentBundle.variables) === JSON.stringify(variables)) {
    return NextResponse.json({
        status: 'success',
        version: currentBundle.version,
        message: 'No changes detected. Version not incremented.'
    });
}

Now EnvHub only creates a new version when something actually changes. Clean history, lower storage costs.

The Encryption Decision

I knew I needed encryption at rest. I had some familiarity with the options AES, RSA, and others but I wanted to make sure I picked the right approach for this specific use case.

I asked Copilot:

"What's a good symmetric encryption approach that works in both Python and Node.js?"

The answer: Fernet.

Fernet is built on AES-128-CBC with HMAC for authentication. It's secure, widely used, and has solid libraries for both Python and Node.js. Copilot pointed me to the exact packages:

Python: cryptography library
Node.js: fernet package

# Python (CLI side)
from cryptography.fernet import Fernet
key = Fernet.generate_key()
f = Fernet(key)
encrypted = f.encrypt(b"my secret value")

// Node.js (API side)
import fernet from 'fernet';
const secret = new fernet.Secret(key);
const token = new fernet.Token({ secret, ttl: 0 });
const encrypted = token.encode(value);

The right tool for the job, implemented quickly.

The Distribution Problem

Here's a challenge I didn't anticipate: how do users install the CLI?

I couldn't publish to PyPI (the public Python package index) because each user's CLI needs to point to their specific EnvHub instance. My EnvHub lives at envhub-harivelu.vercel.app. Your EnvHub lives at envhub-yourcompany.vercel.app. We can't share the same hardcoded package.

I asked Copilot:

"How can I distribute a Python CLI where each deployment has a different backend URL?"

Copilot's solution was elegant: host the .whl file on the EnvHub instance itself.

Each deployed EnvHub instance serves its own CLI package at /cli/envhub_cli-2.0.3-py3-none-any.whl. Users install directly from their instance:

pip install https://your-instance.vercel.app/cli/envhub_cli-2.0.3-py3-none-any.whl

Then they run envhub init to configure the API URL. Decentralized, self-contained, and each team gets their own setup.

Demo

🎥 Watch the Full Walkthrough: https://youtu.be/54do2TvHB3Y

🔗 Try the Live Demo: https://env-hub-nu.vercel.app/
(Click "Continue with Demo Account" you can only access the sandboxed demo-project)

🐙 Source Code: github.com/Harivelu0/EnvHub

Screenshots

The Login Experience

Clean, dark-themed login with GitHub OAuth. No new passwords.

The Dashboard

Your entire workspace at a glance. Projects, services, environments.

Variable Editor

Add variables individually or bulk upload. Change reason required.

Version History

Full audit trail. Click any version to see that point in time.

CLI in Action

Push, pull, and history right from your terminal.

Try It Yourself (Complete Setup Guide)

Want to deploy your own EnvHub? Here's the full walkthrough.

Prerequisites

Before you start, make sure you have:

A GitHub account
A Vercel account (free tier works)
Python 3.8+ installed
GitHub CLI (gh) installed get it here

Step 1: Clone and Deploy

# Clone the repository
git clone https://github.com/Harivelu0/EnvHub.git
cd EnvHub

# Install dependencies
npm install

# Deploy to Vercel
vercel deploy

Vercel will ask you some questions. Accept the defaults. At the end, you'll get a URL like https://envhub-yourname.vercel.app.

Step 2: Create a GitHub OAuth App

Go to GitHub Developer Settings
Click "New OAuth App"
Fill in:
- Application name: EnvHub (or whatever you want)
- Homepage URL: https://envhub-yourname.vercel.app
- Authorization callback URL: https://envhub-yourname.vercel.app/api/auth/callback/github
Click "Register application"
Copy the Client ID
Click "Generate a new client secret" and copy the Client Secret

Step 3: Create Vercel Blob Storage

Go to Vercel Dashboard
Click on "Storage" in the sidebar
Click "Create Database" → Select "Blob"
Name it something like envhub-storage
Copy the Read/Write Token

Step 4: Generate Your Encryption Key

Run this command to generate a secure Fernet key:

python3 -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())"

You'll get something like: Sn-S2vY5W4HuScQ60IG8JXiK9aIMmC-SadbyY1NxWBY=

Keep this safe! If you lose this key, you lose access to all your encrypted variables.

Step 5: Configure Environment Variables

In your Vercel Dashboard:

Go to your EnvHub project
Click "Settings" → "Environment Variables"
Add these variables:

Variable	Value
`GITHUB_ID`	Your OAuth App Client ID
`GITHUB_SECRET`	Your OAuth App Client Secret
`NEXTAUTH_SECRET`	Run `openssl rand -base64 32` and paste the result
`NEXTAUTH_URL`	`https://envhub-yourname.vercel.app`
`BLOB_READ_WRITE_TOKEN`	Your Vercel Blob token
`ENVHUB_MASTER_KEY`	Your Fernet key from Step 4
`ALLOWED_ORGS`	Your GitHub organization name (optional but recommended)
`ALLOWED_USERS`	Your GitHub username (optional, for personal use)

Step 6: Redeploy

vercel --prod

Step 7: Install and Configure the CLI

# Install the CLI from your instance
pip install https://envhub-yourname.vercel.app/cli/envhub_cli-2.0.3-py3-none-any.whl

# Point it to your instance
envhub init --api-url https://envhub-yourname.vercel.app/api

# Login via GitHub
envhub login

Step 8: Push Your First Environment

# Create a test .env file
echo "DATABASE_URL=postgres://localhost:5432/mydb" > .env
echo "API_KEY=sk_test_12345" >> .env

# Push it to EnvHub
envhub push -p my-project -s backend -e dev -r "Initial setup"

# Verify it worked
envhub pull -p my-project -s backend -e dev

You're done! You now have a fully functional, encrypted, versioned secret manager.

The Tech Stack

Layer	Technology	Why
Frontend	Next.js 16 + React 19	Latest features, seamless API routes
Styling	Tailwind CSS 4	Fast iteration on the dark theme
Auth	NextAuth.js + GitHub OAuth	No new passwords, org gating built-in
Storage	Vercel Blob	Serverless, no database to manage
Encryption	Fernet (AES-128)	Industry standard, cross-platform
CLI	Python + Typer + Rich	Clean terminal UI, good DX

The entire backend is serverless. No database servers to maintain, no scaling headaches. Vercel Blob handles storage, API routes handle logic.

What's Next?

EnvHub is already useful for small teams, but I have plans:

Cloud-Agnostic Storage

Currently uses Vercel Blob. Some enterprises need their data in AWS S3 or Azure Blob Storage for compliance. I'm abstracting the storage layer to support multiple providers.

Enterprise Identity (RBAC)

Right now, access control is binary: you're in the allowed org or you're not. For larger teams, I want role-based access:

DevOps Lead → All projects
Backend Dev → Backend services only
Intern → Read-only on dev environments

Audit Log Export

For SOC2 compliance, security teams need to export audit logs to SIEM tools (Splunk, Azure Sentinel). Adding a one-click export feature.

Lessons Learned

1. Scratch your own itch. The best tools come from real frustration. I built EnvHub because I was tired of the .env chaos. That frustration carried me through the hard parts.

2. AI accelerates learning. GitHub Copilot CLI didn't build EnvHub for me. But it helped me learn CLI development faster than I would have on my own. I still made the architecture decisions and debugged the weird issues. Copilot just shortened the learning curve.

3. Security can't be an afterthought. It would've been easier to build EnvHub as a hosted SaaS. But zero-knowledge architecture is the right choice for a secrets tool. Your secrets should be yours.

4. Developer experience matters. A secure tool that's annoying to use gets ignored. I spent real time on the CLI output formatting and dashboard UI. Good tools should feel good to use.

Conclusion

EnvHub started as frustration with the way we handle environment variables and became a tool I'm genuinely proud of.

It solves a real problem. It's secure by design. It's open source. And building it taught me that with tools like GitHub Copilot CLI, you can pick up new skills faster than ever.

If you've ever Slacked a .env file to a teammate, or overwritten new config with old values, or spent an hour debugging a missing variable EnvHub is for you.

⭐ Star the repo: https://github.com/Harivelu0/EnvHub

🐛 Found a bug? Open an issue

💬 Questions? Drop a comment below!

Thanks for reading. Now go secure your secrets.

Canary Deployments in Azure Container Apps: A Complete Guide

Haripriya Veluchamy — Tue, 03 Feb 2026 12:36:07 +0000

Why Do We Need Canary Deployments?

Imagine you push a new update to production. Everything looked fine in testing, but suddenly your users start experiencing errors. By the time you notice, thousands of users are affected. You scramble to rollback, but the damage is done.

This is exactly what canary deployments prevent.

The Problem with Traditional Deployments

In a traditional deployment, when you push new code, 100% of your traffic immediately goes to the new version. If something breaks, all your users are affected.

Traditional Deployment:

Before:  [Old Version] ████████████ 100% traffic
After:   [New Version] ████████████ 100% traffic  ← If broken, everyone is affected!

The Canary Solution

Canary deployment gets its name from the old mining practice of bringing canaries into coal mines. If dangerous gases were present, the canary would die first, warning miners to evacuate.

Similarly, in canary deployments, we send a small portion of traffic to the new version first. If something goes wrong, only a small percentage of users are affected.

Canary Deployment:

Step 1:  [Old Version] ██████████ 90%
         [New Version] ██ 10%        ← Test with small traffic

Step 2:  [Old Version] ██████ 50%
         [New Version] ██████ 50%    ← Gradually increase

Step 3:  [New Version] ████████████ 100%  ← Full rollout after validation

How Azure Container Apps Supports Canary Deployments

Azure Container Apps has built-in support for traffic splitting through its revision system. Every time you deploy, a new revision is created. You can then control how much traffic goes to each revision.

Key Concepts

Revision: A snapshot of your container app at a specific point in time. Each deployment creates a new revision.

Traffic Weight: The percentage of traffic each revision receives. All weights must add up to 100%.

Active Revision: A revision that is running and can receive traffic.

Inactive Revision: A revision that exists but receives no traffic and consumes no resources.

Implementation: Single Region Canary Deployment

Let's build a complete canary deployment pipeline for a single-region setup.

The Deployment Flow

Push Code
    │
    ▼
Build Image (tagged with git SHA)
    │
    ▼
Save Current Stable Revision
    │
    ▼
Deploy New Revision (Canary)
    │
    ▼
Health Check (5 attempts)
    │
    ├── PASS ──▶ Split Traffic 50/50
    │
    └── FAIL ──▶ Auto Rollback + Alert

Step 1: Build and Deploy with Unique Tags

Never use the latest tag for production deployments. It's mutable and causes inconsistencies.

- name: Deploy Application
  run: |
    # Use git SHA for unique, immutable image tag
    IMAGE_TAG="${{ github.sha }}"
    REVISION_SUFFIX=$(echo "$IMAGE_TAG" | cut -c1-8)

    # Build and push with unique tag
    az acr build \
      --registry $REGISTRY_NAME \
      --image $APP_NAME:$IMAGE_TAG \
      --file Dockerfile .

    # Deploy with revision suffix for easy identification
    az containerapp update \
      --name $APP_NAME \
      --resource-group $RESOURCE_GROUP \
      --image "$ACR_SERVER/$APP_NAME:$IMAGE_TAG" \
      --revision-suffix "$REVISION_SUFFIX"

Step 2: Health Check

Before routing traffic, verify the new deployment is healthy.

- name: Health Check
  run: |
    FQDN=$(az containerapp show \
      --name $APP_NAME \
      --resource-group $RESOURCE_GROUP \
      --query properties.configuration.ingress.fqdn \
      --output tsv)

    HEALTH_PASSED=false
    for i in {1..5}; do
      if curl -sf --max-time 5 "https://${FQDN}/health" > /dev/null 2>&1; then
        HEALTH_PASSED=true
        break
      fi
      sleep 10
    done

    echo "HEALTH_PASSED=${HEALTH_PASSED}" >> $GITHUB_ENV

This tries 5 times with 10-second intervals. Why 5 attempts? Because containers need time to start, connect to databases, and warm up.

Step 3: Traffic Splitting

If health check passes, split traffic between stable and canary.

- name: Route 50/50 Traffic
  if: env.HEALTH_PASSED == 'true'
  run: |
    STABLE="${{ env.STABLE_REVISION }}"
    CANARY="${{ env.CANARY_REVISION }}"

    if [[ -n "${STABLE}" && "${STABLE}" != "${CANARY}" ]]; then
      az containerapp ingress traffic set \
        --name $APP_NAME \
        --resource-group $RESOURCE_GROUP \
        --traffic-weight ${STABLE}=50 ${CANARY}=50
    else
      # First deployment - no stable exists
      az containerapp ingress traffic set \
        --name $APP_NAME \
        --resource-group $RESOURCE_GROUP \
        --traffic-weight ${CANARY}=100
    fi

Step 4: Auto Rollback on Failure

If health check fails, immediately rollback to protect users.

- name: Auto Rollback
  if: env.HEALTH_PASSED == 'false'
  run: |
    # Route all traffic back to stable
    az containerapp ingress traffic set \
      --name $APP_NAME \
      --resource-group $RESOURCE_GROUP \
      --traffic-weight ${{ env.STABLE_REVISION }}=100

    # Deactivate broken revision to free resources
    az containerapp revision deactivate \
      --name $APP_NAME \
      --resource-group $RESOURCE_GROUP \
      --revision "${{ env.CANARY_REVISION }}" || true

    # Send alert to team
    curl -X POST "$WEBHOOK_URL" \
      -H "Content-Type: application/json" \
      -d '{"text":"Auto rollback triggered for '$APP_NAME'"}'

    exit 1

Implementation: Multi-Region Canary Deployment

For applications deployed across multiple regions, we need a sequential approach to prevent global outages.

Why Sequential Deployment?

If you deploy to all regions simultaneously and there's a bug, all regions go down together. Sequential deployment means:

Deploy to Region 1
Health check Region 1
If healthy, deploy to Region 2
Health check Region 2
Continue...

If any region fails, stop the rollout immediately.

Multi-Region Deployment Flow

Deploy to US Region
    │
    ▼
Health Check US
    │
    ├── FAIL ──▶ Stop! Don't deploy to other regions
    │
    └── PASS ──▶ Deploy to EU Region
                    │
                    ▼
                Health Check EU
                    │
                    ├── FAIL ──▶ Rollback EU, keep US on canary
                    │
                    └── PASS ──▶ All regions on 50/50 split

Multi-Region Implementation

- name: Deploy to All Regions
  run: |
    REGIONS="us eu uk"
    REVISION_SUFFIX=$(echo "${{ github.sha }}" | cut -c1-8)

    for REGION in $REGIONS; do
      APP_NAME="myapp-${REGION}-prod"

      echo "Deploying to $REGION..."

      # Deploy
      az containerapp update \
        --name $APP_NAME \
        --resource-group $RESOURCE_GROUP \
        --image $IMAGE_NAME \
        --revision-suffix "$REVISION_SUFFIX"

      # Health check this region before proceeding
      FQDN=$(az containerapp show \
        --name $APP_NAME \
        --resource-group $RESOURCE_GROUP \
        --query "properties.configuration.ingress.fqdn" \
        --output tsv)

      HEALTHY=false
      for i in {1..5}; do
        if curl -sf "https://$FQDN/health" > /dev/null 2>&1; then
          HEALTHY=true
          break
        fi
        sleep 10
      done

      if [[ "$HEALTHY" != "true" ]]; then
        echo "Region $REGION failed health check. Stopping deployment."
        exit 1
      fi

      echo "$REGION deployed and healthy"
    done

- name: Route Traffic All Regions
  run: |
    REGIONS="us eu uk"

    for REGION in $REGIONS; do
      APP_NAME="myapp-${REGION}-prod"
      CANARY="${APP_NAME}--${REVISION_SUFFIX}"

      # Get stable revision for this region
      STABLE=$(az containerapp revision list \
        --name $APP_NAME \
        --resource-group $RESOURCE_GROUP \
        --query "[?properties.trafficWeight>\`0\` && name!='${CANARY}'] | [0].name" \
        -o tsv)

      if [[ -n "$STABLE" ]]; then
        az containerapp ingress traffic set \
          --name $APP_NAME \
          --resource-group $RESOURCE_GROUP \
          --traffic-weight ${STABLE}=50 ${CANARY}=50
      fi
    done

Manual Promotion and Rollback

After the canary has been running for a while and you've validated it's working correctly, you need to promote it to 100% or rollback if issues are found.

Promotion Workflow

name: 'Canary: Promote or Rollback'

on:
  workflow_dispatch:
    inputs:
      action:
        description: 'Action to perform'
        required: true
        type: choice
        options:
          - promote-100
          - rollback

jobs:
  canary-action:
    runs-on: ubuntu-latest
    steps:
      - name: Azure Login
        uses: azure/login@v2
        with:
          creds: ${{ secrets.AZURE_CREDENTIALS }}

      - name: Get Current Revisions
        id: revisions
        run: |
          REVISIONS=$(az containerapp revision list \
            --name $APP_NAME \
            --resource-group $RESOURCE_GROUP \
            --query "[?properties.trafficWeight>\`0\`] | sort_by(@, &properties.trafficWeight)" \
            -o json)

          # Highest traffic = stable, lowest = canary
          STABLE=$(echo "$REVISIONS" | jq -r 'last | .name')
          CANARY=$(echo "$REVISIONS" | jq -r 'first | .name')

          echo "STABLE=${STABLE}" >> $GITHUB_OUTPUT
          echo "CANARY=${CANARY}" >> $GITHUB_OUTPUT

      - name: Execute Action
        run: |
          case "${{ github.event.inputs.action }}" in
            promote-100)
              # Send 100% to canary
              az containerapp ingress traffic set \
                --name $APP_NAME \
                --resource-group $RESOURCE_GROUP \
                --traffic-weight ${{ steps.revisions.outputs.CANARY }}=100

              # Deactivate old stable
              az containerapp revision deactivate \
                --name $APP_NAME \
                --resource-group $RESOURCE_GROUP \
                --revision "${{ steps.revisions.outputs.STABLE }}" || true
              ;;

            rollback)
              # Send 100% back to stable
              az containerapp ingress traffic set \
                --name $APP_NAME \
                --resource-group $RESOURCE_GROUP \
                --traffic-weight ${{ steps.revisions.outputs.STABLE }}=100

              # Deactivate broken canary
              az containerapp revision deactivate \
                --name $APP_NAME \
                --resource-group $RESOURCE_GROUP \
                --revision "${{ steps.revisions.outputs.CANARY }}" || true
              ;;
          esac

Cleaning Up Inactive Revisions

Over time, you'll accumulate many inactive revisions. While they don't consume compute resources, they clutter your revision list. Deactivating revisions after promotion or rollback keeps things clean.

# Deactivate a specific revision
az containerapp revision deactivate \
  --name $APP_NAME \
  --resource-group $RESOURCE_GROUP \
  --revision "myapp--abc12345"

The || true after deactivation commands ensures the pipeline doesn't fail if the revision is already inactive.

Best Practices

1. Use Immutable Image Tags

Never use latest. Always tag images with git SHA or build number.

❌ myapp:latest
✅ myapp:a1b2c3d4e5f6

2. Start with Higher Canary Percentage for Low Traffic

If you have few users, start with 50/50. You need enough traffic on the canary to detect issues.

Low traffic:  50/50 split (need volume to detect issues)
High traffic: 10/90 split (even 10% is thousands of users)

3. Implement Proper Health Checks

Your /health endpoint should verify:

Application is running
Database connections work
Critical dependencies are reachable

4. Set Up Alerts

Always send alerts on rollback. Your team needs to know when deployments fail.

5. Use Revision Suffixes

Name revisions with git SHA prefix for easy identification.

myapp--a1b2c3d4  ← Easy to trace back to commit

Conclusion

Canary deployments are essential for safe production releases. With Azure Container Apps, you get native support for traffic splitting that makes implementation straightforward.

Key takeaways:

Deploy new code as a separate revision
Run health checks before routing traffic
Split traffic gradually (50/50 for low traffic, 10/90 for high traffic)
Auto-rollback on health check failure
Use sequential deployment for multi-region setups
Clean up inactive revisions after promotion/rollback

The initial setup takes effort, but the peace of mind knowing your deployments are safe is worth it. No more 2 AM panic calls because a bad deployment took down production.

DEV Community: Haripriya Veluchamy

Serving ML Artifacts from Amazon S3 Files How I used After the Launch

The Honest Story

What I Built

The Problem S3 Files Solves

Architecture

Setting Up S3 Files What Actually Matters

1. S3 bucket needs versioning enabled

2. IAM role trust principal is elasticfilesystem.amazonaws.com

3. The IAM role needs EventBridge permissions

4. boto3 API uses camelCase

5. Mount needs amazon-efs-utils, not plain NFS

6. EC2 instance needs AmazonS3FilesClientFullAccess

The Serving Code

Weekly Index Updates The Real S3 Files Advantage

Results

What I Learned

Resources

How I Built a Self Resizing EC2 for My ML Data

The Pain Point

The First Thought Let AWS ML Decide

The Second Thought Bedrock Agent

The Solution EventDriven VM Resize

Architecture

Setup 3 Steps

Step 1 Deploy AWS infrastructure

Step 2 Copy agent to your VM

Step 3 Add your pipeline steps

The Key Design Decision steps.conf

The Email Alert

Difficulties I Faced

Live Demo

Cost Reality

When to Use This

Why a VM and Not Glue or EMR?

Repo

What's Next

How I Built Production AI Agent Monitoring with Langfuse

Full Trace Visibility

Deterministic Checks

Faithfulness Checks

LLM Judges

100% Traffic Monitoring

Cost + Latency Tracking

What This Actually Caught

Wrong attribution

Ignored outputs

Routing mistakes

Stack

Final Takeaway

I Built a Self-Updating SEO Brain Inspired by Andrej Karpathy's LLM Wiki

The Tweet That Changed How I Think About AI + Knowledge

The Problem: RAG Has No Memory

Karpathy's Core Insight

What I Built: LLM Wiki for SEO

The 3 Layers

The Flow

The Knowledge Graph in Plain Markdown

Obsidian as the Visualization Layer

Interactive Query Mode

Weekly Lint Pass

The Fix Agent (Skill-Based, Not LLM-Generated Code)

vs Cognee: Honest Comparison

What Karpathy Got Right

Try It Yourself

I built a 20-20-20 eye reminder because my eyes were dying at my desk

What it does

The core trick: disabling the close button

The countdown + locked button

The alarm

The full popup

Setup: auto-start on Windows boot

Testing it

Dependencies

Full source

Harivelu0 / eye-remainder

eye-reminder

Why

Demo

Features

2. IAM role trust principal is `elasticfilesystem.amazonaws.com`

5. Mount needs `amazon-efs-utils`, not plain NFS

6. EC2 instance needs `AmazonS3FilesClientFullAccess`