<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Pranay Ravi</title>
    <description>The latest articles on DEV Community by Pranay Ravi (@pranay_raavi).</description>
    <link>https://dev.to/pranay_raavi</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3372066%2Fdab53ee7-0e9d-4dab-8313-32931f0fa124.jpg</url>
      <title>DEV Community: Pranay Ravi</title>
      <link>https://dev.to/pranay_raavi</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/pranay_raavi"/>
    <language>en</language>
    <item>
      <title>Stop Running psql Commands by Hand — Build a REST API for PostgreSQL User Management</title>
      <dc:creator>Pranay Ravi</dc:creator>
      <pubDate>Fri, 29 May 2026 21:45:12 +0000</pubDate>
      <link>https://dev.to/pranay_raavi/stop-running-psql-commands-by-hand-build-a-rest-api-for-postgresql-user-management-1bba</link>
      <guid>https://dev.to/pranay_raavi/stop-running-psql-commands-by-hand-build-a-rest-api-for-postgresql-user-management-1bba</guid>
      <description>&lt;p&gt;If you manage PostgreSQL databases across multiple environments, you've probably done this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;SSH to the DB host (or connect via &lt;code&gt;psql&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Run &lt;code&gt;CREATE USER jsmith CONNECTION LIMIT 20 PASSWORD '...'&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Slack the password to the developer&lt;/li&gt;
&lt;li&gt;Forget to log it anywhere&lt;/li&gt;
&lt;li&gt;Repeat for every environment, every onboarding, every access request&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;It's tedious, error-prone, and leaves zero audit trail. Here's a better way.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;pg-user-api&lt;/strong&gt; is a lightweight Flask REST API that wraps PostgreSQL user provisioning in clean HTTP endpoints. You register your databases once in a SQLite inventory, then any tooling — CI pipelines, internal portals, Ansible playbooks, or a plain &lt;code&gt;curl&lt;/code&gt; — can create and manage users across environments without ever touching &lt;code&gt;psql&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;GitHub: &lt;a href="https://github.com/pcraavi/PostgreSQL-user-creation-API" rel="noopener noreferrer"&gt;pcraavi/PostgreSQL-user-creation-API&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem It Solves
&lt;/h2&gt;

&lt;p&gt;In teams that span dev, QA, UAT, and prod, you end up with different patterns of users:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;App service accounts&lt;/strong&gt; — named after the host/port combo (&lt;code&gt;web01_8080&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Kubernetes workload accounts&lt;/strong&gt; — named after env prefix + farm (&lt;code&gt;dv_gearservice&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Individual dev/QA accounts&lt;/strong&gt; — low connection limits, scoped to non-prod&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Read-only analyst accounts&lt;/strong&gt; — prod only, no DDL&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DBA accounts&lt;/strong&gt; — &lt;code&gt;CREATEDB CREATEROLE LOGIN&lt;/code&gt;, rarely provisioned&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each type has different &lt;code&gt;CONNECTION LIMIT&lt;/code&gt; values, privilege levels, and naming conventions. Encoding these patterns in an API means the rules are consistent, repeatable, and auditable.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;The project is intentionally small — five Python files and a requirements list:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pg_user_api/
├── app.py              # Flask app — all endpoints
├── auth.py             # HTTP Basic Auth (constant-time compare)
├── database.py         # SQLite registry + audit log
├── notifications.py    # Notification stubs (Webex / Slack / Email)
├── seed_db.py          # One-time setup: creates DB + sample records
└── requirements.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff3sjt72008n6q9zv3tka.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff3sjt72008n6q9zv3tka.png" alt=" " width="800" height="1011"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Two credential pairs, clearly separated:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;PG_API_USER&lt;/code&gt; / &lt;code&gt;PG_API_PASS&lt;/code&gt; — who can call this API (your team/tooling)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;PG_ADMIN_USER&lt;/code&gt; / &lt;code&gt;PG_ADMIN_PASS&lt;/code&gt; — the PostgreSQL DBA role that executes DDL&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The DBA credentials never appear in API URLs or response bodies. Callers only need the API credentials plus &lt;code&gt;env&lt;/code&gt; and &lt;code&gt;database&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SQLite as a config-free registry:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Rather than a static YAML or environment file listing every hostname, databases are registered once in a &lt;code&gt;db_registry&lt;/code&gt; table:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Column&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;env&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;dev&lt;/code&gt; / &lt;code&gt;qa&lt;/code&gt; / &lt;code&gt;uat&lt;/code&gt; / &lt;code&gt;prod&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;db_name&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;PostgreSQL database name&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hostname&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;FQDN or IP of the host&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;port&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;PostgreSQL port (default 5432)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;active&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;1&lt;/code&gt; = active, &lt;code&gt;0&lt;/code&gt; = skip&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Every endpoint looks up the hostname dynamically from this registry. No hardcoded connection strings anywhere in application code.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Endpoints
&lt;/h2&gt;

&lt;p&gt;All endpoints are &lt;code&gt;GET&lt;/code&gt; with query params (by design — simple to &lt;code&gt;curl&lt;/code&gt;, simple to call from automation scripts).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;GET /                            # Health check, no auth required
GET /api/v1/registry             # List registered databases
GET /api/v1/users/all            # List all PostgreSQL roles on a database
GET /api/v1/users/app            # Create VM/container service account
GET /api/v1/users/app-k8s        # Create Kubernetes workload account
GET /api/v1/users/devqa          # Create individual dev/QA user
GET /api/v1/users/devlead        # Create dev-lead user
GET /api/v1/users/readonly       # Create read-only user
GET /api/v1/users/dba            # Create DBA user (CREATEDB + CREATEROLE)
GET /api/v1/users/reset          # Reset a user's password
GET /api/v1/users/search-path    # Update search_path for a user
GET /api/v1/users/find           # Look up a specific user
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every user-creation endpoint returns the same structured response:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"username"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"web01_8080"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"password"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"generatedSecurePassword"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"user created"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"hostname"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"pg-dev-01.example.com"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"database"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"myapp_dev"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"port"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"5432"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"env"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"dev"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Password is returned once at creation time. The API uses &lt;code&gt;secrets.token_urlsafe(16)&lt;/code&gt; for generation — no insecure &lt;code&gt;random&lt;/code&gt; module.&lt;/p&gt;




&lt;h2&gt;
  
  
  Running It
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Clone and install
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/pcraavi/PostgreSQL-user-creation-API.git
&lt;span class="nb"&gt;cd &lt;/span&gt;PostgreSQL-user-creation-API
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Seed the registry
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python seed_db.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This creates &lt;code&gt;pg_registry.db&lt;/code&gt; with sample entries. Edit &lt;code&gt;SAMPLE_RECORDS&lt;/code&gt; in &lt;code&gt;seed_db.py&lt;/code&gt; to point at your real PostgreSQL hosts.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Set credentials via environment variables
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Windows PowerShell&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nv"&gt;$&lt;/span&gt;&lt;span class="nn"&gt;env&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nv"&gt;PG_API_USER&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"pgadmin"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nv"&gt;$&lt;/span&gt;&lt;span class="nn"&gt;env&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nv"&gt;PG_API_PASS&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Ch@ngeMe2024!"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nv"&gt;$&lt;/span&gt;&lt;span class="nn"&gt;env&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nv"&gt;PG_ADMIN_USER&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"role_create"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nv"&gt;$&lt;/span&gt;&lt;span class="nn"&gt;env&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nv"&gt;PG_ADMIN_PASS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"your_pg_password"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Linux / macOS&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;PG_API_USER&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"pgadmin"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;PG_API_PASS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"Ch@ngeMe2024!"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;PG_ADMIN_USER&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"role_create"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;PG_ADMIN_PASS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"your_pg_password"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4. Start
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python app.py
&lt;span class="c"&gt;# Listening on http://localhost:5000&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Example Calls
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Create a service account for a VM running on port 8080:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-u&lt;/span&gt; pgadmin:Ch@ngeMe2024! &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"http://localhost:5000/api/v1/users/app?env=dev&amp;amp;database=myapp_dev&amp;amp;servername=web01&amp;amp;port=8080"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Creates user &lt;code&gt;web01_8080&lt;/code&gt; with &lt;code&gt;CONNECTION LIMIT 200&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Create a Kubernetes workload account:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-u&lt;/span&gt; pgadmin:Ch@ngeMe2024! &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"http://localhost:5000/api/v1/users/app-k8s?env=dev&amp;amp;database=myapp_dev&amp;amp;env_prefix=dv&amp;amp;farmname=gearservice"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Creates user &lt;code&gt;dv_gearservice&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reset a forgotten password:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-u&lt;/span&gt; pgadmin:Ch@ngeMe2024! &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"http://localhost:5000/api/v1/users/reset?env=prod&amp;amp;database=myapp_prod&amp;amp;username=analyst01"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Generates a new &lt;code&gt;secrets.token_urlsafe(16)&lt;/code&gt; password and applies it immediately. Returns the new password in the response.&lt;/p&gt;




&lt;h2&gt;
  
  
  Idempotency
&lt;/h2&gt;

&lt;p&gt;All create endpoints check &lt;code&gt;pg_catalog.pg_roles&lt;/code&gt; before issuing &lt;code&gt;CREATE USER&lt;/code&gt;. If the role already exists, the API returns &lt;code&gt;"status": "user already exists"&lt;/code&gt; and exits cleanly. Safe to call from automation without worrying about duplicate creation errors.&lt;/p&gt;




&lt;h2&gt;
  
  
  Audit Log
&lt;/h2&gt;

&lt;p&gt;Every operation (create, reset, search_path change) is written to a &lt;code&gt;audit_log&lt;/code&gt; table in the same &lt;code&gt;pg_registry.db&lt;/code&gt; SQLite file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;sqlite3 pg_registry.db &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"SELECT * FROM audit_log ORDER BY performed_at DESC LIMIT 10;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You get a timestamped record of who called what, on which database, with what outcome. Useful for access reviews and incident investigations.&lt;/p&gt;




&lt;h2&gt;
  
  
  Notification Hooks
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;notifications.py&lt;/code&gt; ships with ready-to-uncomment stubs for Webex Teams, Slack, and email. Wire in your webhook URL or SMTP config, then call &lt;code&gt;send_notification()&lt;/code&gt; from any endpoint to push alerts to your team when accounts are created or passwords reset.&lt;/p&gt;




&lt;h2&gt;
  
  
  Security Notes
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Intended for &lt;strong&gt;internal / intranet&lt;/strong&gt; use — put it behind a VPN or API gateway, not on the open internet.&lt;/li&gt;
&lt;li&gt;For internet-facing deployments, swap HTTP Basic Auth for JWT or an API key header.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;pg_registry.db&lt;/code&gt; contains your real hostnames — it's in &lt;code&gt;.gitignore&lt;/code&gt; and should stay off version control.&lt;/li&gt;
&lt;li&gt;The DBA password never appears in URLs, query strings, or logs.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What I'd Add Next
&lt;/h2&gt;

&lt;p&gt;A few things on the roadmap:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GRANT/REVOKE endpoints&lt;/strong&gt; — privilege management beyond account creation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Schema-level grants&lt;/strong&gt; — &lt;code&gt;GRANT SELECT ON ALL TABLES IN SCHEMA&lt;/code&gt; patterns&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Token-based auth&lt;/strong&gt; — drop-in replacement for Basic Auth&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Docker packaging&lt;/strong&gt; — &lt;code&gt;docker run&lt;/code&gt; for teams that don't want to manage Python deps&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Structured audit export&lt;/strong&gt; — JSON or CSV export of the audit log for compliance workflows&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Wrapping Up
&lt;/h2&gt;

&lt;p&gt;If your team provisions PostgreSQL users more than a few times a month, wrapping it in an HTTP interface pays for itself quickly. The audit trail alone is worth it.&lt;/p&gt;

&lt;p&gt;The full source is at &lt;a href="https://github.com/pcraavi/PostgreSQL-user-creation-API" rel="noopener noreferrer"&gt;github.com/pcraavi/PostgreSQL-user-creation-API&lt;/a&gt;. It's MIT-licensed — fork it, adapt the user types to your org's naming conventions, and wire in your notification channels.&lt;/p&gt;

&lt;p&gt;Questions or suggestions? Drop them in the comments or open an issue on GitHub.&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>python</category>
      <category>devops</category>
      <category>platformengineering</category>
    </item>
    <item>
      <title>PostgreSQL VACUUM Tuning: A Technical Deep Dive Into Autovacuum Configuration</title>
      <dc:creator>Pranay Ravi</dc:creator>
      <pubDate>Sun, 24 May 2026 09:37:29 +0000</pubDate>
      <link>https://dev.to/pranay_raavi/postgresql-vacuum-tuning-a-technical-deep-dive-into-autovacuum-configuration-3gf0</link>
      <guid>https://dev.to/pranay_raavi/postgresql-vacuum-tuning-a-technical-deep-dive-into-autovacuum-configuration-3gf0</guid>
      <description>&lt;p&gt;&lt;strong&gt;Author's Note:&lt;/strong&gt; This article documents a production incident investigation and the technical findings that emerged from returning to foundational documentation. The fix was implemented by a colleague; this article captures the learning journey through proper documentation review.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Incident: High CPU Due to Autovacuum Contention
&lt;/h2&gt;

&lt;p&gt;A production Aurora PostgreSQL cluster experienced sustained CPU utilization between 85-90% over a 3-4 hour window. CloudWatch Performance Insights identified the primary wait event as CPU (not I/O or lock contention), and the top consuming operation was autovacuum VACUUM processes running on two large tables.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fimpgkzayofraehnyhe21.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fimpgkzayofraehnyhe21.png" alt="CloudWatch CPU Spike Dashboard" width="800" height="437"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Observed State:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Table A (593 GB, 623M rows): 124 million dead tuples (16.6% dead ratio)&lt;/li&gt;
&lt;li&gt;Table B (465M rows): 74 million dead tuples (13.7% dead ratio)&lt;/li&gt;
&lt;li&gt;Autovacuum workers: 2 running concurrently&lt;/li&gt;
&lt;li&gt;CPU utilization: 85-90%&lt;/li&gt;
&lt;li&gt;Autovacuum frequency: Single massive vacuum operation per 4-6 hours&lt;/li&gt;
&lt;li&gt;Status of tables: Never manually vacuumed since instance creation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Root Cause:&lt;/strong&gt; Autovacuum thresholds were configured at system defaults, which were inappropriate for high-churn tables receiving bulk updates every 2-3 hours via scheduled data loader processes.&lt;/p&gt;




&lt;h2&gt;
  
  
  Understanding MVCC: PostgreSQL vs. Oracle's Undo Architecture
&lt;/h2&gt;

&lt;p&gt;Before tuning can be effective, the fundamental difference in how PostgreSQL and Oracle manage concurrent access must be understood.&lt;/p&gt;

&lt;h3&gt;
  
  
  Oracle's Approach: Automatic Undo Retention Management (AUM)
&lt;/h3&gt;

&lt;p&gt;In Oracle, Multi-Version Concurrency Control (MVCC) is implemented via &lt;strong&gt;undo tablespace segments&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Update operation:&lt;/strong&gt; When a row is updated, the old version is written to undo tablespace (not to the table itself).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Undo retention:&lt;/strong&gt; Oracle's Automatic Undo Retention Management (AUM) manages undo tablespace as a circular buffer. Undo extents are recycled automatically based on the &lt;code&gt;UNDO_RETENTION&lt;/code&gt; parameter and available tablespace.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Space reclamation:&lt;/strong&gt; Undo space is automatically freed when either (a) the retention period expires, or (b) tablespace pressure forces rollback of older undo data. The DBA's responsibility is limited to allocating sufficient undo tablespace upfront.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Key characteristic:&lt;/strong&gt; The DBA tunes this mechanism once (setting retention period and tablespace size) and then relies on Oracle's background processes to manage undo lifecycle automatically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;[IMAGE 2: Oracle vs PostgreSQL MVCC Architecture Diagram]&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi7efyd3nzn1m277c7hj4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi7efyd3nzn1m277c7hj4.png" alt=" " width="800" height="437"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;"&lt;/p&gt;

&lt;h3&gt;
  
  
  PostgreSQL's Approach: Heap-Based MVCC with Explicit VACUUM
&lt;/h3&gt;

&lt;p&gt;In PostgreSQL, MVCC is implemented at the &lt;strong&gt;table (heap) level&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Update operation:&lt;/strong&gt; When a row is updated, a new version of the row is inserted into the same table. The old version is marked as "dead" but remains physically in the table.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Space reclamation:&lt;/strong&gt; VACUUM must scan the table, identify dead tuples, and mark their space as reusable. Dead tuples are not automatically removed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Autovacuum trigger:&lt;/strong&gt; Autovacuum is a background process that decides &lt;em&gt;when&lt;/em&gt; to vacuum based on tunable thresholds. Unlike Oracle's automatic undo recycling, PostgreSQL requires explicit configuration of when vacuum should trigger.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Key characteristic:&lt;/strong&gt; The DBA must actively tune VACUUM parameters based on table churn patterns. There is no "set it and forget it" mechanism comparable to Oracle's AUM.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Implication:&lt;/strong&gt; On Oracle, a table with high UPDATE volume simply generates more undo, which Oracle's AUM handles. On PostgreSQL, the same UPDATE volume generates more dead tuples, and if autovacuum thresholds are too conservative, dead tuples accumulate until autovacuum finally triggers—often at high volume, causing CPU spikes.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Vacuum Threshold Formula: The Mathematical Foundation
&lt;/h2&gt;

&lt;p&gt;When autovacuum decides to run, it evaluates the following formula for each table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;VACUUM_TRIGGER_THRESHOLD = autovacuum_vacuum_threshold + 
                           (autovacuum_vacuum_scale_factor × n_live_tup)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;autovacuum_vacuum_threshold&lt;/code&gt;: Absolute minimum dead tuples (default: 50)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;autovacuum_vacuum_scale_factor&lt;/code&gt;: Percentage of table size (default: 0.1 = 10%)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;n_live_tup&lt;/code&gt;: Current number of live tuples in the table&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Pre-Investigation Configuration
&lt;/h3&gt;

&lt;p&gt;System defaults on the cluster:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;autovacuum_vacuum_threshold&lt;/code&gt; = 50&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;autovacuum_vacuum_scale_factor&lt;/code&gt; = 0.1&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Calculation for Table A (725M rows):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;THRESHOLD = 50 + (0.1 × 725,000,000)
          = 50 + 72,500,000
          = 72.5 million dead tuples
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Interpretation:&lt;/strong&gt; Autovacuum would not trigger on Table A until 72.5 million dead tuples accumulated. This is the critical misconfiguration.&lt;/p&gt;

&lt;p&gt;For comparison, Table A actually accumulated 124 million dead tuples before the vacuum completed—well beyond this threshold, indicating autovacuum had already triggered much earlier in the lifecycle but was running continuously against an accumulating workload.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Root Cause: Loader Pattern and Threshold Mismatch
&lt;/h2&gt;

&lt;p&gt;The data loader process (running every 2-3 hours with 4 parallel workers) updated rows using a COALESCE merge pattern:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;UPDATE&lt;/span&gt; &lt;span class="n"&gt;table_a&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; 
  &lt;span class="n"&gt;column_1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;COALESCE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;column_1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;column_1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="n"&gt;column_2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;COALESCE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;column_2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;column_2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="n"&gt;column_3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;COALESCE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;column_3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;column_3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;staging_table&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This pattern creates a dead tuple for every row touched, regardless of whether values actually changed. Over a 2-3 hour window with millions of rows, dead tuple generation rate significantly exceeded autovacuum's ability to reclaim space given the high threshold values.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The collision:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;High dead tuple generation rate from bulk UPDATE operations&lt;/li&gt;
&lt;li&gt;Autovacuum thresholds calibrated for the default use case (moderate churn on tables of typical size)&lt;/li&gt;
&lt;li&gt;No table-level overrides to account for this specific workload pattern&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Result: Dead tuples accumulated to 16.6% of table size before stabilizing.&lt;/p&gt;




&lt;h2&gt;
  
  
  Investigation: Diagnostic Queries and Findings
&lt;/h2&gt;

&lt;p&gt;Three queries provided diagnostic clarity:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Query 1: Current dead tuple status&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; 
  &lt;span class="n"&gt;relname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="n"&gt;n_live_tup&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;n_dead_tup&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;ROUND&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;n_dead_tup&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_live_tup&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;n_dead_tup&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;dead_ratio_pct&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;last_autovacuum&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_user_tables&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;relname&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'table_a'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'table_b'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;n_dead_tup&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Result:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Table A: 623M live, 124M dead (16.6%)&lt;/li&gt;
&lt;li&gt;Table B: 465M live, 74M dead (13.7%)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Query 2: Active autovacuum processes&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; 
  &lt;span class="n"&gt;pid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="n"&gt;query_start&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;EXTRACT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;EPOCH&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;NOW&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;query_start&lt;/span&gt;&lt;span class="p"&gt;))::&lt;/span&gt;&lt;span class="nb"&gt;INT&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;runtime_seconds&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_activity&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="k"&gt;LIKE&lt;/span&gt; &lt;span class="s1"&gt;'%VACUUM%'&lt;/span&gt; 
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;LIKE&lt;/span&gt; &lt;span class="s1"&gt;'%pg_stat%'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Result: Two autovacuum workers running concurrently, one on each table, both active for 55+ minutes and 3+ minutes respectively at time of investigation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Query 3: Cumulative churn analysis&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; 
  &lt;span class="n"&gt;relname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;n_live_tup&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;n_dead_tup&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;n_tup_upd&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;n_tup_del&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_modifications&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;ROUND&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_tup_upd&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;n_tup_del&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;n_live_tup&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;churn_ratio_pct&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_user_tables&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;relname&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'table_a'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'table_b'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Result:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Table A: 16 billion total modifications on 725M rows (2203% cumulative churn)&lt;/li&gt;
&lt;li&gt;Table B: 11.9 billion total modifications on 465M rows (2566% cumulative churn)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Interpretation:&lt;/strong&gt; These represent lifetime cumulative statistics since instance creation. The 2200%+ ratio indicates every row has been touched approximately 22 times on average over the instance lifetime.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Configuration Adjustment: Formula-Based Approach
&lt;/h2&gt;

&lt;p&gt;Rather than implementing ad-hoc changes, a formula-based approach was used to calculate appropriate thresholds.&lt;/p&gt;

&lt;p&gt;The colleague conducting the fix referenced textbook formulas for maintenance memory allocation and threshold calculation, confirming that the system defaults were inappropriate for tables with this churn profile.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Applied changes (table-level only):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;table_a&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;autovacuum_vacuum_scale_factor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;01&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;autovacuum_vacuum_threshold&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;autovacuum_analyze_scale_factor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;005&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;autovacuum_analyze_threshold&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;autovacuum_vacuum_cost_delay&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;autovacuum_vacuum_cost_limit&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5000&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;table_b&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;autovacuum_vacuum_scale_factor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;01&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;autovacuum_vacuum_threshold&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;autovacuum_analyze_scale_factor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;005&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;autovacuum_analyze_threshold&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;autovacuum_vacuum_cost_delay&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;autovacuum_vacuum_cost_limit&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5000&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Rationale for each parameter:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Scale factor (0.1 → 0.01):&lt;/strong&gt; Lowers the percentage-based trigger from 10% to 1% of table size, causing autovacuum to start at 7.25 million dead tuples instead of 72.5 million. This increases vacuum frequency but reduces per-operation work volume.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Threshold (50 → 10000):&lt;/strong&gt; Sets an explicit minimum to prevent excessive vacuum triggering on very small tables, but still allows reasonable triggering on large tables.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Analyze scale factor (0.05 → 0.005):&lt;/strong&gt; ANALYZE (which updates table statistics for the query planner) triggers more frequently, preventing stale statistics during high-churn periods.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cost parameters:&lt;/strong&gt; &lt;code&gt;cost_delay = 2ms&lt;/code&gt; and &lt;code&gt;cost_limit = 5000&lt;/code&gt; distribute vacuum work into smaller increments with longer pauses, reducing per-operation CPU spike while still completing the work.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;New threshold calculation for Table A:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;THRESHOLD = 10,000 + (0.01 × 725,000,000)
          = 10,000 + 7,250,000
          = 7.26 million dead tuples
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This represents a &lt;strong&gt;10× reduction&lt;/strong&gt; in the threshold value, causing autovacuum to trigger 10× more frequently but with proportionally smaller work loads.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbyuky1v3222vyg5o3i1t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbyuky1v3222vyg5o3i1t.png" alt="Vacuum Threshold Comparison - Before vs After" width="800" height="437"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Critical Design Decision: Table-Level Configuration, Not Cluster-Level
&lt;/h2&gt;

&lt;p&gt;A deliberate choice was made to apply all tuning parameters &lt;strong&gt;at the table level only&lt;/strong&gt;, not at the system/cluster level.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why table-level configuration is necessary:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Heterogeneous workloads:&lt;/strong&gt; Not all tables in a cluster have the same churn pattern. Table A and Table B are high-churn bulk-update targets. Other tables in the same cluster may be mostly static or read-heavy.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Preventing cascade effects:&lt;/strong&gt; A cluster-wide reduction in &lt;code&gt;autovacuum_vacuum_scale_factor&lt;/code&gt; would cause autovacuum to trigger more frequently on &lt;em&gt;all&lt;/em&gt; tables, including those with minimal churn. This could result in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Unnecessary vacuum operations consuming CPU and I/O&lt;/li&gt;
&lt;li&gt;More frequent ANALYZE operations on stable tables&lt;/li&gt;
&lt;li&gt;Increased lock contention if many vacuums run concurrently&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Isolation of risk:&lt;/strong&gt; By tuning only the affected tables, the change is isolated to the problem source and does not introduce unexpected side effects on other database objects or applications.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is a deliberate engineering discipline: tune at the smallest scope that solves the problem.&lt;/p&gt;




&lt;h2&gt;
  
  
  Results: Before and After Metrics
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Before Configuration (April 13, 09:51 AM):&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Table A&lt;/th&gt;
&lt;th&gt;Table B&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Dead tuples&lt;/td&gt;
&lt;td&gt;124M&lt;/td&gt;
&lt;td&gt;74M&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dead ratio&lt;/td&gt;
&lt;td&gt;16.6%&lt;/td&gt;
&lt;td&gt;13.7%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Last autovacuum&lt;/td&gt;
&lt;td&gt;N/A (first vacuum)&lt;/td&gt;
&lt;td&gt;N/A (first vacuum)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CPU utilization&lt;/td&gt;
&lt;td&gt;85-90%&lt;/td&gt;
&lt;td&gt;85-90%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Vacuum frequency&lt;/td&gt;
&lt;td&gt;1 per 4-6 hours&lt;/td&gt;
&lt;td&gt;1 per 4-6 hours&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;After Configuration (April 13, 12:07 PM - 13:45 PM):&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Table A&lt;/th&gt;
&lt;th&gt;Table B&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Dead tuples&lt;/td&gt;
&lt;td&gt;0 (vacuum completed)&lt;/td&gt;
&lt;td&gt;21M (declining)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dead ratio&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;td&gt;7.8%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Last autovacuum&lt;/td&gt;
&lt;td&gt;2026-04-13 12:07:08&lt;/td&gt;
&lt;td&gt;2026-04-13 13:45:44&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CPU utilization&lt;/td&gt;
&lt;td&gt;30-40%&lt;/td&gt;
&lt;td&gt;30-40%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Vacuum frequency&lt;/td&gt;
&lt;td&gt;Multiple per loader cycle&lt;/td&gt;
&lt;td&gt;Multiple per loader cycle&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The vacuum operations completed naturally without manual intervention. Dead tuple levels stabilized well below the new thresholds. CPU alerts cleared.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2l5cg0agi01eaqacbinn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2l5cg0agi01eaqacbinn.png" alt="Incident Timeline and Recovery " width="800" height="437"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Monitoring: Proactive Visibility
&lt;/h2&gt;

&lt;p&gt;To prevent recurrence, a monitoring table was implemented to capture dead tuple trends:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;pg_table_stats_history&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;captured_at&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="n"&gt;NOW&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
  &lt;span class="k"&gt;table_name&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;n_live_tup&lt;/span&gt; &lt;span class="nb"&gt;BIGINT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;n_dead_tup&lt;/span&gt; &lt;span class="nb"&gt;BIGINT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;dead_ratio_percent&lt;/span&gt; &lt;span class="nb"&gt;NUMERIC&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;n_tup_upd&lt;/span&gt; &lt;span class="nb"&gt;BIGINT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;n_tup_del&lt;/span&gt; &lt;span class="nb"&gt;BIGINT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;last_autovacuum&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;cron&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;schedule&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'capture_table_stats'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'*/30 * * * *'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="err"&gt;$$&lt;/span&gt;
  &lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;pg_table_stats_history&lt;/span&gt;
  &lt;span class="k"&gt;SELECT&lt;/span&gt; 
    &lt;span class="n"&gt;NOW&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;relname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_live_tup&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_dead_tup&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;ROUND&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;n_dead_tup&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="k"&gt;NULLIF&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_live_tup&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;n_dead_tup&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;n_tup_upd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_tup_del&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;last_autovacuum&lt;/span&gt;
  &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_user_tables&lt;/span&gt;
  &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;relname&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'table_a'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'table_b'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="err"&gt;$$&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This captures a snapshot every 30 minutes, allowing observation of daily patterns: "During loader window (12:00-14:00), dead tuples climb to 5-8M, then autovacuum brings them back to 0.5M."&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9d9n1nw14ci779wl1b0c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9d9n1nw14ci779wl1b0c.png" alt="Dead Tuple Trend Monitoring Over 24 Hours" width="800" height="437"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Concepts: What the Investigation Revealed
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. The Autovacuum Cost Mechanism
&lt;/h3&gt;

&lt;p&gt;VACUUM has two cost-control parameters:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;autovacuum_vacuum_cost_limit&lt;/code&gt;&lt;/strong&gt;: Units of work (reading a page = 1 unit, writing a page = 20 units) allowed before autovacuum pauses&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;autovacuum_vacuum_cost_delay&lt;/code&gt;&lt;/strong&gt;: Milliseconds to pause when cost limit is reached&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Lower cost_limit + higher delay = slower vacuum, less CPU spike&lt;br&gt;&lt;br&gt;
Higher cost_limit + lower delay = faster vacuum, more CPU spike&lt;/p&gt;

&lt;p&gt;The pre-incident configuration had &lt;code&gt;cost_delay = 5ms&lt;/code&gt; and &lt;code&gt;cost_limit = 1800&lt;/code&gt; (already aggressive). The post-incident configuration used &lt;code&gt;cost_delay = 2ms&lt;/code&gt; and &lt;code&gt;cost_limit = 5000&lt;/code&gt;, allocating higher work budgets but still enforcing frequent pauses for distribution.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fitmiimix23toanwn1r5a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fitmiimix23toanwn1r5a.png" alt="Autovacuum Cost Mechanism - Work Distribution Pattern" width="800" height="437"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When autovacuum triggers, it scans the entire table. If dead tuples are being created faster than they're being reclaimed (because the loader is still running), the vacuum operation takes longer and consumes sustained CPU.&lt;/p&gt;

&lt;p&gt;In the incident:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Table A vacuum started at 09:51:54 and completed at 12:07:08 (2h 15m)&lt;/li&gt;
&lt;li&gt;Table B vacuum started at 08:59:39 and completed at 13:45:44 (4h 46m)&lt;/li&gt;
&lt;li&gt;During this window, the loader was also running (starting 12:37:27), creating additional dead tuples&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The concurrent activity created contention for CPU and I/O resources, explaining the 85-90% CPU utilization.&lt;/p&gt;
&lt;h3&gt;
  
  
  3. Wraparound Protection: A Critical Background Mechanism
&lt;/h3&gt;

&lt;p&gt;While not the active problem in this incident, understanding wraparound protection is essential context:&lt;/p&gt;

&lt;p&gt;PostgreSQL uses 4-byte transaction IDs (2^32 = 4.3 billion possible values). For MVCC visibility comparison to work correctly, only a 2-billion value range is usable at any given time. If autovacuum falls so far behind that unfrozen tuples approach the 2-billion transaction boundary, PostgreSQL will:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Log warnings at 200M transactions remaining&lt;/li&gt;
&lt;li&gt;Shift to READ-ONLY mode at 1M transactions remaining
&lt;/li&gt;
&lt;li&gt;Force shutdown if the limit is reached&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is not a theoretical concern—it's a safety mechanism that has forced database shutdowns in under-monitored systems. Autovacuum must keep up, or the database fails.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fknpunbwv37i29rs3hwbl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fknpunbwv37i29rs3hwbl.png" alt="Transaction ID Wraparound Protection Lifecycle" width="800" height="437"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  The Documentation Recovery: Why AI Alone Is Insufficient
&lt;/h2&gt;

&lt;p&gt;When the incident was first encountered, AI-based diagnostic tools provided generic suggestions: "lower cost_delay," "increase cost_limit," "check maintenance_work_mem."&lt;/p&gt;

&lt;p&gt;These suggestions were not wrong, but they were not calibrated to the specific situation. The threshold formula, the rationale for scale factor adjustment, and the risks of cluster-level configuration changes were not apparent from AI output alone.&lt;/p&gt;

&lt;p&gt;The fix required returning to foundational documentation:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;PostgreSQL VACUUM documentation&lt;/strong&gt; (official): Explained the cost mechanism and threshold formula&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AWS Aurora PostgreSQL tuning guide&lt;/strong&gt;: Provided context-specific guidance for managed Aurora instances&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Textbook references on MVCC&lt;/strong&gt;: Clarified why dead tuples accumulate and how autovacuum prevents wraparound&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The colleague's insistence on reading the documentation forced a deeper investigation that revealed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The mathematical formula driving autovacuum trigger decisions&lt;/li&gt;
&lt;li&gt;The specific interaction between bulk update workloads and default thresholds&lt;/li&gt;
&lt;li&gt;The risks and benefits of table-level vs. cluster-level configuration&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The lesson:&lt;/strong&gt; Base concepts—MVCC, dead tuples, autovacuum thresholds, wraparound protection—are non-negotiable foundations. Understanding these requires reading documentation. Once the foundations are understood, AI tools can accelerate diagnosis and suggest configurations. But without foundations, suggestions are just knobs to turn without understanding consequences.&lt;/p&gt;

&lt;p&gt;Combining both—foundational reading with AI-assisted diagnosis—yields better outcomes than either alone.&lt;/p&gt;


&lt;h2&gt;
  
  
  Diagnostic Queries for Future Incidents
&lt;/h2&gt;

&lt;p&gt;When autovacuum CPU spikes occur, these queries provide immediate visibility:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dead tuple status:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; 
  &lt;span class="n"&gt;relname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_live_tup&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_dead_tup&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;ROUND&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;n_dead_tup&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_live_tup&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;n_dead_tup&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;dead_pct&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;last_autovacuum&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_user_tables&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;n_dead_tup&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Active autovacuum activity:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;pid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query_start&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="k"&gt;EXTRACT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;EPOCH&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;NOW&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;query_start&lt;/span&gt;&lt;span class="p"&gt;))::&lt;/span&gt;&lt;span class="nb"&gt;INT&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;runtime_sec&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_activity&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="k"&gt;LIKE&lt;/span&gt; &lt;span class="s1"&gt;'%VACUUM%'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Churn rate analysis:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; 
  &lt;span class="n"&gt;relname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_live_tup&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="n"&gt;n_tup_upd&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;n_tup_del&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_mods&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;ROUND&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_tup_upd&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;n_tup_del&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;n_live_tup&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;churn_pct&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_user_tables&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;n_dead_tup&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Transaction age (wraparound risk):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; 
  &lt;span class="n"&gt;datname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;datfrozenxid&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;txid_age&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;setting&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_settings&lt;/span&gt; 
   &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'autovacuum_freeze_max_age'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;datfrozenxid&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
   &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;txid_remaining&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_database&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;datallowconn&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;txid_age&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;This incident demonstrated that PostgreSQL's VACUUM mechanism requires active tuning based on workload patterns. Unlike Oracle's Automatic Undo Retention Management—which automatically recycles undo tablespace based on retention policies—PostgreSQL requires explicit configuration of autovacuum thresholds calibrated to specific table churn patterns.&lt;/p&gt;

&lt;p&gt;The resolution came not from tweaking random parameters, but from understanding the mathematical formulas governing autovacuum behavior and applying them methodically at table scope.&lt;/p&gt;

&lt;p&gt;The broader lesson is methodological: when facing production database performance issues, foundational understanding of mechanisms (documented in official sources) combined with diagnostic data yields better outcomes than parameter suggestions alone.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;PostgreSQL Official: &lt;a href="https://www.postgresql.org/docs/current/sql-vacuum.html" rel="noopener noreferrer"&gt;VACUUM&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;PostgreSQL Official: &lt;a href="https://www.postgresql.org/docs/current/runtime-config-autovacuum.html" rel="noopener noreferrer"&gt;Autovacuum&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;AWS Database Blog: &lt;a href="https://aws.amazon.com/blogs/database/understanding-autovacuum-in-amazon-rds-for-postgresql-environments/" rel="noopener noreferrer"&gt;Understanding autovacuum in Amazon RDS for PostgreSQL&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Percona: &lt;a href="https://www.percona.com/blog/overcoming-vacuum-wraparound/" rel="noopener noreferrer"&gt;Overcoming VACUUM WRAPAROUND&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;CYBERTEC: &lt;a href="https://www.cybertec-postgresql.com/en/autovacuum-wraparound-protection-in-postgresql/" rel="noopener noreferrer"&gt;Autovacuum wraparound protection&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>aws</category>
      <category>database</category>
      <category>performance</category>
      <category>postgres</category>
    </item>
    <item>
      <title>From Zombie Connections to XactSync: A Post-Mortem on Aurora Postgres CPU Spikes</title>
      <dc:creator>Pranay Ravi</dc:creator>
      <pubDate>Sat, 23 May 2026 20:54:26 +0000</pubDate>
      <link>https://dev.to/pranay_raavi/from-zombie-connections-to-xactsync-a-post-mortem-on-aurora-postgres-cpu-spikes-55ek</link>
      <guid>https://dev.to/pranay_raavi/from-zombie-connections-to-xactsync-a-post-mortem-on-aurora-postgres-cpu-spikes-55ek</guid>
      <description>&lt;p&gt;The alert fires at 1:28 PM. By the time you open CloudWatch, CPU is already back to normal. No application errors. No user complaints. Nothing obviously broken.&lt;/p&gt;

&lt;p&gt;Do you close the ticket and move on, or dig in?&lt;/p&gt;

&lt;p&gt;This post is about why you should always dig in — and exactly how we traced a "self-resolving" CPU spike on Aurora PostgreSQL 16 all the way back to a mismatch between pipeline commit logic and the database storage engine.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Environment
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Engine:&lt;/strong&gt; Aurora PostgreSQL 16.1 on &lt;code&gt;db.x2g.xlarge&lt;/code&gt; (4 vCPUs, 128GB RAM)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Workload:&lt;/strong&gt; Kafka stream-based pipeline inserting ~365 rows/sec average into a daily-partitioned table, with burst peaks near 825/sec&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Replication:&lt;/strong&gt; Oracle GoldenGate&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observability:&lt;/strong&gt; AWS CloudWatch Database Insights + AppDynamics Database Agent&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What the Dashboard Showed
&lt;/h2&gt;

&lt;p&gt;The first stop is always &lt;strong&gt;CloudWatch&lt;/strong&gt; $\rightarrow$ &lt;strong&gt;Database Insights&lt;/strong&gt; $\rightarrow$ &lt;strong&gt;Database Load&lt;/strong&gt;. This chart shows Average Active Sessions (AAS) broken down by wait type:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Color&lt;/th&gt;
&lt;th&gt;Wait Type&lt;/th&gt;
&lt;th&gt;What It Means&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;🟢 &lt;strong&gt;Green&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;&lt;code&gt;CPU&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Active computation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🟠 &lt;strong&gt;Orange&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;&lt;code&gt;IO:AuroraStorageLogAllocate&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;WAL/storage log writes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🟣 &lt;strong&gt;Purple&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;&lt;code&gt;LWLock:BufferContent&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Buffer manager contention&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🟤 &lt;strong&gt;Brown&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Timeout:SpinDelay&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Spinlock spin waits&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🩷 &lt;strong&gt;Pink&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;&lt;code&gt;LWLock:WALInsert&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;WAL insert lock contention&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Our spike was &lt;strong&gt;pure green&lt;/strong&gt;—CPU only, no I/O or lock contention. The &lt;em&gt;Top SQL&lt;/em&gt; tab showed a high-volume &lt;code&gt;INSERT&lt;/code&gt; into a partitioned event log table running at ~825 calls/sec with 0.17ms average latency. Fast and normal individually. The volume was the question.&lt;/p&gt;

&lt;p&gt;However, the &lt;em&gt;Database Telemetry&lt;/em&gt; tab revealed two ominous signals that often catch a developer's attention:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Max time idle in transaction:&lt;/strong&gt; Climbing linearly for 3.2 hours straight.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vacuum: Max used transaction IDs:&lt;/strong&gt; On a steady linear climb.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;When you see both of those linear trends on a single chart, it usually points to a common suspect in relational databases: &lt;em&gt;something has been holding an open transaction for a very long time, threatening your autovacuum health.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Background: The Core Mechanics of Postgres MVCC
&lt;/h2&gt;

&lt;p&gt;To understand why long-running transactions are so closely tracked, it helps to look at how PostgreSQL manages Multi-Version Concurrency Control (MVCC).&lt;/p&gt;

&lt;p&gt;Unlike database engines that keep old record data in a separate utility tablespace (like Oracle's Undo Tablespace), PostgreSQL relies on a &lt;strong&gt;Heap-Based Approach&lt;/strong&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The Postgres Heap Reality:&lt;/strong&gt; There is no separate undo space. When a row is updated or deleted, the old version (called a "dead tuple") remains physically stored inside the exact same table page. It stays there until the background &lt;code&gt;VACUUM&lt;/code&gt; process scans the table and marks that dead space as reusable for future inserts.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If an application leaves a transaction open, it anchors the &lt;code&gt;xmin&lt;/code&gt; (the baseline transaction ID floor for that snapshot). &lt;strong&gt;&lt;code&gt;VACUUM&lt;/code&gt; cannot safely clean up any dead tuples that were generated after the oldest active transaction's `xmin&lt;/strong&gt;`. This can trigger a problematic domino effect:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Session opens → acquires xmin snapshot
       ↓
Session goes idle (no explicit commits or rollbacks)
       ↓
Dead tuples accumulate → VACUUM cannot clear them out
       ↓
Autovacuum loops continuously, wasting CPU scanning bloated tables
       ↓
Transient CPU spikes alert operations

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Because autovacuum periodically wakes up to evaluate tables, the resulting CPU spikes look completely intermittent. The alert clears, engineers close the ticket as a glitch, and the underlying architectural friction persists.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 1: Find the Stuck Sessions
&lt;/h2&gt;

&lt;p&gt;To hunt down the culprit, we ran a quick diagnostic query against &lt;code&gt;pg_stat_activity&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
  &lt;span class="n"&gt;pid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;usename&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;application_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;client_addr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;state&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;state_change&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;idle_duration&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;xact_start&lt;/span&gt;   &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;transaction_age&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_activity&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'idle in transaction'&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;xact_start&lt;/span&gt; &lt;span class="k"&gt;ASC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The results:&lt;/strong&gt; 35 sessions, all originating from the same connection IP, belonging to the replication user. The oldest transaction had been open for &lt;strong&gt;133 days&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The queries they were idling on:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pg_settings&lt;/span&gt; &lt;span class="k"&gt;where&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'ansi_force_foreign_key_checks'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pg_type&lt;/span&gt; &lt;span class="k"&gt;where&lt;/span&gt; &lt;span class="n"&gt;typtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'e'&lt;/span&gt; &lt;span class="k"&gt;and&lt;/span&gt; &lt;span class="n"&gt;oid&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="p"&gt;(...);&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These are classic &lt;strong&gt;compatibility probe queries&lt;/strong&gt; issued during connection initialization by integration tooling like GoldenGate. The integration platform had been connecting, executing its basic discovery probes, and abandoning the sessions without explicitly closing them every time it restarted or reconnected over a four-month period.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 2: Verify the Vacuum Blocker (The Twist)
&lt;/h2&gt;

&lt;p&gt;Finding &lt;code&gt;idle in transaction&lt;/code&gt; sessions is a major red flag, but we had to verify if they were genuinely blocking data cleanup. Not all idle sessions hold an active database snapshot.&lt;/p&gt;

&lt;p&gt;We executed the source-of-truth query to find actual vacuum blockers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
  &lt;span class="n"&gt;pid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;usename&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;backend_xmin&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;backend_xmin&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;xmin_age_in_transactions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;xact_start&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;session_age&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;state&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_activity&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;backend_xmin&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;backend_xmin&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;⚠️ &lt;strong&gt;This is where our initial MVCC-bloat hypothesis collapsed.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The zombie integration sessions did &lt;em&gt;not&lt;/em&gt; appear in this list; their &lt;code&gt;backend_xmin&lt;/code&gt; fields were entirely &lt;code&gt;NULL&lt;/code&gt;. Because they had only executed basic, read-only catalog queries during initialization, they never acquired a persistent data snapshot transaction ID. They were &lt;em&gt;not&lt;/em&gt; blocking vacuum. The actual oldest &lt;code&gt;xmin&lt;/code&gt; holders were normal application sessions with transaction ages under 3,000—completely standard for a busy transaction engine.&lt;/p&gt;

&lt;p&gt;While these zombie connections were a clear operational hazard (consuming connection slots and wasting system memory), they weren't driving our CPU alert.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 3: Check the Partitions
&lt;/h2&gt;

&lt;p&gt;With zombie connections ruled out as vacuum blockers, we turned our focus to the physical table health. Because our target event log table is a &lt;strong&gt;partitioned table&lt;/strong&gt;, statistics are tracked natively at the individual child partition level, not the parent logical table. We targeted the table prefix:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;schemaname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;relname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_live_tup&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_dead_tup&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;ROUND&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;n_dead_tup&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="k"&gt;NULLIF&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_live_tup&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;n_dead_tup&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;dead_ratio_pct&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;last_autovacuum&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_user_tables&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;relname&lt;/span&gt; &lt;span class="k"&gt;LIKE&lt;/span&gt; &lt;span class="s1"&gt;'event_log%'&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;n_dead_tup&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; &lt;code&gt;0&lt;/code&gt; dead tuples across all active partitions.&lt;/p&gt;

&lt;p&gt;The partitioned strategy was working flawlessly. The application used an insert-only pattern on the active daily partition, and old partitions were dropped entirely rather than vacuumed. This meant &lt;code&gt;VACUUM&lt;/code&gt; bloat was completely ruled out.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 4: Finding the Real Culprit via Wait States
&lt;/h2&gt;

&lt;p&gt;With MVCC bloat off the table, we pivoted to the APM wait state drill-down for the intensive &lt;code&gt;INSERT&lt;/code&gt; query during the exact minutes of the CPU spike:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;ClientRead&lt;/code&gt; (54%):&lt;/strong&gt; The database engine spent more than half its time completely idle, waiting for the application client to send the next query over the network wire.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;Active&lt;/code&gt; (24.7%):&lt;/strong&gt; Time spent physically executing the raw insert statements.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;XactSync&lt;/code&gt; (17.9%):&lt;/strong&gt; The engine waiting for the Write-Ahead Log (WAL) to be flushed and synchronized to durable storage upon a &lt;code&gt;COMMIT&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;AuroraStorageLogAllocate&lt;/code&gt; (2.4%):&lt;/strong&gt; Storage block layer allocation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Two metrics tell the final story here:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;XactSync&lt;/code&gt; at 17.9%:&lt;/strong&gt; This wait state occurs when an explicit &lt;code&gt;COMMIT&lt;/code&gt; forces PostgreSQL to flush WAL records. At a peak volume of 825 inserts per second, the application was issuing &lt;strong&gt;825 individual commits per second&lt;/strong&gt;. This meant 825 distinct sync operations forcing the storage layer to acknowledge writes sequentially.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;ClientRead&lt;/code&gt; at 54%:&lt;/strong&gt; This is the classic signature of single-row, one-at-a-time application database calls. The application loop was executing: &lt;em&gt;Send row $\rightarrow$ Execute $\rightarrow$ Commit $\rightarrow$ Wait for network Ack $\rightarrow$ Send next row.&lt;/em&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The shape of the CPU spike closely mirrored a &lt;strong&gt;backlog flush pattern&lt;/strong&gt;. Messages would pile up briefly in the upstream Kafka topic, and then the consumer would drain them all at once in a massive burst of single-row, high-frequency synchronous commits.&lt;/p&gt;




&lt;h2&gt;
  
  
  Root Cause Summary
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Hypothesis&lt;/th&gt;
&lt;th&gt;Verdict&lt;/th&gt;
&lt;th&gt;Technical Proof&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Dead Tuple Accumulation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;❌ &lt;strong&gt;Ruled Out&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;pg_stat_user_tables&lt;/code&gt; showed a 0% dead tuple ratio across all database partitions.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Zombie Connections Blocking Vacuum&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;❌ &lt;strong&gt;Ruled Out&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;backend_xmin&lt;/code&gt; was explicitly &lt;code&gt;NULL&lt;/code&gt; for all zombie replication sessions.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Burst Insert Flush + Per-Row Commits&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Root Cause&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;High &lt;code&gt;XactSync&lt;/code&gt; (17.9%) paired with a massive &lt;code&gt;ClientRead&lt;/code&gt; (54%) network wait pattern.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  The Action Plan
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Fix 1: Terminate and Block the Zombie Connections
&lt;/h3&gt;

&lt;p&gt;Even though they didn't block vacuum in this specific instance, leaving 35 abandoned sessions active for 133 days is risky. They leak connection resources and backend memory.&lt;/p&gt;

&lt;p&gt;First, we safely terminated the dangling backends:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;pg_terminate_backend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pid&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;pid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;xact_start&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;xact_age&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_activity&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;usename&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'repl_agent'&lt;/span&gt; 
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="k"&gt;state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'idle in transaction'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next, we implemented an automated defense line. We updated the cluster configuration parameters to enforce a strict timeout globally:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="py"&gt;idle_in_transaction_session_timeout&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;600000 # 10 minutes in milliseconds&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If any session sits idle inside an uncommitted transaction block for more than 10 minutes, Postgres will now automatically drop the connection and free up system resources.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix 2: Implement Micro-Batching in the Consumer Pipeline
&lt;/h3&gt;

&lt;p&gt;The true fix for the CPU spike required changing how the application interacted with the storage engine. Instead of committing every single row individually, we modified the JDBC producer configuration to use a batch size of &lt;code&gt;500&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Old Pattern (825 WAL Syncs/sec):&lt;/span&gt;
&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;row&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="k"&gt;COMMIT&lt;/span&gt; &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="n"&gt;WAL&lt;/span&gt; &lt;span class="n"&gt;fsync&lt;/span&gt; &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="n"&gt;Wait&lt;/span&gt;
&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;row&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="k"&gt;COMMIT&lt;/span&gt; &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="n"&gt;WAL&lt;/span&gt; &lt;span class="n"&gt;fsync&lt;/span&gt; &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="n"&gt;Wait&lt;/span&gt;

&lt;span class="c1"&gt;-- New Pattern (1.7 WAL Syncs/sec):&lt;/span&gt;
&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;row&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;row&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt; &lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;row&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt; &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="k"&gt;COMMIT&lt;/span&gt; &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="n"&gt;WAL&lt;/span&gt; &lt;span class="n"&gt;fsync&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Before Batching&lt;/th&gt;
&lt;th&gt;After Batching (Est.)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Commits/sec (Burst)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~825&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~1.7&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;WAL Fsyncs/sec&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~825&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~1.7&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;code&gt;XactSync&lt;/code&gt; Wait Time&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;17.9%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;&amp;lt; 1.0%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Transient CPU Spikes&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Frequent&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Eliminated&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Essential Checklist Queries for Triage
&lt;/h2&gt;

&lt;p&gt;Add these queries to your incident response runbook for analyzing database load:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Identify Long-Running Snapshot Blockers
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;pid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;usename&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;backend_xmin&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;backend_xmin&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;xmin_age_transactions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;xact_start&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;session_age&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_activity&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;backend_xmin&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;backend_xmin&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt; 
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Spot Dangerous &lt;code&gt;Idle in Transaction&lt;/code&gt; Sessions
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;pid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;usename&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;client_addr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;xact_start&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;transaction_age&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;state_change&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;idle_duration&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;LEFT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;80&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;last_query&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_activity&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'idle in transaction'&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;xact_start&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;interval&lt;/span&gt; &lt;span class="s1"&gt;'5 minutes'&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;xact_start&lt;/span&gt; &lt;span class="k"&gt;ASC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. Check Table Churn &amp;amp; Dead Tuple Ratios (Including Partitions)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;schemaname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;relname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_live_tup&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_dead_tup&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;ROUND&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;n_dead_tup&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="k"&gt;NULLIF&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_live_tup&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;n_dead_tup&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;dead_ratio_pct&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;last_autovacuum&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_user_tables&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;relname&lt;/span&gt; &lt;span class="k"&gt;LIKE&lt;/span&gt; &lt;span class="s1"&gt;'your_table_prefix%'&lt;/span&gt;  &lt;span class="c1"&gt;-- Handles partitioned child tables&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;n_dead_tup&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;💡 &lt;strong&gt;Tip:&lt;/strong&gt; A &lt;code&gt;dead_ratio_pct&lt;/code&gt; &amp;gt; 10% indicates a strong need for space recovery; &amp;gt; 20% means the table is critically bloated and query execution performance will likely degrade.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h3&gt;
  
  
  What's Next?
&lt;/h3&gt;

&lt;p&gt;If you found this breakdown helpful, keep an eye out for our next post. We will be diving into the opposite side of database optimization: &lt;strong&gt;When Autovacuum goes wrong.&lt;/strong&gt; We will walk through a separate production incident where a high-churn table choked on over &lt;strong&gt;124 million dead tuples&lt;/strong&gt; (a brutal 16.6% dead ratio) due to misconfigured autovacuum cost thresholds. Stay tuned!&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>aws</category>
      <category>database</category>
      <category>devops</category>
    </item>
    <item>
      <title>What It Actually Takes to Audit Aurora PostgreSQL on AWS</title>
      <dc:creator>Pranay Ravi</dc:creator>
      <pubDate>Thu, 21 May 2026 23:28:33 +0000</pubDate>
      <link>https://dev.to/pranay_raavi/what-it-actually-takes-to-audit-aurora-postgresql-on-aws-l7j</link>
      <guid>https://dev.to/pranay_raavi/what-it-actually-takes-to-audit-aurora-postgresql-on-aws-l7j</guid>
      <description>&lt;p&gt;Most operational infrastructure starts this way: a requirement appears before the architecture does.&lt;/p&gt;

&lt;p&gt;One day the team needed a database audit solution. Not in a planning doc — someone asked, and I had to build something. I'd owned the Oracle audit pipeline already, so I knew what the destination looked like. The question was what it would take to get there on Aurora PostgreSQL and AWS.&lt;/p&gt;

&lt;p&gt;The short answer: more than you'd expect. The longer answer is this article.&lt;/p&gt;




&lt;h2&gt;
  
  
  What We're Actually Auditing — and Why It Matters
&lt;/h2&gt;

&lt;p&gt;The scope here is specific: &lt;strong&gt;individual human users making direct DML changes&lt;/strong&gt; (SELECT, INSERT, UPDATE, DELETE) to application tables.&lt;/p&gt;

&lt;p&gt;Application service accounts are expected to modify data — that's their function. The risk surface is direct human access. A developer connected via psql. A support engineer running an ad-hoc update. A credential that shouldn't have had access to begin with. Those are the actions that need an audit trail in a regulated environment.&lt;/p&gt;

&lt;p&gt;This distinction shapes every architectural decision: we enable pgAudit per individual user account, not cluster-wide. Application accounts are excluded entirely.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Oracle Baseline
&lt;/h2&gt;

&lt;p&gt;Before getting into the build, it's worth framing the comparison.&lt;/p&gt;

&lt;p&gt;On Oracle, Unified Auditing (an Enterprise-tier feature) handles this with a single policy definition at the intersection of user, action, and object:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;AUDIT&lt;/span&gt; &lt;span class="n"&gt;POLICY&lt;/span&gt; &lt;span class="n"&gt;user_dml_activity&lt;/span&gt;
  &lt;span class="n"&gt;ACTIONS&lt;/span&gt; &lt;span class="k"&gt;SELECT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;INSERT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;UPDATE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;DELETE&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;app_schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;
  &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="s1"&gt;'SYS_CONTEXT(&lt;/span&gt;&lt;span class="se"&gt;''&lt;/span&gt;&lt;span class="s1"&gt;USERENV&lt;/span&gt;&lt;span class="se"&gt;''&lt;/span&gt;&lt;span class="s1"&gt;,&lt;/span&gt;&lt;span class="se"&gt;''&lt;/span&gt;&lt;span class="s1"&gt;SESSION_USER&lt;/span&gt;&lt;span class="se"&gt;''&lt;/span&gt;&lt;span class="s1"&gt;) != &lt;/span&gt;&lt;span class="se"&gt;''&lt;/span&gt;&lt;span class="s1"&gt;APP_SVC_ACCOUNT&lt;/span&gt;&lt;span class="se"&gt;''&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;
  &lt;span class="n"&gt;EVALUATE&lt;/span&gt; &lt;span class="n"&gt;PER&lt;/span&gt; &lt;span class="k"&gt;SESSION&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="n"&gt;AUDIT&lt;/span&gt; &lt;span class="n"&gt;POLICY&lt;/span&gt; &lt;span class="n"&gt;user_dml_activity&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Records land in &lt;code&gt;UNIFIED_AUDIT_TRAIL&lt;/code&gt; — a structured, SQL-queryable audit view. From there, the data can be exported on a schedule to any downstream analytics platform such as Elasticsearch, Splunk, OpenSearch, or a dedicated audit store. Scheduled queries or alerting rules can then detect patterns like bulk deletes, after-hours access, or unexpected DDL changes and trigger notifications through PagerDuty, Opsgenie, or an observability platform via HTTP webhook. Once the audit policy is defined, the downstream detection and alerting pipeline is relatively small because the audit data is already structured and queryable.&lt;/p&gt;

&lt;p&gt;One important operational caveat with Oracle's approach: if the audit tablespace fills up, Oracle halts the database rather than silently dropping audit records — it guarantees completeness at the cost of availability. Tablespace monitoring becomes a hard operational requirement. pgAudit writing to the log stream instead of a table sidesteps this entirely: if log delivery has issues, the database keeps running.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;pgAudit is different.&lt;/strong&gt; There's no policy-based object/action targeting. You set a log class (&lt;code&gt;read&lt;/code&gt;, &lt;code&gt;write&lt;/code&gt;, &lt;code&gt;ddl&lt;/code&gt;, &lt;code&gt;all&lt;/code&gt;) per user. Records go to the PostgreSQL log stream — on Aurora, that means CloudWatch Logs. There's no queryable view. You're working with text:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;AUDIT&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;SESSION&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="k"&gt;WRITE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="k"&gt;INSERT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="k"&gt;TABLE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="k"&gt;public&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="nv"&gt;"INSERT INTO orders (customer_id, amount) VALUES ($1, $2)"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It works. But extracting structured, alertable signal from a log stream requires deliberate engineering.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm45v45s1kz0hyv67e6hr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm45v45s1kz0hyv67e6hr.png" alt=" " width="800" height="447"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Aurora PostgreSQL  (pgAudit per-user)
        │
        ▼
CloudWatch Logs
        │
EventBridge  rate(5 min)
        │
        ▼
Lambda  → runs Log Insights query → filters noise → publishes count metric
        │
        ▼
CloudWatch Alarm  (count &amp;gt; 0)
        │
        ▼
SNS → incident platform + email
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Lambda exists because &lt;strong&gt;CloudWatch Alarms cannot evaluate Log Insights query results directly&lt;/strong&gt; — there's no native bridge. Lambda runs the query, counts filtered results, and publishes a standard CloudWatch metric. The alarm evaluates that metric. Lambda invocation logs also give you an independent timestamped record of every detection event — useful when compliance asks "when was this first noticed?"&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Full code, IAM policies, SNS configurations, and Terraform module:&lt;/strong&gt; &lt;a href="https://github.com/pcraavi/PostgreSQL-Audit" rel="noopener noreferrer"&gt;github.com/pcraavi/PostgreSQL-Audit&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Gotcha 1: The Silent pgAudit Setup Failure
&lt;/h2&gt;

&lt;p&gt;Enable &lt;code&gt;pgaudit&lt;/code&gt; in &lt;code&gt;shared_preload_libraries&lt;/code&gt;, reboot the cluster. Simple.&lt;/p&gt;

&lt;p&gt;Except — &lt;code&gt;shared_preload_libraries&lt;/code&gt; only loads the binary into shared memory. You also need:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;EXTENSION&lt;/span&gt; &lt;span class="n"&gt;pgaudit&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Without it: zero audit records. Zero error messages. Nothing in CloudWatch. The cluster had been rebooted, the parameter was confirmed, logs were being exported — and there was nothing to show for it.&lt;/p&gt;

&lt;p&gt;The diagnostic:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_extension&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;extname&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'pgaudit'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;-- 0 rows returned&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That was the entire problem. One missing statement.&lt;/p&gt;

&lt;p&gt;Then enable per-user logging:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;USER&lt;/span&gt; &lt;span class="n"&gt;john_doe&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;pgaudit&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;log&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="s1"&gt;'all'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Verify&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;usename&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;useconfig&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_user&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;usename&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'john_doe'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Gotcha 2: The Noise Problem
&lt;/h2&gt;

&lt;p&gt;The first time you look at raw audit logs from a production cluster, the signal-to-noise ratio is genuinely bad:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;AUDIT&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;SESSION&lt;/span&gt;&lt;span class="p"&gt;,...,&lt;/span&gt;&lt;span class="k"&gt;READ&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;&lt;span class="p"&gt;,,,&lt;/span&gt;&lt;span class="nv"&gt;"SELECT version()"&lt;/span&gt;
&lt;span class="n"&gt;AUDIT&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;SESSION&lt;/span&gt;&lt;span class="p"&gt;,...,&lt;/span&gt;&lt;span class="k"&gt;READ&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;&lt;span class="p"&gt;,,,&lt;/span&gt;&lt;span class="nv"&gt;"SELECT * FROM pg_shdescription..."&lt;/span&gt;
&lt;span class="n"&gt;AUDIT&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;SESSION&lt;/span&gt;&lt;span class="p"&gt;,...,&lt;/span&gt;&lt;span class="k"&gt;READ&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;&lt;span class="p"&gt;,,,&lt;/span&gt;&lt;span class="nv"&gt;"SET application_name='DBeaver 23.2.0'"&lt;/span&gt;
&lt;span class="n"&gt;AUDIT&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;SESSION&lt;/span&gt;&lt;span class="p"&gt;,...,&lt;/span&gt;&lt;span class="k"&gt;READ&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;&lt;span class="p"&gt;,,,&lt;/span&gt;&lt;span class="nv"&gt;"SELECT oid, typarray FROM pg_type WHERE typname=$1"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every database tool (DBeaver, DataGrip, pgAdmin) fires a catalog query sequence on connect. Every JDBC driver runs initialization SELECTs. Every connection pool runs health checks. Alert on this raw stream and you're alerting on noise constantly — which means you stop paying attention, which defeats the audit entirely.&lt;/p&gt;

&lt;p&gt;The filter query was built incrementally by watching actual production traffic:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;fields&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="nb"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="n"&gt;logStream&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="n"&gt;log&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;filter&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="k"&gt;like&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;AUDIT&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;filter&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="k"&gt;like&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="k"&gt;or&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="k"&gt;like&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="k"&gt;INSERT&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="k"&gt;or&lt;/span&gt;
    &lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="k"&gt;like&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="k"&gt;UPDATE&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="k"&gt;or&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="k"&gt;like&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="k"&gt;DELETE&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;filter&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="k"&gt;not&lt;/span&gt; &lt;span class="k"&gt;like&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;version&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;filter&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="k"&gt;not&lt;/span&gt; &lt;span class="k"&gt;like&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;pg_shdescription&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;filter&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="k"&gt;not&lt;/span&gt; &lt;span class="k"&gt;like&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;pg_catalog&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;filter&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="k"&gt;not&lt;/span&gt; &lt;span class="k"&gt;like&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;information_schema&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;filter&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="k"&gt;not&lt;/span&gt; &lt;span class="k"&gt;like&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;application_name&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;filter&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="k"&gt;not&lt;/span&gt; &lt;span class="k"&gt;like&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;DBeaver&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;filter&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="k"&gt;not&lt;/span&gt; &lt;span class="k"&gt;like&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;PostgreSQL&lt;/span&gt; &lt;span class="n"&gt;JDBC&lt;/span&gt; &lt;span class="n"&gt;Driver&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;filter&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="k"&gt;not&lt;/span&gt; &lt;span class="k"&gt;like&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;datname&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="err"&gt;\$&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;sort&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="nb"&gt;timestamp&lt;/span&gt; &lt;span class="k"&gt;desc&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Your exclusion list will grow. Every application stack generates its own connection init patterns.&lt;/p&gt;




&lt;h2&gt;
  
  
  Gotcha 3: The KMS + CloudWatch Alarms Incompatibility
&lt;/h2&gt;

&lt;p&gt;Encrypt the SNS topic at rest. Straightforward — use &lt;code&gt;alias/aws/sns&lt;/code&gt;, the AWS-managed SNS key. Done.&lt;/p&gt;

&lt;p&gt;Except the alarm stopped delivering.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CloudWatch Alarms does not have authorization to access the SNS topic encryption key
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;AWS-managed keys have immutable key policies. You cannot grant CloudWatch &lt;code&gt;kms:GenerateDataKey&lt;/code&gt;. The fix is a customer-managed KMS key with an explicit CloudWatch service grant in the key policy. The managed key works fine if Lambda is publishing directly to SNS — the limitation is specific to the &lt;strong&gt;CloudWatch Alarms → SNS&lt;/strong&gt; delivery path.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Key policy and full SNS security configuration (HTTPS enforcement, cross-account lockdown): &lt;a href="https://github.com/pcraavi/PostgreSQL-Audit/tree/main/sns" rel="noopener noreferrer"&gt;github.com/pcraavi/PostgreSQL-Audit/tree/main/sns&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Gotcha 4: The Incident Platform Deduplication Problem
&lt;/h2&gt;

&lt;p&gt;First alert: fired correctly.&lt;br&gt;
Second alert, third, fourth: nothing.&lt;/p&gt;

&lt;p&gt;The incident platform logs showed requests arriving and the "Create Alert" action starting — but no alert created. The issue: platforms like Opsgenie, PagerDuty, and VictorOps deduplicate by alarm name (used as the alert alias). If that alert is already open or acknowledged, subsequent triggers are suppressed.&lt;/p&gt;

&lt;p&gt;Correct behavior for most alarms. For an audit alert that should fire every detection window, it needs handling.&lt;/p&gt;

&lt;p&gt;Fix options:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Timestamp the alarm name&lt;/strong&gt; at deploy time (&lt;code&gt;$(date +%s)&lt;/code&gt; suffix) — unique alias per deployment&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Configure auto-close&lt;/strong&gt; when the alarm returns to OK — fresh alert on each ALARM transition&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Override the alias template&lt;/strong&gt; in the integration to include a timestamp from the alert payload&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Why Not Database Activity Streams?
&lt;/h2&gt;

&lt;p&gt;Experienced AWS engineers will immediately ask this. DAS provides near-real-time activity streaming to Kinesis with structured output. It's a real solution.&lt;/p&gt;

&lt;p&gt;The reason we didn't use it: it introduces Kinesis as a required dependency, adds per-event cost, and requires a consumer layer for decryption and processing. For teams without an existing Kinesis pipeline, that's meaningful operational surface area.&lt;/p&gt;

&lt;p&gt;pgAudit + CloudWatch uses infrastructure Aurora already depends on. Lambda and SNS are general-purpose services. There's no proprietary tooling, no specialized operational knowledge required. Any engineer with standard AWS experience can maintain it. That portability and low learning curve matters when you're deploying across multiple accounts.&lt;/p&gt;

&lt;p&gt;GuardDuty RDS Protection is complementary — it handles threat detection, not structured audit trails. OpenSearch subscription filters would work but introduce an OpenSearch cluster as a dependency, which adds cost and operational overhead.&lt;/p&gt;




&lt;h2&gt;
  
  
  A Note on pgAudit Overhead
&lt;/h2&gt;

&lt;p&gt;Overhead is minimal at this audit scope. Individual human users in a regulated production environment are not expected to generate high transaction volumes — the audit target is ad-hoc access, not application traffic. The per-session overhead of writing to the PostgreSQL log is negligible when audit scope is limited to non-application users.&lt;/p&gt;

&lt;p&gt;For high-volume clusters where audit is a compliance requirement: account for log overhead in instance sizing. Set CloudWatch log retention to match your compliance window rather than keeping logs indefinitely — use &lt;code&gt;aws logs put-retention-policy&lt;/code&gt; to enforce it.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I'd Do Differently
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Ship filtered audit records to a structured store.&lt;/strong&gt; CloudWatch Log Insights is a query tool, not a data store. Extending the Lambda to write each filtered record to DynamoDB or an RDS audit table gives you the closest equivalent to Oracle's &lt;code&gt;UNIFIED_AUDIT_TRAIL&lt;/code&gt;: indexed, queryable, long-term audit history. Kinesis Firehose → S3 → Athena is another path for columnar query performance over large windows. The Lambda architecture makes this extension natural — the query and filtering logic is already there.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Terraform from day one.&lt;/strong&gt; The full stack — Lambda, EventBridge rule, SNS topic, IAM role, CloudWatch alarm, KMS key — fits in a single reusable module. Deploying to additional accounts should be &lt;code&gt;terraform apply&lt;/code&gt; with environment variables, not a re-run of CLI commands.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tag every resource.&lt;/strong&gt; &lt;code&gt;Environment&lt;/code&gt;, &lt;code&gt;ClusterName&lt;/code&gt;, &lt;code&gt;Owner&lt;/code&gt;. Untagged audit infrastructure across multiple accounts becomes unauditable infrastructure.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Thought
&lt;/h2&gt;

&lt;p&gt;Each component here has a clean interface and behaves predictably in isolation. The engineering is in the integration contracts: Log Insights results don't flow to alarms without mediation; AWS-managed KMS keys don't work with CloudWatch delivery; incident platform deduplication suppresses repeated alarm transitions. None of these are documented prominently — they surface when components are wired together under production conditions.&lt;/p&gt;

&lt;p&gt;Start with the Log Insights filter query. Validate the signal before building anything around it. The filtering logic is the foundation. The rest is plumbing.&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Full implementation:&lt;/strong&gt; Lambda code, IAM policies, SNS topic policy, KMS key policy, CloudWatch alarm, Terraform module → &lt;a href="https://github.com/pcraavi/PostgreSQL-Audit" rel="noopener noreferrer"&gt;github.com/pcraavi/PostgreSQL-Audit&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;em&gt;Drop a comment if you've hit different noise patterns in your environment, or if you've implemented the Lambda → structured audit table extension — curious how others have approached long-term audit retention on Aurora.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>postgres</category>
      <category>security</category>
      <category>devops</category>
    </item>
    <item>
      <title>Automated 25 Minutes of My Morning With a Prompt (Not a Script)</title>
      <dc:creator>Pranay Ravi</dc:creator>
      <pubDate>Thu, 21 May 2026 03:07:21 +0000</pubDate>
      <link>https://dev.to/pranay_raavi/automated-25-minutes-of-my-morning-with-a-prompt-not-a-script-2d4k</link>
      <guid>https://dev.to/pranay_raavi/automated-25-minutes-of-my-morning-with-a-prompt-not-a-script-2d4k</guid>
      <description>&lt;p&gt;Every serious engineering org I've worked in has the same split personality.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One side:&lt;/strong&gt; A modern observability stack. AppDynamics, Datadog, whatever the current favorite is. Real-time metrics, distributed traces, beautiful dashboards, alert routing. Years of investment. It works.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The other side:&lt;/strong&gt; A long tail of legacy monitoring tools that predate the current stack by a decade. Tools that don't speak OpenTelemetry. Tools that can't push to a webhook. Tools whose output format is, and will remain, a 73-row HTML table emailed at 11:15 UTC every morning.&lt;/p&gt;

&lt;p&gt;These two sides don't talk to each other. And in most orgs, someone fills that gap manually for 25–30 minutes every morning.&lt;/p&gt;

&lt;p&gt;That someone was me.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fykro6x1odvcxp36qn7us.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fykro6x1odvcxp36qn7us.png" alt=" " width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Gap Nobody Draws on the Architecture Diagram
&lt;/h2&gt;

&lt;p&gt;The box that should exist on every enterprise architecture diagram but never does:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;"Dave reads his email for 25 minutes"&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is what I call an &lt;strong&gt;observability seam&lt;/strong&gt; — the boundary between your modern monitoring stack and legacy outputs it can't ingest. These seams exist because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Legacy tools were built before modern observability conventions&lt;/li&gt;
&lt;li&gt;Integration work is permanently lower priority than feature work&lt;/li&gt;
&lt;li&gt;The cost is diffuse (distributed across many humans' mornings) rather than concentrated, so it never hits "fix it" priority&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The insight I kept coming back to: this isn't a technology problem. It's an &lt;strong&gt;architectural surface area nobody drew&lt;/strong&gt;. Once you see it that way, the solution becomes tractable.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the Legacy Tail Actually Looked Like
&lt;/h2&gt;

&lt;p&gt;My specific environment had four data sources, none of which fed the observability stack:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Source&lt;/th&gt;
&lt;th&gt;Format&lt;/th&gt;
&lt;th&gt;Frequency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Replication latency alerts&lt;/td&gt;
&lt;td&gt;Email (body text)&lt;/td&gt;
&lt;td&gt;Multiple times daily&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DB backup status report&lt;/td&gt;
&lt;td&gt;Email (73-row HTML table)&lt;/td&gt;
&lt;td&gt;Daily at 11:15 UTC&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Infrastructure/AWS notifications&lt;/td&gt;
&lt;td&gt;Email (subject-line pattern)&lt;/td&gt;
&lt;td&gt;20–50 per day&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deployment Review&lt;/td&gt;
&lt;td&gt;Confluence page backed by Jira macros&lt;/td&gt;
&lt;td&gt;Checked manually&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;None of them have APIs. None of them feed my SIEM or APM. The only integration surface available was: &lt;em&gt;it arrives as email, or it lives at a known URL.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;That constraint is also the opportunity.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Architecture: Agent as Integration Layer
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8oyo8rns3fjim6ogetyo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8oyo8rns3fjim6ogetyo.png" alt=" " width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The core decision was to treat an LLM agent not as a productivity tool, but as a &lt;strong&gt;seam-closing integration layer&lt;/strong&gt; — something that sits between heterogeneous, unstructured legacy outputs and a human who needs a consistent, actionable daily summary.&lt;/p&gt;

&lt;p&gt;The agent does four things each morning:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Replication Latency — Signal extraction from noise
&lt;/h3&gt;

&lt;p&gt;Find the latest alert, read the body, and report whether the body is actually populated.&lt;/p&gt;

&lt;p&gt;This sounds trivial. It isn't. Replication alerts sometimes fire with &lt;em&gt;empty bodies&lt;/em&gt; — the email arrived, but the content didn't. Manually, this is easy to miss because you see the email and assume it's fine. The agent makes "body is empty" an explicit flagged state. That's catching a &lt;strong&gt;monitoring failure&lt;/strong&gt;, not just monitoring an alert.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Backup Report — Structured extraction from semi-structured HTML
&lt;/h3&gt;

&lt;p&gt;Parse the 73-row HTML table, tally the &lt;code&gt;BACKUP_STATUS&lt;/code&gt; column, surface only the rows that didn't complete.&lt;/p&gt;

&lt;p&gt;Output: &lt;em&gt;"69 completed, 0 failed, 3 with no backup — db01, db02, db05."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Reading time: three seconds. A human eyeballing 73 rows is slower and occasionally misses things.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Infrastructure Notifications — Classification at scale
&lt;/h3&gt;

&lt;p&gt;Twenty to fifty emails daily, each following the pattern &lt;code&gt;STATUS - Alert (Server Name : X and DB Name : Y)&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The agent classifies every subject line, counts by status, and surfaces only the non-SUCCESS entries with a direct link to each email.&lt;/p&gt;

&lt;p&gt;Output: &lt;em&gt;"21 SUCCESS, 1 WARNING on SERVER04U / DB09."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;That one line is the entire actionable output of fifty emails.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Deployment Review — Live data from a known URL
&lt;/h3&gt;

&lt;p&gt;Read the Confluence page, identify the Jira macro structure, query the underlying filters directly, and report live issue counts. Bridges the gap between "someone updated a Confluence page" and "here's the actual current state of production deployments."&lt;/p&gt;




&lt;h2&gt;
  
  
  Why an LLM Instead of a Python Script?
&lt;/h2&gt;

&lt;p&gt;A reasonable engineer will ask: couldn't this be IMAP + regex + cron?&lt;/p&gt;

&lt;p&gt;Yes — and in many cases that's the right answer. Large enterprises already run Airflow, Power Automate, Splunk SOAR, Logic Apps. This isn't automation entering a vacuum.&lt;/p&gt;

&lt;p&gt;The honest comparison:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Strength&lt;/th&gt;
&lt;th&gt;Breaks when…&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Python + regex + cron&lt;/td&gt;
&lt;td&gt;Deterministic, auditable, fast&lt;/td&gt;
&lt;td&gt;Email format changes, new alert pattern appears&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ETL / SOAR pipeline&lt;/td&gt;
&lt;td&gt;Scalable, governed, integrated&lt;/td&gt;
&lt;td&gt;Requires schema agreement upstream&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLM agent&lt;/td&gt;
&lt;td&gt;Tolerates format variance, low setup cost&lt;/td&gt;
&lt;td&gt;Output is nondeterministic, harder to audit&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;What the LLM reduces is &lt;strong&gt;integration friction&lt;/strong&gt; — not the need for integration.&lt;/p&gt;

&lt;p&gt;The backup report column order can shift. The AWS notification subject-line pattern can vary. A new data source can be added in a sentence of English rather than a week of schema work.&lt;/p&gt;

&lt;p&gt;That flexibility has a cost: you trade determinism for adaptability. For a morning triage digest where a missed edge case surfaces in the next run, that tradeoff is acceptable. For a system that triggers automated remediation, it isn't.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Right mental model:&lt;/strong&gt; LLM agents lower the cost of closing observability seams. They don't replace deterministic pipelines where correctness guarantees matter.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Design Principles Worth Keeping
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Independent section isolation
&lt;/h3&gt;

&lt;p&gt;Each source is handled independently. A connector outage in one section doesn't abort the others — the agent renders an "unavailable" note and continues. A partial report is vastly more valuable than a silent failure.&lt;/p&gt;

&lt;h3&gt;
  
  
  Absence as a signal
&lt;/h3&gt;

&lt;p&gt;Each section includes a sanity check that fires if no emails are found for that source.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;"No replication alerts received today"&lt;/code&gt; is &lt;strong&gt;bolded as a warning&lt;/strong&gt;, not silently skipped.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This paid off in month two, when the backup report email silently stopped arriving after a mail routing change. The sanity check flagged it immediately. Manual triage would have assumed a clean run.&lt;/p&gt;

&lt;h3&gt;
  
  
  Delta orientation
&lt;/h3&gt;

&lt;p&gt;The report answers &lt;em&gt;"what changed or failed since yesterday"&lt;/em&gt;, not &lt;em&gt;"what is the current state of everything."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Confirming sameness has no information value but costs the same attention as confirming change. A delta-oriented report routes attention only where it's needed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Runtime date resolution
&lt;/h3&gt;

&lt;p&gt;The prompt never hardcodes a date. "Today" is resolved against the local clock at execution time. Small thing. Prevents a class of subtle bugs where a stale prompt runs with yesterday's scope.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Prompt (Sanitized)
&lt;/h2&gt;

&lt;p&gt;The prompt itself is the deliverable worth preserving:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;You are producing the daily 9 AM operations report. Run silently — this prompt is self-contained.

REQUIRED CONNECTORS
&lt;span class="p"&gt;-&lt;/span&gt; Microsoft 365 / Outlook (email search + read_resource on mail:// URIs)
&lt;span class="p"&gt;-&lt;/span&gt; Atlassian (getConfluencePage; optionally searchJiraIssuesUsingJql)

If any connector is missing or errors out, render that section with a clear
"⚠ connector unavailable" note and continue. Never abort the whole report.

SCOPE OF "TODAY"
"Today" = the calendar day the task fires, in local time.
Always pass afterDateTime: "today" to outlook_email_search.

====
&lt;span class="gu"&gt;SECTION 1 — Replication Latency Monitoring (latest alert today)
====
&lt;/span&gt;&lt;span class="p"&gt;1.&lt;/span&gt; Search for latest alert today.
&lt;span class="p"&gt;2.&lt;/span&gt; read_resource on its URI for the full body.
&lt;span class="p"&gt;3.&lt;/span&gt; Report: count of alerts today, receivedDateTime, sender, and either the
   body text or "Body is empty" if it's only whitespace.

====
&lt;span class="gu"&gt;SECTION 2 — DB Backup Report (failure check)
====
&lt;/span&gt;&lt;span class="p"&gt;1.&lt;/span&gt; Find today's report email.
&lt;span class="p"&gt;2.&lt;/span&gt; Parse rows; tally BACKUP_STATUS column.
&lt;span class="p"&gt;3.&lt;/span&gt; State explicitly: "X completed, Y failed, Z with no backup."
   &lt;span class="gs"&gt;**Bold**&lt;/span&gt; the failure count if &amp;gt; 0.

====
&lt;span class="gu"&gt;SECTION 3 — Infra notifications today
====
&lt;/span&gt;&lt;span class="p"&gt;1.&lt;/span&gt; Fetch alerts from sender, limit 50.
&lt;span class="p"&gt;2.&lt;/span&gt; Parse subject pattern: "STATUS - Alert (Server : X and DB : Y)"
&lt;span class="p"&gt;3.&lt;/span&gt; Summarize counts by status. &lt;span class="gs"&gt;**Bold**&lt;/span&gt; anything not SUCCESS.
&lt;span class="p"&gt;4.&lt;/span&gt; List non-SUCCESS items with a webLink to each.

====
&lt;span class="gu"&gt;SECTION 4 — Deployment Review (Confluence)
====
&lt;/span&gt;&lt;span class="p"&gt;1.&lt;/span&gt; getConfluencePage, contentFormat="html".
&lt;span class="p"&gt;2.&lt;/span&gt; If JIRA tool available, run each filter and include live issue counts.

====
&lt;span class="gu"&gt;OUTPUT FORMAT
====
&lt;/span&gt;&lt;span class="p"&gt;-&lt;/span&gt; Plain markdown. One H1 with today's date. Four H2 sections in order.
&lt;span class="p"&gt;-&lt;/span&gt; &lt;span class="gs"&gt;**Bold**&lt;/span&gt; anything requiring attention.
&lt;span class="p"&gt;-&lt;/span&gt; End with "Sources:" — no duplicate links.
&lt;span class="p"&gt;-&lt;/span&gt; Tone: factual. No filler. No apology lines.

====
&lt;span class="gu"&gt;SANITY CHECKS
====
&lt;/span&gt;&lt;span class="p"&gt;-&lt;/span&gt; Section 1 zero results: &lt;span class="gs"&gt;**"No replication alerts today — monitoring may be down."**&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Section 2 zero results: &lt;span class="gs"&gt;**"Backup report email not received yet."**&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Section 3 zero results: &lt;span class="gs"&gt;**"No infra notifications today — verify alerting pipeline."**&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Section 4 failure: include the direct Confluence URL for manual access.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three things worth noting:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Section independence is explicit.&lt;/strong&gt; Each section is instructed to fail gracefully and continue. Fault isolation encoded directly in the prompt.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sanity checks are the highest-value lines.&lt;/strong&gt; Flagging "no email received today" catches the failure mode where the upstream monitoring system has broken silently — the exact failure mode that checklist-style human review tends to miss.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The prompt is a versioned artifact.&lt;/strong&gt; As the environment changes — new email formats, new Confluence pages, connector upgrades — the prompt evolves. Treat it as a living document.&lt;/p&gt;




&lt;h2&gt;
  
  
  Honest Tradeoffs
&lt;/h2&gt;

&lt;p&gt;This wouldn't be a useful post without naming the downsides.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LLM output is nondeterministic.&lt;/strong&gt; Identical inputs can produce slightly different summaries across runs. I spot-check the raw email count against the agent's tally once a week.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Token truncation is real.&lt;/strong&gt; A 73-row HTML table pushed through a context window can get silently trimmed. Defensive prompting mitigates but doesn't eliminate this.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The desktop-up dependency.&lt;/strong&gt; In my setup, scheduled tasks fire when the desktop app is open. A laptop asleep at 9 AM means the report runs when you open it. Acceptable for a daily digest; not for time-critical monitoring.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Trust takes time to calibrate.&lt;/strong&gt; The first week I ran this, I verified everything manually anyway. By week four, I'd stopped double-checking, and I had enough incident data to know the report was reliable. That calibration period is part of the deployment.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Pattern Generalizes
&lt;/h2&gt;

&lt;p&gt;Once you see the shape — &lt;em&gt;agent reads heterogeneous sources at a known cadence, filters to deltas, surfaces only what requires human attention&lt;/em&gt; — it appears everywhere:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;On-call handoff summaries&lt;/strong&gt; (PagerDuty + Slack + Jira + APM, joined and summarized)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sprint health reports&lt;/strong&gt; (blocked tickets, aging tickets, no-owner tickets — weekly, automated)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security alert triage&lt;/strong&gt; (SOC mailers often emit 200 events to find 3 that matter)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compliance evidence collection&lt;/strong&gt; (SOC 2 renewals are largely a gathering problem, not an analysis problem)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The domain changes. The architecture doesn't.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Mental Model Shift
&lt;/h2&gt;

&lt;p&gt;The efficiency gain is real — roughly 25 minutes a day × 250 working days ≈ &lt;strong&gt;100 hours annually&lt;/strong&gt;. But that's not the most interesting outcome.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;From push to pull.&lt;/strong&gt; Most monitoring tools send you everything and you decide what matters. The cost of sending is zero for the system and enormous for the human. An agent flips that: I'll ask each morning for the delta.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;From dashboards to digests.&lt;/strong&gt; A dashboard is only valuable if someone looks at it. A daily digest that synthesizes five sources and filters to a single page is more operationally useful than five perfect dashboards, because it solves the actual constraint — &lt;strong&gt;attention&lt;/strong&gt; — rather than the assumed one — visualization.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;From checklists to deltas.&lt;/strong&gt; The morning health check used to be: open these five places, confirm each is green. That's cognitive load spent confirming sameness. The right version is &lt;em&gt;only tell me what changed&lt;/em&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  What This Is, and Isn't
&lt;/h2&gt;

&lt;p&gt;This is not an argument for replacing your observability infrastructure with LLM prompts. Your APM is doing things this agent will never do: real-time anomaly detection, distributed trace correlation, infrastructure topology mapping. The agent doesn't compete with that.&lt;/p&gt;

&lt;p&gt;What the agent does is address the &lt;strong&gt;seam&lt;/strong&gt; — the gap between the modern observability stack and the legacy outputs it can't ingest.&lt;/p&gt;

&lt;p&gt;Making the seam visible is the first step. Closing it is the second. The agent is the bridge.&lt;/p&gt;

&lt;p&gt;If your morning starts with "let me just check a few things in my email" — you have an observability seam. The tooling to close it is available now, and the implementation cost is a well-written prompt.&lt;/p&gt;




&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F619adfdpud19nbv16ea4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F619adfdpud19nbv16ea4.png" alt=" " width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;What observability seams do you have in your environment? Curious what data sources people are working around. Drop a comment.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>ai</category>
      <category>monitoring</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Can AI Agents Replace Enterprise Workflow Orchestration? A Real-World Test — OpenClaw. n8n. Claude Dispatch. A side-by-side comparison..</title>
      <dc:creator>Pranay Ravi</dc:creator>
      <pubDate>Sun, 17 May 2026 21:58:49 +0000</pubDate>
      <link>https://dev.to/pranay_raavi/can-ai-agents-replace-enterprise-workflow-orchestration-a-real-world-test-openclaw-n8n-claude-1hho</link>
      <guid>https://dev.to/pranay_raavi/can-ai-agents-replace-enterprise-workflow-orchestration-a-real-world-test-openclaw-n8n-claude-1hho</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg55iazexdoak4m1xq3ch.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg55iazexdoak4m1xq3ch.png" alt=" " width="800" height="566"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;"A database administrator's honest investigation into whether the new wave of AI automation tools can handle enterprise-grade workflows — or whether the boring answer is still the right one."&lt;br&gt;
tags: n8n, automation, database, devops&lt;/p&gt;




&lt;p&gt;&lt;em&gt;A database administrator's honest investigation into whether the new wave of AI automation tools can handle enterprise-grade workflows — or whether the boring answer is still the right one.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fal7p5ugc1ic5gg1nmhza.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fal7p5ugc1ic5gg1nmhza.png" alt=" " width="800" height="1201"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The workflow that started the question — and ended up answering it&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Everyone was talking about these tools. I got curious.
&lt;/h2&gt;

&lt;p&gt;I work as a database administrator. I support hundreds of databases across multiple environments — dev, staging, non-prod, production — in a HIPAA-regulated organization. Access management is a constant, grinding operational burden. Developers need access. Analysts need access. Application owners need access. And every single request needs to be approved, documented, and auditable.&lt;/p&gt;

&lt;p&gt;For a long time, the process looked like this: someone would message the DBA, the DBA would manually create a Jira ticket and a Word document, chase down approvals from a manager, a database manager, and the security team, manually create the account, and email the credentials back. Days could pass. Follow-ups stacked up. The security team had to trust that the paperwork was right.&lt;/p&gt;

&lt;p&gt;Then I started hearing about Claude Dispatch. And OpenClaw. Both were being described as AI tools that could receive a message and take action — automate tasks, call APIs, connect to services. The demos looked impressive. The communities were excited.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;And I thought: wait. Could one of these actually solve the problem I have been living with for years?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I had already built something with n8n — a workflow automation tool. But I genuinely wanted to know whether I had picked the right tool, or whether something newer and smarter had passed it by. So I did what any engineer would do: I used my actual problem as the test.&lt;/p&gt;




&lt;h2&gt;
  
  
  The problem I needed to solve
&lt;/h2&gt;

&lt;p&gt;Before comparing any tools, let me describe the workflow precisely, because the details are what make the comparison meaningful.&lt;/p&gt;

&lt;p&gt;A developer — or a data analyst, or a product manager — needs access to a specific database. The request needs to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Go to their direct manager for approval&lt;/li&gt;
&lt;li&gt;Then to the database manager for approval&lt;/li&gt;
&lt;li&gt;Then to the security team for final approval&lt;/li&gt;
&lt;li&gt;Create a full audit trail at every step&lt;/li&gt;
&lt;li&gt;Call a pre-existing API to provision the account with the correct access level&lt;/li&gt;
&lt;li&gt;Deliver the credentials back to the requester&lt;/li&gt;
&lt;li&gt;Spin up a Jira ticket for the network team to open the firewall port&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Every approval is sequential — no step fires unless the previous one passes. If anyone says no, the chain stops and the requester is notified. If security does not approve, no account gets created. Full stop.&lt;/p&gt;

&lt;p&gt;This is not a hypothetical. It runs in a regulated environment where access to production data is governed by HIPAA. The audit trail is not nice-to-have. It is required.&lt;/p&gt;

&lt;p&gt;The access levels themselves are structured — not free text. Read only. Read/write. Dev owner. Application owner. DBA. Each maps to a specific API endpoint. Each produces a deterministic result. The central database inventory that drives all of this is a relational database populated automatically when infrastructure is created through Terraform. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Nobody should be able to bypass it.&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqthftls95b7kxn8o4edc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqthftls95b7kxn8o4edc.png" alt=" " width="800" height="1201"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Figure 1: The complete approval chain — from Webex message to provisioned credentials and dual Jira audit trail&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  First candidate: Claude Dispatch
&lt;/h2&gt;

&lt;p&gt;Claude Dispatch is Anthropic's answer to the question: what if you could delegate tasks to your AI from your phone and come back to find them done? It lives inside Claude Cowork — a desktop agent product — and creates a persistent connection between your mobile app and the Claude Desktop app running on your computer.&lt;/p&gt;

&lt;p&gt;The pitch is genuinely compelling. Send a message from your phone, Claude acts on your desktop: reads files, calls APIs, summarizes documents, delivers results. For personal productivity this is interesting. For ad-hoc delegation it is actually useful.&lt;/p&gt;

&lt;p&gt;So I asked the obvious question: could Dispatch receive a Webex message, run an approval chain, call my database API, and write to Jira?&lt;/p&gt;

&lt;p&gt;Here is where the investigation got honest quickly.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Dispatch requires the Claude Desktop app to be running on your computer. The moment the laptop sleeps, it stops.&lt;/li&gt;
&lt;li&gt;There is no server. There is no always-on process. There is no execution log.&lt;/li&gt;
&lt;li&gt;The workflow is driven by an LLM reasoning about what to do — not a deterministic set of rules. The same input could produce a different output on a different day.&lt;/li&gt;
&lt;li&gt;There is no concept of a sequential approval gate. Claude does not wait for a human to respond before deciding the next step.&lt;/li&gt;
&lt;li&gt;There is no audit trail. No timestamps. No record of who approved what and when.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;A workflow that depends on a laptop staying awake is not an enterprise workflow. It is a personal convenience. There is nothing wrong with that — but it is a different category of tool entirely.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Cost-wise, Dispatch is bundled with Claude Pro at $20 per month or Max at $100 per month. Accessible pricing. But the architecture disqualifies it for this use case before the price even matters.&lt;/p&gt;




&lt;h2&gt;
  
  
  Second candidate: OpenClaw
&lt;/h2&gt;

&lt;p&gt;OpenClaw is a different kind of tool. It is open-source, self-hostable, and designed as a personal AI assistant that runs on your own infrastructure. You can connect it to Webex, Telegram, Slack, WhatsApp — it listens on those channels and uses an LLM to decide what action to take in response to a message.&lt;/p&gt;

&lt;p&gt;The self-hosted angle immediately made it more interesting for regulated environments. If you run it on a VPS rather than your laptop, it can operate 24/7. And because it is open-source under an MIT license, the software itself costs nothing. Your real costs are the VPS — roughly $5 to $15 per month — and the API tokens from whichever LLM provider you connect.&lt;/p&gt;

&lt;p&gt;So OpenClaw gets past the first disqualification that knocked out Dispatch. It can stay on. Good start.&lt;/p&gt;

&lt;p&gt;But then I pushed further:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OpenClaw has no concept of a deterministic approval chain. It reasons about what to do. If I ask it to get manager approval before proceeding, it will try — but there is no guarantee it handles every edge case the same way every time.&lt;/li&gt;
&lt;li&gt;There is no built-in error handling or retry logic. If an API call fails, the agent may or may not handle it gracefully.&lt;/li&gt;
&lt;li&gt;There are no execution logs in any structured, auditable format. The LLM's reasoning is not a HIPAA audit trail.&lt;/li&gt;
&lt;li&gt;There is no native Jira integration. You can make API calls, but you are building that logic yourself in an environment with no visual workflow editor.&lt;/li&gt;
&lt;li&gt;Setup requires real DevOps experience — Docker, VPS configuration, model routing. Not a weekend project for someone who just needs automation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;OpenClaw is genuinely impressive for what it is: a powerful, flexible personal AI assistant for technical users who want to automate their own workflows. It is not what I needed.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The core tension is this: OpenClaw lets an AI decide what to do. In a regulated environment, I need a system that does exactly what it is configured to do — every single time.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The one I already had: n8n
&lt;/h2&gt;

&lt;p&gt;n8n is not the newest tool in this comparison. It is not the most talked-about. It does not have a viral GitHub repository or a growing community of people sharing AI agent demos. It is a workflow automation platform — visual, node-based, deterministic.&lt;/p&gt;

&lt;p&gt;I had already built the database access workflow in n8n before I started this investigation. What the investigation forced me to do was articulate why it works where the others do not.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;n8n runs on a server. Always on. No desktop dependency.&lt;/li&gt;
&lt;li&gt;Every workflow execution is logged — step by step, with timestamps. If something fails, you know exactly where and why.&lt;/li&gt;
&lt;li&gt;The approval chain is explicit: manager node fires, waits for a webhook response, branches on yes or no. Database manager node fires. Security node fires. No LLM is deciding the order. The order is the configuration.&lt;/li&gt;
&lt;li&gt;Jira, Webex, and REST API integrations are native — no custom code required to connect them.&lt;/li&gt;
&lt;li&gt;When security approves, a Jira ticket is created automatically with every approval documented: who approved, their role, the timestamp. A second ticket fires for the network team. The requester receives credentials via Webex.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The entire chain — from chat message to provisioned account — is deterministic, auditable, and server-side.&lt;/strong&gt; It does not matter whether my laptop is on. It does not matter whether the LLM is having a creative day. The workflow does what it is built to do.&lt;/p&gt;




&lt;h2&gt;
  
  
  Putting them side by side
&lt;/h2&gt;

&lt;p&gt;Here is what the investigation produced — not as opinion, but as a structured comparison against the actual requirements of the problem.&lt;/p&gt;

&lt;p&gt;![Figure 2: Tool comparison — enterprise workflow criteria]&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F36mziav4nlqsqilicocg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F36mziav4nlqsqilicocg.png" alt=" " width="800" height="717"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Figure 2: How the three tools compare across the criteria that actually matter for enterprise workflows&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The green cells are not a coincidence. n8n was built for exactly this category of problem — structured, multi-step, multi-system, always-on automation with governance requirements. Dispatch and OpenClaw were built for a different problem: intelligent, flexible, personal task delegation.&lt;/p&gt;

&lt;p&gt;Neither of those things is wrong. They are just different categories.&lt;/p&gt;




&lt;h2&gt;
  
  
  But is n8n actually accessible? Or is it just the enterprise answer no one wants to hear?
&lt;/h2&gt;

&lt;p&gt;This is the question I kept coming back to. Because n8n has a reputation for requiring technical knowledge. Webhooks, JSON payloads, API authentication, conditional branching. It is not a no-code tool in the purest sense.&lt;/p&gt;

&lt;p&gt;And yet — compared to OpenClaw, which requires DevOps expertise to host and configure, and compared to Dispatch, which requires you to trust an LLM to handle regulated processes — n8n is actually the most accessible path to a production-grade solution.&lt;/p&gt;

&lt;p&gt;The visual builder is genuinely good. The template library covers most common patterns. The community is large. And the self-hosted Community Edition is free — you pay only for the server, which can be as little as $4 a month.&lt;/p&gt;

&lt;p&gt;What n8n asks for is clarity of thought. You need to understand your process before you can automate it. That is not a technical barrier. That is just good engineering.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;In 2008, I had my first exposure to BPM tools — business process management software. Back then, automating a multi-step approval workflow required consultants, enterprise licenses, and six-month implementation projects. n8n in 2026 is that same capability, accessible to a single engineer on a modest budget, in a weekend.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The AI-native tools will get there. The direction is right. But for workflows where consistency and auditability are non-negotiable, deterministic automation still wins. Not because AI is not impressive — it is. But because a HIPAA auditor does not accept "the AI usually does it right" as an answer.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the real world added that no tool comparison captures
&lt;/h2&gt;

&lt;p&gt;There is one thing I did not discover until the workflow was running: some users have managers on paper who are not actually the owners or decision-makers for the systems being requested.&lt;/p&gt;

&lt;p&gt;Organizational charts say one thing. Actual accountability sits somewhere else. When the approval request went to the wrong person, the workflow stalled — not because the automation failed, but because the data it depended on was wrong.&lt;/p&gt;

&lt;p&gt;No tool comparison surfaces this. You only find it when you run the thing on real people in a real organization.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;This is one of the most useful things a well-designed workflow can do: make your organizational data gaps visible. The automation did not hide the problem. It exposed it. And that forced the fix.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  So — can the new AI tools replace n8n for this?
&lt;/h2&gt;

&lt;p&gt;No. Not yet. Not for this class of problem.&lt;/p&gt;

&lt;p&gt;Claude Dispatch is a remote control for your desktop. It is well-designed and genuinely useful for personal delegation. But it has no server, no audit trail, and no deterministic logic — three things that are non-negotiable in a regulated environment.&lt;/p&gt;

&lt;p&gt;OpenClaw is a powerful personal AI assistant that technical users can self-host and extend. It can call APIs and respond to messages. But it has no structured approval chain, no execution logging, and no enterprise governance features.&lt;/p&gt;

&lt;p&gt;n8n is not the flashiest answer. It did not go viral. It does not use an LLM to decide what to do next. But it runs reliably, it logs everything, it integrates natively with Jira and Webex, and it does exactly what you configure it to do — every single time.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The right tool is not the newest tool. It is the tool that matches the shape of your problem. And some problems have a shape that requires determinism, not intelligence.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The question I started with — can these AI tools solve what n8n solves — turned out to be the wrong question. The better question is: what kind of automation are you building? If the answer involves regulated data, sequential human approvals, and a legal requirement to prove who did what and when, the answer is still n8n. If the answer involves personal productivity, intelligent delegation, and flexible task handling, Dispatch and OpenClaw are worth a serious look.&lt;/p&gt;

&lt;p&gt;Both of those things can be true at the same time.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Principal Engineer | 16+ years in databases, cloud &amp;amp; observability | Oracle · PostgreSQL · Kafka · AWS · Splunk · AppD | Platform engineering &amp;amp; AI delivery&lt;/em&gt;&lt;/p&gt;

</description>
      <category>n8n</category>
      <category>automation</category>
      <category>database</category>
      <category>devops</category>
    </item>
    <item>
      <title>How I Built a Zero-Subscription Local AI Stack — Inspired by a 60-Second YouTube Short</title>
      <dc:creator>Pranay Ravi</dc:creator>
      <pubDate>Sun, 17 May 2026 02:33:21 +0000</pubDate>
      <link>https://dev.to/pranay_raavi/how-i-built-a-completely-free-local-ai-stack-inspired-by-a-60-second-youtube-short-3e39</link>
      <guid>https://dev.to/pranay_raavi/how-i-built-a-completely-free-local-ai-stack-inspired-by-a-60-second-youtube-short-3e39</guid>
      <description>&lt;h1&gt;
  
  
  How I Built a Completely Free Local AI Stack — Inspired by a 60-Second YouTube Short
&lt;/h1&gt;

&lt;p&gt;&lt;em&gt;By Pranaychandra Ravi&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;It started with a YouTube Short. Someone on my feed casually demonstrated connecting a local AI model to Claude Code and I stopped mid-scroll. No API key. No subscription. No code leaving their machine. I had to know how it worked.&lt;/p&gt;

&lt;p&gt;What followed was a deep dive into local AI — Ollama, Gemma4, Docker, Open WebUI, vector databases, context windows, and a Python script that made my local model generate an ASCII diagram of the Earth and Moon. This post documents everything I learned, every question I asked, and every mistake I made along the way. If you're curious about running AI entirely on your own hardware, this one is for you.&lt;/p&gt;




&lt;h2&gt;
  
  
  First Question: Wait, Is This Actually Free?
&lt;/h2&gt;

&lt;p&gt;My first instinct was skepticism. Claude Code is Anthropic's product. Surely using it requires a Claude subscription?&lt;/p&gt;

&lt;p&gt;The short answer is &lt;strong&gt;no — not when you pair it with Ollama and a local model.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Here's what I learned: Claude Code is the &lt;em&gt;agent&lt;/em&gt; — the tool that reads your files, runs commands, edits code, and manages multi-step tasks in your terminal. By default it calls Anthropic's API, which costs money. But Claude Code exposes environment variables that let you redirect those API calls anywhere you want — including a local Ollama server running on your own machine.&lt;/p&gt;

&lt;p&gt;Ollama added official support for Anthropic's Messages API format, meaning Claude Code can talk to it natively. No hacks, no middleware, no subscription. The only cost is your own electricity and hardware.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Claude Code  →  talks to  →  Ollama (local server)  →  runs  →  Your model
                              (no Anthropic servers involved)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  So What Exactly Is Ollama?
&lt;/h2&gt;

&lt;p&gt;Before I could set anything up I needed to understand what Ollama actually is, because "install Ollama" doesn't tell you much.&lt;/p&gt;

&lt;p&gt;Think of Ollama as two things in one:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. A model manager&lt;/strong&gt; — it downloads, stores, and organizes AI models on your machine. Like a package manager but for AI brains.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. A local API server&lt;/strong&gt; — once running, it exposes an endpoint at &lt;code&gt;http://localhost:11434&lt;/code&gt; that any application can call. Your code, Claude Code, Open WebUI, VS Code extensions — anything that speaks the Anthropic or OpenAI API format can connect to it.&lt;/p&gt;

&lt;p&gt;This is the key insight I kept coming back to: &lt;strong&gt;Ollama itself has no intelligence&lt;/strong&gt;. It's an empty engine. You have to download a model — a large file containing all the AI's weights and knowledge — before anything useful happens.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Without a model:   Ollama = empty server, useless
With a model:      Ollama = fully local AI, free forever
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Downloading Your First Model — Which One?
&lt;/h2&gt;

&lt;p&gt;This is where hardware matters. I have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;32GB RAM&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;NVIDIA GPU with ~11GB VRAM&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Core i9 processor&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With an NVIDIA card, Ollama automatically uses CUDA — no setup needed. Your GPU handles inference and it's dramatically faster than CPU-only.&lt;/p&gt;

&lt;p&gt;The key concept here is &lt;strong&gt;VRAM vs RAM&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Model fits in VRAM  →  GPU handles everything  →  Very fast ✅
Model too big for VRAM  →  spills into system RAM  →  Slower ⚠️
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With 11GB VRAM I can fit most 7B–13B parameter models entirely in GPU memory, which means fast, snappy responses.&lt;/p&gt;

&lt;p&gt;After thinking through my use cases — coding help, image analysis, document review — I landed on &lt;strong&gt;Gemma4&lt;/strong&gt; (Google's multimodal model, ~12GB). Here's why it beat out alternatives like Qwen3.6 (28GB):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Gemma4&lt;/th&gt;
&lt;th&gt;Qwen3.6&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Size&lt;/td&gt;
&lt;td&gt;~12GB&lt;/td&gt;
&lt;td&gt;~28GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fits in 11GB VRAM&lt;/td&gt;
&lt;td&gt;Nearly (tiny RAM overflow)&lt;/td&gt;
&lt;td&gt;Partial (big RAM spill)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Image understanding&lt;/td&gt;
&lt;td&gt;✅ Yes (multimodal)&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Coding quality&lt;/td&gt;
&lt;td&gt;Good&lt;/td&gt;
&lt;td&gt;Better&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Speed on my hardware&lt;/td&gt;
&lt;td&gt;Fast&lt;/td&gt;
&lt;td&gt;Slower&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;My use cases included image-to-text extraction and converting images to coloring pages — Qwen3.6 can't do either because it's text-only. Gemma4 won.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama pull gemma4
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One command. It downloads, verifies, and stores the model. You can see progress in the terminal.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Architecture in Plain English
&lt;/h2&gt;

&lt;p&gt;Before going further, I want to share the mental model that made everything click for me:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────┐
│                    YOUR COMPUTER                    │
│                                                     │
│  ┌─────────────┐    ┌──────────────┐               │
│  │ Claude Code │───▶│    Ollama    │               │
│  │  (terminal) │    │ :11434 (API) │               │
│  └─────────────┘    └──────┬───────┘               │
│                            │                        │
│  ┌─────────────┐    ┌──────▼───────┐               │
│  │  Open WebUI │───▶│   Gemma4    │               │
│  │  (browser)  │    │  (the brain) │               │
│  └─────────────┘    └─────────────┘               │
│                                                     │
│  ┌─────────────┐                                   │
│  │  Python API │───▶ http://localhost:11434        │
│  │   scripts   │                                   │
│  └─────────────┘                                   │
└─────────────────────────────────────────────────────┘
              Zero data leaves your machine
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three different interfaces. One local model. Everything private.&lt;/p&gt;




&lt;h2&gt;
  
  
  Context Windows — What Are They and Why Do They Matter?
&lt;/h2&gt;

&lt;p&gt;One of the most important concepts I clarified was the &lt;strong&gt;context window&lt;/strong&gt; — the model's working memory. It's the maximum amount of text a model can "see" at once in a conversation. Exceed it and it starts forgetting the beginning.&lt;/p&gt;

&lt;p&gt;Here's the reality check comparison:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Claude Sonnet 4.5&lt;/th&gt;
&lt;th&gt;Gemma4 (local)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Context window&lt;/td&gt;
&lt;td&gt;200,000 tokens&lt;/td&gt;
&lt;td&gt;~8,000–32,000 tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Approximate pages&lt;/td&gt;
&lt;td&gt;~150,000 words&lt;/td&gt;
&lt;td&gt;~6,000–24,000 words&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6 years of tax docs&lt;/td&gt;
&lt;td&gt;Handles comfortably&lt;/td&gt;
&lt;td&gt;Would overflow&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Your VRAM directly affects how large a context window your local model can hold. More VRAM = more of the model loaded = bigger context available.&lt;/p&gt;

&lt;p&gt;You can manually increase it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama run gemma4 &lt;span class="nt"&gt;--ctx-size&lt;/span&gt; 32768
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For single documents, images, or focused coding tasks — perfectly fine. For analyzing six years of tax filings all at once? That's where Claude's 200k context is a genuine advantage local models can't match yet.&lt;/p&gt;




&lt;h2&gt;
  
  
  Can Local Models Search the Internet?
&lt;/h2&gt;

&lt;p&gt;Short answer: &lt;strong&gt;No, not by default.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Local models are frozen at their training date. They have no internet connection during your conversation. This was an important distinction to understand.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Claude (this chat)  →  Has web search tool  →  Knows current events ✅
Gemma4 (local)     →  No internet          →  Knowledge frozen at training ❌
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This raised an interesting follow-up question though. When I used Gemini to analyze my tax filing and it spotted mistakes — was it searching the internet to find them?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No.&lt;/strong&gt; And this was a real misconception I had.&lt;/p&gt;

&lt;p&gt;Gemini found tax errors because tax law, IRS rules, and common filing mistakes were baked into the model during training. It learned from millions of tax documents, accounting textbooks, and IRS publications. During your session it's not googling anything — it's applying trained knowledge to your specific document.&lt;/p&gt;

&lt;p&gt;Think of it like a tax accountant. They studied tax law for years. When reviewing your return they're not searching Google — they're applying what they already know to what you show them.&lt;/p&gt;

&lt;p&gt;Local models work the same way. The difference is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Gemini/Claude&lt;/strong&gt;: More recent training data, larger knowledge base, up-to-date tax law changes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gemma4 local&lt;/strong&gt;: Good foundational knowledge, may be slightly behind on very recent rule changes, but &lt;strong&gt;your documents never leave your machine&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For sensitive financial documents, that privacy trade-off is significant.&lt;/p&gt;




&lt;h2&gt;
  
  
  Connecting Claude Code to Gemma4
&lt;/h2&gt;

&lt;p&gt;This was surprisingly simple. Claude Code reads three environment variables:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_AUTH_TOKEN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ollama
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;""&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:11434
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or using Ollama's built-in launcher:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama launch claude
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When Claude Code started up I saw this at the bottom of the welcome screen:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;gemma4 · API Usage Billing · pranayraavi@gmail.com's Organization
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That confirms it's using Gemma4 through Ollama. No Anthropic billing. No subscription.&lt;/p&gt;

&lt;p&gt;What you get with this setup:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ File reading and editing across your project&lt;/li&gt;
&lt;li&gt;✅ Terminal command execution&lt;/li&gt;
&lt;li&gt;✅ Multi-step agentic coding tasks&lt;/li&gt;
&lt;li&gt;✅ Git operations&lt;/li&gt;
&lt;li&gt;✅ MCP connectors and plugins&lt;/li&gt;
&lt;li&gt;✅ Project context awareness&lt;/li&gt;
&lt;li&gt;⚠️ Intelligence capped at Gemma4's capability (weaker than Claude Sonnet/Opus)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Python API Test
&lt;/h2&gt;

&lt;p&gt;Before setting up a GUI I wanted to confirm the raw API worked. Here's the script I wrote:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:11434/api/generate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemma4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stream&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Write a hello world in ascii diagram of moon and earth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;          (           )
         /              \
  ----(---O---)    (------)  &amp;lt;-- Orbit Path
 /  /   \    /  /   \
|   |     | | |     |   |
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Gemma4, running entirely on my machine, responding to a Python script. No API key. No internet. Completely local. This was the moment it really clicked.&lt;/p&gt;




&lt;h2&gt;
  
  
  Setting Up Open WebUI — The ChatGPT-Like Interface
&lt;/h2&gt;

&lt;p&gt;For a proper GUI I went with &lt;strong&gt;Open WebUI&lt;/strong&gt; — a beautiful, feature-rich interface that runs locally and connects to Ollama.&lt;/p&gt;

&lt;p&gt;First attempt using pip failed because I had Python 3.13 and Open WebUI requires Python 3.11 or 3.12:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;ERROR: Could not find a version that satisfies the requirement open-webui
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So I went the Docker route instead.&lt;/p&gt;

&lt;h3&gt;
  
  
  Installing Docker Desktop
&lt;/h3&gt;

&lt;p&gt;Docker Desktop is free for personal use. Download from &lt;code&gt;docker.com/products/docker-desktop&lt;/code&gt;. During install, WSL 2 backend gets configured automatically on Windows.&lt;/p&gt;

&lt;h3&gt;
  
  
  Running Open WebUI
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="n"&gt;docker&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;run&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-d&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;-p&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;127.0.0.1:3000:8080&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--name&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;open-webui&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;-v&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;open-webui:/app/backend/data&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;--add-host&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;host.docker.internal:host-gateway&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;`
&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nx"&gt;ghcr.io/open-webui/open-webui:main&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I initially tried &lt;code&gt;-p 3000:80&lt;/code&gt; which caused a port conflict (another process was using port 3000 on my machine). Switching to &lt;code&gt;-p 127.0.0.1:3000:8080&lt;/code&gt; fixed it.&lt;/p&gt;

&lt;p&gt;Confirmed it was running:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="n"&gt;netstat&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;-ano&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;findstr&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="nx"&gt;3000&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c"&gt;# TCP  0.0.0.0:3000  LISTENING  ← Docker up and running&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="n"&gt;curl&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;http://localhost:3000&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="c"&gt;# StatusCode: 200 OK  ← Server responding&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then opened &lt;code&gt;http://localhost:3000&lt;/code&gt; in Chrome and saw the Open WebUI interface with Gemma4 auto-detected.&lt;/p&gt;

&lt;h3&gt;
  
  
  First Real Test — Image to Text Extraction
&lt;/h3&gt;

&lt;p&gt;One of the reasons I picked Gemma4 over Qwen3.6 was its multimodal capability — it can actually &lt;em&gt;see&lt;/em&gt; images. I put this to the test immediately.&lt;/p&gt;

&lt;p&gt;I had a photo of handwritten chess notes and uploaded it directly into the Open WebUI chat. The prompt was simple: &lt;em&gt;"convert this image to text"&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Gemma4 thought for 11 seconds and returned:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;FORK/DOUBLE ATTACK

When we attack two or more pieces at the same time then it is known
as fork or double attack

Note- Knights are good at making fork.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's a perfect transcription of handwritten text — extracted entirely locally, no cloud OCR service, no API key, nothing leaving my machine. It even generated a relevant follow-up suggestion: &lt;em&gt;"Are there other kinds of tactical attacks besides forks, like pins or skewers?"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This is the multimodal capability in action:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Handwritten text extracted accurately&lt;/li&gt;
&lt;li&gt;✅ Context understood (chess notes)&lt;/li&gt;
&lt;li&gt;✅ Intelligent follow-up suggested&lt;/li&gt;
&lt;li&gt;✅ 100% local — image never left my PC&lt;/li&gt;
&lt;li&gt;✅ Free&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For anyone with scanned documents, handwritten notes, receipts, or any image containing text — this works out of the box with Gemma4 in Open WebUI.&lt;/p&gt;




&lt;h2&gt;
  
  
  Document Upload and RAG — How It Actually Works
&lt;/h2&gt;

&lt;p&gt;One of the most powerful features of Open WebUI is document upload with &lt;strong&gt;RAG (Retrieval Augmented Generation)&lt;/strong&gt;. This is how you can upload your AWS docs, tax returns, or any PDFs and chat with them.&lt;/p&gt;

&lt;p&gt;Here's what happens under the hood:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You upload PDF
      ↓
Open WebUI splits it into chunks
      ↓
Converts chunks to embeddings (mathematical vectors)
      ↓
Stores in ChromaDB (local vector database)
      ↓
You ask a question
      ↓
ChromaDB finds the most relevant chunks
      ↓
Sends chunks to Gemma4 as context
      ↓
Gemma4 answers based on YOUR document
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Everything is stored locally at:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;C:\Users\lavan\AppData\Roaming\open-webui\data\
  📁 vector_db    ← document embeddings (ChromaDB)
  📁 uploads      ← original files
  📄 webui.db     ← chat history (SQLite)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Your documents never leave your machine. ChromaDB is completely free and open source.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One important limitation&lt;/strong&gt;: RAG finds &lt;em&gt;relevant chunks&lt;/em&gt;, not the entire document. If an answer spans many sections of a large document, it might miss some context. The workaround is to upload smaller, focused documents rather than one giant PDF.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Full Stack — What I Now Have Running
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;✅ Ollama          — model manager and local API server
✅ Gemma4          — the AI model (multimodal, ~12GB)
✅ Claude Code     — agentic coding with local model
✅ Open WebUI      — browser-based chat interface with document upload
✅ Python API      — scripts calling the model directly
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Total monthly cost: &lt;strong&gt;$0&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  When to Use What
&lt;/h2&gt;

&lt;p&gt;After going through all of this, here's the practical split I settled on:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;Use&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Coding with file editing&lt;/td&gt;
&lt;td&gt;Claude Code + Gemma4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Image analysis / image to text&lt;/td&gt;
&lt;td&gt;Open WebUI + Gemma4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Document Q&amp;amp;A (private)&lt;/td&gt;
&lt;td&gt;Open WebUI + RAG + Gemma4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Web research / current events&lt;/td&gt;
&lt;td&gt;Claude.ai or Perplexity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Complex reasoning / large context&lt;/td&gt;
&lt;td&gt;Claude.ai (paid)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tax doc analysis (all years)&lt;/td&gt;
&lt;td&gt;Claude.ai or NotebookLM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Quick Python scripts calling AI&lt;/td&gt;
&lt;td&gt;Direct Ollama API&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Honest Reflections
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What surprised me&lt;/strong&gt;: How straightforward the setup actually was once I understood the mental model. Ollama is the server, the model is the brain, everything else just connects to it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What I underestimated&lt;/strong&gt;: The quality gap between local models and Claude Sonnet/Opus is real. For simple tasks Gemma4 is impressive. For complex multi-step reasoning, Claude's frontier models are noticeably stronger.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What I'd tell myself at the start&lt;/strong&gt;: Local AI is not a replacement for cloud AI — it's a complement. Use local for private, repetitive, or experimental tasks. Use cloud AI for research, complex reasoning, and anything that benefits from a larger context window.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The privacy win is real&lt;/strong&gt;: For sensitive documents — financial records, personal data, proprietary code — local AI is genuinely better from a privacy standpoint. Your data does not leave your machine. Full stop.&lt;/p&gt;




&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Ollama: &lt;a href="https://ollama.com" rel="noopener noreferrer"&gt;ollama.com&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Open WebUI: &lt;a href="https://openwebui.com" rel="noopener noreferrer"&gt;openwebui.com&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Claude Code: &lt;a href="https://claude.ai/code" rel="noopener noreferrer"&gt;claude.ai/code&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Ollama + Claude Code docs: &lt;a href="https://docs.ollama.com/integrations/claude-code" rel="noopener noreferrer"&gt;docs.ollama.com/integrations/claude-code&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Docker Desktop (free): &lt;a href="https://docker.com/products/docker-desktop" rel="noopener noreferrer"&gt;docker.com/products/docker-desktop&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;All of this runs on a Windows machine with 32GB RAM, an NVIDIA GPU with ~11GB VRAM, and a Core i9 processor. If you have similar hardware you can replicate this entire stack in an afternoon.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>showdev</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
