Inside the pgEdge AI DBA Workbench: How Ellie Actually Works (v1.0.0 Now Available)

#ai #opensource #programming #postgres

TL;DR: The pgEdge AI DBA Workbench is four components on a shared Postgres datastore: a collector, a server, an alerter, and a React client that renders the dashboards and the chat panel where Ellie lives. Ellie is an agentic loop that drives any LLM you choose (Claude, OpenAI, Gemini, Ollama, or anything OpenAI-compatible) through a fixed set of database-aware tool calls. The LLM queries Postgres through those tools - read-only, typed, and auditable - rather than a raw connection. Anomaly detection runs three tiers: z-score baselines, pgvector similarity against historical patterns, and LLM escalation for the residual cases. Source on GitHub under the PostgreSQL Licence.

Most developers I talk to have a similar reaction when somebody tells them an LLM is going to manage their database: a polite nod, followed by a question about how the model is supposed to know the difference between a healthy pg_stat_activity snapshot on a quiet Tuesday and a runaway transaction on a Black Friday morning. It is a fair question, and it is the question we have been trying to answer whilst building the pgEdge AI DBA Workbench and its conversational interface, Ellie. You can run the Workbench as a pure monitoring platform with the AI switched off, and it is a perfectly capable one; but after months of working with Ellie enabled, none of us choose to.

Four Components and One Datastore

The Workbench is four Go and TypeScript services that share a PostgreSQL datastore. The collector daemon connects to each monitored instance and pulls dozens of metric probes covering pg_stat_statements, pg_stat_activity, replication state, and OS-level resource usage. Each probe writes to a dedicated time-series table indexed on connection_id and collection_time, with cadence configurable per scope.

The server exposes a JSON-RPC 2.0 MCP and REST endpoints over HTTPS and brokers every interaction between the LLM and your databases. The alerter runs anomaly detection and threshold evaluation. The client is a React SPA that renders the dashboards, the cluster topology graph, and the chat panel where Ellie lives.

One thing before the tool-use loop: MCP, the Model Context Protocol that Anthropic published in late 2024, is the standard for how LLMs invoke external tools. The Workbench MCP server isn't just Ellie's backend. It exposes the same tool catalogue to any HTTP MCP-compatible client - Claude Code, Cursor, VS Code (GitHub Copilot), Windsurf, or a custom agent - so if you'd rather drive the same diagnostics from your IDE, you can.

Please note: Claude Desktop does not support HTTP transport for MCP servers.

Ellie’s Tool-Use Loop

Ellie is not a separate service; she is an agentic loop implemented in the client that talks to a configurable LLM backend through the server. We currently support Claude, ChatGPT, and local Ollama models, as well as other providers with an OpenAI compatible API such as LM Studio or EXO. The choice of model matters less than people tend to assume because the LLM never queries your database directly. Instead, it returns tool calls that the client dispatches to /api/v1/tools/execute, and the server executes them under the caller’s bearer token against per-token connection pools so that no two users share state.

When you ask Ellie a question, the loop pulls together the same information a human DBA would. She maps the database with get_schema_info, looks up the metric catalogue via describe_probe, and pulls the relevant history with query_metrics. She walks the alert log with get_alert_history, which surfaces acknowledgments and false-positive marks alongside the raw firings, so she can distinguish a recurring noisy rule from a genuinely new symptom. get_blackouts keeps her honest about scheduled maintenance, so a metric spike inside a known blackout is not flagged as an incident. test_query validates any SQL she intends to show against the live planner, and the system prompt forbids her from emitting a statement that has not been through that gate. execute_explain returns plans, and search_knowledgebase gives her semantic access to a curated corpus covering PostgreSQL, the major extensions, and the pgEdge product family, so her answers cite documentation rather than the model’s pre-training. The memory tools (store_memory, recall_memories, delete_memory) let her carry forward the quirks of a particular cluster across conversations.

The same agentic loop drives the AI Analysis button on every chart, alert, and dashboard panel. Clicking it runs a one-shot version of the loop, seeded with the panel’s data, time range, server context, and timeline events in that window; the result drops into a modal you can read or download as Markdown, no conversation needed.

Three-Tier Anomaly Detection

The alerter is the part that earns its keep when nobody is watching. Tier one is plain z-score analysis against rolling baselines computed by get_metric_baselines, which catches the bulk of obvious deviations cheaply. Tier two embeds the current metric window and runs a pgvector similarity search against historical anomalies (stored in plain Postgres tables you can SELECT from yourself), so a pattern resembling last quarter’s connection storm gets flagged before symptoms cascade. Tier three escalates the residual cases to an LLM with the relevant probe context, and only fires when the cheaper tiers cannot classify the signal with confidence. Twenty-six threshold rules run alongside this, scoped per estate, cluster, or server, and feed Slack, Mattermost, webhook, and email notifications.

A Word on Postgres

One detail for the Postgres crowd: the pgvector dependency in tier two is a quiet endorsement of using Postgres as the substrate for the AI itself rather than bolting on a separate vector store. Everything Ellie remembers, every anomaly she has previously classified, and every embedding she searches lives in tables you can SELECT from yourself. Replication coverage is broad: the collector probes pg_stat_replication, pg_replication_slots, and pg_stat_subscription for streaming and logical setups, plus spock_resolutions and spock_exception_log for Spock active-active deployments.

The Workbench is now generally available and the source is on GitHub under the PostgreSQL Licence. Give it a bash without AI enabled and then with, and prepare to be surprised how productive it makes you. If you’d rather have a live demo of the Workbench first, we do those too - book one here.