How we built a 43-connector CMDB with LLM pattern-learning discovery

Nick Zitzer — Wed, 08 Apr 2026 16:40:04 +0000

Repo: https://github.com/Happy-Technologies-LLC/configbuddy

A few years back, I was working on infrastructure where nobody trusted the CMDB. The data was perpetually stale. Connectors broke after every API update. Nobody used it for decisions — it was a ritual: we had a CMDB because enterprises have CMDBs.

This post is about what I built to solve that problem, why I made the architectural choices I made, and what I'd do differently if I started over. The project is called ConfigBuddy. It's now open-source under Apache 2.0, and I'm writing this less as an announcement and more as an architecture deep-dive for other people working on infrastructure tooling.

Skip to the parts you care about:

The problem with traditional CMDB discovery
Why Neo4j over PostgreSQL for the primary store
The connector split: 17 TypeScript + 26 JSON
Pattern-learning discovery: discover once, replay forever
The identity resolution problem nobody talks about
Unified credentials with protocol affinity
The enrichment pipeline (ITIL + TBM + BSM)
Failure modes and what I'd do differently

1. The problem with traditional CMDB discovery

If you've worked with a CMDB at enterprise scale, you know the failure modes:

Stale data. Discovery runs nightly, weekly, sometimes monthly. By the time anyone looks at the CMDB during an incident, half the records are wrong. The graph of relationships you trust during a P1 is the graph from three weeks ago.

Connector breakage. Every external system has an API. Every API changes. Connectors break silently — sometimes for weeks before anyone notices the gap in the data. The maintenance burden of keeping 30+ connectors current is enormous, and the work is invisible until something fails.

Per-asset pricing. Most commercial CMDB platforms charge per discovered CI. This creates perverse incentives — teams turn off discovery for "low-value" assets, which means the CMDB only sees the things finance approved, not the things that actually exist. Shadow IT thrives in the gap.

No ownership. The CMDB is everybody's data and nobody's job. Without continuous validation, it decays. Without ownership, validation never happens.

The technology to solve this exists. Discovery tooling is mature. Graph databases are mature. LLM APIs are widely available. The problem isn't missing technology — it's that nobody has put them together with the right economic model.

The economic model that ConfigBuddy is built around: discovery should be cheap to run and cheap to extend. If it's cheap to run, you run it more often, and the data stays fresh. If it's cheap to extend, you cover more systems, and the gaps shrink. Cost is the variable that drives quality, not the other way around.

2. Why Neo4j over PostgreSQL for the primary store

This was the first major architectural decision and the one I get the most questions about.

CMDB data is graph-shaped by nature. A configuration item (a server, a database, an application) has relationships to other CIs — runs on, depends on, hosted in, owned by, communicates with. The classic CMDB question — "if this database goes down, what services are affected?" — is a graph traversal problem.

In a relational model, that traversal looks like a multi-table join. For a moderately complex environment, blast radius analysis can require 10-15 table joins, plus recursive CTEs to handle transitive dependencies. The query is slow, the SQL is unreadable, and the database planner has to make hard choices about index usage.

In Neo4j with Cypher, the same query is 3-4 hops:

MATCH (db:Database {id: 'prod-customers-01'})<-[:DEPENDS_ON*1..4]-(svc:BusinessService)
RETURN svc.name, svc.criticality
ORDER BY svc.criticality DESC

That's it. Three lines. The query planner is built for graph traversal, so it's fast even with millions of nodes. And the query is readable to anyone who understands the data model — you can hand it to an SRE during an incident and they can modify it on the fly.

The tradeoff: Operational overhead. Neo4j is a real database with real ops requirements. Backups are different. Causal clustering for HA is a non-trivial setup. The team learning curve for Cypher is real, even though the language is small. If your team has deep PostgreSQL expertise and zero graph database experience, you'll feel the friction.

ConfigBuddy uses Neo4j as the primary store but syncs to PostgreSQL (with TimescaleDB) for the analytics data mart. Neo4j is the source of truth for relationships. PostgreSQL is the source of truth for time-series analytics, historical reporting, and the kind of aggregation that graph databases handle poorly. The two stores are kept in sync via an ETL pipeline.

I'd make the same choice again. The graph traversal economics are decisive for the primary use case (impact analysis), and the PostgreSQL data mart catches the cases where graph queries are the wrong tool.

3. The connector split: 17 TypeScript + 26 JSON

ConfigBuddy ships with 43 connectors. They fall into two architectural categories.

17 TypeScript connectors for systems where discovery requires real business logic:

AWS, Azure, GCP — multi-account/multi-subscription enumeration, IAM walking, region iteration
ServiceNow, Jira — incremental sync with watermarking, complex auth flows
VMware vSphere, Kubernetes — cluster traversal, parent-child relationship inference
SCCM, Active Directory — protocol-specific quirks (LDAP paging, SCCM WMI)
A few more in the same category

These are the systems where a JSON config can't capture what discovery actually requires. You need real code to handle pagination quirks, retry logic for rate limits, multi-step OAuth flows, and the messy realities of production APIs.

26 JSON declarative connectors for systems with well-documented REST/GraphQL APIs:

{
  "name": "github",
  "version": "1.0.0",
  "auth": {
    "type": "bearer",
    "credentialKey": "github_token"
  },
  "endpoints": [
    {
      "name": "repositories",
      "method": "GET",
      "path": "/user/repos",
      "pagination": { "type": "link_header" },
      "ciType": "code-repository",
      "fieldMapping": {
        "id": "$.id",
        "name": "$.full_name",
        "owner": "$.owner.login",
        "language": "$.language"
      }
    }
  ]
}

That's a working GitHub connector in ~20 lines of JSON. The framework handles authentication, pagination, rate limiting, error handling, retry logic, and field mapping. Adding a new JSON connector for a system I haven't covered yet takes hours, not days.

Why the split? Because pretending all connectors are the same is how connector libraries become unmaintainable. The systems that need code, need code. The systems that don't, shouldn't. By drawing the line explicitly, the maintenance burden of the JSON connectors collapses to almost nothing — they're configuration, not software.

The JSON declarative framework is the part of the codebase I'm most likely to extract into a standalone repo. It's reusable beyond ConfigBuddy and other projects could benefit.

4. Pattern-learning discovery: discover once, replay forever

This is the part of ConfigBuddy that I think is genuinely novel, and the part that took the longest to get right.

The setup: Imagine you want to discover everything in a customer's AWS account, but you don't know what's in there. Traditional discovery runs a fixed strategy — call EC2 DescribeInstances, then S3 ListBuckets, then RDS DescribeDBInstances, etc. That strategy works for AWS, but it's hand-coded by an engineer who knew what to look for.

What if you didn't know what to look for? What if the system was something custom, or a niche cloud, or a homegrown internal platform? The traditional answer is "write a new connector," which takes engineering time you don't have.

The pattern-learning approach: On first discovery of an unknown system, ConfigBuddy hands the problem to an LLM (Anthropic Claude or OpenAI, configurable). The LLM is given the system's API documentation (or, failing that, the API responses themselves) and asked to figure out a discovery strategy.

The LLM produces a strategy — "call this endpoint, paginate this way, extract these fields, map them to these CI types." This is captured as a pattern: an actual TypeScript code string, stored in PostgreSQL, executed in a sandboxed VM (vm2) on subsequent runs.

The economics:

Run	LLM cost	Discovery cost
First discovery	~$0.40 (one-time)	LLM API call
Second discovery	$0	Pattern replay (sandboxed code execution)
Third discovery	$0	Pattern replay
Nth discovery	$0	Pattern replay

You pay once to learn how to discover a system type, then run forever for free.

Why this works: Most discovery operations are highly repetitive. The same AWS account looks the same on Tuesday as it did on Monday. The same Kubernetes cluster has the same shape. The LLM isn't doing novel reasoning every run — it's doing novel reasoning once and then we cache the result as executable code.

Why it's not just caching: A naive cache stores the results of discovery. Pattern learning stores the strategy. When the underlying system changes (new resources, new accounts, new namespaces), pattern replay still works because the strategy is parameterized — it's discovering the current state, not replaying the previous state. Only when the API itself changes does the pattern need to be regenerated.

Pattern versioning: When an API change breaks a pattern replay, the system detects the failure (sandboxed execution fails or returns malformed data), regenerates the pattern via LLM, and stores it as a new version. Old patterns are kept for rollback.

In my own usage, after the initial learning period, the vast majority of discovery runs are pattern replays with zero LLM cost. The first month of a new environment might cost $20-50 in LLM calls. Every subsequent month is essentially free.

5. The identity resolution problem nobody talks about

This was 30% of v2 development time. It should have been day-one architecture. Most CMDB conversations focus on discovery; identity resolution is where production deployments actually break.

The problem: Your AWS connector finds an EC2 instance with ID i-0a1b2c3d. Your ServiceNow connector finds a CI with serial number ABC123XYZ. Your network scanner finds a host at 10.20.30.40 with hostname web-prod-01. These are all the same server. How does the CMDB know that?

If you don't solve this, you get duplicate CIs everywhere. Three records for one server. Twelve records for one application. Impact analysis becomes meaningless because the graph thinks these are different things.

ConfigBuddy's approach: a six-tier matching cascade.

When a new CI arrives from a connector, the identity resolver checks for matches in this order:

External ID match — does the new CI's connector-assigned ID match an existing CI's external ID for the same source? (e.g., AWS instance ID i-0a1b2c3d already known)
Serial number match — does the new CI's serial match an existing CI's serial? (Hardware serials are highly reliable)
UUID match — does the new CI's UUID match? (Common for VMs, less common for physical hardware)
MAC address match — does any MAC on the new CI match any MAC on an existing CI? (MAC addresses are reasonably stable for physical NICs but can be ephemeral for cloud)
FQDN match — does the fully-qualified domain name match? (Usually reliable but vulnerable to DNS misconfiguration)
Composite fuzzy match — does a combination of hostname, IP, OS, and other fields meet a similarity threshold? (Last resort, lowest confidence)

The first match wins. If no tier matches, it's treated as a new CI.

Source authority ranking: When two sources disagree about the same field (AWS says the instance type is t3.large, ServiceNow says t3.medium), ConfigBuddy uses configurable source authority rules to decide which wins. By default: ServiceNow outranks Nmap, AWS outranks SSH, vendor-managed sources outrank inferred sources. Authority rules are per-field, so AWS can be authoritative for instanceType while ServiceNow is authoritative for owner and costCenter.

Why this is hard: Every tier has edge cases. MAC addresses spoof. UUIDs collide across hypervisors. Hostnames are reused. DNS is wrong. Source authority rules conflict. The identity resolver has to handle all of this without losing data and without creating false matches.

The thing I'd tell anyone building a CMDB: start with identity resolution, not discovery. You can always add more discovery sources. You can never recover from bad identity resolution because the graph is permanently corrupted by bad joins.

6. Unified credentials with protocol affinity

A CMDB platform needs credentials for everything it discovers. AWS needs an IAM access key. ServiceNow needs OAuth or basic auth. Active Directory needs an LDAP bind user. SSH needs a private key. SNMP needs a community string. Multiply by 43 connectors and you have a credential management problem.

ConfigBuddy uses a unified credential vault with protocol affinity matching. Instead of associating each credential with a specific connector, credentials are stored with a protocol type (aws-iam, oauth2, bearer, basic, ssh-key, ldap, snmp-v3) and metadata about which systems they're authorized for.

When a connector needs to authenticate, it asks the credential vault: "Give me a credential of type oauth2 that's authorized for servicenow.cleveland.example.com." The vault returns the appropriate credential. If multiple credentials match, the most-specific match wins (system-specific beats organization-wide).

Why this matters: Credential rotation. When a credential is rotated, you update it in one place — the vault — and every connector that uses it picks up the new value automatically. No connector code touches credentials directly. The vault is the only thing that knows the secret values.

The honest disclosure: Right now, the vault stores credentials as base64-encoded strings, not encrypted. This is the most embarrassing TODO in the codebase, and it's the top item on the v3.1 roadmap. For homelab use on a trusted network, base64 is technically sufficient (it's not human-readable, and if the database is compromised you have bigger problems). For production deployment beyond a trusted network, you should wait for the AES-256 implementation or implement your own encryption layer.

I'm telling you this in the architecture post because if you decide to deploy ConfigBuddy and you find out about the credential issue from someone else, you'll feel misled. I'd rather you find out from me.

7. The enrichment pipeline: ITIL + TBM + BSM at discovery time

Most CMDBs treat discovery and enrichment as separate processes. Discovery finds the CIs. Enrichment runs later — sometimes much later — to add business context. By the time enrichment finishes, the CIs are already stale.

ConfigBuddy runs three enrichment engines inline with discovery:

ITIL v4 service management. Every CI is classified into ITIL service categories at ingestion. Configuration items, service assets, change records — all populated as the discovery pipeline runs, not as a batch job overnight.

TBM v5 cost transparency. Cost attribution happens at discovery time. AWS resources are tagged with cost data from Cost Explorer; on-prem resources are attributed to cost centers via the source authority rules. By the time a CI lands in the graph, you can query "what does this service cost per month?" without a separate cost analytics run.

BSM impact analysis. Business service relationships are inferred from the technical graph. If a database CI is tagged with application=customer-portal, the enrichment engine creates a SUPPORTS relationship from the database to the customer-portal business service. Blast radius is queryable from day one.

The reason these run inline rather than as batch jobs: freshness compounds. If discovery runs hourly and enrichment runs nightly, your business context is always 23 hours stale. If enrichment runs inline, business context is as fresh as discovery. For incident response, this is the difference between a useful CMDB and a decorative one.

8. Failure modes and what I'd do differently

What I'd do the same:

Neo4j as primary store. Graph traversal economics are decisive.
Pattern-learning discovery. The economics are too good to ignore.
The connector split. JSON declarative connectors are the only way to keep maintenance sustainable.
Inline enrichment. Freshness compounds.

What I'd do differently:

Identity resolution first. I built discovery first and bolted identity resolution on later. It's the wrong order. Start with identity resolution architecture, then layer discovery on top.
Credential encryption from day one. I deferred this and it became the most embarrassing TODO in the codebase. AES-256 from the start would have taken half a day; retrofitting it now is a multi-week project because so many components touch credentials.
Prometheus metrics from day one. I emit structured JSON logs but no Prometheus endpoint. Anyone running ConfigBuddy in production with a Prometheus + Grafana stack has to write a custom exporter. This is on the v3 roadmap but it should have been v1.
Better pattern versioning UX. Pattern versioning works but the UI for reviewing pattern diffs and approving new versions is minimal. It needs to be first-class.
Connector test harness. Each connector has tests, but there's no standardized harness for testing connectors against synthetic API responses. Adding one would make community contributions much easier.

Failure modes I've had to design around:

LLM pattern learning is probabilistic. First-run discoveries can vary slightly based on how the LLM interprets ambiguous API responses. Pattern replay (deterministic) is the primary run path after the learning phase for exactly this reason.
Neo4j is the source of truth. If the graph goes down, discovery results buffer in Kafka until it recovers. Recovery is automatic but slow under sustained backpressure.
Pattern versioning edge cases. When an API change breaks a pattern replay, "latest wins" isn't always correct — sometimes the change affects one field mapping, not the whole strategy. Manual review is sometimes needed.

What's next

ConfigBuddy is open-source under Apache 2.0. The repo is at https://github.com/Happy-Technologies-LLC/configbuddy.

I'm not building a company around it. I'm releasing it as a credibility artifact and a research contribution to the CMDB conversation. If it's useful to you, star it, fork it, contribute to it. If you want commercial support, custom connectors, or implementation help, reach out to commercial@happy-tech.biz.

The two areas I most want feedback on:

Identity resolution. How are you handling multi-source CI matching? What edge cases have you hit that the six-tier cascade doesn't cover?
Pattern learning. Has anyone else tried this approach? I've read a lot of CMDB literature and haven't seen the LLM-compile-and-replay pattern documented elsewhere. If you've seen it or built something similar, I'd love to compare notes.

Drop questions in GitHub Discussions or find me on LinkedIn. Best ideas in ConfigBuddy came from people telling me what was wrong with the previous version.

ConfigBuddy is built and maintained by Happy Technologies LLC. Apache 2.0 licensed. CLA required for contributions.

DEV Community: Nick Zitzer