Adding an LLM to your application usually means writing the same infrastructure over and over: define JSON schemas for each tool, dispatch tool calls, drive the agentic loop, wire up a vector store, manage embedding calls, handle session state. Before you know it the actual feature is buried under plumbing.
Daimon is a Go sidecar that takes a different approach. Drop the binary next to your app, write a YAML config, and you get a fully operational LLM endpoint — with vector search, graph queries, session memory, and a complete agentic loop — without writing any of that wiring yourself.
The key idea: when you declare a vector store or graph database in the config, Daimon auto-generates LLM tools for it ({name}_search, {name}_cypher, etc.) and injects them into every LLM call. The model can use them immediately. You named the component; you got the tools.
To show what this looks like end-to-end, we'll build a medical chart ICD-10 coding pipeline. Given a medical transcription, it will:
- Semantically search a vector store of 46k ICD-10 codes for candidates
- Verify each candidate against a Neo4j taxonomy graph
- Return only confirmed codes, with confidence scores grounded in the graph — not the model
The entire LLM + tools setup fits in one YAML file. The Python client is about 60 lines. Let's build it.
What you'll need:
- Docker
- Python 3.11+
- An Anthropic API key (or swap to a fully local model — covered at the end)
- The Daimon binary
Step 1: Clone the Repo
git clone https://github.com/sonicboom15/medchart
cd medchart
The repo contains everything: the Daimon config, the ICD-10 loader, and the coding pipeline. Here's the layout:
daimon-config/config.yaml ← the entire LLM + tools setup
docker-compose.yml ← Neo4j, Qdrant, Ollama
icd10/loader.py ← CMS XML parser → Qdrant + Neo4j
pipeline/coder.py ← Daimon client → CodeAssignment[]
pipeline/ingest.py ← batch processor for MTSamples CSV
Step 2: Install Daimon
Pick the method for your platform:
macOS / Linux — Homebrew
brew tap sonicboom15/tap
brew install daimon
Windows — winget
winget install sonicboom15.daimon
Windows — Scoop
scoop bucket add sonicboom15 https://github.com/sonicboom15/scoop-bucket
scoop install daimon
Debian / Ubuntu
# Download the .deb from https://github.com/sonicboom15/daimon/releases/latest
sudo dpkg -i daimon_*_linux_amd64.deb
RHEL / Fedora
# Download the .rpm from https://github.com/sonicboom15/daimon/releases/latest
sudo rpm -i daimon_*_linux_amd64.rpm
Build from source (requires Go 1.23+)
git clone https://github.com/sonicboom15/daimon.git
cd daimon && make build
# binary at ./bin/daimon
Verify the install:
daimon --version
Step 3: Start the Infrastructure
docker compose up -d
This starts:
-
Qdrant on port
6333— vector store for ICD-10 code embeddings -
Neo4j on ports
7474(browser) and7687(Bolt) — taxonomy graph -
Ollama on port
11434— local embedding model
Remove the
deploy:GPU block fromdocker-compose.ymlif you don't have an NVIDIA GPU.
Pull the embedding model into Ollama:
docker exec -it medchart-ollama-1 ollama pull nomic-embed-text
This is the only model Ollama needs. The LLM itself uses the Anthropic API.
Step 4: Understand the Config
Open daimon-config/config.yaml. This single file is the entire pipeline setup:
components:
# Local embedder via Ollama (OpenAI-compatible endpoint)
- name: embedder
type: embedding/openai
metadata:
base_url: http://localhost:11434/v1
api_key: local
model: nomic-embed-text
dimensions: "768"
# 46k ICD-10 code descriptions in Qdrant
- name: icd10
type: qdrant
metadata:
base_url: http://localhost:6333
collection: icd10
create_if_missing: "true"
embedder: embedder
dimensions: "768"
# ICD-10 taxonomy hierarchy in Neo4j
- name: icd10-graph
type: neo4j
metadata:
bolt_url: bolt://localhost:7687
username: neo4j
password: medchart
# Claude Sonnet — api_key read from ANTHROPIC_API_KEY env var
- name: coder
type: anthropic
metadata:
default_model: claude-sonnet-4-6
defaults:
temperature: 0.1
max_tokens: 1024
system: |
You are a clinical coding assistant. Given a medical chart excerpt, identify
and assign the correct ICD-10-CM codes using this exact two-phase workflow:
PHASE 1 — Search:
Call icd10_search for each distinct clinical concept in the note.
Issue all searches in a single parallel batch.
PHASE 2 — Verify:
For each candidate code, call icd10_graph_cypher:
MATCH (n:Code {code: $code}) RETURN n.code, n.short_desc
Issue all verifications in a single parallel batch.
Only include codes where this query returns a result.
Return ONLY a JSON array — no prose, no markdown:
[{"code": "J30.1", "description": "Allergic rhinitis due to pollen"}]
A few things worth understanding here:
Wiring order matters. Daimon resolves components top to bottom: embedders first, then vector stores (which can reference an embedder by name via embedder: embedder), then graph stores, then LLMs.
Auto-generated tools. Because you declared icd10 (Qdrant) and icd10-graph (Neo4j), the coder LLM automatically gets these tools on every call — no code required:
-
icd10_search— embed a query and search Qdrant -
icd10_upsert— insert or update a code -
icd10-graph_cypher— run any Cypher query against Neo4j
The system prompt references icd10_search and icd10_graph_cypher by name. That's all the wiring there is.
Batched tool calls. The system prompt tells the model to issue all searches in one batch and all verifications in another. Daimon executes concurrent tool calls in parallel, so two waves of 8 calls runs as fast as two single calls.
Step 5: Start Daimon
export ANTHROPIC_API_KEY=sk-ant-...
daimon serve --config daimon-config/config.yaml
You should see all four components register:
INFO registered embedder name=embedder type=embedding/openai
INFO registered vector store name=icd10 type=qdrant
INFO registered graph store name=icd10-graph type=neo4j
INFO registered LLM component name=coder type=anthropic
INFO daimon listening addr=127.0.0.1:3500
Step 6: Load the ICD-10 Data
Download the CMS 2025 ICD-10-CM tabular XML:
https://www.cms.gov/files/zip/2025-code-tables-tabular-and-index.zip
Extract icd10cm_tabular_2025.xml into the data/ folder. Then install dependencies and run the loader:
pip install -e ".[dev]"
python -m icd10.loader data/icd10cm_tabular_2025.xml
The loader does three things:
Parses the XML. The CMS file is a nested hierarchy of chapters → sections → diagnosis codes → subcodes. The parser walks this tree recursively and produces a flat list of
ICD10Codeobjects.Loads into Qdrant. Each code is upserted via Daimon's
/v1/memory/icd10endpoint with the full description as content and the code, category, block, and chapter as metadata. IDs are deterministicuuid5hashes of the code string, so re-runs are idempotent.Loads into Neo4j. The taxonomy becomes a graph via Daimon's
/v1/graph/icd10-graph/cypherendpoint:
Chapter -[:PART_OF]-> Block -[:PART_OF]-> Category -[:IS_A]-> Code
All codes get a :Code label; 3-character codes (like J30, E66) also get :Category. Everything is batched into groups of 500 using Cypher UNWIND to keep it fast and memory-efficient.
This takes about 10 minutes. When it finishes, verify the counts:
# Vector store: should show ~46k points
curl http://localhost:6333/collections/icd10 | python -m json.tool
Open the Neo4j browser at http://localhost:7474 (user: neo4j, password: medchart) and run:
MATCH (n:Code) RETURN count(n)
Step 7: Run the Coder
Download mtsamples.csv from Kaggle and save it to data/. Then:
python -m pipeline.ingest --csv data/mtsamples.csv --out data/coded.jsonl --limit 10
You'll see the agentic tool calls stream in real time as Daimon drives the loop:
Allergic Rhinitis
→ icd10_search allergic rhinitis seasonal
→ icd10_search asthma mild intermittent
→ icd10_graph_cypher J30.9
→ icd10_graph_cypher J45.20
✓ J30.9 Allergic rhinitis, unspecified (100%)
✓ J45.20 Mild intermittent asthma, uncomplicated (100%)
Laparoscopic Gastric Bypass Consult
→ icd10_search morbid obesity BMI
→ icd10_search gastroesophageal reflux
→ icd10_search knee pain bilateral
→ icd10_search low back pain
→ icd10_search nicotine dependence cigarettes
→ icd10_search allergy penicillin
→ icd10_graph_cypher E66.01
→ icd10_graph_cypher K21.9
→ icd10_graph_cypher M25.561
→ icd10_graph_cypher M25.562
→ icd10_graph_cypher M54.50
→ icd10_graph_cypher F17.210
→ icd10_graph_cypher Z88.0
✓ E66.01 Morbid (severe) obesity due to excess calories (100%)
✓ K21.9 Gastro-esophageal reflux disease w/o esophagitis (100%)
✓ M25.561 Pain in right knee (100%)
✓ M25.562 Pain in left knee (100%)
✓ M54.50 Low back pain, unspecified (100%)
✓ F17.210 Nicotine dependence, cigarettes, uncomplicated (100%)
✓ Z88.0 Allergy status to penicillin (100%)
Each record in coded.jsonl looks like this:
{
"sample_name": "Laparoscopic Gastric Bypass Consult",
"specialty": "Bariatrics",
"codes": [
{"code": "E66.01", "description": "Morbid (severe) obesity due to excess calories", "confidence": 1.0, "verified": true},
{"code": "K21.9", "description": "Gastro-esophageal reflux disease without esophagitis", "confidence": 1.0, "verified": true},
{"code": "M25.561","description": "Pain in right knee", "confidence": 1.0, "verified": true},
{"code": "M25.562","description": "Pain in left knee", "confidence": 1.0, "verified": true},
{"code": "M54.50", "description": "Low back pain, unspecified", "confidence": 1.0, "verified": true},
{"code": "F17.210","description": "Nicotine dependence, cigarettes, uncomplicated", "confidence": 1.0, "verified": true},
{"code": "Z88.0", "description": "Allergy status to penicillin", "confidence": 1.0, "verified": true}
]
}
Notice M25.561 and M25.562 — bilateral knee pain correctly split into separate laterality codes. That's proper ICD-10 specificity, and the model inferred it from the clinical language without any special instruction.
The confidence scores come from the graph verification step in pipeline/coder.py, not from the model's own calibration:
-
1.0— exact code confirmed in Neo4j. Use it. -
0.5— category exists but specific subcode does not. Flag for human review. -
0.0— not found at all. Hallucinated. Discard.
Bonus: Run Fully Local (No API Key)
Swap two lines in daimon-config/config.yaml:
- name: coder
type: llamacpp # was: anthropic
metadata:
base_url: http://localhost:11434/v1
api_key: local
default_model: qwen2.5:7b # was: default_model: claude-sonnet-4-6
Pull the model:
docker exec -it medchart-ollama-1 ollama pull qwen2.5:7b
Restart Daimon. Everything else — the Qdrant store, the Neo4j graph, the Python scripts — stays identical. qwen2.5:7b is accurate enough for most records during development. qwen2.5:14b gets you meaningfully closer to Claude quality if you have the VRAM.
What's Actually Happening
It's worth stepping back to see what Daimon is doing behind the scenes, because the Python client code is surprisingly short:
with daimon.Client(base_url="http://localhost:3500") as client:
for chunk in client.llm("coder").converse(
messages=[{"role": "user", "content": chart_text}]
):
if chunk.type == "text":
text_chunks.append(chunk.text)
That single converse call triggers a full agentic workflow inside Daimon:
- Daimon calls Claude with the chart text and the auto-generated tool definitions for
icd10_searchandicd10_graph_cypher - Claude calls
icd10_searchfor each clinical concept — Daimon embeds the query via Ollama and searches Qdrant, all in parallel - Claude calls
icd10_graph_cypherto verify each candidate code — Daimon runs the Cypher queries against Neo4j, again in parallel - Claude returns the final JSON array
- The
conversegenerator yields text chunks back to your code
Your code never touched an embedding client, a Qdrant client, a Neo4j session, or an agentic loop. The YAML config wired all of it.
Where To Go From Here
The full source is at github.com/sonicboom15/medchart.
Daimon itself is at github.com/sonicboom15/daimon — it supports OpenAI, Anthropic, and any Ollama-compatible model, with Qdrant, Chroma, pgvector, Redis, Neo4j, and Memgraph as backends.
The pattern here — vector search + graph verification + agentic LLM loop — applies to a lot of domains beyond medical coding. If you're building something similar and find yourself writing the same infrastructure glue, Daimon is worth a look.
Top comments (0)