Rishi Kumar

Posted on May 17

The Daimon Java SDK: Chat, Stream, and Query Memory from 3 Lines of Java

#ai #java #llm #tutorial

If you've built AI features in Java recently, you know the drill: choose an LLM SDK, wire up HTTP clients, handle SSE parsing, build a session store, figure out RAG, repeat for every provider you want to support.

Daimon takes a different approach. It's a Go sidecar that runs next to your app and exposes a unified HTTP API for LLM inference, vector memory, graph queries, and session management — all wired from a single YAML file. Your application only talks to localhost.

Today the Daimon Java SDK (io.github.sonicboom15:daimon-client) lands on Maven Central. Here's what you can do with it.

Installation

Gradle (build.gradle):

dependencies {
    implementation 'io.github.sonicboom15:daimon-client:0.4.1'
}

Maven (pom.xml):

<dependency>
    <groupId>io.github.sonicboom15</groupId>
    <artifactId>daimon-client</artifactId>
    <version>0.4.1</version>
</dependency>

Requires Java 17+. The only transitive dependency is Gson.

Step 1: Configure and start the sidecar

Create config.yaml:

components:
  - name: assistant         # your name — "gpt", "local", whatever
    type: anthropic
    metadata:
      api_key: ${ANTHROPIC_API_KEY}
      default_model: claude-opus-4-7

Start the sidecar (grab the binary from the GitHub releases page):

daimon serve --config config.yaml
# Listening on :3500

Or run it with Go installed:

go install github.com/sonicboom15/daimon/cmd/daimon@latest
daimon serve --config config.yaml

Step 2: Chat

Client client = new Client();

String reply = client.chat("assistant", "What is the capital of France?");
System.out.println(reply); // Paris

Three lines. No HTTP wiring. No SSE parsing. No JSON boilerplate.

"assistant" is the component name you chose in config.yaml — nothing more. Want to switch from Anthropic to GPT-4o or Gemini? Change two lines in the YAML. The Java code stays exactly as it is.

Step 3: Streaming

For long responses where you want to display tokens as they arrive:

LLMClient llm = client.llm("assistant");

for (String fragment : llm.stream("Write a haiku about distributed systems")) {
    System.out.print(fragment);
    System.out.flush();
}

stream() returns a lazy Iterable<String> backed by the SSE connection — no threads, no callbacks.

Step 4: Sessions (stateful conversations)

Attach a session_id to link requests into a conversation. The sidecar keeps the history server-side.

LLMClient llm = client.llm("assistant");

ChatOptions session = ChatOptions.builder()
        .sessionId("user-42")
        .build();

llm.chat("My name is Alice.", session);

String reply = llm.chat("What's my name?", session);
System.out.println(reply); // Alice

// Clear when done
llm.clearSession("user-42");

Sessions default to in-memory storage. Add a session/redis or session/postgres component to make them persistent across restarts.

Step 5: Vector memory (RAG)

This is where things get interesting. Daimon can query a vector store on every request and inject the top results as context — automatically, before the LLM ever sees the message.

Update config.yaml:

components:
  - name: docs
    type: inmemory          # BM25, no external service needed

  - name: assistant
    type: anthropic
    metadata:
      api_key: ${ANTHROPIC_API_KEY}
    memory_store: docs      # inject top-5 docs before every chat call

Java side:

MemoryStoreClient mem = client.memory("docs");

// Index some facts
mem.upsert("The Eiffel Tower is 330 metres tall and located in Paris.", "eiffel", null);
mem.upsert("The Colosseum is 48 metres tall and located in Rome.", "colosseum", null);
mem.upsert("The Burj Khalifa is 828 metres tall and located in Dubai.", "burj", null);

// Ask the LLM — relevant docs are injected automatically
String reply = client.chat("assistant", "Which famous landmark is tallest?");
System.out.println(reply); // mentions Burj Khalifa

You can also query the store directly:

List<MemoryResult> results = mem.query("tall structures", 3);
for (MemoryResult r : results) {
    System.out.printf("[%.2f] %s%n", r.score(), r.content());
}

Swap type: inmemory for type: chroma, type: qdrant, type: pgvector, or type: redis without touching a single line of Java.

Step 6: Graph queries

Daimon also exposes graph stores through the same thin client.

Add to config.yaml:

  - name: kg
    type: neo4j
    metadata:
      bolt_url: bolt://localhost:7687
      password: secret

Java side:

GraphStoreClient graph = client.graph("kg");

graph.addNode("alice", List.of("Person"), Map.of("name", "Alice", "role", "engineer"));
graph.addNode("daimon", List.of("Project"), Map.of("name", "Daimon"));
graph.addEdge("alice", "daimon", "MAINTAINS", null);

List<Map<String, Object>> rows = graph.cypher(
        "MATCH (p:Person)-[:MAINTAINS]->(proj) RETURN p.name, proj.name",
        null
);

rows.forEach(row -> System.out.println(row.get("p.name") + " maintains " + row.get("proj.name")));
// Alice maintains Daimon

For a deeper look at graph + LLM pipelines, see my earlier article: Build a Medical Chart Coding Pipeline with Daimon, Claude, and Neo4j.

Putting it together: a self-populating knowledge assistant

Here's a complete example that combines all three — LLM, memory, and graph — in under 50 lines:

import io.github.sonicboom15.daimon.*;
import java.util.List;
import java.util.Map;

public class KnowledgeAssistant {

    public static void main(String[] args) {
        Client client = new Client();

        // ── Populate memory store ───────────────────────────────
        MemoryStoreClient mem = client.memory("docs");
        mem.upsert("Java 17 introduced sealed classes and pattern matching for instanceof.", "java17", null);
        mem.upsert("Java 21 introduced virtual threads (Project Loom) and record patterns.", "java21", null);
        mem.upsert("Java 23 introduced structured concurrency as a preview feature.", "java23", null);

        // ── Populate graph store ────────────────────────────────
        GraphStoreClient graph = client.graph("kg");
        graph.addNode("java17", List.of("Release"), Map.of("version", "17", "year", "2021"));
        graph.addNode("java21", List.of("Release"), Map.of("version", "21", "year", "2023"));
        graph.addEdge("java17", "java21", "FOLLOWED_BY", null);

        // ── Ask a question — docs are injected automatically ────
        // (memory_store: docs is set on the assistant component in config.yaml)
        LLMClient llm = client.llm("assistant");
        String answer = llm.chat("What major features came in Java 21?");
        System.out.println("Answer: " + answer);

        // ── Direct graph query ───────────────────────────────────
        var timeline = graph.cypher(
                "MATCH (a:Release)-[:FOLLOWED_BY]->(b:Release) RETURN a.version, b.version ORDER BY a.year",
                null
        );
        System.out.println("Timeline: " + timeline);
    }
}

The LLM will mention virtual threads and record patterns because the sidecar queried the memory store with "What major features came in Java 21?" and prepended the matching document as context.

What providers does Daimon support?

All configured in YAML — no code changes when you swap:

Type	Provider
`anthropic`	Claude (opus-4-7, sonnet-4-6, haiku-4-5)
`openai`	GPT-4o, GPT-4o-mini, o1, o3
`gemini`	Gemini 2.0 Flash, 1.5 Pro
`mistral`	Mistral Large, Small, Nemo
`llamacpp`	Any local OpenAI-compatible server (Ollama, LM Studio)

DEV Community