Leiv Eriksson

Posted on Mar 12

Choosing a graph over a database was the easy part

#graphdatabases #architecture #python #fastapi

Designing the node-and-edge skeleton that would have to hold everything

The moment you decide to model your data as a graph, you are betting that relationships are more important than records — and you can't easily change your mind later. I made that bet on a Tuesday morning with a whiteboard marker in one hand and a half-finished coffee in the other, sketching nodes and edges for a system that didn't exist yet. The schema felt elegant. It felt obvious, even. And that feeling — the dangerous one, the one that makes you commit — is exactly when the doubt sets in. What if the thing that feels obvious now is the thing you'll be refactoring in six weeks, cursing yourself for not seeing the edge case that was always there?

It wasn't six weeks. It was four days.

Where we are

In the first chapter, I described the research desk's memory problem: analyst notes scattered across email threads, earnings prep that lived in someone's head, no clean way to ask "what do we know about this company right now?" The answer we kept circling back to was not a better database. It was a different kind of database — one that could model not just entities, but the tissue connecting them.

That chapter ended with a vague architectural direction. This one is where we had to make it real.

The project that I hope to build is basically a digital twin for the research division — a living graph of every company under coverage, every analyst who covers it, every note ever written, every filing pulled from EDGAR, every material event captured from exchange feeds. Not a data warehouse. Not a CRM. Something that can answer questions like: What has changed about this issuer in the last 30 days, and who on the team should care?

The problem

The first decision was the one that felt least like a decision: should this be relational or graph?

I asked the AI what I should do. Nobrainer it said, Postgres. It seemed to know it better, relying on endless examples online. The ecosystem is mature, the tooling is excellent, and when something goes wrong at 2am, Stack Overflow and Github training data means that the AI has a answer ready in a heartbeat. Choosing anything else seemed risky.

But the data we were modeling was fundamentally relational in the graph sense, not the table sense. A company has analysts. Analysts write notes. Notes reference events. Events trigger filings. Filings update estimates. Estimates change coverage posture. Every one of those connections is a first-class fact — not a foreign key you join through, but a relationship with its own properties and its own queryable meaning.

The moment you start drawing that on a whiteboard, a graph isn't exotic. It's just honest.

We evaluated two options seriously. Neo4j is the obvious choice — battle-tested, Cypher is expressive, the community is large. But we were building something that needed to run embedded, close to the application layer, without the operational overhead of managing a separate server process in the early stages. That pointed us toward an alternative, a embeddable graph database with Cypher support, columnar storage, and a Python API that felt like it was designed by people who had actually suffered through the alternatives.

The risk was real. The alternative was newer. Production references are fewer. Training data lacking. We were going to be among the early adopters in a financial context, which is exactly the kind of sentence that makes compliance teams nervous. Good thing I haven't told them about the project yet (but for now I'm working on publically available data ONLY).

Into the unknown

The schema came together fast — maybe too fast.

The core node types were clear from day one: Company, Analyst, Sector, ResearchNote, Filing, MaterialEvent, CoverageRecord, BondIssue, EstimateSnapshot. The edges were where the thinking happened. A COVERS relationship between Analyst and Company isn't just a link — it has a start date, a rating, a target price. A REFERENCES edge between a ResearchNote and a MaterialEvent carries context about why that event mattered to that note.

Here's a simplified version of how node types were declared in Python, binding the schema definition directly to the graph client:

# graph/schema/nodes.py (illustrative excerpt)

NODE_DEFINITIONS = {
    "Company": {
        "ticker": "STRING",
        "isin": "STRING",
        "name": "STRING",
        "sector": "STRING",
        "market_cap": "DOUBLE",
        "last_updated": "TIMESTAMP",
    },
    "ResearchNote": {
        "note_id": "STRING",
        "title": "STRING",
        "published_date": "DATE",
        "sentiment": "STRING",
        "word_count": "INT64",
    },
    "MaterialEvent": {
        "event_id": "STRING",
        "event_type": "STRING",
        "event_date": "TIMESTAMP",
        "headline": "STRING",
        "source": "STRING",
    },
}

The bootstrap script would walk these definitions and issue the equivalent CREATE NODE TABLE calls against the graph DB. Clean. Declarative. And — critically — easy to modify before anything real was written to the graph.

The FastAPI layer went up in parallel, and this is where we made a choice I'm still glad about: the API routers never touched the graph directly. From day one, all graph access was routed through service classes — CompanyService, and later GraphQueryService — that translated between HTTP semantics and Cypher. The routers asked questions in domain language. The services answered in graph language.

# api/routers/companies.py (illustrative pattern)

@router.get("/{ticker}/snapshot")
async def get_company_snapshot(
    ticker: str,
    service: CompanyService = Depends(get_company_service),
):
    snapshot = await service.get_full_snapshot(ticker)
    if not snapshot:
        raise HTTPException(status_code=404, detail="Company not found")
    return snapshot

The seam between router and service sounds obvious in retrospect. At the time, under pressure to get something queryable, it required active discipline to not just inline the Cypher into the route handler and move on.

The first real failure came from the connectors. We had live integrations running against SEC EDGAR and stock exchange news sites, plus stub connectors for Bloomberg and FactSet running in mock mode. The data coming back was messier than the schema expected — companies with missing ISINs, events with ambiguous timestamps, filings that referenced entities not yet in the graph. The graph's strictness, which felt like a feature in the design phase, became a source of friction the moment real data arrived.

We spent quite some time writing transformer logic that could have been avoided but wasn't - because the real world doesn't normalize itself for you.

What worked

The first time I ran a traversal query against real data, I understood why people build graph databases.

The query was simple: give me every research note written about companies in a specific sector, ordered by date, with the analyst and any associated material events. In a relational model, that's three or four joins, and you're mentally tracking the join path the whole time. In Cypher, it reads almost like the question itself:

MATCH (a:Analyst)-[:AUTHORED]->(n:ResearchNote)-[:COVERS]->(c:Company)-[:IN_SECTOR]->(s:Sector)
OPTIONAL MATCH (n)-[:REFERENCES]->(e:MaterialEvent)
WHERE s.name = $sector_name
RETURN a.name, n.title, n.published_date, c.ticker, e.headline
ORDER BY n.published_date DESC

There's no impedance mismatch. The shape of the query matches the shape of the question. For a system where new question types arrive every week from analysts who didn't write the code, that matters enormously.

The GraphQueryService — introduced in the second commit — was where the architecture paid off properly. Instead of 23 inline Cypher queries scattered across five API routers, we centralized everything into typed methods with clear contracts. The routers became thin. The service became testable. And when an analyst asked for a query type we hadn't anticipated, we added one method in one place.

The agents came online against this foundation, and that's when the system started feeling like something. The company snapshot agent could traverse the graph and produce a structured summary of everything known about an issuer. The earnings prep agent could pull analyst coverage history, recent notes, and upcoming filing dates in a single coherent pass. The report drafting agent had something real to work with.

The Prefect flow layer, added four days later, gave the pipelines retry logic and observability without requiring us to rewrite anything fundamental. The graph contract held.

What this changed

The graph bet forced an ontology conversation we would have had to have anyway — we just had it before writing a single line of application logic, which is the right time.

If you're considering a graph database for a production system, the operational concern is real but manageable. The harder thing is the schema: once analysts and pipelines are writing to the graph, changing a node type is a migration, not a refactor. We learned to be conservative about what became a node versus what stayed as a property. Events became nodes. Statuses stayed as properties. That line matters.

I'd also instrument the query layer earlier. We added logging to GraphQueryService in the second commit, but we should have had it from the first bootstrap. You want to know which traversals are expensive before something is slow in production, not after.

The seam between data connectors and graph writers — the transformer pipeline — is where the real complexity lives. Not in the graph itself.

Signpost

The foundation is in place: a typed graph, a service layer, a set of agents that can actually reason over connected data. What comes next is the part everyone underestimates — pulling in credit data alongside equity data and discovering that the two worlds have almost nothing in common except the company name. Bonds don't have tickers. Covenants don't have analogues in equity filings. And the graph that felt complete on Tuesday morning turns out to have a very significant gap.

How do you model instruments that fundamentally resist being modeled the same way — without fracturing your schema into two separate systems that can't talk to each other?

Part 2 of 7 in the series "Building a research hive mind"

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.