Leiv Eriksson

Posted on Mar 18

Credit data is messier than equity data, always

#fintech #dataengineering #graphdatabases #python

Adding bonds, news, and a credit universe to a graph built for stocks

I load the first batch of bond data into the graph and run a quick sanity check — how many Company nodes do we have? The answer is 350. It should be closer to 220. I pull a sample and there it is: the same company, twice. One node carries an equity ticker, a Bloomberg ID, clean edges to earnings and research notes. The other carries a bond ISIN as its primary identifier, floating loose, connected to nothing meaningful. Same legal entity. Two nodes. No relationship between them. The graph I spent two chapters describing as a flexible, expressive data model has just revealed its first real crack — and I haven't even finished loading the credit universe yet.

Where we are

In chapter one, I made the case for a graph over a relational database. In chapter two, I built the foundational schema: Company nodes, ResearchNote edges, earnings events, equity coverage records. It was clean work. Equity research has clean abstractions. A company has a ticker. A ticker has an ISIN. An ISIN is unique. You know where you stand.

That world ends the moment you add bonds.

The graph at this point holds a working equity universe — companies, tickers, analysts, research notes, earnings events. The agents built on top of it can draft a snapshot, prep for an earnings call, pull relevant context. The thing works, for stocks. The mandate now is to expand it: add bonds, add credit news, add a credit coverage universe, and wire all of it into the same graph so that equity and credit research can share context. Simple in theory.

The problem

Bond data doesn't have the same clean identity primitives that equity data does.

An equity issuer is identified by its ticker, its primary exchange ISIN, maybe a CUSIP or SEDOL. These map onto each other reasonably well. You can pick one as your canonical identifier and live with it.

A bond issuer is identified by its legal entity name, a registration number if you're lucky, a Bloomberg issuer ID if you're in that ecosystem, and then each individual bond has its own ISIN — a different ISIN from the equity, a different ISIN from every other bond the same company has issued. A company with five bonds outstanding has at least five ISINs floating around that all resolve to the same legal entity. None of them are the company's "primary" ISIN in any meaningful sense.

When I built the initial Company node schema, I included a primary_isin field. It was designed for equity — the ISIN of the listed share. When I started loading bond terms from the Stamdata connector, the pipeline dutifully populated primary_isin with whatever ISIN it found first. For a credit counterparty with no equity listing, that meant a bond ISIN ended up as the company's primary identifier. Now the dedup logic that relied on primary_isin uniqueness broke silently, and a new Company node was created every time a different bond from the same issuer came through.

That's how you get 350 nodes when you should have 220.

Into the unknown

The first thing I build is the Stamdata connector — search_issuers() and get_all_issues_for_issuer(), pulling bond terms: coupon type, maturity, currency, seniority, covenant flags. The data is rich. Stamdata's coverage of Nordic credit is excellent, and getting 13 BondIssue nodes loaded with full terms feels like real progress.

Then I add the news pipeline. Nordic credit markets run on a handful of wire services, and getting those NewsItem nodes into the graph — linked to the relevant companies and bond issues — is the piece that will eventually make the system feel current rather than archival. The news_writer.py and the migration script go in without drama. News is actually the cleanest of the new data types; a news item has a timestamp, a headline, a body, and a set of entities it mentions. Easy to model.

The context builder is where things get genuinely interesting. Until this point, agents had been querying the graph for structured data and feeding it into LLM prompts as raw field values. The context builder — graph/services/context_builder.py, 235 lines that didn't exist a week ago — is the first piece of code that actually assembles graph data into something an LLM can reason about. It traverses company → bonds → recent news → research notes and produces a structured narrative block. This is the piece I've been building toward.

But I can't properly test it against credit companies because the credit companies aren't properly in the graph yet. And the reason they aren't properly in the graph is the duplicate node problem.

# What we were doing — using bond ISINs as company primary identifiers
company = CompanyNode(
    name="Amwood AS",
    primary_isin="NO0012345678",  # this is a bond ISIN, not an equity ISIN
    asset_class="Credit"
)

# The result: a second node gets created when the equity pipeline
# later encounters Amwood with a different ISIN

The dedup script — scripts/fix_credit_entities.py — does the forensics. 127 duplicate Company nodes. For each duplicate pair, I elect a canonical node: the one with the highest research note count, and critically, the one whose primary_isin looks like an equity ISIN rather than a bond ISIN. Then I re-link all edges — ResearchNote edges, BondIssue edges, coverage records — from the duplicate to the canonical, and delete the duplicate.

# Simplified version of the canonical election logic
def elect_canonical(nodes: list[CompanyNode]) -> CompanyNode:
    # Prefer nodes with equity-style ISINs (bond ISINs become BondIssue attributes)
    equity_candidates = [n for n in nodes if not looks_like_bond_isin(n.primary_isin)]
    pool = equity_candidates if equity_candidates else nodes
    # Among candidates, prefer the one with the most attached research
    return max(pool, key=lambda n: n.note_count)

350 nodes becomes 222. The graph is smaller and more truthful.

What worked

The fix itself is unglamorous — a dedup script, an asset_class tag, a re-linking pass. The lesson it leaves behind is permanent.

The right model separates company identity from instrument identity. A Company node represents a legal entity. It gets a primary_isin only if it has a listed equity — and if it does, that field holds the equity ISIN. Bond ISINs belong on BondIssue nodes, which are their own first-class entities in the graph with their own attributes: coupon, maturity, currency, seniority, covenant flags. The relationship (:Company)-[:ISSUED]->(:BondIssue) carries the connection. You never confuse the issuer with the instrument.

The asset_class field on Company — Equity, Credit, or Both — is a coverage tag, not an identity tag. After the dedup and the Excel coverage universe load, the graph settles at 305 companies: 168 equity-only, 47 credit-only, 90 tagged Both. Those 90 are the interesting ones — companies that a credit analyst and an equity analyst might both have a view on, and where the graph can now start surfacing connections between those views.

# After the fix: BondIssue as a first-class node
class BondIssueNode(BaseModel):
    isin: str                    # bond ISIN — lives here, not on Company
    issuer_name: str
    coupon_rate: float | None
    maturity_date: date | None
    currency: str
    seniority: str | None
    has_covenants: bool

The context builder, once the graph is clean, actually works. Give it a company name; it returns a structured block covering the equity position, any outstanding bonds with their key terms, recent news items, and the most relevant research notes. This is what gets wired into the agent prompts in the next commit cycle. The graph earns its place not by storing data but by traversing it.

What this changed

I had assumed the hardest part of expanding into credit would be the data sourcing — finding the right connectors, navigating the identifier mess, parsing bond terms out of documents. That was genuinely fiddly work. But the hardest part was the schema flaw I didn't know I had until the new data revealed it.

Real financial data will always find the crack in your abstraction. Equity data and credit data both represent claims on the same companies, but the market conventions around how those companies are identified, how their instruments are described, and how information about them is distributed are entirely different. Any system that tries to model both has to resist the temptation to unify prematurely. The Company node isn't an equity node with credit data bolted on, and it isn't a credit node with an equity ticker attached. It's a legal entity node, and the equity and credit representations hang off it as separate subgraphs.

The dedup work also forced a conversation about what "canonical" means when you have conflicting data from multiple sources. The heuristic I landed on — prefer the node with the most attached research, prefer equity ISINs for primary_isin — is defensible but not perfect. There are edge cases. There will always be edge cases. The script is in the repo and it will run again when the next batch of messy data arrives.

What I'd do differently: define the BondIssue node as a first-class entity from day one, before any credit data touches the graph. Don't let primary_isin be ambiguous for a single commit cycle. Identifier discipline is the kind of thing that feels like over-engineering until the moment you're staring at 127 duplicate nodes at midnight.

The graph now knows about bonds, news, and a proper credit universe. The context builder can traverse all of it and hand an LLM something genuinely useful to reason about. The scaffolding is there.

What it lacks is urgency. A static snapshot of a company's bonds and research notes is valuable, but trading floors run on now — price moves, covenant triggers, breaking news. The next chapter is about making the platform feel alive: real-time event ingestion, the question of what the graph should forget, and whether a graph database is actually the right tool for anything that moves at market speed.

What's been your experience expanding a data model you thought was solid into a messier adjacent domain? Did you rebuild from scratch, or patch and tag your way through like I did?

Part 3 of 7 in the series "Building a research hive mind"

Top comments (1)

Mayckon Giovani • Mar 19

Calling credit data “messier” than equity data is technically correct, but it undersells what’s actually going on.

Equity data feels clean not because the world is clean, but because the abstraction is forgiving. You get a ticker, a venue, a price surface that is continuously observable, and a relatively stable notion of identity. Even when things are wrong, the model still holds well enough to hide it.

Credit doesn’t give you that luxury. In many cases, you don’t even have a reliable price signal to anchor anything, which already forces you into indirect modeling approaches instead of direct observation . From that point on, everything becomes inference layered on incomplete structure.

And the structure itself is the real problem.

What people call “messiness” is usually the symptom of trying to project an entity-centric worldview onto something that is fundamentally contract-centric. A credit instrument is not just “an asset tied to an issuer”, it’s a bundle of constraints, triggers, priorities, and state transitions that evolve over time. Covenants change, capital structure shifts, restructurings happen, and the meaning of the instrument drifts without any clean event boundary.

So you don’t just lose cleanliness, you lose stable identity.

That’s why even in formal modeling, institutions struggle with data limitations and long-horizon uncertainty, because the underlying system doesn’t produce consistent, high-frequency, observable signals in the first place . You’re trying to model a dynamic, partially observed state machine with sparse and irregular emissions.

Using graphs helps, but only in the sense that it stops lying to you about relationships. It doesn’t solve the harder problem, which is defining the ontology of those relationships in a way that survives time, restructuring, and edge cases. If anything, it forces you to confront the ambiguity earlier instead of burying it under joins and lookup tables.

The uncomfortable conclusion is that a unified model between equity and credit only works if you accept asymmetry at the core. Equity compresses reality into a tradable signal. Credit exposes the underlying contractual complexity and refuses to be normalized.

So the issue isn’t that credit data is messy.

It’s that credit is closer to how the world actually behaves, and most systems are designed around the assumption that the world prefers to be simplified.