<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Leiv Eriksson</title>
    <description>The latest articles on DEV Community by Leiv Eriksson (@leivleivleiv).</description>
    <link>https://dev.to/leivleivleiv</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3815566%2F5790ca13-782e-4a17-857d-d77634e40c18.jpg</url>
      <title>DEV Community: Leiv Eriksson</title>
      <link>https://dev.to/leivleivleiv</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/leivleivleiv"/>
    <language>en</language>
    <item>
      <title>Credit data is messier than equity data, always</title>
      <dc:creator>Leiv Eriksson</dc:creator>
      <pubDate>Wed, 18 Mar 2026 22:35:11 +0000</pubDate>
      <link>https://dev.to/leivleivleiv/credit-data-is-messier-than-equity-data-always-1fig</link>
      <guid>https://dev.to/leivleivleiv/credit-data-is-messier-than-equity-data-always-1fig</guid>
      <description>&lt;p&gt;&lt;em&gt;Adding bonds, news, and a credit universe to a graph built for stocks&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;I load the first batch of bond data into the graph and run a quick sanity check — how many Company nodes do we have? The answer is 350. It should be closer to 220. I pull a sample and there it is: the same company, twice. One node carries an equity ticker, a Bloomberg ID, clean edges to earnings and research notes. The other carries a bond ISIN as its primary identifier, floating loose, connected to nothing meaningful. Same legal entity. Two nodes. No relationship between them. The graph I spent two chapters describing as a flexible, expressive data model has just revealed its first real crack — and I haven't even finished loading the credit universe yet.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where we are
&lt;/h2&gt;

&lt;p&gt;In chapter one, I made the case for a graph over a relational database. In chapter two, I built the foundational schema: Company nodes, ResearchNote edges, earnings events, equity coverage records. It was clean work. Equity research has clean abstractions. A company has a ticker. A ticker has an ISIN. An ISIN is unique. You know where you stand.&lt;/p&gt;

&lt;p&gt;That world ends the moment you add bonds.&lt;/p&gt;

&lt;p&gt;The graph at this point holds a working equity universe — companies, tickers, analysts, research notes, earnings events. The agents built on top of it can draft a snapshot, prep for an earnings call, pull relevant context. The thing works, for stocks. The mandate now is to expand it: add bonds, add credit news, add a credit coverage universe, and wire all of it into the same graph so that equity and credit research can share context. Simple in theory.&lt;/p&gt;




&lt;h2&gt;
  
  
  The problem
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Bond data doesn't have the same clean identity primitives that equity data does.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;An equity issuer is identified by its ticker, its primary exchange ISIN, maybe a CUSIP or SEDOL. These map onto each other reasonably well. You can pick one as your canonical identifier and live with it.&lt;/p&gt;

&lt;p&gt;A bond issuer is identified by its legal entity name, a registration number if you're lucky, a Bloomberg issuer ID if you're in that ecosystem, and then each individual bond has its own ISIN — a different ISIN from the equity, a different ISIN from every other bond the same company has issued. A company with five bonds outstanding has at least five ISINs floating around that all resolve to the same legal entity. None of them are the company's "primary" ISIN in any meaningful sense.&lt;/p&gt;

&lt;p&gt;When I built the initial Company node schema, I included a &lt;code&gt;primary_isin&lt;/code&gt; field. It was designed for equity — the ISIN of the listed share. When I started loading bond terms from the Stamdata connector, the pipeline dutifully populated &lt;code&gt;primary_isin&lt;/code&gt; with whatever ISIN it found first. For a credit counterparty with no equity listing, that meant a bond ISIN ended up as the company's primary identifier. Now the dedup logic that relied on &lt;code&gt;primary_isin&lt;/code&gt; uniqueness broke silently, and a new Company node was created every time a different bond from the same issuer came through.&lt;/p&gt;

&lt;p&gt;That's how you get 350 nodes when you should have 220.&lt;/p&gt;




&lt;h2&gt;
  
  
  Into the unknown
&lt;/h2&gt;

&lt;p&gt;The first thing I build is the Stamdata connector — &lt;code&gt;search_issuers()&lt;/code&gt; and &lt;code&gt;get_all_issues_for_issuer()&lt;/code&gt;, pulling bond terms: coupon type, maturity, currency, seniority, covenant flags. The data is rich. Stamdata's coverage of Nordic credit is excellent, and getting 13 BondIssue nodes loaded with full terms feels like real progress.&lt;/p&gt;

&lt;p&gt;Then I add the news pipeline. Nordic credit markets run on a handful of wire services, and getting those NewsItem nodes into the graph — linked to the relevant companies and bond issues — is the piece that will eventually make the system feel current rather than archival. The &lt;code&gt;news_writer.py&lt;/code&gt; and the migration script go in without drama. News is actually the cleanest of the new data types; a news item has a timestamp, a headline, a body, and a set of entities it mentions. Easy to model.&lt;/p&gt;

&lt;p&gt;The context builder is where things get genuinely interesting. Until this point, agents had been querying the graph for structured data and feeding it into LLM prompts as raw field values. The context builder — &lt;code&gt;graph/services/context_builder.py&lt;/code&gt;, 235 lines that didn't exist a week ago — is the first piece of code that actually &lt;em&gt;assembles&lt;/em&gt; graph data into something an LLM can reason about. It traverses company → bonds → recent news → research notes and produces a structured narrative block. This is the piece I've been building toward.&lt;/p&gt;

&lt;p&gt;But I can't properly test it against credit companies because the credit companies aren't properly in the graph yet. And the reason they aren't properly in the graph is the duplicate node problem.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# What we were doing — using bond ISINs as company primary identifiers
&lt;/span&gt;&lt;span class="n"&gt;company&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;CompanyNode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Amwood AS&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;primary_isin&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;NO0012345678&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# this is a bond ISIN, not an equity ISIN
&lt;/span&gt;    &lt;span class="n"&gt;asset_class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Credit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# The result: a second node gets created when the equity pipeline
# later encounters Amwood with a different ISIN
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The dedup script — &lt;code&gt;scripts/fix_credit_entities.py&lt;/code&gt; — does the forensics. 127 duplicate Company nodes. For each duplicate pair, I elect a canonical node: the one with the highest research note count, and critically, the one whose &lt;code&gt;primary_isin&lt;/code&gt; looks like an equity ISIN rather than a bond ISIN. Then I re-link all edges — ResearchNote edges, BondIssue edges, coverage records — from the duplicate to the canonical, and delete the duplicate.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Simplified version of the canonical election logic
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;elect_canonical&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nodes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;CompanyNode&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;CompanyNode&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Prefer nodes with equity-style ISINs (bond ISINs become BondIssue attributes)
&lt;/span&gt;    &lt;span class="n"&gt;equity_candidates&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;nodes&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;looks_like_bond_isin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;primary_isin&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
    &lt;span class="n"&gt;pool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;equity_candidates&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;equity_candidates&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;nodes&lt;/span&gt;
    &lt;span class="c1"&gt;# Among candidates, prefer the one with the most attached research
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;note_count&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;350 nodes becomes 222. The graph is smaller and more truthful.&lt;/p&gt;




&lt;h2&gt;
  
  
  What worked
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The fix itself is unglamorous — a dedup script, an &lt;code&gt;asset_class&lt;/code&gt; tag, a re-linking pass. The lesson it leaves behind is permanent.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The right model separates company identity from instrument identity. A Company node represents a legal entity. It gets a &lt;code&gt;primary_isin&lt;/code&gt; only if it has a listed equity — and if it does, that field holds the equity ISIN. Bond ISINs belong on BondIssue nodes, which are their own first-class entities in the graph with their own attributes: coupon, maturity, currency, seniority, covenant flags. The relationship &lt;code&gt;(:Company)-[:ISSUED]-&amp;gt;(:BondIssue)&lt;/code&gt; carries the connection. You never confuse the issuer with the instrument.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;asset_class&lt;/code&gt; field on Company — &lt;code&gt;Equity&lt;/code&gt;, &lt;code&gt;Credit&lt;/code&gt;, or &lt;code&gt;Both&lt;/code&gt; — is a coverage tag, not an identity tag. After the dedup and the Excel coverage universe load, the graph settles at 305 companies: 168 equity-only, 47 credit-only, 90 tagged &lt;code&gt;Both&lt;/code&gt;. Those 90 are the interesting ones — companies that a credit analyst and an equity analyst might both have a view on, and where the graph can now start surfacing connections between those views.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# After the fix: BondIssue as a first-class node
&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;BondIssueNode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;isin&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;                    &lt;span class="c1"&gt;# bond ISIN — lives here, not on Company
&lt;/span&gt;    &lt;span class="n"&gt;issuer_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;coupon_rate&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;maturity_date&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;date&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;currency&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;seniority&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;has_covenants&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The context builder, once the graph is clean, actually works. Give it a company name; it returns a structured block covering the equity position, any outstanding bonds with their key terms, recent news items, and the most relevant research notes. This is what gets wired into the agent prompts in the next commit cycle. The graph earns its place not by storing data but by traversing it.&lt;/p&gt;




&lt;h2&gt;
  
  
  What this changed
&lt;/h2&gt;

&lt;p&gt;I had assumed the hardest part of expanding into credit would be the data sourcing — finding the right connectors, navigating the identifier mess, parsing bond terms out of documents. That was genuinely fiddly work. But the hardest part was the schema flaw I didn't know I had until the new data revealed it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real financial data will always find the crack in your abstraction.&lt;/strong&gt; Equity data and credit data both represent claims on the same companies, but the market conventions around how those companies are identified, how their instruments are described, and how information about them is distributed are entirely different. Any system that tries to model both has to resist the temptation to unify prematurely. The Company node isn't an equity node with credit data bolted on, and it isn't a credit node with an equity ticker attached. It's a legal entity node, and the equity and credit representations hang off it as separate subgraphs.&lt;/p&gt;

&lt;p&gt;The dedup work also forced a conversation about what "canonical" means when you have conflicting data from multiple sources. The heuristic I landed on — prefer the node with the most attached research, prefer equity ISINs for &lt;code&gt;primary_isin&lt;/code&gt; — is defensible but not perfect. There are edge cases. There will always be edge cases. The script is in the repo and it will run again when the next batch of messy data arrives.&lt;/p&gt;

&lt;p&gt;What I'd do differently: define the BondIssue node as a first-class entity from day one, before any credit data touches the graph. Don't let &lt;code&gt;primary_isin&lt;/code&gt; be ambiguous for a single commit cycle. Identifier discipline is the kind of thing that feels like over-engineering until the moment you're staring at 127 duplicate nodes at midnight.&lt;/p&gt;




&lt;p&gt;The graph now knows about bonds, news, and a proper credit universe. The context builder can traverse all of it and hand an LLM something genuinely useful to reason about. The scaffolding is there.&lt;/p&gt;

&lt;p&gt;What it lacks is urgency. A static snapshot of a company's bonds and research notes is valuable, but trading floors run on &lt;em&gt;now&lt;/em&gt; — price moves, covenant triggers, breaking news. The next chapter is about making the platform feel alive: real-time event ingestion, the question of what the graph should forget, and whether a graph database is actually the right tool for anything that moves at market speed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's been your experience expanding a data model you thought was solid into a messier adjacent domain? Did you rebuild from scratch, or patch and tag your way through like I did?&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Part 3 of 7 in the series "Building a research hive mind"&lt;/em&gt;&lt;/p&gt;

</description>
      <category>fintech</category>
      <category>dataengineering</category>
      <category>graphdatabases</category>
      <category>python</category>
    </item>
    <item>
      <title>Choosing a graph over a database was the easy part</title>
      <dc:creator>Leiv Eriksson</dc:creator>
      <pubDate>Thu, 12 Mar 2026 00:55:46 +0000</pubDate>
      <link>https://dev.to/leivleivleiv/choosing-a-graph-over-a-database-was-the-easy-part-2el4</link>
      <guid>https://dev.to/leivleivleiv/choosing-a-graph-over-a-database-was-the-easy-part-2el4</guid>
      <description>&lt;p&gt;&lt;em&gt;Designing the node-and-edge skeleton that would have to hold everything&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The moment you decide to model your data as a graph, you are betting that relationships are more important than records — and you can't easily change your mind later. I made that bet on a Tuesday morning with a whiteboard marker in one hand and a half-finished coffee in the other, sketching nodes and edges for a system that didn't exist yet. The schema felt elegant. It felt obvious, even. And that feeling — the dangerous one, the one that makes you commit — is exactly when the doubt sets in. What if the thing that feels obvious now is the thing you'll be refactoring in six weeks, cursing yourself for not seeing the edge case that was always there?&lt;/p&gt;

&lt;p&gt;It wasn't six weeks. It was four days.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where we are
&lt;/h2&gt;

&lt;p&gt;In the first chapter, I described the research desk's memory problem: analyst notes scattered across email threads, earnings prep that lived in someone's head, no clean way to ask "what do we know about this company right now?" The answer we kept circling back to was not a better database. It was a different &lt;em&gt;kind&lt;/em&gt; of database — one that could model not just entities, but the tissue connecting them.&lt;/p&gt;

&lt;p&gt;That chapter ended with a vague architectural direction. This one is where we had to make it real.&lt;/p&gt;

&lt;p&gt;The project that I hope to build is basically a digital twin for the research division — a living graph of every company under coverage, every analyst who covers it, every note ever written, every filing pulled from EDGAR, every material event captured from exchange feeds. Not a data warehouse. Not a CRM. Something that can answer questions like: &lt;em&gt;What has changed about this issuer in the last 30 days, and who on the team should care?&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The problem
&lt;/h2&gt;

&lt;p&gt;The first decision was the one that felt least like a decision: should this be relational or graph?&lt;/p&gt;

&lt;p&gt;I asked the AI what I should do. Nobrainer it said, Postgres. It seemed to know it better, relying on endless examples online. The ecosystem is mature, the tooling is excellent, and when something goes wrong at 2am, Stack Overflow and Github training data means that the AI has a answer ready in a heartbeat. Choosing anything else seemed risky.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;But the data we were modeling was fundamentally relational in the graph sense, not the table sense.&lt;/strong&gt; A company has analysts. Analysts write notes. Notes reference events. Events trigger filings. Filings update estimates. Estimates change coverage posture. Every one of those connections is a first-class fact — not a foreign key you join through, but a relationship with its own properties and its own queryable meaning.&lt;/p&gt;

&lt;p&gt;The moment you start drawing that on a whiteboard, a graph isn't exotic. It's just honest.&lt;/p&gt;

&lt;p&gt;We evaluated two options seriously. Neo4j is the obvious choice — battle-tested, Cypher is expressive, the community is large. But we were building something that needed to run embedded, close to the application layer, without the operational overhead of managing a separate server process in the early stages. That pointed us toward an alternative, a embeddable graph database with Cypher support, columnar storage, and a Python API that felt like it was designed by people who had actually suffered through the alternatives.&lt;/p&gt;

&lt;p&gt;The risk was real. The alternative was newer. Production references are fewer. Training data lacking. We were going to be among the early adopters in a financial context, which is exactly the kind of sentence that makes compliance teams nervous. Good thing I haven't told them about the project yet (but for now I'm working on publically available data ONLY).&lt;/p&gt;




&lt;h2&gt;
  
  
  Into the unknown
&lt;/h2&gt;

&lt;p&gt;The schema came together fast — maybe too fast.&lt;/p&gt;

&lt;p&gt;The core node types were clear from day one: &lt;code&gt;Company&lt;/code&gt;, &lt;code&gt;Analyst&lt;/code&gt;, &lt;code&gt;Sector&lt;/code&gt;, &lt;code&gt;ResearchNote&lt;/code&gt;, &lt;code&gt;Filing&lt;/code&gt;, &lt;code&gt;MaterialEvent&lt;/code&gt;, &lt;code&gt;CoverageRecord&lt;/code&gt;, &lt;code&gt;BondIssue&lt;/code&gt;, &lt;code&gt;EstimateSnapshot&lt;/code&gt;. The edges were where the thinking happened. A &lt;code&gt;COVERS&lt;/code&gt; relationship between &lt;code&gt;Analyst&lt;/code&gt; and &lt;code&gt;Company&lt;/code&gt; isn't just a link — it has a start date, a rating, a target price. A &lt;code&gt;REFERENCES&lt;/code&gt; edge between a &lt;code&gt;ResearchNote&lt;/code&gt; and a &lt;code&gt;MaterialEvent&lt;/code&gt; carries context about &lt;em&gt;why&lt;/em&gt; that event mattered to that note.&lt;/p&gt;

&lt;p&gt;Here's a simplified version of how node types were declared in Python, binding the schema definition directly to the graph client:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# graph/schema/nodes.py (illustrative excerpt)
&lt;/span&gt;
&lt;span class="n"&gt;NODE_DEFINITIONS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Company&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ticker&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;STRING&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;isin&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;STRING&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;STRING&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sector&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;STRING&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;market_cap&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DOUBLE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;last_updated&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TIMESTAMP&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ResearchNote&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;note_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;STRING&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;STRING&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;published_date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DATE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sentiment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;STRING&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;word_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;INT64&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MaterialEvent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;STRING&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;STRING&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event_date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TIMESTAMP&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;headline&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;STRING&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;STRING&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The bootstrap script would walk these definitions and issue the equivalent &lt;code&gt;CREATE NODE TABLE&lt;/code&gt; calls against the graph DB. Clean. Declarative. And — critically — easy to modify before anything real was written to the graph.&lt;/p&gt;

&lt;p&gt;The FastAPI layer went up in parallel, and this is where we made a choice I'm still glad about: &lt;strong&gt;the API routers never touched the graph directly.&lt;/strong&gt; From day one, all graph access was routed through service classes — &lt;code&gt;CompanyService&lt;/code&gt;, and later &lt;code&gt;GraphQueryService&lt;/code&gt; — that translated between HTTP semantics and Cypher. The routers asked questions in domain language. The services answered in graph language.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# api/routers/companies.py (illustrative pattern)
&lt;/span&gt;
&lt;span class="nd"&gt;@router.get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/{ticker}/snapshot&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_company_snapshot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;ticker&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;service&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;CompanyService&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Depends&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;get_company_service&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;snapshot&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;service&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_full_snapshot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ticker&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;snapshot&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;HTTPException&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;404&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;detail&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Company not found&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;snapshot&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The seam between router and service sounds obvious in retrospect. At the time, under pressure to get something queryable, it required active discipline to not just inline the Cypher into the route handler and move on.&lt;/p&gt;

&lt;p&gt;The first real failure came from the connectors. We had live integrations running against SEC EDGAR and stock exchange news sites, plus stub connectors for Bloomberg and FactSet running in mock mode. The data coming back was messier than the schema expected — companies with missing ISINs, events with ambiguous timestamps, filings that referenced entities not yet in the graph. The graph's strictness, which felt like a feature in the design phase, became a source of friction the moment real data arrived.&lt;/p&gt;

&lt;p&gt;We spent quite some time writing transformer logic that could have been avoided but wasn't - because the real world doesn't normalize itself for you.&lt;/p&gt;




&lt;h2&gt;
  
  
  What worked
&lt;/h2&gt;

&lt;p&gt;The first time I ran a traversal query against real data, I understood why people build graph databases.&lt;/p&gt;

&lt;p&gt;The query was simple: give me every research note written about companies in a specific sector, ordered by date, with the analyst and any associated material events. In a relational model, that's three or four joins, and you're mentally tracking the join path the whole time. In Cypher, it reads almost like the question itself:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cypher"&gt;&lt;code&gt;&lt;span class="k"&gt;MATCH&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ss"&gt;(&lt;/span&gt;&lt;span class="py"&gt;a:&lt;/span&gt;&lt;span class="n"&gt;Analyst&lt;/span&gt;&lt;span class="ss"&gt;)&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="ss"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;:AUTHORED&lt;/span&gt;&lt;span class="ss"&gt;]&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="ss"&gt;(&lt;/span&gt;&lt;span class="py"&gt;n:&lt;/span&gt;&lt;span class="n"&gt;ResearchNote&lt;/span&gt;&lt;span class="ss"&gt;)&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="ss"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;:COVERS&lt;/span&gt;&lt;span class="ss"&gt;]&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="ss"&gt;(&lt;/span&gt;&lt;span class="py"&gt;c:&lt;/span&gt;&lt;span class="n"&gt;Company&lt;/span&gt;&lt;span class="ss"&gt;)&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="ss"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;:IN_SECTOR&lt;/span&gt;&lt;span class="ss"&gt;]&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="ss"&gt;(&lt;/span&gt;&lt;span class="py"&gt;s:&lt;/span&gt;&lt;span class="n"&gt;Sector&lt;/span&gt;&lt;span class="ss"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;OPTIONAL&lt;/span&gt; &lt;span class="k"&gt;MATCH&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ss"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="ss"&gt;)&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="ss"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;:REFERENCES&lt;/span&gt;&lt;span class="ss"&gt;]&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="ss"&gt;(&lt;/span&gt;&lt;span class="py"&gt;e:&lt;/span&gt;&lt;span class="n"&gt;MaterialEvent&lt;/span&gt;&lt;span class="ss"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;s.name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;$sector_name&lt;/span&gt;
&lt;span class="k"&gt;RETURN&lt;/span&gt; &lt;span class="n"&gt;a.name&lt;/span&gt;&lt;span class="ss"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n.title&lt;/span&gt;&lt;span class="ss"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n.published_date&lt;/span&gt;&lt;span class="ss"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c.ticker&lt;/span&gt;&lt;span class="ss"&gt;,&lt;/span&gt; &lt;span class="n"&gt;e.headline&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;n.published_date&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There's no impedance mismatch. The shape of the query matches the shape of the question. For a system where new question types arrive every week from analysts who didn't write the code, that matters enormously.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;GraphQueryService&lt;/code&gt; — introduced in the second commit — was where the architecture paid off properly. Instead of 23 inline Cypher queries scattered across five API routers, we centralized everything into typed methods with clear contracts. The routers became thin. The service became testable. And when an analyst asked for a query type we hadn't anticipated, we added one method in one place.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The agents came online against this foundation, and that's when the system started feeling like something.&lt;/strong&gt; The company snapshot agent could traverse the graph and produce a structured summary of everything known about an issuer. The earnings prep agent could pull analyst coverage history, recent notes, and upcoming filing dates in a single coherent pass. The report drafting agent had something real to work with.&lt;/p&gt;

&lt;p&gt;The Prefect flow layer, added four days later, gave the pipelines retry logic and observability without requiring us to rewrite anything fundamental. The graph contract held.&lt;/p&gt;




&lt;h2&gt;
  
  
  What this changed
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The graph bet forced an ontology conversation we would have had to have anyway&lt;/strong&gt; — we just had it before writing a single line of application logic, which is the right time.&lt;/p&gt;

&lt;p&gt;If you're considering a graph database for a production system, the operational concern is real but manageable. The harder thing is the schema: once analysts and pipelines are writing to the graph, changing a node type is a migration, not a refactor. We learned to be conservative about what became a node versus what stayed as a property. Events became nodes. Statuses stayed as properties. That line matters.&lt;/p&gt;

&lt;p&gt;I'd also instrument the query layer earlier. We added logging to &lt;code&gt;GraphQueryService&lt;/code&gt; in the second commit, but we should have had it from the first bootstrap. You want to know which traversals are expensive before something is slow in production, not after.&lt;/p&gt;

&lt;p&gt;The seam between data connectors and graph writers — the transformer pipeline — is where the real complexity lives. Not in the graph itself.&lt;/p&gt;




&lt;h2&gt;
  
  
  Signpost
&lt;/h2&gt;

&lt;p&gt;The foundation is in place: a typed graph, a service layer, a set of agents that can actually reason over connected data. What comes next is the part everyone underestimates — pulling in credit data alongside equity data and discovering that the two worlds have almost nothing in common except the company name. Bonds don't have tickers. Covenants don't have analogues in equity filings. And the graph that felt complete on Tuesday morning turns out to have a very significant gap.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;How do you model instruments that fundamentally resist being modeled the same way — without fracturing your schema into two separate systems that can't talk to each other?&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Part 2 of 7 in the series "Building a research hive mind"&lt;/em&gt;&lt;/p&gt;

</description>
      <category>graphdatabases</category>
      <category>architecture</category>
      <category>python</category>
      <category>fastapi</category>
    </item>
    <item>
      <title>The research desk has a memory problem</title>
      <dc:creator>Leiv Eriksson</dc:creator>
      <pubDate>Mon, 09 Mar 2026 22:49:00 +0000</pubDate>
      <link>https://dev.to/leivleivleiv/the-research-desk-has-a-memory-problem-3dk6</link>
      <guid>https://dev.to/leivleivleiv/the-research-desk-has-a-memory-problem-3dk6</guid>
      <description>&lt;p&gt;&lt;em&gt;Why a securities firm needed a brain, not another dashboard&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;An analyst leans across the desk and asks: "What's our current stance on  XYZ Inc — the one that filed something last quarter?" Thirty seconds, maybe a minute at most, is how long this should take. But the guy that covered the name had quit. Instead, what follows is a small, painful expedition. Someone opens a Bloomberg terminal. Someone else searches their inbox for a coverage note that was definitely emailed around. A third person remembers a conversation from an earnings call but can't locate the transcript. Twelve browser tabs and twenty-five minutes later, the picture assembles itself from fragments. The answer was always there. It was just distributed across four systems, two inboxes, and one analyst's increasingly unreliable memory. That moment — that unnecessary expedition — is where this project begins.&lt;/p&gt;




&lt;h2&gt;
  
  
  The beginning
&lt;/h2&gt;

&lt;p&gt;I have worked as a research analyst my entire career. Never written a line of python, but always with a focus on building infrastructure to solve for friction in my everyday tasks. Peer comps are cumbersome - build spreadsheets that pull from Bloomberg and our own estimate databases. Updating powerpoints sucks - build chart packs that can be automatically refreshed (but with external third party EXPENSIVE plug ins since excel and powerpoint hate each other on a biblical level). But context gathering, context switching and in general working with a complete mess of information at all times, seemed like part of the job. I simply did not have time nor the knowledge to build a universal research tool that gave me everything I wanted, when I wanted it. That is, until the last iteration of AI coding tools surfaced.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where we are
&lt;/h2&gt;

&lt;p&gt;This is the first dispatch in a seven-part series about building a research intelligence platform for a securities firm. I'm writing it as the foundation is being laid, a single enormous commit landing with 79 files and north of forty thousand lines of code — everything from graph schema definitions to live data connectors to the first generation of AI agents.&lt;/p&gt;

&lt;p&gt;I'll document the decisions as honestly as I can. The architecture choices, the wrong turns, the moments where a clean idea collided with a messy reality. This first chapter doesn't touch much code. It's about the problem itself, and why the problem turned out to be more interesting — and more stubborn — than it first appeared.&lt;/p&gt;




&lt;h2&gt;
  
  
  The problem
&lt;/h2&gt;

&lt;p&gt;Research analysts at a securities firm operate under a specific kind of cognitive load that is almost invisible until you start mapping it. They carry coverage universes in their heads. They know which companies are approaching earnings season. They remember that a particular issuer's CFO said something cautious on a call six months ago. They recall a ratings change that happened before a junior colleague joined the team.&lt;/p&gt;

&lt;p&gt;This knowledge is real and valuable. It is also almost entirely unstructured.&lt;/p&gt;

&lt;p&gt;The firm had tools, of course. Terminal access for market data. A distribution platform for outbound research notes. Swimming in stock notifications from a plethora of different sites. Bloomberg, Factset, you name it. The data existed and we had it all. What didn't exist was any connective tissue between it — no way to ask a question that crossed system boundaries, no way to surface what the firm &lt;em&gt;collectively&lt;/em&gt; knew about a company at a given moment.&lt;/p&gt;


&lt;div class="crayons-card c-embed"&gt;

  &lt;br&gt;
&lt;strong&gt;The invisible cost isn't any single wasted lookup. It's the compounding drag of every analyst rebuilding context from scratch, every time, for every question.&lt;/strong&gt;&lt;br&gt;

&lt;/div&gt;


&lt;p&gt;What ratings changes has this analyst published in the last six months? What material events has this issuer filed since our last note? What were the key estimate revisions going into the last earnings cycle? Every one of these questions had an answer. None of them had a fast path to it, and the paths that did exist were long and painful.&lt;/p&gt;

&lt;p&gt;The temptation, at this point, is to reach for a SaaS tool. Notion, Confluence, a better intranet, a fancier search layer over the existing systems. I gave that route genuine consideration. The problem is that knowledge management tools are designed around documents and pages — human-authored artifacts that someone has already synthesized. What we were dealing with was something different: a dense web of &lt;em&gt;relationships&lt;/em&gt; between entities. Companies. Analysts. Ratings. Events. Filings. Notes. Bonds. The meaning lived not in any single document but in the connections between things.&lt;/p&gt;

&lt;p&gt;A search index would tell you what documents mention a company. It wouldn't tell you that this company's coverage analyst changed eight months ago, that coverage intensity has dropped since the change, that the last three notes were all published within a week of an earnings filing, and that a material event landed last Tuesday that nobody has formally reacted to yet. That's not a search problem. That's a graph problem.&lt;/p&gt;




&lt;h2&gt;
  
  
  Into the unknown
&lt;/h2&gt;

&lt;p&gt;I want to be honest about the vertigo of this moment. Deciding to build a custom graph-backed platform rather than assemble something from existing parts is not a modest commitment. It means owning the schema. It means owning the ingestion pipelines. It means owning the query layer, the agent layer, the API, the interface. It means that when something breaks at 7:45am before a market open, the person on call is you. I get physically sick of thinking of the responsibility I could end up with, if this actually makes it all the way to production.&lt;/p&gt;

&lt;p&gt;I went looking for evidence that this was the right call rather than an elaborate form of scope creep. I studied what Palantir built for institutional knowledge management. I looked at how Cognite approaches industrial knowledge graphs in the energy sector. What was so genious about Databricks? I read everything I could find about how these firms used graph-based and other approaches to make sense of data, and more importantly the connections between the data.&lt;/p&gt;

&lt;p&gt;What I kept returning to was a structural observation: &lt;strong&gt;financial research knowledge is fundamentally relational, and relational knowledge degrades when you store it in flat structures.&lt;/strong&gt; A research note isn't just a document. It's a relationship between an analyst, a company, a rating, a date, a set of estimates, and a market context. Strip away those relationships and you have a PDF. Keep them and you have something you can reason about.&lt;/p&gt;

&lt;p&gt;The early schema sketches were humbling. My first attempt at modeling the domain felt clean — companies, analysts, notes, events — until I started trying to answer real questions against it. The schema didn't know the difference between an analyst &lt;em&gt;covering&lt;/em&gt; a company and an analyst &lt;em&gt;having covered&lt;/em&gt; a company. It couldn't represent the difference between a rating that was current and one that had been superseded. Temporal validity is genuinely hard to model, and I'd underestimated it.&lt;/p&gt;

&lt;p&gt;I also underestimated how much the data connectors would teach me about the domain. Building the integration with regulatory filing feeds forced me to understand what "material event" actually means in practice versus in theory. The filings integration surfaced the gap between how events are &lt;em&gt;categorized&lt;/em&gt; and how they're actually &lt;em&gt;used&lt;/em&gt; by analysts.&lt;/p&gt;

&lt;p&gt;
  Early naive approach — the code that never shipped
  &lt;br&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Flattening event data into a document store
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;store_event&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;doc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;company&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;issuer_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;published&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;category&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;search_index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# The problem: we've lost the relationship between the event
# and the company's coverage record, analyst assignment,
# and any notes published in response to it.
# Querying "what did we do after this event?" becomes impossible.
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;




&lt;/p&gt;

&lt;p&gt;That code never made it into the real system. But writing it clarified exactly why it couldn't.&lt;/p&gt;




&lt;h2&gt;
  
  
  What worked
&lt;/h2&gt;

&lt;p&gt;The decision that unlocked everything was committing to graph-native modeling from the start, rather than treating the graph as a layer on top of something relational.&lt;/p&gt;

&lt;p&gt;The node types that eventually stabilised — Company, Analyst, Sector, CoverageRecord, ResearchNote, MaterialEvent, Filing, BondIssue, EstimateSnapshot — weren't designed top-down. They emerged from asking "what questions do analysts actually ask?" and working backwards. Every node type represents a thing that analysts reason about. Every edge type represents a relationship that changes the answer to a question.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Excerpt from graph/schema/nodes.py — illustrating the principle
# A CoverageRecord isn't just a link between Analyst and Company.
# It carries its own temporal properties and state.
&lt;/span&gt;
&lt;span class="nd"&gt;@dataclass&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;CoverageRecord&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;analyst_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;company_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;rating&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;target_price&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;currency&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;coverage_start&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;date&lt;/span&gt;
    &lt;span class="n"&gt;coverage_end&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;date&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;      &lt;span class="c1"&gt;# None = currently active
&lt;/span&gt;    &lt;span class="n"&gt;is_primary&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;
    &lt;span class="n"&gt;last_note_date&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;date&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;coverage_end&lt;/code&gt; field looks trivial. It took three schema iterations to get there. Without it, you cannot answer "who covered this company before the current analyst?" Without it, you cannot audit continuity of coverage. Without it, you cannot detect that a company in your universe has gone unreferenced for ninety days.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The schema is an argument about what matters.&lt;/strong&gt; Every field is a claim that this piece of information is worth carrying forward. Getting the schema right — genuinely right — turned out to be the most intellectually demanding part of the foundation phase.&lt;/p&gt;

&lt;p&gt;The agent architecture followed a similar principle. Rather than a single general-purpose assistant, the system needed specialists: a company snapshot agent that could assemble a complete current picture, an earnings preparation agent that could pull together everything relevant before a call, a material event monitor that watched for regulatory filings and surfaced them to the right analyst. Each agent is narrow. The graph is what makes the narrowness sufficient — because every agent has access to the full relational context, a focused question gets a rich answer.&lt;/p&gt;




&lt;h2&gt;
  
  
  What this changed
&lt;/h2&gt;

&lt;p&gt;The commit that landed this foundation — 79 files, forty-thousand-plus lines — is almost certainly the densest single push this project will ever see. Normally that's a warning sign. In this case it reflects something real: you can't build half a knowledge graph. The schema, the connectors, the writers, the query layer — they form a system. They only work together.&lt;/p&gt;

&lt;p&gt;What I'd do differently is start the data connector work earlier, and treat it as domain research rather than engineering. Every connector teaches you something about the data it touches. The bond data integration changed my understanding of how the fixed-income side of the coverage universe needed to be modeled. I'd have designed a better initial schema if I'd done that integration first.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The deeper lesson is that knowledge infrastructure is never purely a technical problem.&lt;/strong&gt; It's a problem about how people think, what they need to know, and when they need to know it. The right architecture is the one that mirrors the actual cognitive work — not the one that's technically elegant in isolation.&lt;/p&gt;

&lt;p&gt;The dashboard, the agents, the API — those are expressions of an idea about what the research desk could become. The graph is the idea itself.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Next up: the graph database selection process looked like a technical decision. It turned out to be a question about operational reality — and the answer surprised me.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's the most valuable piece of knowledge at your organisation that currently lives only in someone's head? And what would it take to change that?&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;&lt;a href="https://dev.to/leivleivleiv" class="crayons-btn crayons-btn--primary"&gt;Follow along — Part 2 drops soon&lt;/a&gt;
&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Part 1 of 7 in the series "Building a research hive mind"&lt;/em&gt;&lt;/p&gt;

</description>
      <category>fintech</category>
      <category>ai</category>
      <category>python</category>
      <category>graphdatabases</category>
    </item>
  </channel>
</rss>
