MatBanik

Posted on May 20 • Originally published at matbanik.info

You Got Your Whole Genome Sequenced. Now What?

#health #nutrition #lifestyle #wellness

Published May 20, 2026 on matbanik.info

You paid somewhere between $300 and $500 for whole genome sequencing. Maybe Nebula, maybe Dante, maybe one of the newer providers that keep undercutting each other. You got a folder. Inside that folder: a VCF file with 4.86 million rows of variant data and a dashboard that shows you maybe 1% of what's in there.

The dashboard answers their questions. Ancestry breakdown. A handful of trait associations. Maybe a carrier screening if you're lucky. But your questions — the ones that actually matter to you — don't have a button.

"What are my CYP enzyme metabolizer phenotypes?" No button.

"Which variants did I inherit from my father versus my mother?" Definitely no button.

"What does ClinVar say about that variant my doctor flagged last week?" Not even close.

This post is about closing that gap. I'll show why it exists, what it looks like when it's gone, and how to set it up yourself — locally, without uploading your genome anywhere.

The "Now What?" Problem

The consumer whole-genome sequencing market has a strange asymmetry. Getting sequenced is easy. Thirty-times coverage for under $500 in 2026. The price curve has been falling for years and it's still falling.

But interpretation hasn't kept up.

You get a VCF file with millions of variant rows. Each row is a position in your genome that differs from the reference. Some of those differences matter a lot. Most don't. And the consumer platforms that sold you the test? They show you a curated slice — pre-built reports with pre-built answers.

That's fine for the questions they anticipated. It's useless for the ones they didn't.

Here's what gets lost in the average consumer dashboard:

Non-SNP variation. Structural variants, copy number variations, mitochondrial DNA. Most consumer platforms skip these entirely or treat them as secondary.

Family-aware queries. If you've sequenced your parents too, you're sitting on a trio — three genomes that can tell you which variants came from whom. Zero consumer platforms offer trio analysis.

Flexible evidence lookup. You read about a variant on a research paper. You want to know if you carry it, what ClinVar says, what the population frequency is across ancestry groups. The dashboard doesn't have a "look up an arbitrary rsID" button.

The 23andMe bankruptcy in March 2025 made this concrete. Within 24 hours of the Chapter 11 filing, the site saw 1.5 million visits — a 526% spike. The data-deletion page got 376,000 hits on day one, 480,000 on day two. People realized their most personal dataset was sitting on someone else's server, and they wanted it back.

The demand for local-first analysis isn't hypothetical. It's measured in deletion-page clicks.

What It Looks Like When You Close the Gap

The stack I've been running has three components:

GeneChat-MCP handles local VCF queries. It reads your variant files directly — nothing leaves your machine. You can ask about specific genes, scan for known pathogenic variants, compare inheritance patterns across family members.

OpenCRAVAT-MCP connects to cloud annotation databases, but only sends rsIDs (the public identifiers for known variants, like rs4988235 for lactase persistence). Your actual genotype stays local. What comes back: population frequencies, functional predictions, protein interaction data, regulatory annotations — the context that makes a variant meaningful.

Pomera handles session notes. When you're working through complex queries, you want persistent context.

All three run inside your IDE through the Model Context Protocol. Antigravity, Codex, Claude Code — anything that supports MCP. You ask in natural language. The IDE routes the query to the right tool. You get grounded answers from your actual files.

No uploads. No subscription tiers. No waiting for a report.

Same desk, different story. The data is organized, queryable, and entirely local.

Four Examples from a Real Genome

I've run all of these queries on real data. Full reports — with complete tables, source databases, and version stamps — are available in the examples directory on GitHub. Here's what they found.

Drug Metabolism (Pharmacogenomics)

Query: "What are my CYP enzyme metabolizer phenotypes for drug metabolism?"

Gene	Likely Phenotype	Key Finding
CYP2C19	Normal Metabolizer	No 2, 3, or *17 alleles
CYP2D6	Intermediate Metabolizer	Heterozygous 4 carrier (1/*4)
CYP2C9	Normal Metabolizer	No 2 or 3 alleles
CYP3A5	Non-expresser (3/3)	Common European genotype

CYP2D6 matters most here. It metabolizes roughly 25% of all prescribed drugs — codeine, tramadol, tamoxifen, many antidepressants, several beta-blockers. An intermediate metabolizer status means reduced enzyme activity. Codeine won't convert to morphine as efficiently. Some antidepressants may need dose adjustments.

This is factual genotype data, not a prescription. But it's exactly the kind of information worth discussing with a prescriber before they write a script for tramadol.

Over a hundred FDA drug labels reference pharmacogenomic biomarkers. Your prescriber may not know your metabolizer status. You can bring this information to them.

→ Full report: Pharmacogenomics Profile

Four enzymes, four results. The amber one — CYP2D6 — processes a quarter of all prescribed drugs.

What You Inherited from Whom (Trio Analysis)

Query: "For these well-known variants, which parent did I inherit them from?"

Three VCF files. Three genomes. Standard trio logic: if you're heterozygous and one parent carries the variant while the other doesn't, you know which side it came from.

Variant	Gene	Inheritance
rs1801131	MTHFR (A1298C)	Paternal — father is het, mother is wildtype
rs17822931	ABCC11 (earwax type)	Maternal — mother is het, father is wildtype
rs4988235	MCM6 (lactase)	Both parents — one allele from each
rs1050450	GPX1 (antioxidant)	Not inherited — mother carries it, subject is wildtype

That last row is the one people don't expect. Your parent carries a variant. You didn't inherit it. That's a question you can answer with trio analysis and literally nothing else available to consumers.

→ Full report: Trio Inheritance Analysis

Three genomes, three orbs. The threads of light trace what was inherited — and what wasn't.

ClinVar Genome Scan

Query: "Scan my genome for ClinVar pathogenic variants."

ClinVar is NIH's database of clinically relevant variants — the ones linked to diseases, drug responses, or other phenotypes. A full scan against 4.86 million variants found:

100 pathogenic variants across 41 genes
25 drug-response variants

That sounds alarming until you dig in. Most entries flagged as "pathogenic" have conflicting classifications. One lab calls it pathogenic, another calls it benign, a third says uncertain significance. The database captures this disagreement, which is actually valuable — it tells you where the science is still unsettled.

A handful of variants had consistent pathogenic classifications across multiple submitters. Those are worth reviewing with a genetic counselor. The rest are noise, or at least noise until more evidence accumulates.

→ Full report: ClinVar Variant Scan

Deep Variant Annotation (OpenCRAVAT)

Query: "Give me a deep annotation of rs4988235."

This is where the rsID-only cloud query earns its keep. A single call — annotate_rsid("rs4988235") — returned over 150 annotation fields for the lactase persistence variant:

CADD score: Functional impact prediction
Population frequencies across seven ancestry groups: 60.2% in Europeans, 0.3% in East Asians (this variant enabled dairy farming in northern Europe — its geographic distribution tells a 10,000-year story)
56 protein interactors: The broader molecular network
Regulatory element data: Where this variant sits in the genome's control architecture

Getting this normally requires a bioinformatics pipeline — downloading databases, running annotation tools, parsing output formats. Here it's one function call that sends only the rsID, not your genotype.

→ Full report: OpenCRAVAT Deep Annotation

Two additional reports — trait associations and polygenic risk scores — are available in the GitHub examples directory.

Clone, Configure, Query

What You Need

Your VCF file(s) from any whole-genome sequencing provider
An agentic IDE that supports MCP (Antigravity, Codex, Claude Code)
Python 3.10+ and conda (for GeneChat-MCP)
About 2GB of disk space for annotation databases (ClinVar, SnpEff, GWAS Catalog, PGS models)
Optional: an OpenCRAVAT cloud account (free tier) for deep annotation

Architecture

Your IDE (chat)
  Antigravity / Codex / Claude Code
       │                │
       ▼                ▼
  ┌──────────┐   ┌──────────────┐
  │ GeneChat │   │  OpenCRAVAT  │
  │   MCP    │   │     MCP      │
  │ (local)  │   │ (cloud API)  │
  └────┬─────┘   └──────┬───────┘
       │                │
       ▼                │ rsIDs only
  ┌──────────┐          │ (no genome data)
  │ Your VCF │          ▼
  │  files   │     CADD, REVEL,
  │ (local)  │     gnomAD, BioGRID
  └──────────┘

Setup

I'm not going to reproduce the README here. If you can configure an MCP server in your IDE, you can follow the instructions in the repo.

Clone it: github.com/matbanik/agentic-genomics

The setup has three pieces: GeneChat-MCP (local VCF querying), OpenCRAVAT-MCP (cloud annotation), and your IDE's MCP configuration. The README walks through each one.

Two things that trip people up:

VCF indexing. Your VCF files need to be indexed with tabix before the first query. GeneChat expects .vcf.gz + .vcf.gz.tbi pairs. If the index is missing, queries will fail silently or throw cryptic errors. The repo documents this, but it's the single most common setup issue.

Contig format mismatch. Some sequencing providers use chr1, chr2, chr3 prefixes. Others use bare 1, 2, 3. If your VCF uses one format and the reference databases expect the other, variant lookups will miss. The repo handles the conversion, but it's worth knowing why a query might return "not found" when you know the variant is there.

Trio analysis? Same setup, more genomes. Register each family member's VCF file and query across all of them.

What You Can Ask

These are natural-language prompts you can type directly into your IDE. The agent routes each one to the right MCP tool automatically.

What is my CYP2D6 metabolizer status?

Scan my genome for ClinVar pathogenic variants

Which of these variants did I inherit from my mother?

Calculate my BMI polygenic risk score

Give me a deep annotation of rs4988235

What GWAS associations exist for caffeine metabolism?

→ All six example reports on GitHub

The Landscape

You're Not Alone

This space is moving fast. A few projects worth knowing about.

ClawBio came out of the UK AI Agent Hackathon at Imperial College. It's a Python CLI and library — not an MCP server, so the architecture differs, but the goal overlaps. They've built two tools I haven't seen elsewhere: gwas-lookup federates queries across nine GWAS databases simultaneously, and clinpgx pulls from PharmGKB, CPIC guidelines, and FDA label annotations in one call. Complementary work, different interface paradigm.

Academic signals are emerging too. A paper in Briefings in Bioinformatics formalized the MCPmed framework for medical AI agents. EMBL has BioContextAI in development. cBioPortal — the cancer genomics database — now has an MCP interface. IBM Research presented related work at ISMB.

On the open-source side: Bio-MCP provides general bioinformatics tool access, gget-mcp wraps the gget library for gene/protein queries, and IGV-MCP connects to the Integrative Genomics Viewer for visualization.

The pattern is clear. Genomic data is becoming queryable through conversational interfaces. The question is whether that happens on your machine or someone else's.

What This Isn't

I want to be direct about boundaries.

This does not diagnose disease. A ClinVar "pathogenic" flag is not a diagnosis — it's a database entry reflecting submitted evidence, often with conflicting interpretations.

This does not recommend treatments. A CYP2D6 intermediate metabolizer status is a genotype fact. What to do about it is a clinical decision that depends on your full medical context, other medications, and your prescriber's judgment.

This does not provide nutritional advice. Your MTHFR status does not tell you what supplements to take.

Star-allele calls — CYP2D6 *1/*4, CYP2C19 *1/*1 — are factual genotype data. They describe what variants you carry. Translating that into action requires a human with clinical training and your complete picture.

If you find something concerning, discuss it with a genetic counselor or your prescriber. That's not a hedge; it's how this works.

Source databases are named and versioned in each query output: PharmVar for star allele definitions, CPIC for genotype-phenotype mappings, ClinVar for clinical variant classifications, gnomAD for population frequencies.

Closing

Your genome is the most personal dataset you'll ever own. Right now, most of it sits in a folder you've never opened — or on a server you can't control.

The tools to change that are free, open, and run on your machine.

Your data. Your machine. Your questions.

If you get stuck during setup, open an issue on GitHub. That's what it's there for.

Resources

⚠️ Disclaimer: This is a factual genotype report, not medical advice. Discuss actionable findings with your prescriber or genetic counselor. Source databases: PharmVar, CPIC, ClinVar, gnomAD — versions stated per query.

Originally published on matbanik.info. Cross-posted with ❤️ to Dev.to.

Top comments (2)

Sam Rivera • May 31

great article on local-first genome analysis. as someone who quit smoking and started tracking health metrics, the pharmacogenomics angle really resonates — knowing how you metabolize medications is something most people never think about until they need it. the CYP2D6 finding is especially relevant since so many common prescriptions depend on it. love that this keeps everything on your own machine.

MatBanik • Jun 1

Sam, thanks for reading and for the thoughtful comment. Really appreciate you taking the time.

And congrats on quitting smoking, that’s a huge shift. I think you nailed the part that makes pharmacogenomics feel less abstract.

Curious how you’d want to use this kind of info: as a one-time reference to bring up with a doctor, or as something that lives alongside the health metrics you’re already tracking?