Penfield

Posted on Jun 15

We Built an Open-Source Promethease Alternative in a Month. Here's the Architecture.

Seven public genomics databases. One offline pipeline. A self-contained HTML report. How we wired together ClinVar, PharmGKB, GWAS Catalog, SNPedia, AlphaMissense, CADD, and gnomAD into a single CLI tool that runs on your machine and costs nothing.

The problem

Promethease was the standard. For years, if you had raw data from a consumer DNA test and wanted to know what any of it meant, you paid $12, uploaded your file, and got a report cross-referencing your variants against SNPedia.

It worked. Then MyHeritage acquired it. SNPedia froze. No new curation since mid-2023. The reports still dump 25,000 entries into a report. You still have to upload your most sensitive personal data to someone else's server. And the price is now $25.

Research-grade tools like VEP and ANNOVAR are powerful but require bioinformatics expertise and don't handle consumer file formats. Paid services ($25-$1,400+) bundle genotyping with proprietary reports, but their annotation pipelines are black boxes. Many do not clearly identify all the databases they query. Only some offer reports separately.

We wanted something that didn't exist: a comprehensive annotation pipeline that queries every major public database, runs entirely offline, preserves privacy, supports the file formats people actually have, and lets you audit every result back to its primary source and related research.

So we built Allelix.

What it does

Allelix is a Python CLI tool. Give it a raw genotype file from any major consumer DNA test. It cross-references every variant against seven public databases in a single pass and outputs an interactive HTML report you can open in any browser, offline, forever.

pip install allelix
allelix db update
allelix analyze your_raw_data.txt --output report.html

The databases, each pulled from their authoritative public source:

ClinVar (NCBI): clinical significance classifications with review status
PharmGKB (Stanford): drug-gene interactions with CPIC evidence levels
GWAS Catalog (EBI/NHGRI): genome-wide association study findings
SNPedia: community annotations with magnitude scoring
gnomAD (Broad Institute): population allele frequencies from 807,162 individuals
AlphaMissense (DeepMind): pathogenicity predictions for all possible single amino acid substitutions
CADD (University of Washington): variant deleteriousness scoring

Every annotation cites its source database. Nothing is a black box. The tool says "ClinVar classifies this variant as pathogenic," never "this variant is pathogenic." That distinction isn't a disclaimer. It's a design constraint that affects model naming, report wording, and category labeling throughout the codebase.

There's also a hosted analysis service for users who want to try it without installing anything. Files are processed and deleted within minutes. But the real point is local execution: Your data, your machine, your report.

The architecture

The pipeline has three stages: parse, annotate, report.

Parsers handle six consumer DNA formats natively: 23andMe, AncestryDNA, FTDNA, LivingDNA, MyHeritage, and Tempus. Each normalizer produces a common internal representation. Adding a new format means adding one file to allelix/parsers/ and registering it. Many tools in this space only support two or three formats. We support six because real-world data arrives in whatever format the test provider shipped.

Annotators query local database copies. Each annotator is independent and streams results rather than loading everything into memory. This matters when you're processing 600,000+ variants against databases containing billions of rows.

Reports come in four flavors: interactive HTML (self-contained, works offline), JSON (schema-versioned), terminal output (Rich-formatted), and PLINK export for research workflows. Plus specialized modes for pharmacogenomics and methylation pathways and filter file mode for custom panels.

Database updates are on-demand via allelix db update. Freshness detection uses MD5/ETag headers, so you only download what changed. We let you pull fresh data whenever you want. The analyze subcommand automatically updates sources older than 7 days, unless you opt out.

Build detection: trust the data, not the header

Human genomes are mapped against reference assemblies. The two most common currently in circulation are GRCh37 and GRCh38. The same genetic variant has different position numbers in each build. If a tool thinks the data is GRCh37 when it's actually GRCh38, every position-based lookup returns the wrong annotation. You'd get results for the wrong variants entirely, without warning.

We discovered that some vendors put one build number in the file header while the actual position data corresponds to a different build.

Allelix doesn't trust headers. It uses 11 sentinel SNPs across 7 chromosomes where the position differs between builds, checks the actual position values in the file, and calls the build from the data. When a header/data mismatch is detected, the report warns the user. This is documented in ADR-0021.

Every other tool we've evaluated trusts the header.

Building the CADD cache

CADD scores every possible single nucleotide variant in the human genome for deleteriousness. The full dataset is ~81GB compressed.

Allelix doesn't ship that. We wrote a build script that loads ~80 million position keys from gnomAD, AlphaMissense, and ClinVar, then streams through the full CADD file keeping only matching rows. The output is a compact SQLite cache hosted on HuggingFace. Users get it via allelix db update alongside everything else. The build script ships with the repo for reproducibility, but nobody needs to run it.

Composite scoring

Most annotation tools that surface SNPedia data just pass through the magnitude scores. We built a cross-source composite scoring system that aggregates severity signals across all seven databases. The highest severity annotation across all sources surfaces first. ClinVar pathogenicity, GWAS effect sizes, PharmGKB evidence levels, and SNPedia magnitudes all feed into a unified ranking.

The result is that clinically significant findings sort to the top of the report regardless of which database flagged them. A pathogenic ClinVar variant and a high-magnitude SNPedia entry both surface above a benign population-frequency hit, even if the SNPedia magnitude alone would have buried it.

How it was actually built

Allelix was built by one person in approximately one month. Allelix was built by one person in approximately one month. At v1.9.0: 197 files, 30,000+ lines of Python, 1,279 tests across 50 test files, 35 architecture decision records, and 1292 tests providing 93.16% coverage. All six parsers verified against real data. Cross-parser gold standards ensuring identical annotation output regardless of input format. Annotation accuracy verified variant-by-variant against source databases.

This was only possible because of AI-assisted development. We used Claude Code for implementation and Penfield for managing the research, architecture, and design decisions across dozens of sessions over the course of the project.

The project didn't start as "build a genotyping tool." It started as wanting a comprehensive, private, reproducible analysis of raw genetic data. That meant weeks of research before writing code: genome builds, chip architectures, file formats, database schemas, annotation conventions, no-call patterns, strand orientation, population frequency interpretation. Hundreds of connected findings, design decisions, and architectural tradeoffs that needed to persist across sessions and be retrievable by context.

Penfield handled that. Every research finding was stored, tagged, and connected to related findings in a knowledge graph. Design decisions could be recalled with full context. Session handoffs preserved not just what was decided, but why. Competitive intelligence, technical constraints, regulatory considerations, and implementation details all lived in one persistent, searchable, connected memory system.

Without persistent AI memory, a project this domain-heavy either takes one person six months or requires a team. The genomics knowledge alone is far too large for a single conversation context. Penfield made it possible to build up domain expertise incrementally across weeks and query it on demand when making implementation decisions.

The thesis: synthesis is approaching free

Here's the architectural bet underlying everything about Allelix.

The genetics industry has traditionally bundled three things: raw data (the lab test), annotation (looking up what variants mean in research databases), and synthesis (turning annotations into a narrative report with recommendations).

The first two have real costs. Lab tests require physical infrastructure. Annotation databases require curation, hosting, and maintenance.

Synthesis is rapidly becoming free. Take Allelix's JSON output, hand it to any frontier language model with proper source gating, and you get a customized report that matches or exceeds what many paid services produce and with more recent research data. You can focus it on pharmacogenomics, cardiovascular risk, or whatever you actually care about. You can regenerate it every time a database updates. You can integrate your own lab work in the analysis.

Pre-generated one-size-fits-all reports age the moment they're created. Many of the "proprietary interpretation layers" in this space are structured data from public databases plus a synthesis pass that any many LLMs can closely replicate, if not exceed, on demand.

The durable value is in the data layer: freshness, transparency, traceability, and user control. That's what Allelix provides. The interpretation layer is yours to build however you want.

What's coming

VCF and whole-genome sequencing support (v2.0.0): Most consumer genotyping chips cover roughly 600,000 to 700,000 positions, about 0.1% of the genome. Whole genome sequencing covers everything. That means millions of variants per individual, including rare and novel variants that aren't yet classified in any clinical database. This is where pathogenicity predictors like CADD and the planned REVEL integration become critical. For chip data, many variants already have ground truth in ClinVar. For the majority of WGS data, computational predictions are all you have.

REVEL integration: An ensemble of 13 individual pathogenicity predictors (including CADD) that consistently outperforms any single predictor for missense variant classification. Purpose-built for the novel variants that WGS surfaces.

Try it

pip install allelix
allelix db update
allelix analyze your_raw_data.txt --output report.html

Or use the hosted analysis service to try it without installing anything (but be aware of the privacy tradeoff).

Your data stays on your machine. The report works offline. Every annotation traces to its source. The code is open and auditable.

GitHub: github.com/dial481/allelix
Website: allelix.io
License: AGPL-3.0

Allelix was built with AI-assisted development using Claude Code for implementation and Penfield for persistent memory across the research, architecture, design, and development process. Zero to v1.x in approximately one month. v2.0.0 with VCF/gVCF/WGS support is under active development.