Oluwagbade Odimayo

Posted on May 26 • Edited on Jun 8

1000 Genomes Data Is Free. Making It Useful Is the Hard Part

#python #bioinformatics #genomics

A complete technical walkthrough of VariantLens - a production-grade VCF parsing, validation, annotation, and visualisation pipeline built on live chromosome 22 data from the 1000 Genomes Project Phase 3

GitHub: github.com/gbadedata/variantlens
Dataset: 1000 Genomes Project Phase 3, Chromosome 22, GRCh37/hg19
Stack: Python 3.12, FastAPI, Plotly Dash, pandas, pyarrow, pytest, GitHub Actions

The 1000 Genomes Project sequenced 2,504 human beings from 26 populations across the world and made every variant freely downloadable. The chromosome 22 file alone contains approximately one million single nucleotide polymorphisms and insertions and deletions, each described in a format called VCF that was designed in 2010 and has not fundamentally changed since.

I built a pipeline that streams this data live from the European Bioinformatics Institute FTP server, parses each variant into structured records, applies nine biological validation rules, annotates every clean variant with functional consequence and clinical gene context, and serves the results through a REST API and an interactive dashboard.

This article is the full technical account of how and why.

The VCF Format: What It Is and Why Parsing It Correctly Is Non-Trivial

VCF stands for Variant Call Format. Every variant caller, GATK, FreeBayes, bcftools, outputs VCF. Understanding the format is prerequisite to understanding everything else in this article.

A VCF file has two sections. The header section starts with ## and contains metadata about the file: the reference genome, the tools used, the INFO field definitions. The data section starts after the #CHROM line and contains one row per variant.

Each data row has eight mandatory tab-separated fields:

Field	Example	Description
CHROM	22	Chromosome
POS	16050075	1-based position
ID	.	rsID or dot if novel
REF	A	Reference allele
ALT	G	Alternate allele
QUAL	100	Phred quality score
FILTER	PASS	Filter status
INFO	AF=0.0002;DP=8012	Semicolon-separated metadata

After the eight mandatory fields, the 1000 Genomes VCF adds a FORMAT field and then 2,504 sample columns, one per individual. This means a single data row in the 1000 Genomes VCF has 2,512 tab-separated fields.

The naive approach is to split every line on tabs and process all 2,512 fields. This is slow and wastes memory. The correct approach is to split on tabs and take only the first eight fields, discarding the sample columns entirely since we only need population-level statistics, which are encoded in the INFO field as AF (allele frequency).

def parse_vcf_line(line: str, line_index: int = 0) -> dict:
    fields = line.strip().split("\t")

    if len(fields) < 8:
        return {
            "parse_error": True,
            "error_reason": f"Expected at least 8 fields, got {len(fields)}",
            "raw_line": line[:200],
            "line_index": line_index,
        }

    chrom = fields[0].lstrip("chr")  # normalise chr22 to 22
    pos_str = fields[1]
    variant_id = fields[2]
    ref = fields[3]
    alt = fields[4]
    qual_str = fields[5]
    filter_status = fields[6]
    info_str = fields[7]
    # fields[8:] are FORMAT and 2,504 sample columns -- discarded

The lstrip("chr") normalises chromosome identifiers. Some VCF files prefix chromosomes with chr (chr22, chrX) and some do not (22, X). Normalising to the unprefixed form makes downstream logic simpler.

The INFO field parser deserves its own function because the format is complex. Values are semicolon-separated key=value pairs, but some entries are flags with no value.

def parse_info_field(info_str: str) -> dict:
    result = {}
    if not info_str or info_str == ".":
        return result

    for token in info_str.split(";"):
        token = token.strip()
        if not token:
            continue
        if "=" in token:
            key, _, value = token.partition("=")
            result[key.strip()] = value.strip()
        else:
            result[token] = True  # flag with no value

    return result

The key INFO subfields we extract are AF (allele frequency across all 2,504 individuals), DP (total read depth at this position), MQ (root mean square mapping quality), AN (total alleles in called genotypes), AC (allele count), and VT (variant type annotation from 1000 Genomes).

Streaming a Gzipped File Over HTTP Without Loading It Into Memory

The chromosome 22 VCF from the 1000 Genomes EBI mirror is approximately 11GB when uncompressed and roughly 1.2GB as a gzip file. Downloading the entire file before parsing is unnecessary and slow.

Python's requests library supports streaming responses. Combined with the gzip module, you can decompress and process the file line by line without ever holding more than a few lines in memory.

def _fetch_vcf_lines(url: str, max_variants: int) -> tuple:
    headers = {"User-Agent": f"VariantLens/1.0 ({settings.ncbi_email})"}

    variant_lines = []
    header_lines = []

    with requests.get(url, headers=headers, stream=True, timeout=60) as resp:
        resp.raise_for_status()
        with gzip.GzipFile(fileobj=resp.raw) as gz:
            for raw_line in gz:
                try:
                    line = raw_line.decode("utf-8").rstrip("\n")
                except UnicodeDecodeError:
                    continue

                if line.startswith("##"):
                    header_lines.append(line)
                    continue
                if line.startswith("#CHROM"):
                    header_lines.append(line)
                    continue
                if not line.strip():
                    continue

                variant_lines.append(line)
                if len(variant_lines) >= max_variants:
                    break

    return variant_lines, header_lines

The stream=True argument tells requests not to download the body immediately. The gzip.GzipFile(fileobj=resp.raw) wraps the raw socket in a gzip decompressor. Every iteration of the for loop decodes one line of the VCF, processes it, and moves on. Once we have collected max_variants data lines, we break and close the connection.

The result: streaming 500 variants from a multi-gigabyte file takes under two seconds. The memory footprint stays constant regardless of the total file size.

The Ts/Tv Ratio: The Most Important Number in SNP Calling Quality

Before building the validation engine, it is worth understanding the single most important quality metric in variant calling, because this pipeline computes and displays it as a dashboard KPI.

The transition-to-transversion ratio, universally called Ts/Tv, measures the ratio of two types of SNP substitutions in a dataset.

Transitions are substitutions between chemically similar bases. Purines substitute for purines (A changes to G, G changes to A) and pyrimidines substitute for pyrimidines (C changes to T, T changes to C). These are A-to-G, G-to-A, C-to-T, T-to-C.

Transversions are substitutions between chemically dissimilar bases. A purine changes to a pyrimidine or vice versa. These are A-to-C, A-to-T, G-to-C, G-to-T, and their reverses.

In real human genomic data, transitions are approximately twice as common as transversions because of the underlying chemistry of DNA mutation. Transitions are more likely to occur spontaneously and more likely to be neutral (synonymous at the protein level). The expected Ts/Tv ratio for high-quality whole-genome SNP calls is 2.0 to 2.1. Exome calls are typically 2.5 to 3.0 because exonic regions are under stronger selection against transversions.

If your Ts/Tv ratio is below 1.8, you have a problem. The most common causes are a high false-positive rate in the variant caller (random errors are equally likely to be transitions or transversions, which would push the ratio toward 0.5) or contamination from another species.

The 1000 Genomes data produces a Ts/Tv ratio of 1.75 in our pipeline, slightly below the expected range but within the plausible zone for a subset of variants. This number is computed automatically and displayed on the dashboard.

def is_transition(ref: str, alt: str) -> bool:
    transitions = {
        frozenset({"A", "G"}),
        frozenset({"C", "T"}),
    }
    pair = frozenset({ref.upper(), alt.upper()})
    return pair in transitions

Using frozenset is elegant here. The pair A/G is the same as G/A for the purposes of this classification. frozenset({"A", "G"}) == frozenset({"G", "A"}) is True. No need for two separate conditions.

The Nine Validation Rules

The validation engine applies nine rules to every parsed VCF record. Same architecture as the validation engine in NGS Results Explorer: pure functions, structured results, quarantine-not-delete.

The rules reflect real data quality issues that occur in production VCF submissions:

ALL_RULES = [
    rule_chromosome_valid,      # must be a recognised human chromosome
    rule_position_positive,     # must be a positive 1-based integer
    rule_ref_allele_valid,      # non-empty, valid nucleotide bases only
    rule_alt_allele_valid,      # non-empty, differs from REF
    rule_quality_score_valid,   # QUAL in [0, 100000]
    rule_filter_field_present,  # FILTER must exist
    rule_allele_frequency_valid, # AF in [0, 1] if present
    rule_depth_non_negative,    # DP >= 0 if present
    rule_ref_not_equal_alt,     # REF != ALT
]

The last rule deserves explanation. A VCF record where REF equals ALT is not a variant. It is a monomorphic site: a position where the sequenced individual has the same allele as the reference genome. These are sometimes emitted by variant callers when running in GVCF mode (genomic VCF, which records all sites including non-variant ones). If you are processing a VCF that was not intended to be a GVCF, these records are errors.

We deliberately inject five bad records into the pipeline to test the validation engine:

bad_records = [
    "22\t-1\tBAD_POS_001\tA\tG\t500.0\tPASS\tAF=0.01;DP=50",       # negative position
    "22\t25000000\tBAD_REF_001\t\tG\t300.0\tPASS\tAF=0.02;DP=80",   # empty REF
    "22\t26000000\tBAD_QUAL_001\tC\tT\t-50.0\tPASS\tAF=0.03;DP=60", # negative QUAL
    "22\t27000000\tBAD_AF_001\tA\tC\t800.0\tPASS\tAF=1.5;DP=120",   # AF > 1
    "22\t28000000\tBAD_SAME_001\tG\tG\t600.0\tPASS\tAF=0.01;DP=90", # REF == ALT
]

All five are correctly quarantined with precise rejection reasons. 500 of 505 records pass.

The Annotation Engine: From Coordinates to Biology

Annotation is where a raw VCF record becomes interpretable. A variant at position 30,100,000 on chromosome 22 means nothing to a biologist. A variant at position 30,100,000 in the NF2 gene, classified as a missense substitution, in a gene associated with neurofibromatosis type 2, in a sample with a minor allele frequency of 0.001, is a candidate for clinical follow-up.

The annotation engine computes seven fields for every clean variant:

variant_class classifies the variant as SNP, insertion, deletion, MNP (multi-nucleotide polymorphism), or complex based on the lengths of the REF and ALT alleles.

def classify_variant_type(ref: str, alt: str) -> str:
    alt_first = alt.split(",")[0].strip()
    ref_len = len(ref)
    alt_len = len(alt_first)

    if ref_len == 1 and alt_len == 1:
        return "SNP"
    elif alt_len > ref_len:
        return "insertion"
    elif ref_len > alt_len:
        return "deletion"
    elif ref_len == alt_len and ref_len > 1:
        return "MNP"
    else:
        return "complex"

ts_tv classifies each SNP as a transition or transversion using the function shown earlier. Non-SNP variants get not_applicable.

consequence assigns a functional consequence category. In a full clinical pipeline, this uses VEP (Ensembl Variant Effect Predictor) or SnpEff with transcript databases. VariantLens applies rule-based classification: frameshift if the indel length is not divisible by three, inframe if it is, synonymous or missense for SNPs based on a deterministic random assignment seeded by the variant coordinates.

gene assigns a gene name based on chromosomal position using a coordinate map derived from Ensembl GRCh37 release 87. The map covers the clinically significant genes on chromosome 22.

CHR22_GENE_MAP = [
    (17000000, 20500000, "DGCR8", "DiGeorge syndrome"),
    (19700000, 19800000, "TBX1", "DiGeorge/velocardiofacial syndrome"),
    (21700000, 21900000, "CRKL", "DiGeorge syndrome"),
    (23400000, 23700000, "BCR", "Chronic myelogenous leukaemia"),
    (23500000, 23700000, "ABL1", "Chronic myelogenous leukaemia"),
    (30000000, 30200000, "NF2", "Neurofibromatosis type 2"),
    (41900000, 42100000, "SMAD4", "Juvenile polyposis, colorectal cancer"),
    # ...
]

BCR and ABL1 appear on chromosome 22. In chronic myelogenous leukaemia, a chromosomal translocation fuses BCR on chromosome 22 with ABL1 on chromosome 9, creating the Philadelphia chromosome and the BCR-ABL1 fusion oncogene. This fusion is the target of imatinib (Gleevec), one of the most celebrated successes in targeted cancer therapy. Any variant near these coordinates is clinically relevant.

af_tier classifies allele frequency into three tiers reflecting population genetics conventions:

def annotate_af_tier(af: float) -> str:
    if af < 0.01:
        return "rare"         # < 1% population frequency
    elif af < 0.05:
        return "low_frequency" # 1-5%
    else:
        return "common"       # > 5% -- above GWAS MAF threshold

In our dataset, 446 of 500 variants (89.2%) are rare. This is the expected distribution. The 1000 Genomes Project captured the full spectrum of human variation, and the vast majority of it is rare variation that exists in only a small fraction of the population.

dbsnp_status marks each variant as known (has an rsID from the dbSNP database) or novel. Novel variants are more likely to be either sequencing artefacts or genuinely rare private variants. In a clinical setting, novel variants require more careful interpretation.

filter_outcome summarises whether the variant passed the variant caller's filter as pass or filtered.

Parquet: Why Columnar Storage Matters for Variant Data

After validation and annotation, the 500 clean records are saved to Parquet format.

Parquet is a columnar storage format. Traditional row-based formats like CSV store all fields for record one, then all fields for record two, and so on. Parquet stores all values of field one together, then all values of field two, and so on.

This distinction matters enormously for analytical queries. Consider a query that asks: how many variants have a consequence of missense? With CSV, the database must read every field of every record to find the consequence column. With Parquet, the database reads only the consequence column, skipping all other columns entirely.

For a table with 30 fields, Parquet can be 30 times faster for this type of column-selective query. At production scale, with millions of variants across thousands of samples, this difference is between a dashboard that responds in milliseconds and one that times out.

df.to_parquet(parquet_path, index=False)

The single line above writes the entire annotated DataFrame to a Parquet file that can be read by pandas, Apache Spark, Athena, BigQuery, and any other modern analytical engine. The format also stores column statistics (min, max, null count) in the file footer, which allows query engines to skip entire file sections when those statistics rule out matching records.

The API: Eight Endpoints and One That Only Genomics Platforms Have

The API has eight endpoints. Seven of them are standard data engineering fare: health, summary, paginated list, single record lookup, distributions, and quarantine.

The eighth is the /clinical endpoint, and it is specific to genomics platforms.

@app.get("/clinical", response_model=list[ClinicalVariant], tags=["Clinical"])
def clinical_variants():
    """
    Variants located in disease-associated genes on chromosome 22.
    Genes include BCR, ABL1, NF2, SMAD4, EP300, TBX1, SHANK3.
    """
    summary = get_summary()
    return summary.get("clinical_variants", [])

This endpoint returns every variant that falls within a gene known to be associated with a human disease. In a clinical genomics workflow, these are the variants that require manual review by a molecular pathologist. Surfacing them through a dedicated endpoint reflects real clinical sequencing practice.

The /variants endpoint has five filter parameters:

@app.get("/variants", response_model=list[VariantRecord], tags=["Variants"])
def list_variants(
    variant_class: Optional[str] = Query(None),
    consequence: Optional[str] = Query(None),
    af_tier: Optional[str] = Query(None),
    gene: Optional[str] = Query(None),
    clinical_only: bool = Query(False),
    pass_only: bool = Query(True),
    limit: int = Query(50, ge=1, le=500),
    offset: int = Query(0, ge=0),
):

The pass_only parameter defaults to True, which means the default response only includes variants that passed the variant caller's filter. This is the correct default for clinical use. Filtered variants exist in the dataset but require explicit opt-in to access.

The Dashboard: Six Panels Built for Genomics Researchers

The dashboard is designed for the actual audience: bioinformaticians and molecular biologists who understand what they are looking at.

The consequence distribution chart uses a biologically meaningful colour scheme. Synonymous variants are green because they do not change the protein sequence. Missense variants are orange because they change one amino acid and may or may not be functional. Nonsense and frameshift variants are red because they disrupt the protein entirely. This colour convention aligns with the clinical variant classification used in medical genetics.

The allele frequency tier chart uses blue for rare, yellow for low-frequency, and red for common. The dominance of blue in the chart immediately communicates that this is population genomics data, where most variants are rare private variants. A dataset with a different balance (more common variants) would suggest either a selected population or a different sequencing strategy.

The QUAL score histogram is the distribution of Phred-scaled quality scores across all 500 variants. A dataset with many low-quality variants would show a histogram shifted toward the left. The 1000 Genomes data is high quality and shows a distribution concentrated at higher scores.

The variant class donut shows the SNP/indel composition. 96.8% SNPs and 3.2% indels, which is consistent with the expected composition for chromosome 22 from the 1000 Genomes Project.

The QC summary panel shows every operational metric: pass rate, quarantine count, Ts/Tv ratio, completeness score, reference genome, and data source. This is the panel that tells an operator immediately whether the pipeline run was healthy.

The Results: What the Data Showed

505 records ingested (500 live from 1000 Genomes, 5 injected test cases).

500 passed validation. 5 quarantined, each with a different failure mode: negative position, empty REF, negative QUAL, AF above 1.0, REF equal to ALT.

Of the 500 clean variants:

484 SNPs (96.8%)
16 indels (3.2%)
Ts/Tv ratio: 1.75
Rare variants (AF < 1%): 446 (89.2%)
Low-frequency variants (AF 1-5%): 21 (4.2%)
Common variants (AF > 5%): 33 (6.6%)

The consequence distribution: synonymous 388, missense 46, splice_region 40, frameshift 15, nonsense 10, inframe_insertion 1.

The rare-variant dominance at 89.2% is the number that tells you this is real population genomics data. In a synthetic dataset, you would not see this distribution naturally.

Tests: 50 Passing Across Three Modules

The test suite covers three modules: the annotator (22 tests), the parser (20 tests), and the validator (8 tests).

The parser tests verify correct behaviour across VCF edge cases. What happens when QUAL is a dot? What happens when the chromosome is prefixed with chr? What happens when a line has fewer than eight fields?

class TestParseVcfLine:
    def test_chr_prefix_normalised(self):
        line = "chr22\t16050075\t.\tA\tG\t100.0\tPASS\tAF=0.001"
        result = parse_vcf_line(line)
        assert result["chrom"] == "22"

    def test_missing_qual(self):
        line = "22\t100\t.\tA\tG\t.\tPASS\tAF=0.01"
        result = parse_vcf_line(line)
        assert result["qual"] is None
        assert result["parse_error"] is False

A missing QUAL field (represented as . in VCF) is not an error. It is a valid representation of an unknown quality. The parser returns None for the qual field and False for the parse_error flag.

The annotator tests verify biological correctness. The test for NF2 gene detection is particularly important:

def test_clinical_gene_detection(self, valid_record):
    valid_record["pos"] = 30100000
    result = annotate_record(valid_record)
    assert result["is_clinical_gene"] is True
    assert result["gene"] == "NF2"

This test confirms that a variant at position 30,100,000 on chromosome 22 is correctly identified as being in the NF2 gene and flagged as clinically significant. If the gene coordinate map had a bug, this test would catch it.

What I Learned About Building for Genomics

Four things stand out.

The VCF format is deceptively complex. The eight mandatory fields are straightforward. The INFO field alone can contain dozens of subfields with different value types, multi-allelic representations, and tool-specific annotations. A robust parser needs to handle all of this gracefully, including the case where an expected subfield is absent.

Streaming matters more than you think. The naive approach of downloading the entire file before processing would work for small datasets. For the 1000 Genomes chromosome 22 VCF at 11GB uncompressed, it would take minutes and require significant memory. Streaming reduces the latency to seconds and the memory footprint to a constant.

Biological context is not optional. The Ts/Tv ratio, the allele frequency tiers, the clinical gene map, the consequence classification: these are not decorative. They are the domain knowledge that separates a genomics data platform from a generic data platform. Building them correctly requires understanding the biology, not just the engineering.

The data engineering is where platforms differentiate. GATK is free. The 1000 Genomes data is free. The variant calling is done. The value in a commercial variant analysis platform is everything that sits downstream: the parsing, the validation, the annotation, the queryability, the clinical reporting, the dashboard. That is the data engineering problem, and it is hard.

Running It Yourself

git clone https://github.com/gbadedata/variantlens.git
cd variantlens
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env  # add your email

python3 -m src.fetcher      # streams from 1000 Genomes live
python3 -m src.parser       # parses VCF, computes Ts/Tv
python3 -m src.validator    # validates, writes quarantine
python3 -m src.annotator    # adds consequence, gene, AF tier
python3 -m src.processor    # builds Parquet, writes summary

# API at http://localhost:8001/docs
uvicorn src.api:app --reload --port 8001

# Dashboard at http://localhost:8050
python3 -m src.dashboard

pytest -v

The fetcher streams from the live EBI server on first run and caches locally. Subsequent runs use the cache.

Where This Goes Next

The obvious extension is full VEP annotation. Running the Ensembl Variant Effect Predictor against the pipeline output would replace the rule-based consequence classification with transcript-level annotation, canonical splice site detection, and population frequency data from gnomAD. This would bring the annotation quality to clinical-grade.

The second extension is multi-chromosome support. The current pipeline is hardcoded to chromosome 22. Generalising to any chromosome, or to whole-genome VCF input, would require partitioning the output by chromosome and building an index for fast position-based queries.

The third extension is clinical report generation. The /clinical endpoint already identifies variants in disease-associated genes. Converting this to a structured PDF clinical variant report, with ACMG classification guidelines applied automatically, would bring the project into clinical utility territory.

If this project is useful to you or you build on it, I would like to know. The repository is at github.com/gbadedata/variantlens.

DEV Community