DEV Community

Ali Sher
Ali Sher

Posted on

Grants to Investments Part 2-3: Models and Pipelines

๐Ÿš€ Grants ETL Pipeline โ€” Rust + Transformer-Based Classification

๐Ÿ“Œ Overview

I built an end-to-end ETL pipeline to ingest, classify, and analyze Canadian government grant data. The project combines:

  • โšก High-performance data extraction using Rust
  • ๐Ÿง  Semantic classification using BERT (zero-shot)
  • ๐Ÿ“Š Structured output ready for downstream analytics and dashboarding

This project demonstrates systems design, data engineering, and applied NLP in a production-style pipeline.


๐Ÿงฉ Extraction Layer (Rust)

The Problem

The Grants Canada portal has no accessible API โ€” only an HTML-rendered search interface. I needed a way to extract structured data at scale.

The Solution

I built a custom scraper targeting the paginated search endpoint:
https://search.open.canada.ca/grants/?page={}&sort=agreement_start_date+desc

Key Decisions

I initially started with Python but switched to Rust for performance at scale. The Rust scraper uses:

  • scraper โ€” for HTML parsing
  • csv โ€” for structured output

Designed to handle large-scale ingestion efficiently without extreme usage of memory or runtime.

Outcome

โœ… Successfully extracted structured grant data into CSV
โœ… Significantly faster ingestion vs. the prior Python-based workflow

๐Ÿ“„ Sample Record

Agreement: European Space Agency (ESA)'s Space Weather Training Course
Agreement Number: 25COBLLAMY
Date Range: Mar 11, 2026 โ†’ Mar 27, 2026
Description: Supports Canadian students attending international space training events
Recipient: Canadian Space Agency
Amount: $1,000.00
Location: La Prairie, Quebec, CA


๐Ÿง  Transformation + Classification

Objective

Categorize grants into meaningful sectors for analytics and discovery โ€” making the data explorable beyond raw fields.

Categories

CATEGORIES = [
    "Housing & Shelter",
    "Education & Training",
    "Employment & Entrepreneurship",
    "Business & Innovation",
    "Health & Wellness",
    "Environment & Energy",
    "Community & Nonprofits",
    "Research & Academia",
    "Indigenous Programs",
    "Public Safety & Emergency Services",
    "Agriculture & Rural Development",
    "Arts, Culture & Heritage",
    "Civic & Democratic Engagement"
]
Enter fullscreen mode Exit fullscreen mode

๐Ÿค– Model Choice

I evaluated two approaches:

Approach Verdict
Traditional ML (clustering) Requires labeled data, less semantic
BERT via Hugging Face (zero-shot) โœ… Selected

Why zero-shot BERT?

  • No labeled dataset required
  • Strong semantic understanding out-of-the-box
  • Fast to implement and iterate

โš™๏ธ Inference Pipeline

print("Running classification...")
predictions = []

for text in df['text']:
    result = classifier(text, candidate_labels=CATEGORIES)

    predictions.append({
        'predicted_category': result['labels'][0],
        'confidence_score': result['scores'][0]
    })
Enter fullscreen mode Exit fullscreen mode

Each grant description gets mapped to its most semantically relevant category, with a confidence score attached.


๐Ÿงผ Data Quality

The source data was highly structured and clean, which meant:

  • Minimal preprocessing required
  • Faster iteration on modeling and pipeline integration
  • No time lost on data wrangling before getting to the interesting parts

๐Ÿ“ฆ Next Steps

The pipeline is actively being extended:

  • ๐Ÿ—„๏ธ Load Layer โ†’ Persist classified data in a database
  • ๐Ÿ“Š Analytics Dashboard โ†’ Visualize funding trends by category, region, and time
  • โฑ๏ธ Pipeline Orchestration โ†’ Automate ingestion + inference end-to-end

๐Ÿ’ก Key Takeaways

  1. Rust is a legit choice for ETL scraping โ€” not just systems programming. The performance gains over Python are real and measurable.
  2. Zero-shot BERT punches above its weight for classification tasks without labeled data. It's a great first-pass model.
  3. Modular pipeline design pays off early โ€” separating extraction, transformation, and load made iteration much faster.
  4. Don't over-engineer โ€” the right tool for each layer matters more than using a single stack.

๐Ÿ”— Links


Open to opportunities in Data Science, ML Engineering, and Data Engineering โ€” feel free to reach out at alisher213@outlook.com

Top comments (0)