Said Salihefendic

Posted on Dec 4

Building an MCP Server for AI-Native Data Discovery: Rust Crates Ecosystem: Part I

#duckdb #mcp #python

Traditional data exploration requires knowing what questions to ask. SQL code, dashboards, BI tools, ... All are built around predefined queries.

But what if you could explore a data warehouse conversationally? Ask open ended questions and have an AI discover patterns you never thought to look for?

This is what the Model Context Protocol (MCP) enables. I built an MCP server for analyzing the Rust ecosystem and let Claude explore it and answer questions around it.

0. Background

The Rust ecosystem has exploded in recent years, but understanding its evolution requires more than browsing crates.io. Which crates are becoming foundational infrastructure? What dependencies are growing fastest? Are there data quality issues in the registry itself?

To answer these questions, I needed access to the raw data and crates.io provides exactly that. You can check it out at their Data Access.

I built a complete data engineering project for this data that runs locally and then made it AI-explorable using MCP. This post shows how.

If you wish to follow along, visit Rust Crates Analytics Repo to get started!

1. Introduction to crates.io data

Before we talk more about the cool stuff and show some interesting points, we need to understand the data itself we get from crates.io.

When you download and extract the DB dump file (download link), it provides a PostgreSQL DB dump with clear instructions in the README.md in how to load it to your PostgreSQL.

Along the instructions, you'll see a data folder with CSV files which are the actual contents. We will now show the ERD of the DB and share some quick insights into what they are representing

Entities

crates - Rust packages published to crates.io
versions - Specific releases of a crate (e.g. serde v1.0.228)
categories - Taxonomic classifications for organizing crates (e.g. science::bioinformatics, finance, algorithms)
keywords - User-defined tags for discoverability (e.g. cargo, sql, machine-learning)
teams - Organizational accounts that can own crates (Github)
users - Individual developer accounts (Github)

Fact Tables

version_downloads - Time series download counts per version per day
crate_downloads - All-time total download counts per crate

Junction Tables

crates_categories - Links crates to their categories
crate_owners - Links crates to users or teams who own them
crates_keywords - Links crates to their keywords
dependencies - Links versions to the crates they depend on

Support Tables

metadata - contains total_downloads column, only one row
reserved_crate_names - Administrative list of protected/unavailable crate names
default_versions - Links crates to their default version with version count

crates.io ERD

The central tables that we are most interested in are version_downloads and dependencies.

However, according to their README.md, for performance purposes, version_downloads contains last 3 months of data, meaning depending on the day you download their DB dump, you get last 3 month window of downloads time series data, which is completely understandable. But, what if someone wants to look more into the history of the Rust ecosystem?

In this link, you will find that crates.io places CSV dumps for each day that are not in the last 3 month window that they provide with their DB dump. In contrast to the version_downloads table, these CSVs only contain version_id and downloads columns, so you'll have to infer the date from the filename if you wish to extend your version_downloads table.

Other tables are actual states in the ecosystem at the time of the downloading the DB dump. Tables like crates with created_at, updated_at and documentation, dependencies, categories and other tables contain the latest updates to them at the time of downloading the DB.

This means that we can snapshot the crates.io DB on daily basis and observe changes across the ecosystem. For example, see which crates and versions were deleted, were there any dependency changes etc. We will talk more about it later in the post as one of the use cases of the project.

Recap

For the purpose of the project, this is what we have to understand about the crates.io data:

crates.io DB dump is updated on daily basis and it reflects the state of the ecosystem at the time of download
after each update, they strip of one day from 3 month window and put a dump of it in CSV in the version downloads archives
most important tables are crates, versions, version_downloads and dependencies

2. Rust Crates Analytics - Architecture Overview

Now that we understand crates.io data and its relationships, we want to ingest both DB Dump and archives, but how do we do that?

We will design a data warehouse with ELT pipeline, where we will define schemas:

raw - Direct CSV loads from crates.io dumps
staging - Cleaned and validated data

Usually we'd build marts schema for analytics-ready tables as well, but for the purpose of this post, we will focus on raw to staging transformations.

For Extract phase, we will download the crates.io DB dump (tar.gz) and extract the CSV files.

For Load phase, we will import all CSVs into the raw schema using a full refresh strategy on all tables, meaning that each new dump completely replaces the raw tables, giving us a clean snapshot of crates.io's current state.

Now for the Transformation phase, this is where we want to apply data quality rules:

Each table in staging has prefix stg_
Normalize all timestamps to UTC
Apply incremental updates to stg_version_downloads, meaning we only add new dates instead of reprocessing the entire table. On first run, however, it will ingest all available dates from the raw version_downloads, the incremental updates refer to subsequent runs
Full refresh on all dimension tables (categories, crates, versions, ...) to capture updates
Enforce data contracts and run quality tests

This design handles the crates.io dump's 3 month rolling window efficiently while maintaining a complete historical version downloads archive in staging.

The crates.io dump has a 3 month rolling window for version_downloads, but historical archives dating back to 2014 are available. To build a complete dataset, we need a backfill strategy that ingests older dates directly from the archives into stg_version_downloads.

The backfill script accepts parameters for how far back to ingest. I've loaded the full archive back to 2014-11-11, but you can choose any date range based on your storage and analysis needs.

We also snapshot crates, dependencies, and versions tables to capture historical changes between dumps.

Given our architecture, we also have specific constraints:

No infrastructure overhead - No servers or cloud setup
Cross-platform - Must work on Windows, Mac, Linux
Fast iteration - Setup and updates in minutes, not hours
Storage efficient - Must handle the project on a standard laptop (~8GB of RAM, ~20GB free disk space)

Based on these constraints and the architecture, we chose:

Python - Flexible scripting with rich ecosystem for data processing. Works across all platforms.
uv - Fast Python package and project manager. Handles dependencies and virtual environments with minimal configurations
DuckDB - Embedded OLAP database (single file, no server), works locally, optimized for analytical queries on large datasets
dbt - Transformations as SQL code with built-in testing and snapshotting framework, database agnostic (can swap DuckDB for other databases)
streamlit - Python-based visualization tool, quick prototyping for data validation.

Whole project and its dependencies are managed via uv. The only prerequisite for the project is having uv installed on your machine.

The total file size for DuckDB for data from 2014-11-11 to 2025-11-29 was ~10GB.

3. Traditional Exploration - Brief insight into the workflow

As a data engineer, I wanted to gain insights into the Rust Crates Analytics. We'll go through an interesting investigation, the orphan versions in version downloads time series data, which at the end turned out to be one of the data quality checks.

The question we want to answer is, are there any version_id's in the stg_version_downloads that are orphaned. The orphan version_id means that it doesn't exist in the versions table anymore, which potentially breaks the foreign key constraint.

In order to do this, we will use the staging schema and assume we have loaded the transformed data and did some backfill, but not required.

For this, I have used DBeaver tool, but you can do it via any other DB client that supports DuckDB.

First, we need to understand the staging.stg_version_downloads schema

DESCRIBE staging.stg_version_downloads

We get columns:

|column_name|column_type|
|-----------|-----------|
|version_id |BIGINT     |
|downloads  |BIGINT     |
|date       |DATE       |

Then we need to understand the staging.stg_versions, which has columns:

|column_name  |column_type|
|-------------|-----------|
|id           |BIGINT     |
|crate_id     |BIGINT     |
|num          |VARCHAR    |
|num_no_build |VARCHAR    |
|yanked       |BOOLEAN    |
...

Now that we have the schemas of these two tables, we can write a simple join query

SELECT
    svd.version_id as version_id,
    sv.num as version_num,
    sv.num_no_build as version_num_no_build,
    sv.yanked as version_yanked,
    svd.date,
    svd.downloads
FROM staging.stg_version_downloads svd
LEFT JOIN staging.stg_versions sv ON svd.version_id = sv.id
LIMIT 25

This will give us some basic information from versions dimension table, and we can perform COUNT(*) to get total number of records from version_downloads. A quick data quality check here is to simply do the COUNT(*) in stg_version_downloads before and after join with stg_versions, making sure there are no duplicate entries in the versions table.

Let's try to find orphan records with the next query

SELECT
    COUNT(*) as nr_records
FROM staging.stg_version_downloads svd
LEFT JOIN staging.stg_versions sv ON svd.version_id = sv.id
WHERE sv.id IS NULL

Now, two things can happen here:

If you have just setup the project, this will return 0
If you have updated the staging with the update_duckdb.sh script at least once OR you have backfilled some archives into version downloads, 99.99% chance there will be a non-zero count and it continues to grow as you ingest more backfill data

We'll present monthly downloads for valid vs orphan version IDs from 2014-11-11 to 2025-11-29

This is an outstanding discovery and raised the question why was this happening. After some research into the crates usage policy, specifically this paragraph

Crate owners can delete their crates under certain conditions: the crate has been published for less than 72 hours, or the crate only has a single owner, the crate has been downloaded less than 1000 times for each month it has been published, and the crate is not depended upon by any other crate on crates.io. If these conditions are not met, the crate will not be deleted. In exceptional cases crate owners may contact the crates.io team to request deletion of a crate that does not meet these conditions.

And this is true: The crates.io DB dump does honor ON DELETE CASCADE for deleted crates and its related tables such as versions and version_downloads. If you download the crates.io DB dump, you'll notice there aren't any orphan version id's in the version_downloads table.

However, the issue is in the version downloads archives, where these version ids still persist in their daily CSV dumps. This can be a problem if you are enforcing the foreign key constraint, and dbt tests will warn about the existence of orphan version ids after you've backfilled some of the dates from the version downloads archives.

The key discovery here: While crates.io honors deletion in the database dump, the historical archives still contain download records for deleted versions. Usually in these situations, you'd have to follow the source system rules and remove these orphan version ids yourself while backfilling.

To monitor this, I've created a streamlit dashboard showing:

Monthly Downloads counts (2014-2025) and orphaned version_id trends
Summary metrics: total crates, versions, orphaned versions, total downloads

This investigation took roughly 3 hours of work:

Exploring schemas and writing queries
Researching the crates.io deletion policy
Building and refining the Streamlit dashboard
Documenting the findings

This process worked (I even used Claude to help around the project), but it was still time-intensive.

I wondered: could MCP make this exploration faster and help me discover related issues I hadn't thought to check?

Well, let's see!

4. The MCP Layer: Turning SQL into Conversations

From Model Context Protocol:

MCP (Model Context Protocol) is an open-source standard for connecting AI applications to external systems.
Using MCP, AI applications like Claude or ChatGPT can connect to data sources (e.g. local files, databases), tools (e.g. search engines, calculators) and workflows (e.g. specialized prompts)—enabling them to access key information and perform tasks.

This looked very promising for our analytical workflow, so I've decided to make a basic MCP server and connect it to Claude Code to see how well it would interact with the DuckDB for Rust Crates Analytics project.

To understand how MCP works, let's look at its three core concepts:

Tools are functions that your LLM can actively call, and decides when to use them based on user requests. Tools can write to databases, call external APIs, modify files, or trigger other logic. An example, in our case, would be to execute a SELECT query to list top 10 most downloaded crates of all time.
Resources are passive data sources that provide read-only access to information for context, such as file contents, database schemas or API documentation. An example would be our dbt YAML files containing schema definitions for both raw and staging.
Prompts are pre-built instruction templates that tell the model to work with specific tools and resources. An example would be a setup for gathering context of the Rust Crates Analytics

Based on these core features and our Rust Crates Analytics project, I designed a basic MCP server:

Tool

query_duckdb - executes read-only SELECT queries against DuckDB
list_tables - lists all available tables in the staging schema (useful for verification and exploration)

Resources

Schema definitions for all staging tables (columns, types, relationships)

Prompts

/setup_crates_analytics_context - loads project context

Let's have a look at the implementation of the MCP server here:

import logging
import os
import duckdb
from typing import Any
from mcp.server.fastmcp import FastMCP
from mcp.types import TextContent, GetPromptResult, PromptMessage


logging.basicConfig(
    level=logging.DEBUG,
    filename=os.path.join(os.path.dirname(os.path.dirname(__file__)), 'output.log'),
    format='%(pathname)s - %(message)s'
)

def get_crates_duckdb_path():
    project_root_directory = os.path.dirname(os.path.dirname(__file__))
    data_folder = "data"
    duckdb_instance = "crates.duckdb"

    full_duckdb_path_instance = os.path.join(project_root_directory, data_folder, duckdb_instance)

    return full_duckdb_path_instance

mcp = FastMCP("duckdb_crates_server")

@mcp.tool()
async def list_tables():
    """List all tables in the staging schema"""
    crates_duckdb_path = get_crates_duckdb_path()
    conn = duckdb.connect(crates_duckdb_path, read_only=True)

    result = conn.execute("""
        SELECT table_schema, table_name
        FROM information_schema.tables t
        LEFT JOIN (
            SELECT table_schema, table_name
            FROM information_schema.columns
            GROUP BY table_schema, table_name
        ) cols USING (table_schema, table_name)
        WHERE table_schema IN ('raw', 'staging', 'marts')
        GROUP BY table_schema, table_name
        ORDER BY table_schema, table_name
    """).fetchdf()

    return [TextContent(
        type="text",
        text=f"Available tables:\n\n{result.to_markdown(index=False)}"
    )]

@mcp.tool()
async def query_duckdb(sql: str):
    """Execute a read-only SQL query against the crates.io DuckDB database"""
    # Safety check
    if not sql.strip().upper().startswith('SELECT'):
        return [TextContent(
            type="text",
            text="Error: Only SELECT queries are allowed"
        )]

    crates_duckdb_path = get_crates_duckdb_path()
    conn = duckdb.connect(crates_duckdb_path, read_only=True)

    result = conn.execute(sql).fetchdf()

    return [TextContent(
        type="text",
        text=f"{result.to_markdown(index=False)}"
    )]


@mcp.resource("schema://staging/tables")
async def get_available_dbt_models():
    """Column definitions and relationships for all staging tables"""

    project_root = os.path.dirname(os.path.dirname(__file__))
    schema_path = os.path.join(project_root, "transformations", "models", "staging")

    # Read all .yml files in staging folder
    schema_content = []
    for yml_file in os.listdir(schema_path):
        if yml_file.endswith('.yml'):
            with open(os.path.join(schema_path, yml_file), 'r') as f:
                schema_content.append(f"# {yml_file}\n{f.read()}\n")

    return "\n".join(schema_content)

@mcp.prompt()
async def setup_crates_analytics_context():
    """Load context about the Rust Crates Analytics project"""
    return GetPromptResult(
            messages=[
                PromptMessage(
                    role="user",
                    content=TextContent(
                        type="text",
                        text="""You are exploring the Rust Crates Analytics data warehouse.

**Available Data:**
- Staging schema contains cleaned, validated data from crates.io
- Date range: figure out from the query_duckdb tool, please
- You can expect around ~200k crates, ~1.75M versions, billions of download records

**Key Tables:**
- stg_crates: Rust packages
- stg_versions: Specific releases
- stg_version_downloads: Download history (fact table)
- stg_dependencies: Version dependencies
- stg_categories, stg_keywords: Metadata

**Available Tools:**
- query_duckdb: Run SELECT queries
- list_tables: See available tables

**Available Resources:**
- schema://staging/tables: Full schema definitions

**Setup:**
Please load the context using the Resources for schema to let you know of columns, types, data quality tests, relationships.
That way, you already know about the Rust Crates DWH structure and you'll know how to create queries.

Know that you can only create SELECT queries, don't try creating any other queries like DESCRIBE, INSERT, etc.

DO NOT FORGET TO USE staging. schema for the tables that you get enlisted, please

**Your Task:**
Help analyze the Rust ecosystem. Suggest interesting questions to explore, write efficient queries, and explain findings in context."""
                    )
                )
            ]
        )


if __name__ == "__main__":
    mcp.run(transport="streamable-http")

This implements everything we've discussed before. To run the server:

uv run mcp/mcp_duckdb_http.py

From now on, I will use Claude Code CLI and show my conversations with him. First, we want to add the MCP to it. To do that, we execute in terminal in your project directory:

claude mcp add --transport http duckdb_crates http://127.0.0.1:8000/mcp

Verify it with

claude mcp list

You should see that it is Connected and ready for use.

Now let's look at what it does with the prompt we've defined in the MCP server with /setup_crates_analytics_context:

Claude automatically loads the schema resource and queries for data summary

Context loaded - Claude is ready to explore the data and provides some suggestions

The /setup_crates_analytics_context prompt accomplishes three things:

Focuses exploration - Directs Claude to the staging schema (where clean data lives), avoiding token waste on raw tables
Loads context proactively - Triggers Claude to read schema resources and query summary stats before you ask questions
Sets the analytical tone - Positions Claude as a data exploration assistant for the Rust ecosystem

With this setup, Claude understands the data structure upfront rather than discovering it query-by-query.

Remember the orphan_ids? Let's see Claude in action in same session with preloaded context how he figures it out:

In a few minutes, Claude with right MCP setup discovers the orphan IDs based on my prompt and even has some suggestions into why is this happening. Claude didn't just find the orphans, it suggested potential causes without me having to research separately. This is augmentation, not just acceleration.

This means that what was done in hours translate this work to minutes and makes analyzing stuff much easier.

But MCP isn't just about replicating manual work faster. Let's see what Claude discovered when I asked open-ended questions I hadn't thought to explore...

5. MCP Discovery: What Claude Found

With the MCP setup complete and orphan IDs discovered in a few minutes, I wanted to explore something more open-ended: dependency patterns across the entire Rust ecosystem.

Let's see what Claude suggested us for dependencies to look into for Rust Crates Analytics project using the same session, I prompted Let's not get deep into that. What do you suggest we do for dependencies?

It was interesting to see recommendations for the underlying problem of the orphan version IDs in the version downloads table, thus giving some suggestions that we can now decide what to actually do.

But I was more interested in the dependency analysis, so I prompted
Okay, thank you for your suggestions! I was wondering more about the analysis of the dependencies themselves, what do you recommend?.

There were many suggestions, such as:

Normal vs build vs dev dependency distribution
Optional dependency usage (feature-gated deps)
Dependency depth (how deep are dependency chains?)
Circular dependencies detection
Abandoned dependency risk (deps on unmaintained crates)
Which features are most commonly enabled?
Cluster analysis (which crates are used together?)

and many more.

I was interested in the Cluster analysis, let's see how Claude Code handles this discovery:

This is what Claude Code found:

Rust Ecosystem Cluster Analysis

  1. Most Common Crate Pairs

  The "holy trinity" combinations that appear everywhere:

  | Crate Pair         | Co-occurrences | Pattern                       |
  |--------------------|----------------|-------------------------------|
  | serde + serde_json | 396,229        | Universal serialization stack |
  | serde + thiserror  | 205,937        | Data + error handling         |
  | serde + tokio      | 199,644        | Async + serialization         |
  | quote + syn        | 94,596         | Proc macro toolkit            |
  | tokio + futures    | 116,835        | Async runtime foundation      |

  2. Ecosystem Clusters Identified

  A. The Serde Ecosystem (Serialization)

  Core dependencies that travel with serde:
  - serde_json (396K co-occurrences) - JSON support
  - serde_derive (82K) - Derive macros
  - serde_yaml (46K) - YAML support
  - bincode (49K) - Binary encoding

  Insight: If a project uses serde, it almost always uses serde_json (58% of the time).

  B. The Tokio Ecosystem (Async Runtime)

  The async stack travels together:
  - futures (117K) - Async primitives
  - async-trait (88K) - Async in traits
  - tokio-util (37K) - Tokio utilities
  - tokio-stream (31K) - Stream processing
  - tracing-subscriber (41K) - Observability
  - hyper (32K) - HTTP foundation

  Insight: Tokio projects are 55% likely to also depend on futures, and 37% on async-trait.

  C. The Proc Macro Ecosystem

  Tight cluster for macro authors:
  - syn + quote + proc-macro2 (all ~80-95K co-occurrences)
  - darling (7.8K with syn) - Attribute parsing
  - heck (6.6K) - Case conversion
  - convert_case (5.6K) - Name transformations

  Insight: 92% of crates using syn also use quote.

  D. Web Framework Clusters

  HTTP Client: reqwest (144K total deps)
  - Paired with: tokio, serde_json, url, bytes

  Web Frameworks (by popularity):
  1. axum (21K deps) - Modern, gaining fast
  2. actix-web (15K deps) - Mature, stable
  3. warp (7K deps) - Declining
  4. rocket (4.5K deps) - Older async
  5. tide (1.7K deps) - Minimal usage

  3. Most Common Triple Combinations

  The standard stacks that define project types:

  | Triple                         | Occurrences | Project Type          |
  |--------------------------------|-------------|-----------------------|
  | serde + serde_json + tokio     | 152K        | Async web services    |
  | serde + serde_json + thiserror | 129K        | Data processing apps  |
  | serde + serde_json + reqwest   | 108K        | HTTP clients          |
  | serde + clap + serde_json      | 99K         | CLI tools with config |

  4. Category-Specific Patterns

  CLI Tools use:

  1. clap (75K) - Argument parsing (appears in 74% of CLI crates)
  2. serde + serde_json - Config files
  3. colored (13K) - Terminal colors
  4. indicatif (11.5K) - Progress bars
  5. crossterm (11K) - Terminal control

  Web Programming prefers:

  1. serde (30K) - Data serialization
  2. tokio (19K) - Async runtime
  3. reqwest (12K) - HTTP client
  4. wasm-bindgen (4K) - WASM support

  Cryptography projects cluster around:

  1. serde (17K) - Serialize keys/data
  2. rand (11K) - Random number generation
  3. sha2 (6.7K) - Hashing
  4. hex (5.9K) - Hex encoding
  5. zeroize (5.4K) - Secure memory clearing

  ---
  Key Insights & Recommendations

  1. The "Foundation 7"

  These crates define modern Rust development:
  - serde - Universal serialization (appears in ~36% of all dependencies)
  - tokio - Async runtime standard
  - thiserror - Ergonomic error types
  - anyhow - Error handling for applications
  - clap - CLI standard
  - tracing - Modern logging/instrumentation
  - reqwest - HTTP client de facto standard

  2. Ecosystem Consolidation

  - Async: Tokio has won (296K deps vs alternatives)
  - CLI: Clap is dominant (230K deps)
  - Serialization: Serde is universal (679K deps)
  - Proc macros: syn+quote are required (94K pair)

  3. Project Archetypes

  You can predict project type by dependency cluster:

  Web Service:
  tokio + serde + serde_json + reqwest + thiserror + tracing

  CLI Tool:
  clap + serde + serde_json + anyhow + colored

  Proc Macro:
  syn + quote + proc-macro2

  Crypto/Blockchain:
  serde + rand + sha2 + hex + thiserror

This looks pretty good and it even offered a feature suggestions on to predict what project is about purely on which crates are commonly used together strictly viewed from dependency perspectives.

If you read through this analysis carefully, it can sometimes make some absurd claims.

Critical note: Claude's analysis contains questionable claims - like Insight: If a project uses serde, it almost always uses serde_json (58% of the time). (contradicts its own "universal" claim). This reinforces an important principle:

MCP accelerates analysis, but doesn't replace domain expertise. AI generates hypotheses quickly; experts validate critically. Always verify findings against actual data.

Now, what did this discovery reveal:

The Foundation 7 - Core crates that define modern Rust (serde, tokio, clap, etc.)
Ecosystem archetypes - You can predict project type by dependency cluster
Consolidation winners - tokio won async, clap won CLI, serde won serialization
Common stacks - Standard combinations like serde + tokio + thiserror for async services

If you're following along, try asking your AI agent to identify crates with missing or mismatched categories. You might discover interesting categorization issues in the ecosystem.

6. Try It Yourself

The complete project is available on Github: https://github.com/saidsalihefendic/rust-crates-analytics

Suggested explorations:

/setup_crates_analytics_context - Load context and start exploring
What are the fastest growing crates in the last year?
Find crates with suspicious download patterns
Which dependencies are becoming obsolete?

The MCP Server works with any MCP-compatible client (Claude Desktop, Claude Code CLI, Gemini CLI, Codex CLI, ...). Experiment and see what you discover.

7. What's next?

This was the Part I of the series, where we laid out the foundations of the Rust Crates analytics project and explained our architecture, as well as introduced the MCP layer on top of it.

The whole project was inspired by ClickPy - Analytics for PyPI packages and Model Context Protocol.

In Part II, we will cover advanced MCP features and how it ties to the Rust Crates Analytics project, where we will lay out the lessons learned and expose the capabilities and limitations of the MCP, as well as the cost.

Thanks for reading and looking forward to see you in Part II!

What did you discover? I'd love to hear what patterns you find in the Rust ecosystem? Drop a comment here with your discoveries!

DEV Community