Jordan Hudgens

Posted on Nov 25

Santa Came Early: I Just Published a Rust Crate and CLI Tool to Take Care of AI Markdown Citations for Good

#rust #markdown #ai #opensource

A practical guide to lazy regex compilation, efficient string manipulation, publishing production-ready Rust crates, and how I discovered that building a Rust and Regex based code library can actually be even more frustrating that it sounded initially - so hopefully my tears will fuel your dev joy.

The Problem That Shouldn't Exist (But Does)

If I wasn't already bald, the last 5 years of wrangling integrations with AI LLM build outs would have ensured that my trips to the barber were no longer needed. Topping my list of frustrations has been those annoying and ever present AI generated citations. It's gotten to the point that now I picture them smirking as they send me back the response, their GPUs powered by the anger and stress they have figured out how to extract from my 2am rants each time I read their 'reasoning' literally reading my prompt, begging them not to include citations in their generated content, immediately followed by them excitedly sending me a 3,000 word article with enough links to make a Wikipedia page blush.

Even more aggravating than having to manually inspect and/or edit every deep research task that the various AI agents performed for our AI marketing platform was the fact that every few weeks I'd have one of the AI coding agents spend a full deep research session combing through NPM.js, RubyGems, and Rust Crates, with the singular purpose of finding a code library that would do literally nothing besides remove citations from 100% of the Markdown strings that the 100+ AI tools that we use throughout our platform generate. After 5 years of unanswered coding prayers and 3 letters that I don't think Santa even glanced at, and while blasting my Taylor Swift motivational coding playlist, the "I'm the problem, it's me." lyrics weren't simply the best descriptor for my multiple marriages, they also explained why I still didn't have the tool I needed.

So instead of continuing my previous software engineering strategy of waiting for some open source fairy to publish what I needed, I decided that I have not struggled quite enough in my coding career and so therefore not only would I build out the tool I needed and share it with the world, I decided that I was going to see if Rust syntax got easier to read if I coated every key line in Regex.

Spoiler alert:: I quickly discovered that when Satan's uncle was designing the Rust language, he realized that a typical Regex implementation didn't fill up his tank of dev tears like he needed and so I got to learn that there are multiple, completely different Regex modules in Rust. And when we get to the section where I show the performance benchmark comparisons differences between the two Regex options, you will see that there is a right and wrong answer when it comes to which approach to take for this specific scenario.

^ Also, I spent WAY too long trying to decide if I wanted to go with Satan's Uncle being the Rust syntax designer vs me discovering that I was going to be spending a weekend straight wrestling with Rust and Regex being the real Devil's threeway, so please just pretend I went with whatever one you found more entertaining and we'll get started.

The Introduction Perplexity Said I Should Have Gone Right Into Instead of Everything Above

So let's get into it. If you've ever used ChatGPT, Claude, or Perplexity to generate content—blog posts, documentation, research summaries—you've encountered this:

AI research shows promising results in natural language processing[1][2][3]. 
Recent studies indicate significant improvements[4][source:1].

[1]: https://example.com/study1
[2]: https://example.com/study2
[3]: https://example.com/study3
[4]: https://example.com/study4

Those citations are useful for verification, but completely unwanted when you're publishing to a CMS, generating documentation, or processing AI responses in a streaming pipeline.

After searching for a Rust solution and finding nothing, I built markdown-ai-cite-remove. This article isn't a marketing pitch—it's a deep dive into the patterns, optimizations, and decisions that make a text processing library production-ready in Rust.

Part 1: The Architecture Decision — Regex vs. Parser

Why Not a Full Markdown Parser?

My first instinct was to use pulldown-cmark or comrak to parse the markdown into an AST, walk the tree, remove citation nodes, and reconstruct the string.

Problems with this approach:

Overhead: Full parsing is expensive when you only need pattern matching
Reconstruction loss: Converting AST back to markdown can alter formatting
Complexity: More code = more bugs for a simple task

The insight: AI citations follow predictable regex patterns. We don't need to understand markdown semantics—we need fast pattern matching and replacement.

The Hybrid Approach

The library uses a multi-pass regex strategy that's both fast and accurate:

// Pass 1: Remove inline citations [1][2][3]
// Pass 2: Remove named citations [source:1][ref:2]
// Pass 3: Remove reference link definitions
// Pass 4: Remove reference section headers
// Pass 5: Normalize whitespace

Each pass handles one concern, making the code testable and debuggable.

Part 2: Lazy Static Regex — The Critical Optimization

The Anti-Pattern That Kills Performance

The most common Rust regex mistake:

// ❌ DON'T DO THIS - Compiles regex on every call
fn remove_citations_bad(text: &str) -> String {
    let re = Regex::new(r"\[\d+\]").unwrap(); // Expensive!
    re.replace_all(text, "").to_string()
}

Every call to Regex::new() parses and compiles the pattern. For a library processing thousands of documents, this is catastrophic.

The Modern Solution: `std::sync::LazyLock`

As of Rust 1.80, the standard library includes LazyLock (previously you'd use lazy_static or once_cell):

use std::sync::LazyLock;
use regex::Regex;

// ✅ Compiled once, used forever
static INLINE_NUMERIC: LazyLock<Regex> = LazyLock::new(|| {
    Regex::new(r"\[\d+\]").unwrap()
});

static INLINE_NAMED: LazyLock<Regex> = LazyLock::new(|| {
    Regex::new(r"\[(?:source|ref|cite|note):\d+\]").unwrap()
});

static REFERENCE_LINK: LazyLock<Regex> = LazyLock::new(|| {
    Regex::new(r"(?m)^\[\d+\](?::\s*|\s+)https?://[^\n]+$").unwrap()
});

Why this matters:

Approach	1000 documents	10,000 documents
Compile per call	~40ms per doc	~400 seconds total
Lazy static	~27ns per doc	~0.27 seconds total

That's a 1000x+ speedup from a single architectural change.

The `patterns.rs` Module Pattern

Centralizing all regex patterns in one module provides:

Single source of truth for pattern definitions
Compile-time documentation of what patterns match
Easy testing of individual patterns
Clear modification path when AI providers change citation formats

// src/patterns.rs
use std::sync::LazyLock;
use regex::Regex;

pub struct Patterns {
    pub inline_numeric: &'static Regex,
    pub inline_named: &'static Regex,
    pub reference_link: &'static Regex,
    pub reference_header: &'static Regex,
    pub reference_entry: &'static Regex,
}

static INLINE_NUMERIC: LazyLock<Regex> = LazyLock::new(|| {
    Regex::new(r"\[\d+\]").unwrap()
});

// ... other patterns ...

impl Patterns {
    pub fn get() -> Self {
        Self {
            inline_numeric: &INLINE_NUMERIC,
            inline_named: &INLINE_NAMED,
            reference_link: &REFERENCE_LINK,
            reference_header: &REFERENCE_HEADER,
            reference_entry: &REFERENCE_ENTRY,
        }
    }
}

Part 3: Regex Pattern Design for AI Citations

Understanding the Target Patterns

AI citations come in several flavors:

# Inline numeric (most common)
Text here[1] and more[2][3] content[20].

# Named sources (Perplexity style)
Studies show[source:1] that results[ref:2] indicate[cite:3]...

# Reference link definitions
[1]: https://example.com/article
[2]: https://another-source.com

# Reference sections
## References
[1] Author, A. (2024). Article Title. Journal Name.
[2] Author, B. (2023). Another Article. Conference.

Crafting Efficient Patterns

Pattern 1: Inline Numeric Citations

r"\[\d+\]"

Simple, fast, and handles [1] through [999]. No greedy quantifiers, no backtracking.

Pattern 2: Named Citations

r"\[(?:source|ref|cite|note):\d+\]"

The (?:...) is a non-capturing group—we don't need the match content, just removal. This is faster than (...).

Pattern 3: Reference Links (Multiline)

r"(?m)^\[\d+\](?::\s*|\s+)https?://[^\n]+$"

Breaking this down:

(?m) — Multiline mode: ^ and $ match line boundaries
^\[\d+\] — Line starts with [number]
(?::\s*|\s+) — Followed by : or just whitespace
https?://[^\n]+$ — URL to end of line

Pattern 4: Reference Headers

r"(?m)^#{1,6}\s*(?:References?|Citations?|Sources?|Bibliography|Notes?)\s*$"

Matches ## References, # Citations, ### Sources, etc.

Avoiding Regex Pitfalls

1. Greedy vs. Lazy Quantifiers

// ❌ Greedy - can cause catastrophic backtracking
r".*\[\d+\].*"

// ✅ Specific - fast and predictable
r"\[\d+\]"

2. Anchoring When Possible

// ❌ Scans entire document for each potential match
r"\[\d+\]:\s*https?://.*"

// ✅ Anchored to line start - much faster
r"(?m)^\[\d+\]:\s*https?://[^\n]+$"

3. Character Classes Over Wildcards

// ❌ Slow - . matches everything
r"^\[\d+\]:.*$"

// ✅ Fast - [^\n] is specific
r"^\[\d+\]:[^\n]+$"

Part 4: The Cleaner Architecture

Configuration-Driven Processing

Different use cases need different behavior:

#[derive(Debug, Clone)]
pub struct RemoverConfig {
    pub remove_inline_citations: bool,
    pub remove_reference_links: bool,
    pub remove_reference_headers: bool,
    pub remove_reference_entries: bool,
    pub normalize_whitespace: bool,
    pub remove_blank_lines: bool,
    pub trim_lines: bool,
}

impl RemoverConfig {
    /// Remove everything (default)
    pub fn default() -> Self { /* ... */ }

    /// Only inline [1][2][3], keep reference sections
    pub fn inline_only() -> Self { /* ... */ }

    /// Only reference sections, keep inline citations
    pub fn references_only() -> Self { /* ... */ }
}

Why this matters: A user cleaning blog posts wants everything gone. A user building a citation extraction tool wants to keep inline markers but strip URLs.

The Builder Pattern

let cleaner = CitationRemover::builder()
    .remove_inline(true)
    .remove_references(false)
    .normalize_whitespace(true)
    .build();

This provides a fluent API that's both readable and extensible.

Stateless Design for Thread Safety

pub struct CitationRemover {
    config: RemoverConfig,
    // No mutable state! Patterns are static references.
}

impl CitationRemover {
    pub fn remove_citations(&self, markdown: &str) -> String {
        // Pure function - same input always produces same output
        let mut result = markdown.to_string();

        if self.config.remove_inline_citations {
            result = self.remove_inline(&result);
        }
        // ... more passes ...

        result
    }
}

Because CitationRemover has no mutable state, it's Send + Sync automatically—safe to share across threads without locks.

Part 5: Testing Real-World AI Output

The Testing Philosophy

Unit tests catch regressions. Integration tests with real AI output catch reality.

#[test]
fn test_real_chatgpt_response() {
    let input = include_str!("../tests/fixtures/chatgpt_response.md");
    let result = remove_citations(input);

    // Verify no citations remain
    assert!(!result.contains("[1]"));
    assert!(!result.contains("[source:"));
    assert!(!result.contains("https://"));

    // Verify content preserved
    assert!(result.contains("AI research shows"));
    assert!(result.contains("## Key Findings"));
}

#[test]
fn test_real_perplexity_response() {
    let input = include_str!("../tests/fixtures/perplexity_response.md");
    let result = remove_citations(input);

    // Perplexity uses different citation styles
    assert!(!result.contains("[source:1]"));
    assert!(!result.contains("## Sources"));
}

Edge Cases That Break Naive Implementations

1. Code blocks containing bracket syntax

Here's an array: `let arr = [1, 2, 3];`
And a citation[1].

[1]: https://example.com

A naive \[\d+\] pattern would incorrectly match inside the code block. Solution: process code blocks separately or use negative lookbehind (if your regex engine supports it).

2. Markdown links vs. citations

Check out [this article](https://example.com)[1].

[1]: https://citation.com

The [this article](url) is a valid markdown link that should be preserved. The [1] citation should be removed. Pattern specificity is key.

3. Nested or malformed citations

Results[[1]][2] show improvement[3][source:4].

Real AI output is messy. Test with messy input.

Property-Based Testing with Proptest

use proptest::prelude::*;

proptest! {
    #[test]
    fn never_crashes_on_arbitrary_input(s in ".*") {
        // Should never panic, regardless of input
        let _ = remove_citations(&s);
    }

    #[test]
    fn preserves_non_citation_content(s in "[a-zA-Z ]+") {
        // Plain text without brackets should pass through unchanged
        let result = remove_citations(&s);
        assert_eq!(result.trim(), s.trim());
    }

    #[test]
    fn removes_all_numeric_citations(n in 1u32..1000) {
        let input = format!("Text[{}] here.", n);
        let result = remove_citations(&input);
        assert!(!result.contains(&format!("[{}]", n)));
    }
}

Part 6: Benchmarking with Criterion

Setting Up Criterion

# Cargo.toml
[dev-dependencies]
criterion = "0.5"

[[bench]]
name = "citation_removal"
harness = false

// benches/citation_removal.rs
use criterion::{black_box, criterion_group, criterion_main, Criterion, Throughput};
use markdown_ai_cite_remove::remove_citations;

fn bench_simple_inline(c: &mut Criterion) {
    let input = "Text[1] with[2] citations[3].";

    let mut group = c.benchmark_group("simple_inline");
    group.throughput(Throughput::Bytes(input.len() as u64));

    group.bench_function("remove", |b| {
        b.iter(|| remove_citations(black_box(input)))
    });

    group.finish();
}

fn bench_real_chatgpt(c: &mut Criterion) {
    let input = include_str!("../tests/fixtures/chatgpt_response.md");

    let mut group = c.benchmark_group("real_chatgpt");
    group.throughput(Throughput::Bytes(input.len() as u64));

    group.bench_function("remove", |b| {
        b.iter(|| remove_citations(black_box(input)))
    });

    group.finish();
}

criterion_group!(benches, bench_simple_inline, bench_real_chatgpt);
criterion_main!(benches);

Interpreting Results

simple_inline/remove    time:   [580 ns 585 ns 590 ns]
                        thrpt:  [91.2 MiB/s 92.0 MiB/s 92.8 MiB/s]

real_chatgpt/remove     time:   [17.8 μs 18.0 μs 18.2 μs]
                        thrpt:  [640 MiB/s 650 MiB/s 660 MiB/s]

Key metrics:

Latency: How long does one operation take?
Throughput: How many bytes per second can we process?

For a streaming API processing AI responses in real-time, sub-100μs latency is essential.

Baseline Comparisons

# Save current performance as baseline
cargo bench -- --save-baseline main

# Make changes, then compare
cargo bench -- --baseline main

This catches performance regressions before they ship.

Part 7: The CLI with Clap

A library is useful. A CLI makes it accessible.

// src/bin/mdcr.rs
use clap::Parser;
use std::io::{self, Read, Write};
use std::fs;
use markdown_ai_cite_remove::remove_citations;

#[derive(Parser)]
#[command(name = "mdcr")]
#[command(about = "Remove AI citations from markdown")]
struct Cli {
    /// Input file (reads from stdin if not provided)
    input: Option<String>,

    /// Output file (writes to stdout if not provided)
    #[arg(short, long)]
    output: Option<String>,

    /// Show processing details
    #[arg(long)]
    verbose: bool,
}

fn main() -> io::Result<()> {
    let cli = Cli::parse();

    // Read input
    let input = match &cli.input {
        Some(path) => {
            if cli.verbose {
                eprintln!("Reading from: {}", path);
            }
            fs::read_to_string(path)?
        }
        None => {
            let mut buffer = String::new();
            io::stdin().read_to_string(&mut buffer)?;
            buffer
        }
    };

    if cli.verbose {
        eprintln!("Input size: {} bytes", input.len());
    }

    // Process
    let output = remove_citations(&input);

    if cli.verbose {
        eprintln!("Output size: {} bytes", output.len());
        eprintln!("Removed: {} bytes", input.len() - output.len());
    }

    // Write output
    match &cli.output {
        Some(path) => fs::write(path, output)?,
        None => io::stdout().write_all(output.as_bytes())?,
    }

    Ok(())
}

Usage:

# Pipe mode
echo "Text[1] here." | mdcr

# File to file
mdcr input.md -o output.md

# Verbose
mdcr input.md --verbose

Part 8: Publishing to crates.io

Cargo.toml Best Practices

[package]
name = "markdown-ai-cite-remove"
version = "0.1.0"
edition = "2021"
rust-version = "1.70"
authors = ["Your Name <email@example.com>"]
license = "MIT OR Apache-2.0"
description = "High-performance removal of AI-generated citations from Markdown"
repository = "https://github.com/yourname/markdown-ai-cite-remove"
documentation = "https://docs.rs/markdown-ai-cite-remove"
keywords = ["markdown", "ai", "citation", "text-processing", "cleanup"]
categories = ["text-processing", "command-line-utilities"]
readme = "README.md"

[dependencies]
regex = "1.10"
thiserror = "1.0"

[dev-dependencies]
criterion = "0.5"
proptest = "1.4"

[[bin]]
name = "mdcr"
path = "src/bin/mdcr.rs"

Pre-Publish Checklist

# 1. All tests pass
cargo test --all-features

# 2. Clippy is happy
cargo clippy -- -D warnings

# 3. Formatting is correct
cargo fmt --check

# 4. Docs build without warnings
cargo doc --no-deps

# 5. Dry-run publish
cargo publish --dry-run

# 6. Check what's included
cargo package --list

Conclusion: What I Learned

Building markdown-ai-cite-remove reinforced several Rust principles:

Lazy static initialization isn't optional for regex — it's a 1000x+ performance difference
Configuration structs with sensible defaults make libraries flexible without being complex
Real-world test fixtures catch bugs that synthetic tests miss
Criterion benchmarks prevent performance regressions
A CLI companion makes libraries accessible to non-Rust developers

The crate is live at crates.io/crates/markdown-ai-cite-remove and handles the citation patterns from ChatGPT, Claude, Perplexity, and Gemini.

If you're building text processing tools in Rust, I hope these patterns help. And if you're tired of manually deleting [1][2][3] from AI output—give the crate a try.

What text processing challenges have you solved in Rust? Drop a comment below—I'd love to hear about your approach.

DEV Community

Santa Came Early: I Just Published a Rust Crate and CLI Tool to Take Care of AI Markdown Citations for Good

The Problem That Shouldn't Exist (But Does)

The Introduction Perplexity Said I Should Have Gone Right Into Instead of Everything Above

Part 1: The Architecture Decision — Regex vs. Parser

Why Not a Full Markdown Parser?

The Hybrid Approach

Part 2: Lazy Static Regex — The Critical Optimization

The Anti-Pattern That Kills Performance

The Modern Solution: `std::sync::LazyLock`

The `patterns.rs` Module Pattern

Part 3: Regex Pattern Design for AI Citations

Understanding the Target Patterns

Crafting Efficient Patterns

Avoiding Regex Pitfalls

Part 4: The Cleaner Architecture

Configuration-Driven Processing

The Builder Pattern

Stateless Design for Thread Safety

Part 5: Testing Real-World AI Output

The Testing Philosophy

Edge Cases That Break Naive Implementations

Property-Based Testing with Proptest

Part 6: Benchmarking with Criterion

Setting Up Criterion

Interpreting Results

Baseline Comparisons

Part 7: The CLI with Clap

Part 8: Publishing to crates.io

Cargo.toml Best Practices

Pre-Publish Checklist

Conclusion: What I Learned

Top comments (0)

The Problem That Shouldn't Exist (But Does)

The Introduction Perplexity Said I Should Have Gone Right Into Instead of Everything Above

Part 1: The Architecture Decision — Regex vs. Parser

Why Not a Full Markdown Parser?

The Hybrid Approach

Part 2: Lazy Static Regex — The Critical Optimization

The Anti-Pattern That Kills Performance

The Modern Solution: std::sync::LazyLock

The patterns.rs Module Pattern

Part 3: Regex Pattern Design for AI Citations

Understanding the Target Patterns

Crafting Efficient Patterns

Avoiding Regex Pitfalls

Part 4: The Cleaner Architecture

Configuration-Driven Processing

The Builder Pattern

Stateless Design for Thread Safety

Part 5: Testing Real-World AI Output

The Testing Philosophy

Edge Cases That Break Naive Implementations

Property-Based Testing with Proptest

Part 6: Benchmarking with Criterion

Setting Up Criterion

Interpreting Results

Baseline Comparisons

Part 7: The CLI with Clap

Part 8: Publishing to crates.io

Cargo.toml Best Practices

Pre-Publish Checklist

Conclusion: What I Learned

The Modern Solution: `std::sync::LazyLock`

The `patterns.rs` Module Pattern