A practical guide to lazy regex compilation, efficient string manipulation, publishing production-ready Rust crates, and how I discovered that building a Rust and Regex based code library can actually be even more frustrating that it sounded initially - so hopefully my tears will fuel your dev joy.
The Problem That Shouldn't Exist (But Does)
If I wasn't already bald, the last 5 years of wrangling integrations with AI LLM build outs would have ensured that my trips to the barber were no longer needed. Topping my list of frustrations has been those annoying and ever present AI generated citations. It's gotten to the point that now I picture them smirking as they send me back the response, their GPUs powered by the anger and stress they have figured out how to extract from my 2am rants each time I read their 'reasoning' literally reading my prompt, begging them not to include citations in their generated content, immediately followed by them excitedly sending me a 3,000 word article with enough links to make a Wikipedia page blush.
Even more aggravating than having to manually inspect and/or edit every deep research task that the various AI agents performed for our AI marketing platform was the fact that every few weeks I'd have one of the AI coding agents spend a full deep research session combing through NPM.js, RubyGems, and Rust Crates, with the singular purpose of finding a code library that would do literally nothing besides remove citations from 100% of the Markdown strings that the 100+ AI tools that we use throughout our platform generate. After 5 years of unanswered coding prayers and 3 letters that I don't think Santa even glanced at, and while blasting my Taylor Swift motivational coding playlist, the "I'm the problem, it's me." lyrics weren't simply the best descriptor for my multiple marriages, they also explained why I still didn't have the tool I needed.
So instead of continuing my previous software engineering strategy of waiting for some open source fairy to publish what I needed, I decided that I have not struggled quite enough in my coding career and so therefore not only would I build out the tool I needed and share it with the world, I decided that I was going to see if Rust syntax got easier to read if I coated every key line in Regex.
Spoiler alert:: I quickly discovered that when Satan's uncle was designing the Rust language, he realized that a typical Regex implementation didn't fill up his tank of dev tears like he needed and so I got to learn that there are multiple, completely different Regex modules in Rust. And when we get to the section where I show the performance benchmark comparisons differences between the two Regex options, you will see that there is a right and wrong answer when it comes to which approach to take for this specific scenario.
^ Also, I spent WAY too long trying to decide if I wanted to go with Satan's Uncle being the Rust syntax designer vs me discovering that I was going to be spending a weekend straight wrestling with Rust and Regex being the real Devil's threeway, so please just pretend I went with whatever one you found more entertaining and we'll get started.
The Introduction Perplexity Said I Should Have Gone Right Into Instead of Everything Above
So let's get into it. If you've ever used ChatGPT, Claude, or Perplexity to generate content—blog posts, documentation, research summaries—you've encountered this:
AI research shows promising results in natural language processing[1][2][3].
Recent studies indicate significant improvements[4][source:1].
[1]: https://example.com/study1
[2]: https://example.com/study2
[3]: https://example.com/study3
[4]: https://example.com/study4
Those citations are useful for verification, but completely unwanted when you're publishing to a CMS, generating documentation, or processing AI responses in a streaming pipeline.
After searching for a Rust solution and finding nothing, I built markdown-ai-cite-remove. This article isn't a marketing pitch—it's a deep dive into the patterns, optimizations, and decisions that make a text processing library production-ready in Rust.
Part 1: The Architecture Decision — Regex vs. Parser
Why Not a Full Markdown Parser?
My first instinct was to use pulldown-cmark or comrak to parse the markdown into an AST, walk the tree, remove citation nodes, and reconstruct the string.
Problems with this approach:
- Overhead: Full parsing is expensive when you only need pattern matching
- Reconstruction loss: Converting AST back to markdown can alter formatting
- Complexity: More code = more bugs for a simple task
The insight: AI citations follow predictable regex patterns. We don't need to understand markdown semantics—we need fast pattern matching and replacement.
The Hybrid Approach
The library uses a multi-pass regex strategy that's both fast and accurate:
// Pass 1: Remove inline citations [1][2][3]
// Pass 2: Remove named citations [source:1][ref:2]
// Pass 3: Remove reference link definitions
// Pass 4: Remove reference section headers
// Pass 5: Normalize whitespace
Each pass handles one concern, making the code testable and debuggable.
Part 2: Lazy Static Regex — The Critical Optimization
The Anti-Pattern That Kills Performance
The most common Rust regex mistake:
// ❌ DON'T DO THIS - Compiles regex on every call
fn remove_citations_bad(text: &str) -> String {
let re = Regex::new(r"\[\d+\]").unwrap(); // Expensive!
re.replace_all(text, "").to_string()
}
Every call to Regex::new() parses and compiles the pattern. For a library processing thousands of documents, this is catastrophic.
The Modern Solution: std::sync::LazyLock
As of Rust 1.80, the standard library includes LazyLock (previously you'd use lazy_static or once_cell):
use std::sync::LazyLock;
use regex::Regex;
// ✅ Compiled once, used forever
static INLINE_NUMERIC: LazyLock<Regex> = LazyLock::new(|| {
Regex::new(r"\[\d+\]").unwrap()
});
static INLINE_NAMED: LazyLock<Regex> = LazyLock::new(|| {
Regex::new(r"\[(?:source|ref|cite|note):\d+\]").unwrap()
});
static REFERENCE_LINK: LazyLock<Regex> = LazyLock::new(|| {
Regex::new(r"(?m)^\[\d+\](?::\s*|\s+)https?://[^\n]+$").unwrap()
});
Why this matters:
| Approach | 1000 documents | 10,000 documents |
|---|---|---|
| Compile per call | ~40ms per doc | ~400 seconds total |
| Lazy static | ~27ns per doc | ~0.27 seconds total |
That's a 1000x+ speedup from a single architectural change.
The patterns.rs Module Pattern
Centralizing all regex patterns in one module provides:
- Single source of truth for pattern definitions
- Compile-time documentation of what patterns match
- Easy testing of individual patterns
- Clear modification path when AI providers change citation formats
// src/patterns.rs
use std::sync::LazyLock;
use regex::Regex;
pub struct Patterns {
pub inline_numeric: &'static Regex,
pub inline_named: &'static Regex,
pub reference_link: &'static Regex,
pub reference_header: &'static Regex,
pub reference_entry: &'static Regex,
}
static INLINE_NUMERIC: LazyLock<Regex> = LazyLock::new(|| {
Regex::new(r"\[\d+\]").unwrap()
});
// ... other patterns ...
impl Patterns {
pub fn get() -> Self {
Self {
inline_numeric: &INLINE_NUMERIC,
inline_named: &INLINE_NAMED,
reference_link: &REFERENCE_LINK,
reference_header: &REFERENCE_HEADER,
reference_entry: &REFERENCE_ENTRY,
}
}
}
Part 3: Regex Pattern Design for AI Citations
Understanding the Target Patterns
AI citations come in several flavors:
# Inline numeric (most common)
Text here[1] and more[2][3] content[20].
# Named sources (Perplexity style)
Studies show[source:1] that results[ref:2] indicate[cite:3]...
# Reference link definitions
[1]: https://example.com/article
[2]: https://another-source.com
# Reference sections
## References
[1] Author, A. (2024). Article Title. Journal Name.
[2] Author, B. (2023). Another Article. Conference.
Crafting Efficient Patterns
Pattern 1: Inline Numeric Citations
r"\[\d+\]"
Simple, fast, and handles [1] through [999]. No greedy quantifiers, no backtracking.
Pattern 2: Named Citations
r"\[(?:source|ref|cite|note):\d+\]"
The (?:...) is a non-capturing group—we don't need the match content, just removal. This is faster than (...).
Pattern 3: Reference Links (Multiline)
r"(?m)^\[\d+\](?::\s*|\s+)https?://[^\n]+$"
Breaking this down:
-
(?m)— Multiline mode:^and$match line boundaries -
^\[\d+\]— Line starts with[number] -
(?::\s*|\s+)— Followed by:or just whitespace -
https?://[^\n]+$— URL to end of line
Pattern 4: Reference Headers
r"(?m)^#{1,6}\s*(?:References?|Citations?|Sources?|Bibliography|Notes?)\s*$"
Matches ## References, # Citations, ### Sources, etc.
Avoiding Regex Pitfalls
1. Greedy vs. Lazy Quantifiers
// ❌ Greedy - can cause catastrophic backtracking
r".*\[\d+\].*"
// ✅ Specific - fast and predictable
r"\[\d+\]"
2. Anchoring When Possible
// ❌ Scans entire document for each potential match
r"\[\d+\]:\s*https?://.*"
// ✅ Anchored to line start - much faster
r"(?m)^\[\d+\]:\s*https?://[^\n]+$"
3. Character Classes Over Wildcards
// ❌ Slow - . matches everything
r"^\[\d+\]:.*$"
// ✅ Fast - [^\n] is specific
r"^\[\d+\]:[^\n]+$"
Part 4: The Cleaner Architecture
Configuration-Driven Processing
Different use cases need different behavior:
#[derive(Debug, Clone)]
pub struct RemoverConfig {
pub remove_inline_citations: bool,
pub remove_reference_links: bool,
pub remove_reference_headers: bool,
pub remove_reference_entries: bool,
pub normalize_whitespace: bool,
pub remove_blank_lines: bool,
pub trim_lines: bool,
}
impl RemoverConfig {
/// Remove everything (default)
pub fn default() -> Self { /* ... */ }
/// Only inline [1][2][3], keep reference sections
pub fn inline_only() -> Self { /* ... */ }
/// Only reference sections, keep inline citations
pub fn references_only() -> Self { /* ... */ }
}
Why this matters: A user cleaning blog posts wants everything gone. A user building a citation extraction tool wants to keep inline markers but strip URLs.
The Builder Pattern
let cleaner = CitationRemover::builder()
.remove_inline(true)
.remove_references(false)
.normalize_whitespace(true)
.build();
This provides a fluent API that's both readable and extensible.
Stateless Design for Thread Safety
pub struct CitationRemover {
config: RemoverConfig,
// No mutable state! Patterns are static references.
}
impl CitationRemover {
pub fn remove_citations(&self, markdown: &str) -> String {
// Pure function - same input always produces same output
let mut result = markdown.to_string();
if self.config.remove_inline_citations {
result = self.remove_inline(&result);
}
// ... more passes ...
result
}
}
Because CitationRemover has no mutable state, it's Send + Sync automatically—safe to share across threads without locks.
Part 5: Testing Real-World AI Output
The Testing Philosophy
Unit tests catch regressions. Integration tests with real AI output catch reality.
#[test]
fn test_real_chatgpt_response() {
let input = include_str!("../tests/fixtures/chatgpt_response.md");
let result = remove_citations(input);
// Verify no citations remain
assert!(!result.contains("[1]"));
assert!(!result.contains("[source:"));
assert!(!result.contains("https://"));
// Verify content preserved
assert!(result.contains("AI research shows"));
assert!(result.contains("## Key Findings"));
}
#[test]
fn test_real_perplexity_response() {
let input = include_str!("../tests/fixtures/perplexity_response.md");
let result = remove_citations(input);
// Perplexity uses different citation styles
assert!(!result.contains("[source:1]"));
assert!(!result.contains("## Sources"));
}
Edge Cases That Break Naive Implementations
1. Code blocks containing bracket syntax
Here's an array: `let arr = [1, 2, 3];`
And a citation[1].
[1]: https://example.com
A naive \[\d+\] pattern would incorrectly match inside the code block. Solution: process code blocks separately or use negative lookbehind (if your regex engine supports it).
2. Markdown links vs. citations
Check out [this article](https://example.com)[1].
[1]: https://citation.com
The [this article](url) is a valid markdown link that should be preserved. The [1] citation should be removed. Pattern specificity is key.
3. Nested or malformed citations
Results[[1]][2] show improvement[3][source:4].
Real AI output is messy. Test with messy input.
Property-Based Testing with Proptest
use proptest::prelude::*;
proptest! {
#[test]
fn never_crashes_on_arbitrary_input(s in ".*") {
// Should never panic, regardless of input
let _ = remove_citations(&s);
}
#[test]
fn preserves_non_citation_content(s in "[a-zA-Z ]+") {
// Plain text without brackets should pass through unchanged
let result = remove_citations(&s);
assert_eq!(result.trim(), s.trim());
}
#[test]
fn removes_all_numeric_citations(n in 1u32..1000) {
let input = format!("Text[{}] here.", n);
let result = remove_citations(&input);
assert!(!result.contains(&format!("[{}]", n)));
}
}
Part 6: Benchmarking with Criterion
Setting Up Criterion
# Cargo.toml
[dev-dependencies]
criterion = "0.5"
[[bench]]
name = "citation_removal"
harness = false
// benches/citation_removal.rs
use criterion::{black_box, criterion_group, criterion_main, Criterion, Throughput};
use markdown_ai_cite_remove::remove_citations;
fn bench_simple_inline(c: &mut Criterion) {
let input = "Text[1] with[2] citations[3].";
let mut group = c.benchmark_group("simple_inline");
group.throughput(Throughput::Bytes(input.len() as u64));
group.bench_function("remove", |b| {
b.iter(|| remove_citations(black_box(input)))
});
group.finish();
}
fn bench_real_chatgpt(c: &mut Criterion) {
let input = include_str!("../tests/fixtures/chatgpt_response.md");
let mut group = c.benchmark_group("real_chatgpt");
group.throughput(Throughput::Bytes(input.len() as u64));
group.bench_function("remove", |b| {
b.iter(|| remove_citations(black_box(input)))
});
group.finish();
}
criterion_group!(benches, bench_simple_inline, bench_real_chatgpt);
criterion_main!(benches);
Interpreting Results
simple_inline/remove time: [580 ns 585 ns 590 ns]
thrpt: [91.2 MiB/s 92.0 MiB/s 92.8 MiB/s]
real_chatgpt/remove time: [17.8 μs 18.0 μs 18.2 μs]
thrpt: [640 MiB/s 650 MiB/s 660 MiB/s]
Key metrics:
- Latency: How long does one operation take?
- Throughput: How many bytes per second can we process?
For a streaming API processing AI responses in real-time, sub-100μs latency is essential.
Baseline Comparisons
# Save current performance as baseline
cargo bench -- --save-baseline main
# Make changes, then compare
cargo bench -- --baseline main
This catches performance regressions before they ship.
Part 7: The CLI with Clap
A library is useful. A CLI makes it accessible.
// src/bin/mdcr.rs
use clap::Parser;
use std::io::{self, Read, Write};
use std::fs;
use markdown_ai_cite_remove::remove_citations;
#[derive(Parser)]
#[command(name = "mdcr")]
#[command(about = "Remove AI citations from markdown")]
struct Cli {
/// Input file (reads from stdin if not provided)
input: Option<String>,
/// Output file (writes to stdout if not provided)
#[arg(short, long)]
output: Option<String>,
/// Show processing details
#[arg(long)]
verbose: bool,
}
fn main() -> io::Result<()> {
let cli = Cli::parse();
// Read input
let input = match &cli.input {
Some(path) => {
if cli.verbose {
eprintln!("Reading from: {}", path);
}
fs::read_to_string(path)?
}
None => {
let mut buffer = String::new();
io::stdin().read_to_string(&mut buffer)?;
buffer
}
};
if cli.verbose {
eprintln!("Input size: {} bytes", input.len());
}
// Process
let output = remove_citations(&input);
if cli.verbose {
eprintln!("Output size: {} bytes", output.len());
eprintln!("Removed: {} bytes", input.len() - output.len());
}
// Write output
match &cli.output {
Some(path) => fs::write(path, output)?,
None => io::stdout().write_all(output.as_bytes())?,
}
Ok(())
}
Usage:
# Pipe mode
echo "Text[1] here." | mdcr
# File to file
mdcr input.md -o output.md
# Verbose
mdcr input.md --verbose
Part 8: Publishing to crates.io
Cargo.toml Best Practices
[package]
name = "markdown-ai-cite-remove"
version = "0.1.0"
edition = "2021"
rust-version = "1.70"
authors = ["Your Name <email@example.com>"]
license = "MIT OR Apache-2.0"
description = "High-performance removal of AI-generated citations from Markdown"
repository = "https://github.com/yourname/markdown-ai-cite-remove"
documentation = "https://docs.rs/markdown-ai-cite-remove"
keywords = ["markdown", "ai", "citation", "text-processing", "cleanup"]
categories = ["text-processing", "command-line-utilities"]
readme = "README.md"
[dependencies]
regex = "1.10"
thiserror = "1.0"
[dev-dependencies]
criterion = "0.5"
proptest = "1.4"
[[bin]]
name = "mdcr"
path = "src/bin/mdcr.rs"
Pre-Publish Checklist
# 1. All tests pass
cargo test --all-features
# 2. Clippy is happy
cargo clippy -- -D warnings
# 3. Formatting is correct
cargo fmt --check
# 4. Docs build without warnings
cargo doc --no-deps
# 5. Dry-run publish
cargo publish --dry-run
# 6. Check what's included
cargo package --list
Conclusion: What I Learned
Building markdown-ai-cite-remove reinforced several Rust principles:
- Lazy static initialization isn't optional for regex — it's a 1000x+ performance difference
- Configuration structs with sensible defaults make libraries flexible without being complex
- Real-world test fixtures catch bugs that synthetic tests miss
- Criterion benchmarks prevent performance regressions
- A CLI companion makes libraries accessible to non-Rust developers
The crate is live at crates.io/crates/markdown-ai-cite-remove and handles the citation patterns from ChatGPT, Claude, Perplexity, and Gemini.
If you're building text processing tools in Rust, I hope these patterns help. And if you're tired of manually deleting [1][2][3] from AI output—give the crate a try.
What text processing challenges have you solved in Rust? Drop a comment below—I'd love to hear about your approach.



Top comments (0)