DEV Community

Cover image for Your Java Regex Just Silently Broke in Production. Here's How to Make That Impossible.
Mirko Dimartino
Mirko Dimartino

Posted on

Your Java Regex Just Silently Broke in Production. Here's How to Make That Impossible.

A colleague pushes a fix for a validation bug. The regex looks right. Tests pass. Two weeks later, a user reports that their perfectly valid email address is being rejected — because the fix introduced an unbalanced bracket that the compiler never complained about.

This is the nature of raw regex in Java. The compiler has no opinion. The mistake lives quietly in a string literal until runtime hands you the bill.

I built Sift to make this class of bug unrepresentable.


What Sift actually does

Sift is a fluent DSL for building regular expressions in Java. Its core idea is simple: instead of writing a string, you traverse a type-state machine. Each method returns only the next valid state — so wrong transitions don't exist as methods, and the compiler rejects incomplete or structurally invalid patterns before your code ever runs.

The before/after speaks for itself:

// Before — what does this even do?
Pattern p = Pattern.compile("^(?=[\\p{Lu}])[\\p{L}\\p{Nd}_]{3,15}+[0-9]?$");

// After — your IDE guides every step
String regex = Sift.fromStart()
    .exactly(1).upperCaseLettersUnicode()       // Must start with an uppercase letter
    .then()
    .between(3, 15).wordCharactersUnicode().withoutBacktracking() // ReDoS-safe
    .then()
    .optional().digits()                         // May end with a digit
    .andNothingElse()
    .shake();

// Result: ^[\p{Lu}][\p{L}\p{Nd}_]{3,15}+[0-9]?$
Enter fullscreen mode Exit fullscreen mode

Same output. Zero runtime overhead. And if you try to skip the quantifier step and call .digits() directly, it simply doesn't compile.


The LEGO brick approach

The real power emerges when you start composing patterns from named building blocks.

Every Sift.fromAnywhere() call returns a SiftPattern<Fragment> — an unanchored, reusable piece that can be embedded anywhere without carrying unwanted ^ anchors. Patterns built with fromStart() or sealed with andNothingElse() become SiftPattern<Root> — they cannot be embedded. Attempting it is a compile-time error.

// Define named building blocks
SiftPattern<Fragment> year  = Sift.fromAnywhere().exactly(4).digits();
SiftPattern<Fragment> month = Sift.fromAnywhere().exactly(2).digits();
SiftPattern<Fragment> day   = Sift.fromAnywhere().exactly(2).digits();
SiftPattern<Fragment> dash  = Sift.fromAnywhere().character('-');

// Compose into a reusable date block
SiftPattern<Fragment> date = year.followedBy(dash, month, dash, day);

// Embed inside a larger log parser
String logRegex = Sift.fromStart()
    .of(date)
    .followedBy('\t')
    .then().oneOrMore().upperCaseLetters()  // log level: INFO, WARN, ERROR
    .followedBy('\t')
    .then().oneOrMore().anyCharacter()
    .andNothingElse()
    .shake();

// Result: ^[0-9]{4}-[0-9]{2}-[0-9]{2}\t[A-Z]+\t.+$
Enter fullscreen mode Exit fullscreen mode

The date fragment is independently testable, readable by name, and reusable across your codebase without copy-paste.


Patterns are extraction tools, not just validators

This is the part that usually surprises people. In v5.6, every SiftPattern ships a complete extraction API — no Matcher boilerplate required.

Named group extraction

NamedCapture yearGroup  = SiftPatterns.capture("year",  Sift.exactly(4).digits());
NamedCapture monthGroup = SiftPatterns.capture("month", Sift.exactly(2).digits());
NamedCapture dayGroup   = SiftPatterns.capture("day",   Sift.exactly(2).digits());

SiftPattern<?> datePattern = Sift.fromStart()
    .namedCapture(yearGroup)
    .followedBy('-')
    .then().namedCapture(monthGroup)
    .followedBy('-')
    .then().namedCapture(dayGroup)
    .andNothingElse();

Map<String, String> fields = datePattern.extractGroups("2026-03-13");
// → { "year": "2026", "month": "03", "day": "13" }
Enter fullscreen mode Exit fullscreen mode

Extracting all matches across a text

List<String> prices = Sift.fromAnywhere()
    .oneOrMore().digits()
    .extractAll("Order: 3 items at 25 and 40 euros");
// → ["3", "25", "40"]
Enter fullscreen mode Exit fullscreen mode

Extracting all named groups across multiple matches

List<Map<String, String>> allMatches = invoicePattern.extractAllGroups(largeDocument);
// → [{"id": "INV-001", "amount": "250"}, {"id": "INV-002", "amount": "80"}, ...]
Enter fullscreen mode Exit fullscreen mode

Lazy streaming for large inputs

Sift.fromAnywhere().oneOrMore().lettersUnicode()
    .streamMatches(largeText)
    .filter(word -> word.length() > 5)
    .forEach(System.out::println);
Enter fullscreen mode Exit fullscreen mode

The full API — all null-safe:

Method Returns Description
containsMatchIn(input) boolean Is there at least one match?
matchesEntire(input) boolean Does the entire string match?
extractFirst(input) Optional<String> First match, or empty
extractAll(input) List<String> All matches
extractGroups(input) Map<String, String> Named groups from first match
extractAllGroups(input) List<Map<String, String>> Named groups from all matches
replaceFirst(input, replacement) String Replace first match
replaceAll(input, replacement) String Replace all matches
splitBy(input) List<String> Split around matches
streamMatches(input) Stream<String> Lazy stream of all matches

ReDoS mitigation built in

The withoutBacktracking() you saw earlier generates a possessive quantifier (\w++). There are two other tools:

// Atomic group — locks a sub-pattern once matched
SiftPattern<Fragment> safe = Sift.fromAnywhere()
    .oneOrMore().digits()
    .preventBacktracking(); // wraps in (?>...)

// Lazy quantifier — matches as few characters as possible
Sift.fromAnywhere()
    .oneOrMore().anyCharacter().asFewAsPossible(); // generates .+?
Enter fullscreen mode Exit fullscreen mode

Secure patterns become the path of least resistance — you don't have to remember whether it's *+ or *?.


Jakarta Validation — no more duplicated regex

If you use Bean Validation, you've probably written the same @Pattern across multiple DTOs and then forgotten to sync them when the rule changed. Sift solves this with @SiftMatch:

// Define the rule once
public class PromoCodeRule implements SiftRegexProvider {
    @Override
    public String getRegex() {
        return Sift.fromStart()
            .atLeast(4).letters()
            .then()
            .exactly(3).digits()
            .andNothingElse()
            .shake();
    }
}

// Reuse it everywhere — compiled once at bootstrap, zero overhead per request
public record ApplyPromoRequest(
    @SiftMatch(
        value   = PromoCodeRule.class,
        flags   = { SiftMatchFlag.CASE_INSENSITIVE },
        message = "Invalid promo code format"
    )
    String promoCode
) {}
Enter fullscreen mode Exit fullscreen mode

Ready-made patterns — SiftCatalog

For common formats, SiftCatalog provides production-ready, ReDoS-safe patterns. All are Fragment-typed — they compose cleanly with your own chains.

// Standalone validation
boolean valid = SiftCatalog.email().matchesEntire("user@example.com");

// Embedded in a larger pattern
String regex = Sift.fromStart()
    .of(SiftCatalog.uuid())
    .followedBy('/')
    .then().of(SiftCatalog.isoDate())
    .andNothingElse()
    .shake();
Enter fullscreen mode Exit fullscreen mode

Available: uuid(), ipv4(), macAddress(), email(), webUrl(), isoDate().


Getting started

Gradle:

implementation 'com.mirkoddd:sift-core:<latest>'

// Optional: Jakarta Validation integration
implementation 'com.mirkoddd:sift-annotations:<latest>'
Enter fullscreen mode Exit fullscreen mode

Maven:

<dependency>
    <groupId>com.mirkoddd</groupId>
    <artifactId>sift-core</artifactId>
    <version>latest</version>
</dependency>
Enter fullscreen mode Exit fullscreen mode

Java 8 bytecode. Zero runtime dependencies. Tested on JVM 8, 11, 17, and 21.

👉 GitHub — mirkoddd/Sift
📖 Sift Cookbook — real-world recipes: UUID validation, TSV log parsing, lookarounds, conditional patterns, nested structures, and more.


The compiler is the best test suite you have. Sift puts it to work on your regex too.

Top comments (0)