Balakumaran Sugumar

Posted on Nov 23

Predictive and secure look-ahead log interception using Aho-Corasick & log tokenization (Java & Spring)

#ahocorasick #logging #tokenization #spring

The Purpose:
Modern distributed systems generate massive volumes of logs, e.g. API Gateways, microservices, payment processing, fraud engines, and event-driven pipelines.
Buried deep inside these logs are sensitive data elements such as:

Credit card information
SSNs/TaxIDs
Email addresses
Access Tokens/API Keys
Customer IDs
Session identifiers

If these logs reach Splunk, ELK, CloudWatch, S3, or shared storage without redaction, companies can violate PCI DSS, GDPR, HIPAA, SOX, CCPA, and internal info-sec policies.

Goal:
Build a real-time look-ahead predictive log interceptor using the Aho-Corasick algorithm to detect sensitive patterns at scale, and then tokenize them before leaving the application.

This Gives:

🔒 Security (no raw PII/PCI stored in logs)
⚡ Speed (Aho–Corasick performs multi-pattern search in O(n))
🔁 Consistency (same sensitive value → same token)
🧩 Clean integration (pluggable for Spring Boot logs)

Why Aho–Corasick for Log Interception?

Regex is powerful but slow — especially when scanning logs with 50+ sensitive patterns.
Aho–Corasick builds a finite automaton (trie + failure links) that searches all patterns simultaneously.

Benefits:
✔ Matches thousands of patterns in one scan
✔ No backtracking
✔ Works well for streaming logs
✔ Perfect for hot paths (interceptors, filters, appenders)

If your application handles millions of log lines an hour, Aho–Corasick might be helping hand here, consuming fewer resources.

Architecture

Implementation

Step1:

Adding the dependencies:

<dependency>
    <groupId>org.ahocorasick</groupId>
    <artifactId>ahocorasick</artifactId>
    <version>0.6.3</version>
</dependency>

Step2:

Define your sensitive Patterns:

List<String> sensitivePatterns = List.of(
    "\\d{16}",             // 16-digit card numbers
    "\\b\\d{3}-\\d{2}-\\d{4}\\b", // SSN
    "[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+", // Emails
    "Bearer [A-Za-z0-9-_]+" // Tokens
);

Step3:

Create the Aho-Corasic Trie

Trie trie = Trie.builder()
    .onlyWholeWords()
    .ignoreCase()
    .addKeywords(sensitivePatterns)
    .build();

Step4: Create a Tokenizer (This would be deterministic)

@Component
public class Tokenizer {
    private final Map<String, String> cache = new ConcurrentHashMap<>();

    public String tokenize(String value) {
        return cache.computeIfAbsent(value, v -> 
            "TOKEN_" + Base64.getEncoder()
                             .encodeToString(v.getBytes())
                             .substring(0, 10)
        );
    }
}

This is a simple deterministic tokenization.
It can be replaced with:

Vault-based tokenization.
Hash-based irreversible tokenization.
KMS-encrypted reversible tokenization.

Step4:

An example of Logback TurboFilter, there are other enriched LogBack and Log4j libraries available for specific implementation

logging.apache.org

logback.qos.ch

public class SensitiveDataFilter extends TurboFilter {

    private Trie trie;
    private Tokenizer tokenizer;

    @Override
    public FilterReply decide(Marker marker, Logger logger, Level level,
                              String format, Object[] params, Throwable t) {

        if (format == null) return FilterReply.NEUTRAL;

        String safe = maskSensitive(format);
        logger.log(level, safe);
        return FilterReply.DENY;  // Prevent original unsafe log
    }

    private String maskSensitive(String message) {
        Collection<Emit> emits = trie.parseText(message);

        for (Emit e : emits) {
            String matched = message.substring(e.getStart(), e.getEnd() + 1);
            String token = tokenizer.tokenize(matched);
            message = message.replace(matched, token);
        }
        return message;
    }
}

Step5

<configuration>
    <turboFilter class="com.example.logging.SensitiveDataFilter"/>
</configuration>

This implementation will have every log line through** Aho-Corasick + Tokenization pipeline**

Example in Action:

Received payment for card 4532123412341234 from john.doe@gmail.com

Aho–Corasick finds:

4532123412341234
john.doe@gmail.com

Tokenizer converts:

TOKEN_Qz...
TOKEN_am...

Final Log output

Received payment for card TOKEN_QzQyMTIzND from TOKEN_am9obi5k

📊 Performance: Why This Scales

Regex:
Regex works well for simple matching, but it becomes slow when you try to evaluate dozens of sensitive patterns. Each pattern is evaluated independently, resulting in high overhead and potential backtracking. It’s suitable for small, simple cases but not for large-scale log scanning.

Aho–Corasick:
Aho–Corasick performs multi-pattern matching in linear time relative only to the length of the text being scanned. All patterns are compiled into a trie with efficient failure links, allowing simultaneous matching. This makes it ideal for high-volume log streams and real-time tokenization scenarios.

ML-based Approaches
Machine learning text-classification models require far more memory and compute and are typically unnecessary for deterministic pattern detection. They are better suited for semantic or NLP tasks rather than explicit pattern extraction like credit cards, SSNs, or email detection in logs.

DEV Community