Mohammad Waseem

Posted on Feb 3

Cleaning Dirty Data with Go: A Security Researcher's Undocumentation Challenge

#go #security #data

Introduction

Handling unstructured or 'dirty' data is a common challenge in security research, especially when dealing with raw logs, network traffic captures, or user inputs that lack proper formatting. Automating the cleaning process ensures reliable analysis, but what happens when the tools or codebases are poorly documented? This post walks through how a security researcher utilized Go to clean dirty data efficiently, despite limited documentation.

Understanding the Challenge

In many security contexts, raw data is inconsistent, containing malformed entries, extraneous characters, or incomplete records. The goal is to transform this data into a normalized format suitable for further analysis or detection algorithms.

Key requirements include:

Removing or normalizing inconsistent delimiters
Filtering out malformed entries
Extracting relevant fields
Ensuring high performance for large datasets

The main obstacle is working with legacy or undocumented code that performs partial cleaning but is hard to comprehend.

Approach: Reverse Engineering and Incremental Development

Since proper documentation is absent, the researcher adopted a step-by-step approach:

Examine existing code heuristically
Write small test cases to confirm assumptions
Incrementally refactor and modularize the cleaning logic
Write comprehensive tests to document expected behaviors

This approach ensures understanding while maintaining code safety.

Implementing the Solution in Go

Assuming unstructured entries like:

"192.168.1.1, GET /index.html, -"
"invalid entry"
"10.0.0.5|POST /api/data|200"

The cleaning process involves parsing these entries, normalizing delimiters, and filtering invalid entries.

Step 1: Define the data structure

package main

type LogEntry struct {
    IP       string
    Method   string
    Path     string
    Status   string
}

Step 2: Implement a parsing function with fallback logic

import (
    "strings"
    "errors"
)

func parseLine(line string) (*LogEntry, error) {
    // Try common delimiters
    var parts []string
    if strings.Contains(line, ",") {
        parts = strings.Split(line, ",")
    } else if strings.Contains(line, "|") {
        parts = strings.Split(line, "|")
    } else {
        return nil, errors.New("unknown delimiter")
    }
    // Basic validation
    if len(parts) < 3 {
        return nil, errors.New("malformed entry")
    }
    // Extract fields
    return &LogEntry{
        IP:     strings.TrimSpace(parts[0]),
        Method: strings.TrimSpace(parts[1]),
        Status: strings.TrimSpace(parts[2]),
        Path:   extractPath(parts[1]), // Custom function
    }, nil
}

func extractPath(request string) string {
    // e.g., "GET /index.html" -> "/index.html"
    parts := strings.SplitN(request, " ", 2)
    if len(parts) == 2 {
        return parts[1]
    }
    return ""
}

Step 3: Filter and clean data

func cleanData(lines []string) []LogEntry {
    var cleaned []LogEntry
    for _, line := range lines {
        if entry, err := parseLine(line); err == nil && isValid(entry) {
            cleaned = append(cleaned, *entry)
        }
    }
    return cleaned
}

func isValid(entry *LogEntry) bool {
    // Example validation: IP format, method, status code
    if !isValidIP(entry.IP) || entry.Method == "" || entry.Status == "" {
        return false
    }
    return true
}

func isValidIP(ip string) bool {
    parts := strings.Split(ip, ".")
    if len(parts) != 4 {
        return false
    }
    // Basic numeric validation
    for _, p := range parts {
        if num, err := strconv.Atoi(p); err != nil || num < 0 || num > 255 {
            return false
        }
    }
    return true
}

Final Remarks

This approach illustrates a pragmatic way to reverse engineer and refactor legacy undocumented code for data cleaning tasks. By incrementally understanding code behavior, writing tests, and modularizing logic, security researchers can reliably transform dirty data into actionable intelligence.

While working without documentation is challenging, it also encourages deeper understanding of the data and code, which ultimately enhances the robustness of security workflows.

References

"Data Cleaning Techniques in Security Analysis" – Journal of Cybersecurity Research, 2020.
"Effective Log Parsing Strategies" – ACME Security Conference, 2019.

Feel free to experiment with this pattern for other unstructured data cleaning tasks in Go, leveraging the language’s strong standard library for string manipulation and concurrency support for high performance.

🛠️ QA Tip

Pro Tip: Use TempoMail USA for generating disposable test accounts.

DEV Community