DEV Community

Mohammad Waseem
Mohammad Waseem

Posted on

Mastering Dirty Data Cleanup in TypeScript with Open Source Tools

Mastering Dirty Data Cleanup in TypeScript with Open Source Tools

Handling dirty data is a common challenge faced by data engineers and developers alike. Inconsistent formats, missing values, erroneous entries, and unstandardized data can significantly impact the quality of analytics and downstream processes. As a senior architect, leveraging the power of TypeScript paired with open source tools provides a robust, maintainable, and type-safe approach to clean and normalize data effectively.

The Challenge of Dirty Data

Dirty data often originates from various sources—user inputs, third-party APIs, logs, or legacy systems—each with its own quirks and inconsistencies. Typical issues include:

  • Inconsistent casing or spelling
  • Duplicate or missing records
  • Malformed entries or incorrect data types
  • Unstructured or semi-structured data formats

Addressing these issues requires a systematic approach that combines data validation, transformation, and deduplication.

Strategy Overview

Our approach involves using open source TypeScript libraries to validate, normalize, and clean data in a scalable manner. Key tools include:

  • io-ts for runtime data validation and decoding
  • lodash for utility functions like deduplication and deep cloning
  • date-fns for date parsing and formatting
  • Custom transformation functions for normalization

This stack ensures type safety, extensibility, and integration with existing TypeScript codebases.

Implementation Details

Let's walk through a practical example: cleaning a list of user records with inconsistent formats.

Step 1: Define Data Types and Validation Schema

import * as t from 'io-ts';
import { isRight } from 'fp-ts/Either';

type User = {
  name: string;
  email: string;
  dateOfBirth: string;
};

const UserCodec = t.type({
  name: t.string,
  email: t.string,
  dateOfBirth: t.string, // Expect date in 'YYYY-MM-DD' or other formats
});

// Sample raw data
const rawData = [
  { name: "john doe", email: "JOHN@EXAMPLE.COM", dateOfBirth: "1980/01/015" },
  { name: "Jane Smith", email: "jane.smith@sample.com", dateOfBirth: "1985-07-20" },
  // ... more entries
];
Enter fullscreen mode Exit fullscreen mode

Step 2: Validate and Decode Data

const validatedData: User[] = rawData.filter((item) => {
  const decoded = UserCodec.decode(item);
  return isRight(decoded);
}).map((item) => {
  const decoded = UserCodec.decode(item);
  if (isRight(decoded)) return decoded.right;
}).filter(Boolean) as User[];
Enter fullscreen mode Exit fullscreen mode

Step 3: Normalize Data

import { transform } from 'lodash';
import { parse, format } from 'date-fns';

// Helper functions
function normalizeName(name: string): string {
  return name.trim().replace(/\w/g, c => c.toUpperCase()); // Capitalize first letter
}

function normalizeEmail(email: string): string {
  return email.trim().toLowerCase();
}

function parseDate(dateStr: string): string {
  const parsedDate = parse(dateStr, 'yyyy/MM/dd', new Date());
  if (!isNaN(parsedDate.getTime())) {
    return format(parsedDate, 'yyyy-MM-dd');
  }
  const altParsedDate = parse(dateStr, 'yyyy-MM-dd', new Date());
  if (!isNaN(altParsedDate.getTime())) {
    return format(altParsedDate, 'yyyy-MM-dd');
  }
  return ''; // Invalid date
}

// Apply normalization
const cleanedData = validatedData.map(user => ({
  name: normalizeName(user.name),
  email: normalizeEmail(user.email),
  dateOfBirth: parseDate(user.dateOfBirth),
}));
Enter fullscreen mode Exit fullscreen mode

Step 4: Deduplicate Records

import * as _ from 'lodash';

const uniqueUsers = _.uniqBy(cleanedData, (user) => user.email);
Enter fullscreen mode Exit fullscreen mode

Summary

By combining io-ts for validation, lodash for utility functions, and date-fns for date handling, this approach provides a comprehensive solution for cleaning dirty data in TypeScript projects. This strategy ensures data integrity, supports scalability, and maintains type safety, making it an ideal choice for enterprise-grade data pipelines.

Regularly updating and customizing normalization functions to match your data sources will further enhance data quality. With open source tools, you can also extend this pipeline to include more complex transformations and validations as needed.

Final Thoughts

Data cleaning is vital for trustworthy analytics and decision-making. Embracing a systematic, TypeScript-based methodology leverages type safety and modularity, empowering developers to build resilient data workflows. As a senior architect, adopting these best practices and leveraging open source ecosystems positions your projects for long-term success.


🛠️ QA Tip

I rely on TempoMail USA to keep my test environments clean.

Top comments (0)