DEV Community

Dr. Malte Polley
Dr. Malte Polley

Posted on

GenAI in the Field of Data Cleansing: First Steps

TLDR;

In this post, we dive into how Generative AI (GenAI) can supercharge data cleansing, specifically for cleaning up email addresses and birthdays. We tackle the usual headaches of using regular expressions and show how GenAI can step in to make things easier and faster. You’ll find code snippets for generating fake data, adding some errors, and using Amazon Bedrock for smart error-fixing. The results? GenAI nailed all the birthday fixes, but some emails tripped up due to special characters.

Introduction: PowerUp for RegEX Implementation

GenAI has been discussed in many different fields. For a long time, I have been thinking about how to implement GenAI in Data Preparation or Integration. So far, I have tried to implement regular expressions to first identify invalid values in a table. In a second step, I attempted to build a rule set using Spark, Python, or SQL to correct inaccurate values.

Both steps are tedious due to edge cases and different development streams. In any case, the work was valuable, as we see significant improvements in data analysis and the creation of data products. Ultimately, the harmonization of data sources will always lead to opportunities in every organization.

Due to this effort and the value of data cleansing, I want to share my first steps in leveraging GenAI to reduce the implementation effort. This is possible because a prompt describes a generic task, like "fix this string according to ... if it contains errors, mistakes, and typos," and the GenAI model in the background will decide on its own how to fix the string. By using such an approach, we combine regex development and error correction in one step. One caveat might be that the response behavior may vary from time to time.

Creating a Sample: Cleaning Email Addresses and Birthdays

To showcase my experiences so far, we will create samples coming from an internal system. It will contain data such as emails and birthdays from our customers. Unfortunately, our input system does not check the data that customers enter into our system. Therefore, customers occasionally send incorrect data to our CRM.

Data Producer: Creating a Ground Truth for Evaluation Purposes

Our system will be mocked by this Python script. This will create valid or expected inputs with 100% accuracy. Afterwards, we will introduce typical errors and typos with a 50% chance into the entries of our dataset. Let's start with emails.

# src/dataproducer.py

def generate_random_email(self):
    first_name = random.choice(self.first_names)
    last_name = random.choice(self.last_names)
    domains = ['example.com', 'test.com', 'sample.com', 'demo.com']
    email = f"{first_name}.{last_name}@{random.choice(domains)}"
    return email.lower()

def introduce_typo_email(self, email: str):
    if random.random() < 0.5:
        email_list = list(email)
        index = random.randint(0, len(email) - 1)
        action = random.choice(['replace', 'add', 'remove', 'case_change'])
        if action == 'replace':
            email_list[index] = random.choice('abcdefghijklmnopqrstuvwxyz')
        elif action == 'add':
            email_list.insert(index, random.choice('abcdefghijklmnopqrstuvwxyz'))
        elif action == 'remove' and len(email) > 1:
            email_list.pop(index)
        elif action == 'case_change':
            email_list[index] = email_list[index].upper() if email_list[index].islower() else email_list[index].lower()
        return ''.join(email_list)
    return email  
Enter fullscreen mode Exit fullscreen mode

The function generate_random_email() creates random emails based on a predefined set of German first and last names and four different email providers. The function introduce_typo_email() introduces an error with a 50% chance by replacing, adding, or removing a letter from an email. Letters can also be changed to uppercase.

The second data field will be the birthday of our customers.

def generate_correct_birthday(self):
        start_date = datetime(1900, 1, 1)
        end_date = datetime(2023, 12, 31)
        random_date = start_date + timedelta(days=random.randint(0, (end_date - start_date).days))
        return random_date.strftime("%d.%m.%Y")

def introduce_typo_birthday(self, birthday):
    if random.random() < 0.5:
        birthday = birthday.lstrip('0')
    if random.random() < 0.5:    
        birthday = birthday.replace('.', random.choice([' ', '/', '-']))
    if random.random() < 0.5:
        if "19" in birthday or "20" in birthday:
            year = birthday.split('.')[-1]
            if year.startswith("19") and random.choice([True, False]):
                year = year[2:]
            elif year.startswith("20") and random.choice([True, False]):
                year = year[2:]
            birthday = '.'.join(birthday.split('.')[:-1] + [year])
    return birthday
Enter fullscreen mode Exit fullscreen mode

The function generate_correct_birthday() creates birthdays in the format DD.MM.YYYY. Each date will be between 01.01.1900 and 31.12.2023. The typos we introduce can include changing the separator from "." to "/" or "-" or even removing any separator. We will remove leading zeros or "19" and "20".

Finally, we will write down the dataset via CSV. For this blog post, we will limit the number of rows to 100.

def create_pandas_dataframe(self):
    emails_correct = [self.generate_random_email() for _ in range(self.rows)]
    emails_typo = [self.introduce_typo_email(email=email) for email in emails_correct]
    birthdays_correct = [self.generate_correct_birthday() for _ in range(self.rows)]
    birthdays_typo = [self.introduce_typo_birthday(birthday=birthday) for birthday in birthdays_correct]
    data = {
        'EmailCorrect': emails_correct,
        'EmailTypo': emails_typo,
        'BirthdayCorrect': birthdays_correct,
        'BirthdayTypo': birthdays_typo,
    }
    return pd.DataFrame(data)

def write_as_csv(self, dataframe: pd.DataFrame):
    dataframe.to_csv(path_or_buf='data/out.csv', index=False)  
Enter fullscreen mode Exit fullscreen mode

We execute the snippets above from the file src/datacleaner.py.

# src/datacleaner.py
# snippet

def create_data():
    logging.info("Creating DataFrame")
    data_producer = dp.DataProducer(
        rows=100,
        first_names=['Max', 'Sophie', 'Leon', 'Marie', 'Paul', 'Emma', 'Lukas', 'Hannah', 'Tim', 'Anna'],
        last_names=['Müller', 'Schmidt', 'Schneider', 'Fischer', 'Weber', 'Meyer', 'Wagner', 'Becker', 'Hoffmann', 'Schulz'],
    )
    df = data_producer.create_pandas_dataframe()
    data_producer.write_as_csv(dataframe=df)
    logging.info("Wrote down DataFrame as csv")

if __name__ == "__main__":
    create_data()
Enter fullscreen mode Exit fullscreen mode

GenAI-based Data Cleansing: Invoking Claude Haiku for Dynamic RegEX Evaluation at Runtime

To implement a dynamic GenAI-based RegEX at runtime, we need to handle four things. First, we need to create a prompt that holds the relevant information. Second, we need to prepare the data so it can be processed by our GenAI model. Finally, we need to invoke the Amazon Bedrock endpoint. Lastly, the result from the endpoint needs to fit and be passed into our DataFrame.

Prompting, Retrieving, and Running the Implementation

I followed this website to create the prompt. The following snippet shows how to create dynamic prompts for different data fields. I marked the instructions and the data with XML tags as suggested on the website.

def prompt_builder(field_name: str, pattern: str, data: list[str]):
    prompt = f"""<instructions>You are in charge of ensuring data quality. 
Check the following data fields. They represent {field_name} 
from our CRM of our German customers. Fix all fields if they show typos or other mistakes. Ensure your answers follow this pattern: {pattern}.
Return only a comma-separated list and keep the order of the input data.</instructions> <data>{data}</data>"""

    return prompt
Enter fullscreen mode Exit fullscreen mode

The idea is to create a role-based approach to ensure that the task is clearly and directly formulated, as well as being precise about the expected output and, in our case, also the order.

The comma-separated list is a preparation for integrating the results into our existing DataFrame. To do this, I implemented the following code:

def process_output(bedrock_response: dict, json_parsable: bool = False):
    output = bedrock_response['body'].read()
    output = output.decode('utf-8')
    output = json.loads(output)['content'][0]['text']
    if json_parsable:
        output = json.loads(output)
    return [elem.strip() for elem in output.split(",")]
Enter fullscreen mode Exit fullscreen mode

It's important to mention that the result is expressed in natural language. That's why I need to process the output a little bit to extract elements from the text. But the processing is relatively straightforward, as it involves removing leading empty spaces in front of each word (" 05.12.1988").

Now that we have declared the input and output handling, we just need to invoke the endpoint:

def invoke_endpoint(bedrock_client: boto3.client, prompt: str, top_k: int, max_tokens: int, top_p: int, temperature: int):
    body = json.dumps(
        {
            'messages': [{'role': 'user', 'content': [{'type': 'text', 'text': prompt}]}],
            'anthropic_version': 'bedrock-2023-05-31',
            'max_tokens': max_tokens,
            'temperature': temperature,
            'top_p': top_p,
            'top_k': top_k,
        }
    )
    modelId = 'anthropic.claude-3-haiku-20240307-v1:0'
    accept = "*/*"
    contentType = "application/json"
    response = bedrock_client.invoke_model(
        body=body,
        modelId=modelId,
        accept=accept,
        contentType=contentType,
    )
    return response
Enter fullscreen mode Exit fullscreen mode

Et voilà, the final script:

# src/datacleaner.py
# snippet

def invoke_endpoint(bedrock_client: boto3.client, prompt: str, top_k: int, max_tokens: int, top_p: int, temperature: int):
    body = json.dumps(
        {
            'messages': [{'role': 'user', 'content': [{'type': 'text', 'text': prompt}]}],
            'anthropic_version': 'bedrock-2023-05-31',
            'max_tokens': max_tokens,
            'temperature': temperature,
            'top_p': top_p,
            'top_k': top_k,
        }
    )
    modelId = 'anthropic.claude-3-haiku-20240307-v1:0'
    accept = "*/*"
    contentType = "application/json"
    response = bedrock_client.invoke_model(
        body=body,
        modelId=modelId,
        accept=accept,
        contentType=contentType,
    )
    return response

def prompt_builder(field_name: str, pattern: str, data: list[str]):
    prompt = f"""<instructions>You are in charge of ensuring data quality. 
Check the following data fields. They represent {field_name} 
from our CRM of our German customers. Fix all fields if they show typos or other mistakes. Ensure your answers follow this pattern: {pattern}.
Return only a comma-separated list and keep the order of the input data.</instructions> <data>{data}</data>"""

    return prompt

def process_output(bedrock_response: dict, json_parsable: bool = False):
    output = bedrock_response['body'].read()
    output = output.decode('utf-8')
    output = json.loads(output)['content'][0]['text']
    if json_parsable:
        output = json.loads(output)
    return [elem.strip() for elem in output.split(",")]

def gen_ai_processing(create_data: bool, aws_region: str, data_path: str):
    config = Config(
        region_name=aws_region,
    )

    logging.info("Creating or reading data ...")
    if create_data:
        create_data()
        logging.info("Created data ...")
    else:
        df = pd.read_csv(filepath_or_buffer=data_path)
        logging.info("Read data ...")

    bedrock_client = boto3.client("bedrock-runtime", config=config)

    logging.info("Running Bedrock inference for data cleansing")
    for entry in [
        {
            "FieldName": "Email",
            "Pattern": "first_name.last_name@company.com",
            "ColumnTypo": "EmailTypo",
            "ColumnCorrect": "EmailCorrect"
        },
        {
            "FieldName": "Birthday",
            "Pattern": "DD.MM.YYYY",
            "ColumnTypo": "BirthdayTypo",
            "ColumnCorrect": "BirthdayCorrect"
        }
    ]:
        prompt = prompt_builder(
            field_name=entry["FieldName"],
            pattern=entry["Pattern"],
            data=df[entry["ColumnTypo"]].tolist()
        )

        logging.info(f"Running Bedrock inference for {entry['FieldName']}")

        llm_res = invoke_endpoint(
            bedrock_client=bedrock_client,
            prompt=prompt,
            temperature=0.2, 
            top_k=100, 
            top_p=0.2, 
            max_tokens=1024
        )

        processed_output = process_output(bedrock_response=llm_res, json_parsable=False)

        df[f"{entry['FieldName']}Bedrock"] = processed_output

        comparison_column = np.where(df[f"{entry['FieldName']}Bedrock"] == df[entry["ColumnCorrect"]], True, False)
        df[f"CompareBedrockOrg{entry['FieldName']}"] = comparison_column

    df.to_csv(path_or_buf='data/bedrock_out.csv', index=False)  

if __name__ == "__main__":
    gen_ai_processing(
        create_data=False,
        aws_region="eu-central-1",
        data_path="data/out.csv",
    )
Enter fullscreen mode Exit fullscreen mode

Analyzing the Results

Now it's time to conduct an analysis on our dynamic RegEX implementation with GenAI. Since we know the Ground Truth, we can simply compare the results from Bedrock with it. Initially, we have 100 entries for birthdays and emails. The script created 45 emails and 31 birthdays with typos.

With Bedrock, all birthdays could be fixed:

INFO:root:   BirthdayCorrect BirthdayTypo BirthdayBedrock  CompareBedrockOrgBirthday
0       06.06.2006    6/06/2006      06.06.2006                       True
5       22.12.1957   22 12 1957      22.12.1957                       True
6       23.03.1985     23.03.85      23.03.1985                       True
11      22.07.2010   22.07.2010      22.07.2010                       True
14      28.03.1970   28/03/1970      28.03.1970                       True
Enter fullscreen mode Exit fullscreen mode

While 5 emails could not be fixed:

INFO:root:                EmailCorrect                 EmailTypo              EmailBedrock  CompareBedrockOrgEmail
0       tim.weber@sample.com       ti.weber@sample.com       ti.weber@sample.com                   False
5      marie.müller@test.com     marie.müller@test.com    marie.mueller@test.com                   False
6   sophie.hoffmann@demo.com   sophie.offmann@demo.com   sophie.offmann@demo.com                   False
11    hannah.müller@test.com   hannah.müller@testf.com   hannah.mueller@test.com                   False
14   paul.müller@example.com  payul.müller@example.com  paul.mueller@example.com                   False
Enter fullscreen mode Exit fullscreen mode

However, if you look closer, you can see that Bedrock fixed 3 out of 5 correctly, as emails cannot contain German special characters. Thus, two names ("Tim" and "Hoffmann") could be corrected.

Final Words

I was very impressed by this approach. However, I am fully aware of some limitations of this blog post. The data set is small. In larger settings, chunking will be crucial as Bedrock has strict limits for batch approaches. Also, the examples are relatively simple. More complex data fields might show lower success rates. The reproduction showed very good results in five rounds; I consistently got the same results here. However, in larger sets, this needs to be evaluated.

I hope you gained some insight into GenAI-based RegEX implementation for data cleansing. Feel free to reach out if you want to discuss! I will be happy! :-)

Happy coding!

Top comments (0)