Jeffrey Ip for Confident AI

Posted on Sep 12, 2023 • Edited on Oct 2, 2023 • Originally published at confident-ai.com

Generating synthetic data with LLMs - Part 1

#chatgpt #ai #machinelearning #tutorial

The ability to use AI to generate data out of thin air is one of those things that seem too good to be true — think about it, you can get your hands on quality data without needing to manually collect, clean, and annotate massive datasets.

But, as you might expect, synthetic data is not without its caveats😔. Although it is convenient, efficient, and cost effective, the quality of synthetic data is only as good as the method used to generate it. Settle for rudimentary methods, and you’ll end up with unusable datasets that don’t represent real-world data well 🤯.

In this article, I’m going to share how we managed to generate realistic textual synthetic data at Confident AI. Let's dive right into it 😊.

What is synthetic data?

First and foremost, synthetic data is artificially generated data in attempt to simulate real-world data 🤖. Unlike real-world data that is collected from observations or actual events (e.g., tweets on the platform formally known as Twitter 🐦), synthetic data is made up, sometimes entirely, but more commonly based on a small subset of real-world data (also known as data augmentation).

This kind of data is often used for testing, training, and validating machine learning models, especially in scenarios where using real-world data is scarce or difficult to collect✅.

The struggles in generating textual data

Historically, while the demand for synthetic data continued to rise steadily over the years, advancements in generation methods struggled to keep pace.

Methods available at the time were often simplistic, perhaps relying on basic statistical methods, or they were too domain-specific and hard to generalize, meaning they lacked the complexity to mimic real-world data in a meaningful way❌.

Let’s take Generative Adversarial Networks (GANs) as an example. GANs employed a novel architecture of two neural networks — a generator and a discriminator — that competed with each other. The competition between these two networks resulted in the generation of highly realistic and complex synthetic data.

However, as one might have guessed from the title of this article, there were still major drawbacks when leveraging GANs to generate textual data 😔.

To list a few, here are the major pitfalls that GANs suffer from:

Mode Collapse: A phenomenon where the generator starts to produce the same output (or very similar outputs) over and over again.
Difficult to train: GANs are notoriously hard to train, with issues like vanishing/exploding gradients and oscillations in loss.
Long-Range Dependencies: Textual data often involve long-range dependencies (e.g., the subject of a sentence affecting a verb that appears much later), and capturing these effectively is a challenge even for advanced GAN architectures.
Very Needy: They require lots of data to train on (ironically).

Needless to say, there’s a lot of hurdles to overcome and consider when it comes to textual data. Let’s cut to the chase and see why you should use LLMs instead.

Generating Synthetic Data with LLMs

Like it or hate it, large language models (LLMs) like GPT-4 has democratized textual synthetic data. Let’s say I want to generate some queries related to the topic of synthetic data. All I have to do is use either ChatGPT or OpenAI’s API to generate a set of tweets. For example, here’s how you can do it in python (note: I’m using GPT-3.5):

import os
import openai

openai.api_key = os.getenv("OPENAI_API_KEY")

response = openai.ChatCompletion.create(
  model="gpt-3.5-turbo",
  messages=[
    {
      "role": "user",
      "content": "Generate 5 questions one might have on synthetic data."
    }
  ],
  temperature=1,
  max_tokens=256
)

print(response.choices[0].message.content)

Here’s a sample output 👀:

1. What are the key advantages of using synthetic data over real-world data in machine learning models?

2. How is synthetic data generated and how closely can it mimic the characteristics of original datasets?

3. Is synthetic data reliable for training machine learning models in sensitive sectors like healthcare or finance?

4. What are the ethical considerations associated with using synthetic data, especially when it is used to replace or supplement personally identifiable information?

5. Can synthetic data be used to address the challenges of data imbalance in machine learning, and if so, how effective is it compared to traditional resampling techniques?

While the generated data is quite varied, it may not accurately reflect real-world conditions, making it less useful for certain applications. Fortunately, by carefully crafting the input prompts, we can improve the authenticity of the synthetic data.

Using Dynamic Prompts Templates to make Synthetic Data Realistic

The pervasive problem with synthetic data generation is there’s often a mismatch between the generative distribution and the distribution of real-world data. However, due to the versatile and adaptable nature of LLMs, we can easily ground generated data by dynamically changing the prompt (basically string interpolation!) 🥳. For example, you might want to wrap the OpenAI API call in a function instead and make it accept additional context as parameters:

import os
import openai

openai.api_key = os.getenv("OPENAI_API_KEY")

def generate_synthetic_data(context):
  prompt = f""" 
    Generate 5 questions one might have on synthetic data. In your questions, also take into account the context below.
    Context: {context}
  """

  response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[
      {
        "role": "user",
        "content": prompt
      }
    ],
    temperature=1,
    max_tokens=256
  )

  return response.choices[0].message.content

print(generate_synthetic_data("Data Privacy is a huge concern for enterprises."))

Here’s a sample output (don’t forget we’re using GPT-3.5!):

1. How can synthetic data help enterprises address data privacy concerns while still maintaining the ability to perform data analytics and testing?

2. What are the key differences between real data and synthetic data in terms of their privacy implications for enterprises?

3. Are there any legal or regulatory considerations that enterprises should be aware of when using synthetic data to safeguard data privacy?

4. How can enterprises ensure that synthetic data accurately represents their real data while preserving the privacy of sensitive information?

5. What are the potential limitations or challenges that enterprises may face when implementing synthetic data solutions to protect data privacy, and how can they mitigate these challenges effectively?

As you can see, the output is much more relevant and significantly improved compared to previous iterations without dynamic prompting.

Conclusion

In this article, we explored ways to contextualize synthetic data effectively. LLMs like GPT-3.5 can offer a simple yet powerful way of generating data through some careful prompt designing.

Stay tuned for our Part 2 guide on diversifying your synthetic data set!

(thanks for reading my first article! Follow my Twitter to follow my journey in building Confident AI: https://twitter.com/jeffr_yyy, and come give our Github repo a star⭐ !)

Top comments (1)

Vinayak Mishra • Jan 19 '25

The part covering the use of dynamic prompts was quite helpful! This blog on Synthetic data generation grounded in real data sources might be useful for your upcomin parts on this topic, I personally liked it and had it saved :)