DEV Community

Cover image for 🪄 Enhance/fix data quality w. openai's API 🦾
adriens
adriens

Posted on

🪄 Enhance/fix data quality w. openai's API 🦾

❔ About

🤔 Sometimes you face lack of data or data quality issues that prevent you from producing insights.

💡 Whatif you could call AI to the rescue to fix/enhance some data

I first started some Prompt engineering on chatGPT:

Image description

☝️ Notice

Notice that guessing gender on firstnames can seem useless or a bit dumb (or nerdy). Yes,but...

  • 🗺️ This work relies on openAI... which acts as a universal language firstname parser
  • 💡 This work is just an illustration of how prompt engineering and OpenAPI'API can help review/fix any kind of data quality issues... and makes a concrete illustration on how you may enrich your enterprise data pipeline

🎯 Target

The purpose of this article is to see how openai's API can help on a very specific testable dataset.

Image description

📝 Kaggle Notebook

This short notebook I will:

  1. 📥 Download data
  2. 🐼 Load data in pandas
  3. 🦾 Call openai's API to guess firstname's gender
  4. ⚖️ Compare guessed vs. real data

Image description

🍿 Demo

🗃️ Input Dataset

I have used the top-10-prenoms-a-noumea-depuis-1860 open dataset from data.gouv.nc:

Top 10 des Prénoms à Nouméa depuis 1860 — Open Data NC

Ce jeu de données présente la liste des dix prénoms les plus donnés à Nouméa, depuis 1860, d'après le registre de l'état civil. Fréquence de mise à jour : Annuelle

favicon data.gouv.nc

🤖 The text-davinci-003 model

I have used text-davinci-003 from GPT-3.5 models as they can:

"understand and generate natural language or code."

Image description

📊 Results 👏

Image description

☝️ Notice

Notice that I have put the guessed value in a dedicated structure... so we can easily flag it as AI generated when reporting its metadatas:

💰 Gains

  • 📈 Data quality
  • 💡 Better decisions & opportunities
  • 💸 Puts the cost of the lack of data quality in evidence (API calls are not free)
  • 🧠 Create more intelligence

👨‍🔬 Further optimizations

  • Benchmark models to spend as less money as possible while getting the best results as possible

🔭 News & perpsectives

Top comments (9)

Collapse
 
adriens profile image
adriens
Collapse
 
adriens profile image
adriens
Collapse
 
adriens profile image
adriens

Making the most of AI: The latest lessons from MIT Sloan Management Review | MIT Sloan

Knowing how to evaluate AI tools, manage data effectively, and share data strategically will help leaders see the results from their AI investments.

favicon mitsloan.mit.edu
Collapse
 
adriens profile image
adriens
Collapse
 
adriens profile image
adriens

Image description

Collapse
 
adriens profile image
adriens

Image description

Collapse
 
adriens profile image
adriens
Collapse
 
adriens profile image
adriens
Collapse
 
adriens profile image
adriens • Edited