DEV Community: Saverio Bertocci

Stop using Regex for E-commerce scraping. I built an AI API that normalizes product data instantly.

Saverio Bertocci — Tue, 17 Mar 2026 13:32:21 +0000

If you've ever built a scraper, a dropshipping importer, or a PIM (Product Information Management) system, you know the absolute nightmare of dealing with unstructured product data.

You scrape a supplier's website expecting a clean table with sizes and colors, but instead, you get this raw text string:

"Nike Air Max mens sneakers size 42 blue synthetic material"

Or even worse, it's in a foreign language:

"Zapatillas de running Nike Air Max uomo blu taglia 42"

The old way: The Regex Nightmare ❌
Historically, we had to write dozens of regular expressions to catch variations of "Size", "SZ", "Taglia", or map 50 different color names to a standard English list. One typo from the supplier, and the script breaks. Your Shopify catalog ends up with weird tags like Color: blu scuro impermeabile.

The new way: Structured AI Outputs ✅
I got tired of fixing broken parsers, so I built a dedicated backend using Node.js, Express, and GPT-4o-mini with strict JSON schemas.

Instead of searching for keywords, the LLM reads the context, translates everything to standard English, and maps it to specific e-commerce attributes.

If you send the messy text from above, the API returns this exact JSON structure:

json
{
"success": true,
"data": {
"brand": "Nike",
"model": "Air Max",
"category": "sneakers",
"gender": "men",
"size": "42",
"color": "blue",
"material": "synthetic",
"pack_size": null,
"normalized_title": "Nike Air Max sneakers men blue size 42"
}
}
I wrapped it into a public API
Since building the prompt logic, handling LLM latency, and hosting the infrastructure takes a lot of time, I wrapped the whole logic into a plug-and-play API.

If you are building an automated Shopify importer, doing local SEO catalogs, or just formatting messy supplier CSVs with Python or Zapier, you can use it right now.

👉 Check out E-commerce Product Normalizer (AI) on RapidAPI

rapidapi.com

There is a free tier available (50 calls/month) so you can test it directly in the RapidAPI playground without any commitment.

I'd love to hear your feedback! How do you guys currently handle messy product feeds from clients or suppliers?

How to Extract Structured Contact Data from Messy Emails using AI (and Validate Italian VATs)

Saverio Bertocci — Mon, 16 Mar 2026 16:18:26 +0000

As developers, we’ve all been there: a client asks you to build a system to capture leads from incoming emails, WhatsApp messages, or a generic "Contact Us" text area.

You expect structured data, but what you actually get from users is this:

"Hi, I'm Mario Rossi from Milan. I need a quote. You can call me at 333 12 34 567. My company VAT is 12345678901. Thanks."

Good luck parsing that with Regex! 😅
Phone numbers have random spaces, names are mixed with cities, and validating the VAT number usually requires writing a custom Modulo 10 algorithm.

The Solution: AI + Mathematical Validation
I got tired of maintaining fragile regular expressions, so I decided to build a dedicated backend using Node.js, Express, and OpenAI's GPT-4o-mini.

The goal was simple: send raw text in, get a guaranteed clean JSON out.

Instead of just relying on the LLM to guess if a VAT number is valid, I built a hybrid system:

The AI extracts the entities (Name, Phone, City, VAT, Intent).

The Node.js backend processes the VAT passing it through the official mathematical Modulo 10 algorithm to check if it's legally formatted.

The phone number is automatically stripped of spaces and formatted with the international +39 prefix.

What the output looks like
If you send the messy text from the example above, the system returns this clean JSON:

json
{
"success": true,
"extracted_data": {
"person_name": "Mario Rossi",
"city": "Milan",
"phone": "+393331234567",
"vat_number": "12345678901",
"intent": "quote",
"is_vat_valid": false
}
}
(Notice how it automatically detected the VAT is fake because it failed the Modulo 10 math check!)

I made it available as an API
Since building the infrastructure, handling the OpenAI prompts for structured outputs, and hosting the server takes time, I wrapped the whole thing into a plug-and-play API.

If you are building a bot, automating leads with Zapier/n8n, or just handling messy inputs, you can use it right now.

👉 Smart Contact Extractor (Italian AI) on RapidAPI

https://rapidapi.com/x4v1er94/api/smart-contact-extractor-italian-ai

There is a free basic tier available, so you can test it directly in the RapidAPI playground without pulling out your credit card.

I also published a lighter, free-forever API just for strict validation (without the AI extraction part) if you already have structured forms: Italian Data Normalizer.

https://rapidapi.com/x4v1er94/api/italian-data-normalizer

Let me know what you think in the comments! How do you currently handle unstructured leads in your projects?

Why Regex Is Never Enough for Italian Forms (And How to Fix It with an API)

Saverio Bertocci — Mon, 16 Mar 2026 12:59:05 +0000

If you've ever built a checkout form or a CRM for the Italian market, you know the struggle.

You ask the user for a phone number, an address, or a VAT Number (Partita IVA), and you get a wild mix of formats. People write "v.le" instead of "Viale", add random spaces in their phone numbers, and type 10 digits for a VAT number instead of 11.

The standard developer reaction is to write a massive Regex. But here is the problem: Regex is not enough.

The "Modulo 10" Problem

For example, the Italian VAT Number (Partita IVA) is 11 digits long. A simple /^[0-9]{11}$/ regex will let any random string of 11 numbers pass.
However, the Italian Revenue Agency uses the Luhn Algorithm (Modulo 10) to validate VAT numbers. The 11th digit is actually a control character calculated mathematically from the first 10.

If you don't validate it mathematically, your database will be filled with fake or mistyped VAT numbers.

The Solution: Offload the dirty work

I got tired of copy-pasting the Modulo 10 algorithm and address-cleaning functions into every new Node.js project. So, during the weekend, I decided to pack all these rules into a single micro-service.

I built the Italian Data Normalizer API.

It takes messy inputs like this:


json
{
  "street": "v.le trastevere 10",
  "city": "ROMA",
  "province": "rm",
  "zip": "153"
}

And returns beautifully formatted data, calculating the Modulo 10 for VATs and cleaning the strings:

{
  "street": "Viale Trastevere, 10",
  "city": "Roma",
  "province": "RM",
  "zip": "00153"
}

Try it for free
Instead of keeping it private, I published it on RapidAPI. There is a free tier (100 requests/month) which is more than enough for testing or small projects.

You don't even have to write the fetch requests yourself. I made a tiny open-source wrapper in JavaScript.

Check out the wrapper on GitHub and grab your free API key from the README:
👉 https://github.com/x4v1er94/italian-data-utils-js.git

I'd love to hear your feedback. Try to break it with weird inputs and let me know if I missed any edge cases!