Ozhaya

Posted on May 19 • Edited on May 20

I Built an ML-Powered Email Validation API

#api #ai #python #machinelearning

I built an ML model using XGBoost to catch auto-generated disposable emails when blacklists can't keep up. Most validators rely on MX records, SMTP checks, or blacklists - disposable emails have real mailboxes so MX and SMTP return valid. That's why I added an ML model to determine the risk of accepting an email based on the username and domain.

Compare these two emails:

john.doe@gmail.com

and

r9lo6tngee825@yzcalo.com

You can immediately tell the second one is fake, but why exactly? Is it the numbers, the consonant/vowel ratio, the length? We don't need to know the exact rules. We can train an XGBoost model on labelled data to figure it out, using features like digit count, length, and consonant/vowel ratio to predict whether a username or domain is legitimate.

Under the hood, the API combines MX records, blacklist checking, role detection, syntax validation, and new domain detection for basic coverage. On top of that, ML-powered scoring handles what static methods miss. It also supports batch validation of up to 30 emails per request. SMTP validation is intentionally excluded as disposable emails have real mailboxes so it returns valid anyway, and it adds significant latency for no benefit in this use case.

For the ML side, I used pandas for feature engineering, an 80/20 train-test split via sklearn's train_test_split, and XGBoost as the classifier. Features include pairs of vowels, consonants, digits and ratios for each. One feature that greatly increased the accuracy for username score was using determining whether the username contained a name. For example,

john.doe@gmail.com

contains john and doe which it automatically more likely to be safe than

r9lo6tngee825@yzcalo.com

even if we disregarded the domains.

Here's the API response for the fake email:

{
  "email": "r9lo6tngee825@homvela.com",
  "valid_email_structure": true,
  "is_role": false,
  "mx_records": true,
  "not_disposable": true,
  "new_domain": false,
  "domain_risk": 0.23987430334091187,
  "name_risk": 0.9953031539916992,
  "valid_email": true
}

This response shows us that even though the email is disposable, the traditional methods failed to detect it. However, name_risk is incredibly high, showing us that the email is very risky to accept. Interestingly, despite the domain being fake, domain_risk remains low - this is because the model is trained on patterns in the domain name itself, which doesn't follow the same conventional patterns as usernames.

Contrast this with a real email:

{
  "email": "john.doe@gmail.com",
  "valid_email_structure": true,
  "is_role": false,
  "mx_records": true,
  "not_disposable": true,
  "new_domain": false,
  "domain_risk": 0.221174955368042,
  "name_risk": 0.00679133040830493,
  "valid_email": true
}

This shows us that the email is safe to accept as both traditional and ML methods consider the email to be safe. Also worth noting the domain_risk for gmail is similar to the fake one, which shows us domain_risk alone isn't reliable except in special cases of unconventional named domains.

This is why combining all fields, rather than relying on any single one, gives you the most accurate result.

Limitations
Despite training a model for domain_risk, it has limited uses as patterns in domains are very limited. Unlike usernames, which often follow recognisable human patterns like names or words, auto-generated domains can look surprisingly normal compared to genuine ones, making it much harder for the model to distinguish between real and fake. A prime example of this is visible in the responses above, where the fake domain has a negligible difference in score compared to the real one.

Another takeaway was that blacklists are less effective than initially expected. While they work well for well-known disposable providers, the sheer volume of auto-generated domains makes it nearly impossible to maintain a comprehensive and up to date list. I still found them useful as the resources it takes to implement a blacklist and latency difference is negligible.

The reason I kept traditional methods is that ML is still just prediction; it can produce false positives (real email flagged as risky) and false negatives (disposable email missed). Combining both methods reduces the impact of either failure mode.

Try It Yourself
The API is available on RapidAPI with a free tier of 100 requests per month. If you're building a signup form, cleaning an email list, or trying to prevent fraud, give it a try and let me know what you think.

rapidapi.com

Note: the first request may be slow due to a cold start on the free server - subsequent requests will be faster.

Check out my interactive Swagger UI documentation. You can see every endpoint, all parameters, and example responses in one place.

Python Library
Installation:

pip install identify-fake-email

Quick Start:

from identify_fake_email.client import EmailValidator

client = EmailValidator("YOUR_RAPIDAPI_KEY")

result = client.validate("user@example.com")

if result.name_risk > 0.7 or not result.valid_email:
    print("Suspicious - review")
else:
    print("Safe to accept")

For bulk validation

emails = ["user1@gmail.com", "user2@gmail.com"]
results = client.validate_bulk(emails)

for result in results:
    if result.name_risk > 0.7 or not result.valid_email:
        print("Suspicious - review")
    else:
        print("Safe to accept")

Also covered this on Medium with a broader overview.

DEV Community

I Built an ML-Powered Email Validation API

Top comments (0)