Donny Nguyen

Posted on Mar 10 • Edited on Mar 31

Email Extractor API — Complete Guide

#api #webdev #javascript #python

Last month, I launched my first profitable data API after getting frustrated with the existing email extraction tools. Here's the complete breakdown of how I went from idea to $500 MRR in 30 days.

The problem with DIY scrapers

I was building a lead generation tool for my consulting business and needed to extract emails from company websites. The existing solutions were either:

Too expensive ($200+/month for basic plans)
Unreliable (failed on modern SPAs)
Limited (couldn't handle JavaScript-heavy sites)
Slow (30+ second response times)

After spending two weeks wrestling with Beautiful Soup and Puppeteer, I realized I was solving a problem many developers face. That's when I decided to build a robust API and monetize it.

Architecture: Express → Railway → RapidAPI

Here's my tech stack:

// package.json dependencies
{
  "express": "^4.18.2",
  "puppeteer": "^21.5.0",
  "cheerio": "^1.0.0-rc.12",
  "validator": "^13.11.0",
  "rate-limiter-flexible": "^3.0.4",
  "helmet": "^7.1.0"
}

Why Express? Fast development, huge ecosystem, and easy deployment.

Why Railway? One-click deployments, automatic SSL, and affordable scaling. Their free tier was perfect for testing.

Why RapidAPI? Built-in marketplace, handles billing/auth, and gives you instant credibility.

The architecture is straightforward:

RapidAPI forwards requests to my Railway-hosted Express server
Puppeteer launches a headless Chrome instance
Page content gets parsed with Cheerio
Emails are extracted, validated, and returned as JSON

One real endpoint walkthrough with code

Let me show you the core /extract endpoint:

app.post('/extract', async (req, res) => {
  const { url, deep_scan = false } = req.body;

  // Validation
  if (!url || !validator.isURL(url)) {
    return res.status(400).json({ 
      error: 'Valid URL required' 
    });
  }

  let browser;
  try {
    browser = await puppeteer.launch({
      headless: 'new',
      args: ['--no-sandbox', '--disable-dev-shm-usage']
    });

    const page = await browser.newPage();
    await page.setUserAgent('Mozilla/5.0 (compatible; EmailExtractor/1.0)');

    // Navigate with timeout
    await page.goto(url, { 
      waitUntil: 'networkidle0', 
      timeout: 15000 
    });

    // Extract emails from current page
    let emails = await extractEmailsFromPage(page);

    // Deep scan: follow internal links
    if (deep_scan && emails.length < 3) {
      const links = await page.$$eval('a[href]', anchors => 
        anchors
          .map(a => a.href)
          .filter(href => href.includes(new URL(page.url()).hostname))
          .slice(0, 5) // Limit to prevent abuse
      );

      for (const link of links) {
        try {
          await page.goto(link, { waitUntil: 'domcontentloaded', timeout: 8000 });
          const pageEmails = await extractEmailsFromPage(page);
          emails = [...new Set([...emails, ...pageEmails])]; // Dedupe
        } catch (err) {
          // Silently continue if individual pages fail
        }
      }
    }

    res.json({
      url,
      emails: emails.slice(0, 50), // Limit results
      count: emails.length,
      deep_scan,
      processed_at: new Date().toISOString()
    });

  } catch (error) {
    console.error('Extraction failed:', error);
    res.status(500).json({ 
      error: 'Failed to extract emails',
      message: error.message 
    });
  } finally {
    if (browser) await browser.close();
  }
});

async function extractEmailsFromPage(page) {
  return await page.evaluate(() => {
    const emailRegex = /[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}/g;
    const text = document.body.innerText;
    const emails = text.match(emailRegex) || [];

    // Filter out common false positives
    return emails.filter(email => 
      !email.includes('example.com') && 
      !email.includes('test.com') &&
      !email.includes('placeholder')
    );
  });
}

The key insights here:

Always validate inputs first
Use proper timeout handling for web scraping
Implement result limits to prevent abuse
Clean up resources (browser instances) in finally blocks

Pricing strategy for API monetization

I studied competitor pricing and went with a freemium model:

Free: 50 requests/month
Basic ($9/month): 1,000 requests + deep scan
Pro ($29/month): 5,000 requests + priority support
Enterprise ($99/month): 25,000 requests + custom features

The sweet spot was the Pro plan. Most developers need more than 1,000 requests but don't want to pay enterprise pricing.

Pricing lessons:

Start higher than you think (I initially priced Basic at $5)
Offer clear value jumps between tiers
Include one "expensive" tier to make others look reasonable

Lessons learned

1. Monitoring is everything
I use Railway's built-in metrics plus custom logging. Memory leaks from unclosed browser instances killed my server twice in the first week.

2. Rate limiting saves money
Without proper limits, one user can spin up dozens of Puppeteer instances and crash your server:

const rateLimiter = new RateLimiterFlexible({
  keyGenerator: (req) => req.headers['x-rapidapi-user'] || req.ip,
  points: 10, // requests
  duration: 60, // per 60 seconds
});

3. Error handling is user experience
Return meaningful error messages. "Internal server error" helps nobody.

4. Documentation drives adoption
I spent 40% of my time writing clear API docs with examples. It shows in the conversion rate.

What's next?

The API is profitable and growing. Next features:

Email validation/verification
Social media profile extraction
Webhook support for async processing

Building APIs taught me that solving developer problems can be incredibly rewarding—both personally and financially.

Want to try it out? Check out the Email Extractor API and let me know what you think. I'm always looking for feedback from fellow developers!

What APIs are you building? Drop a comment—I'd love to hear about your projects.

Want to go deeper? Check out my guides and templates:

DEV Community