Most influencer discovery tools charge $200-500/month. I built one that costs me cheap to run and finds real influencer profiles with names, follower counts, bios, and emails across Instagram, TikTok, and YouTube.
Here's exactly how it works, what broke along the way, and the architecture that finally made it reliable.
The Problem
A brand asked me to find 50 fitness micro-influencers on Instagram with contact info. The options were:
- Upfluence: $478/month minimum
- Modash: $299/month
- Manual research: 3 hours on Instagram, copy-pasting into a spreadsheet
I figured I could automate this for pennies.
The Architecture (What Actually Works)
After three failed approaches, here's what stuck:
Google SERP search (via Apify GOOGLE_SERP proxy)
-> Extract social profile URLs from search results
-> HTTP fetch each profile (via Apify residential proxy + Googlebot UA)
-> Parse OG meta tags for real names, follower counts, bios
-> Output structured data
The key insight: you don't need to render Instagram pages in a browser. Instagram serves complete Open Graph meta tags to Googlebot. A simple HTTP GET with the right User-Agent through a residential proxy returns everything you need.
For example, fetching https://www.instagram.com/kayla_itsines/ with a Googlebot header returns:
og:title: "KAYLA ITSINES (@kayla_itsines). Instagram photos and videos"
og:description: "16M Followers, 845 Following, 8,977 Posts"
Real name, follower count, post count. No browser. No login. No CAPTCHA.
What Broke (And How I Fixed It)
Attempt 1: Puppeteer + Apify Proxy
Used PuppeteerCrawler to search Google and visit profiles. Google CAPTCHA'd me. Instagram detected headless Chrome. Got 0 results.
Attempt 2: crawl4ai on VPS (direct IP)
Deployed crawl4ai (real Chromium) on a cheap Contabo VPS. Worked for normal sites but Google and Instagram both blocked the datacenter IP. 0 results again.
Attempt 3: crawl4ai + Apify proxy pipeline
The fix: route crawl4ai's browser traffic through Apify's proxy pool.
- Google searches go through
GOOGLE_SERPproxy group (designed for Google) - Instagram profile fetches go through
RESIDENTIALproxy group (residential IPs) - Use a lightweight HTTP fetch endpoint (no browser needed for profile pages)
This is what finally worked consistently.
The Gemma 4 Enhancement
The VPS also runs Google's Gemma 4 (2B parameter model) via Ollama. When the regex-based profile extraction from SERP results misses something, Gemma acts as an intelligent fallback:
"Given these Google search results, extract all Instagram profile URLs,
usernames, display names, and follower counts. Return JSON."
With think: false (disabling chain-of-thought reasoning), Gemma responds in 3-5 seconds instead of 60. For simple classification tasks, the thinking overhead isn't worth it.
Real Results
Running "beauty" niche on Instagram, 5 results requested:
| Username | Real Name | Followers | Source |
|---|---|---|---|
| @mikaylajmakeup | Mikayla Jane Nogueira | 3M | og_meta_enriched |
| @ericataylor2347 | Erica Taylor | 2M | og_meta_enriched |
| @darcybylauren | lauren janelle | 189K | og_meta_enriched |
| @amandaensing | Amanda Ensing | 1M | og_meta_enriched |
| @jamiegenevieve | Jamie Genevieve | 1M | og_meta_enriched |
All real names (not just handles), all real follower counts, all in about 2 minutes.
Cost Breakdown
| Component | Monthly Cost |
|---|---|
| Contabo VPS (6 vCPU, 12GB RAM) | Under $15 |
| Apify Creator Plan | $1 |
| Apify proxy usage | ~$2-5 per 1000 searches |
| Total | ~$11-14/month |
Compare that to $200-500/month for commercial influencer tools.
The Code
The full source is on GitHub: influencer-marketing-intel
Or try it directly on Apify (no code needed): Influencer Marketing Intelligence
Input:
{
"niche": "beauty",
"platforms": ["instagram", "tiktok", "youtube"],
"maxResults": 50,
"followerRange": "micro_10k_100k"
}
Output: structured JSON with username, displayName, estimatedFollowers, bio, contactEmails, nicheTags, profileUrl for each influencer found.
What I'd Do Differently
Start with the OG meta approach from day one. I wasted weeks trying to make Puppeteer work on Instagram. The Googlebot UA trick was the breakthrough.
Don't fight anti-bot systems, route around them. Residential proxies cost pennies and save hours of debugging.
Local LLMs for extraction are underrated. Gemma 4 on a VPS replaces brittle regex patterns. When Instagram changes their HTML structure, Gemma adapts. Regex doesn't.
I build scraping tools 57 actors on Apify Store, 869 users. If you have a data problem that needs automating, I probably already built the tool.
Follow the build log: @ai_in_it on X
Top comments (0)