I Scraped 50,000 Conference Records to Build a Sales Lead Engine — Here's the Python Code
A client came to me three months ago with a straightforward ask: build them a pipeline of conference leads for their B2B outbound sales team. They sell sponsorship packages to tech conferences. Their current process was a sales rep manually Googling "tech conference 2024" for two hours every Monday morning. They wanted automation.
Simple enough, I thought. Then I actually started building it.
The Eventbrite API Trap
My first instinct was the Eventbrite API. It's well-documented, returns clean JSON, has pagination. I had a working prototype in maybe 45 minutes.
Then I actually looked at the results.
For "machine learning conference," the API returned 340 events. When I cross-referenced against manual searches across Eventbrite itself, Luma, Meetup, Sessionize, and a handful of niche tech directories — I was missing roughly 70% of the actual conferences. Not small stuff either. I was missing events with 2,000+ attendees and $50K sponsorship packages.
The Eventbrite API has a coverage problem. It's great for consumer events. Tech conferences — especially the specialized ones a B2B sales team actually wants — are scattered across five different platforms with zero standardization.
So I ended up scraping five sources: Eventbrite (web, not API), Luma, Meetup, Sessionize, and Conferize. About 50,000 records total after deduplication.
Here's where it got painful.
Date Format Hell Is Real
Every platform formats dates differently. I'm not talking about ISO 8601 vs. MM/DD/YYYY. I mean:
"March 4–6, 2025""4 Mar 2025""Tuesday, March 4th at 2pm EST"-
"Q1 2025"(yes, really) -
"TBD"(even better)
I wrote a parsing function that I'm genuinely not proud of, but it works:
import requests
from bs4 import BeautifulSoup
from dateutil import parser as dateparser
import re
from datetime import datetime
def fetch_and_parse_event(url, session):
response = session.get(url, timeout=10)
soup = BeautifulSoup(response.text, "html.parser")
raw_date = soup.select_one(".event-date, [data-date], time")
date_str = raw_date.get_text(strip=True) if raw_date else ""
# Strip ordinal suffixes dateutil chokes on: "4th", "21st", "3rd"
date_str = re.sub(r"(\d+)(st|nd|rd|th)", r"\1", date_str)
# Grab first date if it's a range like "March 4-6"
date_str = re.split(r"[–—-]", date_str)[0].strip()
try:
parsed_date = dateparser.parse(date_str, fuzzy=True)
except (ValueError, OverflowError):
parsed_date = None
return {
"name": soup.select_one("h1").get_text(strip=True) if soup.select_one("h1") else "",
"date": parsed_date.strftime("%Y-%m-%d") if parsed_date else "unknown",
"raw_date": date_str,
}
The fuzzy=True flag on dateutil.parser does a lot of heavy lifting. It's not perfect — "Q1 2025" still falls through to None — but it handled about 94% of the formats I hit without manual intervention. I kept raw_date in the output so the client's team could manually fix edge cases. About 1,200 records ended up with "unknown" dates out of 50,000. Acceptable.
The JavaScript-Rendered Pages Problem
Luma killed me. Their event pages are fully client-side rendered. requests returns a skeleton HTML shell with basically no content — the actual event data loads via React after the initial response. Same issue on about 40% of Meetup's event detail pages.
I solved this with Playwright. I'm not going to paste the full implementation here because it's about 150 lines with retry logic and browser context management, but the core approach was:
- Spin up a persistent browser context with Playwright
- Navigate to the page and wait for a specific selector (like the event title
h1) to appear before scraping - Extract the rendered HTML and hand it off to BeautifulSoup for parsing
The runtime difference is significant. Pure requests does about 8-12 pages per second on a good connection. Playwright drops that to roughly 1-2 pages per second because you're running a real browser. For 50,000 records, that matters. I ended up using requests wherever the page was server-rendered and only falling back to Playwright when I detected the JS-only pattern.
One signal that worked reliably: if the response body is under ~5KB for what should be a rich event page, it's probably shell HTML and you need the browser.
What the Output Actually Looks Like
After all the parsing, here's the schema I settled on for the client's CRM import:
name → "PyCon US 2025"
date → "2025-05-14"
location → "Pittsburgh, PA"
speakers → ["Guido van Rossum", "Brett Cannon"]
ticket_url → "https://us.pycon.org/2025/tickets/"
organizer_email → "pycon-organizers@python.org"
source → "eventbrite"
attendee_count → 2800
organizer_email is the field the sales team cares most about. It's also the hardest to get. About 60% of events have a contact email somewhere on the page — usually in the footer, the organizer profile, or buried in the FAQ section. The other 40% require a secondary lookup against the organizer's website, which I did with a separate enrichment pass using Hunter.io's API.
speakers is almost never structured data. It's usually a blob of text in a "Speakers" section. I ended up using a simple regex pattern to find names followed by titles ("John Smith, CTO at Acme") and called it good enough. Precision was about 80%. Good enough for lead qualification, not good enough for a formal database.
Scaling to 50K Records: The IP Rotation Problem
At around 8,000 requests, Meetup blocked me. Eventbrite was more patient — I got to about 15,000 before hitting 429s. Luma was aggressive; I got blocked after maybe 500 requests from a single IP.
The fix is proxy rotation. Here's the pattern I use — dead simple, no fancy library needed:
import requests
import itertools
import time
PROXIES = [
"http://user:pass@proxy1.example.com:8000",
"http://user:pass@proxy2.example.com:8000",
"http://user:pass@proxy3.example.com:8000",
]
proxy_pool = itertools.cycle(PROXIES)
def get_with_rotation(url, retries=3):
for attempt in range(retries):
proxy = next(proxy_pool)
session = requests.Session()
session.proxies = {"http": proxy, "https": proxy}
try:
response = session.get(url, timeout=15)
if response.status_code == 200:
return response
elif response.status_code == 429:
time.sleep(2 ** attempt) # exponential backoff
except requests.exceptions.RequestException:
continue
return None
I used residential proxies from Webshare for this project. Cost was around $40/month for the volume I needed. Datacenter proxies are cheaper but Meetup fingerprints them almost immediately. Worth the extra spend.
Between proxy rotation and adding randomized delays between 0.5 and 2.5 seconds, I got through 50,000 records over about 72 hours of runtime without significant blocking issues. Not fast. But reliable.
The Maintenance Reality
Here's the honest part: this thing breaks constantly. In the three months since I delivered it, I've had to patch it four times:
- Luma changed their CSS class names (broke the title selector)
- Eventbrite added a cookie consent modal that blocked scraping
- Sessionize restructured their speaker section entirely
- One source just disappeared
Scrapers are not set-and-forget. If you're building something like this internally, budget for ongoing maintenance. I'd estimate 2-4 hours per month minimum for a multi-source setup like this one.
If you'd rather skip the maintenance overhead entirely, I also published this as an Apify actor: https://apify.com/lanky_quantifier/conference-event-scraper — runs in the cloud, handles the proxy rotation and JS rendering infrastructure, no setup needed on your end.
What I'd Do Differently
A few things I'd change if I rebuilt this from scratch:
Start with Playwright everywhere. The hybrid approach (requests where possible, Playwright as fallback) saved runtime but added code complexity. The performance gain wasn't worth the debugging headaches.
Build the deduplication layer first. I added it last and had to reprocess a lot of data. Same conference appears on three platforms with slightly different names, dates, and descriptions. Fuzzy matching on event name + date + city works okay, but it's not trivial.
Don't trust attendee count. Sources report this wildly differently. Some show capacity, some show registered attendees, some show historical peak attendance. I flagged it as "estimated" in the final output, but I should have just dropped it.
The client's sales team ended up qualifying about 3,400 leads from the 50,000 records — roughly 7%. Their previous manual process was yielding maybe 200 leads per month. So the ROI was there, even with all the rough edges.
What's your go-to approach for scraping JS-heavy event sites? Playwright, Puppeteer, or something else?
Top comments (0)