I recently built a prospecting agent with Python to find local businesses on Google’s lower-ranked pages and pitch them SEO services.
The initial version was... promising but flawed.
It tried to pitch Indeed.com because they didn't have a local phone number. It told Ford Dealerships their site was "down" because their firewall blocked my bot. It sent robotic emails starting with "Fail: H1 Missing" ... not exactly a charming opener.
I realized that to make this tool useful, I needed to move from a simple scraper to a true agent. Here is the breakdown of how I refactored the code to filter noise, crawl for contacts, and use GenAI to write personalized campaigns.
Step 1: Filtering the Noise (The "No-Go" List)
The first problem with scraping generic keywords is that half the results aren't businesses, they are directories, job boards, and government sites. My script was wasting resources auditing ZipRecruiter and Texas.gov.
The Fix:
I made the clean_and_deduplicate function even more robust with a strict blocklist. I expanded the existing blocklist significantly. We categorized domains into "Job Boards," "Government," "Social Media," and "National Brands" (like Penske) that wouldn't hire a local agency anyway.
# We filter these out before we even attempt an audit
DIRECTORY_DOMAINS = [
'indeed', 'glassdoor', 'ziprecruiter', # Job Boards
'.gov', 'texas.gov', 'fmcsa', # Gov sites
'yelp', 'yellowpages', 'bbb.org', # Directories
'penske', 'uhaul', 'ford.com' # National Brands
]
def is_directory(url):
if any(domain in url.lower() for domain in DIRECTORY_DOMAINS):
return True
return False
Result: The list went from ~200 "leads" to ~70 actual local businesses.
Step 2: Smarter On-Page Auditing
My original script checked for H1 tags using exact string matching. If the keyword was diesel mechanic and the H1 was Best Diesel Mechanic in Texas, the script marked it as a FAIL.
The Fix: Fuzzy Logic
I switched to token-based set matching. If the H1 contains a significant percentage of the target keywords (over 50%), it passes.
# Breaking strings into sets of words for flexible matching
required_words = set(keyword.lower().split())
found_words = set(h1_text.lower().split())
# Calculate intersection
matches = required_words.intersection(found_words)
match_percentage = len(matches) / len(required_words)
# If >50% overlap, it's a Pass.
if match_percentage >= 0.5:
audit_data['H1_Audit_Result'] = "Pass"
Step 3: Distinguishing "Broken" from "Blocked"
Originally, if a site returned a 403 Forbidden, my script flagged it as "Actionable: Server Error." Pitching a client saying "Your site is down" when it's actually just secure is a great way to look incompetent.
The Fix: Handling Firewalls
I updated the requests logic to explicitly catch 403 and 406 errors and mark them as SKIP. Now, the agent only flags genuine connection errors (like 500 or SSLError) as actionable leads.
except RequestException as e:
# If the server explicitly blocked us (Firewall/WAF), it's not a lead.
if response.status_code in [403, 406, 429, 503]:
audit_data['Error_Status'] = "Blocked"
return audit_data # Stop processing, we will filter this out later
# Real connection errors (DNS failure, Timeout) are actual leads
# We want to pitch "Site Health" services to these.
audit_data['Error_Status'] = f"Error: {e.__class__.__name__}"
Step 4: The "Gap Analysis"
This was the strategic game-changer. A site with a missing H1 tag isn't necessarily a good lead. But a business with 50 five-star reviews and a missing H1 tag? That is a gold mine.
I integrated a secondary API call to fetch the Google Business Profile (GBP) ratings for every prospect to identify "Hidden Gems": businesses with great real-world reputations but poor digital presence.
# We categorize the lead before generating the pitch
is_gbp_strong = gbp_rating >= 4.0 and gbp_reviews >= 10
is_gbp_missing = gbp_rating == 0
# Strategy A: Strong GBP + Weak Site = "Your site hurts your reputation"
# Strategy B: No GBP + Weak Site = "You are invisible"
Step 5: The "Missing Link" - The Crawler
At this point, I had great prospects, but I was missing the most important piece of data: The Email Address. Many local businesses don't put their email in the Header; they hide it on the "Contact Us" page.
The Fix: The Spider Logic
I upgraded the agent to act like a human user:
- Scan Home: Look for
mailto:links or regex matches. - Heuristic Scoring: If no email is found, scan all links and score them.
/contact-usgets 100 points,/aboutgets 30 points. - The Hop: The agent navigates to the highest-scoring URL and scrapes that page.
def find_best_contact_url(soup, base_url):
# Heuristic Scoring Logic
score = 0
if 'contact' in url_path: score += 100
if 'contact' in link_text: score += 50
if link_is_in_footer: score += 10
# Returns the URL with the highest score to crawl next
return best_candidate
This logic alone saved ~40% of leads that would have otherwise been discarded as "No Contact Info."
Step 6: From Template to GenAI (Gemini Integration)
Finally, I tackled the outreach itself. My previous email template was rigid and impersonal. I wanted a 3-email sequence that felt human.
The Fix: Google Gemini 2.5 Flash
I integrated the Gemini API (which is proven to be fast and cost-efficient). Instead of using a fixed string, I feed the Audit Data + GBP Data into a prompt.
The AI generates a 3-stage campaign:
- Email 1: The Hook (Referencing their specific Reputation vs. Site Gap).
- Email 2: The Value (Educational content about the error found).
- Email 3: The Breakup (Professional closing).
# Feeding the Gap Strategy into the LLM
prompt = f"""
PROSPECT: {company}, Rating: {gbp_rating} stars.
ISSUES: H1: {h1_status}, NAP: {nap_status}
STRATEGY:
1. If rating > 4.0, praise reputation but warn about site errors.
2. Explain WHY {h1_status} kills rankings.
3. Gentle breakup.
OUTPUT FORMAT: JSON {{ "subject_1": "...", "body_1": "..." }}
"""
model = genai.GenerativeModel('gemini-2.5-flash')
response = model.generate_content(prompt)
The Result
The agent now runs autonomously. It scans SERPs, filters junk, crawls for emails across multiple pages, and uses LLMs to write custom campaigns.
The Metrics:
- Raw Scrape: 200 URLs
- After Cleaning Directories: 70 Businesses
- Actionable Leads (With Emails): ~30 High-Quality Prospects
Key Takeaway: When developing an Agent or any tool in general, Iteration is King. You have to be able to know what you currently have and what's missing to reach that optimal output. In my case, the difference between "just a script" and an "agent" is the ability to handle imperfection, hopping pages when data is missing, understanding context, and generating dynamic output. This project has become something I look forward to working on and the most exciting part is that there's still room to grow.
🔗 Check out the Code:
You can find the full source code and contribute to the project on GitHub:
https://github.com/Rafa-romero-dev/seo-agent
A special thank you to the Dev.to team for featuring my previous article in the Top 7 Featured Dev Posts of the Week!
What do you think I should focus on next? What could use some refinement? Let me know in the comments!
Top comments (0)