DEV Community

Cover image for I Got Tired of Writing Cold Emails. So I Built an AI Agent to Do It for Me.
Ayush Singh Tomar
Ayush Singh Tomar

Posted on

I Got Tired of Writing Cold Emails. So I Built an AI Agent to Do It for Me.

I Got Tired of Writing Cold Emails. So I Built an AI Agent to Do It for Me.

B2B sales reps spend hours researching a single lead — reading LinkedIn profiles, Googling the company, checking for recent news, then writing a personalized email that doesn't sound like a template. Most of that work is repetitive pattern-matching. I wanted to see if an AI agent could do it better, faster, and without copy-paste.

The result is SalesAgent — paste a LinkedIn URL, get a researched lead profile, an ML-based score (0–100), and a hyper-personalized cold email. End to end in under 45 seconds. No templates. No manual research. Just paste and go.

Live demo: salesagent-theta.vercel.app
GitHub: github.com/ayush-s-tomar/salesagent


What It Does

  1. You paste a LinkedIn profile URL into the React frontend
  2. The FastAPI backend triggers a LangGraph agent
  3. The agent runs live Tavily web searches to research the lead and their company
  4. A scikit-learn model scores the lead 0–100 based on six signals
  5. Groq's LLaMA generates a personalized cold email referencing real company events
  6. You get: lead summary, score with breakdown, and a ready-to-send email

Here's what it looks like in action — I ran it on Satya Nadella's LinkedIn profile. It found Microsoft's recent AI keynote announcements and referenced them directly in the email:

Subject: Microsoft's Quantum Leap: Can You Keep Up with the AI Revolution?

Dear Satya, I was excited to watch your recent keynote on Microsoft's AI advancements, including the launch of seven new MAI models and the introduction of Majorana 2, your quantum computer...

That's not a template. The agent found that news in real time and wrote around it.


Architecture

SalesAgent Architecture Diagram

SalesAgent running on Satya Nadella LinkedIn profile

Three nodes. Each one enriches the context for the next. The scoring node doesn't call an LLM — it runs a trained ML model, which is faster and more deterministic for a classification task like this.

Stack: LangGraph · FastAPI · React · scikit-learn · Groq · Tavily · Render · Vercel


How Each Part Works

1. Research Node — Tavily + LangGraph

The agent calls Tavily's search API twice per lead:

  • Search 1: "{name} {company} LinkedIn" — pulls profile signals (title, summary, skills)
  • Search 2: "{company} news funding jobs 2024" — checks for recent company activity

Tavily returns structured results with titles, URLs, and content snippets. The LangGraph research node processes these into six binary/numeric signals that feed the scorer:

signals = {
    "has_company": bool,      # Is company name known?
    "has_title": bool,        # Is job title known?
    "skills_count": int,      # Number of skills (0–15)
    "has_summary": bool,      # Does profile have a summary?
    "has_news": bool,         # Did Tavily find company news?
    "has_jobs": bool,         # Did Tavily find job postings?
}
Enter fullscreen mode Exit fullscreen mode

has_news and has_jobs are the most valuable signals — they tell you whether the company is active and growing right now. That matters more than whether a LinkedIn summary exists.

2. Scoring Node — scikit-learn

The scorer uses a Gradient Boosting Classifier trained on 500 synthetic samples generated with numpy. Labels were assigned using a weighted formula:

score = (
    has_news    * 0.30 +   # Company is in the news = hot lead
    has_jobs    * 0.25 +   # Hiring = growing, budget exists
    has_title   * 0.20 +   # We know who we're targeting
    has_summary * 0.15 +   # They invest in their profile
    skills_count * 0.05 +  # Proxy for profile completeness
    has_company * 0.05     # Basic data quality check
)
Enter fullscreen mode Exit fullscreen mode

Why ML instead of just an LLM scoring the lead? Two reasons: speed and determinism. An LLM call adds 2–3 seconds and gives you a different score every run. A trained classifier runs in milliseconds and gives you the same score for the same inputs every time — which matters when you're building something people actually use.

In production, you'd retrain on real CRM data — won vs lost deals — with richer features like funding stage, company size, industry vertical, and email response rate. But for a portfolio project with no CRM access, synthetic training with domain-informed weights gets you a working, explainable scorer.

3. Email Generation Node — Groq + LLaMA

The email node takes the full lead context — name, title, company, recent news, job postings — and injects it into a structured prompt:

System: You are an expert B2B sales copywriter. Write emails that are 
specific, short, and reference real context. Never use generic openers.

User: Lead: {name}, {title} at {company}
Recent news: {news_snippet}
Score: {score}/100
Write a 3-paragraph cold email referencing the news above.
Enter fullscreen mode Exit fullscreen mode

The key constraint is "never use generic openers" — without this, LLaMA defaults to "I hope this email finds you well." With it, every email opens with a specific reference to something real about the company.


What Broke (The Honest Part)

This is where I spent most of my time. Real projects break in ways tutorials never show you.

1. Groq Model Deprecations — Three Times

llama-3.3-70b-versatile failed. Switched to llama3-70b-8192. That was decommissioned. Tried llama3-groq-70b-8192-tool-use-preview — tool-calling didn't work properly. Ended up on llama-3.1-8b-instant, which is smaller but stable.

The lesson: never hardcode a model string. In a production system, this belongs in a config file or environment variable so you can swap it without touching code.

2. Tool-Calling Schema Bug — 400 Failed Generation

Groq was rejecting my tool schemas with a failed_generation 400 error. After multiple attempts to isolate it, the issue was that I was passing input_schema directly instead of extracting properties and required separately.

Wrong:

"input_schema": tool.input_schema
Enter fullscreen mode Exit fullscreen mode

Right:

"parameters": {
    "type": "object",
    "properties": tool.input_schema["properties"],
    "required": tool.input_schema.get("required", [])
}
Enter fullscreen mode Exit fullscreen mode

This took longer than it should have because the error message (failed_generation) gave no hint about the schema structure. If you're hitting this — check your tool schema first.

3. Interface Mismatch Between graph.py and llm.py

graph.py was calling run_with_tools(prompt=..., system=...) and expecting a (text, tool_log) tuple back. llm.py was written to accept messages=[] and return a dict. Classic interface mismatch between two files written in isolation.

Every bug from this — the prompt vs messages confusion, the system kwarg error, the tuple vs dict return type — cost me hours of debugging that a typed interface contract would have caught in seconds.

4. Python 3.14 on Render

pydantic-core failed to build because no wheel exists for Python 3.14. Fix: force PYTHON_VERSION=3.11.9 in Render's environment variables.

If you're deploying to Render: always pin your Python version explicitly. Don't trust Render's default.

5. Cached ML Model on Render

After rebalancing the scoring weights in scorer.py, the old model.pkl was still cached on disk. The score stayed stuck at 19/100 until I added rm -f ml/model.pkl to the build command to force a retrain on every deploy.

This one was subtle. The code was right. The model was wrong. Nothing in the logs told me the model was stale.


What I'd Do Differently

Define the LLM interface contract on day one.

The biggest source of bugs was graph.py and llm.py making different assumptions about function signatures, return types, and argument names — and those assumptions were never written down anywhere.

If I rebuilt SalesAgent today, the first file I'd create:

# contracts.py — written before any other code

def run_with_tools(prompt: str, system: str) -> tuple[str, list[dict]]:
    """Run LLM with tool-calling. Returns (response_text, tool_call_log)."""
    ...

def chat(messages: list[dict], system: str) -> str:
    """Simple chat completion. Returns response string."""
    ...
Enter fullscreen mode Exit fullscreen mode

One typed file, agreed upfront. Every bug from the interface mismatch would have been caught before a single line of agent logic was written.

Beyond that — in a production version, I'd:

  • Add conversation memory so the agent learns from past outreach (what worked, what didn't)
  • Replace synthetic training data with real CRM data (won/lost deals) for the scorer
  • Add email open tracking to close the feedback loop and retrain the scorer on outcomes

Try It

Live demo: salesagent-theta.vercel.app
GitHub: github.com/ayush-s-tomar/salesagent

Paste any LinkedIn URL and see what it generates. The email quality varies with how much Tavily finds — the more public news about a company, the better the output.

If you're building something similar or have feedback on the ML scoring approach, I'd genuinely like to hear it — connect with me on LinkedIn.


Stack: LangGraph · FastAPI · React · scikit-learn · Groq LLaMA 3.1 · Tavily · Render · Vercel

Tags: #ai #python #machinelearning #langchain #buildinpublic

Top comments (0)