DEV Community

Kain
Kain

Posted on

Automate Company Intelligence: Scrape Tech Stacks & Hiring Signals in Seconds (A tutorial)

Stop manually checking “BuiltWith.” Here is how to programmatically scan company websites for tech stacks, hiring signals, and sales triggers.

In the world of growth engineering and data pipelines, information is leverage. But gathering that information usually sucks.

You know the drill. You find a potential lead or a competitor. You open their website. You inspect the DOM to find copyright dates. You check their /careers page to see if they are scaling. You verify their tech stack with a browser extension.

It takes 10 minutes per company. It’s tedious. And it doesn’t scale.
For developers, building a custom scraper to do this is a headache. You have to handle like headless browsers (Puppeteer/Playwright), proxy, rotation & anti-bot detection, CAPTCHAs and the endless variety of DOM structures.

There is a better way. In this guide, I’ll show you how to use a pre-built Company Intelligence Scanner (an Apify Actor) to extract over 25 categories of business intelligence from any URL with a single API call.

We will cover:

  • The Data: What you can actually get (it’s more than just HTML).

  • The Setup: Running the scanner via the Apify Console.

  • The Code: Automating it using Python or Node.js for your pipelines.

  • The Data: What are we actually scanning for?
    Most web scrapers are “dumb”—they just grab innerText. This scanner is designed to understand Business Logic. It uses heuristics and pattern matching to extract structured JSON data.

Here is what it finds:

🛠 Tech Stack: Detects 3,000+ technologies (e.g., “They use HubSpot, React, and Stripe”).

📈 Sales Motion: Identifies if a company is PLG (free trials, self-serve) or Sales-Led (request demo, contact sales).

💼 Hiring Signals: Scans for ATS systems (Greenhouse, Lever) and active job postings—strong indicators of budget and growth.

💰 Financials: Looks for funding announcements and pricing tiers.

🔒 Compliance: Checks for SOC2, GDPR, and CCPA badges (crucial for vendor assessment).

The Tutorial: How to Run Your First Scan

We are going to use Apify, a cloud platform for web scraping and automation. You don’t need a paid account to follow along; the free tier is sufficient for testing.

Step 1: Access the Actor
Navigate to the Company Data Enrichment for Pennies in the Apify Store and click Try for free. This will open the Apify Console.

Step 2: Configure the Input
The input is incredibly simple. You don’t need to write selectors. You just need the URLs.

In the input field, you can enter a list of domains. For this example, let’s scan a few AI companies to see how they differ.

{
  "start_urls": [
    "https://openai.com",
    "https://anthropic.com",
    "https://jasper.ai"
  ],
  "apify_proxy": {
    "use_apify_proxy": false,
    "apify_proxy_groups": ["RESIDENTIAL"]
  }
}
Enter fullscreen mode Exit fullscreen mode

💡 Pro Tip: Run the actor without the proxy first (“use_apify_proxy”: false). Only enable the proxy afterward for sites that block or throttle requests. This saves you money.

Step 3: Run and Wait
Click the green Start button.

The Actor will launch a headless browser in the cloud. It visits the sites, scrolls to trigger lazy-loading scripts, analyzes network requests to find analytics trackers, and parses the HTML for specific keywords.

It usually takes about 10–60 seconds per URL (URLs are processed concurrently in batches).

The Output: Analyzing the JSON
Once the run finishes, you get a clean JSON dataset. This is where the magic happens. Let’s look at a few key fields from a sample result.

  1. The Technology Profile Instead of just knowing they use “JavaScript,” you get specific categories. This is gold for lead scoring or competitive analysis.
"technologies": [
  {
    "name": "HubSpot",
    "categories": ["Marketing automation"],
    "description": "Marketing and sales software..."
  },
  {
    "name": "Segment",
    "categories": ["CDP & Customer Profiles", "Analytics"]
  }
]
Enter fullscreen mode Exit fullscreen mode
  1. The “High Intent” Signals The scanner looks for keywords indicating the company is trying to solve specific problems (like "migrate", "scale", "enterprise").
"signals": {
  "intentKeywords": ["integrate", "enterprise", "solution"],
  "hiring": {
    "isHiring": true,
    "atsDetected": ["Greenhouse"],
    "jobLinks": [{ "text": "We're hiring!", "href": "..." }]
  }
}
Enter fullscreen mode Exit fullscreen mode
  1. Business Model & Sales Motion This attempts to classify the business model based on site structure and Call-To-Actions (CTAs).
"businessModel": {
  "primary": "B2B",
  "confidence": "high",
  "b2bSignals": ["contact sales", "solutions for teams"]
},
"salesMotion": {
  "type": "Sales-Led",
  "signals": ["book a demo", "talk to sales"]
}
Enter fullscreen mode Exit fullscreen mode

For Developers: Automating the Pipeline
You don’t want to manually click buttons in a console. You want to pipe this data directly into your Postgres database, Airtable, or CRM.

You can use the Apify API Client to run this programmatically.

Python Example 🐍
Here is a script that scans a list of domains and prints out their sales motion and tech stack.

from apify_client import ApifyClient

# Initialize with your API Token
client = ApifyClient("YOUR_API_TOKEN")

# Configuration
run_input = {
    "start_urls": ["https://linear.app", "https://notion.so"],
    "apify_proxy": {
        "use_apify_proxy": False,  # Enable if blocked
        "apify_proxy_groups": ["RESIDENTIAL"]
    }
}

print("🕵️  Scanning companies...")

# Run the Actor
# Replace with the specific Actor ID you are using
run = client.actor("YOUR_ACTOR_ID").call(run_input=run_input)

# Fetch results
dataset_items = client.dataset(run["defaultDatasetId"]).list_items().items

for item in dataset_items:
    company = item.get("company", {}).get("name")
    motion = item.get("salesMotion", {}).get("type")

    # Filter for tech (e.g., specific Analytics tools)
    analytics = [t["name"] for t in item.get("technologies", []) 
                 if "Analytics" in t.get("categories", [])]

    print(f"--- {company} ---")
    print(f"Strategy: {motion}")
    print(f"Analytics Stack: {', '.join(analytics)}")
Enter fullscreen mode Exit fullscreen mode

Use Cases: How to monetize this data?

Hyper-Personalized Cold Outreach:
Bad email: "Do you need marketing help?"

  • Good email: "I noticed you’re using HubSpot and represent a Sales-Led organization. I also saw you’re actively hiring for SDRs. We help companies with this exact stack scale outbound..."

  • Competitor Monitoring: Set up a cron job (Apify handles scheduling) to scan your top 5 competitors every week. Get a Slack alert if they change pricing or add an "Enterprise" page.

  • Investment Due Diligence: VCs can scan hundreds of startups to filter by "PLG" motion and "Stripe" usage to find mature, scalable products.

Final Thoughts

The web is a database; it’s just unstructured. Tools like this bridge the gap between "messy HTML" and "actionable database."

Whether you are an engineer building an enrichment tool or a founder trying to understand your market, automating this research saves hundreds of hours of manual clicking.

Would you like me to create a specific JSON schema for a database (like Postgres or MongoDB) that fits this data output so you can store it immediately?

Top comments (0)