Automated AI Agent Workflows with n8n & JSON Extraction

#aiagents #automation #dataextraction #api

TL;DR

To build automated website enrichment and competitor research workflows for AI agents, use n8n to orchestrate the pipeline and a web scraping API to convert public HTML pages into structured JSON. By passing target URLs from your CRM into an n8n HTTP Request node, requesting JSON format from the scraper, and feeding the output into an AI agent node, you can continuously extract competitor pricing, feature sets, and firmographic data without writing custom parsers.

The Architecture of an Enrichment Workflow

AI agents require structured context. Feeding raw, unparsed HTML into an LLM window results in high token costs, degraded reasoning, and hallucinated data. To automate competitor research or lead enrichment, the pipeline must standardize the input before it reaches the agent.

An effective n8n enrichment pipeline consists of four stages:

Triggering: A CRM webhook, database event, or cron schedule initiates the workflow with a target URL.
Extraction: A request is made to a scraping API to fetch the publicly accessible page and return it as a structured JSON object.
Reasoning: The AI agent processes the structured JSON against a specific prompt to extract insights (e.g., pricing tiers, feature lists).
Storage: The structured insights are pushed back to the originating CRM or database.

Building the n8n Pipeline

n8n is a node-based workflow automation tool that excels at integrating APIs and LLMs. We will build a pipeline that monitors competitor pricing pages and enriches a central database.

Step 1: Triggering the Workflow

Start by adding a Webhook node or a Schedule node in n8n.
If you are enriching inbound leads, a Webhook node is optimal. Configure your CRM to send a POST request to the n8n webhook URL containing the lead's company website.

For continuous competitor research, use a Schedule node set to run weekly, followed by a database node (like PostgreSQL or Supabase) that pulls a list of competitor URLs to check.

Step 2: Structured Data Extraction

Once you have the target URL, you need to extract the data. Traditional scraping requires building brittle CSS selectors. Instead, we use AlterLab to request the page and return a structured JSON representation of the content.

Add an HTTP Request node in n8n.
Configure it to make a POST request to the scraping API endpoint.

If you are testing locally outside of n8n, you can achieve the exact same extraction using cURL or Python.

```bash title="Terminal" {3-5}
curl -X POST https://api.alterlab.io/v1/scrape \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://example-competitor.com/pricing", "formats": ["json"]}'




For custom applications or dedicated orchestration scripts outside of n8n, you can use our [Python SDK](https://alterlab.io/web-scraping-api-python) to handle the extraction synchronously.



```python title="enrichment_task.py" {4-6}

client = alterlab.Client("YOUR_API_KEY")
# Requesting 'json' format instructs the API to parse the layout automatically
response = client.scrape("https://example-competitor.com/pricing", formats=["json"])
data = response.json()

print(json.dumps(data, indent=2))

Step 3: Handling JavaScript and Anti-Bot Systems

Modern public websites, especially e-commerce platforms and SaaS sites, rely heavily on Single Page Application (SPA) architectures. A standard GET request will only return the empty root <div>.

Furthermore, data collection systems frequently encounter rate limits or bot detection mechanisms, even when accessing public information at respectful intervals. When building reliable automated workflows, robust anti-bot handling is a requirement, not an optional feature.

By offloading the HTTP request to a dedicated API, your n8n workflow does not need to manage headless browser instances, proxy rotation, or retries. The API renders the JavaScript, handles the network complexities, and returns the final DOM state as structured data.

Step 4: Structuring the Data for AI Agents

With the JSON data in n8n, add an Advanced AI agent node.
Connect your preferred LLM provider (OpenAI, Anthropic, or local via Ollama).

Configure the AI node with a system prompt that enforces strict JSON output. The agent's job is to read the extracted page content and map it to your internal schema.

Example System Prompt for the AI Node:

```text title="n8n_system_prompt.txt"
You are a firmographic data extraction agent.
Analyze the provided JSON representation of a competitor's pricing page.
Extract the pricing tiers, the cost of each tier, and the core features included.
Output your response STRICTLY as a JSON object matching this schema:
{
"company_name": "string",
"pricing_tiers": [
{
"tier_name": "string",
"price_monthly": "number",
"core_features": ["string"]
}
]
}
Do not include markdown formatting or conversational text.




Map the output of the HTTP Request node (the scraped JSON) to the input of the AI agent node. The agent will parse the structured web data and output a clean, standardized object that matches your database schema.

### Step 5: Routing Data to Your Target System

The final step in n8n is writing the enriched data to its destination. Add a node for your target system (e.g., PostgreSQL, Salesforce, or HubSpot).

Map the strictly formatted JSON output from the AI agent directly into the corresponding fields of your database or CRM. If this workflow runs on a schedule, you can add an intermediate diff-checking node to compare the newly extracted pricing against the last known pricing in your database, only triggering an alert or update if the competitor has changed their tiers.

## Extending the Workflow

Once the basic pipeline is operational, you can expand its capabilities:

1. **Pagination**: For e-commerce category pages, use n8n's loop node to follow pagination links extracted in the initial request.
2. **Multi-Page Context**: Scrape the target's homepage, `/about`, and `/pricing` pages in parallel HTTP nodes. Merge the JSON outputs into a single text block before passing it to the AI agent to provide comprehensive context for lead enrichment.
3. **Webhook Responses**: If using AlterLab, you can configure the API to push results to an n8n Webhook trigger asynchronously. Refer to the [documentation](https://alterlab.io/docs) for configuring asynchronous webhook deliveries to prevent n8n execution timeouts on heavy pages.

## Summary

Automating competitor research and website enrichment requires standardizing unstructured web data. By orchestrating workflows in n8n, offloading the browser rendering and extraction to an API, and using AI agents to map the resulting JSON to your internal schemas, you create a resilient, scalable data pipeline. You avoid writing brittle CSS selectors, eliminate the overhead of managing headless browsers, and ensure your databases are continuously enriched with the latest publicly available information.