Meir

Posted on Mar 2

How I Built an AI Agent That Maps Any Company's Supply Chain in Real-Time

#aws #ai #webdev #tutorial

Supply chain visibility is one of those problems that sounds straightforward until you actually try to solve it. Every company has suppliers. Every supplier has their own suppliers. Customers have distributors who have sub-distributors. And almost none of it is in a clean, queryable database.

I wanted to see if an AI agent could discover this web of relationships automatically - given only a company name - by reading the same public evidence a human analyst would: news articles, supplier directories, annual reports, procurement documents, partner pages, and ISO certification databases.

The result is Supply Chain Sankey: an agent that autonomously discovers upstream suppliers and downstream distributors for any company, backs every relationship with source URLs and confidence scores, and renders the whole thing as an interactive Sankey diagram.

Here's exactly how it works and what I learned building it.

The Problem With Supply Chain Discovery

If you want to know who supplies Apple, you could:

Read Apple's annual report (partial, high-level)
Scrape SEC filings (incomplete, lags by months)
Buy expensive commercial intelligence data
Hire analysts to manually research it

None of these scale, and most are expensive. But the information is publicly available — it's just scattered across thousands of web pages, PDFs, press releases, and partner directories.

This is the gap AI agents can fill. An agent that can search, scrape, and reason over public web content can replicate what an analyst does, but at machine speed and scale.

Architecture Overview

Before diving into the details, here's the full data flow:

Four distinct systems doing the heavy lifting:

Bright Data — web search and scraping (bypasses bot detection)
AWS Bedrock AgentCore — serverless agent hosting with auto-scaling
LangGraph — orchestrating the multi-step agent pipeline
Amazon Nova 2 Lite — the LLM doing query planning, evidence analysis, and classification

Try it live: demos.brightdata.com/supplychain-sankey

Step 1: Query Planning

The agent starts with just a company name. The first LLM call generates 6-7 targeted search queries designed to surface supply chain signals:

PLANNER_SYSTEM_PROMPT = """
Generate {max_queries} targeted search queries to discover
{direction} supply chain relationships for {company}.

Queries should surface:
- Supplier directories and partner pages
- Procurement documents and RFQ filings
- ISO/IATF/AS9100 certificates (often list suppliers by name)
- Annual reports with named suppliers/customers
- Trade publications with named relationships
- filetype:pdf site:example.com for procurement docs
"""

For a query like "Apple Inc. upstream suppliers", the planner might generate:

"Apple Inc" supplier manufacturing partner site:apple.com
"Apple supply chain" TSMC Samsung filetype:pdf
"Apple Inc" component procurement 2024
site:linkedin.com "Apple supplier" manufacturer
"Apple supplier" ISO certificate directory

The key insight: you want diverse query types. Procurement PDFs contain different information than press releases, which differ from supplier directories.

Step 2: Web Search via Bright Data

Each query goes to Bright Data's Web Unlocker API. This is where Bright Data does something standard HTTP clients can't: it handles anti-bot systems, CAPTCHAs, JavaScript rendering, and geo-targeting automatically.

def bright_data_search(query: str, num_results: int = 5) -> list[dict]:
    response = requests.get(
        "https://api.brightdata.com/request",
        headers={"Authorization": f"Bearer {BRIGHT_DATA_API_KEY}"},
        params={
            "url": f"https://www.google.com/search?q={quote(query)}&num={num_results}",
            "zone": BRIGHT_DATA_ZONE,
            "data_format": "raw",
        },
    )
    # Parse and normalize [{url, title, snippet}]
    return parse_search_results(response.json())

The 6-7 queries run in parallel using ThreadPoolExecutor with 6 workers — so a full discovery run takes roughly the same time as a single sequential search.

One thing I learned: bounding search results matters. Too few (< 2) and you miss relationships. Too many (> 8) floods the URL ranking step with noise. The sweet spot is 5 results per query.

Step 3: URL Ranking

After parallel search, you have ~40-50 URLs. Scraping all of them would be slow and wasteful. An LLM ranking step selects the top 10:

URL_RANKER_SYSTEM_PROMPT = """
Rank these URLs by their likely value for supply chain discovery.

PREFER:
- Supplier/partner directories
- Procurement documents and RFQ filings
- ISO/IATF certification databases (often list supplier names)
- Annual reports with named counterparties
- PDFs (strong signal — often formal business documents)

DEPRIORITIZE:
- Generic company about pages
- Repair manuals and user guides
- News without named relationships
- Login walls
- Social media profiles
"""

This step is cheap (small prompt, short response) but saves significant time in the expensive scraping step.

Step 4: Scraping with Bright Data

The selected URLs get scraped in parallel. Bright Data handles two content types differently:

Web pages → fetched as cleaned Markdown:

response = requests.get(
    "https://api.brightdata.com/request",
    headers={"Authorization": f"Bearer {BRIGHT_DATA_API_KEY}"},
    params={
        "url": target_url,
        "zone": BRIGHT_DATA_ZONE,
        "data_format": "markdown",  # Bright Data converts HTML → Markdown
    },
)

PDFs → fetched as binary, then extracted with pypdf:

if url.endswith(".pdf") or "filetype=pdf" in url:
    binary_response = requests.get(
        "https://api.brightdata.com/request",
        params={"url": url, "zone": BRIGHT_DATA_ZONE, "format": "raw"},
    )
    pdf_reader = pypdf.PdfReader(io.BytesIO(binary_response.content))
    text = "\n".join(page.extract_text() for page in pdf_reader.pages)

PDFs are often the most valuable source — formal supplier lists, procurement specifications, ISO audit reports. The binary + pypdf approach handles them cleanly.

The data_format="markdown" trick is worth highlighting: instead of receiving raw HTML and parsing it, Bright Data strips boilerplate and returns clean content. This alone reduces the LLM context needed in the next step.

Step 5: LLM Evidence Reflection

Each scraped page gets sent to the LLM for compression. A raw supplier directory page might be 15,000 characters; what we actually need is 500 characters of structured evidence.

REFLECTION_SYSTEM_PROMPT = """
Extract ONLY supply-chain-relevant content from this page.

Return JSON:
{
  "content": "compressed evidence text (max 900 chars)",
  "highlights": ["key relationship 1", "key relationship 2"],
  "signal_score": 0.0-1.0,
  "named_entities": ["Company A", "Company B"]
}

INCLUDE: supplier/customer/partner relationships, ISO/IATF certificates,
procurement language, shipping/manufacturing evidence, distributor networks.

EXCLUDE: marketing copy, product descriptions, pricing, contact info,
job listings, generic company information.
"""

The signal_score field is critical. Pages scoring ≤ 0.05 get dropped before the expensive downstream LLM calls. This acts as a noise filter — a repair manual might mention a company's name but has zero supply chain signal.

Step 6: Counterparty Parsing and Edge Construction

High-signal evidence gets parsed to extract actual relationships:

# LLM returns structured counterparty data
{
  "counterparties": [
    {
      "name": "Taiwan Semiconductor Manufacturing Company",
      "relationship_type": "supplier",
      "confidence": 0.92,
      "direction": "upstream",
      "evidence_snippet": "TSMC manufactures Apple's A-series chips..."
    }
  ]
}

Edge construction is then deterministic — no LLM needed:

for counterparty in parsed.counterparties:
    edge = {
        "source": counterparty.name if upstream else company,
        "target": company if upstream else counterparty.name,
        "direction": counterparty.direction,
        "relationship_type": counterparty.relationship_type,
        "confidence": counterparty.confidence,
        "tier": -1 if upstream else +1,
        "evidence_urls": source_urls,
    }
    proposed_edges.append(edge)

Deduplication merges edges with the same (source, target, direction, relationship_type) pair, taking the highest confidence score.

Step 7: Entity Filtering

One problem with LLM-extracted entity names: generic terms slip through. An evidence snippet saying "Apple works with authorized suppliers" might produce a node named "authorized suppliers" — which isn't a real company.

An entity filter LLM call handles this:

ENTITY_FILTER_SYSTEM_PROMPT = """
Classify each candidate entity name as a real organization or a generic placeholder.

KEEP: "Ingram Micro (China) Limited", "Foxconn Technology Group", "TSMC"
DROP: "Suppliers", "Partners", "Manufacturers", "Authorized Dealers", "OEM vendors"

Return JSON: {"keep": [...], "drop": [...]}
"""

This step significantly reduces noise in the final graph.

Running on AWS Bedrock AgentCore

The entire LangGraph pipeline runs inside AWS Bedrock AgentCore — a managed runtime for AI agents that handles containerization, scaling, health checks, and observability.

The config is minimal:

# .bedrock_agentcore.yaml
agentId: supplychain_sankey-dKHoaWDft4
region: us-east-1
runtime:
  language: python
  version: "3.11"
  architecture: arm64
network:
  mode: PUBLIC
observability:
  enabled: true

The entrypoint wraps the LangGraph call:

from bedrock_agentcore import BedrockAgentCoreApp

app = BedrockAgentCoreApp()

@app.entrypoint
def handler(payload, context):
    company = payload["company"]
    direction = payload.get("direction", "both")
    mode = payload.get("mode", "discover")

    result = run_supply_chain_discovery(
        company=company,
        direction=direction,
        mode=mode,
        tier_offset=payload.get("tier_offset", 0),
        existing_entities=payload.get("existing_entities", []),
    )

    return result

AgentCore handles the health check endpoint (/ping returns HEALTHY_BUSY while processing), CloudWatch logs, and deployment via CodeBuild → ECR. What would otherwise require a full ECS service or Kubernetes setup is reduced to a agentcore deploy command.

The Lambda proxy sits in front, routing requests from the Next.js frontend to the AgentCore runtime:

# lambda_invoker/handler.py
response = bedrock_agentcore_client.invoke_agent_runtime(
    agentRuntimeArn=AGENT_ARN,
    payload=json.dumps(request_body),
)
# Stream chunks back to the frontend

The Visualization: D3 Sankey

The final output is a Sankey JSON structure that D3 renders as a flow diagram:

{
  "nodes": [
    {"id": "Apple Inc.", "tier": 0},
    {"id": "TSMC", "tier": -1},
    {"id": "Ingram Micro", "tier": 1}
  ],
  "links": [
    {
      "source": "TSMC",
      "target": "Apple Inc.",
      "tier": -1,
      "direction": "upstream",
      "relationship_type": "supplier",
      "confidence": 0.92,
      "status": "confirmed",
      "evidence_urls": ["https://..."]
    }
  ]
}

Node colors encode position in the supply chain:

Blue (tier 0): the root company
Purple (tier < 0): upstream suppliers
Cyan (tier > 0): downstream distributors

Link colors encode confidence:

Green: confirmed relationships
Amber: unconfirmed (plausible but not definitively verified)
Red: rejected

Clicking a node opens an evidence drawer showing the source URLs. Clicking expand buttons triggers another AgentCore invocation to discover the supplier's suppliers — recursively building a deeper map.

What Can Go Wrong

Generic entity leakage: Even with the entity filter, terms like "key partners" occasionally survive. A second-pass filter with more specific examples in the prompt helps.

PDF extraction failures: Some PDFs are image-only (scanned documents). pypdf extracts nothing. The fallback — fetching the PDF URL as markdown — sometimes returns something useful from the surrounding HTML, but often doesn't. Worth logging these separately.

LLM hallucination in entity names: The evidence reflection step occasionally generates entity names that combine two separate companies from the same page. Adding "only extract names that appear verbatim in the source" to the prompt reduced this significantly.

Results

Running the agent on Apple Inc. produces a Sankey diagram with ~14-18 nodes and ~16-20 links in roughly 45-90 seconds, depending on scraping latency. The relationships discovered include:

TSMC, Samsung Electro-Mechanics, Murata, Foxconn (upstream suppliers)
Ingram Micro, TD Synnex, Authorized Resellers (downstream distributors)
Each backed by actual source URLs with specific evidence snippets

Expanding a supplier node (e.g., TSMC) then discovers TSMC's upstream relationships — recursively building a multi-tier map.

Key Takeaways

Bright Data Web Unlocker removes the anti-bot problem entirely — pages that block standard scrapers are fetched cleanly, including JavaScript-rendered content and PDFs
Signal scoring before LLM calls is essential for cost control — filter obvious noise early
AWS Bedrock AgentCore makes deploying a LangGraph agent to production straightforward — no container orchestration overhead
Amazon Nova 2 Lite runs reliably at temperature=0.0 for structured JSON extraction tasks
Entity filtering is a non-optional step — without it, generic role names pollute the graph

FAQ

Can this run against any company?
Yes — any publicly traded or well-documented private company with web presence. Less-documented companies return fewer nodes.

How accurate are the relationships?
Confidence scores reflect evidence quality. "Confirmed" edges (green) come from multiple independent sources. "Unconfirmed" (amber) means one plausible source.

What does Bright Data cost for this workload?
A single discovery run makes ~50 search requests and ~10 scrape requests. At typical Web Unlocker pricing this is a few cents per run.

Can I expand to 3rd-tier suppliers?
Yes — the expand button triggers a new AgentCore invocation with tier_offset set so the new nodes appear at the correct depth in the existing graph.

Does this replace commercial supply chain intelligence tools?
No. Commercial tools have curated databases, historical tracking, and coverage guarantees. This agent is best for discovery and exploration — finding relationships you didn't know to look for.

Running It Yourself

The full repo is open source. Deploy order:

AgentCore: agentcore deploy (handles ECR + runtime provisioning)
Lambda proxy: sam deploy in infra/
Frontend: npm run dev with LAMBDA_URL set

Required env vars: BRIGHT_DATA_API_KEY, BRIGHT_DATA_ZONE.

GitHub repo → supplychain-sankey

If you build something similar or extend this to other domains (M&A networks, academic citation graphs, ecosystem mapping), drop a comment — I'd like to see what variations emerge.