CIZO

Posted on Apr 23

Building an AI Sourcing System for Industrial Components: What We Learned

#ai #systemdesign #llm #architecture

I want to talk about a class of AI system design problem that doesn't get much attention.

Most of the AI architecture content out there is about making systems smarter, faster, or more capable. RAG pipelines, fine-tuning, better prompts, multi-agent orchestration. All useful stuff.

But there's a different problem that comes up the moment you deploy AI into a domain where output precision is non-negotiable. And the solution isn't a smarter model. It's a smarter system design around the model.

We ran into this building an AI sourcing system for industrial components. Here's what we learned.

The Setup

The goal was to let procurement engineers search for industrial components — bolts, springs, fasteners — using natural language. Instead of navigating complex filter forms and needing to know exact specs upfront, they could just describe what they need.

Simple enough idea. The problem showed up the moment we thought carefully about what "wrong output" means in this domain.

In a content recommendation system, a bad result is annoying. The user scrolls past it.

In industrial procurement:

wrong material standard → wrong part → installed in machinery → failure

That changes the entire design constraint. You're not optimizing for recall or a good average. You're optimizing to never return a technically invalid result.

Why the Obvious Architecture Fails Here

The first approach most teams reach for looks like this:

User query → LLM → Results

The LLM interprets the query, maybe generates some filters, searches the catalog, returns results. Clean. Simple. Fast to build.

Here's the failure mode.

When a user types "spring for high load in a small space", the query is underspecified. A lot of parameters are missing — load range, material, dimensions, spring type, applicable standard. The LLM will fill those in. That's what LLMs do. They generate plausible completions.

The generated values might be totally reasonable. They also might be wrong. And the system won't flag which parameters it assumed versus which it derived from actual input. It just responds.

In practice, this creates what I'd call the confident wrong answer problem. The output is well-formatted, sounds authoritative, and is technically incorrect. That's harder to catch than an obvious error, and in this domain, it gets acted on.

We tested this extensively in the early design phase. With a standard LLM-driven search approach, the system would return results that passed a surface-level inspection but failed when checked against engineering standards. Not every time. Just often enough that the system couldn't be trusted in production.

The Core Design Insight

Here's the reframe that unlocked the right architecture:

AI is reliable at understanding intent. AI is not reliable at making technical decisions.

These feel like the same thing but they're not. Understanding that a user means "corrosion-resistant outdoor bolt" is a language task. LLMs are genuinely good at this. Deciding that the correct material is A4 stainless steel conforming to ISO 4017 is an engineering task. That requires domain rules, constraints, and validated logic — not statistical inference over training data.

The mistake in the obvious architecture is giving AI both jobs. The fix is splitting them.

The Architecture We Built

User Input (Natural Language)
        ↓
┌─────────────────────────┐
│   Intent Extraction     │  ← AI layer. Interprets meaning only.
│   (AI — NLU)            │
└─────────────────────────┘
        ↓
┌─────────────────────────┐
│  Parameter Structuring  │  ← Engineering logic. Converts intent
│  Layer                  │    to bounded valid ranges.
└─────────────────────────┘
        ↓
┌─────────────────────────┐
│   Controlled Search     │  ← Searches on structured params,
│                         │    not raw NL query.
└─────────────────────────┘
        ↓
┌─────────────────────────┐
│   Validation Layer      │  ← Every candidate checked before
│                         │    surfacing. No exceptions.
└─────────────────────────┘
        ↓
   Precise Output

Six layers total. Each one has a narrow, specific responsibility. Let me go through the important ones.

Layer 2: Intent Extraction (The AI Layer)

This is where the LLM lives. Its job is purely extractive. Given a natural language query, return structured intent.

For "I need a corrosion-resistant bolt for outdoor use":

{
  "product_type": "bolt",
  "use_case": "outdoor",
  "requirements": ["corrosion_resistant"]
}

That's it. The LLM does not:

Generate material specifications
Suggest applicable standards
Produce dimensional ranges
Make any technical decisions

It interprets language. Nothing else. The moment you let it do more, you've reintroduced guesswork into the pipeline.

Layer 3: Parameter Structuring (The Critical Layer)

This is the layer most teams skip, and it's the most important one.

The extracted intent goes into a structured parameter engine built on engineering rules. This engine converts intent into bounded valid possibilities — not final values, but candidate sets with explicit exclusions.

For our bolt example:

Input intent:
  product_type: bolt
  use_case: outdoor
  requirements: [corrosion_resistant]

Output parameters:
  material_candidates: [stainless_a2, stainless_a4]
  coating_candidates: [hot_dip_galvanized, none_if_a4]
  standard_candidates: [iso_4017, iso_4018, din_933]
  exclusions: [carbon_steel_uncoated, aluminum_without_load_check]
  open_parameters: [size, thread_pitch, length]

Two rules this layer enforces:

No parameter is assumed. Every value is either derived from the intent, mapped from engineering rules, or left explicitly open. Never blindly filled in.
Exclusions are applied before search. The system doesn't search for carbon steel outdoor bolts and then filter them out. They never enter the search space at all.

The difference between "filter out after retrieval" and "exclude before search" matters more than you'd think. Filtering after retrieval still means the system briefly considered invalid candidates. Upstream exclusions make those candidates structurally impossible to return.

Layer 4: Controlled Search

Search runs on the structured parameters from Layer 3. Not on the original natural language query.

search_params = {
    "product_type": "bolt",
    "material": ["stainless_a2", "stainless_a4"],
    "standards": ["iso_4017", "din_933"],
    "application_tags": ["outdoor"],
    "exclude_materials": ["carbon_steel_uncoated"]
}

candidates = catalog.search(search_params)

No semantic similarity search on specs. No fuzzy matching on technical fields. Structured query against structured catalog data.

This eliminates a whole class of retrieval errors where results are linguistically related to the query but technically incompatible with the requirement.

Layer 5: Validation (Zero Tolerance)

Every candidate from Layer 4 goes through validation before it can reach the user.

For each candidate:
  ✓ Parameter compatibility check
      (do all specs work together as a system?)
  ✓ Standard compliance check
      (does this part actually conform to claimed standard?)
  ✓ Availability check
      (is this sourceable right now?)
  ✓ Constraint compatibility
      (no spec conflicts)

  If any check fails → reject
  User never sees it

No approximate matches. No "this is close." The output only contains candidates that have passed every check.

This is the layer that draws the hard line between "probably correct" and "confirmed correct."

Cross-Cutting Systems

Three systems run in parallel across the full pipeline.

Pattern Learning Engine
Stores validated parameter combinations from successful query-to-result flows. When similar intent comes in again, pre-validated mappings get reused. Reduces re-derivation overhead and improves consistency over time.

Feedback and Correction Loop
Any mismatch that gets flagged — by users, by downstream checks, by monitoring — triggers a rule update in the parameter engine. The same error doesn't repeat. This is how the system gets more accurate with use rather than degrading.

Quality Monitoring
Tracks accuracy metrics per query type. Fires alerts when result quality starts to drift. Essential when the catalog scales — a catalog growing from 20k to 200k parts will introduce edge cases the original rules didn't cover, and you want to catch those before they affect production output.

One honest admission: we built the feedback loop later than we should have. It came in partway through the project rather than from the start. Every wrong result before that was a missed learning opportunity. In any future project like this, the correction loop goes in on day one.

Tech Stack

Layer	Technology
Intent extraction	OpenAI GPT-4o / Claude API
Parameter engine	Custom rules engine (Node.js)
Search	PostgreSQL with structured queries
Validation	Rules + AI-assisted constraint checks
Feedback loop	Event logging + async rule updates
Monitoring	Custom metrics + alerting
Inventory	Client-specific API integrations

The stack is conventional. Nothing exotic. The architecture around the stack is what matters.

What the Results Looked Like

After this went live in a real industrial procurement workflow:

Incorrect component matches dropped significantly
Decision time for procurement engineers decreased
Engineers who had been skeptical of AI results started trusting and using the system

That last point is the meaningful one. Experienced engineers in technical domains are appropriately skeptical of AI output. When they start trusting a system, it's because the system has actually earned it — not because the UI looks good.

Design Principles Worth Keeping

A few things we'd apply to any AI system in a precision-critical domain.

Separate intent understanding from technical decision-making. They're different cognitive tasks that need different system components. Conflating them is where most AI accuracy problems start.

Structure before search. Always convert natural language to structured parameters before hitting retrieval. Raw NL queries introduce semantic drift into the retrieval layer.

Upstream exclusions beat downstream filters. If something is technically invalid, exclude it before it enters the search space. Don't retrieve it and filter it out.

Validation is not optional. In high-stakes AI, validation isn't a quality-of-life feature. It's the feature that makes the system trustworthy.

Build correction loops early. Every wrong output is training data for your rule engine — but only if you're capturing and acting on it from the start.

Closing Thought

The failure mode for AI in technical domains is predictable. AI gets too much autonomy over decisions it can't make reliably. The system produces confident-sounding output that's technically wrong. It erodes trust. The project either gets abandoned or lives permanently in "supervised review" mode where a human checks every output.

The fix isn't a more capable model. It's a more carefully designed system around the model. One where AI does what it's genuinely good at — understanding language — and deterministic logic does what it's genuinely good at — enforcing constraints and validating results.

AI for understanding.
System for control.

If you're building something similar, or running into this class of problem, happy to talk through it. Find us at cizotech.com or drop a comment below.

Built by the team at CIZO. We build production-grade AI systems, mobile apps, and IoT solutions. hello@cizotech.com

canonical_url: "https://cizotech.com/blog/ai-controlled-industrial-sourcing-decision-system"