DEV Community

Cover image for Building an AI Matching Engine Without Big Tech Resources
Yurii Lozinskyi
Yurii Lozinskyi

Posted on

Building an AI Matching Engine Without Big Tech Resources

Pairfect IO Case Study + Practical Framework

Most people think matching in marketplaces is just filters + sorting.

It isn’t.
Matching is architecture. It's the mechanism that decides who should meet whom.

When matching fails, the entire marketplace collapses — no UX, no design, and no advertising budget can save it.

This post is about how we built an AI-powered matching engine for Pairfect IO, a marketplace connecting brands with influencers — without:

  • training data
  • behavioral signals
  • feedback loops
  • GPUs
  • ML ops stacks
  • Pinecone/Milvus/Weaviate
  • and without 200 ML engineers like LinkedIn

Everything ran on PostgreSQL + pgvector, with explainability, determinism, and an evolution path.

If you're building a marketplace and need matching that works before you have Big Tech data — this is for you.


Why Matching Is Harder Than It Looks

Matching looks trivial from the outside. But production-grade matching is an outcome-driven system.

Take LinkedIn. Their matching works because it learns from:

  • applications
  • acceptance rates
  • recruiter behavior
  • network overlap
  • engagement signals
  • retention data

In other words: LinkedIn doesn’t “guess relevance”. It learns relevance from outcomes.

Now contrast that with a seed-stage marketplace.

Pairfect started with:

  • no labeled data
  • no behavioral data
  • no interactions
  • no click-through signals
  • no embeddings graph
  • no GPUs
  • Postgres as the only accepted infra

Completely different world.

Yet a common mistake early teams make is trying to copy Big Tech architecture without Big Tech data. It doesn’t work.


The Real Beginning: Constraints, Not Models

Most teams begin matching by asking:

“Which ML model should we use?”

We started by asking a different question:

“What constraints make certain architectures impossible?”

Below is a simplified version of our real constraint table:

Constraint Impact
Self-funded No GPUs, no distributed systems
Must run on Postgres Matching logic must be SQL-native
No labels No LTR, no two-tower training
CPU only Lightweight embeddings only
MVP in 3 months Simple > complex
Need explainability No black-box ranking
Sparse metadata Must extract from text
Minimal DevOps No vector DB clusters

This table was the architecture.

Before we wrote a single line of code, we knew what we couldn’t build.

And ironically, that saved Pairfect, a self-funded startup.


Defining What “Good Match” Means (Critical & Often Missed)

You cannot architect matching until you define what a good match means in your domain.

For LinkedIn, a “good match” means:

hired + retained

For Pairfect, a “good match” meant:

  • semantic fit between campaign & influencer
  • audience expectations align
  • tone compatibility
  • price compatibility
  • content format alignment
  • worldview alignment (yes, that matters in creators)

If your team cannot answer:

“What constitutes a good match here?”

Then any discussion of embeddings vs rules vs transformers is premature.


Why We Didn’t Go Straight for SOTA Models

We evaluated the standard architectural options. Most didn’t survive the constraint filter:

Option Why Not (At MVP Stage)
Rules-only Too rigid
Pure embeddings Too noisy without deterministic anchors
LLM ranking Too slow + expensive on CPU
Learning-to-Rank Needs labeled data
Two-tower Needs training data + GPUs
Collaborative filtering Needs behavior data
Graph models Needs graph maturity

That left one viable category:

Hybrid Matching

Not because it's “cool” — but because it’s appropriate for the stage.


The Architecture: Hybrid Matching in Practice

Our hybrid pipeline looked like this:

Hard Filters → One-Hot Features → Embeddings → Fusion → Top-K

Breakdown:

1. Hard Filters

Eliminate impossible cases upfront:

  • price
  • language
  • content format
  • region
  • campaign type

This removes garbage noise.

Example (simplified):

SELECT *
FROM influencers
WHERE price BETWEEN 500 AND 1500
  AND language = 'en'
  AND region = 'eu'
  AND format @> ARRAY['video']::text[];
Enter fullscreen mode Exit fullscreen mode

2. One-Hot Signals

Encode domain knowledge explicitly:

  • tone
  • niche
  • vertical
  • channel
  • creative style

This prevents “semantic nonsense” (e.g., matching a financial brand with a prank channel).

SELECT influencer_id,
       (CASE WHEN tone = campaign.tone THEN 1 ELSE 0 END) AS tone_match,
       (CASE WHEN vertical = campaign.vertical THEN 1 ELSE 0 END) AS vertical_match
FROM influencers;
Enter fullscreen mode Exit fullscreen mode

3. Embeddings

We generated embeddings for:

  • bios
  • captions
  • descriptions
  • LLM summaries

Stored in pgvector, similarity via cosine.

SELECT influencer_id,
       1 - (bio_embedding <=> campaign.bio_embedding) AS semantic_score
FROM influencers
ORDER BY semantic_score DESC
LIMIT 50;
Enter fullscreen mode Exit fullscreen mode

4. Rank Fusion (RRF)

This was surprisingly powerful.

RRF allowed us to merge multiple ranking signals into one stable ranking without training.

To merge them without training, we used RRF:

Score = Σ 1 / (k + rank_i)

Example (simplified in SQL/CTE form):

WITH ranked AS (
  SELECT influencer_id,
         ROW_NUMBER() OVER (ORDER BY semantic_score DESC) AS r1,
         ROW_NUMBER() OVER (ORDER BY tone_match DESC) AS r2,
         ROW_NUMBER() OVER (ORDER BY vertical_match DESC) AS r3
  FROM candidates
)
SELECT influencer_id,
       (1.0 / (60 + r1)) +
       (1.0 / (60 + r2)) +
       (1.0 / (60 + r3)) AS final_score
FROM ranked
ORDER BY final_score DESC
LIMIT 10;
Enter fullscreen mode Exit fullscreen mode

Benefits:

  • no ML pipeline
  • consistent behavior
  • explainable scoring
  • cheap to compute
  • resistant to noisy embeddings

5. Top-K Output

Return a shortlist, not an infinite scroll.

Top 10 most compatible influencers
+ explanation layer
Enter fullscreen mode Exit fullscreen mode

This is not personalization; it is decision support.


Why Everything Ran on PostgreSQL

Our entire matching system ran on:

PostgreSQL + pgvector + CPU

Reasons:

  • infra should reduce risk, not increase it
  • one system > five microservices
  • fewer moving parts = fewer failures
  • debugging in SQL is fast & deterministic
  • product iteration > infra optimization

Hot take:

infra is not tooling, infra is liability

Especially at the MVP stage.


Explainability Was a Feature, Not a Nice-to-Have

We built full explainability into the matching layer:

  • why this recommendation
  • which signals contributed
  • how fusion scored them
  • what would disqualify it
  • how to override

Trust matters in early marketplaces.

LinkedIn can hide behind a black box.

Startups cannot.


The Evolution Path (Critical CTO Work)

Founders often ask:

“Will hybrid scale forever?”

No. And it doesn’t need to.

Our planned evolution path looked like this:

Hybrid → Behavioral Signals → LTR → Two-Tower → Graph → RL → Agents

Where each step unlocks the next:

  • hybrid gives usable matching Day 1
  • behavior gives labels
  • labels enable LTR
  • scale enables encoders
  • graph enables multiple objective optimization
  • RL enables personalization
  • agents enable reasoning

This is how marketplace intelligence actually grows in the real world.


Final Lessons

Three lessons emerged from building Pairfect:

Lesson 1 — Matching is not a model problem; it’s a business constraint problem
Lesson 2 — Appropriate complexity wins at the MVP stage. Over-engineering extends time-to-market
Lesson 3 — You don’t need Big Tech architecture without Big Tech data

The goal is not to replicate LinkedIn.

The goal is to build a system honest about your stage and prepared to evolve.


If you’re building something similar

Happy to discuss:

  • marketplace matching
  • ranking architectures
  • hybrid systems
  • pgvector setups
  • evolution paths

DMs open.

Top comments (0)