Hey Man has Hay Name

Posted on Nov 28

Building an AI-Powered Semantic Talent Matching System

#mongodb #ai #vectordatabase #typescript

Using MongoDB Atlas Vector Search + OpenAI Embeddings

By Emmanuel Ifeanyi Mechie

📌 Overview

This document explains how I built a fully semantic, AI-powered talent matching system using:

MongoDB Atlas Vector Search
OpenAI text-embedding-3-small
Node.js + TypeScript

The system solves the limitations of traditional keyword matching by using high-dimensional vector embeddings to understand the meaning behind job requirements and user skills — enabling semantic candidate-job matching with high accuracy and drastically improved performance.

🧠 Why We Needed a Semantic Matching System

The previous approach used simple keyword matching:

❌ "React Developer" only matched profiles with the exact phrase

❌ "Problem solving" ≠ "Analytical thinking"

❌ Recruiter results returned irrelevant profiles

❌ Matching time was ~12 seconds

I redesigned the matching engine so it understands semantics, not words — resulting in 12x faster (10-12s → 1s) and significantly more accurate results.

🧱 System Architecture Overview

Job Requirements → Embedding → MongoDB Vector Search → Ranked Candidate List

User Skills → Embeddings → MongoDB (UserSkills collection)

Skill Library → Precomputed Embeddings → MongoDB (SkillEmbedding collection)

MongoDB handles vector indexing using its native Atlas Vector Search, which gives us:

No external vector DB needed
Fast cosine similarity matching
Filtering + vector search in the same aggregation pipeline
Seamless integration with existing collections

🧬 Step 1: Generating Embeddings

I use OpenAI text-embedding-3-small (1536 dimensions) for:

Job requirements (role title, skills, experience level, duration, notes)
User profiles (first name, last name, email, aggregated skills)
Individual skills (precomputed & cached in SkillEmbedding collection)

User Profile Embedding (TypeScript)

static async generateUserEmbedding(
  firstName: string,
  lastName: string,
  email: string,
  skills: string[],
): Promise<number[]> {
  // Format skills (convert underscores to spaces, capitalize)
  const formattedSkills = skills
    .map((skill) =>
      skill
        .split('_')
        .map((word) => word.charAt(0).toUpperCase() + word.slice(1))
        .join(' '),
    )
    .join(', ');

  // Create text for embedding
  const embeddingText = `Professional Profile:
Name: ${firstName} ${lastName}
Email: ${email}
Skills: ${formattedSkills}`;

  const response = await client.embeddings.create({
    model: 'text-embedding-3-small',
    input: embeddingText,
    dimensions: 1536,
  });

  return response.data[0].embedding;
}

Job Requirements Embedding

static async generateJobEmbedding(
  roleTitle: string,
  skills: string[],
  experienceLevel?: string,
  duration?: string,
  notes?: string,
): Promise<number[]> {
  // Format skills similarly
  const formattedSkills = skills
    .map((skill) =>
      skill
        .split('_')
        .map((word) => word.charAt(0).toUpperCase() + word.slice(1))
        .join(' '),
    )
    .join(', ');

  let embeddingText = `Job Requirements:
Role: ${roleTitle}
Required Skills: ${formattedSkills}`;

  if (experienceLevel) {
    embeddingText += `\nExperience Level: ${experienceLevel}`;
  }
  if (duration) {
    embeddingText += `\nDuration: ${duration}`;
  }
  if (notes) {
    embeddingText += `\nAdditional Notes: ${notes}`;
  }

  const response = await client.embeddings.create({
    model: 'text-embedding-3-small',
    input: embeddingText,
    dimensions: 1536,
  });

  return response.data[0].embedding;
}

🏗️ Step 2: Storing Embeddings in MongoDB

UserSkills Collection

Stores user profile embeddings with aggregated skills:

{
  "_id": "123",
  "user": ObjectId("456"),
  "profile": ObjectId("789"),
  "aggregatedSkills": ["react", "node_js", "problem_solving_skills"],
  "embedding": [1536 floats],
  "embeddingModel": "text-embedding-3-small",
  "embeddingUpdatedAt": "2025-01-20T10:00:00Z",
  "createdAt": "2025-01-01T10:00:00Z",
  "updatedAt": "2025-01-20T10:00:00Z"
}

SkillEmbedding Collection

Stores reusable, precomputed skill embeddings for performance:

{
  "_id": "abc",
  "skill": "team_leadership",
  "embedding": [1536 floats],
  "embeddingModel": "text-embedding-3-small",
  "embeddingUpdatedAt": "2025-01-15T10:00:00Z",
  "createdAt": "2025-01-15T10:00:00Z"
}

Key Point: Pre-computing skill embeddings eliminates the need to generate them on-the-fly during candidate searches, dramatically improving performance.

🔍 Step 3: Creating MongoDB Vector Search Index

In MongoDB Atlas, I created a vector search index named userSkills_vector_index:

{
  "fields": [
    {
      "path": "embedding",
      "type": "vector",
      "numDimensions": 1536,
      "similarity": "cosine"
    }
  ]
}

Important: The index is on the embedding field in the userSkills collection.

🧠 Step 4: Running the Vector Search

When an employer searches for candidates:

I combine job requirements (role, skills, experience, duration, notes)
Generate a single 1536-dim embedding using generateJobEmbedding()
Query MongoDB using $vectorSearch aggregation stage
Join with profiles collection to get user details
Apply filters (availability, search text, etc.) after the lookup

MongoDB Aggregation Pipeline

const vectorSearchPipeline: any[] = [
  {
    $vectorSearch: {
      index: "userSkills_vector_index",
      path: "embedding",
      queryVector: queryEmbedding,
      numCandidates: 100, // Search more candidates, filter down
      limit: 100, // Get top 100 from vector search
      filter: {
        // Only filter on userskills collection fields
        embedding: { $exists: true, $ne: null },
      },
    },
  },
  // Join with Profile to get user details
  {
    $lookup: {
      from: "profiles",
      localField: "profile",
      foreignField: "_id",
      as: "profileDetails",
    },
  },
  {
    $unwind: {
      path: "$profileDetails",
      preserveNullAndEmptyArrays: false,
    },
  },
  // Apply profile filters AFTER the lookup
  // (Vector search can only filter on same collection)
  {
    $match: {
      "profileDetails.deletionFlag": { $ne: true },
      "profileDetails.type": "student",
      ...(availability ? { "profileDetails.availability": availability } : {}),
      ...(search
        ? {
            $or: [
              { "profileDetails.firstName": { $regex: search, $options: "i" } },
              { "profileDetails.lastName": { $regex: search, $options: "i" } },
              { "profileDetails.email": { $regex: search, $options: "i" } },
            ],
          }
        : {}),
    },
  },
  // Add vector search score
  {
    $addFields: {
      vectorScore: { $meta: "vectorSearchScore" },
    },
  },
  // Sort by vector score
  {
    $sort: { vectorScore: -1 },
  },
];

Key Implementation Detail: We must apply profile-related filters (deletionFlag, type, availability, search) after the $lookup stage because MongoDB Vector Search can only filter on fields from the same collection (userSkills). This was a critical learning during implementation.

📊 Step 5: Ranking & Fit Classification

I use percentile-based scoring to ensure meaningful distribution:

Sort all candidates by vector similarity score (descending)
Calculate percentiles:
- High Fit: Top 10%
- Good Fit: Next 20% (top 10-30%)
- Moderate Fit: Next 30% (top 30-60%)
- Low Fit: Bottom 40% (60-100%)
- Not Fit: Filtered out (not returned)

This ensures ranking is meaningful even when embedding scores vary widely.

Fit Label Assignment

// Sort candidates by vector score
const sortedCandidates = candidates.sort(
  (a, b) => b.vectorScore - a.vectorScore
);

const total = sortedCandidates.length;
const highFitIndex = Math.floor(total * 0.1); // Top 10%
const goodFitIndex = Math.floor(total * 0.3); // Top 30%
const moderateFitIndex = Math.floor(total * 0.6); // Top 60%

const filteredCandidates = [];
for (let i = 0; i < sortedCandidates.length; i++) {
  const candidate = sortedCandidates[i];
  const score = candidate.vectorScore;

  if (i < highFitIndex) {
    candidate.fitLabel = "High Fit";
  } else if (i < goodFitIndex) {
    candidate.fitLabel = "Good Fit";
  } else if (i < moderateFitIndex) {
    candidate.fitLabel = "Moderate Fit";
  } else if (score >= 0.5) {
    candidate.fitLabel = "Low Fit";
  } else {
    candidate.fitLabel = "Not Fit";
    continue; // Skip "Not Fit" candidates
  }

  filteredCandidates.push(candidate);
}

🎯 Step 6: Top 20 Relevant Skills

For each candidate, I show only the top 20 most relevant skills (not all their skills). This is done by:

Using pre-computed skill embeddings from SkillEmbedding collection
Calculating cosine similarity between each candidate skill and the job requirements
Prioritizing exact matches first, then semantic similarity
Returning only the top 20

Performance Note: This ranking is only done for candidates that will be returned (after pagination), not all 100 candidates from vector search.

⚡ Step 7: Performance Optimization

My first implementation was slow (~12 seconds) because it generated embeddings during every request.

I solved this by:

✅ Pre-computing skill embeddings

Stored in SkillEmbedding collection. When a new skill appears, we generate and cache its embedding.

✅ Caching user profile embeddings

Updated only when a user updates their skills (via career map changes).

✅ Optimized MongoDB pipeline

Filtering happens after vector search and $lookup for faster execution
Only rank skills for candidates that will be returned (not all 100)

✅ Robust error handling

The system fails gracefully if vector search doesn't work — no false positives. If embedding generation fails, the API returns an error rather than falling back to keyword matching.

🔥 Final Results

Metric	Before	After
Response Time	~12s	~1s (12x faster)
Accuracy	Low keyword matching	High semantic relevance
Architecture	Scattered & inefficient	Unified vector-powered system
Skills Display	All skills (cluttered)	Top 20 relevant skills
Fit Classification	Binary (match/no match)	5-tier percentile system

The system now understands relationships like:

"communication" ≈ "collaboration skills"
"problem solving" ≈ "analytical thinking"
"frontend developer" ≈ "React, UI, JavaScript"

This drastically improves talent matching accuracy and user experience.

🛠️ Tech Stack

Backend: Node.js, Express, TypeScript
Database: MongoDB Atlas with Vector Search
AI: OpenAI Embeddings API (text-embedding-3-small)
ODM: Mongoose / Typegoose
Similarity Metric: Cosine Similarity

📎 Conclusion

This upgrade transformed our recruitment intelligence from a slow keyword-based matcher into a fast, AI-powered semantic engine that truly understands job requirements and candidate capabilities.

The key was leveraging MongoDB Atlas Vector Search — no separate vector database needed, seamless integration with existing data, and native performance optimizations.

Result: 12x faster, semantically accurate, production-ready talent matching system. 🎯

DEV Community