DEV Community

Cover image for 8 Essential Techniques for Building High-Performance JavaScript Search Engines That Scale
Nithin Bharadwaj
Nithin Bharadwaj

Posted on

8 Essential Techniques for Building High-Performance JavaScript Search Engines That Scale

As a best-selling author, I invite you to explore my books on Amazon. Don't forget to follow me on Medium and show your support. Thank you! Your support means the world!

Building a full-text search engine in JavaScript requires balancing precision and performance. Through practical implementation, I've discovered these eight techniques form a robust foundation for search functionality that scales gracefully.

Text preparation starts with tokenization. Breaking content into meaningful units requires handling edge cases like hyphenated words and apostrophes. My approach combines regex with custom logic:

tokenize(text) {
  const words = text
    .toLowerCase()
    .replace(/(\w)'(\w)/g, '$1$2')  // Handle contractions
    .match(/\b[\w']+\b/g) || [];   // Word boundary detection

  return words.filter(word => 
    word.length > 2 && 
    !this.stopWords.has(word)
}
Enter fullscreen mode Exit fullscreen mode

For normalization, I combine stemming with synonym mapping. The Porter stemmer reduces words to roots while maintaining a synonym registry:

const synonyms = {
  "running": "run",
  "ran": "run",
  "runs": "run"
};

normalize(term) {
  const base = this.stemmer(term);
  return synonyms[base] || base;
}

// Simplified Porter stemmer implementation
stemmer(word) {
  if (word.endsWith("ing")) return word.replace(/ing$/, "");
  if (word.endsWith("ly")) return word.replace(/ly$/, "");
  return word;
}
Enter fullscreen mode Exit fullscreen mode

Index construction uses inverted indexes with positional data. Delta encoding compresses positions while maintaining quick access:

addDocument(id, content) {
  const tokens = this.tokenize(content);
  tokens.forEach((token, position) => {
    if (!this.index.has(token)) {
      this.index.set(token, new PostingsList());
    }
    this.index.get(token).add(id, position);
  });
}

class PostingsList {
  constructor() {
    this.entries = [];
    this.lastId = -1;
  }

  add(id, position) {
    if (id !== this.lastId) {
      this.entries.push({ id, positions: [position] });
      this.lastId = id;
    } else {
      // Delta encoding positions
      const lastEntry = this.entries[this.entries.length - 1];
      lastEntry.positions.push(position - lastEntry.positions[0]);
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Query parsing handles complex expressions. I implement a recursive descent parser that supports boolean logic and field operators:

parseQuery(input) {
  const tokens = input.match(/(title:)?("[^"]+"|\S+)/g) || [];
  const stack = [];

  tokens.forEach(token => {
    if (token === "AND") {
      const right = stack.pop();
      const left = stack.pop();
      stack.push({ type: "and", left, right });
    } else if (token === "OR") {
      const right = stack.pop();
      const left = stack.pop();
      stack.push({ type: "or", left, right });
    } else if (token.startsWith("title:")) {
      const term = token.slice(6);
      stack.push({ type: "field", field: "title", term });
    } else {
      stack.push({ type: "term", term: token.replace(/"/g, '') });
    }
  });

  return stack.pop();
}
Enter fullscreen mode Exit fullscreen mode

Relevance scoring incorporates multiple dimensions. Beyond TF-IDF, I factor in proximity and field boosts:

calculateScore(doc, queryTerms) {
  let score = 0;

  queryTerms.forEach(term => {
    const postings = this.index.get(term);
    if (!postings) return;

    const docPosting = postings.find(p => p.id === doc.id);
    if (!docPosting) return;

    // Term frequency component
    const tf = docPosting.freq / doc.wordCount;

    // Inverse document frequency
    const idf = Math.log(this.totalDocs / postings.length);

    // Proximity bonus for adjacent terms
    const proximity = this.calculateProximityBonus(docPosting.positions);

    // Field boost (title vs content)
    const fieldBoost = doc.title.includes(term) ? 3 : 1;

    score += (tf * idf * proximity * fieldBoost);
  });

  return score;
}
Enter fullscreen mode Exit fullscreen mode

Fuzzy matching uses a trie-based approach for efficient similarity searches. This avoids expensive Levenshtein calculations across all terms:

class FuzzyMatcher {
  constructor() {
    this.trie = new Trie();
  }

  build(wordList) {
    wordList.forEach(word => this.trie.insert(word));
  }

  findSimilar(term, maxEdits = 2) {
    return this.trie.search(term, maxEdits);
  }
}

// Trie node implementation
class TrieNode {
  constructor() {
    this.children = {};
    this.isWord = false;
  }
}
Enter fullscreen mode Exit fullscreen mode

Faceted filtering implements counting and filtering through bitmap indexes. This maintains performance during multi-category drill-downs:

class FacetIndex {
  constructor() {
    this.categoryMaps = new Map();
  }

  addValue(docId, field, value) {
    if (!this.categoryMaps.has(field)) {
      this.categoryMaps.set(field, new Map());
    }

    const fieldMap = this.categoryMaps.get(field);
    if (!fieldMap.has(value)) {
      fieldMap.set(value, new Set());
    }

    fieldMap.get(value).add(docId);
  }

  getCounts(field, activeFilters) {
    const counts = [];
    const fieldMap = this.categoryMaps.get(field) || new Map();

    fieldMap.forEach((docSet, value) => {
      if (this.passesFilters(docSet, activeFilters)) {
        counts.push({ value, count: docSet.size });
      }
    });

    return counts;
  }
}
Enter fullscreen mode Exit fullscreen mode

Highlighting leverages DOM manipulation with security safeguards. This approach prevents XSS while marking matches:

generateHighlight(content, terms) {
  const parser = new DOMParser();
  const doc = parser.parseFromString(content, 'text/html');
  const walker = document.createTreeWalker(doc, NodeFilter.SHOW_TEXT);

  const termRegex = new RegExp(`(${terms.join('|')})`, 'gi');
  const nodes = [];

  while (walker.nextNode()) {
    nodes.push(walker.currentNode);
  }

  nodes.forEach(node => {
    const newHtml = node.textContent.replace(
      termRegex,
      '<mark class="highlight">$1</mark>'
    );
    const wrapper = doc.createElement('div');
    wrapper.innerHTML = newHtml;
    node.replaceWith(...wrapper.childNodes);
  });

  return doc.body.innerHTML;
}
Enter fullscreen mode Exit fullscreen mode

Performance optimization uses Web Workers and incremental indexing. This keeps the UI responsive during heavy operations:

// Main thread
const indexWorker = new Worker('search-worker.js');

indexWorker.postMessage({
  type: 'ADD_DOCUMENT',
  doc: { id: 123, content: "..." }
});

indexWorker.onmessage = (event) => {
  if (event.data.type === 'SEARCH_RESULTS') {
    displayResults(event.data.results);
  }
};

// Inside search-worker.js
self.addEventListener('message', (event) => {
  if (event.data.type === 'ADD_DOCUMENT') {
    index.addDocument(event.data.doc);
  } else if (event.data.type === 'SEARCH') {
    const results = index.search(event.data.query);
    self.postMessage({ type: 'SEARCH_RESULTS', results });
  }
});
Enter fullscreen mode Exit fullscreen mode

For large datasets, I implement index sharding. Distributing indexes across multiple workers significantly improves throughput:

class IndexShard {
  constructor(shardCount) {
    this.shards = Array.from(
      { length: shardCount },
      () => new SearchEngine()
    );
  }

  addDocument(id, content) {
    const shardIndex = id % this.shards.length;
    this.shards[shardIndex].addDocument(id, content);
  }

  async search(query) {
    const results = await Promise.all(
      this.shards.map(shard => 
        new Promise(resolve => {
          const worker = new Worker('shard-worker.js');
          worker.postMessage({ type: 'SEARCH', query });
          worker.onmessage = e => resolve(e.data);
        })
      )
    );
    return results.flat().sort((a, b) => b.score - a.score);
  }
}
Enter fullscreen mode Exit fullscreen mode

These techniques form a cohesive search pipeline. The tokenizer prepares raw text, the indexer structures it for retrieval, and the query processor translates user intent. Scoring algorithms determine relevance while faceting enables exploratory search. Performance optimizations ensure responsiveness even with substantial datasets.

Through careful implementation, JavaScript proves capable of handling sophisticated search requirements. The solution scales efficiently by leveraging modern browser capabilities while maintaining precision through algorithmic rigor. Each component plays a vital role in transforming raw data into meaningful results.

📘 Checkout my latest ebook for free on my channel!

Be sure to like, share, comment, and subscribe to the channel!


101 Books

101 Books is an AI-driven publishing company co-founded by author Aarav Joshi. By leveraging advanced AI technology, we keep our publishing costs incredibly low—some books are priced as low as $4—making quality knowledge accessible to everyone.

Check out our book Golang Clean Code available on Amazon.

Stay tuned for updates and exciting news. When shopping for books, search for Aarav Joshi to find more of our titles. Use the provided link to enjoy special discounts!

Our Creations

Be sure to check out our creations:

Investor Central | Investor Central Spanish | Investor Central German | Smart Living | Epochs & Echoes | Puzzling Mysteries | Hindutva | Elite Dev | JS Schools


We are on Medium

Tech Koala Insights | Epochs & Echoes World | Investor Central Medium | Puzzling Mysteries Medium | Science & Epochs Medium | Modern Hindutva

Top comments (0)