Hi, I'm Fernando Agustín Soria, a Computer Engineering student and web developer. Today, I will demonstrate a simple approach to perform full-text search on a relational database using TypeScript and Sequelize. Please note that this algorithm is not exclusive to this language, and an ORM (Object-Relational Mapping) like Sequelize is not necessary to retrieve data from a relational database; it's just a practical example.
Before we start, here's the repository with the minimum reproducible example.
Github fersoria001 trie example
Full-text search is a powerful technique for efficiently retrieving textual data from large collections of documents. When dealing with extensive text databases or search engines, achieving fast and accurate search results becomes a significant challenge. This is where data structures like a trie come into play. A trie, short for retrieval tree, is a tree-like data structure specifically designed for efficient string searching and pattern matching. Its versatility in handling dynamic text datasets and providing rapid lookups makes it ideal for full-text search. In this introduction, we will explore the concept of implementing full-text search from scratch using a trie, covering its construction, query processing, and the advantages it offers in terms of speed and scalability.
The trie's fundamental building block is the node, with each node representing a character in a string. By organizing nodes in a tree-like structure, the trie efficiently stores and manages vast amounts of textual data. During the construction phase, individual words or phrases are inserted into the trie, sharing common prefixes among different entries to minimize storage requirements. This compact representation enables fast search operations and pattern matching, making the trie a preferred data structure for full-text search systems. As queries are entered into the system, the trie's recursive traversal process efficiently matches the input against stored strings, quickly identifying relevant results. Furthermore, the trie's suitability in handling dynamic datasets, such as those found in real-time search engines, makes it a valuable tool for developers and researchers seeking high-performance full-text search capabilities. In the subsequent sections, we will explore trie construction, query processing, and optimizations to further enhance the overall performance of full-text search using this versatile data structure.
Often a database can contain entities like this:
We can obtain metadata by iterating through the collection during the initialization of the trie. This process allows us to gather additional information related to the text data and associate it with each word or phrase in the trie. The metadata can be helpful in providing context or additional attributes to the search results, enhancing the overall user experience.
export type PreToken = {
words: string;
id: number;
};
export type PostToken = {
word: string[];
id: number;
};
Fill the PreTokens with product id and full text from each product.
let products: Product[] = [];
let tokens: PreToken[] = [];
initializeDatabase(MyConnection.instance.sequelize);
products = await Product.findAll();
products.forEach((product) => {
if (product) {
const tokenTitle: PreToken = {
word: product.title,
id: product.id,
};
const tokenDescription: PreToken = {
word: product.description,
id: product.id,
};
tokens.push(tokenTitle, tokenDescription);
}
});
And we can have words that starts or not with the same pre-fix, our trie will store them like this:
It can be demonstrated with recurrence equations that the time complexity of the trie structure search algorithm is O(N * L), where L is the length of the word.
To construct the trie data structure for full-text search, each node should have the following components: an array called "children" to hold references to child nodes or null values, a boolean variable "isEndOfWord" to indicate whether the node marks the end of a word, and a "keyAndId" Set associated with metadata we want to index in the node. To limit the maximum grade of the trie, we consider the alphabet size, which is typically 26 for English characters.
const ALPHABET_SIZE = 26;
export class Node {
child: Array<Node | null>;
isEndOfWord: boolean;
keyAndId: Set<number>;
word?: string;
constructor() {
this.child = new Array(ALPHABET_SIZE);
this.isEndOfWord= false;
this.keyAndId = new Set();
for (let i = 0; i < ALPHABET_SIZE; i++) {
this.child[i] = null;
}
}
}
Our trie structure will have 2 methods insert and find.
export class Trie {
root: Node;
constructor() {
this.raiz = new Nodo();
}
/** Search for a word in the Trie.
* @param {string} word - The word to search for.
* @returns {Node | null} - The last node of the word in the Trie if found, or null if the word is not present.
*
* The search method is used to look for a given word in the Trie data structure.
* It starts at the root node of the Trie and iterates over each character in the word.
* For each character, it calculates the corresponding index based on its ASCII value and the character 'a'.
* If a child node at the calculated index exists, it moves to that child and continues the search.
* If a child node does not exist at any point, the method returns null, indicating that the word is not present in the Trie.
* The method returns the last node of the word in the Trie if the word is found, or null if the word is not present.
*/
search(word: string): Node | null {
let level;
let length = word.length;
let index;
let pCrawl = this.root;
for (level = 0; level < length; level++) {
index = word[level].charCodeAt(0) - "a".charCodeAt(0);
if (pCrawl.child[index] == null) return null;
pCrawl = pCrawl.child[index]!;
}
return pCrawl;
}
/** Insert a word into the Trie along with its associated ID (primary key).
* @param {string} word - The word to insert.
* @param {number} id - The primary key associated with the word.
*
* The insert method is used to add a new word to the Trie data structure along with its associated ID.
* It starts at the root node of the Trie and checks if the word already exists in the Trie by calling the search method.
* If the word is already present and marked as an end of a word, it simply adds the new ID to the set of associated IDs and returns.
* If the word is not present, it proceeds to insert it into the Trie.
* For each character in the word, it calculates the corresponding index based on its ASCII value and the character 'a'.
* If a child node at the calculated index does not exist, it creates a new node and sets it as the child at that index.
* The method then moves down the Trie and continues inserting characters until the end of the word is reached.
* After the loop, it marks the last node as the end of the word, stores the word, and associates the provided ID with it.
* If the word already exists but is not marked as an end of a word, it means the word is a prefix of another word already in the Trie.
* In such cases, the method will insert the word as a new word and not a prefix.
*/
insert(word: string, id: number): void {
let level;
let length = word.length;
let index;
let pCrawl = this.root;
let itExist = this.search(word);
if (itExist != null && itExist.isEndOfWord) {
itExist.keyAndId.add(id);
return;
}
for (level = 0; level < length; level++) {
index = word[level].charCodeAt(0) - "a".charCodeAt(0);
if (pCrawl.child[index] == null) pCrawl.child[index] = new Node();
pCrawl = pCrawl.child[index]!;
}
pCrawl.keyAndId.add(id);
pCrawl.isEndOfWord = true;
pCrawl.word = word;
}
With the "product" entity, we can efficiently create a reverse index of the data using the following approach:
First, we retrieve raw data from each product. Then, we define the token type that needs to be inserted into the trie. For instance, if we can only insert one word at a time when calling the insert function of the trie, we need to split the raw data into an array while maintaining the same product ID as a reference.
Suppose the raw data is a complete paragraph; in that case, we split it into an array of words, with each word representing a token to be inserted into the trie along with the corresponding product ID.
By following this method, we can efficiently build a trie data structure to facilitate full-text search on the product data. The trie will enable us to quickly locate products based on specific keywords or phrases, providing users with fast and accurate search results. Additionally, by associating the product ID with each token in the trie, we can directly access the full product information once a search query yields relevant results. This reverse index approach significantly enhances the performance and functionality of our product search system, contributing to an improved user experience and streamlined information retrieval.
Remove accents,diacritics specific to the language we want to search on.
private tokenize(object: PreToken[]): PostToken[] {
let modified: PostToken[] = [];
for (const token of object) {
let splited: PostToken = {
words: this.splitTextIntoWords(token.word as string),
id: token.id,
};
modified.push(splited);
}
return modified;
}
private splitTextIntoWords(text: string | null): string[] {
if (!text) {
return [];
}
const regexPattern = /[^\wÑñ]+/gim;
const strings: string[] = text
.toLocaleLowerCase()
.normalize("NFD") // Normalize accents and diacritics
.replace(/[\u0300-\u036f]/g, "") // Remove combining diacritical marks
.split(regexPattern)
.filter((word) => !this.commonWords.has(word));
return strings;
}
then we can tokenize and insert the words into the trie
const trie = new Trie();
const tokenizedTokens: PostToken[] = this.tokenize(tokens);
for (const token of tokenizedTokens) {
for (const word in token.word) {
trie.insert(token.word[word], token.id);
}
}
Indeed, with the reverse index built using the trie, searching for specific words and obtaining the associated product IDs becomes remarkably straightforward.
let ids: Set<number> = new Set();
const queries = [ "word1", "word2", "word3" ]
for (const query of queries) {
const trieResult = trie.search(query);
if (trieResult != null && trieResult.isEndOfWord) {
trieResult.keyAndId.forEach((id) => {
ids.add(id);
});
}
}
With the IDs, we can obtain the products referenced by the words in the trie from our database using either an ORM or other methods. However, assume we have a toys store expands, it encompasses not only products but also clients, employees, and more. Consequently, our trie grows with the data, necessitating a strategy to determine the data present in the trie and identify words that yield exact matches.
We can utilize this depth-first method to implement a fast solution for searching and suggesting words and -in principle- avoid the use of a parser or the usage of different flags in the nodes.
/**
* Returns an array of words that have the given prefix.
*
* @param {string} prefix - The prefix to search for.
* @returns {string[]} - An array of words with the given prefix.
*
* Function searchWordsWithPrefix:
* - Initialize an empty array called 'words' to store words with the given prefix.
* - Define a recursive depth-first search function (dfs) to traverse the tree.
* - If the current node marks the end of a word, add the currentWord to 'words'.
* - Iterate through the children of the current node and explore valid paths.
* - Search for the node corresponding to the given prefix in the data structure.
* - If the prefix exists, perform dfs from that node to find words with the prefix.
* - Return the 'words' array containing all the words found with the given prefix.
*
* Data Structure:
* - The function traverses a tree-like data structure where each node represents
* a character in the alphabet and paths from root to nodes form words.
* - Each node can have up to ALPHABET_SIZE children, each representing the next
* character in a word.
*
* Note: ALPHABET_SIZE is a constant representing the number of characters in the alphabet.
*/
searchWordsWithPrefix(prefix: string): string[] {
const words: string[] = [];
function dfs(node: MyNode, currentWord: string) {
if (node.isEndOfWord) {
words.push(currentWord);
}
for (let i = 0; i < ALPHABET_SIZE; i++) {
if (node.child[i]) {
const nextChar = String.fromCharCode(i + "a".charCodeAt(0));
dfs(node.child[i]!, currentWord + nextChar);
}
}
}
const pCrawl = this.buscarPalabra(prefix);
if (pCrawl) {
dfs(pCrawl, prefix);
}
return words;
}
Follow me for more content!
Fernando Agustín Soria.
Top comments (0)