Jonathan Gamble

Posted on Oct 2, 2021 • Edited on Oct 26, 2022 • Originally published at code.build

Quick Firestore Frontend Search Index

#firestore #firebase #angular #rxjs

For those of you that don't know you can get full-text search capabilities in Firestore, read my article on my adv-firestore-functions search package.

UPDATE 10/23/22 - This content is now easy in my j-firebase package.

However, as great as using Firebase Functions are, sometimes we just want a simple and quick way to be able to search through our data. Unfortunately, the Firebase Team has not built this natively yet.

So, I wanted to create a quick way to index your data from the frontend...

Note: - This post uses angular examples, but the premise is for any framework.

Soundex

The core of this code is based on the soundex function which has been used in SQL databases for generations to emulate a fuzzy search. It basically translates your text so that similar sounds in the English Language would be stored as the same string. There are other versions in other languages of this algorithm as well. Just search 'french' + 'soundex' i.e.

  soundex(s: string) {
    const a = s.toLowerCase().split("");
    const f = a.shift() as string;
    let r = "";
    const codes = {
      a: "",
      e: "",
      i: "",
      o: "",
      u: "",
      b: 1,
      f: 1,
      p: 1,
      v: 1,
      c: 2,
      g: 2,
      j: 2,
      k: 2,
      q: 2,
      s: 2,
      x: 2,
      z: 2,
      d: 3,
      t: 3,
      l: 4,
      m: 5,
      n: 5,
      r: 6,
    } as any;
    r = f + a
      .map((v: string) => codes[v])
      .filter((v: any, i: number, b: any[]) =>
        i === 0 ? v !== codes[f] : v !== b[i - 1])
      .join("");
    return (r + "000").slice(0, 4).toUpperCase();
  }

Create the Index

Based on my relevant search index, I created a simple frontend version you can use in your app.

async searchIndex(opts: {
  ref: DocumentReference<DocumentData>,
  after: any,
  fields: string[],
  del?: boolean,
  useSoundex?: boolean
}) {

  opts.del = opts.del || false;
  opts.useSoundex = opts.useSoundex || true;

  const allCol = '_all';
  const searchCol = '_search';
  const termField = '_term';
  const numWords = 6;

  const colId = opts.ref.path.split('/').slice(0, -1).join('/');

  // get collection
  const searchRef = doc(
    this.afs,
    `${searchCol}/${colId}/${allCol}/${opts.ref.id}`
  );

  if (opts.del) {
    await deleteDoc(searchRef);
  } else {

    let data: any = {};
    let m: any = {};

    // go through each field to index
    for (const field of opts.fields) {

      // new indexes
      let fieldValue = opts.after[field];

      // if array, turn into string
      if (Array.isArray(fieldValue)) {
        fieldValue = fieldValue.join(' ');
      }
      let index = this.createIndex(fieldValue, numWords);

      // if filter function, run function on each word
      if (opts.useSoundex) {
        const temp = [];
        for (const i of index) {
          temp.push(i.split(' ').map(
            (v: string) => this.fm.soundex(v)
          ).join(' '));
        }
        index = temp;
        for (const phrase of index) {
          if (phrase) {
            let v = '';
            const t = phrase.split(' ');
            while (t.length > 0) {
              const r = t.shift();
              v += v ? ' ' + r : r;
              // increment for relevance
              m[v] = m[v] ? m[v] + 1 : 1;
            }
          }
        }
      } else {
        for (const phrase of index) {
          if (phrase) {
            let v = '';
            for (let i = 0; i < phrase.length; i++) {
              v = phrase.slice(0, i + 1).trim();
              // increment for relevance
              m[v] = m[v] ? m[v] + 1 : 1;
            }
          }
        }
      }
    }
    data[termField] = m;

    data = {
      ...data,
      slug: opts.after.slug,
      title: opts.after.title
    };

    try {
      await setDoc(searchRef, data)
    } catch (e: any) {
      console.error(e);
    }
  }
}

And you will also need the index function:

  createIndex(html: string, n: number): string[] {
    // create document after text stripped from html
    function createDocs(text: string) {
      const finalArray: string[] = [];
      const wordArray = text
        .toLowerCase()
        .replace(/[^\p{L}\p{N}]+/gu, ' ')
        .replace(/ +/g, ' ')
        .trim()
        .split(' ');
      do {
        finalArray.push(
          wordArray.slice(0, n).join(' ')
        );
        wordArray.shift();
      } while (wordArray.length !== 0);
      return finalArray;
    }
    // strip text from html
    function extractContent(html: string) {
      const tmp = document.createElement('div');
      tmp.innerHTML = html;
      return tmp.textContent || tmp.innerText || '';
    }
    // get rid of code first
    return createDocs(
      extractContent(html)
    );
  }

Note: - For SSR, never access the document directly, inject instead the framework document variable.

Usage

To use it, after you update data you want searchable, update the index:

  async indexPost(id: string, data: any) {
    await this.searchIndex({
      ref: doc(this.afs, 'posts', id),
      after: data,
      fields: ['content', 'title', 'tags']
    });
  }

Pass in all your doc data as after, your document ref as ref, and the fields you want searchable as fields. The rest is done automatically. If you're deleting a post, simply pass in del: true, and it will delete the index.

You will end up with an index like this:

The beauty is, it will automatically store more relevant items with a higher number. If you mention star wars 7 times, it will have a relevance of 7.

Searching

To actually use the indexing for searching, you need to grab the term on your frontend through a form keyup value, and run the search like so:

  /**
  * Search posts by term
  * @param term
  * @returns Observable of search
  */
  searchPost(term: string) {
    term = term.split(' ')
      .map(
        (v: string) => this.ns.soundex(v)
      ).join(' ');
    return collectionData(
      query(
        collection(this.afs, '_search/posts/_all'),
        orderBy('_term.' + term),
      ),
      { idField: 'id' }
    ).pipe(
      take(1),
      debounceTime(100)
    );
  }

As you can see, all search indexes are stored in _search/{YOUR COLLECTION}/_all/{YOUR DOC ID}. The field _term will contain all of your searchable data.

This will return an observable with all of the documents that match your query. It also saves the document data in the search document for easy access and less reads. You could easily just print the 'title' of each document if you wanted an autocomplete, or the whole documents if you have a full search.

Faq

1) Why do we duplicate the data in an index, and not just store the searchable information on the regular document as well?
- Speed. You don't want to read all of the search data unless you're doing an actual search. NoSQL has to copy data for reads to be more efficient.
2) If I do this on the frontend, am I going to slow down my app with code that should be on the backend?
- No. Not if you build your app efficiently. You should only be loading read functions for most users. If a user is logged in, and wants to edit a post, or whatever searchable document, only then should these write functions be lazy-loaded. The soundex function, however, should be shared for searching and indexing.
- If you use a router, you should update your document, redirect to that page, then run the index function in the background.

Example

// add post info
try {
  this.id = await this.db.setPost(data, this.id, publish);
} catch (e: any) {
  console.error(e);
  error = true;
}

if (publish && !error) {
  this.sb.showMsg(this.messages.published);
  this.router.navigate(['/post', this.id, slug]);

  // create search index
  data.content = this.markdownService.compile(data.content);
  await this.db.indexPost(this.id, data);
}

After you publish your data, display the message, redirect, then run the search index in the background while you continue to browse.

Note: If you use a markdown service, you may need to compile your code to html before you can index it. Look at how your app works.

You may not have to do all that, as you will find this function is really fast.

3) What about security? Data integrity?

In reality, if a user wants to mess with their own index, let them. Their index is based on their content, so they have full access to those words in their index anyway. However, we don't want them messing with someone else's index, so we can use this Firestore rule:

function searchIndex() {
  let docPath = 
/databases/$(database)/documents/$(request.path[4])/$(request.path[6]);
  return get(docPath).data.authorId == request.auth.uid;
} 
match /_search/{document=**} {
  allow read;
  allow write: if searchIndex();
}

This only let's them edit a document in whatever collection based on the authorId being equal to the logged in user. You may need to change that variable based on your app.

4) What if I store data in many language?
- Don't use the soundex function. Pass in useSoundex: false, or better yet, just modify the code without the soundex function. You will still have an exact search which is similar to LIKE 'Term%' in sql, allowing you to only search for letters starting with 'Term'. It will also automatically sort by relevance of the term in your data. You could also theoretically change the soundex function depending on the language you're searching in.

And, you have a fully working search index without firebase functions.

For more info, see the backend version, which has a few more features (create indexes by field instead of _all etc).

Note: If you have a very large dataset, you could get a too many index entries for entity or a firestore exceeds the maximum size document error. If that is the case, consider parsing out pre tags, shortening your allowable article length, only adding the needed fields (like title) to the document, or writing custom code to split the index into multiple documents (I may do this eventually).

UPDATE: I fixed the bug creating overly large indexes, check the code above, only chose a SOUNDEX code block or a text code block!

Happy searching.