<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Sagnik Bandyopadhyay</title>
    <description>The latest articles on DEV Community by Sagnik Bandyopadhyay (@sagnikbanerjeesb).</description>
    <link>https://dev.to/sagnikbanerjeesb</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F558885%2F5782bffc-903b-48b4-a6c3-b8310b7b15f6.jpg</url>
      <title>DEV Community: Sagnik Bandyopadhyay</title>
      <link>https://dev.to/sagnikbanerjeesb</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/sagnikbanerjeesb"/>
    <language>en</language>
    <item>
      <title>BigQuery deduplication strategies</title>
      <dc:creator>Sagnik Bandyopadhyay</dc:creator>
      <pubDate>Sun, 30 Apr 2023 22:14:50 +0000</pubDate>
      <link>https://dev.to/sagnikbanerjeesb/bigquery-deduplication-strategies-230</link>
      <guid>https://dev.to/sagnikbanerjeesb/bigquery-deduplication-strategies-230</guid>
      <description>&lt;h2&gt;
  
  
  Problem statement
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Context
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Lets assume that we have data pipeline(s) dumping messages into Google BigQuery tables (lets call them raw tables).&lt;/li&gt;
&lt;li&gt;There maybe duplicate messages being stored in the raw table due to reasons like:

&lt;ul&gt;
&lt;li&gt;Duplicate messages sent from source&lt;/li&gt;
&lt;li&gt;Message inserted multiple times due to network issues and retries between the data pipeline and big query (although this can be addressed to some extent by using unique request ids while loading the data into BQ)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;BigQuery doesn't have unique indexes&lt;/li&gt;

&lt;/ul&gt;

&lt;h3&gt;
  
  
  Assumptions
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Duplicate messages have the same identifier (maybe message_id), although they may differ by any additional metadata injected by the data pipeline, e.g.: receive_timestamp&lt;/li&gt;
&lt;li&gt;The message has a business timestamp that is provided by upstream systems&lt;/li&gt;
&lt;li&gt;Messages can arrive late (maybe several days)&lt;/li&gt;
&lt;li&gt;Lets assume that we have a raw orders table with these fields:

&lt;ul&gt;
&lt;li&gt;message_id: unique id identifying one message&lt;/li&gt;
&lt;li&gt;message_timestamp: timestamp provided by the source systems&lt;/li&gt;
&lt;li&gt;order_id: unique id identifying an order&lt;/li&gt;
&lt;li&gt;order_amount: amount against the order&lt;/li&gt;
&lt;li&gt;receive_timestamp: receive time injected by the ingestion pipeline&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h3&gt;
  
  
  Objective
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;These messages need to be exposed as a data product to consumers in de-duplicated form and the consumers shouldn't have to worry about de-duplication&lt;/li&gt;
&lt;li&gt;Cost effectiveness. &lt;strong&gt;NOTE: BQ billing is based on number of bytes scanned and stored according to the on demand pricing model.&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Solution 1
&lt;/h2&gt;

&lt;p&gt;One way would be to create a view that selects distinct entries from the raw table. E.g. view SQL:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

CREATE OR REPLACE VIEW views.orders AS
SELECT
  DISTINCT
    message_id,
    message_timestamp,
    order_id,
    order_amount
FROM raw.orders


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Now a consumer could query this view with an SQL like: &lt;code&gt;SELECT * FROM views.orders WHERE DATE(message_timestamp) BETWEEN '2023-01-01' AND '2023-01-31'&lt;/code&gt; to get orders placed in the month of january.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If the raw orders table is partitioned on message_timestamp field, then BiqQuery will query only those entries belonging to January's partition.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;However, &lt;em&gt;if we need some more flexibility in the uniqueness criteria then a DISTINCT keyword doesn't help&lt;/em&gt;. Example use cases:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;receive_timestamp needs to be part of the view: The same message may have multiple entries with different receive_timestamps, in which case de-duplication by message_id won't work. Although in this specific case one could argue that receive_timestamp should not be exposed to consumers.&lt;/li&gt;
&lt;li&gt;in case of multiple entries with same message_id, the entry with the latest receive time needs to be selected&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Solution 2
&lt;/h2&gt;

&lt;p&gt;We can bring in more flexibility into the uniqueness criteria by using a QUALIFY CLAUSE in the view:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

CREATE OR REPLACE VIEW views.orders AS
SELECT
  *
FROM raw.orders
QUALIFY ROW_NUMBER() OVER (PARTITION BY message_id ORDER BY receive_timestamp DESC) = 1


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This works, however, if the table is partitioned by message_timestamp, and consumers query the view with filtering on message_timestamp using a query like: &lt;code&gt;SELECT * FROM views.orders WHERE DATE(message_timestamp) BETWEEN '2023-01-01' AND '2023-01-31'&lt;/code&gt;, that &lt;em&gt;does not reduce the partitions scanned, and BQ ends up scanning all the partitions&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;It would have worked if we could have placed the WHERE clause before the QUALIFY clause, something like:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

CREATE OR REPLACE VIEW views.orders AS
SELECT
  *
FROM raw.orders
WHERE DATE(message_timestamp) BETWEEN '2023-01-01' AND '2023-01-31'
QUALIFY ROW_NUMBER() OVER (PARTITION BY message_id ORDER BY receive_timestamp DESC) = 1


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;em&gt;But while defining the view, we do not know the exact message_timestamp that needs to used for filtering, because it is known only at run time.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Solution 3
&lt;/h2&gt;

&lt;p&gt;What if we could have the WHERE clause before the QUALIFY clause, and inject the values at runtime?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://cloud.google.com/bigquery/docs/reference/standard-sql/table-functions" rel="noopener noreferrer"&gt;Table functions&lt;/a&gt; can do that !&lt;/strong&gt;&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

CREATE OR REPLACE TABLE FUNCTION functions.orders(start_date DATE, end_date DATE) AS 
SELECT 
  *
FROM raw.orders
WHERE DATE(message_timestamp) BETWEEN start_date AND end_date
QUALIFY ROW_NUMBER() OVER (PARTITION BY message_id ORDER BY receive_timestamp DESC) = 1


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Consumers can then query this function like this:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

SELECT * FROM functions.orders('2023-01-01', '2023-01-31') WHERE &amp;lt;other filters&amp;gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Here only January's partition is scanned by BQ.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;To summarise:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;We can create functions (lets call them raw table functions) on top of raw tables that accept data ranges as input and injects those parameters into the WHERE clause before QUALIFY&lt;/li&gt;
&lt;li&gt;Based on business needs, we may also create higher order functions if needed by composing one or more raw table functions (across different domains) so that the input date can be propagated till the lowest level.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Solution 4-A
&lt;/h2&gt;

&lt;p&gt;Can we retain the semantics of a table ? One way to do that would be to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Dump the raw data first into a raw table partitioned by receive time&lt;/li&gt;
&lt;li&gt;Run a scheduled query to take the latest partition from the raw table and merge it into another deduplicated table (deduplicated table is partitioned by message_timestamp)&lt;/li&gt;
&lt;li&gt;Remove old partitions from the raw table&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Sample scheduled query that can be run daily to scan yesterday's raw records and merge them:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

MERGE INTO dedup.orders dest USING (
SELECT 
  *
FROM raw_partitioned_by_receive_timestamp.orders
WHERE DATE(receive_timestamp) = CURRENT_DATE()-1
QUALIFY ROW_NUMBER() OVER (PARTITION BY message_id ORDER BY receive_timestamp DESC) = 1
) src
ON dest.message_id = src.message_id

WHEN NOT MATCHED THEN INSERT(message_id, message_timestamp, order_id, order_amount, receive_timestamp)
VALUES (src.message_id, src.message_timestamp, src.order_id, src.order_amount, src.receive_timestamp)


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This scheduled query will scan:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Only one partition of raw table&lt;/li&gt;
&lt;li&gt;One column (message_id because that is part of the merge ON clause) from &lt;strong&gt;all partitions of the deduplicated table&lt;/strong&gt;. Can we optimise this ?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Also note that at query time (e.g.: when a consumer queries the deduplicated table to fetch orders for one month), the deduplicated table's partitions will also be scanned based on partition column based where clause.&lt;/p&gt;

&lt;h2&gt;
  
  
  Solution 4-B
&lt;/h2&gt;

&lt;p&gt;We optimise the previous solution by replacing the scheduled merge query with a scheduled two step stored procedure.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;As a first step, we query the minimum and maximum message_timestamp from the latest partition of raw table like this:
```
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;SELECT&lt;br&gt;
  DATE(min(message_timestamp)) AS minimum_message_timestamp,&lt;br&gt;
  DATE(max(message_timestamp)) AS maximum_message_tiemstamp&lt;br&gt;
FROM raw_partitioned_by_receive_timestamp.orders&lt;br&gt;
WHERE&lt;br&gt;
  DATE(receive_timestamp) = CURRENT_DATE()-1&lt;/p&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
2. We can:
   1. select entries from deduplicated table within the minimum_message_timestamp and maximum_message_tiemstamp window
   2. compare the latest raw table's partition with the entries selected in previous step to figure out new entries that need to be added to the deduplicated table and then insert only those entries

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;INSERT INTO dedup.orders&lt;br&gt;
WITH existing AS (&lt;br&gt;
  SELECT&lt;br&gt;
    message_id&lt;br&gt;
  FROM dedup.orders&lt;br&gt;
  WHERE&lt;br&gt;
    message_timestamp BETWEEN &amp;lt;&amp;gt; AND &amp;lt;&amp;gt;&lt;br&gt;
)&lt;br&gt;
SELECT&lt;br&gt;
  raw.*&lt;br&gt;
FROM raw_partitioned_by_receive_timestamp.orders raw&lt;br&gt;
LEFT JOIN existing&lt;br&gt;
  ON raw.message_id = existing.message_id&lt;br&gt;
WHERE&lt;br&gt;
  receive_timestamp = CURRENT_DATE()-1&lt;br&gt;
  AND existing.message_id IS NULL&lt;/p&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
NOTE: It is important to perform these two steps separately and not combine them into one INSERT query because the partition based where clause provides cost benefit only if the operands are provided statically.

This scans:
- One column (message_timestamp) of latest partition of raw table
- All columns from latest partition of raw table
- One column (message_id) from few partitions of deduplicated table. Number of partitions scanned depends on the minimum and maximum message time obtained from the raw table's latest partition.
- At query time selected partitions of deduplicated table are also scanned

---

## Trade-offs

| | Sol 1 |  Sol 3 | Sol 4-B
|-|-|-|-|
| **Uniqueness criteria** | Works efficiently only if upstream provided attributes are exactly same for duplicate entries with same message_id, and if only upstream provided attributes are required to be projected through the view | + Flexible uniqueness definition |  + Flexible uniqueness definition |
| **Semantics** | + View semantics | Consumers need to work with functions. Functions do not have user friendly schemas like views | + View semantics
| **Latency** | + No delay | + No delay | Depends on the frequency of scheduled query |
| **Development / maintenance overhead** | + Simple to maintain | + Simple to maintain | Relatively complex to maintain. May even need alerting to check for accidental duplicates |

From a cost point of view, this is the decision matrix:

![Use solution 1 or 3 for high duplication with high query frequency, high duplication with low query frequency, low duplication with low query frequency ; whereas use solution 4-B for high duplication with high query frequency](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/y6qss67s9xrsro8deejp.png)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>bigquery</category>
      <category>sql</category>
      <category>googlecloud</category>
      <category>data</category>
    </item>
    <item>
      <title>Partial text search on Mongo</title>
      <dc:creator>Sagnik Bandyopadhyay</dc:creator>
      <pubDate>Mon, 06 Jun 2022 19:38:03 +0000</pubDate>
      <link>https://dev.to/sagnikbanerjeesb/partial-text-search-on-mongo-46j3</link>
      <guid>https://dev.to/sagnikbanerjeesb/partial-text-search-on-mongo-46j3</guid>
      <description>&lt;h3&gt;
  
  
  Scenario
&lt;/h3&gt;

&lt;p&gt;Lets consider a Mongo collection &lt;code&gt;topics&lt;/code&gt; that looks somewhat like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"computer science"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"numerical computation"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"medical science"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"information technology"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"sci fi"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We would like to retrieve documents whose title "contains" &lt;code&gt;comp&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Here are few possible ways that we can consider:&lt;/p&gt;

&lt;h3&gt;
  
  
  OPTION 1 - Simple regex query
&lt;/h3&gt;

&lt;p&gt;We can perform a &lt;a href="https://www.mongodb.com/docs/manual/reference/operator/query/regex/"&gt;regex query&lt;/a&gt; on the title field, like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;topics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;title&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="nx"&gt;comp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="sr"&gt;/i&lt;/span&gt;&lt;span class="err"&gt;)
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;OR&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;topics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;title&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;$regex&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;comp&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;$options&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;i&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;NOTE: Option &lt;code&gt;i&lt;/code&gt; is used to perform case insensitive search.&lt;/p&gt;

&lt;p&gt;Mongo will need to examine every document in the collection and perform a regex match on the title field.&lt;/p&gt;

&lt;p&gt;We can define an index (nothing fancy, the usual index) on the title field, like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;topics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;createIndex&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now if we perform the regex query, mongo will not examine all the documents, but it still needs to examine all the index keys, and then examine and retrieve only those documents that matched.&lt;/p&gt;

&lt;p&gt;This works good for a small collection, but for large collections (having more than 100k or 1M docs) the query becomes slow, and it also consumes a lot of resources from the mongo db instances, which is definitely bad.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This solution is the simplest to implement, but doesn't scale well.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  OPTION 2 - Index on tokenised words
&lt;/h3&gt;

&lt;p&gt;If we are okay to support searching only by words, and not partial words, then tokenising the title field can help. E.g.:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"computer science"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"titleTokens"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"computer"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"science"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"numerical computation"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"titleTokens"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"numerical"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"computation"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"medical science"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"titleTokens"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"medical"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"science"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"information technology"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"titleTokens"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"information"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"technology"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"sci fi"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"titleTokens"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"sci"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"fi"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;While saving / updating documents, we can:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;convert the &lt;code&gt;title&lt;/code&gt; field to lower case (to support case insensitive searches)&lt;/li&gt;
&lt;li&gt;split the text into multiple words&lt;/li&gt;
&lt;li&gt;save those words as an array, as &lt;code&gt;titleTokens&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;And then we can define a &lt;a href="https://www.mongodb.com/docs/manual/core/index-multikey/"&gt;multi key&lt;/a&gt; index on &lt;code&gt;titleTokens&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Now if we need to search for a text, we can:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;convert the search text to lower case&lt;/li&gt;
&lt;li&gt;split the search text into multiple words (using the same algorithm used to tokenise the &lt;code&gt;title&lt;/code&gt; field)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.mongodb.com/docs/manual/tutorial/query-arrays/"&gt;Query the array&lt;/a&gt; &lt;code&gt;titleTokens&lt;/code&gt; with the tokens obtained from search text, e.g.:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;topics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;find&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="na"&gt;titleTokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="na"&gt;$all&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;science&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]}})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;Above query will retrieve all documents having the token "science" in the &lt;code&gt;titleTokens&lt;/code&gt; field, i.e.: document #1 and #3&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;OR&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;topics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;find&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="na"&gt;titleTokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="na"&gt;$all&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;computer&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;science&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]}})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;Above query will retrieve all documents having both "computer" and "science" tokens, and hence will retrieve document #1&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;em&gt;Using this solution we cannot search by partial words like "comp", but can search with individual words pretty fast.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  OPTION 3 - Text index
&lt;/h3&gt;

&lt;p&gt;Mongo has a provision for defining one &lt;a href="https://www.mongodb.com/docs/manual/core/index-text/"&gt;text index&lt;/a&gt; per collection. This works pretty much the same way as OPTION 2, i.e., the title field's value will be tokenised into words and only search by full words (not partial words) is supported.&lt;/p&gt;

&lt;p&gt;The index can be created like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;topics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;createIndex&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;text&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is one way to query the text index:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;topics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;find&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="na"&gt;$text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="na"&gt;$search&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;computer&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;Above query will retrieve all documents having the word "computer" in the &lt;code&gt;title&lt;/code&gt; field.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Here is another:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;topics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;find&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="na"&gt;$text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="na"&gt;$search&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;computer science&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;Above query will retrieve all documents having either "computer" or "science" in the &lt;code&gt;title&lt;/code&gt; field, i.e.: documents #1 and #3&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Yet another:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;topics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;find&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="na"&gt;$text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="na"&gt;$search&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;computer sci&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is a phrase query. It will:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;tokenise the search text into "computer" and "sci"&lt;/li&gt;
&lt;li&gt;fetch documents containing either "computer" (this will match doc #1) or "sci" (this will match doc #5)&lt;/li&gt;
&lt;li&gt;among all these documents, mongo will examine the title fields and check whether the exact phrase "computer sci" is present or not. only doc #1 will pass this check, which will be returned as the result of the query&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Using this solution we can efficiently search for full words or for phrases containing at least one full word.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  OPTION 4 - Atlas search
&lt;/h3&gt;

&lt;p&gt;This is only applicable when mongo is running on &lt;a href="https://www.mongodb.com/atlas/database"&gt;atlas&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Atlas provides a very powerful way to &lt;a href="https://www.mongodb.com/atlas/search"&gt;search&lt;/a&gt; through documents for partial words. &lt;code&gt;autocomplete&lt;/code&gt; indexing and searching perfectly meets our use case.&lt;/p&gt;

&lt;p&gt;We can define a search index on &lt;code&gt;topics&lt;/code&gt; following these steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Choose an &lt;a href="https://www.mongodb.com/docs/atlas/atlas-search/analyzers/"&gt;analyser&lt;/a&gt;, maybe &lt;code&gt;whitespace&lt;/code&gt;. This will tokenise the field's value into words&lt;/li&gt;
&lt;li&gt;Disable dynamic mapping (assuming that we already know which field to index, and the field name / hierarchy doesn't change)&lt;/li&gt;
&lt;li&gt;Define explicit fields to index, in this case it can be the &lt;code&gt;title&lt;/code&gt; field&lt;/li&gt;
&lt;li&gt;Also define two datatypes for the field:

&lt;ol&gt;
&lt;li&gt;String - for complete text matching (also needed for autocomplete match if we want to increase the search score of completely matched strings)&lt;/li&gt;
&lt;li&gt;Autocomplete - for partial matching

&lt;ol&gt;
&lt;li&gt;Choose the tokeniser for Autocomplete datatype:

&lt;ol&gt;
&lt;li&gt;edgeGram: This will allow prefix searches against tokenised words. E.g.: searching by "comp" will return entities #1 and #2, but searching by "puter" wont return anything&lt;/li&gt;
&lt;li&gt;nGram: This will behave like a contains search. E.g.: searching by "put", will return entities #1 and #2&lt;/li&gt;
&lt;/ol&gt;


&lt;/li&gt;
&lt;li&gt;Define:

&lt;ol&gt;
&lt;li&gt;minGrams: The minimum length from which tokenisation will start. If minGrams is 2, then by searching for "c" we wont get any results, we need to have at least 2 characters. We need to be careful with this field. Setting a very low minGrams value (like 1) with nGram tokeniser will result in a huge index, and setting something very high won't be useful enough.&lt;/li&gt;
&lt;li&gt;maxGrams: The maximum length of characters which will be tokenised&lt;/li&gt;
&lt;/ol&gt;


&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Once the Atlas search index is ready, we can query the mongo collection using an aggregation pipeline like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;topics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;aggregate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;$search&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;index&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;atlas&lt;/span&gt; &lt;span class="nx"&gt;search&lt;/span&gt; &lt;span class="nx"&gt;index&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;autocomplete&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;comp&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;title&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;NOTE:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;These search indexes can accomplish a lot more (fuzzy search, compound search, and maybe more).&lt;/li&gt;
&lt;li&gt;The search index can be defined via &lt;a href="https://registry.terraform.io/providers/mongodb/mongodbatlas/latest/docs/resources/search_index"&gt;terraform&lt;/a&gt; (if we want to follow infrastructure as code :-)).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;em&gt;Using this solution we can efficiently search for prefixes of words or for any part of the text, if we run mongo on atlas.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;Above options were easy solutions to the problem. Below are some not so easy solutions:&lt;/p&gt;

&lt;h3&gt;
  
  
  ELASTIC SEARCH
&lt;/h3&gt;

&lt;p&gt;If we strictly do not want to use atlas search, we may consider:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Continuously syncing the mongo collection (maybe only the title and id fields) to elastic search&lt;/li&gt;
&lt;li&gt;Defining appropriate autocomplete edgeGram / nGram indexes&lt;/li&gt;
&lt;li&gt;Executing queries against elastic and retrieving matched documents by id from mongo. Or we may also consider storing the whole document in elastic itself (in that case we need not retrieve docs by id from mongo as we already get the full document from elastic itself).&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  FROM SCRATCH !
&lt;/h3&gt;

&lt;p&gt;If none of the above options work for us, we can build our own indexes using prefix trie or suffix trie, and point to mongo document ids (or any database reference) from them. Searching the index for all possibilities for a given search text may become a difficult / costly operation. We may consider caching the top results in each trie node. If space becomes an issue, we can distribute the trie across multiple nodes.&lt;/p&gt;

&lt;p&gt;This is definitely not an easy job. We'll encounter multiple challenges / questions related to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pre-computation of search results in each trie node: What algorithm to use? How to keep the trie index in sync with the underlying mongo collection? How much latency is acceptable? How will the data infrastructure look like?&lt;/li&gt;
&lt;li&gt;Trie sharding: How to evenly distribute the trie across multiple nodes?&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;Please feel free to suggest better ideas / corrections to this content. Thanks in advance.&lt;/p&gt;

</description>
      <category>mongodb</category>
    </item>
    <item>
      <title>Why asymmetric encryption?</title>
      <dc:creator>Sagnik Bandyopadhyay</dc:creator>
      <pubDate>Thu, 14 Jan 2021 18:26:26 +0000</pubDate>
      <link>https://dev.to/sagnikbanerjeesb/why-asymmetric-encryption-2na7</link>
      <guid>https://dev.to/sagnikbanerjeesb/why-asymmetric-encryption-2na7</guid>
      <description>&lt;h2&gt;
  
  
  What is encryption?
&lt;/h2&gt;

&lt;p&gt;Say Raj needs to send a message to Priya. Easy right? But, there is one catch: whatever message Raj sends to Priya will be visible to the whole world. Argh!&lt;/p&gt;

&lt;p&gt;Raj comes up with an idea: Raj will not send the actual message to Priya. Instead, he will transform it to another form which will appear nonsense to the whole world, but Priya will be able to transform it back to the real message.&lt;/p&gt;

&lt;p&gt;E.g.:&lt;/p&gt;

&lt;p&gt;Say the actual message is: &lt;code&gt;Hello&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Raj applies some mathematical operation to "Hello" and it becomes: &lt;code&gt;abc123&lt;/code&gt; (aka encryption). Which is total nonsense to everyone.&lt;/p&gt;

&lt;p&gt;But Priya knows about the reverse mathetimatical operation which will transform &lt;code&gt;abc123&lt;/code&gt; to &lt;code&gt;Hello&lt;/code&gt; (aka decryption).&lt;/p&gt;

&lt;p&gt;Great! Problem solved!&lt;/p&gt;




&lt;h2&gt;
  
  
  Types of encryption
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Symmetric&lt;/strong&gt; encryption: There is one key used to encrypt the message. The same key can be used to decrypt the message. Example algorithms: AES, DES, ...&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Asymmetric&lt;/strong&gt; encryption: One key is used to encrypt the message. But a different key is used to decrypt the message. The same key cannot be used to perform both encryption and decryption. Some use cases are illustrated below. Example algorithms: RSA, ECDSA, ECDH, ...&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Use case 1
&lt;/h2&gt;

&lt;p&gt;Here is a conversation between Raj and Priya over the internet on an un-secure chatting platform, i.e.: all these chats are visible to everyone.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Raj&lt;/strong&gt;: Hello Priya, here is our revenue forecast for this year: &lt;code&gt;jskdfn&lt;/code&gt;. Its confidential, hence encrypted with AES Key: &lt;code&gt;A!123&lt;/code&gt;. You can use the same to decode it.&lt;br&gt;
&lt;em&gt;[Raj used symmetric encryption to encrypt the confidential data and also shared the key used to encrypt. NOTE: the same key can be used to decrypt it]&lt;/em&gt;&lt;br&gt;
&lt;strong&gt;Priya&lt;/strong&gt;: Raj, what did you do! Now everybody knows the secret key and they can use it decrypt the revenue forecast! The whole world knows about it now.&lt;/p&gt;

&lt;h3&gt;
  
  
  What went wrong?
&lt;/h3&gt;

&lt;p&gt;Raj encrypted the confidential information with a symmetric key. All good so far. But Raj shared the symmetric key with Priya over an unsecured channel, which is a blunder, because once people get to know the symmetric key, they can decrypt the message.&lt;/p&gt;

&lt;h3&gt;
  
  
  Alternate scenario
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Raj&lt;/strong&gt;: Hello Priya, can you please share your &lt;strong&gt;PUBLIC key&lt;/strong&gt; [&lt;em&gt;more on PUBLIC and PRIVATE keys later&lt;/em&gt;]?&lt;br&gt;
&lt;strong&gt;Priya&lt;/strong&gt;: Here it is: &lt;code&gt;pubkey001&lt;/code&gt;.&lt;br&gt;
&lt;strong&gt;Raj&lt;/strong&gt;: Here is our revenue forecast encrypted with your Public Key: &lt;code&gt;ueihwf&lt;/code&gt;.&lt;br&gt;
&lt;strong&gt;Priya&lt;/strong&gt;: Thanks, I will use my &lt;strong&gt;PRIVATE key&lt;/strong&gt; to decrypt the message.&lt;/p&gt;

&lt;h4&gt;
  
  
  What happened here?
&lt;/h4&gt;

&lt;p&gt;With asymmetric encryption, we get a pair of keys:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A Private Key: As the name suggests its supposed to remain PRIVATE always. It shouldn't be shared with anyone.&lt;/li&gt;
&lt;li&gt;A Public Key: Its supposed to be public - shared with everyone freely.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If we encrypt a message using a Public Key, it can be decrypted ONLY with the matching Private Key (and not with the Public Key). Similarly if we encrypt a message with a Private Key, it can be decrypted ONLY with the matching Public Key.&lt;/p&gt;

&lt;p&gt;In the alternate scenario illustrated above, Priya shared her &lt;strong&gt;PUBLIC&lt;/strong&gt; key with Raj. Raj encrypted the confidential message with the &lt;strong&gt;PUBLIC&lt;/strong&gt; key and shared with Priya. Now even if anybody else gets to know about the Public Key and encrypted message, they WON'T be able to decrypt it, as it can be decrypted ONLY with the &lt;strong&gt;PRIVATE&lt;/strong&gt; Key. And only Priya has the Private key which she hasn't shared with anyone.&lt;/p&gt;

&lt;p&gt;NOTE: It is not always possible to encrypt an entire message with an asymmetric key (encrypting long messages with asymmetric keys will be quite slow and expensive). A general practise is to GENERATE a random symmetric key, and encrypt the symmetric key alone with the asymmetric key. And, the actual message content is encrypted with the symmetric key. So we get a payload of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Symmetric key encrypted with asymmetric key&lt;/li&gt;
&lt;li&gt;Message contents encrypted with symmetric key&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The recipient can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Retrieve the symmetric key by decrypting it with the matching asymmetric key pair&lt;/li&gt;
&lt;li&gt;Decrypt the actual message contents with the symmetric key obtained above&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Use case 2
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Sam&lt;/strong&gt;: Hi Priya, Raj has left this message for you: "Sell all my shares today".&lt;br&gt;
&lt;strong&gt;Priya&lt;/strong&gt;: Hello Sam. How do I know that it is really Raj who left that message? Can you please ask Raj to digitally sign this message with his &lt;strong&gt;PRIVATE&lt;/strong&gt; key?&lt;br&gt;
&lt;strong&gt;Sam&lt;/strong&gt;: Sure, Raj has sent the signature as well: &lt;code&gt;mzakld&lt;/code&gt;&lt;br&gt;
&lt;strong&gt;Priya&lt;/strong&gt;: Cool, since I already have Raj's &lt;strong&gt;PUBLIC&lt;/strong&gt; key, I can now verify whether it was really Raj who sent the message.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is this digital signature?
&lt;/h3&gt;

&lt;p&gt;In the previous use case, we saw that the message was encrypted with the Public Key. Here it's kinda reverse. This is how Raj signs the message:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;He generates a &lt;a href="https://www.educative.io/edpresso/what-is-hashing"&gt;Hash&lt;/a&gt; of the message.&lt;/li&gt;
&lt;li&gt;And then encrypts the Hash with his &lt;strong&gt;PRIVATE&lt;/strong&gt; key, which becomes the signature of the message&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  How does Priya verify the signature?
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Priya should already be knowing Raj's &lt;strong&gt;PUBLIC&lt;/strong&gt; key (Remember, PUBLIC keys are meant to be public and known by everyone)&lt;/li&gt;
&lt;li&gt;She decrypts the signature (&lt;code&gt;mzakld&lt;/code&gt;) using Raj's &lt;strong&gt;PUBLIC&lt;/strong&gt; key. Lets call it - HASH_VALUE_1&lt;/li&gt;
&lt;li&gt;And then independently generates a hash of the message contents ("Sell all my shares today"). Lets call it - HASH_VALUE_2&lt;/li&gt;
&lt;li&gt;She verifies whether HASH_VALUE_1 is equal to HASH_VALUE_2.&lt;/li&gt;
&lt;li&gt;Anybody other than Raj will not have Raj's &lt;strong&gt;PRIVATE&lt;/strong&gt; key. If any other Private key is used to sign the message, then when the signature is decrypted using Raj's &lt;strong&gt;PUBLIC&lt;/strong&gt; key, it won't match with the hash of the message contents. Thus Priya will know that the message was not from Raj.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;NOTES:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If the message is being sent over an un-secure channel, then the message (and optionally its signature as well) can be encrypted using the approach described in Use case 1.&lt;/li&gt;
&lt;li&gt;Depending on the use case, additional attributes (like UUID, timestamp) maybe included in the message to ensure every message is unique. This will help in preventing replay attacks.&lt;/li&gt;
&lt;li&gt;In all cases, the Private Key always remains private and is never ever shared with anyone.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Use case 3
&lt;/h2&gt;

&lt;p&gt;Assume that Priya has received Raj's public key prior to this conversation and she is certain that the Public key really belongs to Raj.&lt;/p&gt;

&lt;p&gt;Below is a conversation taking place of an un-secure hackable channel.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Raj&lt;/strong&gt;: Hi Priya&lt;br&gt;
&lt;strong&gt;Priya&lt;/strong&gt;: Hello Raj. I don't know for sure whether you are really Raj or not. To prove that you are indeed Raj, I challenge you to encrypt this random text that I just generated: &lt;code&gt;njsdks&lt;/code&gt; using your &lt;strong&gt;PRIVATE&lt;/strong&gt; key.&lt;br&gt;
&lt;strong&gt;Raj&lt;/strong&gt;: Sure, here it is: &lt;code&gt;zsdeaf&lt;/code&gt;.&lt;br&gt;
&lt;strong&gt;Priya&lt;/strong&gt;: Cool, now I can decrypt &lt;code&gt;zsdeaf&lt;/code&gt; using Raj's &lt;strong&gt;PUBLIC&lt;/strong&gt; key and verify whether it matches the original challenge text (&lt;code&gt;njsdks&lt;/code&gt;) that I had generated.&lt;br&gt;
&lt;strong&gt;Priya&lt;/strong&gt;: I have GENERATED a random symmetric key and have encrypted it with your &lt;strong&gt;PUBLIC&lt;/strong&gt; key: &lt;code&gt;opwefi&lt;/code&gt;. Going forward lets encrypt all our messages with this symmetric key.&lt;br&gt;
&lt;strong&gt;Raj&lt;/strong&gt;: Nice. I can decrypt the symmetric key (&lt;code&gt;opwefi&lt;/code&gt;) using my &lt;strong&gt;PRIVATE&lt;/strong&gt; key, and use the decrypted symmetric key to encrypt / decrypt further communications with Priya.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is happening?
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Priya challenged Raj to encrypt a random text that Priya had generated with Raj's Private Key.&lt;/li&gt;
&lt;li&gt;Since only Raj has access to his own Private Key, only Raj should be able to encrypt the challenge text with his Private key.&lt;/li&gt;
&lt;li&gt;And the encrypted challenge text can only be decrypted with Raj's Public Key which is known to Priya (and maybe everybody else).&lt;/li&gt;
&lt;li&gt;So Priya is able to verify whether Raj's Private Key was indeed used to encrypt the challenge text, which will prove that the person she is talking to has access to Raj's Private Key, which can only be Raj himself (of course unless Raj's private key has been stolen / leaked :-p).&lt;/li&gt;
&lt;li&gt;Next, Priya generated a random symmetric key, encrypted it with Raj's Public Key and shared it.&lt;/li&gt;
&lt;li&gt;Only Raj will be able to decrypt it as Raj alone has access to Raj's Private Key.&lt;/li&gt;
&lt;li&gt;Going forward they can use the symmetric key to encrypt all messages as it will be faster and less expensive.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Above steps are a very simplified version of SSL/TLS handshake.&lt;/p&gt;

&lt;p&gt;Optionally, Raj can also challenge Priya to prove herself. This is somewhat similar to mutual TLS, however such mutual TLS can happen only if both the entities are known to each other.&lt;/p&gt;




&lt;p&gt;The three use cases mentioned above are fundamental building blocks of the https protocol and blockchain (blockchain is so much more though).&lt;/p&gt;

&lt;p&gt;Hope you liked the reading. Please share your feedback as comments or DM / connect with me over &lt;a href="https://www.linkedin.com/in/sagnik-bandyopadhyay-74a492138/"&gt;LinkedIn&lt;/a&gt;!&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
