DEV Community: Sagnik Bandyopadhyay

BigQuery deduplication strategies

Sagnik Bandyopadhyay — Sun, 30 Apr 2023 22:14:50 +0000

Problem statement

Context

Lets assume that we have data pipeline(s) dumping messages into Google BigQuery tables (lets call them raw tables).
There maybe duplicate messages being stored in the raw table due to reasons like:
- Duplicate messages sent from source
- Message inserted multiple times due to network issues and retries between the data pipeline and big query (although this can be addressed to some extent by using unique request ids while loading the data into BQ)
BigQuery doesn't have unique indexes

Assumptions

Duplicate messages have the same identifier (maybe message_id), although they may differ by any additional metadata injected by the data pipeline, e.g.: receive_timestamp
The message has a business timestamp that is provided by upstream systems
Messages can arrive late (maybe several days)
Lets assume that we have a raw orders table with these fields:
- message_id: unique id identifying one message
- message_timestamp: timestamp provided by the source systems
- order_id: unique id identifying an order
- order_amount: amount against the order
- receive_timestamp: receive time injected by the ingestion pipeline

Objective

These messages need to be exposed as a data product to consumers in de-duplicated form and the consumers shouldn't have to worry about de-duplication
Cost effectiveness. NOTE: BQ billing is based on number of bytes scanned and stored according to the on demand pricing model.

Solution 1

One way would be to create a view that selects distinct entries from the raw table. E.g. view SQL:



CREATE OR REPLACE VIEW views.orders AS
SELECT
  DISTINCT
    message_id,
    message_timestamp,
    order_id,
    order_amount
FROM raw.orders

Now a consumer could query this view with an SQL like: SELECT * FROM views.orders WHERE DATE(message_timestamp) BETWEEN '2023-01-01' AND '2023-01-31' to get orders placed in the month of january.

If the raw orders table is partitioned on message_timestamp field, then BiqQuery will query only those entries belonging to January's partition.

However, if we need some more flexibility in the uniqueness criteria then a DISTINCT keyword doesn't help. Example use cases:

receive_timestamp needs to be part of the view: The same message may have multiple entries with different receive_timestamps, in which case de-duplication by message_id won't work. Although in this specific case one could argue that receive_timestamp should not be exposed to consumers.
in case of multiple entries with same message_id, the entry with the latest receive time needs to be selected

Solution 2

We can bring in more flexibility into the uniqueness criteria by using a QUALIFY CLAUSE in the view:



CREATE OR REPLACE VIEW views.orders AS
SELECT
  *
FROM raw.orders
QUALIFY ROW_NUMBER() OVER (PARTITION BY message_id ORDER BY receive_timestamp DESC) = 1

This works, however, if the table is partitioned by message_timestamp, and consumers query the view with filtering on message_timestamp using a query like: SELECT * FROM views.orders WHERE DATE(message_timestamp) BETWEEN '2023-01-01' AND '2023-01-31', that does not reduce the partitions scanned, and BQ ends up scanning all the partitions.

It would have worked if we could have placed the WHERE clause before the QUALIFY clause, something like:



CREATE OR REPLACE VIEW views.orders AS
SELECT
  *
FROM raw.orders
WHERE DATE(message_timestamp) BETWEEN '2023-01-01' AND '2023-01-31'
QUALIFY ROW_NUMBER() OVER (PARTITION BY message_id ORDER BY receive_timestamp DESC) = 1

But while defining the view, we do not know the exact message_timestamp that needs to used for filtering, because it is known only at run time.

Solution 3

What if we could have the WHERE clause before the QUALIFY clause, and inject the values at runtime?

Table functions can do that !



CREATE OR REPLACE TABLE FUNCTION functions.orders(start_date DATE, end_date DATE) AS 
SELECT 
  *
FROM raw.orders
WHERE DATE(message_timestamp) BETWEEN start_date AND end_date
QUALIFY ROW_NUMBER() OVER (PARTITION BY message_id ORDER BY receive_timestamp DESC) = 1

Consumers can then query this function like this:



SELECT * FROM functions.orders('2023-01-01', '2023-01-31') WHERE <other filters>

Here only January's partition is scanned by BQ.

To summarise:

We can create functions (lets call them raw table functions) on top of raw tables that accept data ranges as input and injects those parameters into the WHERE clause before QUALIFY
Based on business needs, we may also create higher order functions if needed by composing one or more raw table functions (across different domains) so that the input date can be propagated till the lowest level.

Solution 4-A

Can we retain the semantics of a table ? One way to do that would be to:

Dump the raw data first into a raw table partitioned by receive time
Run a scheduled query to take the latest partition from the raw table and merge it into another deduplicated table (deduplicated table is partitioned by message_timestamp)
Remove old partitions from the raw table

Sample scheduled query that can be run daily to scan yesterday's raw records and merge them:



MERGE INTO dedup.orders dest USING (
SELECT 
  *
FROM raw_partitioned_by_receive_timestamp.orders
WHERE DATE(receive_timestamp) = CURRENT_DATE()-1
QUALIFY ROW_NUMBER() OVER (PARTITION BY message_id ORDER BY receive_timestamp DESC) = 1
) src
ON dest.message_id = src.message_id

WHEN NOT MATCHED THEN INSERT(message_id, message_timestamp, order_id, order_amount, receive_timestamp)
VALUES (src.message_id, src.message_timestamp, src.order_id, src.order_amount, src.receive_timestamp)

This scheduled query will scan:

Only one partition of raw table
One column (message_id because that is part of the merge ON clause) from all partitions of the deduplicated table. Can we optimise this ?

Also note that at query time (e.g.: when a consumer queries the deduplicated table to fetch orders for one month), the deduplicated table's partitions will also be scanned based on partition column based where clause.

Solution 4-B

We optimise the previous solution by replacing the scheduled merge query with a scheduled two step stored procedure.

As a first step, we query the minimum and maximum message_timestamp from the latest partition of raw table like this: ```

SELECT
DATE(min(message_timestamp)) AS minimum_message_timestamp,
DATE(max(message_timestamp)) AS maximum_message_tiemstamp
FROM raw_partitioned_by_receive_timestamp.orders
WHERE
DATE(receive_timestamp) = CURRENT_DATE()-1


2. We can:
   1. select entries from deduplicated table within the minimum_message_timestamp and maximum_message_tiemstamp window
   2. compare the latest raw table's partition with the entries selected in previous step to figure out new entries that need to be added to the deduplicated table and then insert only those entries

INSERT INTO dedup.orders
WITH existing AS (
SELECT
message_id
FROM dedup.orders
WHERE
message_timestamp BETWEEN <> AND <>
)
SELECT
raw.*
FROM raw_partitioned_by_receive_timestamp.orders raw
LEFT JOIN existing
ON raw.message_id = existing.message_id
WHERE
receive_timestamp = CURRENT_DATE()-1
AND existing.message_id IS NULL


NOTE: It is important to perform these two steps separately and not combine them into one INSERT query because the partition based where clause provides cost benefit only if the operands are provided statically.

This scans:
- One column (message_timestamp) of latest partition of raw table
- All columns from latest partition of raw table
- One column (message_id) from few partitions of deduplicated table. Number of partitions scanned depends on the minimum and maximum message time obtained from the raw table's latest partition.
- At query time selected partitions of deduplicated table are also scanned

---

## Trade-offs

| | Sol 1 |  Sol 3 | Sol 4-B
|-|-|-|-|
| **Uniqueness criteria** | Works efficiently only if upstream provided attributes are exactly same for duplicate entries with same message_id, and if only upstream provided attributes are required to be projected through the view | + Flexible uniqueness definition |  + Flexible uniqueness definition |
| **Semantics** | + View semantics | Consumers need to work with functions. Functions do not have user friendly schemas like views | + View semantics
| **Latency** | + No delay | + No delay | Depends on the frequency of scheduled query |
| **Development / maintenance overhead** | + Simple to maintain | + Simple to maintain | Relatively complex to maintain. May even need alerting to check for accidental duplicates |

From a cost point of view, this is the decision matrix:

![Use solution 1 or 3 for high duplication with high query frequency, high duplication with low query frequency, low duplication with low query frequency ; whereas use solution 4-B for high duplication with high query frequency](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/y6qss67s9xrsro8deejp.png)

Partial text search on Mongo

Sagnik Bandyopadhyay — Mon, 06 Jun 2022 19:38:03 +0000

Scenario

Lets consider a Mongo collection topics that looks somewhat like:

[
  {
    "_id": 1,
    "title": "computer science"
  },
  {
    "_id": 2,
    "title": "numerical computation"
  },
  {
    "_id": 3,
    "title": "medical science"
  },
  {
    "_id": 4,
    "title": "information technology"
  },
  {
    "_id": 5,
    "title": "sci fi"
  }
  ...
]

We would like to retrieve documents whose title "contains" comp.

Here are few possible ways that we can consider:

OPTION 1 - Simple regex query

We can perform a regex query on the title field, like this:

db.topics.find(title: /.*comp.*/i)

db.topics.find(title: {$regex: "comp", $options: "i"})

NOTE: Option i is used to perform case insensitive search.

Mongo will need to examine every document in the collection and perform a regex match on the title field.

We can define an index (nothing fancy, the usual index) on the title field, like:

db.topics.createIndex({title: 1})

Now if we perform the regex query, mongo will not examine all the documents, but it still needs to examine all the index keys, and then examine and retrieve only those documents that matched.

This works good for a small collection, but for large collections (having more than 100k or 1M docs) the query becomes slow, and it also consumes a lot of resources from the mongo db instances, which is definitely bad.

This solution is the simplest to implement, but doesn't scale well.

OPTION 2 - Index on tokenised words

If we are okay to support searching only by words, and not partial words, then tokenising the title field can help. E.g.:

[
  {
    "_id": 1,
    "title": "computer science",
    "titleTokens": ["computer", "science"]
  },
  {
    "_id": 2,
    "title": "numerical computation",
    "titleTokens": ["numerical", "computation"]
  },
  {
    "_id": 3,
    "title": "medical science",
    "titleTokens": ["medical", "science"]
  },
  {
    "_id": 4,
    "title": "information technology",
    "titleTokens": ["information", "technology"]
  },
  {
    "_id": 5,
    "title": "sci fi",
    "titleTokens": ["sci", "fi"]
  },
  ...
]

While saving / updating documents, we can:

convert the title field to lower case (to support case insensitive searches)
split the text into multiple words
save those words as an array, as titleTokens

And then we can define a multi key index on titleTokens

Now if we need to search for a text, we can:

convert the search text to lower case
split the search text into multiple words (using the same algorithm used to tokenise the title field)
Query the array titleTokens with the tokens obtained from search text, e.g.:

db.topics.find({titleTokens: {$all: ["science"]}})

Above query will retrieve all documents having the token "science" in the titleTokens field, i.e.: document #1 and #3

db.topics.find({titleTokens: {$all: ["computer", "science"]}})

Above query will retrieve all documents having both "computer" and "science" tokens, and hence will retrieve document #1

Using this solution we cannot search by partial words like "comp", but can search with individual words pretty fast.

OPTION 3 - Text index

Mongo has a provision for defining one text index per collection. This works pretty much the same way as OPTION 2, i.e., the title field's value will be tokenised into words and only search by full words (not partial words) is supported.

The index can be created like this:

db.topics.createIndex({title: "text"})

This is one way to query the text index:

db.topics.find({$text: {$search: "computer"}})

Above query will retrieve all documents having the word "computer" in the title field.

Here is another:

db.topics.find({$text: {$search: "computer science"}})

Above query will retrieve all documents having either "computer" or "science" in the title field, i.e.: documents #1 and #3

Yet another:

db.topics.find({$text: {$search: "\"computer sci\""}})

This is a phrase query. It will:

tokenise the search text into "computer" and "sci"
fetch documents containing either "computer" (this will match doc #1) or "sci" (this will match doc #5)
among all these documents, mongo will examine the title fields and check whether the exact phrase "computer sci" is present or not. only doc #1 will pass this check, which will be returned as the result of the query

Using this solution we can efficiently search for full words or for phrases containing at least one full word.

OPTION 4 - Atlas search

This is only applicable when mongo is running on atlas.

Atlas provides a very powerful way to search through documents for partial words. autocomplete indexing and searching perfectly meets our use case.

We can define a search index on topics following these steps:

Choose an analyser, maybe whitespace. This will tokenise the field's value into words
Disable dynamic mapping (assuming that we already know which field to index, and the field name / hierarchy doesn't change)
Define explicit fields to index, in this case it can be the title field
Also define two datatypes for the field:
1. String - for complete text matching (also needed for autocomplete match if we want to increase the search score of completely matched strings)
2. Autocomplete - for partial matching
  1. Choose the tokeniser for Autocomplete datatype:
    1. edgeGram: This will allow prefix searches against tokenised words. E.g.: searching by "comp" will return entities #1 and #2, but searching by "puter" wont return anything
    2. nGram: This will behave like a contains search. E.g.: searching by "put", will return entities #1 and #2
  2. Define:
    1. minGrams: The minimum length from which tokenisation will start. If minGrams is 2, then by searching for "c" we wont get any results, we need to have at least 2 characters. We need to be careful with this field. Setting a very low minGrams value (like 1) with nGram tokeniser will result in a huge index, and setting something very high won't be useful enough.
    2. maxGrams: The maximum length of characters which will be tokenised

Once the Atlas search index is ready, we can query the mongo collection using an aggregation pipeline like this:

db.topics.aggregate(
[
  {
    $search: {
      index: <name of atlas search index>,
      autocomplete: {
        query: "comp",
        path: "title"
      }
    }
  }
]
)

NOTE:

These search indexes can accomplish a lot more (fuzzy search, compound search, and maybe more).
The search index can be defined via terraform (if we want to follow infrastructure as code :-)).

Using this solution we can efficiently search for prefixes of words or for any part of the text, if we run mongo on atlas.

Above options were easy solutions to the problem. Below are some not so easy solutions:

ELASTIC SEARCH

If we strictly do not want to use atlas search, we may consider:

Continuously syncing the mongo collection (maybe only the title and id fields) to elastic search
Defining appropriate autocomplete edgeGram / nGram indexes
Executing queries against elastic and retrieving matched documents by id from mongo. Or we may also consider storing the whole document in elastic itself (in that case we need not retrieve docs by id from mongo as we already get the full document from elastic itself).

FROM SCRATCH !

If none of the above options work for us, we can build our own indexes using prefix trie or suffix trie, and point to mongo document ids (or any database reference) from them. Searching the index for all possibilities for a given search text may become a difficult / costly operation. We may consider caching the top results in each trie node. If space becomes an issue, we can distribute the trie across multiple nodes.

This is definitely not an easy job. We'll encounter multiple challenges / questions related to:

Pre-computation of search results in each trie node: What algorithm to use? How to keep the trie index in sync with the underlying mongo collection? How much latency is acceptable? How will the data infrastructure look like?
Trie sharding: How to evenly distribute the trie across multiple nodes?

Please feel free to suggest better ideas / corrections to this content. Thanks in advance.

Why asymmetric encryption?

Sagnik Bandyopadhyay — Thu, 14 Jan 2021 18:26:26 +0000

What is encryption?

Say Raj needs to send a message to Priya. Easy right? But, there is one catch: whatever message Raj sends to Priya will be visible to the whole world. Argh!

Raj comes up with an idea: Raj will not send the actual message to Priya. Instead, he will transform it to another form which will appear nonsense to the whole world, but Priya will be able to transform it back to the real message.

E.g.:

Say the actual message is: Hello

Raj applies some mathematical operation to "Hello" and it becomes: abc123 (aka encryption). Which is total nonsense to everyone.

But Priya knows about the reverse mathetimatical operation which will transform abc123 to Hello (aka decryption).

Great! Problem solved!

Types of encryption

Symmetric encryption: There is one key used to encrypt the message. The same key can be used to decrypt the message. Example algorithms: AES, DES, ...
Asymmetric encryption: One key is used to encrypt the message. But a different key is used to decrypt the message. The same key cannot be used to perform both encryption and decryption. Some use cases are illustrated below. Example algorithms: RSA, ECDSA, ECDH, ...

Use case 1

Here is a conversation between Raj and Priya over the internet on an un-secure chatting platform, i.e.: all these chats are visible to everyone.

Raj: Hello Priya, here is our revenue forecast for this year: jskdfn. Its confidential, hence encrypted with AES Key: A!123. You can use the same to decode it.
[Raj used symmetric encryption to encrypt the confidential data and also shared the key used to encrypt. NOTE: the same key can be used to decrypt it]
Priya: Raj, what did you do! Now everybody knows the secret key and they can use it decrypt the revenue forecast! The whole world knows about it now.

What went wrong?

Raj encrypted the confidential information with a symmetric key. All good so far. But Raj shared the symmetric key with Priya over an unsecured channel, which is a blunder, because once people get to know the symmetric key, they can decrypt the message.

Alternate scenario

Raj: Hello Priya, can you please share your PUBLIC key [more on PUBLIC and PRIVATE keys later]?
Priya: Here it is: pubkey001.
Raj: Here is our revenue forecast encrypted with your Public Key: ueihwf.
Priya: Thanks, I will use my PRIVATE key to decrypt the message.

What happened here?

With asymmetric encryption, we get a pair of keys:

A Private Key: As the name suggests its supposed to remain PRIVATE always. It shouldn't be shared with anyone.
A Public Key: Its supposed to be public - shared with everyone freely.

If we encrypt a message using a Public Key, it can be decrypted ONLY with the matching Private Key (and not with the Public Key). Similarly if we encrypt a message with a Private Key, it can be decrypted ONLY with the matching Public Key.

In the alternate scenario illustrated above, Priya shared her PUBLIC key with Raj. Raj encrypted the confidential message with the PUBLIC key and shared with Priya. Now even if anybody else gets to know about the Public Key and encrypted message, they WON'T be able to decrypt it, as it can be decrypted ONLY with the PRIVATE Key. And only Priya has the Private key which she hasn't shared with anyone.

NOTE: It is not always possible to encrypt an entire message with an asymmetric key (encrypting long messages with asymmetric keys will be quite slow and expensive). A general practise is to GENERATE a random symmetric key, and encrypt the symmetric key alone with the asymmetric key. And, the actual message content is encrypted with the symmetric key. So we get a payload of:

Symmetric key encrypted with asymmetric key
Message contents encrypted with symmetric key

The recipient can:

Retrieve the symmetric key by decrypting it with the matching asymmetric key pair
Decrypt the actual message contents with the symmetric key obtained above

Use case 2

Sam: Hi Priya, Raj has left this message for you: "Sell all my shares today".
Priya: Hello Sam. How do I know that it is really Raj who left that message? Can you please ask Raj to digitally sign this message with his PRIVATE key?
Sam: Sure, Raj has sent the signature as well: mzakld
Priya: Cool, since I already have Raj's PUBLIC key, I can now verify whether it was really Raj who sent the message.

What is this digital signature?

In the previous use case, we saw that the message was encrypted with the Public Key. Here it's kinda reverse. This is how Raj signs the message:

He generates a Hash of the message.
And then encrypts the Hash with his PRIVATE key, which becomes the signature of the message

How does Priya verify the signature?

Priya should already be knowing Raj's PUBLIC key (Remember, PUBLIC keys are meant to be public and known by everyone)
She decrypts the signature (mzakld) using Raj's PUBLIC key. Lets call it - HASH_VALUE_1
And then independently generates a hash of the message contents ("Sell all my shares today"). Lets call it - HASH_VALUE_2
She verifies whether HASH_VALUE_1 is equal to HASH_VALUE_2.
Anybody other than Raj will not have Raj's PRIVATE key. If any other Private key is used to sign the message, then when the signature is decrypted using Raj's PUBLIC key, it won't match with the hash of the message contents. Thus Priya will know that the message was not from Raj.

NOTES:

If the message is being sent over an un-secure channel, then the message (and optionally its signature as well) can be encrypted using the approach described in Use case 1.
Depending on the use case, additional attributes (like UUID, timestamp) maybe included in the message to ensure every message is unique. This will help in preventing replay attacks.
In all cases, the Private Key always remains private and is never ever shared with anyone.

Use case 3

Assume that Priya has received Raj's public key prior to this conversation and she is certain that the Public key really belongs to Raj.

Below is a conversation taking place of an un-secure hackable channel.

Raj: Hi Priya
Priya: Hello Raj. I don't know for sure whether you are really Raj or not. To prove that you are indeed Raj, I challenge you to encrypt this random text that I just generated: njsdks using your PRIVATE key.
Raj: Sure, here it is: zsdeaf.
Priya: Cool, now I can decrypt zsdeaf using Raj's PUBLIC key and verify whether it matches the original challenge text (njsdks) that I had generated.
Priya: I have GENERATED a random symmetric key and have encrypted it with your PUBLIC key: opwefi. Going forward lets encrypt all our messages with this symmetric key.
Raj: Nice. I can decrypt the symmetric key (opwefi) using my PRIVATE key, and use the decrypted symmetric key to encrypt / decrypt further communications with Priya.

What is happening?

Priya challenged Raj to encrypt a random text that Priya had generated with Raj's Private Key.
Since only Raj has access to his own Private Key, only Raj should be able to encrypt the challenge text with his Private key.
And the encrypted challenge text can only be decrypted with Raj's Public Key which is known to Priya (and maybe everybody else).
So Priya is able to verify whether Raj's Private Key was indeed used to encrypt the challenge text, which will prove that the person she is talking to has access to Raj's Private Key, which can only be Raj himself (of course unless Raj's private key has been stolen / leaked :-p).
Next, Priya generated a random symmetric key, encrypted it with Raj's Public Key and shared it.
Only Raj will be able to decrypt it as Raj alone has access to Raj's Private Key.
Going forward they can use the symmetric key to encrypt all messages as it will be faster and less expensive.

Above steps are a very simplified version of SSL/TLS handshake.

Optionally, Raj can also challenge Priya to prove herself. This is somewhat similar to mutual TLS, however such mutual TLS can happen only if both the entities are known to each other.

The three use cases mentioned above are fundamental building blocks of the https protocol and blockchain (blockchain is so much more though).

Hope you liked the reading. Please share your feedback as comments or DM / connect with me over LinkedIn!