DEV Community: Aniketh Deshpande

Predicate Pushdown - Understanding Practically With An Example

Aniketh Deshpande — Wed, 17 Apr 2024 19:42:28 +0000

What is predicate pushdown?

The immediate theoretical answer that we get on searching is

Predicate pushdown is a query optimisation technique used in database technologies

Okay, I got to know that it is an optimisation technique. But I still did not understand...

How is the optimisation happening? 🤨
What is predicate? 🤔
What exactly is the meaning of pushed down here? 🤷‍♂️

I'm sure since you are also reading this article, even you have these questions in mind!

Now, lets explore this interesting topic practically in PySpark using explain()
(Similar phenomenon could be observed in relational databases as well)

1] Reading a csv file containing employee information.

emp = spark.read.format('csv')\
                .option('header', True)\
                .load('data/employee.csv')
emp.show()

emp.explain(mode="formatted")

Here, step (1) is related to read csv file

2] Lets do a group by and get number of employees on each department

df = emp.groupBy('deptID').count()
df.show()

df.explain(mode="formatted")

Here, step (2) is related to group by

3] Lets filter data for only dept number 10

dept_10 = df.filter(df.deptID == 10)
dept_10.show()

dept_10.explain(mode="formatted")

Here, we can see that step (2) is filtering and step (3) is grouping.

Now here is the catch,
Ideally if we go by the sequence of operations, grouping should be done first and then filtering.

However, the optimiser does filtering first and then grouping, because grouping is an expensive operation and it is optimal to filter first and then group data.

In the physical plan we see that filtering (predicate) is pushed down with respect to grouping. That is why it is called push down !

Write Through

Aniketh Deshpande — Fri, 22 Mar 2024 13:06:39 +0000

Write through cache is a simple to implement caching mechanism.

Here the newly arrived data is written into cache and as well as persisted into the disk or a database. Atomicity is maintained.

There are two ways to implement it:
1] The application writes data to cache and database simultaneously

2] The application writes data to cache and then the cache writes the data into the database.

Advantages:

Simple to implement.
Faster response times.
Data integrity because of atomic nature of write operation.
Lower latency for subsequent reads.

Disadvantages:

Cache pollution: Since every time the data is filled into cache, it can get filled with less frequently read data and more cache eviction which could introduce some latency.
Not suitable for write intensive scenarios as the write operations are slower compared to other methods because data needs to be written in cache as well as persistent storage everytime.

Database Caching Strategies

Aniketh Deshpande — Fri, 22 Mar 2024 12:43:08 +0000

We often face high latencies while fetching data from Database and are unable to meet SLA. Caching is one of the solutions to implement after DB query and table optimisations.

A Cache is used to store the data so that it can be delivered faster to the client as compared to persistent systems like database or disk.

Popular caching tools are:

There are multiple ways in which data can be written into and read from the cache. Let us explore the most prominently used methods or policies.

Cache Aside

Cache Aside or Lazy Loading is one of the cache write policies or strategies.

In cache aside method, the application is responsible for storing data into cache.
When the client sends request to the application, it looks for data in the cache.
If the data is found in the cache it is called as cache-hit. The data is fetched from cache and returned to client.
However, if the data is not found in the cache, also called as cache-miss, the application queries for data in the database, writes the data into cache and sends the response to the client.
Since we store data in cache only when it is necessary, this strategy is also called as Lazy-Loading.

Advantages:

Cost effective because, only the frequently accessed data is stored into cache.

Disadvantages:

The response time can be slow when there is cache miss, because, it involves many i/o operations to fetch data from DB and store it in cache.

In case the system can tolerate an initial delay and the same data is to be fetched repeatedly, then this mechanism works best.

Write through

Write through cache is a simple to implement caching mechanism.

Here the newly arrived data is written into cache and as well as persisted into the disk or a database. Atomicity is maintained.

There are two ways to implement it:
1] The application writes data to cache and database simultaneously

2] The application writes data to cache and then the cache writes the data into the database.

Advantages:

Faster response times.
Data integrity because of atomic nature of write operation.
Lower latency for subsequent reads.

Disadvantages:

Cache pollution: Since every time the data is filled into cache, it can get filled with less frequently read data and more cache eviction which could introduce some latency.
Not suitable for write intensive scenarios as the write operations are slower compared to other methods because data needs to be written in cache as well as persistent storage everytime.

In case the system cannot tolerate an initial delay and the writes are infrequent, then this mechanism works best.

Also if the number of records to be cached is also fixed, then this approach can provide the best of the results.

Thank you :)

Cache Aside

Aniketh Deshpande — Fri, 22 Mar 2024 12:41:36 +0000

Cache Aside or Lazy Loading is one of the cache write policies or strategies.

In cache aside method, the application is responsible for storing data into cache.
When the client sends request to the application, it looks for data in the cache.
If the data is found in the cache it is called as cache-hit. The data is fetched from cache and returned to client.
However, if the data is not found in the cache, also called as cache-miss, the application queries for data in the database, writes the data into cache and sends the response to the client.
Since we store data in cache only when it is necessary, this strategy is also called as Lazy-Loading.

Advantages:

The implementation is simple.

Disadvantages:

The response time can be slow when there is cache miss, because, it involves many io operations to fetch data from DB and store it in cache.

Shallow Copy Vs Deep Copy

Aniketh Deshpande — Sat, 29 Oct 2022 12:57:16 +0000

In our day to day development tasks, we come across the need to copy objects and perform various operations.

Python provides two important functions in the copy library - copy and deepcopy.

Let us understand the difference between the two and their respective use cases.

Shallow Copy - copy.copy(obj)

It makes a copy of the obj at the surface level.
It copies all the contents of the obj.

However, the point to be noted here is that, in case the contents of x are mutable, then, y has reference to the contents of x. Any modification to the contents of x would be reflected in y as well.

In this example we see that modifying the contents of x[1], also modified the contents of y[1].

Similarly, we see that modifying the contents of y, also modified the contents of x.

Deep Copy - copy.deepcopy(obj)

Deepcopy copies the object recursively. Recursively means, it copies the contents of the object and not merely its reference.
So we should use deep copy in case we need an independent copy of the contents of the obj.

In deepcopy, any changes to the contents of x do not effect y, unlike shallow copy.

Copying objects properly is a very important basic python concepts. This knowledge helps in building error free code wrt copying objects.

Thank you
Aniketh Deshpande

Dead Letter Queue

Aniketh Deshpande — Sun, 23 Oct 2022 11:50:55 +0000

This is going to be a very short introduction to dead letter queues.

What is a dead letter queue??

Dead letter queues are messages queues specifically deployed to holding messages that could not be delivered to their intended queues.
Messages sometimes fail to get delivered to their intended queues as they might be unavailable or the queue is full.
Popular message queue tools that support or do not support DLQ:
- AWS SQS (Simple Queue Service) supports DLQ.
- RabbitMQ (free and open source) also supports DLQ.
- Redis does not support DLQ.

Thanks for reading :)
DLQ is a very important component of a scalable and resilient software architecture. This article only provides an introduction to the concept and helps readers with useful links.

Following are some of the useful resources that give more indepth information.

I request the readers of this post to kindly add more helpful URLs in the comment section, or add your experiences using a DLQ in a real world project. I believe that would help beginners get even better idea about how to use this in their projects.

Thank You
Aniketh Deshpande

Change Data Capture - PostgreSQL

Aniketh Deshpande — Sat, 22 Oct 2022 14:58:18 +0000

Change Data Capture is the concept of recording the changes in the database table fields.

It is very helpful in use cases where we want to track creation, updation, deletion of records in the table.
We might want to use this information to make changes in other databases or notify customers or notify other services.
Example:
1. Save a copy of this data in a warehouse post transform.
2. Trigger notification service to notify users about this change.
3. Cache the data.

CDC In Postgres

Using Notify/Listen
- NOTIFY provides a mechanism for interprocess communication between the database and the service that is LISTENing to this notification channel.
- One or more services could be listening to this notification channel.
- Name of this channel is usually the name of the database. However the user is free to set suitable names for these channels.
- Any change in the table is captured by the DB and a trigger is initiated, which calls a function that formats the message to notify.
- This usually contains the table name and the payload string.
- The listening server registers to the channel and gets the message from the DB.
- The service can then use this message and perform operations on it.

Pros:

Simple to implement. Use a trigger and a function to notify. Implement a listen service.

Cons:

Weak reliability. There is always a risk of loss of message especially when the listening service is down. Messages in the queue do not persist.

Using Debezium
- Debezium is an open source tool used for capturing changes in the database tables based on the WAL (write ahead log).
- The tool provides connectors to connect to a variety of databases.
- The source connector is used to capture changes in the source database.
- The sync connector is used to sync data directly in the destination database.

Pros:

The changes are persistent as they can be streamed to kafka. Hence highly reliable.

Cons:

Debezium does not take into account changes in the schema, users need to update the schema changes. Otherwise there would be data loss.

Thank you
Aniketh Deshpande

DB Locking - Why and How?

Aniketh Deshpande — Fri, 21 Oct 2022 18:39:15 +0000

If you have sent concurrent requests to your DB to modify its contents, you would have come across a phenomenon called The Double Booking Problem!

Double Booking Problem arises when two or more threads read the same data point and one thread incorrectly overwrites the changes done by another thread resulting in inconsistency.

Lets see an example to get better understanding of it.
Suppose its a bus booking application.

User 1 and User 2, both read Seat_21 as available.
User 1 books the seat. However User 2 is unaware of this as he has read Seat_21 is available.
User 2 also books the seat. Now the seat info is overwritten and the seat is allotted to User 2.
Due to this phenomenon of double time booking of resources, it is named as The Double Booking Problem!

Following flow diagram gives clear idea about it.

In order to avoid this, we need to use locking.

Locking in DynamoDB

Let us see how we can lock dynamodb objects.

1. Optimistic Locking

The spotlight feature of optimistic locking is, it does not have a lock as such. Instead a version number is attached to the record and it is incremented whenever the record is updated.
Multiple users are allowed to read the document or record.
When the users try to commit their changes, the change related to the first request is accepted as the version numbers match in the record and in the commit request.
Subsequent users trying to commit their changes are sent a Validation Exception!
Based on this, the users can sync the updated record and make their changes and retry committing the changes.
Therefore, the double booking problem is eliminated.

2. Pessimistic Locking

In pessimistic locking mechanism, Locks are implemented. The record that is being updated is locked when a user starts a transaction.
Since the record is locked, the other users trying to update the record are notified that the record is locked and the read also fails.
This mechanism although equally effective, has an overhead of implementing locks.
Therefore, the double booking problem is eliminated.

Thanks for reading this blog.
Aniketh Deshpande
India

TimescaleDB Tablespaces

Aniketh Deshpande — Fri, 21 Oct 2022 04:59:45 +0000

Tablespace

Tablespace is a storage location where the actual data is stored.
The tuples belonging to the same table could be stored in different tablespaces.
Tablespaces are mainly used to separately store data of different priority in different kind of disk.
Example: Data of active users, or recent active customers can be stored in fast disk types like ssd or flash. Data that is old and unused frequently or archived, can be stored in a less expensive and slower data storage like HDD.

TimescaleDB

TimescaleDB is an open source time series database. It is extends PostgreSQL and supports most of the commands of postgres.
Docker Image:

docker pull timescale/timescaledb

TimescaleDB has a concept of hypertables and chunks. Hypertables are Postgresql tables, that partition the data into chunks.
The chunks are created based on primarily the time field.
The chunks older than certain date can be moved to a slow disk like HDD and the latest data which would be heavily used can be used in a fast disk like SSD and Flash.
The concept of tablespaces helps in here.
TSDB code to move chunks:

SELECT move_chunk(
  chunk => '_timescaledb_internal._hyper_1_4_chunk',
  destination_tablespace => 'tablespace_2',
  index_destination_tablespace => 'tablespace_3',
  reorder_index => 'conditions_device_id_time_idx',
  verbose => TRUE
);

For detailed information, use the following link: https://legacy-docs.timescale.com/v1.7/api#move_chunk

AWS Volumes

If the TSDB instance is hosted in a kubernetes cluster in AWS, the TSDB pod would be provided with an AWS Volume for persistent storage.
AWS supports the following volume classes. Based on speed and cost requirements, we can select the appropriate volumes for TSDB tablespaces.
AWS volume types: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-volume-types.html

Thank you for reading the blog :)
Aniketh Deshpande

Redis Timeseries

Aniketh Deshpande — Thu, 20 Oct 2022 19:00:04 +0000

Redis is an amazing tool to cache data. It supports different data types to help us cache different kinds of data.

Following are the the data types supported as of Redis 7.0

Strings
Hashes
Lists
Sets
Sorted Sets
Timeseries

In this article, we shall focus mainly on caching timeseries data in redis.

We can cache timeseries data in the following ways:

1. Using the Redis-Timeseries extention

Save data using ADD KEY Timestamp Record
where, Key is the name of the timeseries.
Timestamp is the field used for sorting the elements.
Record is the field representing the value at the given timestamp.
Fetch records using range command. RANGE KEY FROM_TS TO_TS where from_ts and to_ts represent the upper bound and lower bound of the timestamp in the search space.

NOTE: the record field is of type decimal. It supports only numbers.

Therefore, it is very helpful for saving single value records and not lists or maps.
Example: stock values, moisture in soil etc.
However it is not possible to save lists or tuples.
In that case, we can make use of Sorted Sets.

2. Sorted Sets

ZADD KEY Timestamp RECORD.
Here, zadd is used to save data in sorted sets. Key is the series name. Timestamp is the field used for sorting. Record can be of type string. Hence we can save json strings in the record field.
To fetch data from sorted sets, use command ZRANGE FROM_TS TO_TS BYSCORE=TRUE.
Use by_score to get data based on timestamp and not index.

Docker Image For Redis Timeseries

docker pull redislabs/redistimeseries

Link: https://hub.docker.com/r/redislabs/redistimeseries

Thank you for reading the blog. Please suggest improvements and like the blog.

Aniketh Deshpande

ML Ops

Aniketh Deshpande — Sat, 10 Oct 2020 02:54:07 +0000

MLOps is an emerging field in the teck world. As the name suggests it's a combination of machine learning and devops. MLOps mainly focuses on fast and efficient delivery of machine learning models.