DEV Community: Sathyasarathi

Event Streaming Platforms: Kafka - A Hands-On Tutorial

Sathyasarathi — Thu, 01 Jun 2023 14:22:54 +0000

Ever wondered how systems such as Apache Kafka [Open Source], Google Pub/Sub [Google Managed], Azure Event Hubs [Azure Managed], etc. functions?

These distributed messaging/event streaming platform-systems facilitate the the near-realtime data streaming and parallel processing for ETL and analytical capabilities.

The practical applications include ability to build:

Near-real time data pipelines - E.g. Geolocation updates from mobile phones, Shipping orders, Order Deliveries
Large scale data pipelines where the batch jobs can be too expensive - E.g. Payment transactions, Sensor measurements from IoT devices or medical equipment.

Simply put, these are message transportation systems that are highly performant and resilient. This post will help you understand the components of the ELK stack and will get you up and running to process logs on your local machine. Kafka, being a stateful system gives the ability to read the stream at different positions and different speeds and replay the messages from the past (at least from the set expiry time).

Components of Kafka

Apache Kafka consists of several key components that work together to provide a scalable and distributed streaming platform.

The main components of Kafka are mentioned as follows:

Topics

Topics hold the logical stream of data. It is synonymous to a folder in a filesystem or a table in a database. The events are synonymous to the files in a folder or the rows in a table

You can have any number of topics. Applications write directly or publish the messages to the topics. Internally, topics are made up of multiple partitions and the packets are distributed for resilience across multiple nodes (also known as Brokers).

Partitions & Offsets

As mentioned earlier, topics are stored as multiple partitions across different nodes for data resilience. The messages are stored within a partition, gets incremental id's - known as offsets.

The offsets are incremental and once assigned, offset can never be re-assigned ever again even if the messages get flushed.
The sorting order of the offsets are maintained inside the partition but not guaranteed when sent across different partitions.
Data is retained in the offsets only for a limited time (Default is 1 week). This expiry time can be modified to other durations too.
Data is immutable and can't be manipulated once after it has been written to a partition.
Data is assigned to a partition randomly unless the partition is forcefully defined.

But, does it matter if the partition is randomly assigned?

Brokers

Event streaming platforms are multi-node distributed system containing multi-noded clusters. The nodes of the clusters are known as brokers. At least three nodes/brokers are required to make a perfectly functioning cluster. The data is randomly distributed and replicated across these brokers.

Like any other distributed system, connecting to a single broker (Bootstrap Broker) will give access to the entire cluster. Topics are universally assigned and distributed across the brokers. However, the data distribution across the number of brokers is decided based on the replication factor. This brings the resilience into the picture, making sure that: If a broker is down, another broker can serve the data.

Replication Golden Rule: Even though multiple brokers can have the copy of the data, only one of the broker can be a leader for a given partition. The other brokers will synchronize and replicate the data. If the leader goes down, one of the followers will get elected to become the leader.

Each partition has a leader and multiple ISRs (in-sync replica). Distribution coordination services such as Apache Zookeeper will decide on this and take care of replication.

Publishing Data to Kafka (Write)

To write data to Apache Kafka, you need to create a producer application that connects to the Kafka cluster and publishes messages to the desired Kafka topic.

Producers

Applications known as Producers writes the data to topics. They know to which broker and partition to write to. During failures, the producers will auto recover. The load is balanced across the brokers. When sending the data, if the key is not specified, the producer sends the data to brokers based on round robin.

Data Send Modes

Producers can be configured whether to receive the acknowledgement for data writes. The following are the possible send modes, one of them can be configured for the producers.

Send Mode	Description	Data Loss	Use-case
acks=0	Producer won't wait for acknowledgement - very dangerous but no lag in performance	Possible data loss	Suitable for high volume data transfers
acks=1	Producer will wait for the leader broker to acknowledge	limited data loss	Suitable for data streams related (Widely Used)
acks=all	Leader and all other replicas get the data and acknowledge	no data loss but poor performance	Suitable for secure and persisted storages

Message Keys

If the ordering of the message matters, the message keys are mandatory. For example, consider an ecommerce or a delivery app in which the key can be as order_ids. There are different event types for the processing of an order such as: "ordered", "shipped", "in-transit" and "delivered" that is highly crucial to be processed in the same sequence and this is done by assigning to them to the same message key.

Message keys can be a hash, number, string, number, etc. If message key is null, data is sent round robin (i.e. in case of two brokers - first message is sent to broker 1, second message to broker 2, third message to broker 1, etc.).

If a key is null or not provided, the data will be sent to any partition based on round robin.

Subscribing to Data in Kafka (Read)

Applications known as Consumers reads the data from the consumer groups, also known as subscriptions.

Consumers

Consumers read the data from topics. They know which broker to read from. Consumers can auto-recover, in case of failures. Data is read in order within each partition.

In case of consumers reading from multiple partitions, the reading order can be random but however within each partition, the data is read in order.

Consumer Groups

Consumers read from multiple partitions by reading in Consumer Groups. Work is divided among the consumer groups. Internally, Consumers will use a Group Coordinator and a Consumer Coordinator to assign consumers to a partition.

If there are too many consumers and just a few partitions, some of the consumers may be inactive. These inactive consumers can be helpful when one of the consumers go down, so that the inactive consumer can replace the failed consumer.

For Example:

A Dashboard application can have two consumers under a consumer group 1 and can read from three partitions at a time.
A notification application can have one consumer reading from multiple partitions at the same time.

Installing Kafka

Setting up Apache Kafka involves:

Downloading Kafka
Configuring the Kafka cluster
Starting the necessary components.

Here's a high-level overview of the steps to set up Kafka:

Download Kafka:

Go to the Apache Kafka website (https://kafka.apache.org/downloads) and download the desired Kafka version.
Extract the downloaded archive file to a directory on your system.

Configure Kafka:

Navigate to the Kafka installation directory and locate the config subdirectory.
Modify the configuration files as per your requirements:

config/server.properties: This file contains the configuration for Kafka brokers. You may need to update properties such as broker.id, listeners, log.dirs, and zookeeper.connect.
config/zookeeper.properties: This file contains the configuration for the ZooKeeper server used by Kafka. Update properties such as dataDir and clientPort if necessary.

Starting the Necessary Components

Start ZooKeeper:

Kafka relies on ZooKeeper for coordination and maintaining cluster state. Start ZooKeeper by executing the following command in a new terminal or command prompt window:


$ bin/zookeeper-server-start.sh config/zookeeper.properties

Keep the ZooKeeper instance running while setting up and using Kafka.

Start Kafka brokers:

In a separate terminal or command prompt window, start the Kafka brokers by executing the following command:

$ bin/kafka-server-start.sh config/server.properties

You can start multiple brokers if you want to set up a Kafka cluster. In such cases, you need to configure unique properties such as broker.id, listeners, and log.dirs for each broker.

Create topics:

Kafka uses topics to organize and categorize messages. You can create a topic using the following command:

$ bin/kafka-topics.sh --create --topic <topic_name> --bootstrap-server localhost:9092 --partitions 1 --replication-factor 1

Adjust the <topic_name> as desired, and specify the appropriate bootstrap-server (Kafka broker) address, partitions, and replication factor.

Verify the setup:

To ensure that Kafka is running and the setup is correct, you can use various Kafka command-line tools to produce and consume messages from topics. For example:

Produce messages:

$ bin/kafka-console-producer.sh --topic <topic_name> --bootstrap-server localhost:9092

Consume messages:

$ bin/kafka-console-consumer.sh --topic <topic_name> --bootstrap-server localhost:9092 --from-beginning

These steps provide a basic setup of Kafka for local development and testing purposes. For production environments, additional configurations and considerations are necessary, such as securing the cluster, configuring replication, and setting up additional components like Kafka Connect or Kafka Streams.

This gives you a basic understanding of the Kafka architecture and how to get started. They allow for high-throughput, real-time data streaming and processing across multiple applications and systems. Some of the practical use-cases include: messaging system, large volume log aggregation transport, stream processing, data pipelines, and IoT data processing.

Git: Best Practices For Better Collaboration

Sathyasarathi — Sat, 04 Dec 2021 11:03:00 +0000

Always have individual branches for your environments. The developers can create branches out of the dev environment and work on their respective branches👩‍💻👨‍💻.

main
|
|------dev
        |--- helsinki/feature1
        |--- tokyo/feature2
                |--- nairobi/feature3

Protect 🛡️ the main and dev branches so that no one accidentally pushes or deletes the branches. Even if the branches are not protected, never 🛑 push directly to main and dev branches.

Work on your own feature branch 🌳 that is branched out of dev branch.

> git checkout dev
> git checkout -b helsinki/feature-2

Commit and push your changes 🟩 to the remote branch 🌲 before you end your day, every evening 🗓️.

Devices can be replaced but the lost code can never be.

Rebase your changes 🟩 with your source branch 🌲 before you begin the day 🌅, to avoid merge conflicts.

Feature Branch ← Source Branch (Rebase)

> git branch
dev
branch1
> git checkout branch1
> git rebase dev
Successfully rebased and updated refs/heads/branch1.

In case of conflicts, DO NOT PANIC! 🤯🙅🏻

Why do the conflicts occur?

Merge conflicts may occur if competing changes are made to the same line of a file or when a file is deleted that another person is attempting to edit.

a. You clone from a Main branch
b. Your colleague clones from the same branch
c. He pushes his commit and merges the changes in his branch to the main branch
d. You push your commit and when trying to merge your change with the main branch, 🟥 Merge conflict occurs.

Feature branch → Source Branch

Merge conflicts can occur while rebasing too:

> git rebase main
Auto-merging main.py
CONFLICT (content): Merge conflict in main.py
error: could not apply cce2b78... added another statement
Resolve all conflicts manually, mark them as resolved with
"git add/rm <conflicted_files>", then run "git rebase --continue".
You can instead skip this commit: run "git rebase --skip".
To abort and get back to the state before "git rebase", run "git rebase --abort".
Could not apply cce2b78... added another statement

> git status
interactive rebase in progress; onto bbf1e2e
Last command done (1 command done):
   pick cce2b78 added another statement
No commands remaining.
You are currently rebasing branch 'branch2' on 'bbf1e2e'.
  (fix conflicts and then run "git rebase --continue")
  (use "git rebase --skip" to skip this patch)
  (use "git rebase --abort" to check out the original branch)

Unmerged paths:
  (use "git restore --staged <file>..." to unstage)
  (use "git add <file>..." to mark resolution)
    both modified:   main.py

> nano main.py

> git commit -m 'made changes'

> git status

interactive rebase in progress; onto bbf1e2e
Last command done (1 command done):
   pick cce2b78 added another statement
No commands remaining.
You are currently editing a commit while rebasing branch 'branch2' on 'bbf1e2e'.
  (use "git commit --amend" to amend the current commit)
  (use "git rebase --continue" once you are satisfied with your changes)

> git rebase
 --continue
Successfully rebased and updated refs/heads/branch2.

> git push origin branch2
Merge branch 'branch2' of https://github.com/datawrangl3r/conflict into branch2
To https://github.com/datawrangl3r/conflict.git
 ! [rejected]        branch2 -> branch2 (non-fast-forward)
error: failed to push some refs to 'https://github.com/datawrangl3r/conflict.git'
hint: Updates were rejected because the tip of your current branch is behind
hint: its remote counterpart. Integrate the remote changes (e.g.
hint: 'git pull ...') before pushing again.
hint: See the 'Note about fast-forwards' in 'git push --help' for details.

> git pull origin branch2
hint: Pulling without specifying how to reconcile divergent branches is
hint: discouraged. You can squelch this message by running one of the following
hint: commands sometime before your next pull:
hint:
hint:   git config pull.rebase false  # merge (the default strategy)
hint:   git config pull.rebase true   # rebase
hint:   git config pull.ff only       # fast-forward only
hint:
hint: You can replace "git config" with "git config --global" to set a default
hint: preference for all repositories. You can also pass --rebase, --no-rebase,
hint: or --ff-only on the command line to override the configured default per
hint: invocation.
From https://github.com/datawrangl3r/conflict
 * branch            branch2    -> FETCH_HEAD
Merge made by the 'recursive' strategy.

> git push origin branch2

Always delete ❌ your branch upon completion of a feature request.

Follow my Blog @ https://www.datawrangler.in
Github @ https://github.com/datawrangl3r

ETL & Enterprise Level Practices

Sathyasarathi — Thu, 03 Dec 2020 18:36:22 +0000

ETL Strategies & Pipelines have now become inevitable for cloud business needs. There are several ETL tools in the market ranging from open-source ones such as Airflow, Luigi, Azkaban, Oozie to enterprise solutions such as Azure Data Factory, AWS Glue, Alteryx, Fivetran, etc. But what makes the data pipelines to be Industry-ready and robust? The practices and the industrial standards that are put into the architecture and design of the pipelines do. In this article, we touch upon these aspects and characteristics of Enterprise-level practices that can be followed to make a robust data pipeline.

ETL-Alphabets

We will be explaining the ETL strategies, alphabetically as shown in the illustration below:

Archival Strategies

Archive older unused data, to keep the data warehouses clean and performant. By performing such cleanups, unused indexes can be cleared, and unused space can be retrieved.

Be a responsible citizen

Select only those columns that are required for the analysis - While performing the extraction, select only the required columns. This will help in increasing the performance of the data warehouse and also, speeds up the extraction process.
Power down resources when not in use. If you are using clusters for processing or databases for staging; make sure to power down the resources that are not in use.

Cache Data

Accessing a table more often? Why not cache that table for rapid usage? For example, Spark’s createOrReplaceTempView function creates just a mere reference to the table whereas the saveAsTable() stores the table to HDFS. This won’t make much of a difference for smaller datasets, yet could be a pivotal improvement while transforming larger datasets.

Data Security & Anonymity

Mask the sensitive information or better not to bring those to the target Data warehouse. Always perform your ETL practices over VPNs or secured groups. Always use a key vault service (Vault, Azure Key Vault, AWS Secrets Manager), etc. in case if you don’t have a Universal Authentication Identity Service Provider protected by individual Service Principals.

Extract Only Necessary Fields

You don’t always require a select * from the table. Select only those fields that are required from the source database tables. If you are wondering if you needed any columns in the future, many of the ETL solutions such as Azure Data Factory, provide options to facilitate schema drift. This could be a major game-changer in terms of the performance and efficiency of the ETL processes.

Find the Appropriate Business Date Column

Identifying the fields is one, and identifying the business date columns is another. The business date columns are essential since the data lake timestamp columns are not always reliable. These columns are essential for performing incremental changes to your data warehouse.

Generalize the ETL solution

The ETL solution that you create should be able to accommodate changes dynamically. Changes shouldn’t bring about havoc, rework, or major code changes. Make sure to bring in a lot of dynamic parameters and configurable placeholders such as File Location, File System Location, Logging Location, Name of Facts, Dimensions, Staging Tables, etc.

High-Level Data flow

Keep your ETL Pipelines and the corresponding data flows, as simple as possible. Plan before you start, draft a blueprint of what your pipeline is about to accomplish. This helps in figuring out those activities that are tricky and are prone to failures, such as processing at the staging layer and copy to the destination data warehouse.

Incremental Data Refreshes

Incremental data refreshes are nothing but bringing in the data that has changed or newly inserted into the Data Lake. Business date stamps can be very helpful while performing incremental data refreshes. Incremental data refreshes can be performed wherever it's applicable since performing a complete refresh of the data for the whole year can be expensive if done frequently.

Joins in Queries

Joins in queries are fine, as long as you know what you are doing and the query plan is under check. Make sure to check for the conditions used to join the query, filter the results even before you could join, if possible.

Keys-Surrogates & Composites

Surrogate Keys are columns that are added to the data warehouse tables to ensure uniqueness in the destination tables. These keys are essential to perform an upsert operation (insert and update if the insert fails) on the destination tables. The choice of columns is based on the business context and usually, these keys are framed as an SHA encryption of multiple columns combined using a delimiter.

E.g.: In a fact table representing details about the users purchasing products: Product_id, Order_timestamp, and Customer_id, Delivery_date uniqueness can be achieved for each entry by creating a surrogate key like SHA(Product_id, ‘|’, Order_timestamp, ‘|’, Customer_id). In case, there’s an update in the Delivery_date, the newer date can be updated based on the surrogate key in the Data warehouse.

While querying the data warehouse, it’s a good practice to check the query plan, if the indexes are properly used. If not, the required indexes can be created. For queries utilizing multiple columns, these can be recognized and a composite key combining these columns can be created for easier access.

Logging

Logging helps to track down the pesky ETL run failures. If the ETL application provides logging out of the box, no external logging is required. If not, it’s highly essential to have a custom script that does the logging.

Monitor

Mechanisms to monitor the ETL processes need to be in place. For example, Azure Data Factory and Airflow has some really good interfaces to track the failures. In addition to this, mechanisms such as e-mail alerts and slack notifications can be really helpful. In short, a holistic view of the pipeline’s health needs to be available to monitor at all times.

Nested Loops-Avoid at all costs

In most situations, the memory leaks observed in the ETL jobs are due to the Join Strategies. This can be found in the Query Plans. Depending on the database and processing engines, the strategy may vary.

For example: In Spark, the join strategy can be rated from ‘worst’ to ‘best’ as follows:

Broadcast nested loop join (worst) > Cartesian Product > sort-merge join > Hash Joins (best)

As noticed above nested loop join is chosen to be the last resort. A nested loop join strategy looks like this:

for each_row in relation_1:
 for each_row in relation_2:
 # join condition gets executed

An 'n' squared complexity is not something you would want to experience on a large dataset.

Optimize & Scale

ETL Pipelines can be optimized by finding the right time window to execute the pipeline. For example, while scheduling a pipeline to extract the data from the production database, the production business hours need to be taken into consideration so that, the transactional queries of the business applications are not hindered. Choosing the right choice of tools, processing clusters, and strategies can help scale the pipeline. For example, the choice of the clusters for processing can be decided by calculating the query runtime statistics and volume of the rows extracted.

Parallelize wherever possible

Parallel processing is always efficient than sequential processing. Big data storage solutions and non-columnar destinations work parallel in nature. Higher versions of relational databases such as Postgres has the ability to parallelize the queries too. The individual processes of ETL can also be parallelized if the tool provides a solution.

Quality of Data

The source can be in any form, i.e.) the data in data lakes can be in varied forms. However, the target data warehouses should always have clean data. The ETL scripts need to make sure that the data written to the destination needs to be of high quality. The quality parameters include precision in floats, prevent varchar overflows, split columns based on the present delimiters (‘|’, ‘,’, ‘\t’), use the appropriate compression formats, etc.

Recovery & Checkpoints

The ETL Pipelines are to be designed in a pessimistic way, by asking the ‘what-if’ questions. Assume that there’s an outage in the source data lake/datastore; Will the pipeline be capable enough to pause and resume or perhaps, recover or restore from the failed checkpoint? By designing such fail-safe strategies, the pipelines can be robust.

Schemas & SCDs

Warehouse and ETL designs are purely driven by the business requirements and the Engineers need to ask the right questions to design the pipeline and BI requirements that the business needs. Schema & Slowly Changing Dimensions are the major steering components for the warehouse design. The schemas can be Star schemas or Snow Flake schemas depending on how fast the schema evolves.

Star Schemas are constant and each dimension is represented only by a one-dimensional table.
Snow Flake Schemas are a lot more granular compared to the Star Schemas where dimension tables are normalized further, split into multiple tables.

In addition to this, depending on the necessity to store the historic information of the facts or dimensions, the choice needs to be made between SCD1 or SCD2.

SCD1 is the activity of updating the existing rows if the key matches.
SCD2 is the activity where the older data is preserved/invalidated and inserts the newer rows, promoting them to be the latest.

Temp Tables-Use them Wisely

As mentioned in the ‘Cache Data’, temporary tables are just a mere representation of the physical tables. They are accessible only to the executed cluster, available until the cluster runs, and are slow for operations since it is neither in memory nor persisted. True to its name, these temp tables are destroyed upon the termination of the cluster.

Upserts

As mentioned in the section ‘Schemas & SCDs’, SCD1 will depend on the unique key based on which the upsert operation is done. Upserts are nothing but Inserts and Updates on conflict. Here, the conflict refers to the inability to insert since the key already exists, which will in turn lead to an upsert operation.

Views

Creating views on top of complex queries can be very helpful at times when the results of the query needed to be computed and used only a minimal number of times. If it’s too expensive to store, views can come in quite handy.

Where Conditions

While performing the transformations, selecting only the bucket of data that is required for the processing can save a large amount of computing time. The ‘where’ clauses can help in achieving this, especially during the incremental refreshes.

Red Vs Blue Pill — Making the right decision on the choice of BI Server

Sathyasarathi — Mon, 09 Mar 2020 21:20:00 +0000

Red Vs Blue Pill — Making the right decision on the choice of BI Server

This universe is governed by a cosmic duality, sets of two opposing and complementing principles or cosmic energies that can be observed in nature. There’s Yin and there’s Yang…

There’s relational and there are non-relational databases…

And of course, the visualization tools; There’s Tableau Server and there’s Power BI Embedded.!!

The Yin & Yang of the present BI universe

There’s no such thing as a perfect BI server that solves everyone’s (the admin, the data-scientist, the analyst, the end-user) problems. Each of these tools outweighs each other in several aspects and we are about to settle the score once and for all on a one-on-one comparison between Tableau Server and Power BI Embedded.

By the time when you are done reading these articles, you should be able to answer the following question:

Should you choose Tableau Server or Power BI Embedded?

After this, there is no turning back.

Choice of the Cloud & Databases:

Are you using Microsoft SQL server or are your servers hosted on Microsoft Azure? well, why not use Power BI? Yes, you heard me right, you’ve got a good edge over the Tableau. Check if the data source that you are using, is developed or maintained by Microsoft or even better — Check if your database is compatible for not being required to have an on-premises data gateway from this link.

Of course, this is not the case with Tableau; with a wide variety of connectors and drivers to establish the connectivity, Tableau seems to be much more friendly in terms of the databases. Also, with the support toward Tableau Server packages for Linux, Tableau servers have become much more efficient in managing the resources.

Data Extracts & Refreshes:

How huge is your data? And are you using extracts or live connections?

Of course, you may follow all the good practices; if your extract still happens to be huge, Tableau seems to be a better choice. The ability to devise a multi-node architecture for the tableau server is a better fit to improve the performance of the data extracts and data refreshes.

In the case of Power BI, the rate of refreshes depend on the size and configuration of the data gateway instance deployed in the same subnet as that of the databases (for Non-Microsoft databases).

Visualization Renders:

Visualization is the core of the BI exercise. And if those dashboards are not loading on time, that’s gonna hit you real bad. The rendering is all dependent on the hardware — especially the memory. In Tableau, you get to control your hardware — whether it’s a cluster setup or a single node setup. This is not the case with Power BI where you need to purchase a premium dedicated capacity for the rendering to happen; the lowest configuration — A1 has a capacity of performing about 300 renders per hour with a price of 1$ approx per hour.

Tableau is the winner here since you are not limited by the number of renders happening.

Well, it doesn’t stop there. and when your audience grows and when you go up the ladder from (A1 to A6) configuration, you will know that you are shelling out a huge amount of money (could go all the way up to 36 times that of the cost of A1).

Scale:

How many viewers are about to have a look at your dashboards? Is it going to be the dashboard with your visualizations on it or is it going to be just the PDF rendering of your dashes? For client-side renderings; Tableau has a cluster setup available to be configured with multiple nodes, each of which can be assigned with the specific processes and dozens of other tweak-able options in hand; scalability is at disposal for the Tableau admins.

Does it mean that Power BI is up for no good? They do have a dedicated capacity that can be modified based on the peak hour refresh rates and the number of renders happening in an hour.

We don’t have any winners for this round since both the services incur similar costs when it comes to the licensing aspect.

Licenses & Pricing:

By this time, you might have figured out the right BI tool for the job. And that’s the reason why pricing comes at the end.

This topic by itself requires a separate post to talk about as a whole. We will try to cover as crisp as we could in this section.

Let’s look at a hypothetical scenario, Shall we? Consider an Organization with about 200 data scientists; let’s estimate the Pricing for both Tableau Server & Power BI Embedded.

Tableau Server — Requisite & Cost:

License for the users to use Tableau Desktop & publish to tableau server [70$ per user per month].
A tableau server hosted an a compute instance with at least 8 physical cores and 32 Gigabytes of RAM (if a cluster setup is required for better performance — add two more instances with at least 4 cores and 16 Gigabytes of RAM to it) [Single node — 0.70$ per hour; Cluster 0.70$ + 0.38$ + 0.38$ = 1.46$ per hour].
A Server creator license with the required number of creators [70$ per user per month].

Cost incurred: (200*70)+(1.46*730)+(200*70) = 29065.8$ /month

(For a cluster Setup)

Power BI Embedded — Requisite & Cost:

Pro License for the individual users [9.99$ per user per month].
A Power BI Embedded dedicated capacity for performing the refreshes on a custom workspace — let’s consider A4 for the sake of the discussion [$8.06 per hour].
A data gateway inside the same subnet as that of the database (recommended configuration of 4 cores and 16 Gigabytes of RAM) [0.38$ per hour].

Cost incurred: (200*9.99)+(8.06*730)+(730*0.38) = 8159.2$ per month

Now that we have the facts right., it’s time to pick our winner!! There isn’t…Both are good contenders and it all comes down to our requirement & usage.

Let’s build Visualisations; not wage wars!!

Originally published at https://www.datawrangler.in on March 9, 2020.

Postgres: Optimization & Beyond

Sathyasarathi — Sat, 10 Aug 2019 14:37:00 +0000

Postgres, one of the widely used Relational Database Management System; has been widely adopted due to its ability to handle different workloads such as web services, warehouses, etc.

Fun Fact: The name Postgres comes from its predecessor originated from UC Berkley’s Ingres Database (_INteractive GRaphics iterchangE System; meaning it’s Post-INGRES)._

There are times when the performance is straight forward and in other cases when the expected performance is not met; the Database requires some tweaking in the form of structural modifications to the table, Query Tuning, Configuration improvements, etc.

This article will provide some useful pointers and action plans to become a power-user in optimizing Postgres.

What to do when a query is slow?

In most cases, the occurrence of a slow query is due to the absence of indexes, for those fields that are being used in the where clause of the query.

That should have solved the problem, right? RIGHT?

You:

Not helping at all.!!

I hear you; Life ain’t Fair, or Is it?

Not all Indexes for the fields in the WHERE clause can be helpful; It all depends on the appropriate query plan prepared by the optimizer: Prepend EXPLAIN ANALYZE to the query and run it to find the query plan.

Pro Tip: Use https://explain.depesz.com/ to visualize and analyze your query plan. The color formatting gives a straight forward output to debug the reason for the slowness.

The query plan itself can provide a whole lot of information about where the resources are overflowing. Given below, are a few of those keywords that you can find in the query plan and what they mean to you and the query performance.

Sequential Scan: Yes, you read that right. The scan occurs sequentially; the filter runs for the whole table and returns back the rows that match the condition which can be very expensive and exhaustive. In the case of a single page / small table, Sequential scans are pretty fast.

But for larger tables; In order to speed up the query, the sequential scan needs to be changed to an Index Scan. This can be done by creating indexes on the columns that are present in the where clause.

Index Scans / Index Only Scans: Index Scans denote that the indexes are being properly used. Just make sure that the analyzing & vacuuming happens once in a while. This keeps all the dead tuples out of the way and allows the optimizer to choose the right index for the scan.

Bitmap Index Scan: And this right here is the bummer. Bitmap Index Scans are accompanied by Bitmap Heap Scans on top. These scans occur mostly when one tries to retrieve multiple rows but not all, based on multiple logical conditions in the where clause.

It basically creates a bitmap out of the pages of the table, based on the condition provided (hence the Bitmap Heap Scan on top). The query can be sped up by creating a composite index A.K.A multicolumn index; which changes this scan to an Index Scan.

Caution: The order of the columns in the composite index needs to be maintained the same order as that of the where clause.

Summarizing:

Indexes are good; Unused Indexes are Bad;

Having Too many Indexes is OK, as long as they are being used at some point.

More RAM for the DB is Good. VACUUM & ANALYZE of tables is too good!!!

ARCHIVAL of Old Data → Being a good citizen and you are awesome!!

For optimal performance, the following settings (requires a restart of the server) need to be made to the postgresql conf file present in: **_/etc/postgresql/10/main/postgresl.conf_**

Parameter	Allowed Values
shared Buffer	75% of RAM
work_mem	25% of RAM
maintenance_mem	Min: 256MB; Max:512MB

Consider the scenario, where Postgres Server’s has 160Gigs of RAM:

Parameter	Allowed Values
shared_buffer	120GB
work_mem	40GB
maintenance_mem	256MB

Steps to Optimize a query:

1) Run Explain Analyze on your Query, and if it takes too long; Run Explain on your Query.

2) Copy the output and paste it onto the dialogue box @ https://explain.depesz.com/

3) Check the Stats of your query:

Index Scans / Index Only Scans are the best and no changes need to be made.

Sequential Scans, can be converted into Index Scans by creating the index for the particular column in the where clause.

Bitmap Heap Scans, can be converted into Index Scans by creating composite indexes A.K.A multicolumn indexes, with the same order as that of the where clause, as:

CREATE INDEX $indexName ON $tableName ($Field1, $Field2);

Note to Self: Index & Optimize.!!

Originally published at https://www.datawrangler.in on August 10, 2019.

CD-Stream:CDC Replicator Tool

Sathyasarathi — Tue, 30 Oct 2018 21:12:00 +0000

++ Cons on ETL pipelines

Just another day at the workplace;

5 minutes post the boot:

You hear everyone complain that the production database is slow. You quickly start to investigate; exploring all possible outcomes on the dashboards...

Could it have been the long-running slow query which you had raised a ticket for the production support to fix? Or Is it one of the queries run based on a non-indexed column?

6th Minute and 15 minutes later:

Next, you hear the fellow data-analysts lament over their failed reports.

You now realize that your CPU had taken a humongous amount of query load and you understand that your relational database system has gone for a toss into an eternal slumber.

And all of this due to a slow running query of your ETL pipeline..!! Ding. Ding... Ding…!! We have a winner!!!

Alright, let’s back it up a little bit.

Probably you did/used one of the following:

SELECT * from production_database.table where updated_at between x and y;
Bulk exports and Dumps once in every few minutes
Long-running and forgotten Zombie Crontabs

Let’s put it this way... DB Size < 500 GB; it’s OK to do selects but if it exceeds >500GB, unless you have cuts in the budget, do not ever do a bulk select and transfer it over the wire to the destination database in the form of a pipeline.

Enter CDC:

CDC A.K.A Change Data Capture is there to assist this data wrangling exercise and if your data is rapidly growing and if your BI/BA needs access to that sweet-sweet DWH (data warehouse); THIS right here, is the way to go.

You meant the replication? — I don’t wanna get stuck with my OLTP DB Engine:

Well, you are not alone. For many reasons, data-wranglers generally don’t prefer a similar database engine as that of the OLTP. The main reasons include query performance, need for triggers and ease of re-running transformation jobs. But, doing so, requires setting up airflow clusters and setting up connectors to transform and load.

Wouldn’t it be great if you could do all this without doing a bulk selection from the production database? Of course, you can...

CD-Stream is a cross-database CDC driven replicator tool that currently supports replication between MySQL and Postgres.

The tool runs queues to process the information occurring in the binary logs of the source database and replicates it across to a destination database of an entirely different engine.

Post the setup, as given in the project page: CD-Stream; there’s a directory called ‘sample’ in the project which contains some of the intensive DDL and Data Insertion scripts, for you to evaluate and exercise.

And that, People; is how you wrangle production data!!!

Originally published at https://www.datawrangler.in on October 30, 2018.

ES Index — S3 Snapshot & Restoration:

Sathyasarathi — Fri, 15 Dec 2017 16:26:00 +0000

ES Index — S3 Snapshot & Restoration:

The question is... What brings you here? Fed up with all the searches on how to back-up and restore specific indices?

Fear not, for your search quest ends here.!

After going through dozens of tiny gists and manual pages, here it is.. We’ve done all the heavy lifting for you.

The following tutorial was tested on elasticsearch V5.4.0

And before we proceed, remember:

Do’s:

Make sure that the elasticsearch version of the backed-up

cluster’s version <=(lesser than or equals) Restoring Cluster’s version.

Dont’s:

Unless it’s highly necessary;

curl -XDELETE ‘http://localhost:9200/nameOfTheIndex

#deletes a specific index

Especially not, when you are drunk!:

curl -XDELETE ‘http://localhost:9200/_all

#deletes all indexes

(This is where the drunk part comes in..!!)

Step1: Install S3 plugin Support

sudo bin/elasticsearch-plugin install repository-s3  
# (or)  
sudo /usr/share/elasticsearch/bin/elasticsearch-plugin install repository-s3

It depends on where your elasticsearch-plugin executable is installed. This enables the elasticsearch instance to communicate with the AWS S3 buckets.

Step2: Input the Snapshot registration settings

METHOD : PUT

URL: http://localhost:9200/_snapshot/logs_backup?verify=false&pretty

PAYLOAD:

 {  
 “type”: “s3”,  
 “settings”: {  
 “bucket”: “WWWWWW”,  
 “region”: “us-east-1”,  
 “access_key”: “XXXXXX”,  
 “secret_key”: “YYYYYY”  
 }  
 }

In the URL:

logs_backup: Name of the snapshot file

In the payload JSON:

bucket: “WWWWW” is where you enter the name of the bucket.
access_key & secret_key: The values “XXXXXX” and “YYYYYY” is where we key in the access key and secret key for the buckets based on the IAM policies. If you need any help to find it, here’s a link that should guide you through (https://aws.amazon.com/blogs/security/wheres-my-secret-access-key/).
region: region where the bucket is hosted (choose any from: http://docs.aws.amazon.com/general/latest/gr/rande.html).

This should give a response as ‘{“acknowledged”: “true”}’.

Step3: Cloud-Sync — list all Snapshots

URL: http://localhost:9200/_cat/snapshots/logs_backup?v

In the URL:

logs_backup: Name of the snapshot file Time to sync up all the list of snapshots. If all our settings have been synced up just fine; we should end up with a list of indices, close to that of what is shown below:

Step4: Creating a Snapshot

METHOD : PUT

URL: http://localhost:9200/_snapshot/logs_backup/type_of_the_backup?wait_for_completion=true

PAYLOAD:

 {  
 “indices”: “logstash-2017.11.21”,  
 “include_global_state”: false,  
 “compress”: true,  
 “encrypt”: true  
 }

In the URL:

logs_backup: Name of the snapshot file
type_of_the_backup: Could be any string

In the payload JSON:

indices: Correspond to the index which is to be backed-up to S3 bucket. In the case of multiple indices to back up under a single restoration point, the indices can be entered in the form of an array.
include_global_state: set to ‘false’ just to make sure there’s cross-version compatibility. WARNING If set to ‘true’, the index can be restored only to the ES of the source version.
compress: enables compression of the index meta files backed up to S3.
encrypt: In case if extra encryption on the indices is necessary.

This should give a response as ‘{“acknowledged”: “true”}’

Step5: Restoring a Snapshot:

METHOD : PUT

URL: http://localhost:9200/_snapshot/name_of_the_backup/index_to_be_restored/_restore

PAYLOAD:

 {  
 “ignore_unavailable”: true,  
 “include_global_state”: false  
 }

In the URL:

logs_backup : Name of the snapshot file

index_to_be_restored: Any of the index from the id listed in Step:3

In the payload JSON:

ignore_unavailable: It’s safe to set this to true, to avoid unwanted checks.
include_global_state: set to ‘false’ just to make sure there’s cross-version compatibility. WARNING If set to ‘true’, the index can be restored only to the ES of the source version.

This should give a response as ‘ {“acknowledged”: “true”} ’

Et Voila! The restoration is complete.

And Don’t forget to recycle the space corresponding to the index by safely deleting it — Reuse, Reduce & Recycle :)

Happy Wrangling!!!

Originally published at https://www.datawrangler.in on December 15, 2017.