DEV Community: Sunny Srinidhi

Streamline Data Transfer with AWS DataSync: A Comprehensive Guide

Sunny Srinidhi — Sat, 09 Mar 2024 04:20:41 +0000

In today’s data-driven world, the need for efficient and reliable data transfer solutions has never been greater. Enterprises across various industries are constantly seeking ways to streamline their data workflows, ensuring that critical information is seamlessly synchronized between on-premises and cloud environments. AWS DataSync emerges as a powerful tool in this landscape, offering robust capabilities to facilitate fast, secure, and automated data transfers.

What is AWS DataSync?

AWS DataSync is a fully managed service designed to simplify and accelerate data transfer between on-premises storage systems and Amazon S3, Amazon EFS (Elastic File System), or Amazon FSx (File System for Windows and Lustre). With DataSync, organisations can effortlessly migrate large volumes of data, synchronise files between different storage systems, or replicate data for backup and disaster recovery purposes.

How Does AWS DataSync Work?

At its core, AWS DataSync operates on a simple yet effective mechanism. It leverages a combination of agents deployed on the source and destination endpoints, along with a centralized control plane managed by AWS. These agents facilitate high-speed data transfer over optimised network connections, ensuring minimal latency and maximum throughput.

Figure 1: AWS DataSync Architecture

DataSync employs incremental data transfer techniques, only transferring the changes made to files since the last synchronization. This ensures efficiency by minimizing the amount of data transmitted during subsequent sync operations. Additionally, the service offers built-in data integrity checks and encryption, safeguarding data during transit and at rest.

Use Cases and Examples

1. Data Migration to the Cloud

Imagine a multinational corporation seeking to migrate terabytes of data from its on-premises data center to Amazon S3 for better scalability and cost-effectiveness. By utilizing AWS DataSync, the organization can orchestrate the migration process seamlessly. DataSync enables parallel transfers, allowing multiple files to be transferred concurrently, thereby minimizing migration time.

2. Hybrid Cloud Workflows

In a hybrid cloud environment, where organizations maintain a combination of on-premises infrastructure and cloud resources, DataSync plays a pivotal role in synchronizing data between these disparate environments. For instance, a media production company may use on-premises storage for editing and processing large video files, while leveraging Amazon S3 for archival and distribution. DataSync ensures that files are synchronized between these environments in near real-time, enabling seamless collaboration and resource optimization.

3. Disaster Recovery and Backup

Business continuity and disaster recovery are critical concerns for enterprises of all sizes. AWS DataSync provides an efficient solution for replicating data to AWS storage services, ensuring that organizations have up-to-date backups in case of unforeseen events. By configuring DataSync to regularly synchronise data between on-premises storage and Amazon S3, businesses can minimise data loss and expedite recovery processes during emergencies.

Getting Started with AWS DataSync

Setting up AWS DataSync is straightforward and can be accomplished in a few simple steps:

Create a DataSync Agent: Begin by deploying DataSync agents on the source and destination endpoints. These agents act as intermediaries responsible for facilitating data transfer.
Define a Task: Configure a DataSync task to specify the source location, destination location, and any additional parameters such as transfer options and scheduling preferences.
Monitor and Manage Tasks: Once the task is activated, monitor its progress using the AWS Management Console or command-line interface. DataSync provides detailed metrics and logs to track transfer performance and identify any issues.
Optimize Performance: Fine-tune DataSync settings to optimize performance based on your specific requirements. Adjust parameters such as concurrency, bandwidth throttling, and data validation options to achieve optimal transfer speeds and reliability.

Conclusion

AWS DataSync offers a robust solution for simplifying and accelerating data transfer workflows in hybrid cloud environments. Whether it’s migrating data to the cloud, synchronizing files across distributed storage systems, or ensuring data resilience through backups, DataSync provides the tools and capabilities to meet diverse business needs. By leveraging its intuitive interface, scalable architecture, and seamless integration with AWS services, organizations can streamline their data operations and unlock new opportunities for innovation and growth.

In a data-centric world where agility and efficiency are paramount, AWS DataSync empowers enterprises to stay ahead of the curve and harness the full potential of their data assets.

And if you like what you see here, or on my personal blog, Medium blog and Dev.To blog, and would like to see more of such helpful technical posts in the future, consider supporting me on Patreon and Github.

Understanding the Battle of Database Storage: Row-Oriented vs. Columnar

Sunny Srinidhi — Fri, 08 Mar 2024 15:00:02 +0000

In the vast landscape of databases, where data reigns supreme, lies a critical decision point for architects and developers alike: row-oriented storage versus columnar storage. This choice can significantly impact performance, storage efficiency, and query speed, making it imperative to grasp the nuances of each approach. In this exploration, we delve into the depths of row-oriented and columnar storage, unraveling their intricacies and shedding light on which might be the best fit for your data-driven endeavours.

Row-Oriented Storage: The Traditional Approach

Row-oriented storage is the conventional method of storing data in databases. In this model, data is organised and stored row by row, with each row containing all the fields (attributes or columns) for a particular record. This approach aligns well with transactional databases where operations primarily involve fetching entire rows.

One of the primary advantages of row-oriented storage is its suitability for transactional processing. Since transactions typically involve working with complete records, retrieving entire rows becomes efficient. Additionally, row-oriented storage simplifies data updates and inserts, as the entire record is stored contiguously.

However, row-oriented storage has its drawbacks, particularly in analytics and reporting scenarios. When querying for specific columns across multiple rows, row-oriented storage may exhibit performance bottlenecks. This is because retrieving data by column necessitates scanning through each row, resulting in increased I/O overhead and reduced query speed, especially for large datasets.

Columnar Storage: A Paradigm Shift in Data Warehousing

Columnar storage, on the other hand, flips the traditional row-oriented approach on its head by storing data vertically, column by column. In this model, each column is stored separately, with all values for a particular column grouped together. This design is well-suited for analytical workloads where queries typically involve aggregations, filtering, and analysing specific attributes across multiple records.

The key advantage of columnar storage lies in its efficiency for analytics and reporting. Since columns are stored separately, queries that involve selecting specific attributes can be executed much faster compared to row-oriented storage. Columnar databases excel at handling complex analytical queries, such as those commonly found in data warehousing environments.

Moreover, columnar storage often boasts superior compression capabilities. Since columns tend to contain repeated values, compression algorithms can exploit this redundancy to achieve significant space savings. This not only reduces storage costs but also improves query performance by minimising disk I/O.

However, columnar storage may not be as efficient for transactional workloads. Due to its design, inserting or updating individual records entails modifying multiple columnar structures, which can introduce overhead, particularly in write-heavy environments. As a result, columnar databases are often used in conjunction with row-oriented databases, with each optimised for their respective workload types.

Choosing the Right Storage Model

Selecting the appropriate storage model depends on various factors, including the nature of your data, the types of queries you intend to run, and the performance requirements of your application. Here are some considerations to help guide your decision:

Workload Type: Determine whether your application primarily handles transactional processing or analytical queries. For transactional workloads, row-oriented storage may be more suitable, whereas columnar storage shines in analytical scenarios.
Query Patterns: Analyse the types of queries your application frequently executes. If your queries involve selecting specific columns across a large number of records, columnar storage is likely the better choice. Conversely, if queries primarily retrieve entire records, row-oriented storage may suffice.
Performance Requirements: Consider the performance characteristics required by your application. If query speed and scalability are paramount, columnar storage may offer superior performance for analytical workloads. However, if transactional throughput is critical, row-oriented storage might be more appropriate.
Data Compression and Storage Efficiency: Evaluate the potential for data compression and storage savings offered by each storage model. Columnar storage often excels in this regard, particularly for datasets with high redundancy in column values.
Hybrid Approaches: Explore hybrid approaches that leverage both row-oriented and columnar storage based on workload requirements. This allows you to optimise performance and efficiency for different types of queries within the same database system.

Examples of Row-Oriented and Columnar Storage in Action

Let's delve deeper into real-world scenarios where row-oriented and columnar storage shine, illustrating the practical implications of each approach.

Example 1: Online Retail Database

Consider an online retail database that stores information about customer orders, including order details such as customer name, product ID, quantity, and purchase date.

Row-Oriented Storage: In a row-oriented storage model, each order is stored as a single row, with all order details contained within that row. This layout is well-suited for transactional processing, as retrieving an entire order—such as when processing a purchase or updating customer information—can be done efficiently.

Columnar Storage: In a columnar storage model, each attribute (e.g., customer name, product ID, quantity) is stored separately. This design excels when running analytical queries, such as calculating total sales by product or analysing customer purchasing patterns. By storing similar data types together, columnar storage enables faster query execution and better compression ratios, leading to improved performance and reduced storage costs for analytical workloads.

Example 2: Financial Data Analysis

Imagine a financial institution analysing vast amounts of market data to identify trends, assess risk, and make informed investment decisions.

Row-Oriented Storage: In a row-oriented storage model, each financial instrument's data (e.g., stock prices, trading volume, market indicators) is stored as a single row. This structure facilitates transactional processing, enabling quick updates to individual records and efficient retrieval of complete datasets for real-time trading.

Columnar Storage: In a columnar storage model, financial data attributes are stored separately, with each column containing data for a specific metric (e.g., closing price, volume traded, price-to-earnings ratio). Analysing historical stock prices or performing complex calculations across multiple securities becomes more efficient with columnar storage, as it allows for parallel processing of columnar data and optimised query execution for analytical queries.

Example 3: Healthcare Records Management

Consider a healthcare database managing patient records, including demographic information, medical history, and treatment details.

Row-Oriented Storage: In a row-oriented storage model, each patient's data is stored as a single row, encompassing all relevant information. This format facilitates patient-centric operations, such as updating medical records or retrieving comprehensive patient profiles for clinical assessments.

Columnar Storage: In a columnar storage model, healthcare data attributes (e.g., patient age, diagnosis code, treatment duration) are stored separately by column. Analysing population health trends, conducting epidemiological studies, or querying specific medical conditions across a large patient cohort becomes more efficient with columnar storage. The ability to selectively retrieve and process relevant columns enhances query performance and supports data-driven decision-making in healthcare.

Applying Storage Models to Real-World Scenarios

In each of these examples, the choice between row-oriented and columnar storage depends on the specific requirements and use cases of the application. By understanding the strengths and trade-offs of each storage model, organisations can make informed decisions when designing database systems to meet their performance, scalability, and analytical needs. Whether optimising for transactional processing, analytical querying, or storage efficiency, selecting the right storage model is paramount in harnessing the full potential of data-driven applications.

Visual Representation of Data in Each Storage Type

Let's illustrate row-oriented and columnar storage using the same dataset representing sales transactions in an online retail store. This dataset contains information about customer orders, including order ID, customer name, product ID, quantity, and purchase date.

Row-Oriented Storage

In row-oriented storage, each order is stored as a single row, encompassing all order details.

Columnar Storage

In columnar storage, each attribute (e.g., Order ID, Customer Name, Product ID) is stored separately.

Comparison

Row-Oriented Storage

Advantages

Efficient for transactional operations.
Simplicity in updating and retrieving entire records.

Disadvantages

Querying specific attributes across multiple records may result in performance bottlenecks, especially for large datasets.

Columnar Storage

Advantages

Superior performance for analytical queries.
Efficient compression and storage savings due to redundancy in column values.

Disadvantages

Not as efficient for transactional processing, particularly in write-heavy environments.

Conclusion

In the perpetual tug-of-war between row-oriented and columnar storage, there is no one-size-fits-all solution. Each storage model comes with its own set of advantages and trade-offs, making it essential to carefully assess your requirements and choose accordingly. Whether you prioritise transactional throughput, analytical performance, or storage efficiency, understanding the nuances of row-oriented and columnar storage is crucial for building robust and scalable database systems. By leveraging the strengths of each approach, you can unlock the full potential of your data and propel your applications to new heights of efficiency and performance.

Cleaning And Normalizing Data Using AWS Glue DataBrew

Sunny Srinidhi — Tue, 18 Jan 2022 04:04:46 +0000

A major part of any data pipeline is the cleaning of data. Depending on the project, cleaning data could mean a lot of things. But in most cases, it means normalizing data and bringing data into a format that is accepted within the project. For example, it could be extracting date and time components from a timestamp column into multiple columns, converting the case of a string column, or label encoding.

We write code for this, and then design a pipeline for automating that using tools such as Apache Airflow. We have been doing this for years now. But as we move more and more of our processing and pipelines to the cloud, we can achieve these results of data cleaning by using tools provided by our cloud infrastructure providers. And usually, such tools don’t require any coding. Being a developer myself, I can see how this can seem threatening to my job. But I don’t want to get into that argument, because I think that’s moot.

Anyway, in this post, we’ll take a look at AWS Glue DataBrew and how we can clean our datasets using this. For this POC, I’m using S3 to both store the input and the processed output (mostly because that’s the easiest option I see) and using an NYC OpenData dataset. To be more specific, I’m using the Motor Vehicle Collisions – Crashes dataset.

What Is AWS Glue DataBrew?

We already know that AWS Glue is a tool for designing the extract, transform, and load (ETL) pipelines. It provides a lot of features for creating and running ETL jobs. DataBrew takes it one step ahead by providing features to also clean and transform the data to ready it for further processing or feeding to machine learning models.

DataBrew provides over 250 transformations to get started with. These include filtering data, converting formats or converting data into standard formats, fixing data issues, extracting data from columns using regex, and much more. The best part is, these transformations are already available in DataBrew, which means we don’t need to code any of these transformations.

The advantage of that is data analysts can design the transformation pipelines themselves without having to wait for developers to code the transformations. You can apply these transformations at any stage of the pipeline, especially if you already have a Glue ETL job.

Another advantage here is that all of this is serverless. This means you only pay for what you use, and you don’t pay for anything when the pipeline isn’t running. Also, it can scale automatically depending on the amount of data coming in.

So, let’s get started with DataBrew now. We’re first going to create a new project and a recipe. Then we’ll download the sample dataset and upload that to DataBrew. Then, we’ll apply a few transformations and create a job to run the transformations on the entire dataset. Finally, we’ll see how the transformed data looks like.

Getting Started With AWS Glue DataBrew

Creating a project

First, head over to the AWS Glue DataBrew console and create a new project. Next, provide a project name and a recipe name, as you can see from the screenshot below.

For this POC, we can leave all the configurations to the defaults. After providing a name for the project and the recipe, scroll down to the Select a dataset section to upload the sample CSV dataset. If you haven’t yet downloaded the sample dataset, you can get it from the NYC OpenData website here.

In the Select a dataset section, select the New dataset option to the right of the screen. Provide a name to the dataset. Then select the File upload option in the Connect to a new dataset option. You can see this configuration in the screenshot below.

In the Enter S3 destination option, select an S3 bucket for hosting the source data and the transformed data. If you don’t have a bucket created for this, you can do so from here itself.

Finally, scroll down to the bottom of the page and click the Create project button to create the project. This will take some time because the file has to be uploaded to the S3 bucket first, then a bunch of resources has to be provisioned, and also the file has to be previewed. During this time, you’ll see a progress indicator something like this:

Applying transformations

Once the data is prepared, you will see the data presented in a tabular form with a bunch of options at the top of the table. Here, we can select individual rows, select transformations, preview the results, and then apply the transformations. Also, we can manipulate the columns, such as duplicating columns, splitting columns, etc. These do count as transformations, but I just wanted to explicitly call them out as these are very commonly applied transformations.

We’ll start applying transformations with the first column, which is the CRASH DATE column. In most analytics use cases, we would want to aggregate something over years. We usually have timestamp columns from which we extract the year component in queries. But the queries can be much easier and performant if we have a dedicated column for the year value. So we’ll now extract the year from the date column and save it as a new column called CRASH YEAR.

For this, select the column, click the three dots at the top-right corner of the column, select the Extract menu, and then select the Date-time values option. This will open up a new configuration pane to the right side of the page, which is as shown in the screenshot below:

Here, the Create column options, the Source column, and the Extract option sections should be already filled. In the Date-time unit to extract section, select Year from the dropdown menu, and enter CRASH YEAR as the Destination column field value. This will create a new column with the name CRASH YEAR with the year component extracted from the date column. Click the Preview button at the bottom to preview the changes. This should add a new column to the dataset, as shown in the screenshot below.

Click the Apply button to apply the transformation.

Next, we’ll transform the CRASH TIME column to format it in the HH:MM format. It is already in a standard format. But for the sake of experimentation, we’ll do this anyway. So let’s repeat the same process again. Click on the three dots at the top right corner of the column, select the Format menu and then click the Date-time formats menu. Then scroll down to the end of the sub-menu and select More options. This will open up a similar configuration pane that we saw while extracting the year component earlier. Use the following screenshot as a reference to configure this transformation.

Click the Preview changes options to see the difference after applying the transformation. It should be similar to the screenshot below. And finally, click the Apply button to apply the transformation. It is important to note that these changes will not be applied immediately after hitting the Apply button. The apply button will only save the transformation so that we can go back and change it anytime we want. The transformations will change the data only after we run the job.

Now, we’ll apply the final transformation to another column. For machine learning use-cases, we need to eliminate as many text columns as we can. And in cases where we’re working with categorical data, we can map them to numbers to make sure the machine learning models don’t develop any biases. In our dataset, we have one such column – CONTRIBUTING FACTOR VEHICLE 1. This process of converting textual, categorical data into numbers is called Categorical Mapping. There’s one other way of doing this, called One Hot Encoding.

Glue DataBrew provides both options. Click on the three dots at the top right corner of the column to open the context menu and scroll to the end, you’ll see both Categorical mapping and One-hot encode column options. For this example, we’ll go with categorical mapping. So select the menu to open the configuration panel. Match the configuration as shown in the screenshot below.

Click the Preview changes button to preview the changes, which should look something like this:

Click the Apply button again to apply this transformation. We should see a new column added with the mapped values. Now, for an actual project, this transformation isn’t going to help much for machine learning as the numerical values will make the model think this is some sort of ranking instead of categorical data. We need to use One-hot encoding for that. But we’ll stick with this for this example.

After this, click the Recipe button at the top right corner of the table. You should now see all the transformations that we applied to the dataset. It should look something similar to the screenshot below.

From here, we should be able to easily edit any of these recipes or remove them. We can even add a new transformation from this pane. This comes in very handy when we want to check all the transformations we’re applying to a dataset in one place.

Running the job

We are now finally ready to run the job to apply all these transformations to the entire dataset. You should see the Run job button at the top right corner of the page. Click that button and take a break now. This will take a few minutes to complete.

In the meantime, we can look at the data lineage to visualize and understand the complete pipeline. Click the LINEAGE button that’s to the right of the Run job button. Here, you’ll see the flow of data from the S3 bucket, through the transformation recipes, and back to the S3 bucket. It should look similar to the one shown in the screenshot below.

You can see that we have one dataset created from source files in the S3 bucket. From that dataset, we have one project that contains six recipes. These six recipes are currently being applied to the dataset using a job, and finally, the transformed data is saved back to the S3 bucket.

That’s it

And that’s pretty much it. This is, and I repeat, not a typical production pipeline. Usually, the source of data is streaming into the system, using tools such as Kinesis DataStream. The transformations are applied on the fly with micro-batching and stored in a data warehouse. But if the use case is for batch processing, S3 could be the source.

Nonetheless, this exercise should help you get started with DataBrew and explore the tool’s capabilities for advanced transformations. And if the tool doesn’t provide a transformation that you need, or if you need to add a custom transformation, you can always use Lambda functions.

AWS has great tools for creating data pipelines with transformations, data lakes and warehouses, machine learning models, and more. You can check here for more resources, getting-started guides, and POCs on AWS services. And if you want me to explore a particular tool or service for you, do let me know in the comments.

And if you like what you see here, or on my Medium blog and personal blog, and would like to see more of such helpful technical posts in the future, consider supporting me on Patreon and Github.

The Dunning-Kruger Effect In Tech

Sunny Srinidhi — Sun, 28 Nov 2021 15:04:49 +0000

This is not the kind of post I usually write on my blog. This is more of a psychology lecture than a how-to tech tutorial. But it’s not completely irrelevant as well, because I’m going to talk about my experience with the Dunning-Kruger effect in tech that I’ve seen over the last decade.

I’ve always been interested in learning more about psychology and how the brain works. Because the field is so bloody interesting, and also, it helps me understand people, make out patterns, and make better decisions. But I never write about it, because decidedly, I’m not expert in the field, and the Dunning-Kruger effect is about understanding that very fact. But first, let’s understand what it is.

The Dunning-Kruger Effect

To put it simple, the Dunning-Kruger effect is one of the many types of cognitive bias in which people tend to believe they more smarter and capable of things than they really are. That’s probably the most smiply the effect can be explained. And if you’ve spent any time in the tech industry, you don’t want me to tell how common this is.

In his book The Dissent of Man, Charles Darwin wrote that “Ignorance more frequently begets confidence than does knowledge.” The Dunning-Kruger effect is more common than you might expect, and I’m very sure you have experienced this yourself. You might not want to believe that you are one of “those people,” but you most likely are. That’s because the Dunning-Kruger effect spares nobody.

There have been many studies done in the social psychology space around this. But the most prominent and the most discussed is the Dunning-Kruger effect, named after the researchers David Dunning and Justin Kruger. The study that they conducted by itself wasn’t really that big or wide. They performed four experiments and around 100 participants (or less) in each of them.

If you follow the Jimmy Kimmel show, you’d be familiar with the Lie Witness News segment, in which his crew go to popular locations and ask random people questions which are mostly made up, and see how they react. In one such episode, people were asked if the movie Godzilla is inventive to the people who survived the giant lizard attack in Tokyo in 1954. You’d be surprised with how people reacted.

The reason most people go with the flow is that they don’t want to appear dumb or clueless, especially in front of a camera. But in lying about it, they make themselves look dumber, especially in front of a camera. Do you see the irony? It’s natural to feel dumb when you admit that you don’t know something. But the reality is that when you admit you don’t know something, you’re opening up an opportunity for yourself to learn something new. But when you lie about it, and continue to do so, you’re not only making yourself appear dumb and ignorant, but you start developing the false confidence that you know stuff and can hold a conversation about it.

It’s not only depriving you of knowledge, but it’s also putting you on the path to dumbness, ignorance, and arrogance. You might have heard this from a lot of people before, that knowing the boundaries of your knowledge is the most important education you can have. If you are educated about what you know and what you don’t, and if are you are able to differentiate between the two, consider yourself smarter than most people and smarter than what you think you are.

How Is This Related To Tech?

Enough of the psychology lecture, you say? I got the hint that you’re not really interested in that. So let’s see how this is related to our tech industry. You might have heard and read that you need to “fake it till you make it.” We all do that, and it’s necessary at times. I did it when I was fresh out of college and new in the tech industry. I was surrounded by some really smart people, entrepreneurs kinds. And these people hold patents to their names, and I was this kid who thought he knew what he was doing.

So I did fake it, till I made it. But the problem is, some people don’t stop the faking even after making it. I see the Dunning-Kruger effect a lot in such people. It might not be right of me to talk about others, so let me tell you my story.

Early in my career, actually when I was still in my college, I learnt who to build websites using PHP, Java, and MySQL. This was a very common and popular stack (the LAMP stack) back in the days. And because I was able to write APIs for CRUD operations and design websites to perform a few UI tricks, I thought I had mastered software engineers. So in my CV, I started rating myself 7 or 8 out of 10 in the said technologies. Remember, I was fresh out of college without any industry experience.

Now, 8 years later, if you ask me to rate myself on those same technologies, I would not rate myself more than 4 or 5. It doesn’t mean that I’m a worse software engineer (or developer) or that I don’t have the same confidence in myself anymore. But it means that now I know what I know and what I don’t know.

I went through the same phase 5-6 years back when I switched from software engineering to data engineering (don’t get me started on those designations). I started with Apache Spark for a project, I was able to do a lot of stuff with RDDs in Java for that project, and thought again that it’s done, I’m an expert in big data. But as I started reading up on stuff and interviewing for data-related jobs, I again realised that I just don’t know what I don’t know.

That’s something I now say to everybody I know, that knowing what you know and realising that there’s a lot more that you don’t know is very important. If don’t know realise that, you are just making a fool of yourself, trust me.

Once I started realising this, I started reading up on this, and that’s when I first came across studies done on this. I thought, wait, I can do my own study on this. I don’t know if this is ethical or not, but whenever I interviewed people, I started asking one or two questions in the interview that I made up on the spot. This is of course related to the technology that the candidate is interviewing for. But I made sure that the answers to these questions never had any effect on the outcome of the interview. And I’m sure that I’ve been put through such questions myself in the interviews that I’ve given over the years.

But in those interviews, I realised that the Dunning-Kruger effect is actually real. I was taken aback by the answers the interviewees would give to such questions. There were maybe just a handful of people who admitted that they didn’t know what I was talking about or that there’s nothing of the sort that I was describing.

I have nothing against the fact “fake it till you make it.” It is in fact necessary sometimes. But you need to be candid about it to yourself before the fake you consumes the real you. It can become destructive. And I’m not against new engineers who rate themselves high on their CVs, because they are just inexperienced and don’t know what they don’t know.

But once you gain the experience and the knowledge, make sure you are cognizant of the fact that there’s a lot more to learn. Nobody, in my opinion, can be an expert in everything. You might be an expert in your field, but you can’t deny the fact that you don’t know it all.

So, the Dunning-Kruger effect is very much real in the tech industry. And if this news to you, maybe it’s time to evaluate yourself, just to make sure you’re not oblivious to the fact that you don’t know it all.

And if you like what you see here, or on my Medium blog and personal blog, and would like to see more of such helpful technical posts in the future, consider supporting me on Patreon and Github.

Understanding Apache Hive LLAP

Sunny Srinidhi — Fri, 19 Nov 2021 04:50:55 +0000

Apache Hive is a complex system when you look at it, but once you go looking for more info, it’s more interesting than complex. There are multiple query engines available for Hive, and then there’s LLAP on top of the query engines to make real-time, interactive queries more workable. Live Long And Process, or LLAP as it’s more popularly known as, is an awesome concept and execution when you learn more about it.

And that’s exactly what I did, I read a few documents and other posts about LLAP and got a better understanding of it. I understood the Tez query engine too, and expect a post about it soon. In this, I’ll unpack whatever I learnt about LLAP and how it’s awesome if you use it.

Hive Query Engines

There are three query engines for Hive:

MapReduce (MR)
Tez
Spark

MapReduce (MR)

MapReduce is the first query engine that shipped with Hive, and this is also the slowest of the bunch. When you submit a Hive query with MR as the query engine, every query gets converted into MapReduce jobs and gets submitted to YARN containers. YARN, or Yet Another Resource Negotiator, is common between MR and Tez query engines. But the problem with the MR query engine is that all queries need to be converted to MR jobs. The conversion itself takes time. So you can imagine how this query engine becomes slow with a lot of latency.

Apache Tez

To overcome this latency issue, Tez was introduced as a query engine with later versions of Hive. Tez uses Directed Acyclic Graphs (DAGs) to process queries instead of MapReduce jobs. This greatly reduces latency and improves the query response times. WIth the latest version of Hive, even though MR query engine is deprecated, it is the default engine. But you get the deprecation warning whenever you enter the Hive shell and a recommendation to switch to either Tez or Spark as the query engine. And it is universally suggested to switch to Tez. We’ll see why in my next post.

Apache Spark

Finally, we have Apache Spark as the third query engine option. And Spark, by far, is the fastest of them all. There are claims that Spark can improve the performance of Hive by as much as 100x, that’s a very bold claim. Tez doesn’t offer such high boost in performance, the much accepted boost is 10x. So you might say, well, Spark is the clear winner, right? Well, that depends! Tez and Spark both use DAGs to optimise the query performance of MR. So there can be a lot of parallel or concurrent execution of tasks in both.

The choice between Tez and Spark for a query engine totally comes down to your application and needs. But no matter what you choose, LLAP and sit on top of that query engine and better the performance even more. For example, we know that Spark can cache (persist) data in memory or on disk if we need that data again. But this caching is available within the same Spark job, another Spark job can’t access that data. What if I tell you that’s possible with LLAP? I know, that’s crazy. So, let’s see what else LLAP can do.

Live Long And Process (LLAP)

We’re finally going to talk about LLAP. The first you need to know, as I already made it clear, is that LLAP is not a query engine. It sits on top of the query engine to make query and data processing that much faster. If you image the various Hive components to be arranged as a stack, you have HDFS at the bottom, YARN on top of it, and then Hive itself at the top. Now imagine a layer of cache and in-memory processing on top of HDFS. This means a lot of requests don’t go to HDFS at all. That’s what, at a very high level, LLAP does.

I’m definitely over-simplifying stuff when I say LLAP is a caching and in-memory processing layer. There’s definitely a better way to put it. Let me elaborate.

You can think of LLAP as just another YARN application running on all data nodes in a Hadoop cluster. The only difference is that LLAP is a long-lived process (hence the name). But this doesn’t mean that it eats up all your resources. It can be configured to a very tiny process to process simple queries, or it can be configured to dynamically scale out and down whenever required. This brings in a very big difference compared to Spark. And because LLAP still work on YARN, it brings in all the advantages of YARN such as distributed nature and fault tolerance.

LLAP runs daemon processes on data nodes, and these daemons are not tied to a user who is issuing Hive queries. And this is a very important distinction, because this allows LLAP to reuse cached data across users. So you and I both fire similar queries on the same table, LLAP will be able to use the cache that’s already available for both our queries, and that’ll boost the performance for both of us. Without LLAP, both the queries would have to perform the same operations individually. And you can imagine how that’s not very optimal.

LLAP is not a query engine, and it’s optional
You should’ve realised by now, LLAP is completely optional. It’s not mandatory to use LLAP for Hive queries. You only use LLAP if you want to improve the responsiveness of Hive queries both in interactive and batch modes. Even when you use LLAP, it’s not like every and all parts of a query are executed within LLAP. LLAP is not meant for that, that’s what query engines are for. LLAP takes parts of the queries that can benefit by using the cache or the long lived processes.

LLAP doesn’t promise anywhere that it’ll execute the entire query itself. In fact, it’s actually the query engines that orchestrate what can go into LLAP and what can’t. And for now query engine Tez and other frameworks such as Pig can use LLAP in their stack. Unfortunately, support for MapReduce is not yet planned, and don’t hold your breath on this.

And because LLAP is still built to work with YARN, the resource allocation is completely in control of YARN. This is also the reason LLAP daemon nodes are able to talk to each other and share data across nodes. Another advantage here is that the daemon processes themselves don’t require a lot of resources to function. YARN allocates the minimum required for the processes themselves and will increase the resource allocation as and when required depending on the workload. And to avoid heap issues or JVM memory issues, cached data is always kept off-heap and in large buffers. So processing of aggregations such as group by and joins will be much faster in LLAP compered to query engines.

Query fragments

I already mentioned that LLAP usually executes parts of queries and not usually the complete query. These parts of queries are known as query fragments. These fragments include filters, data transformations, partial aggregations, projections, sorting, bucketing, joins, semi-joins, hash joins, etc. And it is to be noted that only certain “blessed” UDFs and Hive code are accepted into LLAP.

For stability of the process and security of the data, LLAP doesn’t localise any code, and executes on the fly. And because the daemon is not tied to any particular user (as already mentioned), an LLAP node can allow parallel execution of various query fragments, across queries and sessions. This is one of the primary reasons for improved performance. Another good news, for developers specifically, is that LLAP APIs are directly available via client SDKs. You can directly specify relational transformations using these APIs.

Input and Output

As I already mentioned, the daemons themselves have a very small footprint, and that’s because everything else is mostly done by offloading the work to multiple threads. Input and output, for example, are offloaded to threads. And transformations are done in separate threads. So as soon as an I/O thread makes the data ready, the data is passed on to separate threads for processing. This makes the I/O threads available for new I/O operations.

The data is passed further in the process to other threads in run-length encoded (RLE) columnar format. This reduces copying data across threads and processes to a great extent. And by extension, caching is also in the same RLE format. You can start seeing the benefits here.

I/O and caching depend heavily on the knowledge of the file format that the data is stored in. This is necessary if I/O and caching have to be performant. So LLAP has externalized this knowledge with the help of plugins. And to start with ORC is the first file format supported by LLAP. This is one of the reasons why there is an increase in the adoption of ORC as the preferred file format for external Hive tables.

When it comes to caching, both metadata and the data itself are cached. As I mentioned in previous sections, data is cached off-heap to avoid other potential issues. But metadata, on the other hand, is stored in-process as Java objects. This makes sure that even if the data itself is evicted from the cache, the metadata is still in memory to avoid some overhead.

I touched upon data eviction in cache. And as you might expect, there are various policies for data eviction. By default, LRFU policy is used. But you can plug in any other policy at any time.

Transactions

This is one big area of debate in Hive, to transact or not to transact. But that’s a topic for a separate blog post. When it comes to LLAP, it understands transactions. And it is smart enough to perform transformations (such as merging or delta files) before the data is cached. If there are various transactions performed on the same tables, which is usually the case, LLAP can store multiple versions of data for each such variation. And the correct version will be fetched from the cache on the fly depending on the query. This will make sure that the same set of transformations are not being performed on the same data over and over again, thereby reducing a lot of processing time.

Understanding LLAP doesn’t stop here. There’s a lot more to it, and the more you try to understand it, the more interesting it becomes. I’m planning to write more about it as and when I explore more. But for now, this is all I have. Understanding how LLAP works will make it a lot easier to write queries, and also to write queries in a way that can make use of these optimisations to reduce latency. I hope this helped you a slight bit with your Hadoop or Hive journey.

And if you like what you see here, or on my Medium blog and personal blog, and would like to see more of such helpful technical posts in the future, consider supporting me on Patreon and Github.

Installing Zsh and Oh-my-zsh on Windows 11 with WSL2

Sunny Srinidhi — Sat, 06 Nov 2021 03:10:34 +0000

Before we begin, you might ask, why am I writing on something this trivial? I sold off my old MacBook Pro because I’m super excited about the new M1 Pro MacBook Pros. I have pre-ordered one of those and am waiting for it to come. Till then, I’m left with my gaming PC which is an Asus Zephyrus. So I thought I’ll make the best of it till I get my Mac. This will be series of posts in which I show to setup a Windows 11 PC to work with big data tools and technologies, mostly Hadoop. So let’s get into it.

Now that Windows 11 is officially available to the public, and given that Windows is the most popular operating system used by developers (according to stackoverflow, refer screenshot below), I thought it would be fun to install a new WSL system on my Windows 11 PC and try to set it up for big data tools. But before I could do that, I had to install Zsh and oh-my-zsh on it. I switched to using Zsh over half a decade ago and haven’t used Bash or any other shell ever. I just love how zsh look, the ability to theme it, and the community surrounding it. So, I started with zsh, and here is how you can install it too.

Source: stackoverflow

Installing WSL2 on Windows 11

The process of installing WSL on Windows 11 is not different than that on Windows 10. And if you are upgrading from a Windows 10 PC which already had WSL installed, you don’t even have to do anything extra. I already had mine setup on Windows 10 with Ubuntu 20.04 LTS. So that was readily available for me. But I wanted to setup everything fresh just to see if anything has changed (spoiler, it hasn’t). So I installed Debian this time.

To install any distro of Linux on Windows 11, just open up the Microsoft Store and search for your favorite Linux distro. In the screenshot below, you can see that if I search for Ubuntu, I get multiple versions of Ubuntu listed in the store.

Just select the version that you want to the install and click the Get button. After the installation is complete, the shell should open up automatically to complete the installation.

Soon after the installation is complete, you’ll see the bash shell something like the screenshot below.

As you can see, the prompt is pretty ugly, at least to me. I know I can customize it, but not to the extent of what’s already built into Zsh. Now that we have Debian installed, let’s start updating the packages and installing Zsh and oh-my-zsh.

Installing Zsh and Oh-my-zsh

Updating package list and upgrade packages

Before we start installing any packages, we first to have to update the package list so that we can then upgrade all the packages that we have already installed, and also fetch the references to the latest version of all the dependencies. This is pretty simple and will take just a few seconds. Run the following command in the terminal to update the references:



sudo apt update

Once you update the package list, if you find any updates available, most probably you will, you can run the following command to upgrade all packages already installed. But let me also mention that this step is optional.



sudo apt upgrade

Installing dependencies

Now that we have updated the package list, we have to install the dependencies that we’ll need to install Zsh and oh-my-zsh. The dependency list isn’t really exhaustive, it’s just two packages. If I’m not mistaken, these two packages must come pre-installed if you install a full fat version of Linux, but because this is a WSL version, I think it’s pretty stripped down.

Anyway, the two packages that we need to install are wget and git. And just to be clear, these are not required for installing Zsh, but for oh-my-zsh. And to be completely honest, you can technically install oh-my-zsh without these packages as well, but there are benefits if you do install this way. For instance, oh-my-zsh will automatically check for updates if you install it using git.

To install these two packages, run the following two commands one after another in your terminal:



sudo apt install wget
sudo apt install git

Together, it shouldn’t take more than a minute to install. These are pretty small packages. Once you’re done with these two, we can finally move on to installing Zsh.

Installing Zsh

In some cases, Zsh should already be installed even in the WSL version of Ubuntu or Debian. But I’m not 100% sure because I already had Ubuntu installed on WSL, as I mentioned. But if it’s not installed, it’s just one little command:



sudo apt install zsh

And that’s it, you have Zsh installed on your Windows 11 PC using WSL. This excites me very much for some reason. But we’re not yet done. Let’s install oh-my-zsh.

Installing oh-my-zsh

If you don’t know what oh-my-zsh is, you can read all about it here. Installing this is another simple command. But it’s not using the apt package manager, but we’ll use wget and git to basically download the install script from the Git repo and run that on our machine. To install oh-my-zsh, run the following command in your terminal:



sh -c "$(wget https://raw.github.com/robbyrussell/oh-my-zsh/master/tools/install.sh -O -)"

Because it’s an open source package, you can just to the link in that command and look at what the script is doing yourself. The whole process should look similar to what you see in the screenshot below.

Soon after the install, you can see the prompt change in your terminal. The default theme is applied. You can explore on how to get more themes and change it or customize it to your heart’s content.

Adding Useful Aliases

If you’re a developer and use Git a lot for your projects, there are some commands that you’ll be typing everyday. This shouldn’t come as a surprise, but most developers have handy aliases for these commands to make life a bit easier in the terminal. To add aliases, you’ll have to edit the .zshrc file in your home directory. For this, run the following command to open up the file:



vi ~/.zshrc

This will open the file in the vi editor. Scroll to the end of the file and the following lines:



alias ll="ls -ltra"
alias gd="git diff"
alias gcmsg="git commit -m"
alias gitc="git checkout"
alias gitm="git checkout master"

As you can see, these are pretty simple aliases. But also, they reduce typing a lot everyday when you add up the number of keystrokes at the end of the day. So, that’s pretty much it.

And if you like what you see here, or on my Medium blog and personal blog, and would like to see more of such helpful technical posts in the future, consider supporting me on Patreon and Github.

Installing Hadoop on the new M1 Pro and M1 Max MacBook Pro

Sunny Srinidhi — Sat, 06 Nov 2021 02:57:27 +0000

In the previous series of posts, I wrote about how to install the complete Hadoop stack on Windows 11 using WSL 2. And now that the new MacBook Pro laptops are available with the brand new M1 Pro and M1 Max SOCs, here’s a guide on how to install the same Hadoop stack on these laptops. Because both M1 Pro and M1 Max use the same architecture, the steps you need to follow to install Hadoop is the same. So it doesn’t matter which MacBook you got, the steps given here should work for you. So, let’s get started.

Installing Dependencies

There are two important dependencies that you’ll need to install to make Hadoop work. These aren’t optional, unless you have them installed already. So make sure you install these dependencies.

Installing JDK

As always, you need to install dependencies. To begin with, let’s get JDK 8 installed, because Hadoop is largely dependent on Java. There are two ways of installing JDK on an M1 Mac, using homebrew, or directly from a vendor. We’re going to install the OpenJDK implementation of Azul, which is super easy to install and is also a certified JDK. You can download the JDK from here. As you can see, there are multiple versions available. But let’s stick to 8 for now. Also, make sure you download the ARM 64-bit version of the JDK.

Once you download the installer, installing itself is pretty easy. Just open the installer and follow the steps in the wizard. It shouldn’t take more than a couple of minutes. Once you are done, make sure to export the path to the Java home directory, as this will be used by not just Hadoop, but a lot of other packages as well. For this, get the installation path (which should be very similar to the one given below), and add this to your .zshrc file:

export JAVA_HOME=/Library/Java/JavaVirtualMachines/zulu-8.jdk/Contents/Home/

I’m pretty sure you can find the JDK in the exact same path if you installed JDK 8. Anyway, let’s move on to the next dependency.

Enable SSH to localhost

Unlike Linux or Windows, SSH is already installed on Macs. We only need to enable the feature and add our security keys so that we don’t need to provide our passwords every time. First, let’s enable SSH or remote login feature. For this, open up your System Preferences app and find the Sharing menu. From there, on the services list to the left, search for “Remote Login” and enable it. This will be disabled by default. Below is a screenshot for reference.

After this, we have to create a security key for being able to SSH into the localhost. Run the following command to generate a key. Follow the instructions on screen to provide all the required information.

ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa

Once the key is generated, we have to copy that over to the authorised keys file so that we authorise this key to be used for SSH without password. This is important because Hadoop expects password-less SSH to be available and enabled. So run the following command to copy over the key:

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

In some cases, SSH might not work if the key has too much “public” access. To avoid this, run the following command to restrict permissions to the key:

chmod 0600 ~/.ssh/id_rsa.pub

And that’s it. SSH should be working fine now. To be extra sure, let’s try it out with the following command:

ssh localhost

If you don’t get any error for that command, it’s all working as expected. You’re now logged into another session in your terminal using SSH. So let’s logout from there and come back to our previous session. To do this, hit CTRL + d. You should see something similar to the following:

And we now have all dependencies installed and working.

Installing Hadoop

First step to installing Hadoop is to actually download it. As of this writing, the latest version of Hadoop is version 3.3.1, and you can download it from here. You will be downloading a .tar.gz file from there. To decompress it, you can just double click the package and it’ll decompress and create a directory with all the contents. You can move this directory wherever you want to place the Hadoop installation.

Because we’re installing Hadoop on our local machine, we’re going to do a single-node deployment, which is also known as pseudo-distributed mode deployment.

Setting the environment variables

We have to set a bunch of environment variables. The best part is, you have to customize only one variable. The others are just copy-paste. Anyway, following are the variables I’m talking about:

# Hadoop
export HADOOP_HOME=/Users/sunny/programs/hadoop-3.3.1/
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/nativ"

As you can see, you only have to change the value of the first environment variable, HADOOP_HOME. Set it to reflect the path where you have placed the Hadoop directory. Also, it is a good idea to place these export statements in the .zshrc file so that these variables are exported every time automatically instead of you having to do it. Once you place it in the file, make sure you source it so that it takes effect immediately:

source ~/.zshrc

Configuring Hadoop

Next, we’ll have to edit a few files to change the config for various Hadoop components. Let’s start that with the file hadoop-env.sh. Run the following command to open the file in the editor:

sudo vim $HADOOP_HOME/etc/hadoop/hadoop-env.sh

Next, find the line that is exporting the $JAVA_HOME variable and uncomment it. Here, you have to provide the same path that you did when you installed Java earlier. For me, that’s the following:

export JAVA_HOME=/Library/Java/JavaVirtualMachines/zulu-8.jdk/Contents/Home/

Next, we have to edit the core-site.xml file. Here we have to provide the temporary directory for Hadoop and also the default name for the Hadoop file system. Open the file in the editor using the following command:

sudo vim $HADOOP_HOME/etc/hadoop/core-site.xml

You’ll find an empty file here with a few comments and an empty configuration block. You can delete everything and replace it with the following:

<configuration>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/Users/sunny/hdfs/tmp/</value>
    </property>
    <property>
        <name>fs.default.name</name>
        <value>hdfs://127.0.0.1:9000</value>
    </property>
</configuration>

Make sure you create the temp directory that you configure here. Next, we have to edit the HDFS config file hdfs-site.xml. To do this, open the file in the editor using the following command:

sudo vim $HADOOP_HOME/etc/hadoop/hdfs-site.xml

In this configuration file, we are setting the HDFS data node directory, HDFS name node directory, and the HDFS replication factor. Here again you should get a file with an empty configuration block. Replace that with the following:

<configuration>
  <property>
      <name>dfs.data.dir</name>
      <value>/Users/sunny/hdfs/namenode</value>
  </property>
  <property>
      <name>dfs.data.dir</name>
      <value>/Users/sunny/hdfs/datanode</value>
  </property>
  <property>
      <name>dfs.replication</name>
      <value>1</value>
  </property>
</configuration>

And again, make sure you create the data node and name node directories. Next, we have the MapReduce config file. To open this in the editor, run the following command:

sudo vim $HADOOP_HOME/etc/hadoop/mapred-site.xml

You can replace the configuration block with the following:

<configuration> 
  <property> 
    <name>mapreduce.framework.name</name> 
    <value>yarn</value> 
  </property> 
</configuration>

As you can see, it’s a simple configuration which specifies the MapReduce framework name. And finally, we have the YARN configuration file, yarn-site.xml. Open the file in the editor using:

sudo vim $HADOOP_HOME/etc/hadoop/yarn-site.xml

Add the following configuration to the file:

<configuration>
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
  <property>
    <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
  </property>
  <property>
    <name>yarn.resourcemanager.hostname</name>
    <value>127.0.0.1</value>
  </property>
  <property>
    <name>yarn.acl.enable</name>
    <value>0</value>
  </property>
  <property>
    <name>yarn.nodemanager.env-whitelist</name>   
    <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PERPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
  </property>
</configuration>

There’s nothing to change in this configuration. And finally, we’re done configuring Hadoop. We can now move on to formatting the name node and starting Hadoop.

Formatting the HDFS name node

It’s important to first format the HDFS name node before starting the Hadoop service the first time. This, obviously, makes sure there’s no junk anywhere in the name node. And once you start using HDFS more frequently, you’ll realize that you’re formatting the name node more often that you thought you would, at least on your development machine. Anyway, to format the name node, use the following command:

hdfs namenode -format

Once you get the shutdown notification for name node, the formatting is complete.

Starting all of Hadoop

Finally, we’re at the best part of this activity, starting and using Hadoop. Now, there are many ways of starting Hadoop depending on what components you actually want to use. For example, you can start only YARN, or HDFS along with it, etc. For this activity, we’ll just start everything. To do this, the Hadoop distribution provides a handy script. And because you have already exported a bunch of environment variables earlier, you don’t even have to search for that script, it’s already in your path. Just run the following command and wait for it to finish:

start-all.sh

This will take a few seconds, as the script just waits for the first 10 seconds without doing anything to give you an option to cancel the operation if you started it by mistake. Just hold on and you should see output similar to the following screenshot:

This tells us that all components of Hadoop are up and running. To make sure, if you want to, you run the jps command to get a list of all the processes running. You should see at least the following services:

And that’s it. You’re now running Hadoop on your Windows 11 PC using a Linux distro on WSL 1 or 2. To make sure, you can use the following simple HDFS command:

hdfs dfs -ls /

This command will list all files and directories at the root HDFS directory. If it’s a brand new deployment, you shouldn’t find much there. You’ll get a list similar to the one shown below:

That’s pretty much it. We’re done!

And if you like what you see here, or on my Medium blog and personal blog, and would like to see more of such helpful technical posts in the future, consider supporting me on Patreon and Github.

Getting Started With Apache Airflow

Sunny Srinidhi — Mon, 11 Oct 2021 09:27:05 +0000

Apache Airflow is another awesome tool that I discovered just recently. Just a couple of months after discovering it, I can’t imagine not using it now. It’s reliable, configurable, and dynamic. Because it’s all driven by code, you can version control it too. It’s just awesome! But wait, what the heck is Apache Airflow?

Apache Airflow is a workflow orchestration tool, to put it simply. Using Airflow, you can define the various steps involved in your workflow (data projects or not), define the relations between these various steps, and then schedule those steps as well. But wait, this sounds more like defining CRON jobs. You wouldn’t be entirely wrong there. But obviously, this is much more than just a fancy CRON job. Let’s dive deep and understand it.

Apache Airflow

The official definition of Airflow on it’s Apache homepage is:

Airflow is a platform created by the community to programmatically author, schedule and monitor workflows.

Apache Airflow

That should give a high-level understanding of what Airflow is. Like most Apache projects, this is community driven and open source. It can be programmatically authored, which means you can write some code to not just define the workflow, but also to schedule and update it. As I already mentioned, you can have versioning on Airflow code and easily update or rollback the code at will. Oh, by the way, Airflow DAGs are written in Python.

Wait, what’s a DAG, you ask? DAG stands for Directed Acyclic Graph. This shouldn’t be new to you if you’ve already worked with tools such as Apache Spark. Spark jobs are internally converted to DAGs as well, which you can see visually from the web UI whenever a Spark job is running. Airflow uses the same concept to chain various operations together in a workflow.

Each operation in Airflow is defined using an operator. And of course, there are various operators. For example, in a DAG if you want to have an operation just to denote a virtual entity, you can use a DummyOperator. Similarly, if you want to run a execute a bash command or run a bash script file, you can use the BashOperator.

Similar to operators, there’s support for plugins as well, where you can integrate third party plugins to bring in more functionality to Airflow. Obviously you can write your own plugins as well. This should be fairly simple as writing a plugin is mostly as simple as writing a Python package. If you have experience with Python, you’d understand this looking at the documentation.

Another way to extend the functionality of Airflow is to use providers packages. This is different from plugins in the way that a providers package can include new operators, sensors, hooks, and transfer operators to extend Airflow. And as you can expect with any other popular open source platform, there are already a bunch of providers to provide a rich ecosystem. You can see the full list of current providers here.

Now that we understand at a very high level what Apache Airflow is, let’s look a simple example workflow to see how we can utilize Airflow.

A Sample Airflow Workflow

Before we get started, if you are anything like me, you would want to have Airflow setup locally so that you can just copy-paste the code and see if it works. If this is you, you’d want to checkout the installation instructions, because there are a bunch of ways to install Airflow. Once you have Airflow up and running, make sure you switch from Derby to either MySQL or PostgreSQL as the backend database. This is because Derby gives some issues with scheduling DAGs and running them, even locally. You can see how to make the switch here. Don’t worry, it’s pretty simple and shouldn’t take more than a couple of minutes if you already have MySQL or Postgres already installed.

Understanding the sample workflow

Even though the dataset itself isn’t important for this Airflow workflow, I just want to touch upon this dataset as this is one of the most frequently used dataset in the world of data science. The NYC Taxi and Limousine Commission (TLC) exposes each month’s taxi data as CSVs, for free. So I’m using that, well a sub-set of that. To be more specific, I’m using the yellow taxi data from December of 2020. You can get more info on this over here.

The Code

Finally, let’s get into the juicy bit. First, we need to define a DAG in our python file. This DAG, which is an instance of the DAG class, will be the identifier that tells Airflow that this particular file should be scanned and listed as a DAG in Airflow’s database. And this DAG taken in a bunch of default arguments which are literally called default arguments, I’m not even kidding. So let’s start by defining these default arguments:

default_args = {
    'owner': 'sunny',
    'depends_on_past': False,
    'email': ['sunny@contactsunny.com'],
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
}

This should be pretty much self explanatory. I’m not going to talk much on each field in that object. Next, with this set of default arguments, we’ll define the DAG itself:

with DAG(
    'load_nyc_taxi_data',
    default_args=default_args,
    description='DAG to load NYC Taxi data to Hive',
    schedule_interval=timedelta(days=1),
    start_date=days_ago(2),
    tags=['sunny', 'sample'],
) as dag:

The first argument to the DAG here is called the DAG ID. This, as you could imagine, is what uniquely identifies a DAG. The second argument is the set of default arguments that we defined in the previous step. The others are again pretty self explanatory.

The next step is to define the various operations that we want to perform in this workflow. So let’s talk about that first. For this example, I’m using Hive as the datastore for storing the yellow taxi data that I’m downloading. Why Hive? Well, why not? You can of course change any of all of this. Anyway, following is the operations that the DAG is going to perform, in the given order:

Create a database in Hive.
Create a table in the database that we just created.
Download the NYC yellow taxi data CSV.
Import that data into the table we created in Hive.

That’s pretty simple, right? Anyway, let’s look at each of these steps in detail.

Creating a database in Hive

This is a simple create database query in Hive. But I’m not going to login to the Hive shell and then execute this query. Instead, I have written a .hql file that I’ll submit to the hive command, which will in turn execute the query that’s in the file. The .hql itself is pretty simple:

create database if not exists nyc;

I have this query in a file called create_database.hql. And how exactly do I submit this to hive? That’s pretty simple too:

hive -f /mnt/d/code/poc/airflow/create_database.hql

Now that we know what exactly we’re doing in this operation, let’s define a task in Airflow. We’ll use the BashOperator to create this task, as we’re running a bash command to perform this operation. The Python code to define this task is as follows:

create_database = BashOperator(
    task_id='create_database',
    bash_command='hive -f /mnt/d/code/poc/airflow/create_database.hql',
    dag=dag
)

The first argument to the BashOperator is a task_id. This is the ID that will uniquely identify the task in the given DAG. So make sure you don’t have duplicate task IDs. The bash_command is the actual bash command that will be executed when Airflow triggers this task. That’s pretty much it for this task. Let’s move to the next task.

Creating a table

Now that we have the database, let’s create a table to hold the NYC yellow taxi data. I have created this table definition based on the data in the CSV file that I downloaded from the source. So if you are using some other dataset, make sure to change the table definition accordingly.

Similar to the query for creating the database, I have the query for creating the table in a .hql file called create_table.hql. The query itself is as follows:

create table if not exists nyc.nyc_yellow_taxi_trips (
    VendorID int,
    tpep_pickup_datetime timestamp,
    tpep_dropoff_datetime timestamp,
    passenger_count int,
    trip_distance double,
    RatecodeID int,
    store_and_fwd_flag string,
    PULocationID int,
    DOLocationID int,
    payment_type int,
    fare_amount double,
    extra double,
    mta_tax double,
    tip_amount double,
    tolls_amount double,
    improvement_surcharge double,
    total_amount double,
    congestion_surcharge double
)
COMMENT 'NYC Yellow Taxi Trips'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';

And the bash command to submit this query to Hive is also similar to the previous command:

hive -f /mnt/d/code/poc/airflow/create_table.hql

Next, let’s define the task in Airflow:

create_table = BashOperator(
    task_id='create_table',
    bash_command='hive -f /mnt/d/code/poc/airflow/create_table.hql',
    dag=dag
)

Nothing new here, this is very similar to the previous task, jus the task ID and the command are different. Next, we download the sample data.

Downloading sample dataset

You can skip this step if you want to manually download the CSV file. But because I’m automating a workflow, I thought it’ll be better if I made the download automated as well. In a real world workflow, getting the data for ingestion itself is a separate workflow. That’s another reason I wanted to have this as a task. Now, how am I downloading the dataset?

I’m using the wget command to get the .csv file and place it in a path that I have already created. You don’t have to manually create the path because in most cases, the path will be constant. Anyway, the wget command itself is straightforward:

wget https://nyc-tlc.s3.amazonaws.com/trip+data/yellow_tripdata_2020-12.csv -P /mnt/d/Backup/hive_sample_datasets/nyc_taxi_trips/nyc_yellow_taxi_trips/

I have this command in a shell script called download_dataset.sh. Now you might ask that because this is a bash command, and we have the BashOperator which can directly take the bash command, why are you creating a shell script to run this command? Valid question. Because we have a file for all the queries, I thought we’ll continue with that trend and create a file for this as well. You can directly use the bash command in the BashOperator itself. That should work the same way. Anyway, the task definition for this is:

download_nyc_yellow_taxi_data = BashOperator(
    task_id='download_nyc_yellow_taxi_data',
    bash_command='/mnt/d/code/poc/airflow/download_dataset.sh ',
    dag=dag
)

Load data to table

At this stage, we have the database, the table, and the sample dataset. The only part pending is loading the data into Hive. This is another Hive query that I have written in another .hql file called load_data_to_table.hql. The query itself is as follows:

load data local inpath '/mnt/d/Backup/hive_sample_datasets/nyc_taxi_trips/nyc_yellow_taxi_trips/yellow_tripdata_2020-12.csv' into table nyc.nyc_yellow_taxi_trips;

As you can see from the query, I’m loading the file from a local path. In a real world setup, you’d copy all the files to HDFS first and then load those files from HDFS to Hive. Because distributed, of course. You can do that here as well, but I didn’t see that adding to this demo. So I just skipped it. Now, the task definition for this step is:

load_data_to_table = BashOperator(
    task_id='load_data_to_table',
    bash_command='hive -f /mnt/d/code/poc/airflow/load_data_to_table.hql',
    dag=dag
)

Orchestrating the workflow

We have all the tasks defined now. So let’s define the sequence of events now. Doing that is pretty simple. You’ll use the >> operand to chain two tasks together. For example:

task1 >> task2

What this means is that when the Airflow DAG is triggered, task1 will be executed first. Once that’s complete, task2 will be executed. This will happen in sequence. But what if you have another task, task3 that you want to trigger after task1 along with task2? You can do that in parallel like this:

task1 >> task2
task1 >> task3

In this case, as soon as task1 is complete, tasks task2 and task3 will be executed in parallel. We don’t need this parallelism in our example though. For us, the sequence is as follows:

start >> create_database >> create_table >> download_nyc_yellow_taxi_data >> load_data_to_table >> end

Wait, we only defined the four tasks in the middle. What are the start and end tasks? Glad that you asked. Those two are DummyOperators used to represent the start and the end of the workflow. You can use DummyOperators even in the middle of the workflow, basically wherever you want. But they do absolutely nothing more than adding a node in the DAG.

Now, if you open the Airflow web UI and open your DAG, you should see something similar to the following:

The green outline you see around each task indicates that the task completed successfully. If, for some reason, any of the task failed, you’d see a red outline around that task, like the example shown below:

Running A DAG

Running a DAG is pretty straightforward, at least for our example as we don’t have any schedule in place. At the top right corner of the Airflow web UI, you’ll see the trigger DAG button, which looks like a play button. Click on that, and you’ll get a couple of options. The first option is to trigger the DAG, and the second is to trigger the DAG with some options. We’ll not look at the second option for now. You can see this in the screenshot below:

Click the Tigger DAG option and you should see the DAG trigger now. But for this to happen, make sure you have enabled the DAG first. By default, when a new DAG is discovered by Airflow, it doesn’t enable it. You have to do that manually the first time. And enabling the DAG is pretty simple. At the top left corner of the Airflow web UI, where you see the DAG ID displayed in large font, you’ll see a toggle to the left of the DAG ID, as shown in the screenshot below.

That toggle will be off by default. Toggle it on to enable the DAG, it’s as simple as that. If you skip this step and trigger the DAG directly, Airflow will not throw a error or put up a notice that you have to enable it first. This is quite unfortunate, but do remember to do this for all new DAGs.

When you trigger a DAG, Airflow takes you the tree view by default. You can see the tree view for our sample DAG in the screenshot below. You can change this behavior to instead take you to the graph view, but that’s a preference you’ll have to discover for yourself. I prefer to stay in the graph view when I trigger a DAG. Because as the workflows get complex and you more tasks to it, it becomes difficult to understand which task is running in the tree view. Graph view is much easier, at least for me.

If a task fails, you can click on it to get a modal which will have the option to either restart the task and all downstream tasks from that point, or check the logs, or just mark it as a success. Once you start writing more complex workflows, you’ll get a chance to experiment with various features of the tool.

That’s pretty much it. If you want to look at the complete code base for this instead of stitching together all these pieces, have a look at my Github repo which has the same code. And if you think my understanding is wrong anywhere with Airflow, please leave a comment to make sure you correct me.

And if you like what you see here, or on my Medium blog and personal blog, and would like to see more of such helpful technical posts in the future, consider supporting me on Patreon and Github.

Fake (almost) everything with Faker

Sunny Srinidhi — Fri, 01 Oct 2021 05:48:22 +0000

I was recently tasked with creating some random customer data, with names, phone numbers, addresses, and the usual other stuff. At first, I thought I'll just generate random strings and numbers (some gibberish) and call it a day. But then I remembered my colleagues using a package for that. I know, there's always a package for everything, well, almost everything.

Anyway, I thought I'll give it a shot. I've started to write some serious Python code these days and thought it to be a good idea to explore the various packages available for Python. I executed the pip command, downloaded the package, and started generating some random people in my CSV files. It was fun. So I thought I'll document this process because given my history, I'll definitely forget about Faker.

Installing Faker

Installing Faker is no different than installing any other Python package using pip. You can use any one of the following commands to install Faker.

pip install faker
pip3 install faker
python -m pip install faker
python3 -m pip install faker

Depending on the version of Python you have installed, use the appropriate command to install the Faker package. It shouldn't take more than a couple minutes.

Importing Faker into your code and initialising it

Importing the Faker package into your code is also nothing different. Simply add the following import statement at the beginning of your Python file and you should be good to go.

from faker import Faker

Once you have imported the package, you need to create a object of the Faker class. You can do that using the following command. The locale parameter is optional though. You can skip that and you'll totally be fine.

faker = Faker(locale='en_US')

Let's look at what it can do first

Before we dive into the code, let's have a look at what it can do for us first.

My name is Mx. Linda Dunn III , I'm a gender neutral person. You can call me at 001-099-311-6470, or email me at caroljohnson@hotmail.com, or visit my home at 2703 Fitzpatrick Squares Suite 785
New Crystal, MN 18112

My name is Dr. John Harris MD , I'm a male. You can call me at (276)611-1727, or email me at combstiffany@brown-rivers.org, or visit my home at 7409 Peterson Locks Apt. 270
South Kimfurt, IL 79246

My name is Dr. Ann Huynh DVM , I'm a female. You can call me at 543.024.8936, or email me at timothy30@shea-poole.com, or visit my home at 5144 Rubio Island
South Kenneth, WI 22855

This is the output of a simple Python script that I wrote to generate fake customer data, or fake people. Looking at this, it's amazing how realistic it looks. And the code I used to get this output is the following:

from faker import Faker

faker = Faker(locale='en_US')

print("My name is %s %s %s , I'm a gender neutral person. You can call me at %s, or email me at %s, or visit my home at %s" % 
    (faker.prefix_nonbinary(), faker.name_nonbinary(), faker.suffix_nonbinary(), faker.phone_number(), faker.ascii_free_email(), faker.address())
)

print("My name is %s %s %s , I'm a male. You can call me at %s, or email me at %s, or visit my home at %s" % 
    (faker.prefix_male(), faker.name_male(), faker.suffix_male(), faker.phone_number(), faker.ascii_company_email(), faker.address())
)

print("My name is %s %s %s , I'm a female. You can call me at %s, or email me at %s, or visit my home at %s" % 
    (faker.prefix_female(), faker.name_female(), faker.suffix_female(), faker.phone_number(), faker.company_email(), faker.address())
)

You can see now how easy it is to generate large amounts of fake customers, for testing of course. And the fun doesn't end here. There's a lot more where that came from. You can generate a whole company with for example:

The company I just created!
David PLC
Providers of Horizontal value-added knowledge user
Phone: 001-891-255-4642x93803
Email: ksanchez@cochran.com
234 Torres Ports
West Rhonda, AL 96210

As you can see from the output above, we provide some great horizontal value-added knowledge user. That's supposed to be the company's catch phrase.

And I kid you not, there's a method called bs(). I don't know when you'd ever use, but you call Faker's bs() any time you want. See what I did there?

How does this help?

Well, I thought you'd have already figured that part out. Anyway, when you need data to test, and you need that data to be as true to reality as possible (or as realistic) as possible, you can use Faker to easily and quickly generated test data.

Actually, I'm not sure about the "quickly" part of my last sentence. It's definitely easy to generate the data. But to generate one million customer records with first name, last name, email, phone, etc., it took almost 350 seconds on a 2019 16-inch base model MacBook Pro. So make of it what you will.

Summary

Nonetheless, this is definitely a very handy and fun package to have in your arsenal. You can generate any number of customers or friends (swing how ever you swing) very easily with a complete offline and online profile for each person. You can generate home phone and email, work phone and email, home address, work address, interests, profiles, credit cards, license plate numbers, and a lot more. So do head over to the package's Github repo, take a look around, and take it for a spin. The source code is pretty easy to understand as well.

And if you like what you see here, or on my Medium blog, and would like to see more of such helpful technical posts in the future, consider supporting me on Patreon and Github.

Become a Patron!

out() vs. outE() – JanusGraph and Gremlin

Sunny Srinidhi — Wed, 03 Mar 2021 03:24:43 +0000

If you are new to JanusGraph and the Gremlin query language, like I am, you would be confused about the out(), outE(), in(), and inE() methods. If you look at examples of these functions, you’ll not be able to comprehend the difference easily. Or is it just me?

Anyway, I got confused and it took me a while to understand there is a difference, and there isn’t. Let me explain.

The Sample Graph

Before we look at the differences, let’s look at a sample graph.

As you can see from the graph above, we have four vertices and three edges. The vertex in the middle with the property "name": "sunny" is the vertex from where we’ll start our traversal. The other three vertices are the items that I bought from an e-commerce website. They are a smartphone, a laptop, and a monitor. The relationship is represented with edges labelled bought.

The edges have another property called count, and as you can tell, they represent the number of times I have bought these items. So I bought three smartphones, two laptops, and one monitor. This is the data we’re going to work with.

Now, we’ll first get a reference to our starting vertex with the following query:

sunny = g.V().has('name', 'sunny').next()

We now have all the data we need to understand the difference between these functions.

`out()` vs. `outE()`

We already know that we use the outE() function to traverse an edge that is going out of the current vertex. We pass in the label of one or more edges to the function. From our e-commerce example, if I want to get all the items that I have bought, I’ll run the following query:

g.V(sunny).outE('bought').in()

This would give us all the vertices which have a ‘bought’ relationship with the current vertex. But you’d have also seen the following query for the same use case:

g.V(sunny).out('bought')

So, they are performing the same traversal and returning the same results. I found out that when you’re using the outE().in() combination, you can simply replace it with out(). It’s a shorthand or an alias for the long form outE().in(). But then, why would you use outE() at all?

Suppose you want to filter or limit the traversal based on other properties of the edge. For example, in our sample graph, I want to get only the items that have bought more than once. We have the count property for each of our bought edge. We can use that to filter our vertices. For this, the query is as follows:

g.V(sunny).outE('bought').has('count', gt(1)).inV()

As you can see, we can use the has() function on edges as well to filter out edges with particular property. This ability to filter is not available when you use the out() function. Because the result of the out() function is vertices. So if you call the has() function on that result, you’ll be filtering on the vertices and not edges. I hope I’m not complicating things.

`in()` vs. `inE()`

It’s the same story with the in() and inE() functions as well. If you want to filter edges based on extra properties, you use the inE() function instead of in().

There’s more…

Using the outE() or inE() functions gives you access to more functions that can be used on edges, such as aliasing them using the as() function, the count() function, etc. You can have a look at the documentation to see the list of all functions available on edges.

I hope this has not confused you more than you already are. I thought this would clear things our for at least a few people who have the same questions as me when you are still getting started with JanusGraph and Gremlin. Let me know if this helped, or didn’t.

And if you like what you see here, or on my personal or Medium blogs, and would like to see more of such helpful technical posts in the future, consider supporting me on Patreon and Github.

Getting Started With JanusGraph

Sunny Srinidhi — Thu, 25 Feb 2021 11:14:16 +0000

JanusGraph is a graph processing tool that can process graphs stored on clusters with multiple nodes. JanusGraph is designed for massive clusters and for real-time traversals and analytics queries.

In this post, we’ll look at a few queries that you would want to run the very first time you install JanusGraph and start playing with the Gremlin console. I did this just yesterday, so it’s still fresh in my memory. I saw the need to write this as I didn’t find a few of my questions answered in the official documentation, and I had to hunt down the information on the world wide web, which took some quality time. So, let’s get started.

Installing JanusGraph

Installing JanusGraph is pretty straightforward, almost. I did face one issue and confusion while I was trying to get the servers up on my local machine. But upon research, I found out that many people are facing the same issue, and that it could be a bug. I didn’t have enough time to get into the weeds, so I just did what I had to to get it started.

To begin with, the official documentation is good enough for instructions on how to install and get the servers up. The issue I was talking about is that you need both Cassandra and Elasticsearch running to get the Gremlin server up. You can of course use alternatives to both Cassandra (such as HBase, Google Cloud Bigtable, Berkeley DB, or in-memory) and Elasticserach (such as Lucene, Solr, etc.), but I went the easiest route to get started with it.

The documentation says that you can skip mixed index backends (Elasticsearch, Solr, Lucene, etc.) and use only storage backends, but it doesn’t work. You need a mixed index backend along with a storage backend. So I had to download Elasticsearch as well and run it locally to get the Gremlin server up. This was a bit inconvenient, but no big deal, I guess. Now that we have all the moving parts moving as expected, let’s look at the data we’re going to play with.

The Sample Dataset

Here again I took the easy route of using the sample dataset that ships with JanusGraph. It’s an interesting one, called The Graph of the Gods. This data model, as seen below, is called a Property Graph Model. This particular property graph model shows a few Gods from the Roman pantheon, with their names, age, pets, places of living, and a few other relationships.

Source: JanusGraph
You can see that there are few special symbols in the graph. They have special meanings, of course.

Now that we have the graph and know what the different edges mean, let’s see if we can traverse the graph. To do that, we first need to get a starting point, which is one vertex. We can use the .has("property", "value") method on the graph to find a vertex with a particular value. I’ll again go with the official example here.

We’ll start by getting the reference of the vertex with the “name” being “saturn.” For this, run the following query:

saturn = g.V().has('name', 'saturn').next()

Pretty straightforward. Now that we have the reference to the vertex in the saturn variable, let’s see how we can traverse to Saturn’s grandkid, Hercules. To find the grandson of Saturn, we first need to find the son of Saturn. From the graph, we can see that there’s an incoming edge called “father” from Jupiter to Saturn. We can use this edge to traverse further. Because we have an incoming edge, we’ll use the in() method on the saturn variable. Following is the query for this.

hercules = g.V(saturn).in('father').in('father')

The in() method accepts one parameter, which is the edge that has to be traversed. In this example, it’s the ‘father’ edge. So from Saturn, we go to the next vertex which has a ‘father’ edge coming into Saturn, which in this case is Jupiter. From there, we again have the same traversal – in('father'). This will take us to the son of Jupiter, which is Hercules.

If you want to go the other way, that’s finding the grandfather of Hercules, you can run the following queries:

hercules = g.V().has('name', 'hercules').next()
saturn = g.V(hercules).out('father').out('father')

As you can see, we’ve used the out() method here instead of in(). This is because we are traversing on the edge ‘father’ which is going out of the hercules vertex. It took me a while to understand when to use which method, but it’s very logical once I figured it out.

Loops

You would’ve observed in the last couple of queries, we did use the in('father') and out('father') methods twice in succession. This is a clear example where loops can be used instead of tailing the same method multiple times like this. And with JanusGraph, you can. There’s a simple repeat() method that makes this very easy. Below are the same queries rewritten with the repeat() method.

hercules = g.V(saturn).repeat(__.in('father')).times(2).next()
saturn = g.V(hercules).repeat(__.out('father')).times(2).next()

I don’t think it could be easier than this.

Selecting Multiple Vertices

Sometimes you’d want to query for multiple vertices in a single query. For example, if you want to query for the parents of Hercules, you’ll have to query one parent first, and then the other. You can imagine how this could slow down things when you are working with a distributed graph. To makes things easier and quicker, you can pass in a comma-separated list of edges that you want to traverse using the out() method. So if you want to query the parents of Hercules, you can just do this:

g.V(hercules).out('father', 'mother').values('name')

So elegant.

As I mentioned at the beginning of this post, this is just getting started. There’s a lot more to JanusGraph and Gremlin. I’ll be posting more about these two as and when I learn new things. So watch out for this space if you are interested.

And if you like what you see here, or on my personal or Medium blogs, and would like to see more of such helpful technical posts in the future, consider supporting me on Patreon and Github.

A few basic (but powerful) ImageMagick commands

Sunny Srinidhi — Sat, 06 Feb 2021 13:04:02 +0000

Originally published on my personal blog

If you’re not sure what ImageMagick is, it’s one of the greatest tools you could have on your computer, to manipulate images and a few other types of files. In this post, I’m going to list out a few of the commands which come in very handy in a variety of situations.

Recently I got the opportunity to work on a project where a lot of images had to be manipulated — changing the resolution and keeping the file size in check. I wrote the script in PHP and ImageMagick was the weapon of choice. Before you read any further, if you are not aware of ImageMagick and the features it offers, go through their website first.

Also, I’d like to point out that this is more for personal reference, so that after a few months when I forget the commands or need to manipulate more images, I can come back here and get all the commands I need.

One more thing, I work with Ubuntu — my personal laptop runs Ubuntu, my work laptop runs Ubuntu, my personal AWS EC2 instance runs Ubuntu, and almost all of the cloud computers I work on run Ubuntu. So the commands given below are all tested on Ubuntu (16.04 to be precise).

Let’s get started then, shall we?

Change image resolution

This is I think one of the most basic image manipulation we’d do. And therefore, the command is also pretty simple to understand and remember, even for a non-techie.

Consider we have an image called image.jpg which is around 2816×2112 pixels, that’s like a 6MP image. If you want to reduce that to say around 1024×768 pixels, you just do it like this:

convert image.jpg -resize 1024x768 output_file.jpg

And that’s it. You’ll have a new image called output_file.jpg whose resolution is somewhere close to 1024×768 pixels. I say close to that resolution because by default, ImageMagick tries to maintain the aspect ration. So if the resolution that you specify doesn’t maintain the original aspect ratio of the image, ImageMagick will use a resolution which is close to what you specified while maintaining the aspect ratio.

But this default behaviour is easy to override. You can use the ‘!’ symbol with the resolution to tell ImageMagick that the resolution has to be maintained strictly. So the command would now look like this:

convert image.jpg -resize 1024x768! output_file.jpg

But this might not always work the way you expect. Because *nix systems have a special meaning for the symbol !. To make sure nothing funny happens and this works as expected, you need to “escape” the ! in the command. You can do that by adding a ‘\’ in front of the ! like this:

convert image.jpg -resize 1024x768\! output_file.jpg

That’s it. If your input file image.jpg is not corrupt, you’ll now have a new image with the resolution that you specified. It’s that simple.

Reducing an image’s file size, or reducing the quality of an image

If you a developer who builds web app with a lot of images, you know the importance of file size. You can’t have a 5MB image on your webpage, which will take ages to download. This will for a horrible user experience. The solution is to keep the file size as small as possible. When it comes to images, “compressing” an image is the most common technique to achieve a smaller file size. This can be done in two way — reduce the image resolution (which we already discussed), and reduce the “quality” of the image.

Since we already discussed how to reduce the resolution of an image, let’s see how you can reduce the quality of an image. ImageMagick provides an easy option with it’s convert command to set the quality of an image. You can do it like this:

convert image.jpg -quality 75 output_file.jpg

This command will reduce a 3MB file to less than 300KB. The quality option 75 provided in the command could be thought of as reducing the quality of the image to 75%, or by 25%.

For JPG (JPEG) images, the quality ranges from 1 to 100. 1 will ensure highest compression but lowest image quality, and 100 will ensure highest image quality but doesn’t provide great compression. You can determine the quality of the input file, if not, the default value is 92.

Experts also says that 75 is the best trade-off between image quality and compression, and in most cases, it gets the work done. For other image types (such as PNG), there are a lot of other options to use with the -quality option. For more on that, take a look at the official documentation, and you’ll be amazed at the possibilities.

Create a collage of images (montage)

Another use case is creating collages of images. We see this kind of images a lot on social networks these days. If you are the kind of person who makes collages, you’ll love this command. No more uploading images to websites and waiting for them to convert and then downloading them again. You don’t even have to give your images to anybody else now.

Say you have four images and you want to create a 2×2 collage of those four images. You use the following command:

montage image1.jpg image2.jpg image3.jpg image4.jpg output_montage.jpg

This will create a new image called output_montage.jpg, but it might not look how you expected it to look — it’ll have four very small images arranged in a 2×2 matrix. The images will be resized because of the default options that the montage command uses, which is “-geometry 120×120>+4+3.” Now what does that mean? It means, no matter what the input image resolution is, resize it to a 120×120 pixel image and use that to generate the collage. To override this behaviour, you can just provide the -geometry option yourself with the command.

There’s also a -tile command, using which you can specify how many tiles you want. But if you aren’t specific about this and just want to have a collage of your images, you can let ImageMagick figure out the best possible tile arrangement.

So, to get the best possible collage of your four images, use this command with the -geometry option:
montage image1.jpg image2.jpg image3.jpg image4.jpg -geometry +2+2 output_montage.jpg
I’m sure you’ll like the output. Let me know. 🙂

Convert an image to a PDF

This is another most common task that we come across frequently. And I must say, conversion of images to PDFs is one of the most simplest commands ImageMagick provides. Let’s suppose you want to convert that awesome 2×2 collage you created earlier to a PDF. How do you do it? Like this:

convert output_montage.jpg output_montage.pdf

Wasn’t that easy? Go ahead, try it.

Bonus: Merging PDF files

You might think that you could easily merge several PDF files into one using the convert command like this:

convert pdf1.pdf pdf2.pdf output_pdf.pdf

You actually can. But don’t! This is because if your input files are 10MB each, your output file will be close to 1GB big. This is very, very bad. So to merge several PDF files, I’ll give you another command using another tool called GhostScript. Most *nix systems come pre-installed with GhostScript. If you don’t have it already installed, head over to their website to get install instructions.

Now, you can easily merge PDF files into one file using the command:

gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -sOutputFile=temp.pdf pdf1.pdf pdf2.pdf

Don’t bother what each option means, unless you’re a techie and interested to know how it works. The website has all the information you’d need.

Most programming languages come with ImageMagick support, either built in or with a few third party packages. But you can be sure to make use of these commands in your project the next time you have to work with images.

ImageMagick has capabilities with work with blobs as well. So you can just get the base64 string of an image and work with that. It’s really very versatile and handy.

I hope this was of some use. Let me know in the comments if I made any mistake here, or if you have any more cool commands.