DEV Community: Ashish Mishra

Chrome Extensions using Vite + Typescript + React: Stepwise Process

Ashish Mishra — Tue, 24 Oct 2023 02:32:35 +0000

Chrome extensions come in really handy when it comes to blocking ads, improving productivity, managing cluttered tabs, and of course improving the readability of code on GitHub for developers.❤️

Image Courtesy: singlequote.blog

In this blog, let’s create a Chrome extension in an easy 4-step process using React, Typescript, and Vite bundler.

Let us assume that nodejs is pre-installed, if not then you can follow this setup nodejs and dependencies on the development/local machine.

Now, since we are all set, Let’s begin!!

Step 1: Initialise a new Project

Create a new project using Vite.

npm create vite@latest

This command will prompt for a few inputs from the user:

Project Name: Project name that you want to give to your project.
Select a Framework: Choose ‘React’ as that is what we are going with in this tutorial.
Select a variant : Choose ‘Typescript’ there to keep up with this tutorial.

Image Courtesy: singlequote.blog

Step 2: Install Dependencies & Run Application on Local

Now change the directory to the created/initialized folder and install the dependencies.

> cd vite-ts-react-test
> npm install

Now to test if everything worked fine. Run the following command in the terminal.

npm run dev

This will prompt a new message on the command line similar to below screenshot:

Image Courtesy: singlequote.blog

This signifies that everything is running fine up to this point. Once you open the above shown — http://localhost:5173/ on the browser, you will a see Vite welcome page, something like this:

Image Courtesy: singlequote.blog

Next exit from the vite localhost prompt using ‘CTRL+C’ and run the command to build the project

npm run build

Image Courtesy: singlequote.blog

An output something like the above will come up on the terminal/cmd.

Step 3: Create a Chrome Extension and Validate it

At the end of step 2, the boilerplate will be ready inside the project directory. You will see a lot of stuff inside the directory but don’t worry we are only interested in a few of these:

Dist: The “build” command creates this folder dynamically by copying a few files from other folders or as instructed in the config file. For this tutorial default configuration will work. So, we are not going to touch anything inside it.
Public: We will add all of our project files and static files in this folder and the “build” command will add those to the dist folder, once the build is successful.
Src: This is where magic happens. This is where we write typescript code, you will also some typescript written in this blog.

Image Courtesy: singlequote.blog

Now go to public folder and create a new file manifest.json :

{
    "manifest_version": 3,
    "name": "vite-ts-react-test",
    "version": "1.0",
    "description": "",
    "action": {
        "default_popup": "index.html"
    },
    "permissions": [
        "scripting",
        "tabs",
        "activeTab"
    ],
    "host_permissions": [
        "https://*/*",
        "http://*/*"
    ],
    "icons": {
        "16": "images/16x16.png",
        "32": "images/32x32.png"
    }
  }

Run the below command to build the project:

npm run build

Now go to chrome://extension, and enable developer mode if not already set. Click on “Load Unpacked” to pick dist folder from the local file system.

Voila!! Our Chrome Extension is ready to try.

If you are still following up to this point, then a new extension will be available on your extension list. You can go ahead and click on the extension on the Chrome menu bar, you will see the Vite welcome page as mentioned above on this page.

Step 4: Scripting in Chrome Extension

In this step, let’s try to do some scripting to see if it works. Let’s go to the “src” folder to write some typescript code.

In this script, we will write a code using Chrome API to change the background color of the web page.

let’s install the Chrome API using the below command:

npm install -D @types/chrome

Now look for src/App.tsx file in the project directory and change the code as below:

// import { useState } from 'react'
import reactLogo from './assets/react.svg'
import viteLogo from '/vite.svg'
import './App.css'

function App() {
  // const [count, setCount] = useState(0) 
  const onClick = async () => {
    let [tab] = await chrome.tabs.query({ active: true, currentWindow: true });
    chrome.scripting.executeScript({
      target: { tabId: tab.id! },
      func: () => {
        document.body.style.backgroundColor = 'green';
      }
    });
  }
  return (
    <>
      <div>
        <a href="https://vitejs.dev" target="_blank">
          <img src={viteLogo} className="logo" alt="Vite logo" />
        </a>
        <a href="https://react.dev" target="_blank">
          <img src={reactLogo} className="logo react" alt="React logo" />
        </a>
      </div>
      <h1>Vite + React</h1>
      <div className="card">
        <button onClick={() => changeColorOnClick()}>
         Change Color
        </button>
        <p>
          Edit <code>src/App.tsx</code> and save to test HMR
        </p>
      </div>
      <p className="read-the-docs">
        Click on the Vite and React logos to learn more
      </p>
    </>
  )
}
export default App

if you look closely most of the code is already available. The only new addition is a function “ changeColorOnClick “.

const changeColorOnClick = async () => {
    let [tab] = await chrome.tabs.query({ active: true, currentWindow: true });
    chrome.scripting.executeScript({
      target: { tabId: tab.id! },
      func: () => {
        document.body.style.backgroundColor = 'green';
      }
    });
  }

Now don’t forget to build it again and refresh the extension from chrome://extensions.

npm run build

and it’s done!!

Give it a try by visiting any website with a considerably white background like singlequote.blog.

Tips and Tricks : Running the build command after every change could be tiresome and error-prone. If you forgot to refresh and are wondering why the changes are not coming up — Rollup is here to rescue us. Read about Chrome Extension using Node + Rollup plugin: Stepwise Process.

Terminal/Commandline Trick: Multiprocessing Progress Bar — Python atpbar

Ashish Mishra — Sat, 02 Sep 2023 14:43:20 +0000

Terminal/Commandline Trick: Multiprocessing Progress Bar — Python atpbar

Courtesy: Unspash

In this blog, we will discuss a simple yet very useful library for python terminal/command line usecases. Very often we work with command line python scripts to process a significantly high volume of data or files. Expectation is to have a way to track how many files are processing and what’s the speed of every process.

The command line progress bars are here to rescue us.There are multiple progress bars available in the market and you can read more about open source python commandline progress bars.

How to create a progress bar on the command line in Python?

In this blog, we will discuss atpbar. Python multiprocessing enabled progress bar for terminal. Atpbar provides following features:

Easy to install.
Minimalistic progress bar without any fancy UX thus quite simple to implement.
Compatible with multi-processing and multi-threading.
Can add name to every subprocess in multiprocessing and multithreading.
Python terminal progress bars simultaneously grow to show the progress of iterations of loops in threading or multiprocessing tasks.
Compatible with Jupyter Notebook.
On TTY devices where progress bar is not compatible, it can show the status with numbers without progress bar.
The object atpbar is an iterable that can wrap another iterable and shows the progress bars for outer and inner iterations.
Break and exception exit the code and progress bar will stop right there.

atpbar by singlequote.blog

How to install python atpbar for commandline progress bar?

Create virtualenv, if not present, using the following command:

virtualenv -p python3.9 
venv source venv/bin/activate 
python3 --version

Now install atpbar using below command for multi-processing python terminal/command line progress bar.

pip install -U atpbar

How to use atpbar?

you can find more details on exact implementation onpython foundation website or on github page of atpbar. In this article I will explain the functionality in brief.

One loop

import time, random
from atpbar import atpbar
n = random.randint(1000, 10000)
for i in atpbar(range(n)):
    time.sleep(0.0001)

A python terminal progress bar will look something like this

Single loop: Singlequote.blog

In order for atpbar to show a progress bar, the wrapped iterable needs to have a length. If the length cannot be obtained by len(), atpbar won't show a progress bar.

Nested loops

atpbar can show progress bars for nested loops as shown in the below example.

for i in atpbar(range(4), name='outer'):
    n = random.randint(1000, 10000)
    for j in atpbar(range(n), name='inner {}'.format(i)):
        time.sleep(0.0001)

In this example, outer loop will iterate 4 times while inner loops are processing.

Nest loop: Singlequote.blog

Threading

atpbar can show multiple progress bars for loops concurrently iterating in different threads.

from atpbar import flush
import threading

def run_with_threading():
    nthreads = 5
    def task(n, name):
        for i in atpbar(range(n), name=name):
            time.sleep(0.0001)
    threads = []
    for i in range(nthreads):
        name = 'thread {}'.format(i)
        n = random.randint(5, 100000)
        t = threading.Thread(target=task, args=(n, name))
        t.start()
        threads.append(t)
    for t in threads:
        t.join()
    flush()

run_with_threading()

As shown in below screenshot, tasks are running concurrently and python terminal progress bar will show the staus of each tasks simultaneously.

One important thing to notice here is flush() function that returns when loops have finished and informs main thread or main program to finish updating progress bars.

Threading atpbar

As a task completes, the progress bar for the task moves up. The progress bars for active tasks are at the bottom.

Multiprocessing

import multiprocessing
multiprocessing.set_start_method('fork', force=True)

from atpbar import register_reporter, find_reporter, flush

def run_with_multiprocessing():
    def task(n, name):
        for i in atpbar(range(n), name=name):
            time.sleep(0.0001)
    def worker(reporter, task, queue):
        register_reporter(reporter)
        while True:
            args = queue.get()
            if args is None:
                queue.task_done()
                break
            task(*args)
            queue.task_done()
    nprocesses = 4
    ntasks = 10
    reporter = find_reporter()
    queue = multiprocessing.JoinableQueue()
    for i in range(nprocesses):
        p = multiprocessing.Process(target=worker, args=(reporter, task, queue))
        p.start()
    for i in range(ntasks):
        name = 'task {}'.format(i)
        n = random.randint(5, 100000)
        queue.put((n, name))
    for i in range(nprocesses):
        queue.put(None)
        queue.join()
    flush()

run_with_multiprocessing()

With multiprocessing enabled with atpbar, two more functions come into play:

find_reporter() — This function is required to be called into main thread or main process. This intimate main thread of atpbar to look for subprocesses.
register_reporter() — This function is required to be called inside every new subprocesses. Every call from subprocess will be tracked by main thread and a new python terminal progress bar will be created.

Simultaneously growing python terminal-based progress bars will look something like this.

multiprocessing atpbar

[AUTHOR’S CORNER]

This article is part one of the progress bar in Python series. Stay tuned for more such articles on singlequote.blog.If you find this exercise helpful then motivate me to write more such posts for you. Share this with your friends, family, and colleagues.

Originally published at https://singlequote.blog on September 2, 2023.

Personal Growth: Journey From Burnout to Breakthrough

Ashish Mishra — Fri, 01 Sep 2023 14:21:24 +0000

Personal Growth — How to Reclaim Your Energy and Purpose in life to achieve more.

Photo by Nubelson Fernandes on Unsplash

In the last blog, I discussed the steps to achieve maximum output by investing significantly less time andhow to improve your productivity at work.

In this blog, we will discuss the next topic for productivity and personal growth. That is “burnout”.

What is burnout?

“ Burnou t” — Is it a real thing or is it another excuse for Absenteeism?

According to a study, Approximately 28% of workers say they burnout “very often” or “always” at work and 24% of workers report that they “rarely” or “never” feel burnout at work.

There are a few days when we feel productive and energetic even after working for long hours and on the other hand, there are also a few drenched days when we feel burned out even before lunchtime. This happens to all of us. Agreed?

Burning out directly impacts our productivity and personal growth. I have published a productivity tips blog on How to increase your productivity at least 5 times without any extra effort.

In this blog, I am not considering frequent burning out due to health issues. This blog is for people who tend to burnout very easily due to overwhelming tasks either professionally or personally.

What are the burnout coping strategies?

Let’s try to understand this using an example of 4 candidates(Sandra, Dave, John, and Laura) working in a corporate/company. All 4 candidates work in the same environment and have similar facilities. Their performance is also competitive. Now let’s observe how they behave whenever they face burnout or frustration during their work.

Image by Ashish Mishra on singlequote.blog

Sandra tries to escape from work and go for a run or the gym and this helps her think straight and now she feels more energetic and motivated once she is back to work.

Dave indulges with more people outside the work who are in a similar field and tries to seek help by discussing the situation.

John, on the other hand, tries to discuss the situation with his team and tries to understand the problem better from his current team.

Laura escapes from all distractions, gets herself a silent corner, and thinks about the situation.

Every working professional suffers from burning out from time to time. The key is how to cope. People adopted multiple coping strategies to avoid burnout. The above examples depict the coping strategies of each individual.

According to an article published in Gallup’s on Fighting Burnout with Strengths: People have different types of coping strategies when it comes to dealing with burnout. A survey was conducted on 3000 such employees to figure out their coping strategies. These coping strategies were categorized into 4 themes, also known as CliftonStrengths:

Executing Theme:

In order to cope with burnout or frustration, People tend to return to work as quickly as possible or do some physical activity (exercise). Anything that could give them a sense of accomplishment. People in this category take immense satisfaction in being busy or productive.

Influencing Theme:

People who tend to spend more time with their family or friends outside of work or talk about their work with them to seek help on how their work can contribute to their future goals. People with dominating influencing themes also try to speak up for others and make sure they are heard. Talking more about the future gives them a sense of a bright future and they tend to make a bright future their strength and work towards that.

Relationship-Building Theme:

People with dominating Relationship-building themes show awareness of how involving others can create good relationships among team members. These people try to take some time to think about how others will feel at the same time. They take pride in including more people in the conversation.

Strategic Thinking Theme:

Strategic thinkers tend to allow themselves the space to think through their frustration when burned out. They are more likely than others to stop and take time to think through their situation or take more breaks during the workday to relax.

Which coping strategy is suitable for personal growth?

No wonder people have various coping strategies. However, as per another research, there were a few more interesting facts:

The most common coping strategies were not most effective for them, and still, people preferred to stick with these ineffective approaches.
This becomes more interesting with the fact that people were aware of the “not so effective” nature of the approach and yet they went ahead with that.
The same person was not choosing the same theme for all the situations. At multiple times, they tend to choose 1 strength in 4 themes.

Now, if you look closely at the survey result below, you will find no matter what the coping whim of each theme, but people always tend to do one common thing “Stop and take time to think through their situation”.

Hence, There are multiple techniques to cope with the burnout. You just need to follow one of the strategies to overcome. There is no good or bad theme, and the choice does not reflect the personality of a person. So, next time whenever you are having a burnout attack, take your time and go for any of the strategies that you feel is right and viable option.

Conclusion

So, when was the last time, you considered thinking about the burnout you are having? If you are reading this blog then you are planning something to overcome this issue or you are on your path for personal growth. I have curated a handmade plan for personal development and time management. This will help you create a plan and analyze the issues with your daily schedule.

An important point — The above plan is not a silver bullet that can resolve everything. It needs your full attention. A line from the song Nothing comes easy rightly describes it

“And nothing comes easy, no nothing at all. But if we believe it, we’re a hundred foot tall”

Sons of the East

[AUTHOR’S CORNER]

I think, I am more of an “Influencing Theme” person, and most of the time, I tend to discuss the situation with a person outside the work to get an unbiased view, quite a few times, I am also an “executing theme” person as I like going for the gym or for a run during such situations.

Few bonus articles to read:

No-Nonsense Guide to Measuring Productivity by Harward Business Review.
Maslow’s Hierarchy of needs is one of the best models for understanding personal development and growth.

Courtesy : Gallup.com for survey results, videos, and text suggestions.

Originally published at https://singlequote.blog on September 1, 2023.

Be 5 times more Productive: Stepwise Guide & Tips — Single Quote

Ashish Mishra — Sat, 20 May 2023 13:43:54 +0000

Be 5 times more Productive: Stepwise Guide & Tips — Single Quote

The project delivery deadline is a few days away and you are racing against time. Each and every second has become highly valuable. During these overwhelming times, we often ask ourselves:

How can I deliver more in less time? What can I do to improve my productivity at work? How can I improve my decision-making power?

Photo by Jo Szczepanska on Unsplash

Don’t we…?

What if I say, This is possible and I have a few productivity tips and answers to all the above questions. A very simple 4-step exercise can help you to become at least 5 times more productive. But in order to achieve this I need your full attention for 10 days while we are doing this exercise.

How has this productivity tip helped me?

I have been practicing this exercise for last past couple of years. This helped me systematically design my life to give me almost desired productivity without impacting the balance between my personal and professional life.

Prior to this exercise, very often I used to feel overwhelmed with work. This often forced me to skip lunch or not give enough time to my family or friends and even lead to sleep deprivation. During that time, I knew, that if I want to spend my life with satisfaction and peace, then somehow I need to manage things without falling into a pit of these unhealthy habits.

I started doing research, and unsurprisingly, there is a common name for such a situation: “ Urgency Trap “.

How this productivity tip can help you?

Very often we are overwhelmed with work, and new work already becomes high priority task even before we could finish the last one. This is called the “Urgency Trap” and this creates huge problems in our decision-making in day-to-day life.

Urgency Trap is a paradox because it limits the very thing that could help us be more innovative, efficient, and effective: Our critical thinking .

Harward Business Review

Thus, the target for this exercise is to keep yourself free from this “Urgency Trap”.

Now a bit of back story about this exercise. This exercise is originally based on Eisenhower’s Method of decision-making. I have made a few modifications to the original Eisenhower method to make it more user-friendly and easy to use. Let’s dive deep into the stepwise process.

Stepwise Guide/Tips to improve productivity

Step 1: Download the Eisenhower Matrix Excel Template

On the internet, there are multiple Excel templates available. We will discuss a few of them which I find good. Both of the templates mentioned below are compatible with Microsoft Excel and Google Sheets. Gumroad SingleQuote Excel Template is cheaper:

Once you click on the above link you will land on the Gumroad screen which will look something like the image below. At the bottom right, fill amount in the price box and click on “I want this”. Once done you can download the Excel template.

Step 2: Explore Excel/Google Sheet Productivity Template

Double-click on the downloaded Excel sheet to open it in Microsoft Excel or you can also open this in Google Sheets.

To open this in Google Sheets, open a new sheet and click File and then Open to import it from local.

Once opened, at the bottom of the sheet, you will find multiple tabs as shown in the image below.

Every tab represents something, Let’s find out how these tabs will be used in this exercise.

Instructions: The instruction tab explains in brief how to use the template.

Example Day: Example day is useful if you are a first-time user of this template. This can help you categorize your tasks with some examples. Example days will help you in 2 ways:

Day 1 — Day 10: This is where you need to fill in your daily activities during the 10-day exercise. You need to fill in what have you done every half an hour.

30 mins window might seem very aggressive initially, but there is a reason for this. Originally, I started with the 1-hour window, but later I realized if I have done something productive for 40–45 mins and have not done something important/productive for the next 15–20 mins then it becomes difficult to track that time, and these multiple 15–20 mins windows can consume more than 1.5–2.0 hrs throughout the day.

Analysis: Once you click there you will find a graph. This graph shows time spent on urgent/important(Quadrant 1) of Eisenhower Matrix available on tasks per day over the 10-day period.

As per Eisenhower Matrix, our day-to-day activities should fall under Quadrant 1 and prioritization of tasks must be smart enough. But as a newbie to this we can fall into a trap where we think everything is important and this exercise will help you avoid exactly that.

The analysis tab will provide you details over time, that how much time are you spending on the urgent+important tasks. The target is to increase this percentage number over the exercise and if you feel that this number is not increasing then go back to the tasks of the last day and see what all the tasks have not helped you complete something of utmost importance as stated above.

Step 3: Let’s start with day one

By now, we understood how and what this exercise is about. Before we start with day 1 there are a few prerequisites.

Step 3.1: Decide your productive hours in the day

Decide what are the hours in a day you want to be productive. We are not robots to track each and every moment of the day as this could be overwhelming and we do not want to fall into the urgency trap of adding tasks while we are avoiding it.

Tip: Most of the people, I have researched with on this template. They mentioned that they started their working hours around 1 hr before the actual office hour and 1 hr after the working hour. Yours could be different.

Step 3.2: Start adding your tasks

Now, you are ready with the task addition. Your template will look something like below. By the end of every 30 mins, add how you spent your last 30 mins by choosing the options available. Let’s discuss what each option stands for and how to use them:

There are 6 columns in the sheet:

Time Slot Column: Very obvious, nothing to explain here. Time slots are already part of the defined template. No user intervention is required at this point.

Task Column: A brief description of the task. This should be detailed enough so you can identify it later and brief enough that it is not taking much time to add.

Urgent Column: If the task falls into urgent criteria then mark it “Y” else mark it “N”. This is an important column and all the tasks can not be urgent. Choose it wisely or refer to this document to help you guide how to choose urgent/ important or non-important tasks.

Important Column: If you think that the task you just completed was important then mark it “Y” else mark it “N”.

WH Column: This stands for “Working Hours”. There are 3 options for this “Work”, “Meeting” and “Learnings”. This will help you understand how are you spending your day.

NWH Column: This stands for “Non-Working hours”. if this task was not related to work or office then mark “Y”. For example. “Call MOM” or “Gym” falls under NWH.

Step 3.3: End your day

At the end of the day, there are 3 things that you need to take care of:

Observe how you spent your time (in minutes) on some productive work vs meetings vs learning. Refer to the image below.
On the basis of tasks and time division between work/meeting or learnings, add details on “According to you how was your day? Productive/Not-Productive/Average.”

Also, at the end of the day, it will automatically create an Enisehhower Matrix to help you guide where you spent your time in all 4 quadrants. This is the time when you need a few minutes to analyze if you are not spending a lot of time in Quadrant 1 which is the left-top box then there is an issue with your prioritization.

Step 4: Start your next day

Next day before starting the day, go to the “Analysis” tab and check how much time you spent on the Quadrant 1(Urgent+Important) task yesterday. The target should be to avoid activities or tasks that were not falling into Quadrant 1 or at least Quadrant 2.

During this exercise, once we are done with tasks, then we will retrospect if this was really an urgent task and if you would have skipped it, then would it have impacted anything? or can it be delegated to someone else in the team? or can this be moved to some other time in future?

We are required to ask these questions to ourselves.

With this in mind, check your calendar and remove or move such meetings or tasks from your tasks list.

At the end of the 10th day, if you are dedicatedly working with the template, then you will find that you have a pattern figured out like what kind of tasks or meetings are consuming your time and how they can be avoided.

Productivity Tips: What’s next, once you are done with Excercise?

The above productivity tip is not a silver bullet. If you think that once you are done with this 10-day exercise it will resolve all your problems and you will not have to do anything in the future. Then, my friend, You are mistaken, at different points in time, our responsibilities at the workplace keep on changing.

So, we are required to repeat this 10 days exercise every year or a few months or whenever we feel like we are losing control of our time and we need to go back to our schedule where we can achieve more in less time.

Quick Tip: If you are not using Excel on your Desktop, and you are using a browser for this sheet then use the browser which you do not use for your work. This will help you quickly find the browser using CMD+tab and will save time while entering task details. For example, if you are using Chrome for your official work on Macbook then you can use Safari for the sheet.

If you find this exercise helpful then motivate me to write more such posts for you. Share this with your friends, family, and colleagues to help them be more productive in life.

Ciao…

[Author’s Corner]

The 4×4 Matrix we used in our exercise is also known as the “Eisenhower Matrix”.

34th President of the United States Dwight D. Eisenhower conceptualized Eisenhower Matrix or Eisenhower Decision Matrix during one of his speeches and then decades later, author Stephan Covey created a framework and popularized this framework in his book The 7 Habits of Highly Effective People.

That is the reason it is also popularized with the name “ Covey Eisenhower Matrix ” or sometimes just the “ Covey Matrix _ “._

So, now from now on, if anyone asks these questions to you- “Can the Eisenhower method help me improve my productivity? What is the right way to use the Eisenhower Matrix? or How can we use Eisenhower Matrix in the most optimized way to improve my decision-making power?”

You can share these productivity tips with them to help them achieve good things in life.

Originally published at https://singlequote.blog on May 20, 2023.

You like Ike, I like Ike, everybody likes Ike

Ashish Mishra — Mon, 01 May 2023 18:54:51 +0000

Bit of an unusual title for a blog? True, Now for some context — watch this short clip shared by New York Historical Society and come back. This was a political public announcement of President Eisenhower ( Dwight David Eisenhower) during his 34th presidential campaign.

As per the sources on the Internet, between the years 1892–1909, in Kansas, when Dwight and his elder brother were in school, Dwight’s older brother was nicknamed “Big Ike” and he became “Little Ike”. This nickname followed him to West Point to the United States Military Academy where “little” was dropped and this followed him till his time to the presidency of the United States from January 20, 1953-January 20, 1961.

A quick glance at President Eisenhower’s achievements

During his two terms as president, he led many important contributions like the construction of interstate highway System, creating NASA, bringing armistice to Korean War; promoting Atoms for Peace; dealing with crises in Lebanon, Suez, Berlin, and Hungary, establishing the U.S. Information Agency, welcomed Alaska and Hawaii into the union and managed to keep the cold war with Russia cold.

Dwight D. Eisenhower. Image Courtesy: whitehouse.gov

Before becoming President, he served as a general in the United States Army and as the Allied Forces Supreme Commander during World War II. He also later became NATO’s first supreme commander.

Amazed! Yes? I was too when I first read about President Eisenhower’s accomplishments during one of my research.

You can read more about President Eisenhower at Eisenhower library If you wish to know more about his life.

How President Eisenhower was able to achieve this much in his lifetime?

Thanks to social media, very often we hear about so many notorious personalities like President Dwight. We hear they have accomplished so many things. Achieving in their lifetime which will not be possible for any of us to even imagine.

Have you ever wondered how some of these legends were able to knot their life in such a poetic manner. What these guys were doing that others were not?

All of us have the same 24 hours in a day, yet only a few have done so much. How many times, has it happened to you when you have spent all of your time managing crises or fires and at the end of the day, you feel completely drained and yet have nothing to find of real significance.

President Eisenhower during one of the speeches stated:

What is important is seldom urgent and what is urgent is seldom important.

- Dwight D. Eisenhower

I find this particular line left a deep impact on me, and this may be the difference between people like President Dwight and people like us.

Very often, I find myself questioning where my priority lies in any given situation? If there are multiple crises — then everything is important and during such time how to prioritize and plan for long goals to achieve the most? Yes, this is a tricky problem. Let us now see what President Eisenhower has to say about this:

Who can define for us with accuracy the difference between the long and short term! Especially whenever our affairs seem to be in crisis, we are almost compelled to give our first attention to the urgent present rather than to the important future.

- Dwight D. Eisenhower, 1961 address to the Century Association

To make it more clear, President Eisenhower explained that it is imminent to see what is urgent now or what can be pushed to later. Basically trick lies in finding what is most beneficial for the future time.

Today, we know that theory as the “Eisenhower Matrix” or “Eisenhower Method” of time management.

Per President Dwight, this theory helped him prioritize and deal with many high-stake crises he faced as US Army General, as President of the United States, and eventually, as Supreme Allied Commander of NATO Forces.

Though President Eisenhower has only conceptualized this. Decades later, author Stephan Covey created a framework and popularized it in his book The 7 Habits of Highly Effective People.

What we can do to align our time with important things?

I know what you are thinking “This is straightforward and someone else can not categorize this for me”.

And Yes, you are absolutely correct. A person himself/herself needs to measure what is important to them using the framework. We will discuss this framework in the later part of the blog.

How to use Eisenhower Matrix to measure and improve productivity — can help you design your Day in the most productive way. I have tried to simplify the above concepts and created a template in the most easy-to-understand and use format.

This is a 10-day exercise where you add your activities every hour during your work day or for the time you want to be optimized. With this activity, you may also analyze your daily tasks. For example — if this was really a meeting or task that needed your time OR if could it be done without your presence?

Thus, categorizing and prioritizing your tasks accordingly. At the end of the day check the pattern of your most productive tasks along with the least ones, trying to enhance the former and fade out on the latter.

At the end of the few days(around the 5th day), you will start seeing an impact where you will start to get more time for tasks of real significance.

Eisenhower Method to avoid/dodge ‘Monday Blues’ — can help you categorize the task on the right quadrant of the Eisenhower Matrix.

If you wish to know more about the Eisenhower matrix and how can we use it in the most optimized way then you can check out this page for the — stepwise process to use Eisenhower and improve productivity.

If you love to work on Excel and it is your bread and butter then this — Excel template for the Eisenhower Matrix is nothing less than a treat for you. This Excel template not only lets you add your hourly tasks but also gives you a graphical presentation of your tasks. Read more about it here.

Originally published at https://singlequote.blog on May 1, 2023.

Python Dynamic Configuration — Python-Trick

Ashish Mishra — Tue, 04 Apr 2023 07:19:08 +0000

Python Dynamic Configuration — Python-Trick

In this blog, we will discuss one easy-to-use python trick that could be very handy for a quick program or repository you are creating for your individual project or for your organization.

Photo by Jantine Doornbos on Unsplash

I am sure you are already aware of the importance of configuration files in any programming language. Managing configuration files is the most important part of any software development process.

In order to understand this, we will first discuss the traditional approach and then we will discuss how a small class in python can help you avoid multiple iterations in your config file.

Traditional Approach

In order to make both approaches more intelligible, we will take an example code and will implement both approaches to see the impact of the trick.

Let’s create a folder configuration and add these 2 files inside it.

settings.ini file looks something like this:

[RUN]
num_cores = 2
num_files = -1

Now to parse the configuration, we use another file. let’s say config.py

import os
import configparser

parser = configparser.ConfigParser()
parser.read_file(open(os.path.join(os.path.dirname(os.path.abspath( __file__ )), "settings.ini")))

NUM_CORES = parser.get('RUN','num_cores')

print ("Number of cores available : ", parser.get('RUN','num_cores'))

The output of the above script will look something like this:

➜ configuration python config.py
Number of cores available : 2

This is straightforward and easy, now every time, we want to add a new config, there is an easy way just to add a new variable and use that inside code. so, let’s consider now we also want the value of NUM_FILES in the code, then new code will look something like this:

import os
import configparser

parser = configparser.ConfigParser()
parser.read_file(open(os.path.join(os.path.dirname(os.path.abspath( __file__ )), "settings.ini")))

NUM_CORES = parser.get('RUN','num_cores')
NUM_FILES = parser.get('RUN','num_files')

print ("Number of cores available : ", parser.get('RUN','num_cores'))
print ("Number of cores available : ", parser.get('RUN','num_files'))

and now the NUM_FILES variable can be used across your project to get the value of this variable.

But how many times has this happened to you that you have to add these same lines every time you want to add a new variable, you would agree that this could be painful sometimes and it could halt your train of thought.

Dynamic configuration:

What if I tell you there is an easy way out and you can initialize the whole .ini file dynamically and there is no need to parse every variable explicitly.

Now, in order to do that, I have added a new file dynamic_config_parser.py and its content will look something like this:

import configparser

class DynamicConfig:
    def __init__ (self, conf):
        if not isinstance(conf, dict):
            raise TypeError(f'dict expected, found {type(conf). __name__ }')

        self._raw = conf
        for key, value in self._raw.items():
            setattr(self, key, value)

class DynamicConfigInit:
    """
    This class is used to dynamically load static variables from the settings.ini file. Any va
    variable declared in the settings.ini can be parsed directly.
    """
    def __init__ (self, conf):
        if not isinstance(conf, configparser.ConfigParser):
            raise TypeError(f'ConfigParser expected, found {type(conf). __name__ }')

        self._raw = conf
        for key, value in self._raw.items():
            setattr(self, key, DynamicConfig(dict(value.items())))

and now your config.py will look something like this:

import os
import configparser
from dynamic_config_parser import DynamicConfigInit

parser = configparser.ConfigParser()
parser.read_file(open(os.path.join(os.path.dirname(os.path.abspath( __file__ )), "settings.ini")))
STATIC_CONFIG = DynamicConfigInit(parser)

print("Number of cores from dynamic config: ", STATIC_CONFIG.RUN.num_cores)

Viola! That’s it. We are done.

As you see, now there is no requirement to add every variable every time into config.py, and dynamic config initialization is handling this for you and you can save a long repetitive file with a huge configuration.

Originally published at https://singlequote.blog on April 4, 2023.

Continuous Data Load from S3 to Snowflake(Snowpipe): Stepwise Process, Benchmarks & Cost

Ashish Mishra — Tue, 21 Feb 2023 02:02:44 +0000

What is continuous data load in Snowflake? What are the ways to achieve the continuous data load in snowflake? What are the cost and benchmarks for the solution? What are the ways to reduce the significant cost of continuous data load to snowflake? and last but not least what are the limitations of the solution?

We will discuss each of these questions in this blog.

Photo by Aaron Burden on Unsplash

In the last post on bulk data load(medium link), we discussed 2 things:

How to load bulk data (micro-batches) having a high frequency, on snowflake using copy command?
What will be the cost for the solution if the frequency is higher?

Please go through the last post here, If you are reading my article for the first time. It will help you to create an understanding of how AWS and Snowflake create handshake to share data between the 2 technologies.

What is Continuous Data Load in Snowflake?

Snowflake provides a serverless solution, where files coming on the Object Storage(like S3 in this case) will be loaded on Snowflake table without using any external resource and Snowflake will charge for compute usage while data loading.

Continuous data load can be achieved by Snowflake using these different ways:

Snowpipe: Easiest and one of the most popular solutions as it requires the least effort and is categorized as a zero code solution.
Kafka Connector for Snowflake: Reads data from Apache Kafka topics and loads the data into a Snowflake.
Third-Party Data Integration Tools: Read more about these integrations on Snowflake official website about existing and new integrations.

How Does Snowpipe Work?

Snowpipe loads data from files as soon as data or files are available in the stage and runs a copy statement to load the data into the Snowflake. But how does Snowflake knows that there are new staged files available?

There are 2 ways for detecting staged files:

Event Notification from S3 bucket. Read more about s3 event notification and cost of the process here.
Calling Snowpipe REST endpoints.

In this blog, we will discuss around the first point, where we will utilize the event notification from cloud storage (S3 in this case) to automate the continuous data load using Snowpipe. At the end of this blog, we will also discuss which solution should be preferred for high-frequency data and which solution is cost-effective.

What are the Steps to load Continous Data using Snowpipe?

Continuous data load in Snowflake using Snowpipe is a 5 step process.

Few of the initial steps to configure the access permission are similar to the Bulk load data using Copy Command. In this blog, we will refer few of the previous post links to keep this blog precise.

Step 1: Create Storage Integration & Access Permissions

Refer to step 2 in this document (Medium Link) for storage integration.

Step 2: Create Stage Objects

Refer to step 3 in this document (Medium Link) for stage objects.

Step 3: Create a Pipe with Auto-Ingest Enabled

The pipe uses the COPY INTO

command internally to load data in an automatic fashion whenever a new notification for file ingestion gets received.

create pipe mypipe auto_ingest=true as
  copy into production_object_storage
  from @my_csv_stage
  file_format = (type = 'CSV');

The AUTO_INGEST=TRUE is important to specify to read event notifications sent from an S3 bucket to an SQS queue when new data is ready to load.

Step 4: Provide Permission to the user which will run Snowpipe

If you are an account admin or have a higher access level (because you are trying this on a newly created personal Snowflake), in that case, you can skip this step, else you can give permission to the current role and user as follows:

-- Create a role to contain the Snowpipe privileges
use role securityadmin;

create or replace role snowpipe1;

-- Grant the required privileges on the database objects
grant usage on database snowpipe_db to role snowpipe1;

grant usage on schema snowpipe_db.public to role snowpipe1;

grant insert, select on snowpipe_db.public.mytable to role snowpipe1;

grant usage on stage snowpipe_db.public.mystage to role snowpipe1;

-- Grant the OWNERSHIP privilege on the pipe object
grant ownership on pipe snowpipe_db.public.mypipe to role snowpipe1;

-- Grant the role to a user
grant role snowpipe1 to user jsmith;

-- Set the role as the default role for the user
alter user jsmith set default_role = snowpipe1;

Step 5: Configure S3 for Event Notification for Snowpipe Configuration

This is an important step. For ease of the user, Snowflake managed SQS can be used to send new events from your S3 bucket.

How Snowpipe Loads Data into Snowflake?

As depicted in the below diagram, as soon as new files come to the external stage(in this case, this is personal or your organization’s bucket), an event notification gets triggered and the notification gets stored in the SQS Queue(which is Snowflake managed or on Snowflake AWS cloud and outside your organization account and VPC).

A pipe(consumer) is configured at the Snowflake end which keeps on listening to the SQS queue and sends a notification to snowflake for compute provisioning and starts ingesting data into Snowflake using copy command.

Image Courtesy: snowflake.com

Thus, what is pending now is to create an S3 event notification and configure SQS queue.

In order to get SQS ARN (managed by snowflake), run the following command on Snowflake and copy the SQS ARN from notification_channel column as shown in the picture below.

show pipes;

Now follow this sequence to add and create S3 notification

AWS S3 console >> Properties >> Event Notification >> Create Event Notification >> add path & SQS ARN >> Save/Submit

If you are still with me then we have completed all the steps and now is the time to test the continuous load by uploading files on the S3 bucket.

Step 6: Validate the Data Load using Snowpipe

Now, if you have uploaded files on S3 then you will start seeing the data on the table, If the data is still not loaded, then you probably have these questions in your mind.

How to check if Snowpipe is successfully established?

Run the following command on Snowflake to check if the connection is successfully established. Here ‘mypipe’ is the name of the pipe created in Step 3.

select SYSTEM$PIPE_STATUS('mypipe');

This command will return a JSON response, where “executionState”:”RUNNING” denotes if everything has been setup properly and as expected.

I have uploaded the file on S3, but why the data is still not visible in the table using Snowpipe?

If you have uploaded the file on S3, but data is not yet loaded onto the table, then check the issue with the content using this command:

select * from table(validate_pipe_load(
  pipe_name=>'MYPIPE',
  start_time=>dateadd(hour, -24, current_timestamp())));

This command returns result, a sample of which is shown below, which shows if there was any problem with data or compatible schema.

Image by Ashish Mishra

SYSTEM$PIPE_STATUS command is showing the last ingested file but why data is not visible in the target table?

There are chances that SYSTEM$PIPE_STATUS command will show lastIngestedFilePath and lastForwardedFilePath as the correct path but the file is not loaded, this happened due to reason that there is some problem with the file as mentioned in the above question.

When using Snowpipe, why am I not able to see any entry in load_history table for success or failure?

All the data loaded or failed using Snowpipe do not create an entry in load_history instead it uses ‘pipe_usage_history’ for this purpose.

select *
  from table(information_schema.pipe_usage_history(
    date_range_start=>dateadd('hour',-24,current_timestamp()),
    pipe_name=>'mypipe'));

How Snowpipe Cost is Calculated?

Compute resources is necessary to decompose, decrypt and transform the new data. Snowflake adds costing for compute usage.
Apart from this an overhead cost for managing files in the internal Snowpipe queue is also included and this overhead charge increases with proportion to the number of files queued for loading and the size of files loaded(more time to load a file will increase queue time and thus cost).

In order to validate and get some approximation, I have run a few experiments as below:

Snowpipe charges approximately 0.06 USD per 1000 files queued.

With my experiment, it consumed around 0.000287202 Credits = 0.001148808 USD for 10K rows CSV file.

One important point that I noticed while doing this experiment is that there is no definite latency on Snowpipe load and sometime it takes unabruptly more time to load data and there is not much you can do during that time. I guess Snowflake should add more transparancy during the whole process.

How to reduce cost in Snowpipe?

File size should be roughly 10–100MB(Snowflake Documentation has referred to around 100–250MB, but my recommendation is to keep the file size smaller) in size compressed.
Try to reduce the total time to aggregate file data within 1 min. One way to try this is if your source application takes more than 1 minute to accumulate the data, then consider creating split data files once per minute.
If there are more frequent files at the stage location and more than one file is coming within a minute, in this case Snowpipe will utilize the internal load queue to manage these files and overhead cost will increase.
Snowflake recommends enabling S3 event filtering for Snowpipe to reduce event noise, latency, and finally cost.
If file size in the above range is not possible, then consider removing “SKIP_FILE” as the default option for Snowpipe, as it might waste a lot of resources and credits and might cause a huge delay. The better option would be “CONTINUE” for such cases.

Things to Note:

Only Snowflake hosted on AWS supports AWS S3 event notification for Snowpipe(by the time this blog was posted).
When you use AWS SQS notification then data moves out of your current VPC and this traffic is not protected by your VPC.
All data types are supported, including semi-structured data types such as JSON and Avro.
Snowpipe does not load a file with the same name if it has been modified later, because Snowpipe maintains metadata for itself and changing the name doesn’t modify this metadata.
Snowpipe maintains load history only for 14 days, a modified file can be loaded after 14 days as metadata will not be valid then. While bulk data load keeps this metadata for 64 days.
Snowpipe does not guarantee that files will be loaded in the same order they were staged, though if files are not getting staged at exceptionally high velocity, then order violation will be visible or detectable. In order to handle this it is recommended to load smaller files once per minute. Load order is not maintained due to multiple processes pulling files from the queue and depending upon the time to load, the sequence of data load could appear as different.
By default, Snowflake applies “SKIP_FILE” when there is an error in loading files, while copy command uses ABORT_STATEMENT as the default behavior.
Snowflake caches the temporary credentials for a period that cannot exceed the 60 minutes expiration time. If you revoke access from Snowflake, users might be able to list files and load data from the cloud storage location until the cache expires.

Conclusion:

After my experiment with Snowpipe, These are my takes:

It was not able to handle high-velocity files, where a lot of files are coming at a higher pace and of course, maintaining the order was a big issue.
When the file size is higher(between 100MB to 200MB) then it takes an unexpectedly huge time to load, while the same file can be loaded in less time using copy command.
Once a file is loaded on S3, there is not much visibility of the files status and queue length and it becomes tricky to check why files are not loaded and debugging becomes tricky. (If you have found any solution for this then please comment).
Snowpipe is good for very small files with the very low frequency where the latency of file ingestion is not a determining factor.

References:

Snowflake Official Documentation

Listing Billion Number of S3 Objects into SQS: Challenges & Benchmarks

Ashish Mishra — Sat, 18 Feb 2023 08:14:36 +0000

Can S3 event notification service scale enough to handle Billion high-velocity events and can SQS handle these events without any data drop? This blog is all about unveiling these use cases.

I have been working in the healthcare industry for the past 8 years, and one of the most interesting problem in this industry is skewness and variation in data. The day you think you have handled all the issues with the data, is the day you receive double what you faced earlier.

On daily basis, we receive more than 200 Million files and this number of files can reach up to Billions in a single historical data transfer. Another interesting part is the size of the file, which ranges from a few bytes to 20 MB.

What’s the Goal?

The problem/target is very generic.

Data is coming from multiple sources into the S3 bucket.
Parse the data using standard parsers which can scale automatically to parse huge amounts of files.
Clean and transform the data into the standard format.
Map the transformed source schema to warehouse schema.

The end Goal is to get data to the data warehouse in the standard schema

Challenges with the Process

As per the above diagram, there are 4 steps. For this blog, we will limit our discussion to S3 source file arrival and parsers only, where we are interacting with billions of files in chronological order.

The system faces multiple challenges when the number of objects in the S3 increases exponentially:

Exploration of Data: Listing objects on the S3 bucket becomes very slow and tiresome for data engineers while exploring the data.
Parsing and Cleansing: Parsing and cleansing of bulk data become challenging due to the slow listing of S3 objects (A single-threaded python program takes approximately 20 mins to list 1 million objects using standard AWS SDK while writing this blog).
Reprocessing and Error Handling: Handling processed vs unprocessed data becomes tricky due to a large number of objects and in case of any error re-processing of data takes a lot of time.
Metadata Management: Metadata management becomes tricky if the metadata(Size, Modified Time, Ingested Time etc) of objects is not already available(Read more about metadata management and its challenges here).

How to Solve the Above Problem?

The solution is also straightforward, but can we scale it enough to handle a billion objects?

We have decided to keep track of each file that is coming to the S3 bucket and save those in the database or any persistent storage, which can later be accessed by our parsers to overcome the above-mentioned problem.

Capture events from S3 into SQS & store them in persistent storage for later access

Though this is clear from the AWS documentation that S3 event notification can get us where we wanted to reach, we wanted to check if it can handle huge and high velocity events where the number of files is coming at the speed of 4000–5000 objects/Second.

Experiment

In order to check the validity, a quick experiment is required, which includes these steps:

Set up S3 and SQS with required permissions: Create an S3 bucket with the prefix required for the problem and SQS with the required permission to start receiving S3 events.
Automate high-paced data load: Create a script that will create random data of 1Kb and stream data on the S3 bucket in files and another script to keep checking the delay of the messages on the SQS and their count compare to the message sent.
Run and Monitor: Now run the scripts to load data and compare delay.

Top GIF: Script to load data, and count SQS events & their delay, Bottom: S3 & SQS data sample

Its Result Time:

Once we have parallelized the above scenarios to start loading files. The results surprised me.

With the automation script, I was able to reach up to 2500 files creation per second on S3 and the maximum delay between file landing and SQS receiving the event notification for the PUT event was 100ms.

This is the sample message that you receive on the SQS for the PUT event:

{
  "Records": [
    {
      "eventVersion": "2.1",
      "eventSource": "aws:s3",
      "awsRegion": "us-east-1",
      "eventTime": "2023-01-22T18:06:53.713Z",
      "eventName": "ObjectCreated:Put",
      "userIdentity": {
        "principalId": "AWS:<PRINICIALID>"
      },
      "requestParameters": {
        "sourceIPAddress": "<IP ADDRESS>"
      },
      "responseElements": {
        "x-amz-request-id": "12345678AF67",
        "x-amz-id-2": "wertyuiofghjkcvbn456789rtyui45678tyui56789rtyuiopertydfghjkcvbn"
      },
      "s3": {
        "s3SchemaVersion": "1.0",
        "configurationId": "s3_sqs_load_testing",
        "bucket": {
          "name": "s3_sqs_load_testing",
          "ownerIdentity": {
            "principalId": "<PRINCIPAL ID>"
          },
          "arn": "arn:aws:s3:::s3_sqs_load_testing"
        },
        "object": {
          "key": "s3_sqs_load_testing/base_path/1st_iterator/20230122/test.csv",
          "size": 14311,
          "eTag": "987fghjk789hj89",
          "versionId": "SdgdghjdbhfjhYUhjdhfj",
          "sequencer": "0063CD7B3D9FA2B9E6"
        }
      }
    }
  ]
}

Estimated Cost for the Problem:

Pricing as of January 23* for US-EAST region. Please confirm exact prices from official AWS website

As shown in the above diagram, Pricing for the solution is nominal compared to what it is solving for your metadata management system.

Above cost considers the cost for 1 Million objects. We wanted to get the cost for billion objects. An estimated cost for 1 Billion objects will be around 5.857*100 = $ 585.7.

This is the total cost of the system, but you are already getting files on S3 that cannot be removed so we are adding approximately ~15% extra cost to the current S3 system. Though the interesting part is, if you look at the whole journey from parsers (which includes listing the same objects multiple times) and metadata management(which list files for data engineer to explore) then the cost will reduce further to approximately ~10%.

Now, this extra 10% cost has increased the total efficiency of the system by 2.5X and parsers are able to finish the job in 2.5 times less than what it used to take, which is a good win.

Limitations of the Current Approach

Though we had success in the above experiment, still there are a few limitations that might be a deal-breaker for few use cases. Those are:

The Amazon SQS queue must be in the same AWS Region as your Amazon S3 bucket (mention).
Enabling notifications is a bucket-level operation and notifications need to be enabled for each bucket separately, though events can be published on a single queue (mention).
After you create or change the bucket notification configuration, it usually takes about five minutes for the changes to take effect (mention).
When notification is first enabled or notification configuration gets changed, a S3:TestEvent Occurs. If use case forces to change configuration very frequently then additional changes in consumer is required to skip these messages (mention).
Event notifications aren’t guaranteed to arrive in the same order that the events occurred. However, notifications from events that create objects (PUTs) and delete objects containing a sequencer. It can be used to determine the order of events for a given object key. If you compare the sequencer strings from two event notifications on the same object key, the event notification with the greater sequencer the hexadecimal value is the event that occurred later.
Across events for different buckets or objects within a bucket, the sequencer value should not be considered useful for ordering comparisons (Check this stackoverflow answer).
If you’re using event notifications to maintain a separate database or index of your Amazon S3 objects, AWS recommends that you compare and store the sequencer values as you process each event notification.
Each 64 KB chunk of a payload is billed as 1 request (for example, an API action with a 256 KB payload is billed as 4 requests)
Every Amazon SQS action counts as a request. The GET per-request charge is the charge for handling the actual request for the file (checking whether it exists, checking permissions, fetching it from storage, and preparing to return it to the requester), each time it is downloaded. The data transfer charge is for the actual transfer of the file’s contents from S3 to the requester, over the Internet, each time it is downloaded. If you include a link to a file on your site but the user doesn’t download it and the browser doesn’t load it to automatically play, or pre-load it, or something like that, S3 would not know anything about that, so you wouldn’t be billed. That’s also true if you are using pre-signed URLs — those don’t result in any billing unless they’re actually used, because they’re generated on your server.

Conclusion

Though there are a few limitations for some use cases when you want to take it to production, but the experiment is successful. I will list down what I did achieve:

The overall speed of file exploration by data engineers and stewards increased upto 1.5x, as the listing was easy for them and they can group files on the basis of regex pattern which is not very easy in the case of direct S3 exploration (Read more about data Engineer vs Data Stewards here).
The overall speed of parsing files increased by 2.5x , as the listing can be avoided during parsing and parsers can be scaled by pre-determined scaling factor (on the basis of #files).
Better Data Democratization for the organization by enabling rich searching over data.
… and many more.

[Writers Corner]

Data Platform and analytics is a basic entity not only for IT but also for non-IT companies and object storage is one of the most common attributes of a Data Platform. Keeping object storage healthy(un-messy) should be taken care with the highest priority.

Bulk Data Load from S3 to Snowflake: Stepwise Process, Benchmarks & Cost

Ashish Mishra — Sat, 11 Feb 2023 12:02:32 +0000

What will be the total cost if we load micro batches of data using bulk load from S3 to Snowflake at high frequency? Can snowflake load huge data files in a single go?

In this blog, we shall understand how bulk load works in snowflake. What’s the syntax to bulk load data into Snowflake, and the cost to load frequent data using Bulk Load from S3?

Bulk Data load from S3 to Snowflake. Image Courtesy: singlequote.blog

There are 2 ways to load data to snowflake:

Bulk Load using
Continuous Load using SNOWPIPE

In this blog, we will discuss the Bulk Load using Copy Into command. Refer here if you want to learn more about continuous load using Snowpipe in snowflake using multiple notification systems (S3/SQS).

Bulk data load is a 4 step process. Before we deep dive into each step, you can refer to below SQL script for a quick get-through.

-- Create file format to let system know the format
create file format if not exists single_quote_csv_format
   type = 'CSV'
   field_delimiter = ','
   skip_header = 1;

-- Check if any storage integration already exists   
show storage integrations;

-- Create storage integration with role arn received from AWS console.
create or replace storage integration single_quote_blog_s3_integration
  type = external_stage
  storage_provider = 'S3'
  storage_aws_role_arn = 'arn:aws:iam::<id>:role/single-quote-blog-snowflake-s3'
  enabled = true
  storage_allowed_locations = ('s3://single-quote-blog/sample_blog_data/');

-- Describe storage integration to get external_id to be applied on AWS IAM 
role.
describe storage integration single_quote_blog_s3_integration;

-- Create external stage with created storage integration
create stage if not exists single_quote_blog_stage
file_format = single_quote_csv_format
url = 's3://single-quote-blog/sample_blog_data/json_sample_files/'
storage_integration = single_quote_blog_s3_integration;

-- Create a table to load data into the snowflake table
CREATE TABLE if not exists single_quote_employee_data(name VARCHAR(255), dob DATE, designation VARCHAR(255), event_time TIMESTAMP);

-- Run copy command to load data into snowflake table
copy into single_quote_employee_data
  from @single_quote_blog_stage/random_object_storage_data.csv
  on_error = 'skip_file';

-- Verify the loaded records from load history table
select * from SNOWFLAKE.ACCOUNT_USAGE.load_history order by last_load_time desc limit 10;

-- Run count on the table to validate the records loaded into the table
select count(*) from single_quote_employee_data;

-- Check few records for sanity
SELECT * FROM single_quote_employee_data;

Now that we are aware of the quick syntax and steps, let us understand each step in brief and how to connect with the S3 bucket using AWS console.

Step 1: Create File Format Objects

File format helps Snowflake understand how should the data in the file be interpreted and processed.

create or replace file format single_quote_csv_format
   type = 'CSV'
   field_delimiter = ','
   skip_header = 1;

Step 2: Create Storage Integration & Access Permissions

In layman's terms, Storage integration enables a handshake between Snowflake and your S3 bucket.

Snowflake objects store a generated Identity and Access Management(IAM) entity for your external cloud storage(S3), and to complete a handshake you are required to add an entity provided by Snowflake in authorized keys/entities.

Storage Integration is a combination of multiple steps. This includes multiple steps in this sequence:

Create AWS Policy >> Create AWS Role & attach policy >> Create SF Storage Integration >> Update Trust relationship using External ID & user arn from storage Integration

Now let us understand each of the above steps in detail.

Step 2.1: Create AWS policy:

Create AWS policy, that will give permission to Snowflake to be able to access files in the folder (and sub-folders).

This includes the following steps in order:

AWS Console >> Search IAM >> Policies >> Create Policy >> Click JSON >> A dd JSON permission >> Add policy name >> Save

Give a unique name to a policy like “singlequote_sf_s3_policy” as this will be used in a later step while creating the AWS role.

Paste the following JSON to the “JSON” tab and click “Next” to save the policy

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:GetObjectVersion"
            ],
            "Resource": "arn:aws:s3:::single_quote_blog/*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:GetBucketLocation"
            ],
            "Resource": "arn:aws:s3:::single_quote_blog",
            "Condition": {
                "StringLike": {
                    "s3:prefix": [
                        "sample_blog_data/*"
                    ]
                }
            }
        }
    ]
}

Step 2.2: Create AWS role arn

Create AWS Role for Snowflake. In order to create a role use the given steps in order.

AWS Console >> Search IAM >> Click on Roles >> Create role >> AWS Service >> Use case (S3) >> select policy created in last step >> Add name >> Click Create role

Step 2.3: Create Storage Integration

Now copy the “role arn” from the Roles screen of the AWS console. It must be something like this —

arn:aws:iam::<id>:role/single-quote-blog-snowflake-s3

Now go back to the Snowflake database and run the following command —

create or replace storage integration single_quote_blog_storage_int
  type = external_stage
  storage_provider = 'S3'
  storage_aws_role_arn = 'arn:aws:iam::<id>:role/single-quote-blog-snowflake-s3'
  enabled = true
  storage_allowed_locations = ('s3://single-quote-blog/sample_blog_data/');

There are other options available on the above SQL command which can allow or disallow a few locations. Please refer official snowflake documentation for more information.

If you are not sure whether storage integration is already available or not. or if you do not want to mess up the current data pipelines running on production, then it is recommended to check all the storage integration by listing them using the above command and, then describing them to check if the storage location is already available.

Step 2.3: Modify the Trust Relationship for the Role

Once you have created the storage integration, Snowflake creates an external Id, which is required to be added to the trust relationship. An external id and Snowflake user arn are required to grant access between your AWS resource (i.e. S3) and a third party (i.e, Snowflake).

Describe storage Integration to get IAM USER ARN & External ID >> AWS Cosole Roles > Trust Relationships >> Add json for trust relationship

Copy this external Id from this command

describe storage integration single_quote_blog_storage_int;

Result of “describe storage” on snowflake

Get the “STORAGE_AWS_IAM_USER_ARN” and “STORAGE_AWS_EXTERNAL_ID” from the above result and modify JSON and paste it into the trust relationship.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::<id>:user/<id>"
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "ForAnyValue:StringEquals": {
                    "sts:ExternalId": [
                        "<external_id_1",
                        "<external_id_2>"
                    ]
                }
            }
        }
    ]
}

Step 3: Create Stage Objects

A Snowflake stage specifies where the data files are stored(i.e “staged”) so that the data in the files can be loaded into a table. Stages can be external or internal.

create or replace stage single_quote_blog_stage
file_format = single_quote_csv_format
url = 's3://single-quote-blog/sample_blog_data/json_sample_files/'
storage_integration = single_quote_blog_storage_int;

Using Copy command, You can load directly from the bucket, but Snowflake recommends creating an external stage that references the bucket and using that external stage instead.

Image Courtesy: snowflake.com

Validate if the stage is correct by listing the files on the stage

list @single_quote_blog_stage;

Step 4: Copy Data into Target Table

Before running a copy command, create a table, if not it does not already exist, followed by load data into the table, using the below command.

copy into single_quote_employee_data
  from @single_quote_blog_stage/random_object_storage_data.csv
  on_error = 'skip_file';

Validate the successfully loaded data using the load_history table.

select * from SNOWFLAKE.ACCOUNT_USAGE.load_history 
order by last_load_time desc limit 10;

Output for load_history table. Image by Snowflake Documentation

Benchmarks & Cost for the Bulk Data Load

Let’s try to load some big files on the ‘M’ snowflake warehouse size.

Benchmarks on CSV load for Snowflake. Image by Author

Now, as discussed at the start of this blog, I wanted to load data onto the Snowflake at a higher frequency. These are the benchmarks and cost of the solution:

Here I am taking these considerations on high-frequency data.

Each file has 1M records (5 columns) of size ~ 100MB.
Each 1MB file takes around 8.2 Seconds to load into snowflake on XS warehouse size.
Considering the Snowflake edition as Business Critical which cost $4/Credit. On Warehouse Size = XS, it will utilize 1 Credit/hour or $4/hour or $0.067/min.
The minimum cost for every time warehouse starts (60 Secs).

Cost for loading 100MB file throughout the day. Image by Author

This is clear from the costing above, more frequent data will add more cost to the snowflake cluster.

Points to be noted:

Few points to keep in mind while loading the CSV data into a database.

Whenever we create a stage or storage integration, there is a hidden ID assigned to both and they are linked to each other. therefore, recreating a storage integration (using CREATE OR REPLACE STORAGE INTEGRATION) breaks the association between the storage integration and any stage that references it. Even if the names are the same, the Id generated will be different so the association will break. Therefore, it is necessary to establish all the connections again.
Copy command can also be executed without storage integration, though this is not recommended, Syntax for direct copy command looks something like this or you can also create stage using credentials

-- Crendentials in copy command
Copy into single_quote_employee_data 
from 's3://single-quote-blog/sample_blog_data/json_sample_files/random_object_storage_data.csv' 
credentials=(AWS_KEY_ID = '<access_key>' AWS_SECRET_KEY = '<secret_key>')  
file_format = (type = 'CSV');

-- Credentials in stage
CREATE STAGE single_quote_employee_data 
URL = 's3://single-quote-blog/sample_blog_data/json_sample_files/' 
CREDENTIALS = (AWS_KEY_ID = ' *******' AWS_SECRET_KEY = '*********');

Snowflake expects each record in a CSV file to be separated by newlines and the fields (i.e. individual values) in each record to be separated by commas. If different characters are used as record and field delimiters, you must explicitly specify this as part of the file format when loading.
While loading data into the database, snowflake looks for a ‘newline’ character for line separation and ‘comma’(,) as the default character for the field delimiter. You must specify if any other character is used.
If fields and data are not aligned with the table schema, then Transforming Data During a Load can be used to match the schema with the table schema.

Conclusion:

The copy command is one of the most common methods to load bulk data into the database. It is one of the most prominently used commands across the industry. One of the challenges with the bulk load is that it is yet not ready to use. In times of zero code pipeline, another alternative is using the pipe to load data into the database. But that is the topic for the next blog.

[Writer’s Corner]

I am working at Innovaccer in Data Platform as Senior Engineering Manager and while writing this blog, I can proudly say that Innovaccer Data Platform has been rated Best in KLAS Data and analytics platform for 2 years in a row! Read more about the news here.

Listing Billion Number of S3 Objects into SQS: Challenges & Benchmarks

Ashish Mishra — Sun, 29 Jan 2023 02:36:26 +0000

Can S3 event notification service scale enough to handle Billion high-velocity events and can SQS handle these events without any data drop? This blog is all about unveiling these use cases.

What’s the Goal?

The problem/target is very generic.

Data is coming from multiple sources into the S3 bucket.
Parse the data using standard parsers which can scale automatically to parse huge amounts of files.
Clean and transform the data into the standard format.
Map the transformed source schema to warehouse schema.

The end Goal is to get data to the data warehouse in the standard schema

Challenges with the Process

The system faces multiple challenges when the number of objects in the S3 increases exponentially:

Exploration of Data: Listing objects on the S3 bucket becomes very slow and tiresome for data engineers while exploring the data.
Parsing and Cleansing: Parsing and cleansing of bulk data become challenging due to the slow listing of S3 objects (A single-threaded python program takes approximately 20 mins to list 1 million objects using standard AWS SDK while writing this blog).
Reprocessing and Error Handling: Handling processed vs unprocessed data becomes tricky due to a large number of objects and in case of any error re-processing of data takes a lot of time.
Metadata Management: Metadata management becomes tricky if the metadata(Size, Modified Time, Ingested Time etc) of objects is not already available(Read more about metadata management and its challenges here).

How to Solve the Above Problem?

The solution is also straightforward, but can we scale it enough to handle a billion objects?

Capture events from S3 into SQS & store them in persistent storage for later access

Though this is clear from the AWS documentation that S3 event notification can get us where we wanted to reach, we wanted to check if it can handle huge and high velocity events where the number of files is coming at the speed of 4000–5000 objects/Second.

Experiment

In order to check the validity, a quick experiment is required, which includes these steps:

Set up S3 and SQS with required permissions: Create an S3 bucket with the prefix required for the problem and SQS with the required permission to start receiving S3 events.
Automate high-paced data load: Create a script that will create random data of 1Kb and stream data on the S3 bucket in files and another script to keep checking the delay of the messages on the SQS and their count compare to the message sent.
Run and Monitor: Now run the scripts to load data and compare delay.

Top GIF: Script to load data, and count SQS events & their delay, Bottom: S3 & SQS data sample

Its Result Time:

Once we have parallelized the above scenarios to start loading files. The results surprised me.

This is the sample message that you receive on the SQS for the PUT event:

{
  "Records": [
    {
      "eventVersion": "2.1",
      "eventSource": "aws:s3",
      "awsRegion": "us-east-1",
      "eventTime": "2023-01-22T18:06:53.713Z",
      "eventName": "ObjectCreated:Put",
      "userIdentity": {
        "principalId": "AWS:<PRINICIALID>"
      },
      "requestParameters": {
        "sourceIPAddress": "<IP ADDRESS>"
      },
      "responseElements": {
        "x-amz-request-id": "12345678AF67",
        "x-amz-id-2": "wertyuiofghjkcvbn456789rtyui45678tyui56789rtyuiopertydfghjkcvbn"
      },
      "s3": {
        "s3SchemaVersion": "1.0",
        "configurationId": "s3_sqs_load_testing",
        "bucket": {
          "name": "s3_sqs_load_testing",
          "ownerIdentity": {
            "principalId": "<PRINCIPAL ID>"
          },
          "arn": "arn:aws:s3:::s3_sqs_load_testing"
        },
        "object": {
          "key": "s3_sqs_load_testing/base_path/1st_iterator/20230122/test.csv",
          "size": 14311,
          "eTag": "987fghjk789hj89",
          "versionId": "SdgdghjdbhfjhYUhjdhfj",
          "sequencer": "0063CD7B3D9FA2B9E6"
        }
      }
    }
  ]
}

Estimated Cost for the Problem:

Pricing as of January 23* for US-EAST region. Please confirm exact prices from official AWS website

As shown in the above diagram, Pricing for the solution is nominal compared to what it is solving for your metadata management system.

Above cost considers the cost for 1 Million objects. We wanted to get the cost for billion objects. An estimated cost for 1 Billion objects will be around 5.857*100 = $ 585.7.

Now, this extra 10% cost has increased the total efficiency of the system by 2.5X and parsers are able to finish the job in 2.5 times less than what it used to take, which is a good win.

Limitations of the Current Approach

Though we had success in the above experiment, still there are a few limitations that might be a deal-breaker for few use cases. Those are:

The Amazon SQS queue must be in the same AWS Region as your Amazon S3 bucket (mention).
Enabling notifications is a bucket-level operation and notifications need to be enabled for each bucket separately, though events can be published on a single queue (mention).
After you create or change the bucket notification configuration, it usually takes about five minutes for the changes to take effect (mention).
When notification is first enabled or notification configuration gets changed, a S3:TestEvent Occurs. If use case forces to change configuration very frequently then additional changes in consumer is required to skip these messages (mention).
Event notifications aren’t guaranteed to arrive in the same order that the events occurred. However, notifications from events that create objects (PUTs) and delete objects containing a sequencer. It can be used to determine the order of events for a given object key. If you compare the sequencer strings from two event notifications on the same object key, the event notification with the greater sequencer the hexadecimal value is the event that occurred later.
Across events for different buckets or objects within a bucket, the sequencer value should not be considered useful for ordering comparisons (Check this stackoverflow answer).
If you’re using event notifications to maintain a separate database or index of your Amazon S3 objects, AWS recommends that you compare and store the sequencer values as you process each event notification.
Each 64 KB chunk of a payload is billed as 1 request (for example, an API action with a 256 KB payload is billed as 4 requests)
Every Amazon SQS action counts as a request. The GET per-request charge is the charge for handling the actual request for the file (checking whether it exists, checking permissions, fetching it from storage, and preparing to return it to the requester), each time it is downloaded. The data transfer charge is for the actual transfer of the file’s contents from S3 to the requester, over the Internet, each time it is downloaded. If you include a link to a file on your site but the user doesn’t download it and the browser doesn’t load it to automatically play, or pre-load it, or something like that, S3 would not know anything about that, so you wouldn’t be billed. That’s also true if you are using pre-signed URLs — those don’t result in any billing unless they’re actually used, because they’re generated on your server.

Conclusion

Though there are a few limitations for some use cases when you want to take it to production, but the experiment is successful. I will list down what I did achieve:

The overall speed of file exploration by data engineers and stewards increased upto 1.5x, as the listing was easy for them and they can group files on the basis of regex pattern which is not very easy in the case of direct S3 exploration (Read more about data Engineer vs Data Stewards here).
The overall speed of parsing files increased by 2.5x , as the listing can be avoided during parsing and parsers can be scaled by pre-determined scaling factor (on the basis of #files).
Better Data Democratization for the organization by enabling rich searching over data.
… and many more.

[Writers Corner]

Demystifying Metadata Management — Part 1

Ashish Mishra — Tue, 17 Jan 2023 10:27:02 +0000

Demystifying Metadata Management — Part 1

Metadata Management provides a base for an organization’s Data Platform Architecture. Let’s understand each component and its role in metadata Management.

Image courtesy infopulse.com

Data:

Data is a collection of raw and unorganized facts that can be used in calculating, reasoning or planning. Without proper processing and organizing, it is useless. That’s where metadata comes into play.

Good read on Data: Blog by Dataedo

Image Courtesy: Dataedo by @piotr kononow

MetaData:

Metadata is simply data about data. It means it is a description and context of the data. It helps to organize, find and understand data, through information such as format, origin, creation date, modification date, etc.

Data stores information, but if you don’t know how to interpret it, you don’t have access to this information. Metadata enables you to understand data and extract the information.

Metadata, you see, is really a love note — it might be to yourself, but in fact it’s a love note to the person after you, or the machine after you, where you’ve saved someone that amount of time to find something by telling them what this thing is.

Cit. Jason Scott’s Weblog

Image Courtesy: ontotext.com

Good read on metadata: Blog by Dataedo

Data Democratization:

Empowering employees and stakeholders of an organization with the right set of tools that enables them to make informed decisions.

Data democratization is the ongoing process of enabling everybody in an organization, irrespective of their technical knowledge — how, to work with data comfortably, to feel confident talking about it, and as a result, make data-informed decisions and build customer experiences powered by data.

Data Democraatization have answers to questions like:

“Experts in my company are too busy to help me”.

“I do not have access to data”

“I can not trust the data”.

Data democratizaton is an ongoing process and need cultural shift because it depends on ongoing process called Data Literacy.

Image Courtesy: Arpit Choudhury from his medium blog

Good read on Data Democratization: Blog by Towards datascience

Data Literacy:

The ability to read, analyze, work and communicate with data — known as data literacy — is now so critical to companies that it has been hailed as the second language of business by Gartner. The global pandemic highlighted its importance, with many companies starting to rely on data to detect new patterns, respond to changing customer behavior and make first-of-a-kind decisions in a new environment of many unknown factors.

Poor data literacy is ranked as the second-biggest internal roadblock to the success of the CDO’s office, according to the Gartner Annual Chief Data Officer Survey.

In upcoming years, data literacy will become essential in driving business value, demonstrated by its formal inclusion in over 80% of data and analytics strategies and change management programs.

One common misconception about Data Democratization and Literacy is that now everyone in the compaby will know everything related to the data and get you details about data in no time and there will be no need for Subject Matter Expert or Data Architect. This is not true.

Data Literacy and Democratiaztion provides way to be independent and able to complete tasks and take company to the right direction and have no place for presumption.

Good read on Data Literacy: Blog by thedataliteracyproject

Image Courtesy: Dataedo by @piotr kononow

Data Architect & Data Engineer:

The data architect and data engineer titles are closely related and, as such, frequently confused. The difference in both roles lies in their primary responsibilities.

Data architects design the vision and blueprint of the organization’s data framework, while the data engineer is responsible for creating that vision.

Data architects provide technical expertise and guide data teams on bringing business requirements to life; data engineers ensure data is readily available, secure, and accessible to stakeholders (data scientists, data analysts) when they need it.

Data architects have substantial experience in data modeling, data integration, and data design and are often experienced in other data roles; data engineers have a strong foundation in programming with software engineering experience.

The data architect and the data engineer work together to build the organization’s data system.

Good read on Data Architect vs Data Engineer: Blog by rsTask

Image Courtesy: Arun Elangovan

Data Steward & Data Analyst & Data Scientist:

Data Analyst gather data from various databases and warehouses, filter and clean it. Data Scientist perform ad-hoc data mining and gather large sets of structured and unstructured data from several sources.
Data Analyst write complex SQL queries and scripts to collect, store, manipulate, and retrieve data from RDBMS such as MS SQL Server, Oracle DB, and MySQL. Data Scientist use various statistical methods, data visualization techniques to design and evaluate advanced statistical models from vast volumes of data.
Data Analyst create different reports with the help of charts and graphs using Excel and BI tools. Data Scientist Build AI models using various algorithms and in-built libraries.
Data Analyst s pot trends and patterns from complex datasets. Data Scientist Automate tedious tasks and generate insights using machine learning models.

At high level, Data Steward handles day to day operations on policeis created by either Data Architect.

Data Engineers are the Bridge by Jennifer Shalamanov

The data steward is the “go-to” guy for everyone working with data within the company. Typical data steward roles and responsibilities can be grouped as:

Operational Oversight — a data steward oversees the lifecycle of a data set. They are responsible for defining and implementing rules and regulations for the day-to-day operational and administrative management of data and systems.
Data Quality — data steward responsibilities include establishing data quality metrics and requirements, like setting acceptable values, ranges, and parameters for every data element.
Privacy, Security, and Risk Management — data protection is a key aspect of data steward responsibilities. A steward must establish regulations and conventions that govern data proliferation to ensure that data privacy controls are exercised in all processes.
Policies and Procedures — data stewards, also establish policies and procedures for data access, including authorization criteria based on any individual and/or the role.

Good read on Data Steward vs Data Analyst: Blog by Simplilearn

Data Warehouse & Data Lake & Data Mart:

Data warehouse (DW) is a system for aggregating data from connected databases — and then transforming and storing it in an analytics-ready state. The main benefits of a data warehouse are effective data consolidation, fast pre-processing, and easy self-access for business users. The key constraint of using a data warehouse solution is the need to pre-transform all data using standard schemas. This increases the usage costs and reduces scalability potential.

Data warehouse solutions:

Azure Synapse Analytics
Amazon Redshift
Google BigQuery
Snowflake

Image Courtesy: Dataedo by @piotr kononow

Data lake is a centralized cloud-based repository for storing raw (unprocessed, non-cataloged, or pre-cleansed) data from various systems. Unlike DWHs, data lake technology allows storing both structured and unstructured data of any size (as object blobs or files). Cloud data lakes are also more scalable and support more querying methods for data retrieval and analysis — a factor data scientists well appreciate.

Data lake solutions:
Azure Data Lake
Amazon S3
Apache Hadoop

Data Mart is more focused subset of data present in Data Warehouse. It generally concerned with a single team of department like finance, marketing, or sales. It is smaller, more focused, and may contain summaries of data that best serve its community of users. A data mart might be a portion of a data warehouse, too.

Data Mart has few benefits over giving access to fuill warehouse to all the departments:

Cost-efficiency
Simplified data access
Quicker access to insights
Simpler data maintenance
Easier and faster implementation

Good read on Data Warehouse vs DataLake: Blog by AWS

Conclusion:

This is the first part of series on metadata management. This part will help in building conceptual blocks of metadata management.

Please stay tuned for more parts of the series where we will discuss metadata management in detail and will also take one example of an organization to create metadata management for an example organization.

Please comment if you guys want me to focus on metadata management of any specific industry like E-Commerce, Healthcare, or Offline retail.

Keep Learning: Please refer here for part of the Demystifying Metadata Management — Part 2.

Name Matching Techniques: Useful Algorithms, Their Problems, & Absolute Solutions

Ashish Mishra — Mon, 02 Jan 2023 07:59:44 +0000

A concise guide to Names & Text Matching Algorithms available and right way to decide the best algorithm on the basis of use case.

Image courtesy of https://undraw.co/

What is a name-matching problem?

Data is the new Oil, Analytics is the Refinery and Intelligence is the Gasoline which drives the Growth.

Tiffani Bova

Simple yet very preeminent quotation. If you look closely at the above quotation you will find absolutely each organization in the world independent of their size or location working only to achieve this goal. I you look more in-depth, you will find this is made up of 3 entities DATA, ANALYTICS, and INTELLIGENCE.

There is no Growth without Intelligence and there is no intelligence without analytics and there is no Analytics without Data.

Now that we know Data is the basic need for any organization and the Internet is the most common source of data. Let’s deep dive into the name-matching problem.

The Internet is overloaded with data, and there are numerous ways an organization can get data from multiple entities. Data is nothing but a resource for an organization but data without proper identification (Consistency and Integrity) will very soon turn into a liability if not nurtured properly.

Here Nurturing is nothing but combining this data with the right demographics(Person Names, ages, contacts, etc.). If an Organization has data without proper consistency then it might show wrong metrics which can lead the organization to bad decisions.

Name matching is important to identify a person’s behavior daily, weekly or occasionally.

For Example , A person looks for shoes during 8–9PM every 150 days using social media advertisements with the exception that he also goes shoe shopping around 10–15 days before Christmas using multiple search engines and e-commerce websites. This information can be used to target the right set of customers at right time.

If an Organization fails to identify the same person on multiple browsers or multiple social media or e-commerce platforms then it might lose out on some sales and capital on digital marketing.

Overwheming Information available on Internet

A lot of organizations are involved in mining demographic information from across the applications like e-mail, customer or patient records, news articles, and business or political memorandums that they might have received.

Sources of name or text variation

By Now we have an understanding of:

Why Data is important for an Organization!
What is a name-matching problem and Why this is important for an Organization!

Hence, we will move next phase to understand where is the origin of this problem. Is there a way to identify the source so we can apply our algorithm on the basis of the source?

For Example, If Data is coming from a third-party application form where the limitation of name length is 20 letters then all the names which are of length 20 words, might be looked for abbreviations or as pruned data and can be handled accordingly and will not go for other costly algorithms. These few techniques could definitely save a significant amount of time and dollars.

These are a few examples that will help you identify the cases for your organization and will help you apply the right algorithm for the right problem.

Handwritten OCR : q and g or _m and rn_can be misrepresented as similar.
Manual Keyboard Data Entry or Over Phone: Neighbouring keyboard keys like n and m or e and r.
Limitation of maximum length allowed: The input field force people to use abbreviations or initials only.
Deliberately provide a modified name: Faith on the organization or to avoid public data.
Different Demographics: The data entry persona does not know the right format for other demographics. Like some sitting in US can put the wrong spelling for Indian or Spanish names.

These are basic examples where an entity(End-user or Data entry professional) can create multiple variations of names for the same organization or data retrieved from multiple organizations.

Type and Examples of different referential / demographic data

Let’s take a few more examples where spelling mistakes or phonetic name variation could occur.

One of the most important parts of making analytics score better is to standardize referential data* which includes names and addresses. If we look only into the above-mentioned problem then we can divide the matching problem into two categories.

Name Matching

Persona Names Matching:

While there is only one correct spelling for many words, there are often several valid spelling variations for personal names. Examples:

NickName like 'Bill' rather than 'William'
Phonetically Similar like 'Gail', 'Gale' and 'Gayle'
Personal names sometimes change over time, After marriage or religion change.

Company Names Matching:

Abbreviations like LTD, Ltd, Limited or Corp, Corporation or IBM and International Business Machine can be used interchangeably across documents
Typographical Errors like Oracle and Orcle
Omissions like Goyal & Sons and Goyals

Product Name Matching:

Canon PowerShot a20IS”, “NEW powershot A20 IS from Canon” and “Digital Camera Canon PS A20IS” should all match “Canon PowerShot A20 IS”

Text Matching:

Address Matching:

134 Ashewood Walk, Summerhill Lane, Portlaoise
134 Summerhill Ln, Ashewood Walk, Summerhill, Portlaoise, Co. Laois, R32 C52X, Ireland

Product review and NLP fuzzy matching : Product reviews can be confusing due to very similar product names or review text do not have very clear picture. Example:

John Snow : What’s so fuzzy about this?
Danny : I think it was a good fuzz.

Similarly, there could be multiple other such examples where name or text matching can create a huge difference,s and collating all those information is very crucial.

Why Name matching is so hard

By Now we have an understanding of

Why Data is important for an Organization!
What is name-matching problem and Why this is important for an Organization!
Sources and examples of multiple name variations

Now, let’s understand, why name matching is so hard and most companies are still not able to figure out the solution with 100% match score.

Names are heavily influenced by people’s cultural backgrounds and first names, middle names, and last names can be represented in different ways.

Names are often recorded with different spellings, and applying exact matching leads to poor results. The most common cases that I can come up with are:

Spelling Errors(80%)
Spelling Variation: ‘O’Connor’, ‘OConnor’ and ‘O Connor’
Phonetical Similarity: Gail & Galye
Tina Smith might be recorded as ‘Christine J. Smith’ and as ‘C.J.Smith-Miller’
Short forms: Like ‘BOB for Robert, or ‘Liz’ for Elizabeth
Some European countries favour compound. ‘Hans-Peter’ or ‘Jean-Pierre’
Hispanic and Arabic names can contain more than two surnames.
Record Linkage, where record contains more than just names

These examples show why name matching problem is so hard, though there are multiple ways to resolve the issues. Let’s discuss those techniques in detail.

Name Matching Techniques

By now, we understood the source of different names and their complexities. To improve the matching accuracy, many different techniques for approximate name matching have been developed in the four decades and new techniques are still being invented.

Most of the solutions are based on the pattern, phonetic, or combinations of these two approaches.

In this document, we will discuss each approach on a very high level, but I will add more blogs for detailing each technique in detail.

Please Note : The description for the algorithm has been reduced delibrately to avoid length of blog, There are seperate blogs for each algorithm.

Phonetic Encoding:

1. Soundex: It keeps the first letter in a string and converts the rest into numbers using below mapping/encoding table.

For example

Phonex: It is a variation of Soundex that tries to improve the encoding quality by pre-processing names according to their English pronunciation before the encoding

3. Phonix: This encoding goes a step further than phonex and applies more than one hundred transformation rules on groups of letters before encoding on the basis of the below encoding table. This algorithm is slow comparatively due to large number of transformations.

4. NYSIIS: (The New York State Identification Intelligence System) It is based on transformation rules similar to Phonex and Phonix, but it returns a code that is only made of letters.

5. Double Metaphone: Specialized in European and Asian names. For example ‘kuczewski’ will be encoded as ‘ssk’ and ‘xfsk’, accounting for different spelling variations.

6. Fuzzy Soundex: It is based q-gram substitution. Fuzzy Soundex technique is combined with a q-gram based pattern matching algorithm, and accuracy results better than Soundex.

Below table shows how diffrent algorithm will create encoding for same word.

Pattern Matching:

Levenstein or Edit Distance: The smallest number of edit operations (insertions, deletions, and substitutions) required to change one string into another.
Damerau-Levenstein Distance: measuring the difference between two sequences. It is a variant of the Levenshtein distance, with the addition of a provision for the transposition of two adjacent characters. The Damerau-Levenshtein distance between two strings is the minimum number of operations (consisting of insertions, deletions, substitutions, and transpositions of two adjacent characters) required to transform one string into the other
Bag Distance: The bag distance algorithm compares two sets of items, and calculates the distance between them as the number of items that are present in one set but not the other. It is a simple and fast algorithm, but it does not take into account the order or frequency of the items in the sets.
Simth-Waterman: is a dynamic programming algorithm used for local sequence alignment. It is used to align two sequences in a way that maximizes the number of matching characters while allowing for the insertion of gaps to optimize the alignment. It was originally developed to find optimal alignments between biological sequences, like DNA or proteins.
Longest common sub-string (LCS): The longest common substring (LCS) is a string that is common to two or more strings and is the longest string that is a substring of all the strings. It is used to find the similarities between two or more strings and is often used in text comparison, data mining, and natural language processing. There are various algorithms for finding the LCS, including dynamic programming and suffix trees.
Q-grams : The Q-gram algorithm is a method for comparing strings by breaking them down into fixed-length substrings, or “grams”, and comparing the set of grams for each string. It is a fast and simple algorithm, but it can be less accurate than other methods, as it does not take into account the order of the grams or the distances between them. Q-grams are often used in spelling correction, text search, and information retrieval.
Positional Q-grams : Positional Q-grams is a variant of the Q-gram algorithm that takes into account the position of the grams within the string. It is used for comparing strings by breaking them down into fixed-length substrings and comparing the set of grams for each string, while also considering the positions of the grams in the original string. Positional Q-grams can be more accurate than regular Q-grams, as it takes into account the order of the grams within the string. It is often used in information retrieval and natural language processing.
Skip-grams : skip-grams was developed as an experiment on multi-lingual texts from different European languages and show improved results compared to bigrams, trigrams, edit distance, and the longest common sub-string technique.
Compression : Compression matching is a method for comparing strings by comparing their compressed representations. The idea is that strings that are similar will compress to a smaller size than strings that are dissimilar. To perform compression matching, the strings are first compressed using a lossless compression algorithm, such as gzip or bzip2. The compressed representations of the strings are then compared using a string distance metric, such as the Levenshtein distance or the Jaccard coefficient. Compression matching can be an effective way to compare strings, especially for large strings or datasets. However, it can be computationally expensive, as it requires the strings to be compressed and decompressed.
Jaro : The Jaro algorithm is commonly used for name matching in data linkage systems. It accounts for insertions, deletions, and transpositions. The algorithm calculates the number c of common characters (agreeing on characters that are within half the length of the longer string) and the number of transpositions.
Winkler : The Winkler algorithm improves upon the Jaro algorithm by applying ideas based on empirical studies which found that fewer errors typically occur at the beginning of names. The Winkler algorithm, therefore, increases the Jaro similarity measure for agreeing on initial characters (up to four)
Sorted-Winkler : If a string contains more than one word (i.e. it contains at least one whitespace or other separators), then the words are first sorted alphabetically before the Winkler technique is applied (to the full strings). The idea is that (unless there are errors in the first few letters of a word) sorting of swapped words will bring them into the same order, thereby improving the matching quality.
Permuted-Winkler : In this more complex approach Winkler comparisons are performed over all possible permutations of words, and the maximum of all calculated similarity values is returned.

Combined Techniques:

Editex : Aim at improving phonetic matching accuracy by combining edit distance-based methods with the letter-grouping techniques of Soundex and Phonix. The edit costs in Editex are 0 if two letters are the same, 1 if they are in the same letter group, and 2 otherwise. Comparison experiments in [34] showed that Editex performed better than edit distance, q-grams, Phonix, and Soundex on a large database containing around 30,000 surnames. Similar to basic edit distance, the time and space com-
Syllable alignment distance : This recently developed technique, called Syllable Alignment Pattern Searching (SAPS) [13] is based on the idea of matching two names syllable by syllable, rather than character by character. It uses the Phonix transformation (without the final numerical encoding phase) as a preprocessing step, and then applies a set of rules to find the beginning of syllables. An edit distance based approach is used to find the distance be- tween two strings.

Problem with current Techniques

We have discussed the characteristics of personal names and the potential sources of variations and errors in them, and we presented an overview of both pattern-matching and phonetical encoding-based name matching techniques. Experimental results on different real data sets have shown that there is no single best technique available. The characteristics of the name data to be matched, as well as computational requirements, have to be considered when selecting a name-matching technique.

How Python or Databases can be utilized

Python also has a lot of phenomenal libraries created by a lot of good developers. A few libraries are worth looking for:

The Fuzz
hnmi
Fuzzy-wuzzy
Namematcher
dedupe
TF-IDF Vectorizer scikit-learn

Not only python but other languages also have similar libraries which can provide the in-built functionality and which can save a few lines of complex code.

Some databases also provide some in-built matching functionalities which can definitely save some network bandwidth in your production system.

Jaro-Winkler by Snowflake
Jaccard Index by Snowflake
Edit Distance by Snowflake
Levenstein by Postgres
Trigrams by Postgres
AWS + Azure in-built Machine Learning

Conclusion

Personal name matching is very challenging, and more research into the characteristics of both name data and matching techniques has to be conducted in order to better understand why certain techniques perform better than others, and which techniques are most suitable for what type of data. A more detailed analysis of the types and distributions of errors is needed to better understand how certain types of errors influence the performance of matching techniques.

I hope you have learned something new today and please stay tuned for more such blogs. In near future, I have planned to add more detailed documentation for each of the algorithms where we will not only discuss each algorithm in detail with examples but also discuss their implementation and some real-time production use case that I have been involved.

References and Courtesy:

The Australian National University(TR-CS-06–02) by Peter Christen
https://undraw.co/

Appendix :

Originally published at https://singlequote.blog on January 2, 2023.