DEV Community: Satish Chandra Gupta

MLOps: Machine Learning Lifecycle

Satish Chandra Gupta — Tue, 13 Sep 2022 05:55:07 +0000

Building machine learning products or ML-assisted product features involve two distinct disciplines:

Model Development: Data Scientists — highly skilled in statistics, linear algebra, and calculus — train, evaluate, and select the best-performing statistical or neural network model.
Model Deployment: Developers — highly skilled in software design and engineering — build a robust software system, deploy it on the cloud, and scale it to serve a huge number of concurrent model inference requests.

Of course, that is a gross over-simplification. It takes several other vital expertise in building useful and successful ML-assisted products:

Data Engineering: Build data pipelines to collect data from disparate sources, curate and transform it, and turn it into homogenous, clean data that can be safely used for training models.
Product Design: Understand business needs, identify impactful objectives and relevant business matrices; define product features or user stories for those objectives, recognize the underlying problems that ML is better suitable to solve; design user experience to not only utilize ML model prediction seamlessly with rest of the product features but also collect user (re)action as implicit evaluation of the model results, and use it to improve the models.
Security Analysis: Ensure that the software system, data, and model are secure, and no Personally Identifiable Information (PII) is revealed by combining model results and other publicly available information or data.
AI Ethics: Ensure adherence to all applicable laws, and add measures to protect against any kind of bias (e.g. limit the scope of the model, add human oversight, etc.)

As more models are being deployed in production, the importance of MLOps has naturally grown. There is an increasing focus on the seamless design and functioning of ML models within the overall product. Model Development can’t be done in a silo given the consequences it may have on the product and business.

We need an ML lifecycle that is attuned to the realities of ML-assisted products and MLOps. It should facilitate visibility for all stakeholders, without causing too many changes in the existing workflows of data scientists and engineers.

In the rest of the article, I first give an overview of the typical Model Development and Software Development workflows, and then how to bring the two together for adapting to the needs of building ML-assisted products in the MLOps era.

Serverless Computing on AWS, Azure, and Google Cloud

Satish Chandra Gupta — Tue, 06 Sep 2022 04:22:58 +0000

Amazon launched its first AWS services in 2006: S3 for Storage and EC2 for Compute. Now there are a plethora of services and offerings from various cloud vendors like AWS, Azure, and Google Cloud. It can be overwhelming to evaluate so many options and pick the right one for you.

Selecting the right infrastructure alternative is not only about the technical requirements of your application but also about expertise in your team and the growth stage of your business.

In this article, you will learn about the kinds of cloud deployment alternatives available today, and how to select the right one for your needs. We will also dig into serverless alternatives, and what can make the migration across alternatives easier as your team and business grow.

SQL Joins: Inner, Left, Right, and Full

Satish Chandra Gupta — Mon, 25 Jul 2022 14:57:00 +0000

For data engineers and data scientists, SQL is valuable expertise. All modern data warehouses as well as data or delta lakes have SQL interfaces. Understanding how an SQL query is executed can help you write fast and economical queries.

Once you understand SQL concepts like join and window, you will intuitively understand the similar APIs in Spark, Pandas, or R. After all, SQL is one of the ancient programming languages that are still best for what they were invented for (another such example is C for system programming).

SQL first appeared in 1974 for accessing data in relational databases, and its latest standard was published in 2016. It is a declarative language (i.e. a programmer specifies what to do, and not how to do it) with strong static typing.

SQL Join is one of the most common and important operations to merge data with different attributes (or columns) from different sources (or tables). In this article, you will learn about various kinds of SQL joins and best practices to write efficient joins on large tables.

Software Architecture Design and Engineering at a Startup

Satish Chandra Gupta — Wed, 29 Dec 2021 14:03:34 +0000

The great thing about starting a new project is that you get a clean slate. No baggage of design choices that you hated to look at every day in your last project. But how many times have you seen a shiny new project not turning into the same intractable mess?

It is more likely to happen in a fast-paced startup. The faster the pace, the sooner it happens. So how do you balance moving fast without being trapped in analysis paralysis and keep technical debt at a manageable level?

You design for change. Ignore the refrain that prevention is better than cure. Instead of preventing the mess, you should embrace it and mitigate it when it happens. That’s what we have done at Slang Labs.

In this article, I discuss:

Startup Reality: forces and constraints in a startup.
Engineering Philosophy: our philosophy to manage that reality.
Slang Architecture: evolution of Slang microservices and SDKs guided by our philosophy.

Top 10 Programming Languages in 2022

Satish Chandra Gupta — Wed, 29 Dec 2021 13:58:03 +0000

What if programming languages were stocks? And you had to make a portfolio to fetch good returns in 2022?

You probably have seen various surveys and analyses listing the most popular programming languages. Those cover the universe of programming languages, like the S&P 500 does for the stock market. What would be the best portfolio for you to be successful and outperform the rest?

Of course, it depends on the risk profile, or whether your focus is specific to a specific sector, e.g., web, mobile, enterprise, machine learning, and edge/embedded.

But let's say you want to pick 10 stocks for a diversified portfolio and a medium-risk appetite. Stocks are of 3 types:

Large Cap: Big corporations with stable businesses, like Fortune 500 companies. The upside is stable but not manifold, and the downside is limited.
Mid Cap: Mid-size companies with a high chance of becoming large-cap in the future. These offer much higher returns, but can also go down significantly.
Small Cap: Upcoming companies. Currently very small, but showing high potential. These might turn out multi-baggers but are very risky too.

If you invest only in large caps, your returns will be subdued. If you invest only in small caps, you might hit a jackpot, but can also go bust and lose your shirt. A diversified portfolio allocates money to each asset class. That keeps returns stable, and also has a fair chance at higher returns.

In this article, I present an opinionated portfolio of 10 general-purpose programming languages, with 50% large-caps, 30% mid-caps, and 20% small-caps. These languages will suffice most of the work done by most teams and organizations.

Actionable Insights from 4 Types of Data Analytics

Satish Chandra Gupta — Wed, 29 Dec 2021 13:51:01 +0000

“Let’s collect all data we can, and we will fish for insights later.” Have you heard this before?

That approach seldom works. On rare occasions when it does work a little, the RoI is very low w.r.t. the cost of collecting, processing, and storing volumes of data.
Analytics yields better returns when you start with a goal.

Besides, not all analytics are equal. Fun stats are amusing. But actionable insights that can guide you to the next steps are way more valuable.

The Drivetrain Approach offers a systematic way to produce actionable insights. It has four steps:

Define Objective: Start by defining your goal.
Specify Levers: Specify the inputs that you control, the levers you can pull to influence the outcome.
Collect Data: Figure out what data you need to collect for measuring the effect of pulling those levers.
Identify Actions: Analyze data and build statistical models to compute which lever to move and how much to achieve the desired outcome.

This article will apply the Drivetrain approach to a problem: starting with a goal to collecting and analyzing data and identifying actions to achieve that goal.

NoSQL vs. SQL: 12 Datastores for Your Application

Satish Chandra Gupta — Tue, 08 Jun 2021 11:51:42 +0000

How do you choose a database? Maybe, you assess whether the use case needs a Relational database. Depending on the answer, you pick your favorite SQL or NoSQL datastore, and make it work. It is a prudent tactic: a known devil is better than an unknown angel.

Picking the right datastore can simplify your application. A wrong choice can add friction. This article will help you expand your list of known devils. It covers the following:

Database constituents that define a datastore’s characteristics.
Datastores categorized by data types: unstructured, structured (tabular), and various semi-structured (NoSQL) types.
Datastores specialized for various use cases.
Decision flow chart to navigate the landscape of on-prem and on-cloud alternatives.

Indic Language Stack for Voice Assistants and Conversational AI

Satish Chandra Gupta — Thu, 17 Dec 2020 04:27:09 +0000

Bhārat Bhāṣā Stack will catalyze Voice Assistant and Conversational AI innovations for vernacular Indic languages as India Stack did for FinTech.

A decade ago, nobody could imagine how digital payments happen today in India. Even street vendors accept money as small as 50 rupees (less than a dollar) on mobile phone. Neither the seller or buyer has to pay any transaction fee. Street vendors don't have deep pockets to build payment gateways either.

It became possible due to Unified Payment Interface (UPI) of the India Stack. It is the digital infrastructure for authentication, payment, and authorization. Every bank implements UPI, so all mobile wallets are interoperable and free of cost.

India speaks several languages, and a large part of the population is neither tech-savvy nor English literate. In-app Voice Assistants are the natural choice to take the benefits of the internet to the masses.

But building these assistants takes a huge investment that small companies can not afford. Just as India Stack offered digital payment infrastructure, we need an affordable Indic Language Stack for conversational AI.

Indic Language Stack

The Indic Language Stack for Conversational AI consists of voice and language technologies:

Listen: Convert speech audio to text. It is called Automatic Speech Recognition (ASR) or Speech-to-Text (STT).
Understand: Understand meaning or intent in the text, and extract important entities. It is called Natural Language Understanding (NLU).
Speak: Ask questions to clarify, confirm, or seek needed information from the user. It is called Speech Synthesis or Text-to-Speech (TTS).
Translate: Humans speak different languages. Applications may need to translate text from one language to another. It is called Machine Translation (MT).
Phonetically Translate: Many people type Indic languages using phonetic spellings on roman keyboards. The computers may need to do phonetic-translation of the text to Indic language scripts. It is called Transliteration.
See/Read: The ability to recognize images of handwritten or printed characters. It is called Optical Character Recognition (OCR).

It has layers to offer diverse entry point based of need and maturity of an organisation:

Script: India has several scripts, almost one per language
Data: Training data is the biggest and most expensive barrier
Models: Even when data is available, traning deep learning model is unaffordable for small organisations.
Software as a Service (SaaS): SaaS frees developers from hosting the model and managing the service infrastructure. It makes it easier to start building applications.
Software Development Kit (SDK): SDKs in popular programming languages and OS platforms form the final layer. SDKs can use the models or SaaS.

Ecosystem Participants

It will take systematic and sustained collaboration to design and build the Bhārat Bhāṣā Stack:

Academia sharing research paper with code on Conversational AI problems relevant to India.
Industry building voice-enabled products and services for the common man.
Government playing a role like the one in building India Stack.
Industry Bodies speeding up collaboration through conferences and consortiums.

More details »

12 Ways of Applying a Function to Python Pandas DataFrame

Satish Chandra Gupta — Sat, 10 Oct 2020 02:43:40 +0000

Applying a function to rows of a Pandas DataFrame is one of the most common operations during data wrangling. There are many ways of doing it.

I plotted the performance of various ways of applying a function to each row of a Pandas DataFrame, for up to a million rows.

I was surprised to see itertuples() beating apply(), and humble list comprehension beating them both.

So far, I was using apply() whenever I found vectorization difficult. Somehow I thought it was the 2nd best option.

I have been using "%timeit" often. In this exercise, I learned how to do line-level profiling in Python and also plotting the performance over input size.

Architecture for High-Throughput Low-Latency Big Data Pipeline on Cloud

Satish Chandra Gupta — Sat, 25 Apr 2020 01:02:20 +0000

Scalable and efficient data pipelines are as important for the success of analytics and ML as reliable supply lines are for winning a war.

For deploying big-data analytics, data science, and machine learning (ML) applications in real-world, analytics-tuning and model-training is only around 25% of the work. Approximately 50% of the effort goes into making data ready for analytics and ML. The remaining 25% effort goes into making insights and model inferences easily consumable at scale. The data pipeline puts it all together. It is the railroad on which heavy and marvelous wagons of ML run. Long term success depends on getting the data pipeline right.

This article gives an introduction to the data pipeline and an overview of architecture alternatives.

Python Microservices, Part 4: API, Object, and Storage Data Models

Satish Chandra Gupta — Fri, 24 Apr 2020 03:33:26 +0000

Design API data model for communicating with the service, object model for the application logic, and storage model for persisting the data.

A data model organizes data elements and formalizes their relationships with one another. In database design, data modeling is the process of analyzing application requirements and designing conceptual, logical, and physical data models for storage. However, data storage is only one, albeit an important, aspect of microservices.

There are three related but distinct data models in a microservice for:

API Data Model for interactions: It defines the schema of data payload that can be sent to or is received from the endpoints of a microservice. Also known as communication or exchange data model.
Object Data Model for computations: It is designed for efficient business logic implementation. Also known as application data model or data structures.
Storage Data Model for persistence: It defines the schema of various, occasionally redundant, data stores and caches.

Python Microservices, Part 3: Effective Canonical Logging across Services

Satish Chandra Gupta — Thu, 23 Apr 2020 02:03:31 +0000

Learn how to design, implement, test and configure canonical logging across microservices using Python and Tornado web framework.

Nature is a meticulous logger, and its logs are beautiful. Calcium carbonate layers in a seashell are nature’s log of ocean temperature, water quality, and food supply. Annual rings in tree cambium are nature’s log of dry and rainy seasons and forest fires. Fossils in the layers in sedimentary rocks are nature’s log of the flora and fauna life that existed at the time...

(Based on a PyCon India 2019 tutorial.)