DEV Community: Argha Sarkar

What I Learned from My Machine Learning Journey: Beyond the Algorithms

Argha Sarkar — Tue, 16 Jun 2026 14:18:36 +0000

Hey there! If you’re reading this, chances are you’re either a fellow student diving into machine learning (ML) or someone curious about what this whole ML hype is about. Maybe you’re even a parent trying to understand why your child is suddenly obsessed with Python, Jupyter notebooks, and something called "neural networks." 😅

I’m Argha, a student from India, and over the past year, I’ve been knee-deep in ML—attending classes, burning midnight oil over assignments, and yes, pulling my hair out over errors in my code. But beyond the equations and the endless debugging, I’ve realized that ML isn’t just about teaching machines to learn. It’s about how we learn, grow, and see the world differently.

So, if you’re thinking of joining an ML program or are already in one, here’s what you should take away from it—and the insights that’ll stick with you long after the course ends.

📚 What Should You Actually Learn from an ML Program?

When I first signed up for my ML course, I thought it’d be all about writing fancy code that predicts stock prices or recognizes cats in images (spoiler: it’s way harder than it looks). But looking back, the real lessons weren’t just in the syllabus.

The Math is Important… But Not Everything
Yes, linear algebra, probability, and calculus are the backbone of ML. You will need to understand gradients, matrices, and loss functions. But here’s the truth: you don’t need to be a math genius to start. Most libraries (like TensorFlow or scikit-learn) do the heavy lifting for you. What does matter is understanding why you’re using a certain algorithm—not just how to implement it.
My insight: If you’re struggling with the math, don’t panic. Focus on the intuition first. Once you see how a decision tree splits data or how a neural network adjusts its weights, the equations start making sense.
Data is the Real King (Not the Model)
Early on, I spent hours tweaking my model’s hyperparameters, only to realize my predictions were garbage because… my data was messy. Garbage in, garbage out. Cleaning data, handling missing values, and feature engineering often take up 80% of the work. And no, there’s no magical clean_data() function in Python (yet).
My insight: The best models fail with bad data. Learn to love pandas, NumPy, and data visualization tools like Matplotlib. And yes, Excel is still your friend.
It’s Okay to Not Know Everything
ML is vast. From supervised learning to reinforcement learning, from CNNs to Transformers, it’s impossible to master everything at once. I’ve seen students (including myself) get overwhelmed trying to learn everything in one go.
My insight: Pick a lane first. Start with the basics—linear regression, logistic regression, decision trees. Then, explore areas that excite you, whether it’s NLP, computer vision, or something niche like recommendation systems.
The Art of Debugging (and Googling)
No code runs perfectly the first time. You will face errors—ValueError, TypeError, Shape Mismatch, you name it. And the solution? Stack Overflow is your gospel. Learning how to debug, read error messages, and search for solutions is a skill in itself.
My insight: The best programmers aren’t the ones who write perfect code—they’re the ones who can fix it fast.
Ethics Matter More Than You Think
This one hit me hard. ML isn’t just about accuracy scores. It’s about bias, fairness, and responsibility. If your facial recognition model performs poorly on darker skin tones, that’s not just a technical flaw—it’s a societal issue. If your hiring algorithm favors certain genders, that’s discrimination in code.

My insight: Always ask: Who does this model serve? Who might it harm? ML isn’t neutral—it reflects the data (and biases) we feed it.

💡 The Insights That Changed My Perspective

Beyond the technical skills, ML taught me things I never expected.

Patience is a Superpower
Training a model takes time. Tuning hyperparameters takes time. Understanding why your model isn’t working takes even more time. ML has taught me to embrace the grind. There’s no "quick win" here—just consistent effort.
Collaboration > Solo Genius
I used to think coding was a lonely job. But in ML, you’ll work with datasets from Kaggle, libraries built by open-source communities, and peers who help you debug. The best ML engineers are great collaborators. GitHub, Discord groups, and hackathons have become my second classroom.
Failure is Just Feedback
My first ML model had an accuracy of 52%. That’s right—barely better than random guessing. But instead of giving up, I learned to analyze why it failed. Was it overfitting? Underfitting? Bad features? Every mistake is a lesson in disguise.
ML is Not Just for Tech—It’s for Everyone
You don’t need to be a software engineer to use ML. Doctors use it for disease prediction, farmers for crop yield forecasting, and businesses for customer insights. ML is a tool—and like any tool, its power depends on how you use it.
The Future is Exciting (and a Little Scary)
From self-driving cars to AI-generated art, ML is reshaping industries. But with great power comes great responsibility. Deepfakes, job displacement, and privacy concerns are real. The next generation of ML practitioners won’t just build models—they’ll shape the future of society.

🎯 So, What’s the Takeaway?

If you’re in an ML program or thinking of joining one, here’s my advice:
✅ Master the fundamentals (math, stats, coding) but don’t get stuck in theory.
✅ Play with data—clean it, visualize it, understand it.
✅ Build projects—even small ones. Apply what you learn.
✅ Stay curious—ML evolves fast. Keep learning.
✅ Think ethically—your work has real-world impact.
And most importantly, enjoy the journey. There will be frustration, late nights, and moments where you feel like giving up. But when your model finally works, or when you see your project make a difference—that feeling is unmatched.

🔚 Final Thought: It’s Not Just About the Destination

ML isn’t just a skill—it’s a way of thinking. It teaches you to break down complex problems, find patterns, and make data-driven decisions. Whether you become a data scientist, an engineer, or just someone who uses ML in their field, the lessons you learn will shape how you see the world.

So, if you’re on the fence about diving into ML, I’d say: Go for it. But remember—it’s not just about learning algorithms. It’s about learning how to learn.

And who knows? Maybe one day, your ML model will change the world. 🚀

What about you? What’s the most unexpected lesson ML has taught you? Drop a comment below—I’d love to hear your story!
—Argha Sarkar
Student | ML Enthusiast | Cyber Security | Chai Lover ☕

Titanic Survival Prediction Using Machine Learning: Complete Data Science Project

Argha Sarkar — Thu, 11 Jun 2026 13:25:25 +0000

When I first started learning Machine Learning, Titanic dataset was one of the most popular projects I found on Kaggle. Honestly, at first it looked like a simple classification problem. But after working on it, I understood that it teaches many important concepts like data cleaning, feature engineering, exploratory data analysis, model building, hyperparameter tuning, and model interpretation.

In this project, the goal was simple. We needed to predict whether a passenger survived or not during the Titanic disaster using information such as age, gender, ticket fare, passenger class, family details, and other features.

The first step was understanding dataset. After loading data using Pandas, I checked data types, summary statistics, and missing values. One major challenge was missing data. Cabin column had many missing values and Age column also contained several missing records. So data cleaning became necessary before building any model.

After basic analysis, I performed Exploratory Data Analysis (EDA). This part helped me understand passenger behavior better.

One interesting observation was that female passengers survived much more than male passengers. Another observation was that passengers travelling in first class had higher survival rates compared to second and third class passengers. Age and fare also showed some relationship with survival.

For visualization, libraries such as Matplotlib and Seaborn were used. Count plots, distribution plots, box plots, and heatmaps helped identify patterns in data.

Then came feature engineering, which I think was one of the most important parts of the project.

Several new features were created:

FamilySize
CabinKnown
TicketGroup
IsAlone
Passenger Title extracted from Name

These new features gave additional information which was not directly available in original dataset.

After feature engineering, a preprocessing pipeline was created. Missing values were handled using imputers. Numerical features were scaled and categorical variables were encoded using One-Hot Encoding. This ensured that data was ready for machine learning models.

Multiple machine learning algorithms were considered such as Logistic Regression, Decision Tree, Random Forest, Gradient Boosting, and XGBoost.

Among these models, XGBoost was selected as the main model because it generally performs very well on structured tabular data.

To improve performance further, Optuna was used for hyperparameter tuning. Different combinations of parameters like learning rate, maximum depth, subsample ratio, and number of estimators were tested automatically. Cross-validation was used during optimization to make model evaluation more reliable.

After training, model performance was measured using accuracy score, classification report, and confusion matrix. These metrics helped evaluate how well the model classified survivors and non-survivors.

The final step was model explainability using SHAP. This was actually my favorite part. SHAP values helped explain which features influenced predictions the most. Instead of treating model as a black box, we could understand why model predicted survival for a passenger.

From what I understand, this project is a complete end-to-end Machine Learning workflow. It starts from raw data and ends with model interpretation. For beginners and intermediate learners, Titanic project is still one of the best ways to learn practical Machine Learning because it combines data analysis, feature creation, model training, tuning, and explainability in a single project.

Below, I am sharing some code snippets. Here, I am not sharing the whole code -

Data Loading

import pandas as pd

train_df = pd.read_csv("train.csv")

print(train_df.info())
print(train_df.describe())

Feature Engineering

train_df["FamilySize"] = train_df["SibSp"] + train_df["Parch"] + 1

train_df["IsAlone"] = (
    train_df["FamilySize"] == 1
).astype(int)

train_df["CabinKnown"] = (
    train_df["Cabin"].notnull()
).astype(int)

train_df["Title"] = train_df["Name"].str.extract(
    ' ([A-Za-z]+)\.',
    expand=False
)

Preprocessing Pipeline

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import (
    OneHotEncoder,
    StandardScaler
)
from sklearn.impute import SimpleImputer

numeric_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

categorical_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore"))
])

XGBoost

from xgboost import XGBClassifier

model = XGBClassifier(
    random_state=42,
    eval_metric="logloss"
)

Hyperparameter Tuning with Optuna

import optuna

def objective(trial):

    params = {
        "n_estimators":
            trial.suggest_int(
                "n_estimators",
                100,
                1000
            ),

        "max_depth":
            trial.suggest_int(
                "max_depth",
                3,
                10
            ),

        "learning_rate":
            trial.suggest_float(
                "learning_rate",
                0.01,
                0.3
            )
    }

    return score

Model Evaluation

from sklearn.metrics import (
    accuracy_score,
    classification_report
)

preds = model.predict(X_test)

print(
    accuracy_score(
        y_test,
        preds
    )
)

print(
    classification_report(
        y_test,
        preds
    )
)

SHAP Explainability

import shap

explainer = shap.TreeExplainer(
    trained_xgb_model
)

shap_values = explainer.shap_values(
    X_test_preprocessed_df
)

shap.summary_plot(
    shap_values,
    X_test_preprocessed_df
)

Full code -

https://github.com/argha-sarkar/30-days-ml-code

What Does a Product Data Scientist Actually Do?

Argha Sarkar — Fri, 05 Jun 2026 18:07:59 +0000

Nowadays, everyone is talking about data science, but not many people know what a Product Data Scientist really does. I also had the same doubt, so I asked one of my LinkedIn connections who works in a big e-commerce company. He explained it to me in simple words, and now I’ll share it with you.

First of all, a Product Data Scientist is not just sitting and coding all day. Their main job is to understand the product—like a shopping app, a food delivery app, or even a banking app—and then use data to make it better. For example, in Zomato or Swiggy, they see which restaurants are getting more orders, why some dishes are popular, and how to show the right food to the right customer. They don’t just collect data; they tell the product team what to do with it.

One of their big tasks is A/B testing. Suppose Flipkart wants to change the color of the “Buy Now” button from blue to red. The Product Data Scientist will run a test—some users see blue, some see red—and then check which color gets more clicks. If red works better, they’ll tell the team to use red. Simple, right? But this small change can increase sales by a lot.

They also work on recommendation systems. Ever noticed how Amazon shows you “Customers who bought this also bought that”? That’s the work of a Product Data Scientist. They use your past purchases, searches, and even how long you looked at a product to suggest what you might like next. It’s like a shopkeeper who remembers what you buy every time and keeps those items ready for you.

Another important part is understanding user behavior. They look at data to see why people are leaving the app without buying anything. Maybe the checkout process is too long, or maybe the delivery charges are too high. They find these problems and suggest fixes. For example, if many users are dropping off at the payment page, they might recommend adding more payment options like UPI or cash on delivery.

But it’s not all about numbers. A good Product Data Scientist also talks to the product managers, designers, and engineers. They have to explain their findings in a way that non-technical people can understand. So, they need to be good with both data and communication.

Lastly, they keep tracking the performance of the product. After any change—like a new feature or a discount offer—they check if it’s working or not. If sales go up, great! If not, they figure out what went wrong and try something else.

So, in short, a Product Data Scientist is like a detective + guide + decision-maker for the product team. They don’t just play with data; they use it to make the product smarter, easier to use, and more profitable.

The Future of Business Intelligence: Integrating AI and Machine Learning

Argha Sarkar — Mon, 11 May 2026 17:27:19 +0000

Business Intelligence, or BI, is a system that enables companies to understand and analyze their data. You can liken it to a car's dashboard. Just as a dashboard displays a car's speed, fuel levels, and engine status, BI presents various business metrics. It highlights data regarding sales, customers, and expenses, empowering those responsible for the business to make better-informed decisions.

Currently, the field of BI is undergoing rapid transformation. The primary driver behind this change is that AI (Artificial Intelligence) and Machine Learning are becoming integral components of this ecosystem.

How Does Today's BI Work?

Most contemporary BI tools primarily focus on reporting what has already occurred in the past.

Example: “Last month, we sold 500 pairs of shoes.”
The Problem: However, understanding why this happened—and determining what actions should be taken next—still requires human intervention.

This is much like reviewing your bank statement at the end of the month. While the information is certainly useful, by that point, there is often limited opportunity to make adjustments or implement changes.

How AI and Machine Learning Are Revolutionizing the Landscape

AI and Machine Learning (ML) empower computers to learn from data and identify underlying patterns. When integrated with BI, the dashboard essentially becomes "smarter." It no longer merely displays raw numbers; instead, it interprets the significance of those figures and offers insights into what might happen in the future.

Outlined below are three major shifts that will distinguish the BI of the future:

1. From "What Happened" to "What Might Happen"

Traditional BI: “Sales declined by 10% last quarter.”
AI-Driven BI: “Sales are projected to drop by another 8% next quarter, as demand for winter coats typically decreases in March. Consider offering a 15% discount to help meet your sales targets.”

This capability is known as Predictive Analytics. By analyzing historical data spanning many years—including weather patterns, holiday schedules, and social media trends—computers can generate forecasts regarding future outcomes.

2. From Searching for Reports to Asking Direct Questions

Previously, retrieving a report required the assistance of a data specialist. Now, with AI, all you need to do is simply ask a question.

The Old Way: To view "region-based sales," you had to click through 10 different filters.
The New Way: You simply type, "Which city had the lowest sales last week, and why?" — and the AI provides the answer in plain, natural language.

AI understands natural human language. Consequently, BI tools can now be utilized not just by the data team, but by anyone within the company.

3. Beyond Reports: Proactive Recommendations

Future BI tools won't merely sit idly by, waiting for you to come and view a report. Instead, they will proactively alert you in advance.

This is called prescriptive analytics. In other words, it does not merely provide information; it also indicates what should be done.

What does this mean for small or ordinary businesses?

You don't need to be a major tech giant to utilize this technology.

Faster Decision-Making: No more waiting two weeks for a report. AI can provide insights within a matter of seconds.
Reduced Errors: Computers can detect patterns that humans often overlook—such as minor product defects that could eventually escalate into major issues.
Accessible Data for Everyone: Even a shopkeeper in Kolkata could simply ask their mobile phone, "How were my tea sales today?"—and plan for the next day accordingly. There is no need to be an expert in Excel.

So, are there no downsides?

There are, and two points, in particular, must be kept in mind:

Bad Data Means Bad Results: No matter how intelligent AI is, if fed incorrect or messy data, it will generate flawed insights and recommendations. Therefore, keeping your data clean and accurate is absolutely crucial.

Human Insight Remains Paramount: While AI can offer recommendations, humans are the ones who truly understand customers, culture, and ethical nuances. Consequently, AI should be utilized as an assistant—not as the boss.

What Can We Expect in the Future?

Over the next five years, BI (Business Intelligence) will no longer resemble a mere spreadsheet. Instead, it will function like an intelligent colleague—someone you can converse with, who learns about your business, and who alerts you to potential issues before they even arise.
In the future, successful companies won't simply be those that possess the most data; rather, success will belong to those that leverage AI to rapidly interpret data and take swift action.
The Bottom Line: AI and Machine Learning are transforming Business Intelligence from a mirror reflecting the past into a GPS guiding the way to the future. It won't just show you where you've been; it will also help you identify the optimal path forward.
Ready to get started? Begin by cleaning and organizing your sales and customer data. With your data properly structured, even basic AI tools can deliver exceptional results.

How to Leverage Data Science for Strategic Business Decisions

Argha Sarkar — Fri, 08 May 2026 10:32:52 +0000

In today's competitive market, it is no longer possible to survive in business relying solely on "experience" or "gut instinct." Whether you manage a small startup or a global enterprise, the difference between success and failure often hinges on how effectively you utilize your data.

Data science is not just for major technology companies; it is a powerful methodology that helps you make smarter, faster, and more profitable decisions. See below how you can leverage it in your business.

1. Move from Analyzing the Past to Planning for the Future

Most businesses use data solely to understand the past: How much did we sell last month? Why did we lose that client? While these questions are certainly important, data science also empowers you to look into the future.
Predictive Analytics: By analyzing historical data and behavioral patterns, future outcomes can be forecasted. For instance, a retail business can utilize seasonal data from previous years to estimate the exact quantity of inventory required for the upcoming winter season. This approach not only helps avoid product shortages but also minimizes unnecessary costs.
Trend Spotting: Data science can identify subtle shifts—or "weak signals"—embedded within market data, which may serve as early indicators of significant future market trends.

2. Understand Your Customers Better Than They Understand Themselves

Generic, one-size-fits-all marketing approaches are often expensive and rarely effective. Data science offers you the opportunity to implement "Hyper-Personalization."
Rather than relying merely on generic demographics like age or location, customers can be segmented based on their actual behavioral patterns. For example, data analysis might reveal that a specific customer segment purchases your products only when a discount is offered on Tuesday nights. You can then automate the delivery of discount notifications specifically during that timeframe. This strategy boosts sales potential while simultaneously reducing the volume of irrelevant messages sent to other customers.

3. Enhance Operational Efficiency and Minimize Waste

Strategic decision-making is not always solely about increasing sales; often, reducing costs is equally critical. Data science is highly effective at identifying operational inefficiencies and bottlenecks within a business.
Supply Chain: Algorithms can be employed to determine the most efficient transportation routes, resulting in significant savings in both fuel consumption and time.
Pricing Strategy: Many businesses are now shifting away from fixed pricing models in favor of "Dynamic Pricing." This involves analyzing competitor pricing, market demand, and even weather data to adjust product prices in real-time—thereby maximizing profits without alienating customers.

4. Mitigate Risks Before They Strike

Every strategic decision carries with it a certain degree of risk. In this context, data science acts as a safeguard.

In the financial sector, data models can detect suspicious transactions within mere milliseconds—transactions that might otherwise go unnoticed by the human eye. Similarly, in project management, data-driven simulations—such as the Monte Carlo Simulation—can be used to predict, in advance, the likelihood of a project being completed on time and within budget. Consequently, this allows for strategic adjustments to be made before any capital is actually expended.

5. How to Get Started: A 3-Step Strategy

You don't need a team of 50 PhD-holding experts to start making data-driven decisions. All you need is a realistic and sound approach.
Define Your Business Questions: Don't simply "collect data." Instead, ask specific questions such as: "How can we reduce our customer churn rate by 10%?" or "Which product line is currently yielding the highest ROI (Return on Investment)?"
Keep Data Clean and Accurate: There is a popular adage in data science: "Garbage in, garbage out." This means that flawed data will inevitably lead to flawed results. Therefore, ensure that your data is accurate, centralized, and easily accessible.
Start Small, Scale Fast: Begin by focusing on a specific problem—for instance, making your email marketing efforts more effective. Once you observe positive results or a solid ROI in that area, gradually extend the application of these data-driven strategies to other departments.

Key Takeaway

Data science is not a "magic button" that will make all your decisions for you. Rather, it is a powerful tool that dispels the fog of uncertainty and provides clear insights. By making evidence-based decisions instead of relying on guesswork, you can transform your data from a dormant asset into your greatest competitive advantage.
The goal is not merely to collect the largest volume of data, but to extract the most actionable insights from it. Start analyzing your business metrics with earnestness today—for they hold the key to revealing the path your business will take tomorrow.

The MLOps Compass: A Local Guide to Building Your First Reproducible ML Pipeline

Argha Sarkar — Sun, 14 Sep 2025 09:54:40 +0000

The MLOps Compass: A Beginner's Guide to Building a Reproducible ML Pipeline
As a data scientist or machine learning engineer, you know that building a great model is only half the battle. The real challenge is getting that model into production and making sure the entire process is reliable and repeatable. This is the world of MLOps, where you connect the dots between model development and operational deployment.

In this post, I'll walk you through a hands-on project to build your very own reproducible MLOps pipeline right on your laptop. We'll use three fantastic open-source tools—Git, DVC, and Docker—to manage our code, data, and application environment. By the time we're done, you'll have a containerized model that's ready to go.

Step 1: Laying the Foundation with Git and Python

Every solid project starts with a good foundation. We'll use Git for version control and a clean Python virtual environment to keep our dependencies organized.

First, let's create a new project folder and initialize a Git repository.

mkdir mlops-local-project
cd mlops-local-project
git init

Next, create a virtual environment. This is a crucial step that keeps your project's dependencies separate from everything else on your computer, so you'll never run into weird conflicts.

python -m venv venv
conda activate venv

Finally, define your project's Python libraries in a requirements.txt file and install them. We'll need a few common ones like scikit-learn, pandas, joblib, and flask. Don't forget to create a .gitignore file to prevent things like your virtual environment from being committed to Git.

Step 2: Versioning Your Data with DVC

Git is great for code, but it just doesn't work for large data files or models. That's where DVC (Data Version Control) comes in. DVC versions your data and models by creating small, lightweight metadata files that Git can track. The actual large files are stored in a local cache, so your Git repo stays lean and clean.

To get started, initialize DVC in your project:

dvc init

For our example, we'll use a simple dataset from scikit-learn. We'll write a Python script to save the data as a CSV, and then use DVC to start tracking it.

# Prepare the data using a Python script
python src/prepare_data.py

# Add the data to DVC and commit the metadata file to Git
dvc add data/iris.csv
git add data/iris.csv.dvc
git commit -m "Add iris dataset with DVC"

Now, every time you make a change to your data, you can simply run dvc addagain. Git will track the new version without a massive, slow commit.

Step 3: The Pipeline - Training a Model

Reproducibility is a core part of MLOps. To make sure your model training can be repeated exactly, we'll define a DVC pipeline. This pipeline tracks all the dependencies between your data, code, and the final model, guaranteeing a consistent outcome.

First, write your training script (src/train.py), which will load the DVC-tracked data and train a simple RandomForestClassifier.

# A simple script to train and save a model.
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
import joblib

def train_model():
    df = pd.read_csv('data/iris.csv')
    X = df.drop('target', axis=1)
    y = df['target']

    model = RandomForestClassifier(n_estimators=100)
    model.fit(X, y)

    joblib.dump(model, 'models/model.joblib')

if __name__ == "__main__":
    train_model()

Next, create aDVC.yaml file to define your pipeline. This file tells DVC that the train stage depends on data/iris.csv and src/train.py and produces models/model.joblib as output.

# Simplified DVC.yaml for our pipeline
stages:
  train:
    cmd: python src/train.py
    deps:
    - data/iris.csv
    - src/train.py
    outs:
    - models/model.joblib

Now, whenever you want to re-run your training process, you simply type dvc repro. DVC will automatically check for changes in the dependencies and execute only the necessary steps, which saves a ton of time.

Step 4: Containerization with Docker

The final piece of the puzzle is to package your application and all its dependencies into a single, portable unit. Docker is the industry standard for this. It ensures your application will run exactly the same way, no matter what computer you run it on.

First, create a simple Flask API (app.py) that loads your trained model and exposes a /predict endpoint.

# A Flask app to serve predictions
from flask import Flask, request, jsonify
import joblib

app = Flask(__name__)
model = joblib.load('models/model.joblib')

@app.route('/predict', methods=['POST'])
def predict():
    # ... prediction logic here ...
    return jsonify({'prediction': prediction})

Finally, create a Dockerfile that defines the environment for your application. This file specifies the Python version, installs your dependencies, and copies your code into the container.

# Dockerfile to containerize the app
FROM python:3.9-slim
WORKDIR /app
COPY requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 5000
CMD ["gunicorn", "--bind", "0.0.0.0:5000", "app:app"]

With these files in place, you can build and run your container with just two commands:

docker build -t mlops-service .
docker run -p 5000:5000 mlops-service

This project gives you a solid foundation in MLOps. Once you're comfortable with this local workflow, you'll be well-equipped to add more advanced steps like continuous integration and continuous deployment (CI/CD) and cloud provisioning.

Download link - PyDVC-Docker-MLOps-Sandbox

Simple Linear Regression: The Complete Guide

Argha Sarkar — Thu, 11 Sep 2025 17:55:15 +0000

**Simple Linear Regression:* *

Simple Linear Regression is a statistical and machine learning model that finds a linear relationship between an independent variable (a feature) and a dependent variable (the output). The goal of this model is to find a line equation in the data that can most accurately predict the output value.

Why is Simple Linear Regression used?

When the outcome needs to be easily predicted using a single input variable.
To analyze linear relationships in data.
To create models quickly, easily, and understandably.
In prediction and understanding trends.

Model equation:

y = a + bx

y = dependent variable (what you want to predict)
x = independent variable (input)
a = y-intercept (what intersects the y-axis)
b = slope (how much y increases/decreases when x changes)

How does it work?

The "Least Squares Method" is used to minimize the distance from each data point, so that the relationship between the output value and the input value can be properly understood.

Use Cases and Applications:

Determining the price of a house (by looking at the size)
Analyzing the relationship between sales and advertising
Evaluating student study time and results
Predicting disease symptoms and patient conditions in medicine
The relationship between rainfall and crop production in agriculture

Example:

Let's say we want to know how the price (y) increases when the size of the house (x) increases. We did a linear regression with some house data. The model said that the average price increases by 5000 rupees per square foot and the base price is 1,00,000 rupees. Then the model:

y = 100000 + 5000x

Now the price of a 100 square feet house will be:

100000 + 5000×100 = 6,00,000

In this way, the future price of a house can be easily estimated.

Example:

Let's say you have some data on the size (square feet) and price (in thousands of taka) of a house—
Size (x): 50, 60, 70, 80, 90
Price (y): 150, 180, 210, 240, 270

Step:

First find the average of x and y.
Find the slope b: b = Σ(xi - x̄)(yi - ȳ) / Σ(xi - x̄)²
Find the intercept a: a = ȳ - b x̄
Model: y = a + b x

Python code:

# Sample data
x = [50, 60, 70, 80, 90]
y = [150, 180, 210, 240, 270]

# Calculate means
x_mean = sum(x) / len(x)
y_mean = sum(y) / len(y)

# Calculate slope (b)
numerator = sum((xi - x_mean)*(yi - y_mean) for xi, yi in zip(x, y))
denominator = sum((xi - x_mean)**2 for xi in x)
b = numerator / denominator

# Calculate intercept (a)
a = y_mean - b * x_mean

print(f"Regression Equation: y = {a:.2f} + {b:.2f}x")

# Predict price for 75 sqft
x_new = 75
y_pred = a + b * x_new
print(f"Predicted price for {x_new} sqft: {y_pred:.2f} thousand")

Running this code will give you a straight line that you can use to predict the price of the new size.

More code details, please visit my GitHub

GitHub - Simple Linear Regression

HashMap: What, Why, Where and How to Manage Data Quickly!

Argha Sarkar — Tue, 09 Sep 2025 08:15:20 +0000

HashMap is a data structure that stores data in the form of Key-Value. Its main advantage is that data can be searched and updated very quickly, because the data is located using the hashcode of the key. So anywhere where data needs to be searched quickly, HashMap is very useful.

Why use?

For fast search, insert and delete work
To avoid keeping duplicate data with the same key, the advantage of value update
Database, caching, counting, frequency tracking

Where is it used?

Getting information by user ID
Searching by name by product code
In cache mechanism
Array-mapping of algorithms (such as LeetCode Two Sum)

Small Python code example (Two Sum Problem):

nums = [2, 7, 11, 15]
target = 9

def twoSum(nums, target):
    hashmap = {}
    for i, num in enumerate(nums):
        complement = target - num
        if complement in hashmap:
            return [hashmap[complement], i]
        hashmap[num] = i

print(twoSum(nums, target))  # Output: [0, 1]

Here we search for two numbers from the nums array whose sum is equal to target. We quickly get the index of those two numbers using hashmap in the code.

This code is explained below -

1. def twoSum(nums, target):
I define a function that takes a list of numbers (nums) and a target number.

2. hashmap = {}
I create a simple dictionary (HashMap), where the number will be the number and the value will be its index.

3. for i, num in enumerate(nums):
I loop through each number in the nums list and its index.

4. complement = target - num
I want a number that will be the target after subtracting the current number from the target.

5. if complement in hashmap:
I check whether the required number is in the hashmap or not.

6. return [hashmap[complement], i]
If it is, I return the index of the two numbers.

7. hashmap[num] = i
If not, then we add the current number and its index to the hashmap.

In this way, we can quickly find the positions of two numbers whose sum is equal to the target by running the loop once.

HashMap is a smart way to find data quickly, storing data as key-value pairs. Using it, searches, inserts, updates, everything is much faster. So whenever fast data access is necessary, HashMap is the best companion! 🚀

Essential Skills Every Aspiring Data Scientist Should Acquire for Career Success (2025)

Argha Sarkar — Tue, 21 Jan 2025 09:42:22 +0000

Introduction:

The role of data scientists is becoming increasingly important in the world of technology and data-driven decision-making. Every day, businesses and industries are generating huge amounts of data, which require skilled professionals to analyze and interpret. However, what skills are essential to become a successful data scientist? In this blog, we will discuss the skills that are essential for every aspiring data scientist to succeed in this competitive field.

Data science is a multifaceted field that requires a mix of technical, analytical, and specialist areas. As an aspiring data scientist, there are some essential skills that you need to acquire:

1. Programming and coding skills

Which languages to focus on: Python and R are the most widely used programming languages in data science, popular for their versatility, ease of use, and widely used libraries for data processing and analysis.
Libraries and Frameworks: Proficiency in libraries like Pandas, NumPy, Matplotlib, and Scikit-learn in Python, or dplyr, ggplot2, and caret in R is essential.
SQL and Database Management: Proficiency in SQL (Structured Query Language) is essential for querying and manipulating data in databases, as data is typically stored in relational database management systems.

2. Computational Analysis and Probability

Understanding of computational concepts: Data scientists must have a solid foundation in statistics, such as hypothesis testing, regression analysis, probability distributions, and p-values.
Data Exploration and Analysis: Before building complex models, you need to clean your data, identify outliers, and identify patterns using exploratory data analysis (EDA).
A/B Testing: An important concept in statistical testing is A/B testing, which is used to evaluate different approaches or solutions by comparing two groups.

3. Machine Learning and Algorithms

Supervised and Unsupervised Learning: A deep understanding of supervised learning (e.g. regression, classification) and unsupervised learning (e.g. clustering, dimensionality reduction) is required.
Model Evaluation: Ability to evaluate models using cross-validation, ROC curves, precision-recall curves, and performance metrics (e.g. accuracy and F1-score).
Deep Learning: With the advancement of data science, familiarity with deep learning concepts such as neural networks and the use of TensorFlow, PyTorch frameworks is becoming important.

4. Data Visualization and Communication

Visualization Tools: A key skill for a data scientist is to present results in a visual format. Libraries such as Tableau, Power BI, and Matplotlib, Seaborn, Plotly in Python are important.
Telling stories with data: It is important not only to analyze data, but also to communicate it clearly, so that technical and non-technical stakeholders can understand it.

5. Big Data Technology

Managing large datasets: The ability to use big data tools like Hadoop, Spark and cloud platforms like AWS, Google Cloud, Azure is crucial, especially in managing large datasets.
Distributed Computing: Having knowledge of distributed computing frameworks and parallel processing is helpful for managing large-scale data pipelines.

6. Business acumen

Domain knowledge: Having a strong understanding of the industry you are working in (economics, healthcare, marketing, etc.) is helpful for data scientists, so that they can build more effective models and provide actionable insights.
Problem solving: Data scientists need to be able to transform business problems into data science problems, which requires strong analytical thinking and strategic problem solving.

Conclusion:

The field of data science is vast and rapidly changing, but focusing on the above skills will help you build a strong career foundation. If you want to become an aspiring data scientist or improve your current skills, focus on programming, statistics, machine learning, data visualization, big data technology, and business acumen. By focusing on these skills, you will be able to stay competitive in this exciting field.
By regularly improving your skills and staying up to date with the latest technologies and trends, you can become a sought-after data scientist.

Unlocking the World of Data Science: A Beginner’s Guide to Getting Started

Argha Sarkar — Sat, 18 Jan 2025 20:21:12 +0000

Introduction

In today’s data-driven world, data science has become a key pillar of business innovation, scientific discovery, and everyday life. Predicting trends, improving customer experiences, or making life-saving medical decisions—data science is behind it all. But what exactly is data science and why is it important?
If you’re new to the field and want to know how data can transform industries, you’ve come to the right place. This guide will give you a simple understanding of data science and why it matters in the digital age.

What is Data Science?

The main objective of Data Science is to collect, analyze and interpret large amounts of data to gain valuable insights. It combines various fields such as statistics, mathematics, computer science and domain expertise to transform raw data into actionable information. Data Science can help discover hidden patterns, make predictions and make informed decisions.

Key components of Data Science:

1. Data Collection: Collecting raw data from various sources, such as databases, sensors or the internet.
2. Data Cleaning: Removing errors and inconsistencies in the data to make it reliable and accurate.
3. Data Analysis: Identifying trends, patterns and relationships in data using statistical methods.
4. Machine Learning: Applying algorithms so that computers can learn from data to make predictions or decisions.
5. Data Visualization: Presenting data in graphical formats such as charts, graphs and dashboards, so that it is easy for stakeholders to understand.

Why is data science important?

In our increasingly digital world, organizations are generating vast amounts of data every second. The real power of this data lies in how we process and analyze it. Here’s why data science is so important:

Informed decision-making: By analyzing data, businesses can make more strategic, informed decisions.
Predictive power: Through machine learning and algorithms, data scientists can predict trends and behaviors, such as customer preferences or stock market fluctuations.
Automation and efficiency: Data science automates routine tasks, saving time and resources, and increasing efficiency.
Competitive advantage: Organizations that use data science can stay ahead of their competitors, as they can gain insights that drive innovation and growth.

Data Science Process: From Data to Insights

It is important to understand the data science process, as it explains how data scientists generate valuable results from data. Here is a brief description of the data science process in plain language:

Problem Definition: Every data science project begins with a clear goal. It could be predicting customer churn, optimizing inventory management, or improving product recommendations—defining the problem is the first step.
Data Collection: The next step is to collect the necessary data from a variety of sources—such as the organization’s internal data, third-party sources, or public datasets.
Data Cleaning and Preparation: Raw data is usually incomplete. Data scientists spend a lot of time cleaning it and converting it to the right format.
Exploratory Data Analysis (EDA): In this stage, data scientists explore the data using statistical tools and discover trends and patterns.
Model building: Using machine learning algorithms or statistical methods, data scientists build models that can predict outcomes or classify data.
Evaluation and development: The model is tested and its accuracy is verified. If necessary, changes are made to improve the performance of the model or algorithm.
Communication: Finally, the results are presented through graphs, charts, and reports, making them available and actionable for decision makers.

Skills Required to Become a Data Scientist

If you want to pursue a career in data science, you need to acquire a number of skills. Some skills are technical, while others depend on your approach to data and problem solving. Some of the important skills for people who want to become a data scientist are:

Programming languages: Proficiency in languages like Python, R, and SQL is essential for analyzing and processing data.
Mathematics and statistics: A solid understanding of probability, statistics, and algebra is essential for building accurate models.
Machine learning: Knowledge of machine learning algorithms will help you build predictive models and identify patterns in data.
Data visualization: Proficiency in data presentation using Tableau, Power BI, or Python libraries (such as Matplotlib, Seaborn) is important.
Big data technologies: Familiarity with big data tools like Hadoop, Spark, or NoSQL is helpful for working with large datasets.

Conclusion

In today’s fast-paced world, data science opens up a path that helps uncover hidden insights, improve decision-making, and drive innovation across industries. If you are a business leader looking to stay ahead of the competition or are an aspiring data scientist, understanding the basic concepts of data science is the first step.
By harnessing the power of data, we can not only solve today’s challenges, but also pave the way for a smarter and more efficient future. So, step into the world of data science and start exploring its vast potential!

DEV Community: Argha Sarkar

What I Learned from My Machine Learning Journey: Beyond the Algorithms

📚 What Should You Actually Learn from an ML Program?

💡 The Insights That Changed My Perspective

🎯 So, What’s the Takeaway?

🔚 Final Thought: It’s Not Just About the Destination

Titanic Survival Prediction Using Machine Learning: Complete Data Science Project

Data Loading

Feature Engineering

Preprocessing Pipeline

XGBoost

Hyperparameter Tuning with Optuna

Model Evaluation

SHAP Explainability

What Does a Product Data Scientist Actually Do?

The Future of Business Intelligence: Integrating AI and Machine Learning

How Does Today's BI Work?

How AI and Machine Learning Are Revolutionizing the Landscape

1. From "What Happened" to "What Might Happen"

2. From Searching for Reports to Asking Direct Questions

3. Beyond Reports: Proactive Recommendations

What does this mean for small or ordinary businesses?

So, are there no downsides?

What Can We Expect in the Future?

How to Leverage Data Science for Strategic Business Decisions

1. Move from Analyzing the Past to Planning for the Future

2. Understand Your Customers Better Than They Understand Themselves

3. Enhance Operational Efficiency and Minimize Waste

4. Mitigate Risks Before They Strike

5. How to Get Started: A 3-Step Strategy

The MLOps Compass: A Local Guide to Building Your First Reproducible ML Pipeline

Step 1: Laying the Foundation with Git and Python

Step 2: Versioning Your Data with DVC

Step 3: The Pipeline - Training a Model

Step 4: Containerization with Docker

Simple Linear Regression: The Complete Guide

*Simple Linear Regression: *

Why is Simple Linear Regression used?

Model equation:

How does it work?

Use Cases and Applications:

Example:

Example:

Step:

Python code:

More code details, please visit my GitHub

HashMap: What, Why, Where and How to Manage Data Quickly!

Essential Skills Every Aspiring Data Scientist Should Acquire for Career Success (2025)

Introduction:

Conclusion:

Unlocking the World of Data Science: A Beginner’s Guide to Getting Started

Introduction

What is Data Science?

Key components of Data Science:

Why is data science important?

Data Science Process: From Data to Insights

Skills Required to Become a Data Scientist

Conclusion

**Simple Linear Regression:* *