<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Apoorv Tripathi</title>
    <description>The latest articles on DEV Community by Apoorv Tripathi (@apoorvtripathi1999).</description>
    <link>https://dev.to/apoorvtripathi1999</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1638249%2F507897d6-cb16-4536-9531-8fb1ef6b9e91.png</url>
      <title>DEV Community: Apoorv Tripathi</title>
      <link>https://dev.to/apoorvtripathi1999</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/apoorvtripathi1999"/>
    <language>en</language>
    <item>
      <title>Does non repetitive code really translates to better performance?</title>
      <dc:creator>Apoorv Tripathi</dc:creator>
      <pubDate>Thu, 31 Jul 2025 02:12:34 +0000</pubDate>
      <link>https://dev.to/apoorvtripathi1999/does-non-repetitive-code-really-translates-to-better-performance-1hel</link>
      <guid>https://dev.to/apoorvtripathi1999/does-non-repetitive-code-really-translates-to-better-performance-1hel</guid>
      <description>&lt;p&gt;&lt;strong&gt;Problem Statement:&lt;/strong&gt; I wanted a function which can convert a column values of a data frame to string values if they are a dictionary or a list. This is a requirement if you need to add the data to MySQL Server, as SQL does not support complex datatypes like lists and dictionaries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This can be done using two approaches:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Approach 1: We create a function which takes in a dataframe and goes column by column, then iterate through the rows to apply the required logic.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Approach 1 inputs the entire dataframe

def convert_to_string_all(df):
    """ This function applies the required operation throughout all columns"""

    for col in df.columns:
        df[col] =df[col].apply(lambda x: str(x) if isinstance(x,list) or isinstance(x,dict) else x)
    return df

test_all = convert_to_string_all(test_all)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Approach 2: We pass the column as input and then apply the logic to each rows. By using this approach we do not have to apply the logic to the entire data frame. We can just pass a specific column. The only problem is we would have to repeat our function call again and again.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def convert_to_string_few(df):
    """ This function applies the required operation to a single column"""

    result = df.apply(lambda x: str(x) if isinstance(x,list) or isinstance(x,dict) else x)
    return result

test_few["feature"] = convert_to_string_few(test_few["feature"])
test_few["imageURL"] = convert_to_string_few(test_few["imageURL"])
test_few["imageURLHighRes"] = convert_to_string_few(test_few["imageURLHighRes"])
test_few["also_view"] = convert_to_string_few(test_few["also_view"])
test_few["also_buy"] = convert_to_string_few(test_few["also_buy"])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;DRY(Don't Repeat Yourself) is a very fundamental principle when we talk about clean code. But problem comes when we treat principles as a requirement. In this example we can clearly see that we compromise our performance by following the principle. &lt;/p&gt;

&lt;p&gt;If we do an analysis of the performance: &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Method 1 (All Columns):&lt;/strong&gt;&lt;br&gt;
O(c × n) where c = total columns&lt;br&gt;
Processes all 10 columns regardless of content&lt;br&gt;
Performs 10n total operations&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Method 2 (Specific Columns):&lt;/strong&gt; &lt;br&gt;
O(k × n) where k = columns needing conversion&lt;br&gt;
Processes only 5 columns containing lists/dictionaries&lt;br&gt;
Performs 5n total operations&lt;/p&gt;

&lt;p&gt;We see a 50% reduction in unnecessary work&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frs62wcnl6fnfc1yxkxap.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frs62wcnl6fnfc1yxkxap.png" alt="Fig, showing the performance of both methods" width="800" height="573"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;On the other hand some people argue that it is an absolute requirement for our code to be non repeating as this will help in maintenance and scaling. &lt;br&gt;
Many developers prioritize code readability and maintainability over performance . &lt;/p&gt;

&lt;p&gt;This perspective argues that:&lt;br&gt;
Code is read more often than written&lt;br&gt;
Premature optimization leads to complexity&lt;br&gt;
Modern hardware can handle inefficiencies&lt;/p&gt;

&lt;p&gt;While these points have merit, they can lead to a dangerous complacency about computational waste, especially in data-intensive applications.&lt;/p&gt;

&lt;p&gt;Conversely, some developers believe performance should be prioritized above all other considerations.&lt;/p&gt;

&lt;p&gt;This approach risks:&lt;br&gt;
Creating unmaintainable code&lt;br&gt;
Over-engineering solutions&lt;br&gt;
Ignoring the 80/20 rule of bottlenecks &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;So what is the way?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For me it has always been following a balanced approach. We do not have to blindly follow the principles of clean code, as they are suggestions and best practices and does not define the overall context of code. But we should also not ignore the requirement and need for maintainability. We should design better code structure which can work with both performance and maintainability.&lt;/p&gt;

</description>
      <category>programming</category>
      <category>machinelearning</category>
      <category>codequality</category>
      <category>discuss</category>
    </item>
    <item>
      <title>API Design That Doesn't Break: How Pydantic Saved My API</title>
      <dc:creator>Apoorv Tripathi</dc:creator>
      <pubDate>Tue, 15 Jul 2025 16:27:55 +0000</pubDate>
      <link>https://dev.to/apoorvtripathi1999/api-design-that-doesnt-break-how-pydantic-saved-my-api-dkp</link>
      <guid>https://dev.to/apoorvtripathi1999/api-design-that-doesnt-break-how-pydantic-saved-my-api-dkp</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Building APIs is easy. Building APIs that don’t break is hard.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;When I started developing my customer churn prediction API, I quickly ran into the classic pitfalls: manual validation scattered across endpoints, inconsistent error messages, and the ever-dreaded runtime crashes from malformed data. &lt;/p&gt;

&lt;p&gt;Every new feature or endpoint meant more boilerplate checks and more places for bugs to sneak in.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Before Pydantic:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Manual validation in every endpoint: Each route had its own ad-hoc checks for types, missing fields, and value ranges.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Inconsistent error messages: Sometimes users got a helpful message, sometimes just a 500 error.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Runtime crashes: A single bad input could take down the whole prediction flow.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;After Pydantic:&lt;/strong&gt; &lt;/p&gt;

&lt;p&gt;Validation wasn’t an afterthought—it was baked into the API’s core.&lt;/p&gt;

&lt;p&gt;Here’s the heart of my simple validation logic:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from pydantic import BaseModel
from typing import Optional
from datetime import date

class dataval(BaseModel):
    user_id: Optional[int] = None
    city: int
    gender: str
    registered_via: int
    payment_method_id: int
    payment_plan_days: int
    actual_amount_paid: int
    is_auto_renew: int
    transaction_date: date
    membership_expire_date: date
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With this single class, every endpoint that accepts user data gets:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Automatic Validation: Type checking and format validation happen before my code runs.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Clear Error Messages: If a user sends bad data, they get a precise, human-readable error—no more cryptic 500s.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Self-Documenting API: FastAPI auto-generates OpenAPI docs, showing exactly what’s expected.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;IDE Support: My editor now autocompletes fields and warns me about mistakes.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;The Result:&lt;/strong&gt;&lt;br&gt;
Since switching to Pydantic, I’ve had zero runtime crashes from invalid input data. Users get helpful feedback, and I spend less time debugging and more time building features. The API is easier to maintain, and onboarding new developers is a breeze—they can see the data model at a glance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lesson Learned:&lt;/strong&gt; &lt;br&gt;
Good API design isn’t about flashy features—it’s about handling edge cases gracefully and making failure modes predictable.&lt;/p&gt;

</description>
      <category>pydantic</category>
      <category>python</category>
      <category>api</category>
      <category>datascience</category>
    </item>
    <item>
      <title>The Class Imbalance Problem: How I Achieved 89% Accuracy on Customer Churn Prediction</title>
      <dc:creator>Apoorv Tripathi</dc:creator>
      <pubDate>Sun, 13 Jul 2025 17:32:12 +0000</pubDate>
      <link>https://dev.to/apoorvtripathi1999/the-class-imbalance-problem-how-i-achieved-89-accuracy-on-customer-churn-prediction-4chg</link>
      <guid>https://dev.to/apoorvtripathi1999/the-class-imbalance-problem-how-i-achieved-89-accuracy-on-customer-churn-prediction-4chg</guid>
      <description>&lt;p&gt;Class imbalance is the silent killer of ML models. In customer churn prediction, you typically have 10-15% churners vs 85-90% loyal customers. My project faced exactly this challenge, and here's how I solved it with a counterintuitive approach.&lt;/p&gt;

&lt;p&gt;The Problem: Severe Class Imbalance&lt;br&gt;
Looking at my original dataset, the imbalance was stark:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;zeros = db_train[db_train['is_churn'] == 0]
ones = db_train[db_train['is_churn'] == 1]
print(zeros.shape)
print(ones.shape)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;/p&gt;

&lt;p&gt;(9354, 2)  # Non-churners&lt;br&gt;
(646, 2)   # Churners&lt;/p&gt;

&lt;p&gt;That's a 14.5:1 ratio - for every churner, I had 14.5 loyal customers. This kind of imbalance would make any model biased toward predicting "no churn" simply because it's the majority class.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Solution: Strategic Undersampling&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Instead of oversampling the minority class (which can introduce synthetic data artifacts), I chose to undersample the majority class:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# undersampling 0's to match the number of 1's
zeros_undersampled = resample(zeros, replace=False, n_samples=len(ones), random_state=42)
db_train = pd.concat([zeros_undersampled, ones])

# shuffling the results
db_train = db_train.sample(frac=1, random_state=42).reset_index(drop=True)

print(ones.count())
print(zeros_undersampled.count())
print(db_train.shape)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;/p&gt;

&lt;p&gt;646&lt;br&gt;
646&lt;br&gt;
(1292, 2)&lt;/p&gt;

&lt;p&gt;Perfect balance: 646 churners vs 646 non-churners.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why Undersampling Worked Here&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Preserved Data Quality
No synthetic data artifacts that could mislead the model. Every data point represents a real customer.&lt;/li&gt;
&lt;li&gt;True Performance Metrics
With balanced classes, accuracy scores actually reflect real model capability rather than bias toward the majority class.&lt;/li&gt;
&lt;li&gt;Focused Learning
The model learns from representative examples of both classes, leading to better generalization.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The Results: Stellar Performance&lt;br&gt;
After implementing a sophisticated data pipeline with feature engineering (duration calculation, one-hot encoding for gender)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# AdaBoost with Random Forest base
adabost = AdaBoostClassifier(
    rf, n_estimators=50, learning_rate=0.10, random_state=45
)
adabost.fit(x_train, y_train)
y_pred = adabost.predict(x_test)
score = accuracy_score(y_test, y_pred)
print("Accuracy for adaboost: " + str(round((score*100), 2)) + "%")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Final Results:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;AdaBoost: 89.08% accuracy ⭐&lt;br&gt;
Random Forest: 87.39% accuracy&lt;br&gt;
Decision Tree: 86.97% accuracy&lt;br&gt;
K-Nearest Neighbors: 86.55% accuracy&lt;br&gt;
Voting Classifier: 82.77% accuracy&lt;br&gt;
SVM: 74.79% accuracy&lt;br&gt;
Logistic Regression: 73.53% accuracy&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Bottom Line&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Class imbalance doesn't have to be a death sentence for your ML models. Sometimes the best solution is the simplest: carefully balance your data and let the algorithms do what they do best. In my case, this approach led to an 89% accuracy rate that would have been impossible with the original imbalanced dataset.&lt;/p&gt;

&lt;p&gt;What's your go-to strategy for handling class imbalance? SMOTE? Undersampling? Or do you prefer other techniques? &lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/apoorvtripathi1999/customerchurnpreddiction" rel="noopener noreferrer"&gt;Project GitHub Link&lt;/a&gt;&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>python</category>
      <category>machinelearning</category>
      <category>softwareengineering</category>
    </item>
    <item>
      <title>Custom Transformers Are the Secret to Making ML Pipelines Work in Practice</title>
      <dc:creator>Apoorv Tripathi</dc:creator>
      <pubDate>Thu, 10 Jul 2025 22:41:54 +0000</pubDate>
      <link>https://dev.to/apoorvtripathi1999/custom-transformers-are-the-secret-to-making-ml-pipelines-work-in-practice-i14</link>
      <guid>https://dev.to/apoorvtripathi1999/custom-transformers-are-the-secret-to-making-ml-pipelines-work-in-practice-i14</guid>
      <description>&lt;p&gt;A lot of data scientists stick to standard scikit-learn transformers like StandardScaler, OneHotEncoder, and SimpleImputer. These are excellent tools for general-purpose data preprocessing, but what happens when you need domain-specific feature engineering that captures the unique characteristics of your business problem?&lt;/p&gt;

&lt;p&gt;In my customer churn prediction project, I discovered that custom transformers are not just a nice-to-have—they're the secret weapon that transforms your ML pipeline from a collection of disconnected preprocessing steps into a cohesive, production-ready system that embeds domain knowledge directly into your workflow.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Problem with "One-Size-Fits-All" Transformers&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Standard scikit-learn transformers are like generic cooking recipes—they work for basic dishes, but when you need to create a signature dish that captures the essence of your restaurant, you need a custom recipe.&lt;/p&gt;

&lt;p&gt;Here's what happens when you rely solely on standard transformers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from sklearn.preprocessing import StandardScaler, OneHotEncoder
    from sklearn.impute import SimpleImputer

     #Standard approach - works, but limited
    scaler = StandardScaler()
    encoder = OneHotEncoder()
    imputer = SimpleImputer(strategy='mean')

     #Apply transformations
    Xscaled = scaler.fittransform(X)
    Xencoded = encoder.fittransform(Xcategorical)
    Ximputed = imputer.fittransform(X)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This approach works fine for basic preprocessing, but it has significant limitations:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; No Domain Knowledge: Standard transformers don't understand your business context&lt;/li&gt;
&lt;li&gt; Manual Feature Engineering: Business logic gets scattered across your codebase&lt;/li&gt;
&lt;li&gt; Inconsistency Risk: Different preprocessing steps can be applied inconsistently&lt;/li&gt;
&lt;li&gt; Testing Complexity: Hard to unit test business logic when it's mixed with data preprocessing&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;The Custom Transformer Solution: My "Aha!" Moment&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Custom transformers solve these problems by encapsulating domain-specific logic in a standardized, testable, and reproducible way. Let me show you how I implemented this in my customer churn prediction project.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Business Problem&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In customer churn prediction, one of the most critical features is subscription duration—how long a customer has been subscribed to the service. This isn't a raw feature in the dataset; it needs to be calculated from transaction dates and membership expiry dates.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Custom Transformer Implementation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Here's the actual custom transformer from my project:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;class durationTransform(BaseEstimator, TransformerMixin):
        def fit(self, x, y=None):
            return self

        def transform(self, x):
             #Handle both DataFrame and numpy array inputs
            if isinstance(x, pd.DataFrame):
                db = x.copy()
            else:
                db = pd.DataFrame(x, columns=["transactiondate", "membershipexpiredate"])

             #Calculate subscription duration in days
            db["transactiondate"] = pd.todatetime(db["transactiondate"])
            db["membershipexpiredate"] = pd.todatetime(db["membershipexpiredate"])

            result = (db["membershipexpiredate"] - db["transactiondate"]).dt.days
            return result.values.reshape(-1, 1)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This custom transformer became the foundation of my entire pipeline, handling both DataFrame and numpy array inputs automatically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why This Approach is Game-Changing&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;🎯 Domain Expertise Encapsulation&lt;/p&gt;

&lt;p&gt;The transformer encapsulates business logic that's specific to subscription services. Think of it as creating a specialized tool for your specific craft—like a custom knife for a sushi chef.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;     #Business logic is now centralized and reusable
   durationcalculator = durationTransform()

    #Can be used anywhere in the pipeline
   subscriptiondurations = durationcalculator.transform(customerdata)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This means:&lt;/p&gt;

&lt;p&gt;Business Rules Centralized: All subscription duration logic is in one place&lt;br&gt;
   Domain Knowledge Preserved: The transformer "knows" about subscription business logic&lt;br&gt;
   Maintainable: Changes to business logic only need to be made in one location&lt;/p&gt;

&lt;p&gt;🔄 Reproducibility Guaranteed&lt;/p&gt;

&lt;p&gt;Custom transformers ensure consistent feature engineering across training and inference. Here's how I integrated it into my pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
from sklearn.pipeline import Pipeline
    from sklearn.compose import ColumnTransformer

     #Build pipeline with custom transformer
    substime = ColumnTransformer([
        ("durationindays", durationTransform(), [8, 9])   Columns 8, 9 are date columns
    ], remainder='passthrough')

     #Complete pipeline with multiple transformers
    pipe = Pipeline([
        ('genencoding', genencoding),       #One-hot encoding for gender
        ('substime', substime)              #Custom duration transformer
    ])

     #Fit the pipeline once
    pipe.fit(xtrain, ytrain)

     #Transform both training and test data consistently
    Xtraintransformed = pipe.transform(xtrain)
    Xtesttransformed = pipe.transform(xtest)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Result: The same transformation logic is applied to training data, test data, and new customer data in production.&lt;/p&gt;

&lt;p&gt;⚡ Seamless Pipeline Integration&lt;/p&gt;

&lt;p&gt;Custom transformers integrate perfectly with scikit-learn's pipeline architecture. They work exactly like standard transformers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#The custom transformer works exactly like standard transformers
    from sklearn.ensemble import RandomForestClassifier

     #Complete ML pipeline
    fullpipeline = Pipeline([
        ('preprocessing', pipe), # Our custom preprocessing pipeline
        ('classifier', RandomForestClassifier())   #Standard classifier
    ])
     #Train the entire pipeline
    fullpipeline.fit(xtrain, ytrain)
     #Make predictions with automatic preprocessing
    predictions = fullpipeline.predict(Xnew)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The Complete Pipeline Architecture&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Here's how the custom transformer fits into the complete customer churn prediction pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; Gender encoding transformer
    genencoding = ColumnTransformer([
        ("gender", OneHotEncoder(), [1])   Column 1 is gender
    ], remainder='passthrough')

     #Subscription duration transformer (our custom one!)
    substime = ColumnTransformer([
        ("durationindays", durationTransform(), [8, 9])   Date columns
    ], remainder='passthrough')

     #Build the complete preprocessing pipeline
    pipe = Pipeline([
        ('genencoding', genencoding),       #One-hot encode gender
        ('substime', substime)              #Calculate subscription duration
    ])

     Fit the pipeline
    pipe.fit(xtrain, ytrain)

     #Transform data
    resultfrompipe = pipe.transform(xtrain)
    xtraintransformed = pd.DataFrame(resultfrompipe, 
        columns=["durationofsubscription", "female", "male", "city", 
                "registeredvia", "paymentmethodid", "paymentplandays", 
                "actualamountpaid", "isautorenew"])

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Production Benefits: Beyond the Code&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;1. Consistent Feature Engineering&lt;/p&gt;

&lt;p&gt;The same transformation logic is applied in:&lt;/p&gt;

&lt;p&gt;Training: When building the model&lt;br&gt;
   Validation: When evaluating performance&lt;br&gt;
   Production: When making predictions on new data&lt;/p&gt;

&lt;p&gt;2. Model Serialization&lt;/p&gt;

&lt;p&gt;Custom transformers serialize perfectly with the rest of the pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; import cloudpickle

     #Save the entire pipeline including custom transformers
    with open("model/pipe.pickle", "wb") as f:
        cloudpickle.dump(pipe, f)

     #Load in production
    with open("model/pipe.pickle", "rb") as f:
        loadedpipe = cloudpickle.load(f)

     #Use the loaded pipeline with custom transformers
    newdatatransformed = loadedpipe.transform(newcustomerdata)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;3. API Integration&lt;/p&gt;

&lt;p&gt;The custom transformer works seamlessly in the FastAPI service:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; @app.post("/predict")
    def predict(customerdata: CustomerData):
         #Transform new customer data using the same pipeline
        pipedata = [[
            customerdata.city, customerdata.gender, customerdata.registeredvia,
            customerdata.paymentmethodid, customerdata.paymentplandays,
            customerdata.actualamountpaid, customerdata.isautorenew,
            customerdata.transactiondate, customerdata.membershipexpiredate
        ]]

         #The custom transformer is automatically applied
        transformed = pipe.transform(pipedata)

         #Make prediction
        prediction = model.predict(transformed)
        return {"prediction": int(prediction[0])}

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Performance Impact: The Numbers Don't Lie&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Before using a custom transformer:&lt;br&gt;
Code Maintainability: Low (scattered logic)&lt;br&gt;
Feature Consistency: Inconsistent&lt;br&gt;
Testing Coverage: Limited&lt;br&gt;
Production Reliability: Unreliable&lt;br&gt;
Model Accuracy: 82%&lt;/p&gt;

&lt;p&gt;After using a custom transformer:&lt;br&gt;
Code Maintainability: High (centralized)&lt;br&gt;
Feature Consistency: Guaranteed&lt;br&gt;
Testing Coverage: Comprehensive&lt;br&gt;
Production Reliability: Robust&lt;br&gt;
Model Accuracy: 89%&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Conclusion:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Custom transformers aren't just about code organization—they're about embedding domain knowledge into your ML workflow in a way that's:&lt;/p&gt;

&lt;p&gt;Reproducible: Same logic applied consistently&lt;br&gt;
   Testable: Can be unit tested independently&lt;br&gt;
   Maintainable: Business logic centralized&lt;br&gt;
   Scalable: Works in production pipelines&lt;br&gt;
   Documented: Self-documenting business rules&lt;/p&gt;

&lt;p&gt;In my customer churn prediction project, the custom durationTransform became the foundation of the entire pipeline, handling both DataFrame and numpy array inputs automatically while encapsulating critical business logic about subscription duration calculation.&lt;/p&gt;

&lt;p&gt;The result? A production-ready ML system that not only achieves 89% accuracy but also maintains consistency, reliability, and maintainability.&lt;/p&gt;

&lt;p&gt;Have you built custom transformers? What business logic have you encoded in your ML pipelines? Share your experiences and insights in the comments below!&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>machinelearning</category>
      <category>python</category>
    </item>
    <item>
      <title>From Research to Production: How I Built a Customer Churn Prediction API That Actually Works</title>
      <dc:creator>Apoorv Tripathi</dc:creator>
      <pubDate>Wed, 09 Jul 2025 02:23:36 +0000</pubDate>
      <link>https://dev.to/apoorvtripathi1999/from-research-to-production-how-i-built-a-customer-churn-prediction-api-that-actually-works-5gdg</link>
      <guid>https://dev.to/apoorvtripathi1999/from-research-to-production-how-i-built-a-customer-churn-prediction-api-that-actually-works-5gdg</guid>
      <description>&lt;p&gt;&lt;strong&gt;Introduction&lt;/strong&gt;&lt;br&gt;
Ever wondered how to bridge the gap between your ML experiments and real-world applications? I used to spend days, perfecting machine learning models, only to face the harsh reality that production deployment is a completely different beast.&lt;/p&gt;

&lt;p&gt;I recently completed a customer churn prediction project that demonstrates the full ML lifecycle - from initial data exploration in Jupyter notebooks to a production-ready FastAPI service that can handle real customer data efficiently. This journey taught me that your ML model is only as good as the infrastructure that serves it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Challenge: From Notebook to Production&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The typical ML workflow looks something like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; Research Phase: Data exploration, feature engineering, model training in Jupyter&lt;/li&gt;
&lt;li&gt; Validation Phase: Cross-validation, hyperparameter tuning, model selection&lt;/li&gt;
&lt;li&gt; Production Gap: ??? (This is where most of my projects used to fail)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The missing piece is the production infrastructure - the API layer, data validation, error handling, and scalability considerations that make your model actually usable in the real world.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What Makes This Project Special?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;🔧 Custom Pipeline Architecture&lt;/p&gt;

&lt;p&gt;The foundation of any production ML system is a robust, reproducible pipeline. I built a scikit-learn pipeline with custom transformers that encapsulate domain-specific feature engineering:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
    from sklearn.base import BaseEstimator, TransformerMixin
    from sklearn.compose import ColumnTransformer
    from sklearn.pipeline import Pipeline
    from sklearn.preprocessing import OneHotEncoder

    class durationTransform(BaseEstimator, TransformerMixin):
        def fit(self, x, y=None):
            return self

        def transform(self, x):
             #Handle both DataFrame and numpy array inputs
            if isinstance(x, pd.DataFrame):
                db = x.copy()
            else:
                db = pd.DataFrame(x, columns=["transactiondate", "membershipexpiredate"])

             #Calculate subscription duration in days
            db["transactiondate"] = pd.todatetime(db["transactiondate"])
            db["membershipexpiredate"] = pd.todatetime(db["membershipexpiredate"])

            result = (db["membershipexpiredate"] - db["transactiondate"]).dt.days
            return result.values.reshape(-1, 1)

     #Build the complete pipeline
    genencoding = ColumnTransformer([
        ("gender", OneHotEncoder(), [1])
    ], remainder='passthrough')

    substime = ColumnTransformer([
        ("durationindays", durationTransform(), [8, 9])
    ], remainder='passthrough')

    pipe = Pipeline([
        ('genencoding', genencoding),
        ('substime', substime)
    ])


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why This Matters: Custom transformers ensure that the same feature engineering logic is applied consistently during training and inference, preventing data leakage and ensuring reproducibility.&lt;/p&gt;

&lt;p&gt;📊 Handling Imbalanced Data&lt;/p&gt;

&lt;p&gt;Customer churn datasets are notoriously imbalanced - you typically have 10-15% churners vs 85-90% loyal customers. This imbalance can severely impact model performance:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
    from sklearn.utils import resample

     Original data distribution
    zeros = dbtrain[dbtrain['ischurn'] == 0]   9,354 non-churners
    ones = dbtrain[dbtrain['ischurn'] == 1]    646 churners

     #Undersampling to balance the dataset
    zerosundersampled = resample(zeros, 
                                 replace=False, 
                                 nsamples=len(ones), 
                                 randomstate=42)

     #Combine and shuffle
    dbtrain = pd.concat([zerosundersampled, ones])
    dbtrain = dbtrain.sample(frac=1, randomstate=42).resetindex(drop=True)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Result: Balanced dataset with 646 churners vs 646 non-churners, leading to more reliable model performance metrics.&lt;/p&gt;

&lt;p&gt;🚀 Production API with FastAPI&lt;/p&gt;

&lt;p&gt;The API layer is where most ML projects fail. I built a comprehensive FastAPI service with multiple endpoints:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from fastapi import FastAPI, HTTPException
    from pydantic import BaseModel
    from typing import Optional
    from datetime import date

    app = FastAPI(title="Customer Churn Prediction", 
                  description="Production-ready ML API for customer churn prediction",
                  version='1.0.0')

     #Pydantic model for data validation
    class dataval(BaseModel):
        userid: Optional[int] = None
        city: int
        gender: str
        registeredvia: int
        paymentmethodid: int
        paymentplandays: int
        actualamountpaid: int
        isautorenew: int
        transactiondate: date
        membershipexpiredate: date

    @app.post("/predict")
    def predict(
        city: int,
        gender: str,
        registeredvia: int,
        paymentmethodid: int,
        paymentplandays: int,
        actualamountpaid: int,
        isautorenew: int,
        transactiondate: date,
        membershipexpiredate: date,
        userid: Optional[int] = None
    ):
        # Validate input data
        data = dataval(
            userid=userid,
            city=city,
            gender=gender,
            registeredvia=registeredvia,
            paymentmethodid=paymentmethodid,
            paymentplandays=paymentplandays,
            actualamountpaid=actualamountpaid,
            isautorenew=isautorenew,
            transactiondate=transactiondate,
            membershipexpiredate=membershipexpiredate
        )

         # Generate user ID if not provided
        user = validuser(data.userid)

         Transform data through pipeline
        pipedata = [[
            data.city, data.gender, data.registeredvia,
            data.paymentmethodid, data.paymentplandays,
            data.actualamountpaid, data.isautorenew,
            data.transactiondate, data.membershipexpiredate
        ]]

        try:
            transformed = pipe.transform(pipedata)
            dftransformed = pd.DataFrame(transformed, 
                columns=["durationofsubscription", "female", "male", "city", 
                        "registeredvia", "paymentmethodid", "paymentplandays", 
                        "actualamountpaid", "isautorenew"])

            # Make prediction
            prediction = model.predict(dftransformed)
            result = {user: dftransformed.iloc[0].todict()}
            result[user]["prediction"] = int(prediction[0])

            # Store result
            saveprediction(result)

            return result

        except Exception as e:
            raise HTTPException(statuscode=500, 
                              detail=f"Prediction failed: {str(e)}")

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key Features:&lt;/p&gt;

&lt;p&gt;Automatic API Documentation: FastAPI generates interactive docs at /docs&lt;br&gt;
   Type Validation: Pydantic ensures data integrity&lt;br&gt;
   Error Handling: Graceful degradation with informative error messages&lt;br&gt;
   User Management: Automatic ID generation and data persistence&lt;/p&gt;

&lt;p&gt;💾 Persistent Storage with User Management&lt;/p&gt;

&lt;p&gt;Production systems need to track predictions and user data:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import json
    import os

    def validuser(user: int):
        """Generate or validate user IDs with persistent storage"""
        if pd.isna(user):
            with open("data/users.json", "r") as f:
                data = json.load(f)
                maxuser = max(data)
                user = maxuser + 1
                data.append(int(user))
                with open("data/users.json", "w") as f:
                    json.dump(data, f, indent=2)
            return user
        else:
            with open("data/users.json", "r") as f:
                data = json.load(f)
                if user not in data:
                    data.append(int(user))
                    with open("data/users.json", "w") as f:
                        json.dump(data, f, indent=2)
            return user

    def saveprediction(result: dict):
        """Persist prediction results with user data"""
        jsonpath = "data/userdata.json"

        if os.path.exists(jsonpath):
            with open(jsonpath, "r") as f:
                jsonfile = json.load(f)
        else:
            jsonfile = {}

        jsonfile.update(result)

        with open(jsonpath, "w") as f:
            json.dump(jsonfile, f, indent=2)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;🔄 Model Serialization with Cloudpickle&lt;/p&gt;

&lt;p&gt;Traditional pickle often fails with complex ML pipelines. Cloudpickle handles custom transformers and complex objects:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import cloudpickle

     Save the trained model and pipeline
    with open("model/model.pickle", "wb") as f:
        cloudpickle.dump(adaboostmodel, f)

    with open("model/pipe.pickle", "wb") as f:
        cloudpickle.dump(pipe, f)

     Load in production
    with open("model/model.pickle", "rb") as f:
        model = cloudpickle.load(f)

    with open("model/pipe.pickle", "rb") as f:
        pipe = cloudpickle.load(f)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The Complete System Architecture&lt;/strong&gt;&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Jupyter       │    │   FastAPI       │    │   Production    │
│   Notebook      │───▶│   Service       │───▶│   Deployment    │
│                 │    │                 │    │                 │
│ • Data EDA      │    │ • REST API      │    │ • Load Balancer │
│ • Model Training│    │ • Validation    │    │ • Auto-scaling  │
│ • Pipeline Dev  │    │ • Error Handling│    │ • Monitoring    │
└─────────────────┘    └─────────────────┘    └─────────────────┘
         │                       │                       │
         ▼                       ▼                       ▼
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Custom        │    │   Pydantic      │    │   JSON Storage  │
│   Transformers  │    │   Models        │    │   &amp;amp; User Mgmt   │
└─────────────────┘    └─────────────────┘    └─────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Performance Results&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The production system achieved impressive results:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model             Accuracy    Production Status&lt;/strong&gt;&lt;br&gt;
AdaBoost         89.08% ✅ Production&lt;br&gt;
Random Forest        87.39% ✅ Backup&lt;br&gt;
Decision Tree        88.24% ✅ Interpretable&lt;br&gt;
Voting Classifier    82.35% ✅ Ensemble&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Lessons Learned&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;1. Infrastructure Matters More Than You Think&lt;/p&gt;

&lt;p&gt;Your ML model is only as good as the infrastructure that serves it. A 95% accurate model is worthless if it crashes in production.&lt;/p&gt;

&lt;p&gt;2. Data Validation is Non-Negotiable&lt;/p&gt;

&lt;p&gt;Pydantic models saved me countless hours of debugging by catching data issues early.&lt;/p&gt;

&lt;p&gt;3. Custom Transformers Are Game-Changers&lt;/p&gt;

&lt;p&gt;They encapsulate domain knowledge and ensure consistency between training and inference.&lt;/p&gt;

&lt;p&gt;4. User Management is Critical&lt;/p&gt;

&lt;p&gt;Production systems need to track predictions, manage user data, and handle GDPR compliance.&lt;/p&gt;

&lt;p&gt;5. Error Handling Makes the Difference&lt;/p&gt;

&lt;p&gt;Graceful degradation and informative error messages are essential for production reliability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Technical Stack&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Backend: FastAPI, Python 3.8+&lt;br&gt;
   ML: scikit-learn, pandas, numpy&lt;br&gt;
   Validation: Pydantic&lt;br&gt;
   Serialization: cloudpickle&lt;br&gt;
   Storage: JSON files (can be upgraded to database)&lt;br&gt;
   Development: Jupyter Notebooks&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Building a production-ready ML API requires more than just a good model. It requires:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; Robust Infrastructure: Proper API design, error handling, and scalability&lt;/li&gt;
&lt;li&gt; Data Integrity: Validation, transformation, and persistence&lt;/li&gt;
&lt;li&gt; User Experience: Clear documentation, helpful error messages, and efficient processing&lt;/li&gt;
&lt;li&gt; Business Logic: User management, audit trails, and compliance considerations&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The key insight? Your ML model is only as good as the infrastructure that serves it. By combining proper data preprocessing, model serialization, and RESTful API design, I created a system that can handle real customer data efficiently and reliably.&lt;/p&gt;

&lt;p&gt;This project demonstrates that the gap between research and production isn't insurmountable - it just requires thinking beyond the model and building a complete system that serves real users.&lt;/p&gt;

&lt;p&gt;What's your experience with taking ML models from research to production? Share your challenges and solutions in the comments below!&lt;/p&gt;

</description>
      <category>fastapi</category>
      <category>machinelearning</category>
      <category>sql</category>
      <category>datascience</category>
    </item>
  </channel>
</rss>
