<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Jee Soo Jhun</title>
    <description>The latest articles on DEV Community by Jee Soo Jhun (@jeesoo).</description>
    <link>https://dev.to/jeesoo</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2375640%2Fbebf0115-bf29-4f94-8d01-a124a34eaeb7.jpg</url>
      <title>DEV Community: Jee Soo Jhun</title>
      <link>https://dev.to/jeesoo</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/jeesoo"/>
    <language>en</language>
    <item>
      <title>✨ Data Preprocessing: The Secret Sauce to Delicious Machine Learning ✨</title>
      <dc:creator>Jee Soo Jhun</dc:creator>
      <pubDate>Fri, 08 Nov 2024 03:15:03 +0000</pubDate>
      <link>https://dev.to/jeesoo/data-preprocessing-the-secret-sauce-to-delicious-machine-learning-4fc1</link>
      <guid>https://dev.to/jeesoo/data-preprocessing-the-secret-sauce-to-delicious-machine-learning-4fc1</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhcau4vsnvrllurkcaorw.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhcau4vsnvrllurkcaorw.jpg" alt="cooking" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Introduction&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Imagine you're a chef 🍳  You have the freshest ingredients, top-of-the-line equipment, and a recipe for the most amazing dish. But what if those ingredients are dirty, not chopped properly, or even rotten? 🤢 Disaster, right?&lt;/p&gt;

&lt;p&gt;That's where data preprocessing comes in! It's like washing, chopping, and preparing your ingredients (data) before you start cooking (building your machine learning model). 🔪  Without it, your model might end up with a bad case of "garbage in, garbage out." 🗑️&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Why is Data Preprocessing So Important? 🤔&lt;/strong&gt;
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;u&gt;Shiny and Clean Data:&lt;/u&gt; Just like you wouldn't want to eat a dirty apple, your model doesn't like dirty data. Preprocessing removes errors, inconsistencies, and missing values.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;u&gt;A Feast for Your Model:&lt;/u&gt; Preprocessing transforms data into a format that your model can easily digest. This can involve scaling, encoding, and creating new features.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;u&gt;Boosting Performance:&lt;/u&gt; Clean and well-prepared data helps your model learn more effectively and make better predictions. 🚀&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;u&gt;Unlocking Insights:&lt;/u&gt; Preprocessing can reveal hidden patterns and relationships in your data, leading to new discoveries. 💡&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Key Steps in Data Preprocessing&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;1️⃣ &lt;u&gt;Data Cleaning&lt;/u&gt;&lt;/p&gt;

&lt;p&gt;This is like washing your ingredients. 🍎  It involves:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Handling missing values (filling them in or removing them).
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import pandas as pd

# Load the data
data = pd.read_csv('your_data.csv')

# Fill missing values in the 'age' column with the mean
data['age'].fillna(data['age'].mean(), inplace=True)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Removing duplicates.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Remove duplicate rows
data.drop_duplicates(inplace=True)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Correcting errors and inconsistencies.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Convert city names to lowercase for consistency
data['city'] = data['city'].str.lower()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;p&gt;2️⃣ &lt;u&gt;Data Transformation&lt;/u&gt;&lt;/p&gt;

&lt;p&gt;This is where you chop and prepare your ingredients. 🥕 It includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;Scaling:&lt;/em&gt; Bringing features to a similar scale (standardization, normalization).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;Pitfall:&lt;/strong&gt;&lt;/em&gt; : Scaling the entire dataset before splitting it into training and testing sets. This causes data leakage and makes your model unrealistically good during testing.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;How to Avoid It:&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Always split your data first, then scale only the training data, and apply the same scaler to the test data afterward.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from sklearn.preprocessing import StandardScaler

# Standardize the 'age' feature
scaler = StandardScaler()
data['age_scaled'] = scaler.fit_transform(data[['age']])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;Encoding:&lt;/em&gt; Converting categorical variables into numbers (one-hot encoding, label encoding). &lt;strong&gt;&lt;em&gt;Order Matters!&lt;/em&gt;&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;Pitfall:&lt;/strong&gt;&lt;/em&gt; Applying label encoding to ordinal data (like 'Low', 'Medium', 'High') without considering the natural order, or using label encoding on non-ordinal data, which can mislead models into thinking there's a hierarchy. 🤔&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;How to Avoid It:&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Use label encoding only for ordinal data where the order makes sense.&lt;/p&gt;

&lt;p&gt;For non-ordinal data, stick to one-hot encoding to avoid misinterpreted relationships.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from sklearn.preprocessing import OneHotEncoder

# One-hot encode the 'city' feature
encoder = OneHotEncoder(handle_unknown='ignore')
encoded_features = encoder.fit_transform(data[['city']]).toarray()  
encoded_df = pd.DataFrame(encoded_features)
data = pd.concat([data, encoded_df], axis=1)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;Feature Engineering:&lt;/em&gt; Creating new features from existing ones (e.g., combining "age" and "income" to create "age_income_group").
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Create a new feature 'age_income_group'
data['age_income_group'] = pd.cut(data['age'], bins=[0, 30, 60, 100], 
                                  labels=['Young', 'Middle-aged', 'Senior'])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;p&gt;3️⃣ &lt;u&gt;Data Reduction&lt;/u&gt;&lt;/p&gt;

&lt;p&gt;Sometimes you have too many ingredients! This step helps you simplify:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;Dimensionality reduction:&lt;/em&gt; Reducing the number of features (PCA).
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from sklearn.decomposition import PCA

# Apply PCA to reduce the number of features
pca = PCA(n_components=2) 
principal_components = pca.fit_transform(data[['feature1', 'feature2', 'feature3']])  
pca_df = pd.DataFrame(data=principal_components, columns=['principal component 1', 'principal component 2'])
data = pd.concat([data, pca_df], axis=1)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;Sampling:&lt;/em&gt; Selecting a smaller representative subset of your data.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from sklearn.model_selection import train_test_split

# Split data into training and testing sets
train_data, test_data = train_test_split(data, test_size=0.2, random_state=42)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  &lt;strong&gt;Real-World Example: Predicting Customer Churn&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Let's imagine you're a cool telecom company (like the one with the talking animals in their commercials 😜) trying to predict which customers are about to say "see ya later!" 👋  You have a bunch of data about your customers, but it's a bit messy... kinda like that junk drawer in your kitchen. 🤪 Time to tidy up!&lt;/p&gt;

&lt;p&gt;Here's where the magic of data preprocessing comes in! ✨ We'll use Python and some handy libraries (pandas and scikit-learn) to whip this data into shape. 💪&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer

# Set a seed for reproducibility (so you get the same results!)
np.random.seed(42)

# 1. Create a synthetic dataset (pretend this is your real customer data!)
n_samples = 1000
data = {
    'age': np.random.randint(18, 65, n_samples),
    'gender': np.random.choice(['Male', 'Female'], n_samples),
    'location': np.random.choice(['Urban', 'Suburban', 'Rural'], n_samples),
    'monthly_bill': np.random.normal(50, 15, n_samples),
    'data_usage': np.random.exponential(10, n_samples),
    'call_duration': np.random.normal(300, 100, n_samples),
    'num_customer_service_calls': np.random.randint(0, 10, n_samples),
    'contract_length': np.random.choice([12, 24], n_samples),
    'churned': np.random.choice([True, False], n_samples, p=[0.2, 0.8]),  # 20% churn rate
}
df = pd.DataFrame(data)

# 2. Introduce some missing values (because real-world data is never perfect! 😜)
missing_indices = np.random.choice(df.index, size=int(n_samples * 0.1), replace=False)
df.loc[missing_indices, 'call_duration'] = np.nan

# 3. Fill in those missing values with the average call duration
imputer = SimpleImputer(strategy='mean')
df['call_duration'] = imputer.fit_transform(df[['call_duration']])

# 4. One-hot encode those pesky categorical features (like gender and location)
encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
encoded_features = encoder.fit_transform(df[['gender', 'location']])
encoded_df = pd.DataFrame(encoded_features, columns=encoder.get_feature_names_out(['gender', 'location']))
df = pd.concat([df, encoded_df], axis=1)
df.drop(['gender', 'location'], axis=1, inplace=True)

# 5. Standardize the numerical features (so they play nicely together! 😊)
scaler = StandardScaler()
numerical_features = ['monthly_bill', 'data_usage', 'call_duration', 'num_customer_service_calls']
df[numerical_features] = scaler.fit_transform(df[numerical_features])

# 6. Split the data into training and testing sets (like dividing a pizza! 🍕)
X = df.drop('churned', axis=1)
y = df['churned']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Ta-da! ✨  Now our data is clean, transformed, and ready for a machine learning model to work its magic. 🧙‍♂️&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Here's what we did:&lt;/strong&gt;
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;u&gt;Created a fake dataset:&lt;/u&gt; We pretended this was our real customer data with info like age, gender, location, monthly bill, etc.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;u&gt;Made some values go missing:&lt;/u&gt; Because, let's be real, data is never perfect! 😜&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;u&gt;Filled in the missing values:&lt;/u&gt; We used the average call duration to fill in the blanks.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;u&gt;One-hot encoded categorical features:&lt;/u&gt; We converted categories (like "Male" and "Female") into numbers for our model to understand.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;u&gt;Standardized numerical features:&lt;/u&gt; We made sure all our numerical features had a similar range of values.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;u&gt;Split the data:&lt;/u&gt; We divided our data into training and testing sets, just like splitting a pizza with a friend! 🍕&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now we're all set to build a model that can predict which customers are likely to churn. This will help our awesome telecom company keep their customers happy and prevent them from switching to the competition. 😎&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;My Thoughts as a Budding Data Scientist&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Data preprocessing is like the foundation of a house. 🏠 Without a strong foundation, everything else crumbles.  It's a crucial step that can make or break your machine learning project. I'm excited to continue learning about advanced preprocessing techniques and apply them to real-world problems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stay tuned for the next post where we'll actually build and train our churn prediction model! 🚀&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>datapreprocessing</category>
      <category>beginners</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
