๐ Table of Contents
- ๐ Welcome to Day 1
- ๐ What is Scikit-Learn?
- ๐ ๏ธ Setting Up Your Environment
- ๐งฉ Understanding Scikit-Learn's API
- ๐ Basic Data Preprocessing
- ๐ค Building Your First Model
- ๐ Model Evaluation Metrics
- ๐ ๏ธ๐ Example Project: Iris Classification
- ๐๐ Conclusion and Next Steps
- ๐ Summary of Day 1 ๐
1. ๐ Welcome to Day 1
Welcome to Day 1 of your 90-day journey to becoming a Scikit-Learn boss! Today, we'll lay the foundation by introducing Scikit-Learn, setting up your development environment, understanding its core API, performing basic data preprocessing, building your first machine learning model, and evaluating its performance.
2. ๐ What is Scikit-Learn? ๐
Scikit-Learn is one of the most popular open-source machine learning libraries for Python. It provides simple and efficient tools for data mining and data analysis, built on top of NumPy, SciPy, and Matplotlib.
Key Features:
- Wide Range of Algorithms: Classification, regression, clustering, dimensionality reduction, and more.
- Consistent API: Makes it easy to switch between different models.
- Integration with Other Libraries: Works seamlessly with pandas, NumPy, and Matplotlib.
- Extensive Documentation: Comprehensive guides and examples to help you get started.
3. ๐ ๏ธ Setting Up Your Environment ๐ ๏ธ
Before diving into Scikit-Learn, ensure your development environment is properly set up.
๐ฆ Installing Scikit-Learn ๐ฆ
You can install Scikit-Learn using pip
or conda
.
Using pip:
pip install scikit-learn
Using conda:
conda install scikit-learn
๐ป Setting Up a Virtual Environment ๐ป
Creating a virtual environment helps manage dependencies and keep your projects organized.
Using venv
:
# Create a virtual environment named 'env'
python -m venv env
# Activate the virtual environment
# On Windows:
env\Scripts\activate
# On macOS/Linux:
source env/bin/activate
Using conda
:
# Create a conda environment named 'ml_env'
conda create -n ml_env python=3.8
# Activate the environment
conda activate ml_env
4. ๐งฉ Understanding Scikit-Learn's API ๐งฉ
Scikit-Learn follows a consistent and intuitive API design, making it easy to implement various machine learning algorithms.
๐ The Estimator API ๐
In Scikit-Learn, most objects follow the Estimator API, which includes:
-
Estimator: An object that learns from data (
fit
method). -
Predictor: An object that makes predictions (
predict
method).
๐ Fit and Predict Methods ๐
-
fit(X, y)
: Trains the model on the data. -
predict(X)
: Makes predictions based on the trained model.
๐ Pipelines ๐
Pipelines allow you to chain multiple processing steps, ensuring that all steps are applied consistently during training and prediction.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
# Create a pipeline with scaling and logistic regression
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', LogisticRegression())
])
# Train the pipeline
pipeline.fit(X_train, y_train)
# Make predictions
predictions = pipeline.predict(X_test)
5. ๐ Basic Data Preprocessing ๐
Data preprocessing is essential to prepare your data for machine learning models.
๐งน Handling Missing Values ๐งน
Missing values can adversely affect model performance. Scikit-Learn provides tools to handle them.
from sklearn.impute import SimpleImputer
import pandas as pd
# Sample DataFrame
data = {
'Age': [25, None, 35, 40],
'Salary': [50000, 60000, None, 80000]
}
df = pd.DataFrame(data)
# Impute missing values with mean
imputer = SimpleImputer(strategy='mean')
df[['Age', 'Salary']] = imputer.fit_transform(df[['Age', 'Salary']])
print(df)
๐ก Encoding Categorical Variables ๐ก
Machine learning models require numerical input. Convert categorical variables using encoding techniques.
from sklearn.preprocessing import OneHotEncoder
# Sample DataFrame
data = {
'City': ['New York', 'Los Angeles', 'Chicago', 'New York']
}
df = pd.DataFrame(data)
# One-Hot Encoding
encoder = OneHotEncoder(sparse=False)
encoded = encoder.fit_transform(df[['City']])
encoded_df = pd.DataFrame(encoded, columns=encoder.get_feature_names_out(['City']))
df = pd.concat([df, encoded_df], axis=1)
print(df)
๐ Feature Scaling ๐
Scaling features ensures that all variables contribute equally to the result.
from sklearn.preprocessing import StandardScaler
# Sample DataFrame
data = {
'Height': [150, 160, 170, 180, 190],
'Weight': [50, 60, 70, 80, 90]
}
df = pd.DataFrame(data)
# Standardization
scaler = StandardScaler()
df[['Height_Scaled', 'Weight_Scaled']] = scaler.fit_transform(df[['Height', 'Weight']])
print(df)
6. ๐ค Building Your First Model ๐ค
Let's build a simple classification model using the Iris dataset.
๐ Loading a Dataset ๐
Scikit-Learn provides easy access to common datasets.
from sklearn.datasets import load_iris
import pandas as pd
# Load Iris dataset
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = pd.Series(iris.target, name='Species')
๐ ๏ธ Splitting the Data ๐ ๏ธ
Divide the data into training and testing sets.
from sklearn.model_selection import train_test_split
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
๐ Training a Simple Classifier ๐
We'll use a Logistic Regression classifier.
from sklearn.linear_model import LogisticRegression
# Initialize the model
model = LogisticRegression(max_iter=200)
# Train the model
model.fit(X_train, y_train)
๐ Making Predictions ๐
Use the trained model to make predictions.
# Make predictions
predictions = model.predict(X_test)
print(predictions)
7. ๐ Model Evaluation Metrics ๐
Evaluate the performance of your model using various metrics.
โ Accuracy โ
Measures the proportion of correct predictions.
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy * 100:.2f}%")
๐ Precision, Recall, and F1-Score ๐
Provide more insight into model performance, especially for imbalanced datasets.
from sklearn.metrics import classification_report
report = classification_report(y_test, predictions, target_names=iris.target_names)
print(report)
๐ Confusion Matrix ๐
Visualizes the performance of a classification model.
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
cm = confusion_matrix(y_test, predictions)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=iris.target_names, yticklabels=iris.target_names)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()
8. ๐ ๏ธ๐ Example Project: Iris Classification ๐ ๏ธ๐
Let's consolidate what you've learned by building a complete classification pipeline.
๐ Project Overview
Objective: Develop a machine learning pipeline to classify Iris species based on flower measurements.
Tools: Python, Scikit-Learn, pandas, Matplotlib, Seaborn
๐ Step-by-Step Guide
1. Load and Explore the Dataset
from sklearn.datasets import load_iris
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Load Iris dataset
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = pd.Series(iris.target, name='Species')
# Combine features and target
df = pd.concat([X, y], axis=1)
print(df.head())
# Visualize pairplot
sns.pairplot(df, hue='Species', palette='Set1')
plt.show()
2. Data Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Feature Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
3. Building and Training the Model
from sklearn.linear_model import LogisticRegression
# Initialize and train the model
model = LogisticRegression(max_iter=200)
model.fit(X_train_scaled, y_train)
4. Making Predictions and Evaluating the Model
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
# Predictions
predictions = model.predict(X_test_scaled)
# Evaluation
print("Classification Report:")
print(classification_report(y_test, predictions, target_names=iris.target_names))
# Confusion Matrix
cm = confusion_matrix(y_test, predictions)
sns.heatmap(cm, annot=True, fmt='d', cmap='Greens', xticklabels=iris.target_names, yticklabels=iris.target_names)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()
9. ๐๐ Conclusion and Next Steps ๐๐
Congratulations on completing Day 1 of "Becoming a Scikit-Learn Boss in 90 Days"! Today, you laid the groundwork by understanding what Scikit-Learn is, setting up your environment, navigating its API, performing basic data preprocessing, building your first machine learning model, and evaluating its performance.
๐ฎ Whatโs Next?
- Day 2: Supervised Learning โ Classification Algorithms: Dive deeper into various classification algorithms like Decision Trees, K-Nearest Neighbors, and Support Vector Machines.
- Day 3: Supervised Learning โ Regression Algorithms: Explore regression techniques including Linear Regression, Ridge, Lasso, and Elastic Net.
- Day 4: Model Evaluation and Selection: Learn about cross-validation, hyperparameter tuning, and model selection strategies.
- Day 5: Unsupervised Learning โ Clustering and Dimensionality Reduction: Understand clustering algorithms like K-Means and techniques like PCA.
- Day 6: Advanced Feature Engineering: Master techniques to create and select features that enhance model performance.
- Day 7: Ensemble Methods: Explore ensemble techniques like Bagging, Boosting, and Stacking.
- Day 8: Model Deployment with Scikit-Learn: Learn how to deploy your models into production environments.
- Days 9-90: Specialized Topics and Projects: Engage in specialized topics and comprehensive projects to solidify your expertise.
๐ Tips for Success
- Practice Regularly: Apply the concepts through exercises and real-world projects.
- Engage with the Community: Join forums, attend webinars, and collaborate with peers.
- Stay Curious: Continuously explore new features and updates in Scikit-Learn.
- Document Your Work: Keep a detailed journal of your learning progress and projects.
Keep up the great work, and stay motivated as you continue your journey to mastering Scikit-Learn and machine learning! ๐๐
๐ Summary of Day 1 ๐
- ๐ What is Scikit-Learn?: Introduced Scikit-Learn as a powerful machine learning library in Python.
- ๐ ๏ธ Setting Up Your Environment: Installed Scikit-Learn and set up a virtual environment for project management.
- ๐งฉ Understanding Scikit-Learn's API: Explored the Estimator API, fit and predict methods, and the use of pipelines.
- ๐ Basic Data Preprocessing: Learned how to handle missing values, encode categorical variables, and scale features.
- ๐ค Building Your First Model: Developed a simple Logistic Regression classifier using the Iris dataset.
- ๐ Model Evaluation Metrics: Evaluated the model using accuracy, precision, recall, F1-score, and confusion matrix.
- ๐ ๏ธ๐ Example Project: Iris Classification: Completed a full machine learning pipeline from data loading to model evaluation.
Top comments (0)