Supervised learning workflow

#beginners #python #machinelearning #datascience

Linear regression workflow.

Collect data
Collecting data involves gathering relevant CSV's, databases, Excel, APIs etc

Prepare data
Data pre-processing involves cleaning, handling missing values, removing duplicates etc

examples.

df.info() - gets a summary of the dataset structure in pandas.
df.drop(columns = ['Cabin'], inplace = True) - dropping columns with lots of missing values.
df['Age'] = df['Age'].fillna(df['Age'].median()) df.info() - filling missing values with median. Used median in case of outliers.
Visualization.

example heatmap.

Split data into train and test sets.
Splitting data helps avoid overfitting of the models i.e. models performing very well on training data but poorly on new data. It also ensures the model doesn't memorize data.

Training the model
Training involves adjusting model parameters to minimize error using training data.

Make predictions on the test data.
After training, the model uses learned patterns to predict outcomes on new data.

Evaluate the model's performance.
Model evaluation involves measuring how well predictions match real values using test data

Regularization.

Regularization is a technique used to reduce overfitting by preventing the model from becoming too complex.

Ridge (l2) and Lasso (L1)

Ridge

Adds squared penalty on coefficients. Shrinks coefficients but doesn’t make them zero. Keeps all features but reduces magnitude of weights

Lasso

Adds absolute value penalty. Can reduce some coefficients to zero. it removes useless features.

Classification.

Logistic regression workflow.

split data
Splitting data helps avoid overfitting of the models i.e. models performing very well on training data but poorly on new data. It also ensures the model doesn't memorize data.

scaling data
Feature scaling is the process of putting all numerical features on a similar scale so that no feature dominates others just because of its size.

Training the model.
Training involves adjusting model parameters to minimize error using training data.

making predictions
After training, the model uses learned patterns to predict outcomes on new data.

Evaluating the model
Model evaluation involves measuring how well predictions match real values using test data

Display the confusion matrix
The confusion matrix is used to evaluate classification models by showing correct and incorrect predictions.