Vivek0712

Posted on Jan 6, 2024

Speed-Dataing for hackers!

#machinelearning #cleanlab #azureml #datascience

For the ML enthusiasts, self-taught developers engaging in hackathons, and the professionals and data scientists striving to streamline ML processes, this blog is your compass in the world of machine learning. Whether you're tuning your ML operations or seeking a Rapid Application Development (RAD) strategy for Data Science, you've come to the right place.

You, my friend, are a hacker at heart if you resonate with any of the roles mentioned. As a seasoned hackathon expert engaged in 60+ hackathons and a professional MLOps Architect and startup consultant for AI/ML, I will demonstrate how to accelerate the machine learning journey from start to finish, all while maintaining the integrity and accuracy of your data and models.

Tackling Stroke Prediction with ML: A No-Code Journey with Cleanlab, Azure, and Amazon SageMaker

Unraveling the Stroke Prediction Dataset from Kaggle

In this blog, we embark on an insightful journey through the realm of machine learning, utilizing the Stroke Prediction Dataset from Kaggle as our guiding star. This dataset holds the potential to unravel predictive insights about stroke occurrences, based on parameters such as gender, age, health conditions, and smoking status. Each data entry is a window into a patient's profile, aiding us in understanding the likelihood of a stroke. Our goal? To predict whether a patient is at risk of a stroke (1) or not (0), using this rich dataset.

The No-Code Edge: Introducing Cleanlab

But here's the twist – we are going to do all of this without writing a single line of code! I introduce you to one of my secret weapons: Cleanlab. This tool is designed to simplify the entire machine learning process, from data cleaning to model selection, making it accessible even to those without extensive coding experience. To validate our approach, we will be benchmarking our results specifically against one of the prominent cloud-based AutoML tools: Azure Automated Machine Learning.

Exploring Cleanlab Studio: Uploading Your Dataset

Getting Started with Cleanlab Studio

To begin your no-code machine learning journey with Cleanlab Studio, follow these steps:

Visit Cleanlab.ai and sign up for an account or visit their GitHub account.
Utilize the free trial offer to get started.

Navigating the Dashboard

Once logged in, you'll be greeted with an intuitive dashboard. Here's what you'll find:

Projects Section

List of ML Projects: Each project is associated with a dataset and includes details such as dataset quality and issues resolved.

Datasets Section

Dataset Listings: Find datasets by name along with details like modality, the number of rows, and upload date.

Uploading Your Dataset

Upload Dataset: Click "Upload Dataset" on your dashboard.
Select File: Choose "Upload from your computer" and select the stroke prediction dataset file.
Dataset Details: Confirm dataset name and designate an "ID column" for unique data identification.

Schema Confirmation: Review the detected fields (columns) by Cleanlab Studio.

Dataset Preview: Visually inspect a sample of your dataset to ensure correctness.

## Creating a Project and Cleaning Data in Cleanlab Studio

After uploading your dataset, proceed with the following steps to create a project and clean your data:

1. Create Project

Click on the + Create Project button once your dataset is ready.

2. Project Details

Enter a name for your new project, such as "Stroke Prediction". Select the appropriate cleaning task for your data, which in this case is "Tabular Classification".

3. Type of Classification

Choose "Multi-Class" as the type of classification since each data point in the stroke prediction dataset corresponds to one outcome.

4. Label Column

Identify the "stroke" column as the label for prediction within your dataset.

5. Dataset Sample

A sample of your dataset is provided for review. Ensure that the data appears as expected.

6. Feature Fields

Check the feature fields you wish to include in your machine learning model.

7. Use Cleanlab Auto-ML

Enable Cleanlab Auto-ML to allow the platform to automatically train and combine multiple models for optimal data cleaning.

8. Project Setting

Select between the "Fast" and "Regular" cleaning options. "Fast" provides quicker but potentially less precise results, while "Regular" is more thorough and yields higher quality at the cost of more time.

9. Start Cleaning

Click Clean my data to initiate the auto-cleaning process. Cleanlab will process your data and send an email notification when the results are ready.

Insights from Data Cleaning and Model Evaluation

After the data cleaning process, Cleanlab Studio has meticulously identified specific issues within our dataset.

The platform's auto-training feature has evaluated several machine learning models to determine the best performer with the cleaned data.

Improving Data Quality with Cleanlab Studio

Clean Top K

This feature offers a bulk action capability, allowing you to address issues across numerous data points at once. The actions you select here will affect the top data points, which are determined by the currently applied sort and filter settings within Cleanlab Studio.

Auto-fix

When you choose the 'Auto-fix' option, Cleanlab Studio automatically applies its recommended actions to the selected data points.

Steps for Training and Deployment

Name your model, e.g., stroke-prediction-cleaned-model.

Click Deploy Model to begin the process.

Post-deployment

Track the deployment status on the dashboard.
Review the model evaluation for accuracy and performance metrics.
Use the provided Python API for integration, with your API key for secure access.

Deploying your model is now just a click away, paving the way for insightful predictions.

Benchmarking Against Azure Automated Machine Learning

Typically, following the established steps for creating a machine learning model with Azure Automated Machine Learning, we now compare the results.

Azure Model Training Overview

Algorithm: A Voting Ensemble approach was utilized.
AUC Weighted: Achieved an AUC weighted score of 0.85427.
Accuracy: The model attained an accuracy of 95.127%.

Cleanlab's AutoML Advantages

When juxtaposing Cleanlab's AutoML results with those from Azure's Automated Machine Learning, it's evident that Cleanlab stands out, particularly in terms of accuracy. Cleanlab not only matches but often surpasses the precision of Azure's offerings in term of straightforwardness of cleaning and training data, lesser training time

Streamlined Model Deployment and Evaluation

Cleanlab's platform is designed for simplicity and ease of use, streamlining the deployment process. It ensures that models are not only deployed efficiently but also monitored with a user-friendly interface, offering a clear and concise view of your model's performance.

Conclusion

The Stroke Prediction Dataset from Kaggle served not just as a dataset but as a gateway to understanding a critical global health issue. With the World Health Organization citing stroke as the 2nd leading cause of death, our problem statement transcended a mere technical challenge; it became a mission to potentially save lives through predictive analysis.

The Cleanlab Studio, with its no-code AutoML solution, demonstrated that sophisticated machine learning is not confined to experts. It showed that with the right tools, we could make significant strides in medical predictions, even under the crunch of hackathon and organisation deadlines. Our comparative analysis with Azure's Automated Machine Learning further highlighted Cleanlab's efficiency and accuracy, bolstering our confidence in the solutions we developed.

As we draw this blog to a close, the key takeaway is clear: the union of purposeful problem statements with powerful, accessible ML tools like Cleanlab can lead to breakthroughs in not just technology but also in societal well-being. This project was more than an exercise in ML proficiency; it was a step towards leveraging technology for the greater good. The path ahead is promising, and as we continue to innovate, we do so with the hope of making a tangible impact on the world.

DEV Community