DEV Community

Cover image for Mastering Dataset Acquisition: A Comprehensive Guide
Rishabh Jain
Rishabh Jain

Posted on

1 1 1 1 1

Mastering Dataset Acquisition: A Comprehensive Guide

While learning, performing, practicing, or constructing a Machine Learning task, the foremost necessity is Machine Learning-specific datasets.

However, a comprehensive process encompasses collecting, cleaning, verifying, and undertaking various tasks when handling datasets.

Chapter 1: Understanding Your Project

  1. Acquiring a thorough understanding of your project is paramount, as it elucidates the fundamental aspects of your dataset's composition.

  2. For instance, consider the scenario where you aim to procure a dataset pertaining to Taxi Customers. In such cases, the dataset's features can vary significantly based on factors such as the temporal context, the intended purpose, and the method of data collection. Some datasets may encompass details regarding customers' arrival and departure times, while others might incorporate information regarding additional tips offered. The diversity in features underscores the nuanced nature of dataset creation and underscores the importance of meticulous planning and project comprehension.

Chapter 2: Knowing the right sources

  1. Kaggle: A platform for data science and machine learning competitions, Kaggle also hosts datasets for practice and exploration. Kaggle Datasets.

  2. UCI Machine Learning Repository: A collection of databases, domain theories, and data generators widely used by the machine learning community. UCI Machine Learning Repository

  3. Google Dataset Search: Google's tool to help users find datasets stored across the web. Google Dataset Search

  4. GitHub: Many researchers and organizations share datasets on GitHub repositories. You can search for repositories with datasets using specific keywords. GitHub

  5. AWS Public Datasets: Amazon Web Services hosts a variety of public datasets that can be accessed for free. AWS Public Datasets

  6. UCR Time Series Classification/Clustering Databases: A collection of time series datasets for classification and clustering tasks. UCR Time Series Classification/Clustering Databases

  7. Reddit Datasets: A subreddit where users share interesting datasets they've found or collected. Reddit Datasets

  8. Data.gov: The home of the U.S. Government's open data. It provides access to thousands of datasets on various topics. Data.gov

  9. FiveThirtyEight Datasets: Datasets related to articles and investigations published by FiveThirtyEight. FiveThirtyEight Datasets

  10. OpenML: An online platform for sharing and organizing machine learning datasets. OpenML

Chapter 3: Convert the dataset according to your needs and format you want to work in (cough...csv...cough)

Chapter 4: Do the Data Cleaning part and apply Analytics to it. 😎

API Trace View

How I Cut 22.3 Seconds Off an API Call with Sentry

Struggling with slow API calls? Dan Mindru walks through how he used Sentry's new Trace View feature to shave off 22.3 seconds from an API call.

Get a practical walkthrough of how to identify bottlenecks, split tasks into multiple parallel tasks, identify slow AI model calls, and more.

Read more β†’

Top comments (0)

Eliminate Context Switching and Maximize Productivity

Pieces.app

Pieces Copilot is your personalized workflow assistant, working alongside your favorite apps. Ask questions about entire repositories, generate contextualized code, save and reuse useful snippets, and streamline your development process.

Learn more

πŸ‘‹ Kindness is contagious

Please leave a ❀️ or a friendly comment on this post if you found it helpful!

Okay