DEV Community

Cover image for Mastering Dataset Acquisition: A Comprehensive Guide
Rishabh Jain
Rishabh Jain

Posted on

Mastering Dataset Acquisition: A Comprehensive Guide

While learning, performing, practicing, or constructing a Machine Learning task, the foremost necessity is Machine Learning-specific datasets.

However, a comprehensive process encompasses collecting, cleaning, verifying, and undertaking various tasks when handling datasets.

Chapter 1: Understanding Your Project

  1. Acquiring a thorough understanding of your project is paramount, as it elucidates the fundamental aspects of your dataset's composition.

  2. For instance, consider the scenario where you aim to procure a dataset pertaining to Taxi Customers. In such cases, the dataset's features can vary significantly based on factors such as the temporal context, the intended purpose, and the method of data collection. Some datasets may encompass details regarding customers' arrival and departure times, while others might incorporate information regarding additional tips offered. The diversity in features underscores the nuanced nature of dataset creation and underscores the importance of meticulous planning and project comprehension.

Chapter 2: Knowing the right sources

  1. Kaggle: A platform for data science and machine learning competitions, Kaggle also hosts datasets for practice and exploration. Kaggle Datasets.

  2. UCI Machine Learning Repository: A collection of databases, domain theories, and data generators widely used by the machine learning community. UCI Machine Learning Repository

  3. Google Dataset Search: Google's tool to help users find datasets stored across the web. Google Dataset Search

  4. GitHub: Many researchers and organizations share datasets on GitHub repositories. You can search for repositories with datasets using specific keywords. GitHub

  5. AWS Public Datasets: Amazon Web Services hosts a variety of public datasets that can be accessed for free. AWS Public Datasets

  6. UCR Time Series Classification/Clustering Databases: A collection of time series datasets for classification and clustering tasks. UCR Time Series Classification/Clustering Databases

  7. Reddit Datasets: A subreddit where users share interesting datasets they've found or collected. Reddit Datasets

  8. Data.gov: The home of the U.S. Government's open data. It provides access to thousands of datasets on various topics. Data.gov

  9. FiveThirtyEight Datasets: Datasets related to articles and investigations published by FiveThirtyEight. FiveThirtyEight Datasets

  10. OpenML: An online platform for sharing and organizing machine learning datasets. OpenML

Chapter 3: Convert the dataset according to your needs and format you want to work in (cough...csv...cough)

Chapter 4: Do the Data Cleaning part and apply Analytics to it. 😎

Top comments (0)