How to Use the OpenClaw Dataset Finder Skill
The OpenClaw ecosystem provides a collection of reusable skills that extend
the capabilities of the OpenClawCLI. One of the most useful skills for
data‑oriented workflows is the dataset finder. This skill consolidates
access to several major data repositories, allowing you to search, preview,
download, and document datasets without leaving your terminal. In this guide
we will walk through the purpose of the skill, its prerequisites, installation
steps, core features, repository‑specific commands, and practical examples
that show how to integrate it into a typical machine learning project.
What the Dataset Finder Skill Does
At its core, the dataset finder skill is a wrapper around a set of Python
scripts that interact with the APIs of Kaggle, Hugging Face Hub, the UCI
Machine Learning Repository, and Data.gov. When you invoke the skill you are
essentially running python scripts/dataset.py with a subcommand that
specifies the repository and the action you want to perform. The skill
supports the following high‑level tasks:
- Searching for datasets across multiple sources using natural‑language‑like queries.
- Downloading datasets with automatic format detection and optional file‑level selection.
- Previewing dataset characteristics such as shape, column types, missing values, and basic statistics without loading the entire file into memory.
- Generating data cards (markdown documentation) that capture description, schema, statistics, usage examples, license, and citation information.
- Managing a local catalog of downloaded datasets so you can quickly list what you have on disk.
Because the skill is accessed through a single command line interface, you can
switch between repositories with a simple change of subcommand, making it
ideal for exploratory data analysis, benchmarking, and building reproducible
pipelines.
Prerequisites and Installation
Before you can use the dataset finder skill you need to have the OpenClawCLI
installed and a few Python packages available. The skill itself does not
require a separate installation; it is part of the openclaw/skills
repository. However, the underlying scripts depend on several libraries:
-
kaggle– for interacting with the Kaggle API. -
datasets– the Hugging Face datasets library. -
pandas– for data preview and basic statistics. -
huggingface-hub– for Hub metadata and streaming downloads. -
requests– generic HTTP calls used by the UCI and Data.gov connectors. -
beautifulsoup4– for parsing HTML responses from Data.gov.
OpenClaw recommends installing these dependencies inside a virtual environment
to avoid polluting your system Python. The installation steps are:
Clone the OpenClaw skills repository (if you have not already):
git clone https://github.com/openclaw/skills.gitNavigate to the dataset‑finder skill directory:
cd skills/dataset-finderCreate and activate a virtual environment:
python -m venv venvsource venv/bin/activate(Windows:
venv\Scripts\activate)Install the required packages:
pip install kaggle datasets pandas huggingface-hub requests beautifulsoup4Ensure you have the OpenClawCLI available globally (installed via
pip install openclawclior fromclawhub.ai).
For Kaggle you also need to place your kaggle.json API token in the default
location (~/.kaggle/kaggle.json on Linux/macOS or
%USERPROFILE%\.kaggle\kaggle.json on Windows) and set the file permissions
to 600.
Core Features Explained
1. Multi‑Repository Search
The skill provides a unified search verb that works across all supported
sources. You simply specify the repository name followed by a query string.
Example:
python scripts/dataset.py kaggle search "house prices"
python scripts/dataset.py huggingface search "sentiment analysis"
python scripts/dataset.py uci search "classification"
python scripts/dataset.py datagov search "census"
Each repository exposes additional flags to narrow results (file type,
license, task, language, etc.). The search output includes a concise list with
title, identifier, size, last update, download count, and a direct URL.
2. Dataset Download
Once you have identified a dataset you can download it with the download
subcommand. The skill automatically detects the file format and writes the
files to a local datasets/ folder inside the skill directory (or a path you
configure). Supported formats include CSV, TSV, JSON, JSONL, Parquet, Excel
(XLS/XLSX), ZIP, HDF5, and Feather. For large repositories like Hugging Face
you can enable streaming to avoid loading the entire dataset into memory:
python scripts/dataset.py huggingface download "large-dataset" --streaming
3. Dataset Preview
Before committing to a full download you can inspect a local file with the
preview command. This command loads just enough data to compute:
- Number of rows and columns (shape)
- Column names and inferred data types
- Missing value counts per column
- Basic statistics (mean, standard deviation, min, max) for numeric columns
- Memory usage estimate
- A sample of the first few rows
Example:
python scripts/dataset.py preview datasets/housing.csv
4. Data Card Generation
Documentation is a critical part of reproducible research. The skill can
generate a markdown data card that summarises a dataset. The command:
python scripts/dataset.py datacard datasets/housing.csv --output README.md
produces a file that includes:
- Dataset title and description
- Schema overview (column names, types, units)
- Statistics summary (from the preview step)
- Usage example code snippet
- License information and citation details (if available)
These cards can be committed alongside your project or uploaded to a model
card hub.
5. Local Dataset Management
The skill maintains a simple index of what you have downloaded. Running:
python scripts/dataset.py list
shows each dataset’s local path, size, and the date it was fetched. This helps
avoid duplicate downloads and makes it easy to clean up stale data.
Repository‑Specific Commands
While the generic search and download verbs work across all sources, each
repository also exposes specialized options that reflect its unique metadata.
Kaggle
To use Kaggle you must have an API token. After setting up kaggle.json you
can:
- Search with filters:
--file-type csv --license CC0 --sort-by hotness --max-results 10 - Download a specific file inside a dataset:
python scripts/dataset.py kaggle download "username/dataset" --file "train.csv" - List all files in a dataset without downloading:
python scripts/dataset.py kaggle list "username/dataset-name"
Hugging Face Hub
The Hugging Face connector lets you filter by task, language, multimodal flag,
and benchmark status. You can also download a specific split or configuration:
python scripts/dataset.py huggingface search "translation" --task translation --language fr
python scripts/dataset.py huggingface download "wmt14" --config "de-en" --split test
Streaming is activated with --streaming and works well for multi‑gigabyte
text corpora.
UCI ML Repository
UCI datasets are often small, tabular collections ideal for quick experiments.
You can filter by task type, minimum number of samples, minimum number of
features, and data type (tabular, text, image, time‑series):
python scripts/dataset.py uci search "regression" --min-samples 500 --min-features 5 --data-type tabular
The --include-metadata flag downloads the accompanying description file.
Data.gov
Data.gov hosts US government open data. You can narrow results by organization, tags, and file format:
python scripts/dataset.py datagov search "health" --organization "cdc.gov" --format csv
python scripts/dataset.py datagov download "dataset-id-12345"
Putting It All Together: Example Workflow
Imagine you are starting a new project to predict house prices. You want to
explore several possible data sources before deciding which one to use. Below
is a step‑by‑step illustration of how the dataset finder skill streamlines
this process.
Search Kaggle :
python scripts/dataset.py kaggle search "house prices" --file-type csvThis returns a list of the most popular
--sort-by hotness --max-results 5
housing‑price datasets on Kaggle.Preview a promising candidate :
After identifying the “House Prices – Advanced Regression Techniques” dataset,
you download just the preview metadata:
python scripts/dataset.py preview datasets/house_prices.csvYou see that the
file has 1460 rows, 81 columns, and a mix of numeric and categorical features.Download the full dataset :
python scripts/dataset.py kaggle download "zillow/zecon"The skill writes
the CSV todatasets/zillow_zecon/.Generate a data card :
python scripts/dataset.py datacard datasets/zillow_zecon/train.csv --outputThe resulting markdown file is ready to be
docs/house_prices_card.md
committed to your repository.Check for complementary sources :
You also search the UCI ML Repository for any additional housing‑related data:
python scripts/dataset.py uci search "housing" --data-type tabularIf you
find a useful supplement, you repeat the preview/download/card steps.List what you have locally :
python scripts/dataset.py listshows both Kaggle and UCI datasets with their
sizes, confirming you have everything you need.
This workflow demonstrates how the skill reduces context‑switching, eliminates
manual web‑scraping, and guarantees that every dataset you acquire is
accompanied by a basic documentation artifact.
Tips and Best Practices
- Always keep your virtual environment activated when running the skill to avoid version conflicts.
- For Kaggle, protect your
kaggle.jsonfile; never commit it to a public repository. - When downloading very large Hugging Face datasets, combine
--streamingwith thedatasetslibrary’siter_batchesto process data in chunks. - Use the data card generation step early in your project; it serves as a living document that you can update as you augment the dataset.
- Periodically run
python scripts/dataset.py listand remove datasets you no longer need to free disk space. - If you encounter authentication errors with Data.gov, verify that your internet connection allows outbound HTTPS calls and that no proxy is interfering.
Conclusion
The OpenClaw dataset finder skill is a powerful, command‑line driven tool that
consolidates access to some of the most widely used data repositories in the
machine learning ecosystem. By providing a single interface for search,
download, preview, documentation, and local management, it saves developers
and data scientists countless hours of manual effort. Whether you are building
a quick prototype, conducting a literature review, or assembling data for a
production model, integrating this skill into your workflow will make the data
acquisition phase more efficient, reproducible, and transparent. Start by
installing the prerequisite packages, configuring your API tokens, and trying
out the example commands above—you will quickly see how the dataset finder
simplifies the journey from raw data to insight.
Skill can be found at:
finder/SKILL.md>
Top comments (0)