Minexa.ai

Posted on Jun 19 • Edited on Jun 24

Scraping environmental data from OpenEI with Minexa.ai

OpenEI (Open Energy Information) is a platform maintained by the U.S. Department of Energy that hosts a large catalog of energy-related datasets. The search page at data.openei.org/search lists datasets across topics like solar resources, utility rates, building energy use, and grid data. Each listing includes a title, organization, tags, license type, and a link to the full dataset record.

If you need to collect this data at scale — to build a dataset index, track what gets published over time, or feed an internal research tool — copying it manually is not realistic. This is where the Minexa.ai Chrome extension comes in.

Watch the full walkthrough first

Before going through the screenshots below, the video tutorial covers the entire process end to end. It is the fastest way to understand how the extraction works on OpenEI.

Watch full video demo

How the extraction works, stage by stage

Rather than listing steps in isolation, here is what each stage of the process actually does and what you see on screen.

Starting point: the Minexa extension

Once the extension is installed, opening it from any page brings up the Minexa home screen. This is where all your scrapers and jobs are managed.

The extension works directly in your browser — no separate app, no dashboard to log into from another tab.

Navigating to the target page

Browse to data.openei.org/search. This is the dataset search listing page. You can see all the dataset cards that Minexa will detect and extract from.

Minexa works on the page currently open in your browser, so there is no URL to paste into a separate interface. You are already on the right page.

Confirming the page

After opening the extension popup, you click 'I'm on the right page'. This tells Minexa to begin analyzing the current page structure.

From this point, Minexa takes over the detection process automatically.

Pagination detection

Minexa scans the page and identifies how it paginates. For OpenEI, it detects the next page mechanism and shows you a list of the pagination it found. You review it and click Continue.

You do not configure this manually. Minexa reads the page structure and figures out the pagination pattern on its own.

Choosing your scraping depth

After pagination is confirmed, Minexa asks whether you want to scrape just the list page or also follow each dataset link and extract detail page data. For most research use cases, list-only is sufficient. For deeper extraction, the detail mode pulls additional fields from each individual dataset record page.

This two-layer extraction capability means a single job can produce both the summary data from the list and the full metadata from each dataset page.

Simple or advanced mode

Before the job starts, you choose between simple mode (Minexa picks the most relevant fields automatically) and advanced mode (you can review and adjust the field selection). For most users, simple mode produces a clean, complete output without any additional configuration.

Container detection

Minexa automatically highlights the repeating container on the page — the element that wraps each dataset listing. This is the structural anchor it uses to identify where one result ends and the next begins.

Field discovery

After detecting the container, Minexa surfaces all the data points it found within each listing. These appear as labeled columns — title, organization, tags, license, and more. You do not need to specify these upfront.

This is one of the more useful aspects of the tool: if you are not sure what fields are available on a page, Minexa shows you rather than asking you to define them first.

API and code samples

At the configuration stage, Minexa also surfaces ready-to-use code samples in JSON and Python, along with an API request view. This is useful if you want to integrate the scraper into an existing pipeline.

Job summary with scheduling and Google Sheets options

Before running, you see a summary screen. From here you can connect a Google Sheet for live output or set up a recurring schedule so the job runs automatically without manual triggering.

Scheduling is particularly relevant for OpenEI since new datasets are added regularly. A weekly scheduled run keeps your local dataset index current without any manual work.

Running the job

The scraper appears in your jobs list with a Run button. Once triggered, extraction begins across all detected pages.

Results during and after the run

As the job runs, data populates in a table view in real time. Once complete, you can export to Excel or JSON.

What the extracted data looks like

Here is a sample of what the JSON output contains after a completed run on the OpenEI search page:

[
  {
    "title": "U.S. Solar Resource Data",
    "organization": "National Renewable Energy Laboratory",
    "tags": "solar, irradiance, GHI, DNI",
    "license": "Public Domain"
  },
  {
    "title": "Utility Rate Database",
    "organization": "NREL",
    "tags": "electricity rates, tariffs, utilities",
    "license": "Creative Commons"
  }
]

Each row corresponds to one dataset listing. Fields are clean and consistently named across all pages.

Working with the exported data in Python

import json

with open('openei_datasets.json', 'r') as f:
    datasets = json.load(f)

for dataset in datasets:
    print(dataset.get('title'), '|', dataset.get('organization'))

This gives you a quick way to scan titles and organizations, or pipe the data into a pandas DataFrame for further filtering and analysis.

The scraper configuration is saved after the first run. The next time you trigger it, Minexa skips the detection phase entirely and goes straight to extraction. If you want to get started, the extension is available at minexa.ai.

DEV Community