<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Minexa.ai</title>
    <description>The latest articles on DEV Community by Minexa.ai (@minexa_ai).</description>
    <link>https://dev.to/minexa_ai</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3869943%2F73932daf-eb97-4f7d-9609-358fb83dd487.png</url>
      <title>DEV Community: Minexa.ai</title>
      <link>https://dev.to/minexa_ai</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/minexa_ai"/>
    <language>en</language>
    <item>
      <title>The complete web scraping process: what each stage actually involves</title>
      <dc:creator>Minexa.ai</dc:creator>
      <pubDate>Wed, 01 Jul 2026 12:06:43 +0000</pubDate>
      <link>https://dev.to/minexa_ai/the-complete-web-scraping-process-what-each-stage-actually-involves-ck8</link>
      <guid>https://dev.to/minexa_ai/the-complete-web-scraping-process-what-each-stage-actually-involves-ck8</guid>
      <description>&lt;p&gt;Web scraping is not one task. It is a sequence of distinct stages, each with its own failure modes. Understanding what each stage does makes it easier to decide where to invest time and where to offload work.&lt;/p&gt;




&lt;h2&gt;
  
  
  Stage 1: Define what you actually need
&lt;/h2&gt;

&lt;p&gt;Before touching any code or tool:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;What data do you need?&lt;/strong&gt; Be specific. Product prices, job titles, property addresses, and review scores all live in different parts of a page.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Where is it?&lt;/strong&gt; Identify the exact pages. Is it a list page, a detail page, or both?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;How often do you need it?&lt;/strong&gt; A one-off export is a different problem from a weekly recurring dataset.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Vague goals produce broken scrapers. Specificity at this stage saves hours later.&lt;/p&gt;




&lt;h2&gt;
  
  
  Stage 2: Inspect the site structure
&lt;/h2&gt;

&lt;p&gt;Open your browser's developer tools and look at the HTML before writing anything.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Is the content in the initial HTML response, or does it load after the page via JavaScript?&lt;/li&gt;
&lt;li&gt;Are the data points inside consistent, repeating containers?&lt;/li&gt;
&lt;li&gt;How does pagination work? Next page button, infinite scroll, or a load more trigger?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Static content is straightforward to parse. Dynamic content requires JavaScript rendering, which adds complexity to any custom build.&lt;/p&gt;




&lt;h2&gt;
  
  
  Stage 3: Check the ethical and legal boundaries
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Rate limiting&lt;/strong&gt;: Do not hammer a server. Introduce delays between requests. One request per second is a common starting point.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Personal data&lt;/strong&gt;: Avoid collecting personally identifiable information without a clear legal basis.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Stage 4: Choose your approach
&lt;/h2&gt;

&lt;p&gt;Three broad paths exist:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Write it yourself&lt;/strong&gt;&lt;br&gt;
Python with requests and BeautifulSoup handles static pages well. For JavaScript-heavy sites, you need a headless browser like Playwright.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;bs4&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BeautifulSoup&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;https://example.com/listings&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;soup&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BeautifulSoup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;html.parser&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;items&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find_all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;div&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;class_&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;listing-card&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;span&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;class_&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;price&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This works. But you are also writing pagination logic, error handling, retry logic, and output validation yourself.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Use a dedicated extraction tool&lt;/strong&gt;&lt;br&gt;
Tools like &lt;a href="https://chromewebstore.google.com/detail/minexa-ai-scraper/ddljgbflolmninnkfcbdikabbjeapdnh" rel="noopener noreferrer"&gt;Minexa.ai&lt;/a&gt; handle detection, pagination, JavaScript rendering, and output formatting automatically. You confirm what it found rather than specifying it manually.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Pass pages to an AI model&lt;/strong&gt;&lt;br&gt;
Works for one-off tasks on small volumes. Becomes unreliable and expensive at scale, particularly when pages contain multiple similar values that the model has to disambiguate.&lt;/p&gt;




&lt;h2&gt;
  
  
  Stage 5: Extract the data
&lt;/h2&gt;

&lt;p&gt;Whether you write selectors manually or use a tool, extraction has the same sub-steps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Fetch&lt;/strong&gt; the HTML&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Parse&lt;/strong&gt; it into a navigable structure&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Locate&lt;/strong&gt; the elements containing your target data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pull&lt;/strong&gt; the values out&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Clean&lt;/strong&gt; them (strip whitespace, normalize formats)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;One thing worth knowing: many pages have two layers of data. The list page shows summary information. Each result links to a detail page with fuller content. If you need both, your scraper has to follow those links and repeat the extraction on each detail page.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fepajng7pfo840qf433te.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fepajng7pfo840qf433te.png" alt="Minexa list and detail page scraping" width="690" height="382"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Minexa handles this natively. After detecting the list, you can instruct it to follow each result's link and extract the detail page content in the same run, no extra configuration needed.&lt;/p&gt;




&lt;h2&gt;
  
  
  Stage 6: Handle pagination
&lt;/h2&gt;

&lt;p&gt;Most datasets span multiple pages. Your options:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Find the next page URL and loop&lt;/li&gt;
&lt;li&gt;Simulate scroll events for infinite scroll&lt;/li&gt;
&lt;li&gt;Click a load more button programmatically&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each requires different logic. Minexa detects the pagination type automatically and follows it across all pages without any setup.&lt;/p&gt;




&lt;h2&gt;
  
  
  Stage 7: Store and validate the output
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Storage options by scale:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scale&lt;/th&gt;
&lt;th&gt;Format&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Small&lt;/td&gt;
&lt;td&gt;CSV, Excel, JSON&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;PostgreSQL, MySQL&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Large&lt;/td&gt;
&lt;td&gt;NoSQL, data warehouse&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Validation checks to run:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Are any expected fields missing?&lt;/li&gt;
&lt;li&gt;Are numeric fields stored as numbers, not strings?&lt;/li&gt;
&lt;li&gt;Are there duplicate rows from overlapping pagination?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This step is often skipped and causes problems downstream. Minexa returns null for missing values rather than fabricating a substitute, which makes validation simpler because you are checking for nulls rather than hunting for plausible-looking wrong values.&lt;/p&gt;




&lt;h2&gt;
  
  
  Stage 8: Monitor and maintain
&lt;/h2&gt;

&lt;p&gt;Websites change. A class name update, a layout redesign, or a new anti-bot layer can break a scraper silently or noisily.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Monitor output quality on each run&lt;/li&gt;
&lt;li&gt;Set up alerts for empty results or format changes&lt;/li&gt;
&lt;li&gt;Have a retraining or rewrite process ready&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With Minexa, retraining after a site redesign takes the same few minutes as the original setup. The scraper ID stays stable, so downstream integrations do not break.&lt;/p&gt;

&lt;p&gt;For recurring data needs, Minexa supports scheduled runs so the job executes automatically without manual triggering each time.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where Minexa fits in this workflow
&lt;/h2&gt;

&lt;p&gt;Minexa does not replace understanding the process. It replaces the implementation of the hardest parts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No selector writing&lt;/li&gt;
&lt;li&gt;No pagination logic&lt;/li&gt;
&lt;li&gt;No JavaScript rendering setup&lt;/li&gt;
&lt;li&gt;No output schema definition&lt;/li&gt;
&lt;li&gt;Automatic field discovery across any page structure&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The extension trains on a page once, then reuses that structure indefinitely. The same scraper that took a few minutes to set up can run against thousands of structurally similar pages without repeating setup.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://chromewebstore.google.com/detail/minexa-ai-scraper/ddljgbflolmninnkfcbdikabbjeapdnh" rel="noopener noreferrer"&gt;Install the Minexa.ai extension&lt;/a&gt; and run your first extraction in under ten minutes.&lt;/p&gt;




&lt;p&gt;For more on how extraction actually works under the hood, read: &lt;a href="https://www.minexa.ai/post/why-beginners-keep-hitting-the-same-wall-with-web-scraping-and-what-actually-gets-them-past-it" rel="noopener noreferrer"&gt;What actually happens when Minexa extracts data from a page&lt;/a&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>10 things every developer learns the hard way about web scraping</title>
      <dc:creator>Minexa.ai</dc:creator>
      <pubDate>Fri, 26 Jun 2026 17:38:16 +0000</pubDate>
      <link>https://dev.to/minexa_ai/10-things-every-developer-learns-the-hard-way-about-web-scraping-2pd0</link>
      <guid>https://dev.to/minexa_ai/10-things-every-developer-learns-the-hard-way-about-web-scraping-2pd0</guid>
      <description>&lt;p&gt;Starting with a simple &lt;code&gt;GET&lt;/code&gt; request feels natural. Then the site returns a 403. Then you add headers, and it still fails. Then you discover the page needs JavaScript to render. Sound familiar? Here are the lessons most developers pick up the hard way.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. A plain HTTP request is rarely enough
&lt;/h2&gt;

&lt;p&gt;Sending a basic request works on static pages, but most modern sites require JavaScript execution before any useful content appears. The HTML you receive without a browser runtime is often just a loading shell.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Headers matter more than you think
&lt;/h2&gt;

&lt;p&gt;Browsers send a detailed fingerprint with every request: browser type, version, OS, accepted encodings, and more. A minimal HTTP client sends almost none of that. Sites use this difference to flag automated requests almost immediately.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Cookies are not optional
&lt;/h2&gt;

&lt;p&gt;Many sites tie session state to cookies set on the first visit. If your scraper does not carry those cookies forward, subsequent requests either fail or return incomplete data. A proper cookie jar is not a nice-to-have.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Headless browsers are powerful but expensive to scale
&lt;/h2&gt;

&lt;p&gt;Puppeteer and Playwright solve the JavaScript rendering problem well. But running a headless browser per page at scale requires real infrastructure: memory, concurrency management, and a solid retry strategy. It is not a lightweight solution.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Anti-bot systems track more than your IP
&lt;/h2&gt;

&lt;p&gt;IP blocking is just the start. Modern protection layers also analyze mouse movement patterns, timing between requests, TLS fingerprints, and behavioral signals across sessions. Rotating IPs alone does not get you far on well-protected sites.&lt;/p&gt;

&lt;h2&gt;
  
  
  6. Raw HTML is not your output format
&lt;/h2&gt;

&lt;p&gt;Puppeteer gives you a DOM. BeautifulSoup gives you a parse tree. Neither gives you structured data. Turning raw HTML into clean, typed, consistently named fields is a separate problem that takes real effort to get right at scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  7. Selectors break when sites update
&lt;/h2&gt;

&lt;p&gt;CSS selectors and XPath expressions are tied to the current structure of a page. When a site updates its layout, those selectors silently stop working or start returning wrong values. Maintenance overhead grows with every scraper you own.&lt;/p&gt;

&lt;h2&gt;
  
  
  8. Scaling changes everything
&lt;/h2&gt;

&lt;p&gt;A scraper that works for 50 pages often breaks at 5,000. Concurrency limits, rate limiting, memory leaks in long-running browser instances, and downstream parsing failures all become real problems only at volume.&lt;/p&gt;

&lt;h2&gt;
  
  
  9. Silent failures are the worst kind
&lt;/h2&gt;

&lt;p&gt;If a field is missing and your scraper does not surface that explicitly, you end up with gaps in your dataset that are hard to detect later. A scraper that returns null loudly is far easier to operate than one that silently returns the wrong value.&lt;/p&gt;

&lt;h2&gt;
  
  
  10. You can skip most of this with the right extraction layer
&lt;/h2&gt;

&lt;p&gt;Tools like the &lt;a href="https://www.minexa.ai" rel="noopener noreferrer"&gt;Minexa.ai API&lt;/a&gt; handle JavaScript rendering, anti-bot evasion, and structured field extraction without requiring you to write or maintain selectors. You train a scraper once using the browser extension, get a stable scraper ID, and call the API with that ID against any list of URLs.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;https://api.minexa.ai/data&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Authorization&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Bearer YOUR_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;scraper_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;4821&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;columns&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;top_25&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;urls&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;https://example.com/listings&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Extraction is DOM-based and deterministic: every field maps to a fixed position in the page structure. If a value is not found, the output returns null rather than a fabricated substitute. No guessing, no hallucination risk.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpotth4rcl9oat5taidy9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpotth4rcl9oat5taidy9.png" alt="Minexa API request structure explained" width="690" height="410"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For developers managing many different URLs across recurring jobs, the practical approach is to set up your own cron jobs and pass URL batches to the API on each run. The extraction side stays stable while your scheduling logic stays in your own infrastructure.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://minexa.stoplight.io/docs/minexa/" rel="noopener noreferrer"&gt;Minexa.ai API docs&lt;/a&gt; cover the full request structure, credit consumption by page type, and how to handle paginated API responses in Python.&lt;/p&gt;

&lt;p&gt;If you want more context on what breaks in production scraping pipelines and why, this is worth reading: &lt;a href="https://www.minexa.ai/post/why-your-scraping-setup-works-in-testing-but-breaks-in-production" rel="noopener noreferrer"&gt;Why your scraping setup works in testing but breaks in production&lt;/a&gt;.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>AI agents for web scraping: what actually works vs. what gets oversold</title>
      <dc:creator>Minexa.ai</dc:creator>
      <pubDate>Fri, 26 Jun 2026 17:32:04 +0000</pubDate>
      <link>https://dev.to/minexa_ai/ai-agents-for-web-scraping-what-actually-works-vs-what-gets-oversold-539p</link>
      <guid>https://dev.to/minexa_ai/ai-agents-for-web-scraping-what-actually-works-vs-what-gets-oversold-539p</guid>
      <description>&lt;p&gt;The AI agent hype cycle has reached web scraping. Teams are shipping agentic workflows that use LLMs to browse pages, extract fields, and pipe structured data into downstream systems. Some of it works well. A lot of it is solving the wrong problem in the most expensive way possible.&lt;/p&gt;

&lt;p&gt;Here is a more grounded look at what is actually happening.&lt;/p&gt;




&lt;h2&gt;
  
  
  'Agents are better than traditional scrapers because they self-heal'
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Partially true. The tradeoff is often ignored.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The self-healing argument is real. When a site changes its layout, an LLM-based agent can sometimes reason its way to the right field without requiring a rewrite. That is genuinely useful, especially for teams that do not have dedicated scraping engineers.&lt;/p&gt;

&lt;p&gt;But self-healing is not free. It means the agent is making judgment calls about which value on the page maps to which field in your schema. For pages with a single unambiguous value per field, this is fine. For pages with multiple similar values, like two prices, two dates, or two address lines, the model picks one. It does not always pick the right one. And it rarely signals that it was uncertain.&lt;/p&gt;

&lt;p&gt;At small scale, this is manageable. Across tens of thousands of pages running on a schedule, it becomes a data quality problem that is hard to detect because the output still looks valid.&lt;/p&gt;

&lt;p&gt;Structural extraction tools avoid this entirely. They bind each output field to a specific position in the page's DOM. If the value is not there, the field returns empty. No guessing, no silent substitution.&lt;/p&gt;




&lt;h2&gt;
  
  
  'Non-technical people can now build scrapers using prompts'
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;This is the most underrated real win.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Several engineering teams have noted that the biggest productivity gain from agentic scraping is not raw performance. It is that non-technical staff can now describe what they want in plain language instead of filing a ticket and waiting for an engineer to write Playwright selectors.&lt;/p&gt;

&lt;p&gt;This is a legitimate shift. Freeing engineers from writing and maintaining scraping scripts is valuable, and it is happening in practice across teams in compliance, operations, and research.&lt;/p&gt;

&lt;p&gt;The same shift is available without agents. Tools like &lt;a href="https://www.minexa.ai" rel="noopener noreferrer"&gt;Minexa.ai&lt;/a&gt; let non-technical users browse to a page, confirm what was detected automatically, and export structured data, without writing a single line of code or understanding how the page is built. The scraper trains itself based on the page structure, and the same configuration runs again on any structurally similar page.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm4pg964w8jl7f6fp18cj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm4pg964w8jl7f6fp18cj.png" alt="Minexa.ai end-to-end no-code scraping flow" width="690" height="310"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  'Token costs are manageable'
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;At low volume, yes. At scale, this is where budgets break.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A full HTML page can contain a very large amount of text, far more than the visible content. When you pass that to an LLM for extraction, you are paying for all of it, including navigation menus, scripts, footers, and boilerplate. Even with stripped HTML, the token count per page is significant.&lt;/p&gt;

&lt;p&gt;For a one-off extraction of a handful of pages, this is not a concern. For a recurring job that processes thousands of pages per day, the cost scales directly with volume and page size. There is no efficiency gain at scale.&lt;/p&gt;

&lt;p&gt;Structural extraction does not work this way. The cost per page stays flat regardless of how much HTML the page contains, because the tool is not reading the content, it is reading the structure. Whether you extract ten rows or ten thousand from the same page type, the setup cost is the same.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Try Minexa.ai:&lt;/strong&gt; &lt;a href="https://chromewebstore.google.com/detail/minexa-ai-scraper/ddljgbflolmninnkfcbdikabbjeapdnh" rel="noopener noreferrer"&gt;Install the Chrome extension&lt;/a&gt; and run your first extraction in a few minutes.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  'LLMs are good enough for production data pipelines'
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;For specific use cases, yes. For general structured extraction at scale, the failure modes matter.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;LLMs are genuinely good at tasks that require reading and synthesis, summarizing documents, classifying content, extracting information from unstructured text where the schema is loose. These are areas where the model's ability to reason adds real value.&lt;/p&gt;

&lt;p&gt;For structured web extraction, the requirements are different. You want the same field to return the same value every time, across every page, with no variance. You want empty results when data is missing, not fabricated placeholders. You want costs that are predictable at volume.&lt;/p&gt;

&lt;p&gt;LLM-based extraction can meet these requirements for simple pages. For pages with dense or ambiguous content, the accuracy and cost profile becomes harder to control.&lt;/p&gt;




&lt;h2&gt;
  
  
  What actually works well together
&lt;/h2&gt;

&lt;p&gt;The most practical pattern is not 'agents vs. traditional scrapers' but knowing which tool fits which job.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;One-off extraction from a handful of pages:&lt;/strong&gt; An LLM or a quick manual copy-paste is often faster than setting up any tool.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recurring structured extraction from many pages:&lt;/strong&gt; DOM-based tools are more consistent and cheaper at volume.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unstructured content that needs interpretation:&lt;/strong&gt; LLMs add real value here because the task requires reasoning, not just reading.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Non-technical users who need recurring datasets:&lt;/strong&gt; A no-code tool with automatic field detection and scheduling handles this without engineering involvement.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Minexa.ai covers the second and fourth cases. It detects page structure automatically, handles all common pagination types without configuration, supports scheduled runs, and exports to Excel, Google Sheets, or JSON. For teams that need data from many pages on a regular basis, that is a more reliable foundation than an agent making field-level judgment calls on every run.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fayaqojvxb0oo0l7da0ih.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fayaqojvxb0oo0l7da0ih.png" alt="Minexa.ai deterministic extraction vs LLM hallucination" width="690" height="370"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The honest summary
&lt;/h2&gt;

&lt;p&gt;AI agents have made scraping more accessible and reduced the engineering overhead for certain workflows. That is real progress. But the framing that agents simply replace structured extraction tools is not accurate, and teams that have shipped both in production know the difference.&lt;/p&gt;

&lt;p&gt;Accuracy at scale, predictable costs, and consistent output are not marketing claims. They are the actual requirements of a production data pipeline. Choose your tooling based on those requirements, not on which approach has more momentum in the current hype cycle.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;See how Minexa.ai handles structured extraction:&lt;/strong&gt; &lt;a href="https://www.minexa.ai" rel="noopener noreferrer"&gt;minexa.ai&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;For more on what LLM-based extraction actually costs at volume, this breakdown is worth reading: &lt;a href="https://www.minexa.ai/post/why-using-ai-to-collect-web-data-at-scale-costs-more-than-you-think" rel="noopener noreferrer"&gt;Why using AI to collect web data at scale costs more than you think&lt;/a&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Scraping real estate data: what actually works and where most pipelines break</title>
      <dc:creator>Minexa.ai</dc:creator>
      <pubDate>Fri, 26 Jun 2026 17:31:00 +0000</pubDate>
      <link>https://dev.to/minexa_ai/scraping-real-estate-data-what-actually-works-and-where-most-pipelines-break-1h01</link>
      <guid>https://dev.to/minexa_ai/scraping-real-estate-data-what-actually-works-and-where-most-pipelines-break-1h01</guid>
      <description>&lt;p&gt;Real estate data is publicly visible on dozens of platforms. Prices, listing dates, square footage, agent details, location components — all sitting right there on the page. Yet collecting it reliably at any meaningful scale is genuinely difficult, and most pipelines that start clean eventually break.&lt;/p&gt;

&lt;p&gt;Here is what actually goes wrong, and what a working setup looks like.&lt;/p&gt;

&lt;h2&gt;
  
  
  The real problems with real estate scraping
&lt;/h2&gt;

&lt;p&gt;Real estate sites are among the most scraper-hostile categories on the web. Heavy JavaScript rendering, aggressive anti-bot layers, geo-targeted content, and frequent layout changes all compound into maintenance work that never really stops.&lt;/p&gt;

&lt;p&gt;Beyond the infrastructure problems, there is a subtler issue: &lt;strong&gt;field consistency&lt;/strong&gt;. A property listing page often contains multiple address components, multiple price figures (asking price, last sale price, estimated value), and multiple date fields. When you build a selector-based scraper, you map each field manually. When a site updates its layout, those mappings silently break and return wrong values or nothing at all.&lt;/p&gt;

&lt;p&gt;LLM-based extraction sounds like a fix but introduces a different problem. A language model parsing a listing page may swap asking price and last sale price because both look like prices and context is ambiguous. It may merge street address, city, and postcode into a single string instead of preserving them as separate fields. These errors do not always produce an exception. They produce plausible-looking data that requires validation downstream.&lt;/p&gt;

&lt;p&gt;At a few hundred pages that validation is manageable. At tens of thousands of pages per month, it becomes a significant and ongoing cost.&lt;/p&gt;

&lt;h2&gt;
  
  
  What deterministic extraction changes
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.minexa.ai" rel="noopener noreferrer"&gt;Minexa.ai&lt;/a&gt; is a web extraction platform that takes a different approach. Instead of writing selectors or passing HTML to a language model, you train a scraper once using a browser extension, and Minexa binds each data field to its exact DOM position. That binding is stable across all structurally similar pages.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpotth4rcl9oat5taidy9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpotth4rcl9oat5taidy9.png" alt="Minexa API request structure for developers integrating extraction pipelines" width="690" height="410"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For real estate specifically, this matters because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Asking price and last sale price always come from their respective DOM elements, not inferred from context&lt;/li&gt;
&lt;li&gt;Address components are extracted as separate columns, not merged&lt;/li&gt;
&lt;li&gt;Missing values return &lt;code&gt;null&lt;/code&gt;, never a fabricated fallback&lt;/li&gt;
&lt;li&gt;The same scraper runs identically on page 1 and page 50,000&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://chromewebstore.google.com/detail/minexa-ai-scraper/ddljgbflolmninnkfcbdikabbjeapdnh" rel="noopener noreferrer"&gt;Install the Minexa Chrome extension&lt;/a&gt; and train your first real estate scraper in under five minutes.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  How the workflow looks in practice
&lt;/h2&gt;

&lt;p&gt;You open a listing page in Chrome, select the HTML container holding the data block you want, and click 'Create Scraper'. Minexa analyzes the structure and automatically discovers all data points within that container. No schema definition required. The whole process takes two to five minutes.&lt;/p&gt;

&lt;p&gt;Once the scraper exists, you get a &lt;code&gt;scraper_id&lt;/code&gt;. From that point, extraction runs through the API:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;

&lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.minexa.ai/data/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Content-Type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;application/json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;api-key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;batches&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scraper_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;6241&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;columns&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;top_30&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;urls&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
      &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://example-realestate.com/listing/101&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://example-realestate.com/listing/102&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scraping&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;js_render&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;proxy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;verified&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;retry&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}],&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;threads&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;columns: ["top_30"]&lt;/code&gt; parameter tells Minexa to return the thirty highest-ranked data points from the scraper. You can also pass explicit column names once you know which fields you need. Both approaches cost the same and return the same underlying data.&lt;/p&gt;

&lt;h2&gt;
  
  
  When the site changes layout
&lt;/h2&gt;

&lt;p&gt;Minexa fails loudly, not silently. If a listing site updates its page structure and the trained scraper no longer matches, affected fields return &lt;code&gt;null&lt;/code&gt; or an explicit error rather than a wrong value. You open the updated page in the extension, select the new container, and create a new scraper. Same two-to-five minute process. The only required code change is updating &lt;code&gt;scraper_id&lt;/code&gt; in your request body.&lt;/p&gt;

&lt;p&gt;This is a meaningful operational difference from selector-based scrapers, which can quietly match the wrong element for days before anyone notices the data is wrong.&lt;/p&gt;

&lt;h2&gt;
  
  
  A note on scale and cost
&lt;/h2&gt;

&lt;p&gt;For real estate pipelines running tens of thousands of pages per month, the cost structure of LLM-based extraction becomes a real constraint. A full HTML property page can run to hundreds of thousands of tokens. At that size, even the cheapest available models cost significantly more per page than a flat-rate extraction platform. Minexa's pricing is per page, not per token, so page size does not affect cost.&lt;/p&gt;

&lt;p&gt;For teams already collecting real estate data and hitting reliability or cost walls, the &lt;a href="https://minexa.stoplight.io/docs/minexa/" rel="noopener noreferrer"&gt;Minexa API docs&lt;/a&gt; cover the full request structure, scraping configuration options, and how to handle dynamic content.&lt;/p&gt;

&lt;p&gt;If you are building or maintaining a real estate data pipeline, the extraction layer is worth getting right once rather than patching repeatedly.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Minexa cost breakdown: when LLM-based extraction stops making financial sense</title>
      <dc:creator>Minexa.ai</dc:creator>
      <pubDate>Fri, 19 Jun 2026 09:25:58 +0000</pubDate>
      <link>https://dev.to/minexa_ai/minexa-cost-breakdown-when-llm-based-extraction-stops-making-financial-sense-1i6j</link>
      <guid>https://dev.to/minexa_ai/minexa-cost-breakdown-when-llm-based-extraction-stops-making-financial-sense-1i6j</guid>
      <description>&lt;p&gt;Most developers building data pipelines start with an LLM. It feels like the obvious move: pass the HTML, describe what you want, get structured JSON back. It works fine at low volume. Then the bills arrive.&lt;/p&gt;

&lt;p&gt;This article breaks down exactly where the math stops working, using real token counts from real pages across six content categories.&lt;/p&gt;

&lt;h2&gt;
  
  
  The token problem nobody talks about upfront
&lt;/h2&gt;

&lt;p&gt;When you pass a page to an LLM for extraction, you are paying for every token in that HTML. The average full HTML page, with whitespace cleaned but otherwise intact, runs around &lt;strong&gt;572,000 tokens&lt;/strong&gt;. That is not a worst-case number. That is the average across job listings, ecommerce pages, property results, review pages, search results, and hotel booking pages.&lt;/p&gt;

&lt;p&gt;You have two options:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Strip the HTML&lt;/strong&gt; down to DOM tags and text only, which brings the average to roughly 39,000 tokens. Cheaper, but you risk removing markup that contains the data you need, and it requires preprocessing logic you now have to maintain.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pass full HTML&lt;/strong&gt; and pay for every token. Safe and low-maintenance, but the cost per page becomes significant fast.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There is no clean middle ground. A context cap saves money but silently truncates pages mid-content with no error signal.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the numbers actually look like
&lt;/h2&gt;

&lt;p&gt;Here is the cost to process &lt;strong&gt;120,000 pages per month&lt;/strong&gt; using stripped HTML (39k tokens/page):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GPT-5 nano&lt;/td&gt;
&lt;td&gt;$285&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4o-mini&lt;/td&gt;
&lt;td&gt;$773&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-5 mini&lt;/td&gt;
&lt;td&gt;$1,410&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Haiku 4.5&lt;/td&gt;
&lt;td&gt;$5,280&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Sonnet 4.6&lt;/td&gt;
&lt;td&gt;$15,820&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Minexa Startup&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$60&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Switch to full HTML and every figure above scales by roughly 15x. GPT-5 nano goes from $285 to $3,480. Claude Sonnet 4.6 goes from $15,820 to $207,980. Minexa stays at $60 because its pricing is per page, not per token.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3vpgux4o6rznr4hnuh16.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3vpgux4o6rznr4hnuh16.png" alt="Minexa vs LLM cost at scale" width="690" height="390"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The only range where LLMs are competitive
&lt;/h2&gt;

&lt;p&gt;Below roughly 10,000 pages per month on stripped HTML, the cheapest nano-class models (GPT-5 nano, GPT-4.1 nano, Gemini Flash Lite) land between $24 and $43. Minexa Personal is $15 as a flat monthly floor, so even here Minexa is cheaper. But the gap is small enough that tooling familiarity might reasonably win the decision.&lt;/p&gt;

&lt;p&gt;Beyond that threshold, the gap widens fast and does not close.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://chromewebstore.google.com/detail/minexa-ai-scraper/ddljgbflolmninnkfcbdikabbjeapdnh" rel="noopener noreferrer"&gt;Try Minexa free and train your first scraper in under 10 minutes&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The costs that do not show up in token pricing
&lt;/h2&gt;

&lt;p&gt;Token cost is only part of the picture. LLM extraction pipelines carry indirect costs that compound at volume:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Validation overhead.&lt;/strong&gt; LLMs can return plausible but wrong values without any error signal. A job listing with salary, equity, and bonus in similar formats might come back with equity assigned to the salary field. At 100,000 pages, that translates to thousands of rows requiring downstream validation logic or manual review.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Retry logic.&lt;/strong&gt; Inconsistent JSON field naming across responses, occasional fabricated values, and schema drift all require retry and normalization code. That is engineering time spent on infrastructure rather than the actual product.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prompt maintenance.&lt;/strong&gt; When a target site updates its layout, LLM prompts may need adjustment to keep extraction accurate. This is not always obvious until data quality degrades silently.&lt;/p&gt;

&lt;p&gt;Minexa's DOM-based extraction returns null when a field is absent and raises an explicit error when a page does not match the trained scraper. There is no silent failure mode.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three real scenarios
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Ecommerce price monitoring (~80,000 pages/month).&lt;/strong&gt; Using GPT-5 nano on stripped HTML with a 20% retry overhead: approximately $230/month. After training a single Minexa scraper on the product page structure: $60/month on the Startup plan. Price fields always pulled from the correct DOM element regardless of page layout updates.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real estate listings (~200,000 pages/month, full HTML).&lt;/strong&gt; GPT-5 mini on full HTML: over $29,000/month. Minexa Business plan: $500/month. The LLM had been occasionally swapping asking price and last sale price; Minexa eliminated the issue by binding each column to its specific DOM element.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lead generation from directories (~50,000 pages/month).&lt;/strong&gt; Mistral Small 2 on stripped HTML: approximately $485/month. Minexa Startup: $60/month. Inconsistent JSON field naming across LLM responses required a normalization step that was eliminated entirely after switching.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F8kaqr73yp2406rtkbrg1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F8kaqr73yp2406rtkbrg1.png" alt="Real-world cost savings switching from LLM extraction to Minexa" width="690" height="350"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  How Minexa's pricing works
&lt;/h2&gt;

&lt;p&gt;Minexa charges per page extracted, not per token. Page size does not affect credit consumption. A 600,000-token page costs the same credit as a 20,000-token page.&lt;/p&gt;

&lt;p&gt;The three plans:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Personal&lt;/strong&gt; ($15/month): 10,000 credits, 3 threads&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Startup&lt;/strong&gt; ($60/month): 120,000 credits, 10 threads&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Business&lt;/strong&gt; ($500/month): 2,000,000 credits, 100 threads&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Note: credit consumption can be higher for pages requiring JavaScript rendering or aggressive anti-bot handling. The baseline figures above apply to standard pages.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting started
&lt;/h2&gt;

&lt;p&gt;Training a scraper takes 2 to 5 minutes via the &lt;a href="https://chromewebstore.google.com/detail/minexa-ai-scraper/ddljgbflolmninnkfcbdikabbjeapdnh" rel="noopener noreferrer"&gt;Minexa Chrome extension&lt;/a&gt;. You hover over the HTML container holding your target data, confirm the selection, and Minexa generates a reusable scraper with a stable &lt;code&gt;scraper_id&lt;/code&gt;. From there, extraction runs through the API with a straightforward POST request:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"batches"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"scraper_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;6241&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"columns"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"top_30"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"urls"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"https://example.com/listing/1"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"scraping"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"js_render"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"proxy"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"verified"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"threads"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The extension generates this code for you after training. You update the URLs and run it.&lt;/p&gt;

&lt;p&gt;If you are currently running an LLM extraction pipeline above 10,000 pages per month, the &lt;a href="https://minexa.stoplight.io/docs/minexa/" rel="noopener noreferrer"&gt;full API documentation&lt;/a&gt; has everything needed to evaluate a migration.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Scraping app store listings at scale with the Minexa API</title>
      <dc:creator>Minexa.ai</dc:creator>
      <pubDate>Fri, 19 Jun 2026 09:17:44 +0000</pubDate>
      <link>https://dev.to/minexa_ai/scraping-app-store-listings-at-scale-with-the-minexa-api-1hd3</link>
      <guid>https://dev.to/minexa_ai/scraping-app-store-listings-at-scale-with-the-minexa-api-1hd3</guid>
      <description>&lt;p&gt;App store pages are structured consistently. Every listing has a title, a rating, a review count, a description, a category, a developer name, and often a price or in-app purchase flag. For app analytics platforms, that consistency is exactly what makes them worth scraping at scale.&lt;/p&gt;

&lt;p&gt;The challenge is volume. Tracking thousands of apps across multiple store pages, refreshing data regularly, and keeping fields aligned across runs is not something you want to rebuild every time a page shifts. This is where &lt;a href="https://minexa.stoplight.io/docs/minexa/" rel="noopener noreferrer"&gt;Minexa API&lt;/a&gt; fits in.&lt;/p&gt;

&lt;h2&gt;
  
  
  How the workflow starts: train once in the extension
&lt;/h2&gt;

&lt;p&gt;Before you write a single API call, you train a scraper using the Minexa Chrome extension. Browse to an app listing detail page, let Minexa detect the structure automatically, confirm the fields, and save the scraper. That scraper gets a stable &lt;code&gt;scraper_id&lt;/code&gt; you will reuse in every API call going forward.&lt;/p&gt;

&lt;p&gt;This is the core developer pattern with Minexa: visual setup in the browser, programmatic execution via API.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Ffzo20asq6mv4lyhv150n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Ffzo20asq6mv4lyhv150n.png" alt="Minexa developer workflow" width="690" height="390"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once trained, the scraper works on any app listing page that shares the same structure. You do not retrain for every app. One scraper, thousands of pages.&lt;/p&gt;

&lt;h2&gt;
  
  
  Making the API call
&lt;/h2&gt;

&lt;p&gt;The endpoint for extracting data is &lt;code&gt;https://api.minexa.ai/data&lt;/code&gt;. You send a POST request with your &lt;code&gt;scraper_id&lt;/code&gt;, the list of URLs you want to process, and the columns you want back.&lt;/p&gt;

&lt;p&gt;Here is a Python example for detail-mode extraction across a batch of app listing pages:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;

&lt;span class="n"&gt;API_KEY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your_api_key_here&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="n"&gt;app_urls&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://appstorehub.com/app/4821&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://appstorehub.com/app/4822&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://appstorehub.com/app/4823&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scraper_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;6374&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;urls&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;app_urls&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;columns&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;app_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;developer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rating&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;review_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;category&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;price&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Authorization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bearer &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;API_KEY&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Content-Type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;application/json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.minexa.ai/data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you are not sure which column names are available, you can use &lt;code&gt;"columns": "top_40"&lt;/code&gt; instead of listing them manually. Minexa will return the top 40 ranked data points it found on the page.&lt;/p&gt;

&lt;h2&gt;
  
  
  Handling paginated responses
&lt;/h2&gt;

&lt;p&gt;When you submit a large batch of URLs, the API returns results in pages. Each response includes a &lt;code&gt;next_token&lt;/code&gt; field. If it is present, there are more results to fetch. Here is a checkpoint-based loop to collect everything:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="n"&gt;all_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;span class="n"&gt;next_token&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;next_token&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;next_token&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;next_token&lt;/span&gt;

    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.minexa.ai/data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;
    &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;all_results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[]))&lt;/span&gt;
    &lt;span class="n"&gt;next_token&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;next_token&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;checkpoint.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;w&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dump&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;all_results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;next_token&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;break&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Total records collected: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;all_results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Saving a checkpoint after each iteration means a network interruption does not cost you the work already done.&lt;/p&gt;

&lt;h2&gt;
  
  
  Credit consumption on app store pages
&lt;/h2&gt;

&lt;p&gt;App store listing pages often load ratings, reviews, and media assets dynamically via JavaScript. Pages with heavy dynamic content or anti-bot protection may consume more than one credit per page. Plan your batch sizes accordingly.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flgg5t1d3p45t1jr3v5h8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flgg5t1d3p45t1jr3v5h8.png" alt="API credit consumption guide" width="690" height="430"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why not just use an LLM for this?
&lt;/h2&gt;

&lt;p&gt;App listing pages contain multiple similar numeric fields: aggregate rating, rating count, number of reviews per version, and sometimes a separate score for the current version. An LLM reading the raw HTML has to decide which number maps to which field. It does not always get this right, and it does not always signal when it is uncertain.&lt;/p&gt;

&lt;p&gt;Minexa binds each column to a specific position in the DOM. The same field returns the same value every run, regardless of what else is on the page. If a value is missing, the output is null, not a guess.&lt;/p&gt;

&lt;p&gt;For an analytics platform ingesting data from thousands of apps on a recurring basis, that determinism matters more than flexibility.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Get started:&lt;/strong&gt; &lt;a href="https://minexa.stoplight.io/docs/minexa/" rel="noopener noreferrer"&gt;Read the full API docs&lt;/a&gt; and train your first app listing scraper in the extension before writing any code.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Scheduling your runs
&lt;/h2&gt;

&lt;p&gt;The API itself does not manage scheduling. If you need to refresh app data daily or weekly, set up a cron job on your end and pass the updated URL list to the API on each run. This gives you full control over timing, batching, and retry logic within your own infrastructure.&lt;/p&gt;

&lt;p&gt;For smaller, fixed lists of app URLs, the Chrome extension's built-in scheduling is simpler to configure. For larger or dynamic URL sets, the cron-plus-API approach scales better.&lt;/p&gt;

&lt;h2&gt;
  
  
  What app analytics platforms actually get out of this
&lt;/h2&gt;

&lt;p&gt;With a trained scraper and a few dozen lines of Python, an analytics platform can maintain a structured, refreshable dataset of app listings covering name, developer, category, rating, review volume, pricing model, and description text. That data feeds directly into trend analysis, competitive benchmarking, category ranking models, and review sentiment pipelines.&lt;/p&gt;

&lt;p&gt;The extraction setup is done once. After that, running it again on a new batch of URLs takes the same amount of engineering effort as the first run.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://minexa.stoplight.io/docs/minexa/" rel="noopener noreferrer"&gt;Explore Minexa API docs&lt;/a&gt; to see the full request schema and response format before you build.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Scraping environmental data from OpenEI with Minexa.ai</title>
      <dc:creator>Minexa.ai</dc:creator>
      <pubDate>Fri, 19 Jun 2026 09:09:56 +0000</pubDate>
      <link>https://dev.to/minexa_ai/scraping-environmental-data-from-openei-with-minexaai-2gl6</link>
      <guid>https://dev.to/minexa_ai/scraping-environmental-data-from-openei-with-minexaai-2gl6</guid>
      <description>&lt;p&gt;OpenEI (Open Energy Information) is a platform maintained by the U.S. Department of Energy that hosts a large catalog of energy-related datasets. The search page at &lt;code&gt;data.openei.org/search&lt;/code&gt; lists datasets across topics like solar resources, utility rates, building energy use, and grid data. Each listing includes a title, organization, tags, license type, and a link to the full dataset record.&lt;/p&gt;

&lt;p&gt;If you need to collect this data at scale — to build a dataset index, track what gets published over time, or feed an internal research tool — copying it manually is not realistic. This is where the &lt;a href="https://chromewebstore.google.com/detail/minexa-ai-scraper/ddljgbflolmninnkfcbdikabbjeapdnh" rel="noopener noreferrer"&gt;Minexa.ai Chrome extension&lt;/a&gt; comes in.&lt;/p&gt;




&lt;h2&gt;
  
  
  Watch the full walkthrough first
&lt;/h2&gt;

&lt;p&gt;Before going through the screenshots below, the video tutorial covers the entire process end to end. It is the fastest way to understand how the extraction works on OpenEI.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;a href="https://www.youtube.com/watch?v=Hpixz0mpI9I" rel="noopener noreferrer"&gt;Watch full video demo&lt;/a&gt;
&lt;/h2&gt;

&lt;h2&gt;
  
  
  How the extraction works, stage by stage
&lt;/h2&gt;

&lt;p&gt;Rather than listing steps in isolation, here is what each stage of the process actually does and what you see on screen.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Starting point: the Minexa extension&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Once the extension is installed, opening it from any page brings up the Minexa home screen. This is where all your scrapers and jobs are managed.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fkm1nhfb26dfr226pke6a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fkm1nhfb26dfr226pke6a.png" alt="Minexa home page" width="800" height="414"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The extension works directly in your browser — no separate app, no dashboard to log into from another tab.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Navigating to the target page&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Browse to &lt;code&gt;data.openei.org/search&lt;/code&gt;. This is the dataset search listing page. You can see all the dataset cards that Minexa will detect and extract from.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fzsbuzsiemoh4yaojhgeu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fzsbuzsiemoh4yaojhgeu.png" alt="OpenEI search page loaded" width="800" height="414"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Minexa works on the page currently open in your browser, so there is no URL to paste into a separate interface. You are already on the right page.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Confirming the page&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;After opening the extension popup, you click 'I'm on the right page'. This tells Minexa to begin analyzing the current page structure.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fk6hs5v20xrrigrm5ght3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fk6hs5v20xrrigrm5ght3.png" alt="Extension popup with confirmation button" width="800" height="414"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;From this point, Minexa takes over the detection process automatically.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Pagination detection&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Minexa scans the page and identifies how it paginates. For OpenEI, it detects the next page mechanism and shows you a list of the pagination it found. You review it and click Continue.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fu8iaowxpiujk3d90t42d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fu8iaowxpiujk3d90t42d.png" alt="Pagination detected" width="800" height="414"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You do not configure this manually. Minexa reads the page structure and figures out the pagination pattern on its own.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Choosing your scraping depth&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;After pagination is confirmed, Minexa asks whether you want to scrape just the list page or also follow each dataset link and extract detail page data. For most research use cases, list-only is sufficient. For deeper extraction, the detail mode pulls additional fields from each individual dataset record page.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F6cn68mpzcrk994ky9gqm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F6cn68mpzcrk994ky9gqm.png" alt="List or detail scraping option" width="800" height="414"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This two-layer extraction capability means a single job can produce both the summary data from the list and the full metadata from each dataset page.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Simple or advanced mode&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Before the job starts, you choose between simple mode (Minexa picks the most relevant fields automatically) and advanced mode (you can review and adjust the field selection). For most users, simple mode produces a clean, complete output without any additional configuration.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fnw55ezypf9wqsdb9yrgr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fnw55ezypf9wqsdb9yrgr.png" alt="Simple or advanced scraping options" width="800" height="414"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Container detection&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Minexa automatically highlights the repeating container on the page — the element that wraps each dataset listing. This is the structural anchor it uses to identify where one result ends and the next begins.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F5kf45vz98mh744nibiod.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F5kf45vz98mh744nibiod.png" alt="Container highlighted automatically" width="800" height="414"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Field discovery&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;After detecting the container, Minexa surfaces all the data points it found within each listing. These appear as labeled columns — title, organization, tags, license, and more. You do not need to specify these upfront.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fxlicd7pnly0btuyy5bg7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fxlicd7pnly0btuyy5bg7.png" alt="All data points extracted" width="800" height="414"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is one of the more useful aspects of the tool: if you are not sure what fields are available on a page, Minexa shows you rather than asking you to define them first.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;API and code samples&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;At the configuration stage, Minexa also surfaces ready-to-use code samples in JSON and Python, along with an API request view. This is useful if you want to integrate the scraper into an existing pipeline.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fs7u8wom22mnz4bls37l8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fs7u8wom22mnz4bls37l8.png" alt="Code samples and API request" width="800" height="414"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Job summary with scheduling and Google Sheets options&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Before running, you see a summary screen. From here you can connect a Google Sheet for live output or set up a recurring schedule so the job runs automatically without manual triggering.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Frvu1hxkhzrkibt8q0v5n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Frvu1hxkhzrkibt8q0v5n.png" alt="Job summary screen" width="800" height="414"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Scheduling is particularly relevant for OpenEI since new datasets are added regularly. A weekly scheduled run keeps your local dataset index current without any manual work.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Running the job&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The scraper appears in your jobs list with a Run button. Once triggered, extraction begins across all detected pages.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fswhi36lvbrp455y3wglw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fswhi36lvbrp455y3wglw.png" alt="Jobs list with run button" width="800" height="414"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Results during and after the run&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;As the job runs, data populates in a table view in real time. Once complete, you can export to Excel or JSON.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Flpj2vz9n7s0zvdakyl65.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Flpj2vz9n7s0zvdakyl65.png" alt="Scraped data table after job finishes" width="800" height="414"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What the extracted data looks like
&lt;/h2&gt;

&lt;p&gt;Here is a sample of what the JSON output contains after a completed run on the OpenEI search page:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"U.S. Solar Resource Data"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"organization"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"National Renewable Energy Laboratory"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"tags"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"solar, irradiance, GHI, DNI"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"license"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Public Domain"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Utility Rate Database"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"organization"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"NREL"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"tags"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"electricity rates, tariffs, utilities"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"license"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Creative Commons"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each row corresponds to one dataset listing. Fields are clean and consistently named across all pages.&lt;/p&gt;




&lt;h2&gt;
  
  
  Working with the exported data in Python
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;openei_datasets.json&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;datasets&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;dataset&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;datasets&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;|&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;organization&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This gives you a quick way to scan titles and organizations, or pipe the data into a pandas DataFrame for further filtering and analysis.&lt;/p&gt;




&lt;p&gt;The scraper configuration is saved after the first run. The next time you trigger it, Minexa skips the detection phase entirely and goes straight to extraction. If you want to get started, the extension is available at &lt;a href="https://www.minexa.ai" rel="noopener noreferrer"&gt;minexa.ai&lt;/a&gt;.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>What actually happens when Minexa extracts data from a page</title>
      <dc:creator>Minexa.ai</dc:creator>
      <pubDate>Tue, 16 Jun 2026 10:21:37 +0000</pubDate>
      <link>https://dev.to/minexa_ai/what-actually-happens-when-minexa-extracts-data-from-a-page-17o6</link>
      <guid>https://dev.to/minexa_ai/what-actually-happens-when-minexa-extracts-data-from-a-page-17o6</guid>
      <description>&lt;p&gt;If you have looked at Minexa before and wondered what is actually going on under the hood when it extracts data, this article walks through the internal mechanics in plain terms. Not the marketing pitch. The actual behavior.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What does 'container locking' mean?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Before collecting any values, Minexa identifies and locks onto the exact section of a page that holds the target data. This prevents it from accidentally pulling values from visually similar but unrelated sections, like a sidebar showing related products with their own prices, or a footer with duplicate navigation text.&lt;/p&gt;

&lt;p&gt;The user selects a parent HTML element in the browser extension, not individual fields. Minexa works from that container outward, discovering columns within it rather than scanning the whole page.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How does it pick which selector to use for each field?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For each discovered data point, Minexa evaluates multiple candidate selectors and ranks them by structural stability and content regularity across pages. The final selector is the one that consistently targets the correct element even when minor layout differences exist between pages on the same site.&lt;/p&gt;

&lt;p&gt;This is why a scraper trained on one product page works correctly across thousands of structurally similar product pages without modification.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is the top_n columns parameter?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When you make an API request, you do not have to know the column names upfront. You can use &lt;code&gt;"top_30"&lt;/code&gt; or &lt;code&gt;"top_10"&lt;/code&gt; or any value, and Minexa returns that many columns ranked by its relevance algorithm.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"batches"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"scraper_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4712&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"columns"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"top_30"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"urls"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"https://example.com/product/99"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"scraping"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"js_render"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"proxy"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"verified"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"threads"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The ranking is deterministic, so &lt;code&gt;"top_30"&lt;/code&gt; always maps to the same set of columns in the same order. Once you know which columns matter, you can switch to named fields like &lt;code&gt;["price", "availability", "brand"]&lt;/code&gt; with no other changes needed.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw5bgb2cbpqyfa9uxzw88.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw5bgb2cbpqyfa9uxzw88.png" alt="Minexa columns parameter: top_n vs named fields explained" width="690" height="410"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What does nested data look like in the output?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When extracted content maps to multiple elements, Minexa returns a list of objects instead of a flat string. Each object includes a tag, type, attribute, and value field.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"study_locations"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"tag"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"span"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"text"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Austin"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"attribute"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"tag"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"span"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"text"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Denver"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"attribute"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In most cases you only need the value. In Python:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;locations&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;study_locations&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For deeply structured content like article body text, the tag and attribute metadata let you filter and reconstruct the original content precisely. This does require some extra handling compared to flat columns, which is a real tradeoff worth knowing about before you start.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is extraction deterministic?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Yes. Running the same scraper on the same page always produces identical JSON output as long as the underlying HTML has not changed. This differs from LLM-based extraction where outputs can vary between runs due to temperature settings, prompt drift, or model updates.&lt;/p&gt;

&lt;p&gt;For testing and validation pipelines, this matters a lot. You can run the same page twice and diff the outputs with confidence.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What happens when something goes wrong?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Minexa is designed to fail loudly. If a page structure changes and the scraper no longer matches the HTML, affected fields return null or the scraper raises an explicit error. It never silently returns a wrong value.&lt;/p&gt;

&lt;p&gt;If you accidentally submit a URL with the wrong scraper_id (for example, a category page when the scraper was trained on a detail page), Minexa returns an error flagging the mismatch rather than attempting extraction on mismatched content.&lt;/p&gt;

&lt;p&gt;This is meaningfully different from selector-based scrapers that can quietly match the wrong element, and from LLM pipelines that may return a plausible-looking but fabricated value with no error signal at all.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What if the site redesigns?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When a site changes its layout substantially, the scraper will start returning errors or null values. That is the signal to retrain. You open an affected page in the extension, select the updated container, and create a new scraper. It takes the same 2 to 5 minutes as the original setup.&lt;/p&gt;

&lt;p&gt;Retraining creates a new scraper with a new scraper_id. Column names may also change since the scraper is generated fresh. The only required code update is swapping the scraper_id in your request body and verifying the column names you rely on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can you skip live crawling if you already have the HTML?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Yes. If you have pages stored as HTML files on something like AWS CloudFront or a public URL, you can pass them via &lt;code&gt;file_urls&lt;/code&gt; and set &lt;code&gt;js_render&lt;/code&gt; to false. This is the cheapest scraping configuration since no live fetching or rendering is needed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"scraping"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"js_render"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"proxy"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"verified"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"file_urls"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"https://9343.cloudfront.net/html-page-1.html"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"urls"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"https://original-site.com/page-1"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;urls&lt;/code&gt; field here holds the original source URLs so extracted data can be mapped back to the real page it came from. The two arrays are 1-to-1 by index.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where to go from here&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://chromewebstore.google.com/detail/minexa-ai-scraper/ddljgbflolmninnkfcbdikabbjeapdnh" rel="noopener noreferrer"&gt;Minexa Chrome extension&lt;/a&gt; is the fastest way to train your first scraper and get the pre-generated Python code. The extension also has a drop-down of ready-made scraping scenarios you can copy directly, which saves time compared to reading through the full API docs.&lt;/p&gt;

&lt;p&gt;Full API reference is at &lt;a href="https://minexa.stoplight.io/docs/minexa/" rel="noopener noreferrer"&gt;minexa.stoplight.io/docs/minexa&lt;/a&gt; if you want to go deeper on request parameters and scraping configurations.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Nested data, null fields, and the quiet failures nobody talks about in web scraping</title>
      <dc:creator>Minexa.ai</dc:creator>
      <pubDate>Sun, 14 Jun 2026 16:01:25 +0000</pubDate>
      <link>https://dev.to/minexa_ai/nested-data-null-fields-and-the-quiet-failures-nobody-talks-about-in-web-scraping-3jp3</link>
      <guid>https://dev.to/minexa_ai/nested-data-null-fields-and-the-quiet-failures-nobody-talks-about-in-web-scraping-3jp3</guid>
      <description>&lt;p&gt;Most scraping bugs are not crashes. They are wrong values that look right.&lt;/p&gt;

&lt;p&gt;A price field returns a number. The number is plausible. It passes your type check. It lands in your database. Three weeks later someone notices the sale price and the original price have been swapped on roughly 8% of records. No error was raised. No log entry flagged it. The pipeline just quietly extracted the wrong element.&lt;/p&gt;

&lt;p&gt;This is the failure mode that actually costs time, and it shows up in three distinct places.&lt;/p&gt;

&lt;h2&gt;
  
  
  The selector drift problem
&lt;/h2&gt;

&lt;p&gt;CSS selectors and XPath expressions are written against a snapshot of a page. When a site updates its layout, the selector either breaks visibly (returns nothing) or drifts silently (matches a different element that happens to exist at the same path). The second case is worse. A selector targeting &lt;code&gt;.price-now&lt;/code&gt; that starts matching &lt;code&gt;.price-was&lt;/code&gt; after a redesign will not throw an exception. It will just return the wrong number, consistently, at scale.&lt;/p&gt;

&lt;p&gt;Traditional scraping gives you no structural guarantee. You write the selector, you hope the site does not change, and you build monitoring on top to catch drift after the fact.&lt;/p&gt;

&lt;h2&gt;
  
  
  The LLM ambiguity problem
&lt;/h2&gt;

&lt;p&gt;LLM-based extraction has a different failure signature. On pages with multiple visually similar fields, like a job listing with a salary range, an equity range, and a bonus figure, the model picks based on proximity and pattern rather than structural position. It is usually right. At 100,000 pages, 'usually right' means thousands of incorrectly attributed rows, with no error signal attached.&lt;/p&gt;

&lt;p&gt;The hallucination that is hardest to catch is not fabrication. It is field swapping: the correct value extracted into the wrong column. Schema conformance failures are also common. If a price is unavailable, some models return 0 or a nearby value rather than null. Both pass downstream validation.&lt;/p&gt;

&lt;h2&gt;
  
  
  The nested data problem
&lt;/h2&gt;

&lt;p&gt;This one is underappreciated. Many pages contain fields that are not flat strings. A clinical trials page might have four separate date fields rendered as sibling span elements. A property listing might have address components spread across multiple tags.&lt;/p&gt;

&lt;p&gt;When Minexa extracts nested content, it returns a list of objects rather than a collapsed string:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"study_locations"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"tag"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"span"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"text"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Akishima"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"tag"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"span"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"text"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Atsugi"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To get the values in Python:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;values&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;study_locations&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The tag and attribute metadata lets you filter when multiple object types are present. This is a real tradeoff: nested fields require more handling than flat columns, but the structure is explicit and accurate rather than collapsed and potentially wrong.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://minexa.stoplight.io/docs/minexa/" rel="noopener noreferrer"&gt;Check the Minexa API docs for the full parameter reference&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  How Minexa handles this structurally
&lt;/h2&gt;

&lt;p&gt;Minexa is a deterministic, DOM-based extraction platform. You train a scraper once using the &lt;a href="https://chromewebstore.google.com/detail/minexa-ai-scraper/ddljgbflolmninnkfcbdikabbjeapdnh" rel="noopener noreferrer"&gt;Chrome extension&lt;/a&gt; by selecting the HTML container that holds your target data block. Minexa locks onto that container and discovers all data points within it automatically.&lt;/p&gt;

&lt;p&gt;Each column is bound to a specific DOM element via a consolidated selector chosen for structural stability. Running the same scraper on the same page always returns identical JSON. No temperature variance, no prompt sensitivity.&lt;/p&gt;

&lt;p&gt;When a field is absent from the HTML, the output is null. Never a fabricated default.&lt;/p&gt;

&lt;p&gt;When a URL is submitted with the wrong scraper_id, the API returns an explicit mismatch error rather than attempting extraction on the wrong structure.&lt;/p&gt;

&lt;p&gt;A minimal API call looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"batches"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"scraper_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;6241&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"columns"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"top_30"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"urls"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"https://example.com/listing/99"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"scraping"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"js_render"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"proxy"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"verified"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"threads"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;columns&lt;/code&gt; parameter accepts either &lt;code&gt;top_N&lt;/code&gt; for automatic ranked field selection or explicit column names generated during training. Both cost the same.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpotth4rcl9oat5taidy9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpotth4rcl9oat5taidy9.png" alt="Minexa API request structure explained" width="690" height="410"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When a site redesigns and the scraper starts returning nulls or explicit errors, you retrain in 2 to 5 minutes via the extension. The only required code change is updating the scraper_id.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fail-loudly design matters more than it sounds
&lt;/h2&gt;

&lt;p&gt;Selector-based scrapers fail silently. LLMs fail silently. Both produce outputs that look valid and are not. Minexa is designed to surface structural problems as explicit errors rather than letting wrong data propagate.&lt;/p&gt;

&lt;p&gt;This changes what your validation layer needs to do. Instead of checking whether extracted values are plausible, you can trust that a non-null value came from the correct DOM position. Your checks shift from 'does this look right' to 'is this field present'.&lt;/p&gt;

&lt;p&gt;At scale, that is a meaningful reduction in downstream cleanup work.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.minexa.ai" rel="noopener noreferrer"&gt;Start with the extension, get your first dataset in under 10 minutes&lt;/a&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Scraping scholarship data from Scholarship America with Minexa.ai</title>
      <dc:creator>Minexa.ai</dc:creator>
      <pubDate>Sun, 14 Jun 2026 15:59:28 +0000</pubDate>
      <link>https://dev.to/minexa_ai/scraping-scholarship-data-from-scholarship-america-with-minexaai-5f79</link>
      <guid>https://dev.to/minexa_ai/scraping-scholarship-data-from-scholarship-america-with-minexaai-5f79</guid>
      <description>&lt;p&gt;Scholarship America maintains one of the largest publicly browsable scholarship databases in the US. The browse page at scholarshipamerica.org lists hundreds of programs with award amounts, deadlines, eligible institutions, and geographic scope. If you are building a scholarship aggregator, doing financial aid research, or just trying to track what programs are open and when, that data is useful but locked inside paginated HTML.&lt;/p&gt;

&lt;p&gt;This article walks through how to pull it out using the &lt;a href="https://chromewebstore.google.com/detail/minexa-ai-scraper/ddljgbflolmninnkfcbdikabbjeapdnh" rel="noopener noreferrer"&gt;Minexa.ai Chrome extension&lt;/a&gt; without writing any scraping code.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the page looks like
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F04r8okuuvzfualufjzzk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F04r8okuuvzfualufjzzk.png" alt="Scholarship America browse page loaded" width="800" height="414"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The browse page at &lt;code&gt;scholarshipamerica.org/students/browse-scholarships/&lt;/code&gt; renders a paginated list. Each entry includes a program name, award amount, deadline, eligible institution types, and a location tag. The pagination uses a URL pattern (&lt;code&gt;?_paged=2&lt;/code&gt;, &lt;code&gt;?_paged=3&lt;/code&gt;, etc.), which Minexa detects automatically.&lt;/p&gt;




&lt;h2&gt;
  
  
  Opening Minexa on the page
&lt;/h2&gt;

&lt;p&gt;Once the extension is installed and you are on the browse page, click the Minexa icon. The popup asks you to confirm you are on the right page.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frx0kxocqe6ibb16nqxds.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frx0kxocqe6ibb16nqxds.png" alt="Extension popup with confirm button" width="800" height="414"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;After confirming, Minexa scans the page and surfaces the pagination it found.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg2zhk4zhpl64k0qx1um1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg2zhk4zhpl64k0qx1um1.png" alt="Pagination detected screen" width="800" height="414"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;No configuration needed here. Hit Continue.&lt;/p&gt;




&lt;h2&gt;
  
  
  Choosing scraping depth
&lt;/h2&gt;

&lt;p&gt;This is a decision point most scraping tools skip entirely. Minexa asks whether you want to scrape the list only, or the list plus the detail page behind each link.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fieq3zv9anr3lv5mq65ng.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fieq3zv9anr3lv5mq65ng.png" alt="List or list-plus-detail choice" width="800" height="414"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For most research use cases, the list data is sufficient. If you need full program descriptions or eligibility criteria from each individual scholarship page, the detail option handles that in the same run.&lt;/p&gt;




&lt;h2&gt;
  
  
  Automatic container and field detection
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjg13waexep82uvz4oi0r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjg13waexep82uvz4oi0r.png" alt="Container auto-highlighted" width="800" height="414"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Minexa highlights the repeating container it identified. You do not click on individual fields. It finds all data points inside the container on its own and presents them for review.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F02utrxn2yuw7fq8axxm8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F02utrxn2yuw7fq8axxm8.png" alt="All extracted columns visible" width="800" height="414"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is useful when you are not sure what fields are available. You see everything Minexa found before committing to a run.&lt;/p&gt;




&lt;h2&gt;
  
  
  Video walkthrough
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.youtube.com/watch?v=foWNcPtPJpw" rel="noopener noreferrer"&gt;Watch full video walkthrough&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Sample output
&lt;/h2&gt;

&lt;p&gt;Here is a cleaned sample from the extracted JSON (meta fields and field prefixes removed):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"scholarship_program_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"#RAREis Scholarship Fund"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"award_amount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"$5,000"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"event_date_time"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"April 28, 2026 3:00 PM CT"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"eligible_institutions"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Community or Technical College, Four-Year University, Graduate School"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"location"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"National"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Closed"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"program_link"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://scholarshipamerica.org/scholarship/rareis/"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"scholarship_program_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Amazon Future Engineer Scholarship"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"award_amount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Up to $40,000"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"event_date_time"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"January 26, 2026 3:00 PM CT"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"eligible_institutions"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Community or Technical College, Four-Year University"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"location"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"National"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Closed"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"program_link"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://scholarshipamerica.org/scholarship/amazonfutureengineer/"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each row is one scholarship. Fields map directly to what is on the page.&lt;/p&gt;




&lt;h2&gt;
  
  
  Working with the data in Python
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;scholarships.json&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;scholarship_program_name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;amount&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;award_amount&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;TBD&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;deadline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;event_date_time&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;location&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;location&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; | &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; | &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;deadline&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; | &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;location&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The export from Minexa is already structured, so there is no parsing or cleanup step before this runs.&lt;/p&gt;




&lt;h2&gt;
  
  
  Scheduling and export options
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feq3et9c6vsr4e6e9tbf5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feq3et9c6vsr4e6e9tbf5.png" alt="Job summary with schedule and Google Sheets options" width="800" height="414"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once the scraper is configured, you can schedule it to run on a recurring basis and push results directly to Google Sheets. For a scholarship database that updates seasonally, this means your dataset stays current without manual re-runs.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffkzfzqm6mnrx5vh2jcey.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffkzfzqm6mnrx5vh2jcey.png" alt="Final data table with export options" width="800" height="414"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Export to Excel, JSON, or Google Sheets directly from the results view.&lt;/p&gt;




&lt;p&gt;If you want to try this on Scholarship America or any other paginated listing site, the Minexa.ai extension is available at the &lt;a href="https://chromewebstore.google.com/detail/minexa-ai-scraper/ddljgbflolmninnkfcbdikabbjeapdnh" rel="noopener noreferrer"&gt;Chrome Web Store&lt;/a&gt;.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Stop reaching for an LLM every time you need to extract web data</title>
      <dc:creator>Minexa.ai</dc:creator>
      <pubDate>Sun, 14 Jun 2026 15:54:43 +0000</pubDate>
      <link>https://dev.to/minexa_ai/stop-reaching-for-an-llm-every-time-you-need-to-extract-web-data-1kcd</link>
      <guid>https://dev.to/minexa_ai/stop-reaching-for-an-llm-every-time-you-need-to-extract-web-data-1kcd</guid>
      <description>&lt;p&gt;There is a pattern that has become surprisingly common in backend and data engineering work: someone needs structured data from a website, reaches for an LLM API, feeds it raw HTML, and calls it done. It works well enough at small scale. Then the bill arrives, or the extracted fields start drifting, and the whole thing needs rethinking.&lt;/p&gt;

&lt;p&gt;Using an LLM to parse HTML is not wrong by default. But it is often the wrong tool chosen for the wrong reason — convenience at prototype stage, not fitness for production.&lt;/p&gt;

&lt;h2&gt;
  
  
  What actually goes wrong with LLM extraction at scale
&lt;/h2&gt;

&lt;p&gt;The issues are not dramatic. They are quiet and cumulative.&lt;/p&gt;

&lt;p&gt;A product page shows a sale price and a crossed-out original price. Both look like prices to a language model. Depending on surrounding text, token context, and model temperature, the model may return either one under &lt;code&gt;sale_price&lt;/code&gt;. It will not tell you it is uncertain. You find out during a data audit three weeks later.&lt;/p&gt;

&lt;p&gt;A clinical trials page has four date fields. An LLM assigns the wrong date to a label roughly once per hundred rows or more, because the values are structurally similar and the model picks based on proximity rather than DOM position. At 50,000 pages, that is hundreds of silently wrong rows.&lt;/p&gt;

&lt;p&gt;This is not a criticism of LLMs as a category. It is a description of what probabilistic text generation does when applied to a task that requires structural precision.&lt;/p&gt;

&lt;h2&gt;
  
  
  The alternative: train once, extract deterministically
&lt;/h2&gt;

&lt;p&gt;The Minexa.ai API takes a different approach. You train a scraper once using the browser extension — point at the HTML container holding your data, confirm the selection, and Minexa generates a reusable scraper with a stable &lt;code&gt;scraper_id&lt;/code&gt;. That scraper is backed by consolidated DOM selectors, not prompts. Every field maps to a specific element. Running the same scraper on the same page always produces identical JSON.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpotth4rcl9oat5taidy9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpotth4rcl9oat5taidy9.png" alt="Minexa API request structure explained" width="690" height="410"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once trained, you call the API with your URLs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;batches&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scraper_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;4821&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;columns&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;top_30&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;urls&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://example.com/listing/9981&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scraping&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;js_render&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;proxy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;verified&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timeout&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;js_code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;wait_time&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;page_init&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;wait_time&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
      &lt;span class="p"&gt;],&lt;/span&gt;
      &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;retry&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}],&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;threads&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;columns&lt;/code&gt; parameter accepts either a named list like &lt;code&gt;["price", "availability", "brand"]&lt;/code&gt; or a &lt;code&gt;top_n&lt;/code&gt; shorthand. Using &lt;code&gt;"top_30"&lt;/code&gt; returns the 30 highest-ranked columns by Minexa's relevance algorithm. The ranking is deterministic, so &lt;code&gt;top_30&lt;/code&gt; always maps to the same 30 fields — safe for production without locking in a schema upfront.&lt;/p&gt;

&lt;p&gt;Up to 50,000 URLs can go into a single batch request. The &lt;code&gt;threads&lt;/code&gt; value controls parallel processing up to your plan's limit.&lt;/p&gt;

&lt;h2&gt;
  
  
  When a field is missing, you get null — not a guess
&lt;/h2&gt;

&lt;p&gt;Minexa is designed to fail loudly. If a page structure changes and a trained selector no longer finds its target, the affected field returns &lt;code&gt;null&lt;/code&gt; or an explicit error. It never borrows a value from a nearby element or invents a plausible substitute.&lt;/p&gt;

&lt;p&gt;This contrasts directly with LLM behavior, where a missing price might come back as &lt;code&gt;$0.00&lt;/code&gt; or a missing date might be filled from another date field on the page — both plausible, both wrong, neither flagged.&lt;/p&gt;

&lt;p&gt;If you submit a URL with the wrong &lt;code&gt;scraper_id&lt;/code&gt; (a page type the scraper was not trained on), Minexa returns an error indicating the mismatch rather than attempting extraction on mismatched structure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Handling pre-scraped HTML
&lt;/h2&gt;

&lt;p&gt;If you already have HTML stored (on CloudFront, S3, or anywhere publicly accessible), you can skip live crawling entirely using &lt;code&gt;file_urls&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"scraping"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"js_render"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"proxy"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"verified"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"file_urls"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"https://your-cdn.cloudfront.net/page-1.html"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"urls"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"https://original-site.com/listing/1"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;file_urls&lt;/code&gt; and &lt;code&gt;urls&lt;/code&gt; are 1-to-1. Minexa reads from the stored HTML and maps output back to the original URL. This is the lowest-credit configuration since no rendering is needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  The cost reality at scale
&lt;/h2&gt;

&lt;p&gt;LLMs price by token. A full DOM-rendered HTML page averages around 572,000 tokens. At that size, GPT-4o-mini costs roughly $0.086 per page. At 120,000 pages per month, that is $10,320. Minexa's Startup plan handles the same volume for $60. Even with stripped HTML at 38,965 tokens, GPT-4o-mini costs $773 for 120,000 pages versus $60 on Minexa.&lt;/p&gt;

&lt;p&gt;Minexa's cost does not change based on page size. A 600K-token page costs the same credit as a 10K-token page.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Start extracting structured data from any site in under 10 minutes: &lt;a href="https://www.minexa.ai" rel="noopener noreferrer"&gt;minexa.ai&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Nested data and what to do with it
&lt;/h2&gt;

&lt;p&gt;When extracted content is structurally nested, Minexa returns a list of objects with metadata:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"locations"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"tag"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"span"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"text"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Berlin"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"tag"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"span"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"text"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Vienna"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In most cases you only need the values:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;values&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;locations&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;tag&lt;/code&gt; and &lt;code&gt;attribute&lt;/code&gt; metadata are available when you need to filter by element type, which is useful for pages with long mixed content like article bodies.&lt;/p&gt;

&lt;h2&gt;
  
  
  What retraining looks like
&lt;/h2&gt;

&lt;p&gt;When a site redesigns and the existing scraper starts returning nulls or errors, you open the updated page in the extension, select the new container, and create a new scraper. This takes the same 2-5 minutes as the original training. The only required code change is updating &lt;code&gt;scraper_id&lt;/code&gt; in your request body and checking whether any column names you depend on have changed.&lt;/p&gt;

&lt;p&gt;Retraining creates a new scraper from scratch. Column names may differ because they are generated fresh. Minexa does not attempt to preserve labels from the previous scraper.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Read the full API docs and explore scraping scenarios: &lt;a href="https://minexa.stoplight.io/docs/minexa/" rel="noopener noreferrer"&gt;minexa.stoplight.io/docs/minexa&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The actual takeaway
&lt;/h2&gt;

&lt;p&gt;LLM extraction is a reasonable starting point for low-volume, exploratory work where output variance is acceptable. Once you are running tens of thousands of pages per month, or once field accuracy matters for downstream use, the tradeoffs shift significantly: token costs compound, silent errors accumulate, and validation logic adds engineering overhead that was never in the original estimate.&lt;/p&gt;

&lt;p&gt;Deterministic DOM-based extraction does not solve every problem, but for structured data at scale from consistent page types, it is the more predictable and cost-stable path.&lt;/p&gt;

&lt;p&gt;Install the Minexa Chrome extension, train a scraper on your target page, and pull the auto-generated Python code directly from the extension: &lt;a href="https://chromewebstore.google.com/detail/minexa-ai-scraper/ddljgbflolmninnkfcbdikabbjeapdnh" rel="noopener noreferrer"&gt;chromewebstore.google.com/detail/minexa-ai-scraper/ddljgbflolmninnkfcbdikabbjeapdnh&lt;/a&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>What actually happens when your scraping pipeline hits 100,000 pages</title>
      <dc:creator>Minexa.ai</dc:creator>
      <pubDate>Sun, 14 Jun 2026 15:45:41 +0000</pubDate>
      <link>https://dev.to/minexa_ai/what-actually-happens-when-your-scraping-pipeline-hits-100000-pages-34b1</link>
      <guid>https://dev.to/minexa_ai/what-actually-happens-when-your-scraping-pipeline-hits-100000-pages-34b1</guid>
      <description>&lt;p&gt;Most scraping projects start small. A few hundred pages, a quick script, maybe an LLM call to parse the HTML. It works fine. Then the scope grows.&lt;/p&gt;

&lt;p&gt;At 100,000 pages per month, the decisions you made at 1,000 pages start costing real money and real engineering time. This post walks through what that scaling curve actually looks like, and where the hidden costs appear.&lt;/p&gt;

&lt;h2&gt;
  
  
  The token problem nobody budgets for
&lt;/h2&gt;

&lt;p&gt;When you pass a full HTML page to an LLM for extraction, you are not passing a few paragraphs. Real pages across job listings, ecommerce, real estate, and review sites average around 572,739 tokens per page after DOM rendering with whitespace cleaned.&lt;/p&gt;

&lt;p&gt;At that size, here is what a single page costs across common models:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Cost per page (full HTML)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GPT-5 nano&lt;/td&gt;
&lt;td&gt;$0.029&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4o-mini&lt;/td&gt;
&lt;td&gt;$0.086&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-5 mini&lt;/td&gt;
&lt;td&gt;$0.145&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Haiku 4.5&lt;/td&gt;
&lt;td&gt;$0.584&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4o&lt;/td&gt;
&lt;td&gt;$1.330&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Sonnet 4.6&lt;/td&gt;
&lt;td&gt;$1.733&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;At 100,000 pages per month, GPT-5 nano costs roughly $2,900. GPT-4o-mini costs $8,600. Claude Sonnet 4.6 costs $173,300.&lt;/p&gt;

&lt;p&gt;You can strip the HTML first to reduce tokens. Stripped pages average around 38,965 tokens. That brings GPT-5 nano down to $0.0024 per page, or $240 for 100,000 pages. But stripping HTML requires preprocessing work, and you risk removing markup that contains the data you actually need. There is no clean middle ground.&lt;/p&gt;

&lt;h2&gt;
  
  
  Flat cost regardless of page size
&lt;/h2&gt;

&lt;p&gt;Minexa.ai is a Chrome extension-based scraper training tool that connects to an API for batch extraction. You train a scraper once by selecting an HTML container on a page in the extension. Minexa identifies all data points inside that container automatically. The result is a &lt;code&gt;scraper_id&lt;/code&gt; you reference in every subsequent API call.&lt;/p&gt;

&lt;p&gt;The extraction cost is credit-based, not token-based. A page costs the same whether it is 10,000 tokens or 600,000 tokens. At 100,000 pages per month, the Startup plan at $60 covers up to 120,000 pages. The Business plan at $500 covers up to 2,000,000.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.minexa.ai" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3vpgux4o6rznr4hnuh16.png" alt="Minexa cost per page at scale vs LLM models" width="690" height="390"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What the API request looks like
&lt;/h2&gt;

&lt;p&gt;After training a scraper in the extension, you click 'API Request' to get pre-generated Python code. The core request structure looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;

&lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;https://api.minexa.ai/data/&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="n"&gt;api_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;YOUR_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;

&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;batches&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;scraper_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;6241&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;columns&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;top_30&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;urls&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
      &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;https://example.com/listing/1&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;https://example.com/listing/2&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;scraping&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;js_render&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;timeout&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;js_code&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;wait_time&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;page_init&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;wait_time&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
      &lt;span class="p"&gt;],&lt;/span&gt;
      &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;proxy&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;verified&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;retry&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}],&lt;/span&gt;
  &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;threads&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Content-Type&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;application/json&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;api-key&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;columns&lt;/code&gt; parameter accepts &lt;code&gt;top_N&lt;/code&gt; notation to return the top N ranked fields automatically, or explicit column names if you want specific fields only. The &lt;code&gt;threads&lt;/code&gt; parameter controls how many URLs are processed in parallel, up to your plan limit. Up to 50,000 URLs can be submitted in a single batch request.&lt;/p&gt;

&lt;p&gt;If you already have HTML stored on S3 or CloudFront, you can pass &lt;code&gt;file_urls&lt;/code&gt; pointing to those files and set &lt;code&gt;js_render: false&lt;/code&gt;. This is the cheapest scraping configuration since no live crawling is needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  What happens when a site redesigns
&lt;/h2&gt;

&lt;p&gt;This is where most scraping pipelines require the most maintenance. With selector-based scrapers, a layout change can silently return wrong values for weeks before anyone notices.&lt;/p&gt;

&lt;p&gt;With Minexa, a structural mismatch produces an explicit error or null values rather than a plausible-looking wrong value. When that happens, you open the updated page in the extension, select the new container, and create a new scraper. This takes the same 2 to 5 minutes as the original setup.&lt;/p&gt;

&lt;p&gt;The only required code change is updating the &lt;code&gt;scraper_id&lt;/code&gt; in your request body and checking whether any column names you rely on have shifted.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Try it yourself:&lt;/strong&gt; Install the &lt;a href="https://chromewebstore.google.com/detail/minexa-ai-scraper/ddljgbflolmninnkfcbdikabbjeapdnh" rel="noopener noreferrer"&gt;Minexa Chrome extension&lt;/a&gt; and train your first scraper in one session.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The reliability cost that does not show up in token pricing
&lt;/h2&gt;

&lt;p&gt;LLM extraction pipelines at scale require validation logic. Field mapping errors, swapped values between visually similar fields, and fabricated defaults when a value is missing all produce rows that look correct but are not.&lt;/p&gt;

&lt;p&gt;Minexa binds each column to a specific DOM element identified during training. The same field always maps to the same element. If the element is absent, the output is null. No value is invented.&lt;/p&gt;

&lt;p&gt;At 100,000 pages, even a 1% error rate means 1,000 rows requiring manual review or retry logic. That indirect cost does not appear in any token pricing table.&lt;/p&gt;

&lt;h2&gt;
  
  
  Practical takeaway
&lt;/h2&gt;

&lt;p&gt;If your extraction volume is below roughly 10,000 pages per month and you are already using stripped HTML, the cheapest LLM models are competitive on price. Above that threshold, or any time you are working with full HTML, the cost gap widens sharply and does not close.&lt;/p&gt;

&lt;p&gt;The engineering overhead also compounds at scale. Prompt maintenance, schema drift, validation pipelines, and retry logic all grow with volume. A scraper trained once in the Minexa extension and called via API does not require any of that.&lt;/p&gt;

&lt;p&gt;Full API documentation is at &lt;a href="https://minexa.stoplight.io/docs/minexa/" rel="noopener noreferrer"&gt;minexa.stoplight.io/docs/minexa&lt;/a&gt; if you want to explore the complete parameter reference before starting.&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
