<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ben</title>
    <description>The latest articles on DEV Community by Ben (@benthepythondev).</description>
    <link>https://dev.to/benthepythondev</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3960119%2F6b3cea53-7207-4e2d-92b1-1073e28fd866.png</url>
      <title>DEV Community: Ben</title>
      <link>https://dev.to/benthepythondev</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/benthepythondev"/>
    <language>en</language>
    <item>
      <title>How to Scrape Local Job Ads from Kleinanzeigen (Germany) — Python + No-Code</title>
      <dc:creator>Ben</dc:creator>
      <pubDate>Thu, 25 Jun 2026 20:48:25 +0000</pubDate>
      <link>https://dev.to/benthepythondev/how-to-scrape-local-job-ads-from-kleinanzeigen-germany-python-no-code-4ke5</link>
      <guid>https://dev.to/benthepythondev/how-to-scrape-local-job-ads-from-kleinanzeigen-germany-python-no-code-4ke5</guid>
      <description>&lt;p&gt;Most job-scraping guides point you at corporate ATS boards — Greenhouse, Lever,&lt;br&gt;
Ashby. Those only carry mid-to-large-company tech and office roles. The entire&lt;br&gt;
&lt;strong&gt;local German labour market&lt;/strong&gt; — trades, care, hospitality, retail, mini-jobs — is&lt;br&gt;
posted somewhere else entirely: the &lt;strong&gt;Jobs (Stellenangebote)&lt;/strong&gt; section of&lt;br&gt;
Kleinanzeigen. If you want hyper-local hiring data, that's the source.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why Kleinanzeigen for jobs?
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Hidden inventory&lt;/strong&gt; — small businesses, shops, restaurants and private employers
post here directly; these roles are absent from corporate boards and aggregators.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hyper-local&lt;/strong&gt; — search any city or PLZ with a radius.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Structured fields&lt;/strong&gt; — employment type, job category, hourly/monthly pay and
experience level, once you parse the detail page.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Parsing the German job attributes
&lt;/h2&gt;

&lt;p&gt;The key insight is what each German attribute actually means — it's easy to get&lt;br&gt;
these backwards:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;Arbeitszeit&lt;/code&gt; → &lt;strong&gt;employment type&lt;/strong&gt; (Vollzeit / Teilzeit / Minijob)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Art&lt;/code&gt; → &lt;strong&gt;job category&lt;/strong&gt; (Reinigungskraft, Verkäufer/-in, …)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Stundenlohn&lt;/code&gt; → &lt;strong&gt;hourly wage&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Berufserfahrung&lt;/code&gt; → &lt;strong&gt;experience level&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse_job&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attrs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;attrs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;()}&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;keys&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;            &lt;span class="c1"&gt;# keys tried in priority order
&lt;/span&gt;        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;keys&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;employment_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;arbeitszeit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anstellung&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;job_category&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;art&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;experience&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;berufserfahrung&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;erfahrung&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;salary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stundenlohn&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gehalt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The attribute rows come off the detail page the same way as any Kleinanzeigen&lt;br&gt;
section (read the &lt;code&gt;.addetailslist--detail--value&lt;/code&gt; node, strip it from the row text&lt;br&gt;
to get the label), and you'll want a &lt;strong&gt;residential DE proxy&lt;/strong&gt; plus the&lt;br&gt;
location-autocomplete endpoint to turn a city name into the internal location ID.&lt;/p&gt;
&lt;h2&gt;
  
  
  One handy quirk: titles contain the role
&lt;/h2&gt;

&lt;p&gt;Unlike cars (where titles often omit the make), &lt;strong&gt;job titles reliably contain the&lt;br&gt;
role&lt;/strong&gt; — &lt;em&gt;"Reinigungskraft (m/w/d)"&lt;/em&gt;, &lt;em&gt;"LKW Fahrer in Vollzeit"&lt;/em&gt;. So a client-side&lt;br&gt;
title filter on a keyword like &lt;code&gt;fahrer&lt;/code&gt; or &lt;code&gt;pflegekraft&lt;/code&gt; works well, on top of the&lt;br&gt;
location browse.&lt;/p&gt;
&lt;h2&gt;
  
  
  The catch: scale and upkeep
&lt;/h2&gt;

&lt;p&gt;Paging results, fetching each detail page politely, mapping the German attributes&lt;br&gt;
correctly, extracting pay from messy free text, and rotating residential IPs — then&lt;br&gt;
maintaining it — adds up fast.&lt;/p&gt;
&lt;h2&gt;
  
  
  The no-code option
&lt;/h2&gt;

&lt;p&gt;The &lt;strong&gt;&lt;a href="https://apify.com/benthepythondev/kleinanzeigen-jobs-scraper" rel="noopener noreferrer"&gt;Kleinanzeigen Jobs Scraper&lt;/a&gt;&lt;/strong&gt;&lt;br&gt;
on Apify does it for you: enter a role and a location (or leave the role blank for&lt;br&gt;
everything in the area), click Run, get clean rows.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"keyword"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"fahrer"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"locationCode"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"münchen"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"radiusKm"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"employmentType"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Vollzeit"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"maxResults"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output is one clean row per ad — title, job category, employment type, salary +&lt;br&gt;
period, experience, company, city, PLZ, posting date and the listing URL — ready&lt;br&gt;
for a spreadsheet, a database, or an LLM.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common use cases
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Recruiting &amp;amp; sourcing&lt;/strong&gt; — find local hires and employers outside the big boards.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Labour-market analysis&lt;/strong&gt; — track local demand, wages and employment types by region.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Job-board aggregation&lt;/strong&gt; — enrich your board with hyper-local German listings.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lead generation&lt;/strong&gt; — small employers actively hiring are strong B2B leads.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;How is this different from an ATS scraper?&lt;/strong&gt; Corporate boards carry mid/large-company&lt;br&gt;
roles only. Kleinanzeigen carries local, hourly, trade, care and mini-jobs — a&lt;br&gt;
completely different, hyper-local inventory.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Do I need an API key?&lt;/strong&gt; No — a residential DE proxy is used automatically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is this legal?&lt;/strong&gt; You're reading publicly available listing data. Use it responsibly&lt;br&gt;
and within applicable laws and Kleinanzeigen's terms.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Building something with German job data? The &lt;a href="https://apify.com/benthepythondev/kleinanzeigen-jobs-scraper" rel="noopener noreferrer"&gt;Kleinanzeigen Jobs Scraper&lt;/a&gt; handles the scraping so you can focus on the product.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>python</category>
      <category>germany</category>
      <category>jobs</category>
    </item>
    <item>
      <title>How to Scrape Used-Car Listings from Kleinanzeigen (Germany) — Python + No-Code</title>
      <dc:creator>Ben</dc:creator>
      <pubDate>Thu, 25 Jun 2026 20:47:39 +0000</pubDate>
      <link>https://dev.to/benthepythondev/how-to-scrape-used-car-listings-from-kleinanzeigen-germany-python-no-code-38nb</link>
      <guid>https://dev.to/benthepythondev/how-to-scrape-used-car-listings-from-kleinanzeigen-germany-python-no-code-38nb</guid>
      <description>&lt;p&gt;If you want &lt;strong&gt;used-car data from Germany&lt;/strong&gt;, everyone reaches for mobile.de — but&lt;br&gt;
that's the dealer market. The &lt;strong&gt;private-seller&lt;/strong&gt; market lives on&lt;br&gt;
&lt;strong&gt;Kleinanzeigen Autos&lt;/strong&gt;, and it's where the deals (and the arbitrage) are. This post&lt;br&gt;
shows how to pull structured car listings — make, model, year, mileage, fuel,&lt;br&gt;
gearbox, power and price — out of it.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why Kleinanzeigen for cars?
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Private sellers&lt;/strong&gt; — typically cheaper than dealer platforms; ideal for
deal-hunting and market analysis.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Different inventory&lt;/strong&gt; — a complement to mobile.de, not a duplicate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rich attributes&lt;/strong&gt; — once you parse the detail page you get the full German spec
sheet, not just a title and a price.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Parsing the German car attributes
&lt;/h2&gt;

&lt;p&gt;Kleinanzeigen stores the spec sheet as label/value rows. Map the German labels to&lt;br&gt;
clean English keys, with a regex fallback on the title for year and mileage when a&lt;br&gt;
field is missing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;bs4&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BeautifulSoup&lt;/span&gt;

&lt;span class="n"&gt;YEAR&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;\b(?:19|20)\d{2}\b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;KM&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;([\d.]+)\s*km\b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;I&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse_car&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attrs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;attrs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;()}&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;keys&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;keys&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt;
    &lt;span class="n"&gt;er&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;erstzulassung&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;
    &lt;span class="n"&gt;ym&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;YEAR&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;er&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;YEAR&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;make&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;marke&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;modell&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;year&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ym&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;group&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;ym&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mileage_km&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kilometerstand&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fuel&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kraftstoff&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transmission&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;getriebe&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;power&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;leistung&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The attribute rows themselves come off the detail page like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;soup&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BeautifulSoup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;html&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lxml&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;attrs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;li&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.addetailslist--detail&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;li&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select_one&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.addetailslist--detail--value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;continue&lt;/span&gt;
    &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;strip&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;label&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;li&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;strip&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;attrs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As with any Kleinanzeigen section, you'll want a &lt;strong&gt;residential DE proxy&lt;/strong&gt; and the&lt;br&gt;
location-autocomplete endpoint to turn a city name into the internal location ID.&lt;/p&gt;
&lt;h2&gt;
  
  
  One gotcha: make is a facet, not free text
&lt;/h2&gt;

&lt;p&gt;Kleinanzeigen treats the car make as a structured filter, and many listing titles&lt;br&gt;
omit the brand (e.g. &lt;em&gt;"320d Touring, Automatik"&lt;/em&gt;). So a naive keyword search on&lt;br&gt;
"BMW" misses cars. The robust approach is to browse the Autos category by location&lt;br&gt;
and filter make/model client-side from the parsed &lt;code&gt;make&lt;/code&gt; field — or paste a search&lt;br&gt;
URL with the make facet already selected.&lt;/p&gt;
&lt;h2&gt;
  
  
  The catch: scale and upkeep
&lt;/h2&gt;

&lt;p&gt;Paging hundreds of results, fetching each detail page politely, normalizing the&lt;br&gt;
German spec sheet, handling the make-facet quirk, and rotating residential IPs —&lt;br&gt;
then keeping it alive as the site changes — is real, ongoing work.&lt;/p&gt;
&lt;h2&gt;
  
  
  The no-code option
&lt;/h2&gt;

&lt;p&gt;The &lt;strong&gt;&lt;a href="https://apify.com/benthepythondev/kleinanzeigen-autos-scraper" rel="noopener noreferrer"&gt;Kleinanzeigen Autos Scraper&lt;/a&gt;&lt;/strong&gt;&lt;br&gt;
on Apify does it for you: pick a location (and optional price/year/mileage filters),&lt;br&gt;
click Run, get clean rows.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"locationCode"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"berlin"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"maxPrice"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;20000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"minYear"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2016&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"maxMileage"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;120000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"maxResults"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output is one clean row per car — make, model, year, first registration, mileage,&lt;br&gt;
price, fuel, transmission, power, condition, color, city and the listing URL —&lt;br&gt;
ready for a spreadsheet, a database, or an LLM.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common use cases
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Deal-hunting &amp;amp; arbitrage&lt;/strong&gt; — spot underpriced private-seller cars fast.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Price &amp;amp; market analysis&lt;/strong&gt; — track asking prices by make/model/region over time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dealers &amp;amp; resellers&lt;/strong&gt; — source private inventory and leads.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automotive apps&lt;/strong&gt; — power a search product with structured listings.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;How is this different from mobile.de?&lt;/strong&gt; Kleinanzeigen is dominated by private&lt;br&gt;
sellers and budget cars — a different inventory, useful for arbitrage and broader&lt;br&gt;
coverage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Do I need an API key?&lt;/strong&gt; No — a residential DE proxy is used automatically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is this legal?&lt;/strong&gt; You're reading publicly available listing data. Use it responsibly&lt;br&gt;
and within applicable laws and Kleinanzeigen's terms.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Building something with German car data? The &lt;a href="https://apify.com/benthepythondev/kleinanzeigen-autos-scraper" rel="noopener noreferrer"&gt;Kleinanzeigen Autos Scraper&lt;/a&gt; handles the scraping so you can focus on the product.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>python</category>
      <category>germany</category>
      <category>api</category>
    </item>
    <item>
      <title>How to Scrape Real Estate Listings from Kleinanzeigen (Germany) — Python + No-Code</title>
      <dc:creator>Ben</dc:creator>
      <pubDate>Thu, 25 Jun 2026 20:46:40 +0000</pubDate>
      <link>https://dev.to/benthepythondev/how-to-scrape-real-estate-listings-from-kleinanzeigen-germany-python-no-code-28j9</link>
      <guid>https://dev.to/benthepythondev/how-to-scrape-real-estate-listings-from-kleinanzeigen-germany-python-no-code-28j9</guid>
      <description>&lt;p&gt;Kleinanzeigen (formerly eBay Kleinanzeigen) is Germany's largest classifieds site,&lt;br&gt;
and its &lt;strong&gt;Immobilien&lt;/strong&gt; section is one of the richest sources of &lt;strong&gt;private-landlord&lt;/strong&gt;&lt;br&gt;
rental and sale listings anywhere in the DACH region — the kind of inventory that&lt;br&gt;
never makes it onto the big paid portals. If you're building a rental-market&lt;br&gt;
dashboard, a relocation tool, or just hunting for underpriced flats, this is where&lt;br&gt;
the data is.&lt;/p&gt;

&lt;p&gt;The problem: Kleinanzeigen renders server-side German HTML, blocks datacenter IPs,&lt;br&gt;
and hides the structured attributes (Zimmer, Wohnfläche, Kaltmiete) inside the&lt;br&gt;
detail page. Here's how to get clean data out of it.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why Kleinanzeigen for real estate?
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Private landlords&lt;/strong&gt; — many listings are direct from owners, often below market
and absent from Immowelt/ImmoScout.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hyper-local&lt;/strong&gt; — search any PLZ or city with a radius, down to the district.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Structured fields&lt;/strong&gt; — rooms, living space, cold/warm rent, deposit, availability
date and more, once you parse the detail page.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  The hard part: location IDs and German attributes
&lt;/h2&gt;

&lt;p&gt;Two things trip up a naive scraper:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Location resolution.&lt;/strong&gt; Kleinanzeigen doesn't search by city name directly — it
uses internal location IDs (e.g. Berlin → &lt;code&gt;l3331&lt;/code&gt;). You resolve them through its
autocomplete endpoint:
&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;

&lt;span class="n"&gt;q&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;münchen&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://www.kleinanzeigen.de/s-ort-empfehlungen.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
              &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# -&amp;gt; {"_6411": "München", ...}  the digits after the underscore are the location id
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Attribute parsing.&lt;/strong&gt; On the detail page each attribute is a label/value pair
without a clean class on the label, so you read the value node and strip it back
out of the row text:
&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;bs4&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BeautifulSoup&lt;/span&gt;

&lt;span class="n"&gt;soup&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BeautifulSoup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;html&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lxml&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;attrs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;li&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.addetailslist--detail&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;li&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select_one&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.addetailslist--detail--value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;continue&lt;/span&gt;
    &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;strip&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;label&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;li&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;strip&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;attrs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;
&lt;span class="c1"&gt;# attrs -&amp;gt; {"Wohnfläche": "72 m²", "Zimmer": "3", "Kaltmiete": "1.150 €", ...}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;You also need a &lt;strong&gt;residential DE proxy&lt;/strong&gt; — datacenter IPs get a soft block fast.&lt;/p&gt;
&lt;h2&gt;
  
  
  The catch: scale and upkeep
&lt;/h2&gt;

&lt;p&gt;One listing is easy. A useful dataset needs paging through hundreds of results,&lt;br&gt;
fetching every detail page politely, normalizing German fields into clean keys&lt;br&gt;
(&lt;code&gt;rooms&lt;/code&gt;, &lt;code&gt;living_space_sqm&lt;/code&gt;, &lt;code&gt;cold_rent_eur&lt;/code&gt;), handling umlauts in city slugs, and&lt;br&gt;
rotating residential IPs — then staying alive as the markup shifts. That's days of&lt;br&gt;
maintenance.&lt;/p&gt;
&lt;h2&gt;
  
  
  The no-code option
&lt;/h2&gt;

&lt;p&gt;The &lt;strong&gt;&lt;a href="https://apify.com/benthepythondev/kleinanzeigen-immobilien-scraper" rel="noopener noreferrer"&gt;Kleinanzeigen Immobilien Scraper&lt;/a&gt;&lt;/strong&gt;&lt;br&gt;
on Apify does all of it: enter a city or PLZ (names auto-resolve to the right&lt;br&gt;
location ID) and an optional price/room filter, click Run, and get clean rows.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"locationCode"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"münchen"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"radiusKm"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"maxResults"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output is one tidy row per listing — title, rooms, living space, cold/warm rent,&lt;br&gt;
deposit, address, PLZ, images and the listing URL — ready for a spreadsheet, a&lt;br&gt;
database, or an LLM. Residential DE proxies and attribute parsing are built in.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common use cases
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Rental-market analysis&lt;/strong&gt; — track asking rents by district and room count over time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deal-hunting&lt;/strong&gt; — surface underpriced private-landlord flats the moment they post.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Relocation &amp;amp; proptech apps&lt;/strong&gt; — power search with structured German listings.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lead generation&lt;/strong&gt; — reach private landlords directly.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Do I need an API key?&lt;/strong&gt; No — a residential DE proxy is used automatically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can I search by city name?&lt;/strong&gt; Yes — city names auto-resolve to Kleinanzeigen's&lt;br&gt;
internal location IDs; you can also pass a PLZ or paste a full search URL.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is this legal?&lt;/strong&gt; You're reading publicly available listing data. Use it responsibly&lt;br&gt;
and within applicable laws and Kleinanzeigen's terms.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Building something with German property data? The &lt;a href="https://apify.com/benthepythondev/kleinanzeigen-immobilien-scraper" rel="noopener noreferrer"&gt;Kleinanzeigen Immobilien Scraper&lt;/a&gt; handles the scraping so you can focus on the product.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>python</category>
      <category>germany</category>
      <category>api</category>
    </item>
    <item>
      <title>Pushshift Alternative: Scrape Historical Reddit Posts &amp; Comments (2026)</title>
      <dc:creator>Ben</dc:creator>
      <pubDate>Tue, 23 Jun 2026 19:49:50 +0000</pubDate>
      <link>https://dev.to/benthepythondev/pushshift-alternative-scrape-historical-reddit-posts-comments-2026-13k2</link>
      <guid>https://dev.to/benthepythondev/pushshift-alternative-scrape-historical-reddit-posts-comments-2026-13k2</guid>
      <description>&lt;p&gt;Pushshift was the de facto way to pull historical Reddit data — until access got heavily restricted and it stopped being the easy public option it once was. If you need posts and comments older than what Reddit's API will give you, here's a working alternative in 2026.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the official Reddit API isn't enough
&lt;/h2&gt;

&lt;p&gt;Reddit's API caps listings at roughly &lt;strong&gt;1,000 items&lt;/strong&gt; per endpoint, so you can't page back through a large subreddit's full history. That's fine for recent data, useless for backfilling months or years — which is exactly why people used Pushshift.&lt;/p&gt;

&lt;h2&gt;
  
  
  The alternative: Reddit Archive Scraper
&lt;/h2&gt;

&lt;p&gt;The &lt;a href="https://apify.com/benthepythondev/reddit-archive-scraper" rel="noopener noreferrer"&gt;Reddit Archive Scraper&lt;/a&gt; on Apify pulls historical Reddit posts and comments from the community &lt;strong&gt;PullPush&lt;/strong&gt; archive (Pushshift's successor) — by subreddit, date range, and keyword — reaching data the official API can't. For current/live data, the &lt;a href="https://apify.com/benthepythondev/reddit-scraper" rel="noopener noreferrer"&gt;Reddit Scraper&lt;/a&gt; returns posts and comments as clean markdown for AI/RAG.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"subreddits"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"datascience"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"since"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2020-01-01"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"until"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2022-12-31"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"keyword"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"career"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Pushshift vs. this
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Pushshift (today)&lt;/th&gt;
&lt;th&gt;Reddit Archive Scraper&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Access&lt;/td&gt;
&lt;td&gt;Restricted / limited&lt;/td&gt;
&lt;td&gt;Open, run on demand&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;History depth&lt;/td&gt;
&lt;td&gt;Was full archive&lt;/td&gt;
&lt;td&gt;Years, via PullPush&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Setup&lt;/td&gt;
&lt;td&gt;API + auth hurdles&lt;/td&gt;
&lt;td&gt;No API key, run + export&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Output&lt;/td&gt;
&lt;td&gt;Raw JSON&lt;/td&gt;
&lt;td&gt;Structured JSON / CSV&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Live data&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Pair with Reddit Scraper&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Use cases
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;AI / RAG datasets&lt;/strong&gt; — years of real Q&amp;amp;A and discussion as training context.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Academic &amp;amp; social research&lt;/strong&gt; — longitudinal analysis of communities.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trend &amp;amp; sentiment analysis&lt;/strong&gt; — track how opinion shifted over time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Brand/competitor history&lt;/strong&gt; — every historical mention in a subreddit.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What replaced Pushshift?&lt;/strong&gt; The PullPush archive is the community successor; the Reddit Archive Scraper wraps it so you can query by subreddit, date, and keyword without managing access.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why can't I just use the Reddit API for old posts?&lt;/strong&gt; It caps listings at ~1,000 items, so deep history isn't reachable through it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What format is the output?&lt;/strong&gt; Structured JSON/CSV; the live Reddit Scraper outputs AI-ready markdown with token counts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is it legal?&lt;/strong&gt; It reads publicly available Reddit data. Use responsibly and within applicable terms.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Need Reddit history beyond the API's limit? The &lt;a href="https://apify.com/benthepythondev/reddit-archive-scraper" rel="noopener noreferrer"&gt;Reddit Archive Scraper&lt;/a&gt; pulls years of posts by subreddit, date, and keyword.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>ai</category>
    </item>
    <item>
      <title>Hunter.io Alternative: Find &amp; Verify Business Emails by Domain (Pay-As-You-Go)</title>
      <dc:creator>Ben</dc:creator>
      <pubDate>Tue, 23 Jun 2026 19:48:10 +0000</pubDate>
      <link>https://dev.to/benthepythondev/hunterio-alternative-find-verify-business-emails-by-domain-pay-as-you-go-278p</link>
      <guid>https://dev.to/benthepythondev/hunterio-alternative-find-verify-business-emails-by-domain-pay-as-you-go-278p</guid>
      <description>&lt;p&gt;Hunter.io is a popular email finder, but its plans are monthly subscriptions with credit caps — overkill if you just need to enrich a list now and then. If you want to find and verify business emails &lt;strong&gt;without a recurring subscription&lt;/strong&gt;, here's a pay-per-result alternative.&lt;/p&gt;

&lt;h2&gt;
  
  
  The job to be done
&lt;/h2&gt;

&lt;p&gt;Given a person's name and their company domain, you want the most likely &lt;strong&gt;verified&lt;/strong&gt; work email — for sales outreach, recruiting, or CRM enrichment. The mechanics are the same everywhere: generate the common corporate patterns, then verify which one actually receives mail (MX + SMTP), while detecting catch-all domains so you don't get false positives.&lt;/p&gt;

&lt;h2&gt;
  
  
  The alternative: Smart Email Finder &amp;amp; Verifier
&lt;/h2&gt;

&lt;p&gt;The &lt;a href="https://apify.com/benthepythondev/smart-email-finder-verifier" rel="noopener noreferrer"&gt;Smart Email Finder &amp;amp; Verifier&lt;/a&gt; on Apify does exactly that — from a name + domain it generates ~14 patterns, verifies via SMTP where possible, detects catch-all domains, and returns a confidence score. You pay per verified email found, with no monthly fee.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"contacts"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"first_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Jane"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"last_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Doe"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"domain"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"acme.com"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Hunter.io vs. pay-per-use finder
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Hunter.io&lt;/th&gt;
&lt;th&gt;This finder&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Pricing&lt;/td&gt;
&lt;td&gt;Monthly subscription + credit caps&lt;/td&gt;
&lt;td&gt;Pay per verified email, no subscription&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Verification&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;MX + SMTP + catch-all detection&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bulk&lt;/td&gt;
&lt;td&gt;Yes (plan-limited)&lt;/td&gt;
&lt;td&gt;Yes — array or CSV&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Confidence score&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes (0–100)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best for&lt;/td&gt;
&lt;td&gt;Always-on teams&lt;/td&gt;
&lt;td&gt;Lists, projects, on-demand enrichment&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Use cases
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cold outreach&lt;/strong&gt; — turn a list of names + companies into reachable contacts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recruiting&lt;/strong&gt; — reach candidates directly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CRM enrichment&lt;/strong&gt; — fill missing email fields in bulk.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PR &amp;amp; partnerships&lt;/strong&gt; — find the right person at a target company.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Do it responsibly
&lt;/h2&gt;

&lt;p&gt;Use verified business emails for legitimate, consented outreach; follow CAN-SPAM / GDPR and always include an opt-out. Don't target personal data.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Is there a cheaper Hunter.io alternative?&lt;/strong&gt; Yes — a pay-per-result finder avoids monthly subscriptions; you only pay for emails actually found.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How accurate is it?&lt;/strong&gt; High for companies with standard formats and non-catch-all domains (SMTP-verified). Big-tech domains block SMTP from datacenters, so you get a pattern-based best guess with a lower confidence score.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can I run a whole list?&lt;/strong&gt; Yes — pass an array of contacts or CSV for bulk enrichment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Do I need an API key?&lt;/strong&gt; No — give it a name and domain.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Turning a name list into reachable contacts? The &lt;a href="https://apify.com/benthepythondev/smart-email-finder-verifier" rel="noopener noreferrer"&gt;Smart Email Finder &amp;amp; Verifier&lt;/a&gt; finds and verifies professional emails by domain — bulk, pay-as-you-go.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>api</category>
    </item>
    <item>
      <title>BuiltWith Alternative: Detect Any Website's Tech Stack via API (No Subscription)</title>
      <dc:creator>Ben</dc:creator>
      <pubDate>Tue, 23 Jun 2026 19:46:37 +0000</pubDate>
      <link>https://dev.to/benthepythondev/builtwith-alternative-detect-any-websites-tech-stack-via-api-no-subscription-1c0c</link>
      <guid>https://dev.to/benthepythondev/builtwith-alternative-detect-any-websites-tech-stack-via-api-no-subscription-1c0c</guid>
      <description>&lt;p&gt;BuiltWith is the go-to for "what is this site built with" — but its paid plans start around &lt;strong&gt;$295/month&lt;/strong&gt;, which is a lot if you just need to enrich a lead list or check a few hundred domains now and then. If you don't need an enterprise subscription, here's a pay-as-you-go alternative that returns the same kind of data via a simple API.&lt;/p&gt;

&lt;h2&gt;
  
  
  What you actually need
&lt;/h2&gt;

&lt;p&gt;Most people reaching for BuiltWith want one of three things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Sales intelligence&lt;/strong&gt; — "does this company use Shopify / HubSpot / Salesforce?"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lead scoring&lt;/strong&gt; — filter a list by the tech a prospect runs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Competitive research&lt;/strong&gt; — see a competitor's analytics, CMS, and hosting&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For all of these, you need to point a tool at a list of domains and get back the technologies — not pay for a yearly seat.&lt;/p&gt;

&lt;h2&gt;
  
  
  The alternative: Website Tech Stack Detector
&lt;/h2&gt;

&lt;p&gt;The &lt;a href="https://apify.com/benthepythondev/website-tech-detector" rel="noopener noreferrer"&gt;Website Tech Stack Detector&lt;/a&gt; (and its sibling &lt;a href="https://apify.com/benthepythondev/tech-stack-detector" rel="noopener noreferrer"&gt;Tech Stack Detector&lt;/a&gt;) on Apify detect 100+ technologies on any site — frameworks, CMS, analytics, hosting, payment, marketing tools — and return clean JSON. You pay per site checked, with no monthly subscription.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"urls"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"https://example.com"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://shopify-store.com"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output (per site): detected technologies grouped by category, plus the domain and a success flag — ready for a spreadsheet, CRM, or your own scoring model.&lt;/p&gt;

&lt;h2&gt;
  
  
  BuiltWith vs. pay-per-use detector
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;BuiltWith (paid plans)&lt;/th&gt;
&lt;th&gt;This detector&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Pricing&lt;/td&gt;
&lt;td&gt;Subscription (~$295+/mo)&lt;/td&gt;
&lt;td&gt;Pay per site, no subscription&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best for&lt;/td&gt;
&lt;td&gt;Large, ongoing programs&lt;/td&gt;
&lt;td&gt;Lists, enrichment, occasional checks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Access&lt;/td&gt;
&lt;td&gt;Web app + API (higher tiers)&lt;/td&gt;
&lt;td&gt;API + no-code UI out of the box&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Output&lt;/td&gt;
&lt;td&gt;Reports&lt;/td&gt;
&lt;td&gt;Clean JSON / CSV / Excel&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lock-in&lt;/td&gt;
&lt;td&gt;Annual plans&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If you run BuiltWith daily across millions of domains, their subscription may be worth it. If you enrich lists or check domains on demand, pay-per-use wins.&lt;/p&gt;

&lt;h2&gt;
  
  
  Use cases
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Outbound sales&lt;/strong&gt; — find every prospect on Shopify/Webflow/HubSpot before you pitch.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lead enrichment&lt;/strong&gt; — append tech stack to a CRM list in bulk.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Competitive &amp;amp; market research&lt;/strong&gt; — map what tools a segment uses.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recruiting/agencies&lt;/strong&gt; — qualify clients by their stack.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Is there a free or cheaper BuiltWith alternative?&lt;/strong&gt; Yes — a pay-per-use detector like the one above avoids BuiltWith's monthly subscription; you only pay for sites you check.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Does it detect the same technologies?&lt;/strong&gt; It covers 100+ common technologies across frameworks, CMS, analytics, hosting, payments, and marketing tools.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can I check a bulk list of domains?&lt;/strong&gt; Yes — pass an array of URLs and get one structured record per site.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Do I need an API key?&lt;/strong&gt; No — run it via API or the no-code UI.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Need it for a list of domains? The &lt;a href="https://apify.com/benthepythondev/website-tech-detector" rel="noopener noreferrer"&gt;Website Tech Stack Detector&lt;/a&gt; checks them in one run — no subscription, just pay per site.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>api</category>
    </item>
    <item>
      <title>How to Find Anyone's Professional Email from Name + Company (2026 Guide)</title>
      <dc:creator>Ben</dc:creator>
      <pubDate>Mon, 22 Jun 2026 14:15:08 +0000</pubDate>
      <link>https://dev.to/benthepythondev/how-to-find-anyones-professional-email-from-name-company-2026-guide-1ca0</link>
      <guid>https://dev.to/benthepythondev/how-to-find-anyones-professional-email-from-name-company-2026-guide-1ca0</guid>
      <description>&lt;p&gt;Finding a verified work email from just a name and a company domain is the backbone of cold outreach, recruiting, and sales prospecting. Here's how the pattern-and-verify approach works, how to do it in Python, and a no-code option that does the whole thing for you.&lt;/p&gt;

&lt;h2&gt;
  
  
  The idea: permute, then verify
&lt;/h2&gt;

&lt;p&gt;Most companies use one of a handful of email formats. Given &lt;code&gt;Jane Doe&lt;/code&gt; at &lt;code&gt;acme.com&lt;/code&gt;, the likely candidates are:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;first&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;last&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;domain&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;jane&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;doe&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;acme.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;patterns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;first&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;last&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;@&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;domain&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# jane.doe@
&lt;/span&gt;    &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;first&lt;/span&gt;&lt;span class="si"&gt;}{&lt;/span&gt;&lt;span class="n"&gt;last&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;@&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;domain&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="c1"&gt;# janedoe@
&lt;/span&gt;    &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;first&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}{&lt;/span&gt;&lt;span class="n"&gt;last&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;@&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;domain&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# jdoe@
&lt;/span&gt;    &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;first&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;@&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;domain&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;          &lt;span class="c1"&gt;# jane@
&lt;/span&gt;    &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;last&lt;/span&gt;&lt;span class="si"&gt;}{&lt;/span&gt;&lt;span class="n"&gt;first&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;@&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;domain&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# doej@
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's the easy part. The hard part is &lt;strong&gt;verification&lt;/strong&gt; — confirming which candidate actually receives mail without sending a test email. That's done with MX lookups and SMTP handshake checks (and you'll hit greylisting, catch-all domains, and rate limits along the way).&lt;/p&gt;

&lt;h2&gt;
  
  
  The Python reality
&lt;/h2&gt;

&lt;p&gt;You can script MX + SMTP checks with &lt;code&gt;dnspython&lt;/code&gt; and &lt;code&gt;smtplib&lt;/code&gt;, but at any volume you'll need IP rotation (many mail servers block repeated probes), catch-all detection, and back-off logic — which is why most teams use a service.&lt;/p&gt;

&lt;h2&gt;
  
  
  The no-code option
&lt;/h2&gt;

&lt;p&gt;The &lt;a href="https://apify.com/benthepythondev/smart-email-finder-verifier" rel="noopener noreferrer"&gt;Smart Email Finder &amp;amp; Verifier&lt;/a&gt; generates the common patterns and verifies them (SMTP where possible) with a confidence score, from just a name + domain:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"contacts"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"fullName"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Jane Doe"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"domain"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"acme.com"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You get the most likely address, the patterns tried, and a confidence rating — ready to drop into your CRM or outreach tool.&lt;/p&gt;

&lt;h2&gt;
  
  
  Use cases
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Sales prospecting&lt;/strong&gt; — turn a lead list of names + companies into reachable contacts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recruiting&lt;/strong&gt; — reach candidates directly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PR &amp;amp; partnerships&lt;/strong&gt; — find the right person at a target company.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CRM enrichment&lt;/strong&gt; — fill missing email fields in bulk.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  A note on doing this responsibly
&lt;/h2&gt;

&lt;p&gt;Use verified business emails for legitimate, consented outreach, follow CAN-SPAM / GDPR rules, and always include an opt-out. Don't scrape or target personal data.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;How accurate is pattern + verify?&lt;/strong&gt; For companies with standard formats and non-catch-all domains, high. Catch-all domains can't be verified by SMTP — the confidence score tells you when.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Do I need an API key?&lt;/strong&gt; Not for the &lt;a href="https://apify.com/benthepythondev/smart-email-finder-verifier" rel="noopener noreferrer"&gt;Smart Email Finder &amp;amp; Verifier&lt;/a&gt; — give it a name and domain.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can I run a whole list?&lt;/strong&gt; Yes — pass an array of contacts for bulk enrichment.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Turning a name list into reachable contacts? The &lt;a href="https://apify.com/benthepythondev/smart-email-finder-verifier" rel="noopener noreferrer"&gt;Smart Email Finder &amp;amp; Verifier&lt;/a&gt; finds and verifies professional emails by domain — bulk, with confidence scores.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>api</category>
      <category>webscraping</category>
    </item>
    <item>
      <title>How to Scrape Reddit Posts &amp; Comments for AI / RAG (Python + No-Code)</title>
      <dc:creator>Ben</dc:creator>
      <pubDate>Mon, 22 Jun 2026 14:13:38 +0000</pubDate>
      <link>https://dev.to/benthepythondev/how-to-scrape-reddit-posts-comments-for-ai-rag-python-no-code-4d46</link>
      <guid>https://dev.to/benthepythondev/how-to-scrape-reddit-posts-comments-for-ai-rag-python-no-code-4d46</guid>
      <description>&lt;p&gt;Reddit is one of the richest sources of real human opinion on the internet — which makes it gold for RAG pipelines, sentiment analysis, and market research. Here's how to pull Reddit posts and comments in 2026, the limits to know about, and a no-code option that outputs clean markdown ready for embeddings.&lt;/p&gt;

&lt;h2&gt;
  
  
  Option 1: the official Reddit API (PRAW)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;praw&lt;/span&gt;

&lt;span class="n"&gt;reddit&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;praw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Reddit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;client_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;client_secret&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_agent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my-app&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;post&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;reddit&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;subreddit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dataengineering&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;hot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;post&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;post&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;post&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;post&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;comments&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace_more&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;post&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;comments&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;()[:&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;120&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This works, but note the catches: you need an app + OAuth credentials, the free tier is rate-limited, and &lt;strong&gt;listings are capped at ~1000 items&lt;/strong&gt; — so you can't page back through a large subreddit's full history.&lt;/p&gt;

&lt;h2&gt;
  
  
  Option 2: historical archives (beyond the 1000-item cap)
&lt;/h2&gt;

&lt;p&gt;For data older than what the API will return, the community archive &lt;strong&gt;PullPush&lt;/strong&gt; (the successor to Pushshift) lets you query historical posts and comments by subreddit, date range, and keyword — useful for backfilling years of data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Option 3: no-code / markdown output for AI
&lt;/h2&gt;

&lt;p&gt;If your goal is an AI/RAG dataset, you mostly want clean text, not JSON plumbing. The &lt;a href="https://apify.com/benthepythondev/reddit-scraper" rel="noopener noreferrer"&gt;Reddit Scraper&lt;/a&gt; returns posts, comments, and user data as &lt;strong&gt;AI-ready markdown&lt;/strong&gt; (with word/token counts), no API keys to manage:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"subreddits"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"MachineLearning"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"LocalLLaMA"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"sort"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"top"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"maxPosts"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"includeComments"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For deep history beyond the API's cap, there's a companion &lt;a href="https://apify.com/benthepythondev/reddit-archive-scraper" rel="noopener noreferrer"&gt;Reddit Archive Scraper&lt;/a&gt; that pulls years of posts from the archive by date range.&lt;/p&gt;

&lt;h2&gt;
  
  
  Which should you use?
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Small, live pulls?&lt;/strong&gt; The official API via PRAW is free and fine.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI/RAG datasets or historical backfill?&lt;/strong&gt; A managed scraper saves you OAuth, rate-limit handling, and the markdown conversion.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Use cases
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;RAG / fine-tuning datasets&lt;/strong&gt; — real Q&amp;amp;A and discussion as training context.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sentiment &amp;amp; trend analysis&lt;/strong&gt; — track opinion on products, tickers, or topics.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Market research&lt;/strong&gt; — mine niche subreddits for pain points and feature requests.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Community monitoring&lt;/strong&gt; — watch mentions of your brand or competitors.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Do I need Reddit API credentials?&lt;/strong&gt; For PRAW, yes. The managed scraper handles access for you.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why only ~1000 posts from the API?&lt;/strong&gt; Reddit caps listing pagination. Use an archive source for deeper history.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What format is best for AI?&lt;/strong&gt; Markdown or plain text with token counts — which is what the &lt;a href="https://apify.com/benthepythondev/reddit-scraper" rel="noopener noreferrer"&gt;Reddit Scraper&lt;/a&gt; outputs.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Building an AI app on Reddit data? The &lt;a href="https://apify.com/benthepythondev/reddit-scraper" rel="noopener noreferrer"&gt;Reddit Scraper&lt;/a&gt; gives you posts and comments as clean markdown — no API keys, no rate-limit juggling.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>webscraping</category>
      <category>ai</category>
    </item>
    <item>
      <title>How to Scrape Yahoo Finance Stock Data in 2026 (Python + No-Code API)</title>
      <dc:creator>Ben</dc:creator>
      <pubDate>Mon, 22 Jun 2026 13:50:22 +0000</pubDate>
      <link>https://dev.to/benthepythondev/how-to-scrape-yahoo-finance-stock-data-in-2026-python-no-code-api-25dk</link>
      <guid>https://dev.to/benthepythondev/how-to-scrape-yahoo-finance-stock-data-in-2026-python-no-code-api-25dk</guid>
      <description>&lt;p&gt;Yahoo Finance is still the most accessible source of free stock market data — real-time quotes, historical prices, fundamentals, analyst ratings, and news. This guide covers the practical ways to pull it in 2026, the gotchas, and a no-code option when you don't want to babysit a scraper.&lt;/p&gt;

&lt;h2&gt;
  
  
  Option 1: yfinance (Python)
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;yfinance&lt;/code&gt; library wraps Yahoo's endpoints:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;yfinance&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;yf&lt;/span&gt;

&lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;yf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Ticker&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AAPL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fast_info&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;last_price&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;          &lt;span class="c1"&gt;# real-time-ish quote
&lt;/span&gt;&lt;span class="n"&gt;hist&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;history&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;period&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1mo&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;            &lt;span class="c1"&gt;# historical OHLCV
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hist&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tail&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;news&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;                         &lt;span class="c1"&gt;# latest news
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It's great for notebooks and quick analysis. The downsides at scale: Yahoo rate-limits aggressive polling, the unofficial endpoints change without notice, and you'll be maintaining retry/back-off logic and proxies once you go beyond a handful of tickers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Option 2: hit the JSON endpoints directly
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;yfinance&lt;/code&gt; is really calling endpoints like &lt;code&gt;https://query1.finance.yahoo.com/v8/finance/chart/AAPL&lt;/code&gt;. You can call them yourself, but you'll quickly run into consent cookies, crumb tokens, and 429s — which is why most people either use the library or a managed API.&lt;/p&gt;

&lt;h2&gt;
  
  
  Option 3: no-code / API (no maintenance)
&lt;/h2&gt;

&lt;p&gt;If you want clean JSON without maintaining any of the above, the &lt;a href="https://apify.com/benthepythondev/yahoo-finance-scraper" rel="noopener noreferrer"&gt;Yahoo Finance Scraper&lt;/a&gt; on Apify returns quotes, company info, historical prices, financials, analyst recommendations, and news for any list of tickers — via API or a simple UI, no key on your side:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"symbols"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"AAPL"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"MSFT"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"NVDA"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"dataTypes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"quote"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"history"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"news"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You get one structured record per ticker, ready for a dashboard, a backtest, or an LLM. Because it runs on managed infrastructure, the rate-limit and rotation problems are handled for you.&lt;/p&gt;

&lt;h2&gt;
  
  
  Which should you use?
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Exploring in a notebook?&lt;/strong&gt; &lt;code&gt;yfinance&lt;/code&gt; is perfect and free.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Building a product or pulling many tickers on a schedule?&lt;/strong&gt; A managed API saves you the cat-and-mouse of cookies, crumbs, and rate limits.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Use cases
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Trading dashboards &amp;amp; alerts&lt;/strong&gt; — live quotes and historical series.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Backtesting&lt;/strong&gt; — clean OHLCV history across many tickers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Research &amp;amp; screeners&lt;/strong&gt; — fundamentals and analyst data at scale.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI/LLM finance apps&lt;/strong&gt; — structured market data as context.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Is Yahoo Finance data free?&lt;/strong&gt; For personal and most research use, yes. Respect Yahoo's terms for redistribution.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How do I avoid rate limits?&lt;/strong&gt; Back off on 429s and cache; or use a managed API that handles rotation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can I get fundamentals and news too?&lt;/strong&gt; Yes — &lt;code&gt;yfinance&lt;/code&gt; exposes them, and the &lt;a href="https://apify.com/benthepythondev/yahoo-finance-scraper" rel="noopener noreferrer"&gt;Yahoo Finance Scraper&lt;/a&gt; returns them in one structured response.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Pulling Yahoo Finance at scale? The &lt;a href="https://apify.com/benthepythondev/yahoo-finance-scraper" rel="noopener noreferrer"&gt;Yahoo Finance Scraper&lt;/a&gt; gives you quotes, history, fundamentals, and news as clean JSON — no key, no rate-limit headaches.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>api</category>
      <category>webscraping</category>
    </item>
    <item>
      <title>How to Scrape Jobs from Greenhouse, Lever &amp; Ashby (Free Python + No-Code)</title>
      <dc:creator>Ben</dc:creator>
      <pubDate>Mon, 22 Jun 2026 13:48:35 +0000</pubDate>
      <link>https://dev.to/benthepythondev/how-to-scrape-jobs-from-greenhouse-lever-ashby-free-python-no-code-4n56</link>
      <guid>https://dev.to/benthepythondev/how-to-scrape-jobs-from-greenhouse-lever-ashby-free-python-no-code-4n56</guid>
      <description>&lt;p&gt;If you've ever tried to build a job board, a sourcing tool, or a hiring-trends dashboard, you've hit the same wall: the big job aggregators are stale, rate-limited, and wrapped in anti-bot defenses. The fix most people miss is that &lt;strong&gt;thousands of companies publish their open roles through public ATS APIs&lt;/strong&gt; — Greenhouse, Lever, and Ashby all expose JSON endpoints that anyone can read, no key required.&lt;/p&gt;

&lt;p&gt;This post shows the free Python way to pull jobs from each, and a one-click no-code option if you'd rather skip the maintenance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why scrape ATS endpoints instead of job boards?
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;First-party and fresh&lt;/strong&gt; — you read the same data that powers the company's careers page, the moment a role opens.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Working apply links&lt;/strong&gt; — every job links to the real application.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No blocking&lt;/strong&gt; — these are public JSON APIs, so no proxies or headless browsers.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Greenhouse jobs API (Python)
&lt;/h2&gt;

&lt;p&gt;Greenhouse exposes a board endpoint per company token:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;

&lt;span class="n"&gt;company&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stripe&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# the slug in boards.greenhouse.io/&amp;lt;slug&amp;gt;
&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://boards-api.greenhouse.io/v1/boards/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;company&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/jobs?content=true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;jobs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;jobs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;jobs&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;—&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;location&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;—&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;absolute_url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;content=true&lt;/code&gt; includes the full HTML job description.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lever jobs API (Python)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;

&lt;span class="n"&gt;company&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# the slug in jobs.lever.co/&amp;lt;slug&amp;gt;
&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.lever.co/v0/postings/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;company&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;?mode=json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()[:&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;cats&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;categories&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{})&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;—&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cats&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;location&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;—&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hostedUrl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Ashby jobs API (Python)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;

&lt;span class="n"&gt;org&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;notion&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# the slug in jobs.ashbyhq.com/&amp;lt;slug&amp;gt;
&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.ashbyhq.com/posting-api/job-board/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;org&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;jobs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][:&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;—&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;location&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;— remote:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;isRemote&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The catch: scale and upkeep
&lt;/h2&gt;

&lt;p&gt;One company is easy. But a useful product needs &lt;strong&gt;dozens of companies across all three ATS&lt;/strong&gt;, normalized into one schema, de-duplicated, filtered by keyword, location, and remote status — and kept running as companies come and go. That's where a maintained tool saves you days.&lt;/p&gt;

&lt;h2&gt;
  
  
  The no-code option
&lt;/h2&gt;

&lt;p&gt;The &lt;a href="https://apify.com/benthepythondev/ats-jobs-aggregator" rel="noopener noreferrer"&gt;ATS Jobs Aggregator&lt;/a&gt; on Apify does exactly this: it pulls live jobs from Greenhouse, Lever, Ashby, SmartRecruiters, and Recruitee in one run, normalized into a single dataset, with keyword/location/remote filters and de-duplication built in. It ships with a curated list of well-known companies so it returns results on the first click, and you can drop in your own company slugs to track exactly the employers you care about.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"sources"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"greenhouse"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"lever"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ashby"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"keywords"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"engineer"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"data"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"remoteOnly"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"maxResults"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output is one clean row per job — title, company, location, department, employment type, remote flag, apply URL, and posted date — ready for a spreadsheet, a database, or an LLM.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common use cases
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Niche job boards&lt;/strong&gt; — power a regional or role-specific board with fresh listings.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sourcing &amp;amp; recruiting&lt;/strong&gt; — see who's hiring for a role across dozens of companies at once.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Talent &amp;amp; market intelligence&lt;/strong&gt; — track hiring velocity and locations by company.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sales signals&lt;/strong&gt; — companies hiring for a function often signal budget and growth.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Do I need an API key?&lt;/strong&gt; No — all three endpoints are public.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How do I find a company's slug?&lt;/strong&gt; It's in their careers URL: &lt;code&gt;boards.greenhouse.io/&amp;lt;slug&amp;gt;&lt;/code&gt;, &lt;code&gt;jobs.lever.co/&amp;lt;slug&amp;gt;&lt;/code&gt;, &lt;code&gt;jobs.ashbyhq.com/&amp;lt;slug&amp;gt;&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is this legal?&lt;/strong&gt; You're reading publicly published job data. Use it responsibly and within applicable terms.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Building something with job data? The &lt;a href="https://apify.com/benthepythondev/ats-jobs-aggregator" rel="noopener noreferrer"&gt;ATS Jobs Aggregator&lt;/a&gt; handles the scraping so you can focus on the product.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>python</category>
      <category>api</category>
    </item>
    <item>
      <title>Best Game &amp; Media Data Scrapers and APIs in 2026</title>
      <dc:creator>Ben</dc:creator>
      <pubDate>Tue, 02 Jun 2026 11:55:38 +0000</pubDate>
      <link>https://dev.to/benthepythondev/best-game-media-data-scrapers-and-apis-in-2026-25gi</link>
      <guid>https://dev.to/benthepythondev/best-game-media-data-scrapers-and-apis-in-2026-25gi</guid>
      <description>&lt;p&gt;&lt;em&gt;Building a game-deal site, a music database, an anime app, or a media catalog? Here are the best no-code ways to pull structured data about games, music, anime and apps in 2026 — with honest pros and cons.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; For game prices and metadata use the &lt;a href="https://apify.com/benthepythondev/steam-scraper" rel="noopener noreferrer"&gt;Steam Scraper&lt;/a&gt;. For music, podcasts and apps, the &lt;a href="https://apify.com/benthepythondev/itunes-scraper" rel="noopener noreferrer"&gt;iTunes Scraper&lt;/a&gt;. For vinyl and releases, the &lt;a href="https://apify.com/benthepythondev/discogs-scraper" rel="noopener noreferrer"&gt;Discogs Scraper&lt;/a&gt;. For anime, the &lt;a href="https://apify.com/benthepythondev/anime-scraper" rel="noopener noreferrer"&gt;Anime Scraper&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  What to look for in a media data tool
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Official/keyless source&lt;/strong&gt; — avoids breakage and captchas.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rich fields&lt;/strong&gt; — price, rating, genre, release date, artwork, IDs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Search + filters&lt;/strong&gt; — by keyword, category, country store.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Clean output&lt;/strong&gt; — flat JSON/CSV you can load straight into an app or DB.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  1. Steam Scraper — game prices, ratings &amp;amp; metadata
&lt;/h2&gt;

&lt;p&gt;Searches the Steam store by keyword and returns name, price, discount, Metacritic score, genres, developers, publishers, release date, description, recommendations and platforms.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt; keyless and fast; great for price tracking and deal aggregators; rich metadata via the appdetails enrichment.&lt;br&gt;
&lt;strong&gt;Cons:&lt;/strong&gt; Steam store catalog only (not other game stores).&lt;br&gt;
&lt;strong&gt;Best for:&lt;/strong&gt; game-deal sites, price trackers, and games databases.&lt;/p&gt;

&lt;p&gt;➡️ &lt;a href="https://apify.com/benthepythondev/steam-scraper" rel="noopener noreferrer"&gt;Steam Scraper&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  2. iTunes Scraper — music, podcasts, apps &amp;amp; more
&lt;/h2&gt;

&lt;p&gt;Wraps Apple's official iTunes Search API. One tool covers music, podcasts, apps, audiobooks, movies, TV shows and ebooks — returning title, artist, price, genre, rating, artwork and preview/view URLs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt; official Apple data; keyless; covers many media types and country stores in one input.&lt;br&gt;
&lt;strong&gt;Cons:&lt;/strong&gt; Apple ecosystem data (not Spotify/Google Play).&lt;br&gt;
&lt;strong&gt;Best for:&lt;/strong&gt; media catalogs, podcast/music discovery tools, and app research.&lt;/p&gt;

&lt;p&gt;➡️ &lt;a href="https://apify.com/benthepythondev/itunes-scraper" rel="noopener noreferrer"&gt;iTunes Scraper&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Discogs Scraper — vinyl, CDs &amp;amp; music releases
&lt;/h2&gt;

&lt;p&gt;Searches the Discogs database — the world's largest catalog of music releases — by artist, album, genre or year. Returns title, year, country, genre, style, format, label, catalog number, barcode and cover image.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt; unmatched release/pressing detail; keyless; great for collectors and marketplaces.&lt;br&gt;
&lt;strong&gt;Cons:&lt;/strong&gt; music-only; rate-limited for very large pulls.&lt;br&gt;
&lt;strong&gt;Best for:&lt;/strong&gt; record collecting, discography databases, and music marketplace analytics.&lt;/p&gt;

&lt;p&gt;➡️ &lt;a href="https://apify.com/benthepythondev/discogs-scraper" rel="noopener noreferrer"&gt;Discogs Scraper&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Anime Scraper — MyAnimeList data
&lt;/h2&gt;

&lt;p&gt;Pulls anime data from MyAnimeList via the public Jikan API: title, type, episodes, status, score, rank, popularity, studios, genres, synopsis, season/year and artwork. Search by title or fetch top-ranked lists.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt; rich anime metadata; keyless; supports top-N lists for leaderboards.&lt;br&gt;
&lt;strong&gt;Cons:&lt;/strong&gt; anime-specific (by design).&lt;br&gt;
&lt;strong&gt;Best for:&lt;/strong&gt; anime databases, recommendation tools, and Discord bots.&lt;/p&gt;

&lt;p&gt;➡️ &lt;a href="https://apify.com/benthepythondev/anime-scraper" rel="noopener noreferrer"&gt;Anime Scraper&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Raw official APIs (Steam, iTunes, Jikan, Discogs)
&lt;/h2&gt;

&lt;p&gt;You can call each underlying API yourself.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt; free, full control.&lt;br&gt;
&lt;strong&gt;Cons:&lt;/strong&gt; different formats, pagination, and rate limits per source; you build and maintain the glue code.&lt;br&gt;
&lt;strong&gt;Best for:&lt;/strong&gt; developers who want to script it themselves.&lt;/p&gt;




&lt;h2&gt;
  
  
  Quick comparison
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scraper&lt;/th&gt;
&lt;th&gt;Domain&lt;/th&gt;
&lt;th&gt;Key fields&lt;/th&gt;
&lt;th&gt;Keyless&lt;/th&gt;
&lt;th&gt;No-code&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Steam&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Games&lt;/td&gt;
&lt;td&gt;price, discount, Metacritic, genres&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;iTunes&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Music/podcasts/apps&lt;/td&gt;
&lt;td&gt;artist, price, genre, artwork&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Discogs&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Vinyl/music&lt;/td&gt;
&lt;td&gt;year, label, format, cat#&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Anime&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Anime&lt;/td&gt;
&lt;td&gt;score, rank, studios, genres&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Raw APIs&lt;/td&gt;
&lt;td&gt;Varies&lt;/td&gt;
&lt;td&gt;Varies&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  How to start (no code)
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Open any scraper above on Apify.&lt;/li&gt;
&lt;li&gt;Enter a keyword (and country store or category where relevant).&lt;/li&gt;
&lt;li&gt;Set &lt;code&gt;maxResults&lt;/code&gt; and run.&lt;/li&gt;
&lt;li&gt;Export JSON/CSV, or schedule recurring runs to keep data fresh.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;For media and game data in 2026, no-code scrapers over official sources are the sweet spot — clean output, no key setup, no maintenance:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Games → &lt;a href="https://apify.com/benthepythondev/steam-scraper" rel="noopener noreferrer"&gt;Steam Scraper&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Music/podcasts/apps → &lt;a href="https://apify.com/benthepythondev/itunes-scraper" rel="noopener noreferrer"&gt;iTunes Scraper&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Vinyl/releases → &lt;a href="https://apify.com/benthepythondev/discogs-scraper" rel="noopener noreferrer"&gt;Discogs Scraper&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Anime → &lt;a href="https://apify.com/benthepythondev/anime-scraper" rel="noopener noreferrer"&gt;Anime Scraper&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>api</category>
      <category>datascience</category>
      <category>gamedev</category>
      <category>webscraping</category>
    </item>
    <item>
      <title>How to Build a Research Paper Dataset for RAG &amp; LLMs (No Code, 2026)</title>
      <dc:creator>Ben</dc:creator>
      <pubDate>Tue, 02 Jun 2026 11:54:22 +0000</pubDate>
      <link>https://dev.to/benthepythondev/how-to-build-a-research-paper-dataset-for-rag-llms-no-code-2026-54ji</link>
      <guid>https://dev.to/benthepythondev/how-to-build-a-research-paper-dataset-for-rag-llms-no-code-2026-54ji</guid>
      <description>&lt;p&gt;&lt;em&gt;Grounding an LLM or running a literature review? You need a clean corpus of papers — titles, abstracts, authors, citations, PDF links. Here's how to build one in minutes without writing a scraper, pulling from arXiv, OpenAlex and PubMed.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What you'll build:&lt;/strong&gt; a structured JSON dataset of academic papers on your topic, ready to drop into a vector database, a notebook, or a RAG pipeline.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why not just hit the APIs directly?
&lt;/h2&gt;

&lt;p&gt;arXiv, OpenAlex and PubMed are all free and open — but each returns a different format (Atom XML, nested JSON, E-utilities), with its own pagination and rate limits. Wiring that up and flattening it is a half-day of glue code you'll have to maintain.&lt;/p&gt;

&lt;p&gt;Instead, we'll use three no-code scrapers that wrap those official APIs and return clean, flat JSON:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://apify.com/benthepythondev/arxiv-scraper" rel="noopener noreferrer"&gt;arXiv Scraper&lt;/a&gt; — CS/ML/physics preprints&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://apify.com/benthepythondev/openalex-scraper" rel="noopener noreferrer"&gt;OpenAlex Scraper&lt;/a&gt; — 250M+ works across all fields, with citations&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://apify.com/benthepythondev/pubmed-scraper" rel="noopener noreferrer"&gt;PubMed Scraper&lt;/a&gt; — 37M+ biomedical citations&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Step 1 — Pick your source by field
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Machine learning / CS / physics → &lt;strong&gt;arXiv&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Anything, with citation counts → &lt;strong&gt;OpenAlex&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Medicine / life sciences → &lt;strong&gt;PubMed&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For an AI/ML RAG corpus, start with arXiv.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2 — Run the arXiv Scraper
&lt;/h2&gt;

&lt;p&gt;Open the &lt;a href="https://apify.com/benthepythondev/arxiv-scraper" rel="noopener noreferrer"&gt;arXiv Scraper&lt;/a&gt; and set:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"allFields"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"retrieval augmented generation"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"sortBy"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"newest"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"maxResults"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or use the advanced query syntax for precision:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"searchQuery"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"cat:cs.CL AND abs:retrieval augmented"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"maxResults"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each result includes the abstract and a direct PDF link:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"arxiv_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2605.30351v1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"VideoMLA: Low-Rank Latent KV Cache..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"authors"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"Hidir Yesiltepe"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Jiazhen Hu"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"abstract"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Long-rollout causal video diffusion..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"primary_category"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"cs.CV"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"pdf_url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://arxiv.org/pdf/2605.30351v1"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 3 — Add breadth and citations with OpenAlex
&lt;/h2&gt;

&lt;p&gt;To rank by impact or cover non-arXiv venues, run the &lt;a href="https://apify.com/benthepythondev/openalex-scraper" rel="noopener noreferrer"&gt;OpenAlex Scraper&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"searchQuery"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"retrieval augmented generation"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"sortBy"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"citations"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"fromYear"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2020&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"maxResults"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You get &lt;code&gt;cited_by_count&lt;/code&gt;, &lt;code&gt;is_open_access&lt;/code&gt; and PDF links — perfect for prioritising the most influential papers in your corpus.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4 — Export and load into your pipeline
&lt;/h2&gt;

&lt;p&gt;From the Storage tab, export each dataset as &lt;strong&gt;JSON&lt;/strong&gt; (or pull it via the API). Then in your RAG pipeline:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Use the &lt;code&gt;abstract&lt;/code&gt; (and the &lt;code&gt;pdf_url&lt;/code&gt; if you want full text) as your documents.&lt;/li&gt;
&lt;li&gt;Keep &lt;code&gt;title&lt;/code&gt;, &lt;code&gt;authors&lt;/code&gt;, &lt;code&gt;doi&lt;/code&gt;, &lt;code&gt;arxiv_id&lt;/code&gt; as metadata for citations.&lt;/li&gt;
&lt;li&gt;Chunk, embed, and load into your vector DB (Pinecone, Weaviate, pgvector, etc.).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Because the output is already flat JSON with consistent field names across runs, there's no per-source parsing — you merge the three datasets and you're done.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 5 — Keep it fresh
&lt;/h2&gt;

&lt;p&gt;Schedule the arXiv Scraper with &lt;code&gt;sortBy: newest&lt;/code&gt; to append new papers on your topic each week, so your RAG index stays current without re-scraping everything.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bonus — add real-world signal
&lt;/h2&gt;

&lt;p&gt;Papers tell you what researchers say; forums tell you what practitioners say. To enrich a dataset with public discussion, add the &lt;a href="https://apify.com/benthepythondev/reddit-archive-scraper" rel="noopener noreferrer"&gt;Reddit Archive Scraper&lt;/a&gt; to pull historical threads from subreddits like r/MachineLearning by date range and keyword.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrap up
&lt;/h2&gt;

&lt;p&gt;Three no-code scrapers, three short inputs, and you've got a clean, multi-source research corpus — no XML parsing, no rate-limit juggling. Start with the &lt;a href="https://apify.com/benthepythondev/arxiv-scraper" rel="noopener noreferrer"&gt;arXiv Scraper&lt;/a&gt; and layer in &lt;a href="https://apify.com/benthepythondev/openalex-scraper" rel="noopener noreferrer"&gt;OpenAlex&lt;/a&gt; and &lt;a href="https://apify.com/benthepythondev/pubmed-scraper" rel="noopener noreferrer"&gt;PubMed&lt;/a&gt; as you need breadth.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>rag</category>
      <category>datascience</category>
      <category>ai</category>
    </item>
  </channel>
</rss>
