DEV Community

Yuvraj Raghuvanshi
Yuvraj Raghuvanshi

Posted on • Originally published at yuvrajraghuvanshis.Medium on

The Website That Looked Like It Needed Selenium (But Didn’t)

For my thesis I needed a large corpus of Hindi poetry. Hindwi is one of the better maintained Hindi literature archives on the internet. Thousands of poems, hundreds of poets, content spanning from the 8th century to contemporary writers. It had everything I needed.

I didn’t plan to spend much time on the scraper. Collect the data, move on.

That didn’t happen.

The Obvious Problem

Visit hindwi.org/poets and you’ll see a listing of poets. Scroll down and more appear. Visit an individual poet’s page and the same thing happens — poems load as you scroll. This is the pattern that makes every scraper writer reach for Selenium almost reflexively. The content isn’t in the initial HTML. JavaScript is loading it dynamically. You need a browser.

So I set up Selenium. Headless Chrome, scroll simulation, wait for elements to appear, extract content. It worked. It was also agonizingly slow.

The real problem wasn’t just speed — it was that Selenium is fundamentally impractical to parallelize. You can’t easily spin up ten browser instances and scrape ten poets simultaneously the way you can with threads making HTTP requests. Each browser instance carries its own rendering engine, memory space, and JavaScript runtime. The resource cost compounds quickly, and the coordination between instances is a nightmare. Even with aggressive parallelism, back-of-envelope math on 25,000+ poems made it clear this would take days, not hours.

There had to be a better way.

Ten Minutes in DevTools

Before writing any more Selenium code, I opened the browser DevTools Network tab and watched what actually happened when the page loaded more content.

This is always worth doing before committing to browser automation. Dynamic-looking behavior on the frontend is still, at the network level, just HTTP requests. The browser has to get the data from somewhere. The question is whether that somewhere is directly reachable.

On Hindwi, when you scroll to the bottom of the poets listing, the browser fires a request like this:

https://www.hindwi.org/PoetCollection?lang=2&pageNumber=2&Info=poet
&StartsWith=&keyword=&typeID=659186cb-44e7-4d94-8b1a-fc70f939a733
&TypeSlug=poets&contentFilter=&_=1777462454692
Enter fullscreen mode Exit fullscreen mode

Plain GET request. No authentication tokens in the body, no encrypted signatures, no WebSocket handshake. Just query parameters. The _=1777462454692 at the end is a cache-busting timestamp the browser adds automatically - the server doesn't validate it, so scrapers can ignore it entirely.

The response that came back was raw HTML — not JSON, not XML. Just HTML cards containing poet names, dates, and profile links, ready to be injected into the DOM. So the website wasn’t serving a proper API, but it was serving something structured, paginated, and directly reachable over plain HTTP.


Screenshot: DevTools Network tab showing the /PoetCollection request and its HTML response body

The next question was: how does the browser know what URL to request for page 3, page 4, page 5? Where does that information come from?

The URL Was Sitting Right There

I looked at the page source. And there they were — all of them, already embedded in the initial HTML response:

<div class="contentLoadMore">
    <div class="contentLoadMorePaging" 
         data-url="/PoetCollection?lang=2&pageNumber=3&Info=poet
                  &StartsWith=&keyword=&typeID=659186cb-44e7-4d94-8b1a-fc70f939a733
                  &TypeSlug=poets&contentFilter=">
        <svg class="screenLoader" ...></svg>
    </div>
</div>
Enter fullscreen mode Exit fullscreen mode

The site pre-embeds the URLs for every subsequent page inside data-url attributes on div.contentLoadMorePaging elements. The JavaScript reads these attributes and fires the requests when you scroll into view. But from a scraper's perspective, the URLs are already there in the first response - you don't need to scroll anything. You just parse them out and fetch them directly.

This was the moment Selenium became irrelevant.

What looked like dynamic JavaScript-driven content was really just a simple pattern: fetch the initial page, extract the hidden data-url values, make those HTTP requests directly. No browser. No scroll simulation. No waiting for DOM mutations.


Screenshot: page source with the contentLoadMorePaging div and data-url attribute visible

The Same Pattern, Everywhere

Once I knew what to look for, I checked the individual poet pages. Same pattern. A poet with more than 50 poems (Mona Gulati, for example) has this in her initial page response:

<div class="contentLoadMore">
    <div class="contentLoadMorePaging" 
         data-url="/PoetCollection?lang=2&pageNumber=2&info=ghazals
                  &SEO_Slug=kavita&Id=34074990-5be7-43e9-8a85-6aaa0be4833c
                  &Info=ghazal&StartsWith=a&typeID=659186cb-...
                  &contentType=kavita&sort=popularity-desc&filter=">
    </div>
    <div class="contentLoadMorePaging" 
         data-url="/PoetCollection?lang=2&pageNumber=3...">
    </div>
</div>
Enter fullscreen mode Exit fullscreen mode

Both page 2 and page 3 are listed upfront in the first response. The site hands you the complete roadmap immediately. Fetch once, and you know exactly what to fetch next — no interaction, no scrolling, no waiting.

This held for dohas, quotes, and every other content type on the site. The contentLoadMorePaging pattern was consistent across all of Hindwi. Understanding it once meant the whole site was open.

Turning the Insight Into Code

The scraper that came out of this is conceptually simple. For the poet listing, hit the /PoetCollection endpoint and keep incrementing pageNumber until you get an empty response:

def _get_paginated_poet_cards(self, info, extra_params=None):
    page = 1
    while True:
        params = {"lang": 2, "pageNumber": page, "Info": info}
        if extra_params:
            params.update(extra_params)

        soup = get_soup(POETS_ENDPOINT, params=params)
        cards = soup.select("div.poetColumn")
        if not cards:
            break
        yield from cards
        page += 1
Enter fullscreen mode Exit fullscreen mode

For poem lists, fetch the poet’s kavita page, parse whatever poems are already in the initial HTML, then extract and follow every data-url:

def _extract_poem_metadata(self, kavita_url):
    soup = get_soup(kavita_url)
    poems = self._parse_poem_list(soup)

    pagination_divs = soup.select("div.contentLoadMorePaging[data-url]")
    seen_urls = set()
    for div in pagination_divs:
        data_url = div.get("data-url")
        if not data_url or data_url in seen_urls:
            continue
        seen_urls.add(data_url)
        full_url = urljoin("https://www.hindwi.org", data_url)
        paginated_soup = get_soup(full_url)
        poems.extend(self._parse_poem_list(paginated_soup))
    return poems
Enter fullscreen mode Exit fullscreen mode

No browser. No scroll events. Two BeautifulSoup calls per paginated poet.

One thing worth mentioning about _parse_poem_list: the initial page and the dynamically loaded fragment pages use different CSS classes for their poem cards. The initial listing uses div.rt_contentBodyListItems, while the paginated HTML fragments come back using div.contentListItems.nwPoetListBody. I caught this when certain poets were returning suspiciously fewer poems than their profile pages suggested - the paginated content was being silently skipped because the selector only matched the first class. A multi-selector handles both:

cards = soup.select(
    "div.rt_contentBodyListItems, div.contentListItems.nwPoetListBody"
)
Enter fullscreen mode Exit fullscreen mode

This is exactly the kind of thing that produces wrong results silently. No error, no exception — just a poem count that’s quietly lower than it should be.


Screenshot: terminal output showing a poet being processed with their correct poem count

Extracting the Poems

Each poem lives on its own URL. The page serves the text in Devanagari and, for many poems, a Romanized transliteration toggled by a button. In the HTML, both versions are already present — just hidden or shown depending on which toggle is active:

# Devanagari
hindi_div = soup.find("div", {"class": "pMC", "data-roman": "off"})

# Romanized
roman_div = soup.find("div", id="HindwiRoman")
roman_pmc = roman_div.find("div", {"class": "pMC", "data-roman": "on"})
Enter fullscreen mode Exit fullscreen mode

The text itself is structured as

tags containing tags per word or phrase. Joining the spans within each paragraph gives one line:

for p in hindi_div.find_all("p"):
    line = " ".join(span.get_text(strip=True) for span in p.find_all("span"))
    if line.strip():
        hindi_lines.append(line)
Enter fullscreen mode Exit fullscreen mode

Both versions get saved as separate plain text files. Not every poem has a Romanized version, so the code returns None for the roman field when it doesn't exist rather than an empty list - preserving the distinction between "no Roman version" and "Roman version is blank."


Screenshot: a poem page on Hindwi showing the Devanagari text alongside the Roman toggle

Concurrency — The Real Payoff

With Selenium out of the picture, threading became trivial. The poem scraper processes all poets concurrently with a thread pool:

def scrape_poems(self, max_workers=10):
    with ThreadPoolExecutor(max_workers=10) as executor:
        futures = [executor.submit(self._process_poet, poet, i, total)
                   for i, poet in enumerate(self.poets)]
        for future in as_completed(futures):
            future.result()
Enter fullscreen mode Exit fullscreen mode

Ten threads making lightweight HTTP requests is nothing. This is what was completely impractical with Selenium — ten browser instances would have needed a dedicated server to run without thrashing. Ten requests threads ran fine on a laptop, barely registering on the CPU.

Every request goes through a shared get_soup wrapper that enforces a 1-second politeness delay and retries with exponential backoff on failures. Errors at any level - a single poem, an entire poet - get logged and skipped rather than crashing the thread. The run completed cleanly over about two hours. A small number of URLs consistently returned server errors and landed in the log; everything else went through without issue.

The Result

Two hours. 25,000+ poems across hundreds of poets. Devanagari and Romanized versions where available. Structured metadata including titles, URLs, slugs, and categories per poem. Around 300MB of text in total.

The dependency list tells the whole story:

beautifulsoup4==4.13.4
requests==2.32.4
Enter fullscreen mode Exit fullscreen mode

No Selenium, no browser drivers, no Playwright, no headless Chrome. Just HTTP requests and HTML parsing.

What I Took From This

The instinct to reach for Selenium when you see dynamic content is understandable — it’s the safe default that definitely works. But dynamic content loading just means the browser is making HTTP requests after the initial page load. Those requests go somewhere, return something, and in most cases can be replicated directly.

The contentLoadMorePaging pattern on Hindwi is a good illustration of how often websites like this are more accessible than they appear. The site wasn't hiding anything. It was handing out pagination URLs in plain HTML, sitting in data-url attributes, ready to be read. JavaScript happened to be the first thing reading them - until a scraper was.

Ten minutes in the Network tab before writing any scraping code is almost always worth it. In this case, it was the difference between days of Selenium pain and a two-hour requests script that finished before lunch.

This article is for educational purposes — all ethical considerations have been addressed, including measures such as rate limiting and conducting scraping during periods of low website traffic.

This article is rewritten using AI chatbots.

April 30, 2026

Top comments (0)