How I Scraped our Private Google Site in a Semi-Automated way

Saugata Roy Arghya — Sat, 05 Jul 2025 16:51:42 +0000

Ever faced a project that felt like hitting a brick wall? I was working on something that looked straightforward, until it wasn’t. I was working with an internal Google Sites instance—access-protected and tied to a workspace domain. That meant standard scraping tools hit a wall at the login page.

After countless hours of trial, error, and debugging, I was able to finally crack it. The solution? A hybrid approach that combines manual intervention with automation which resulted in a seamless, robust system.

In this post, I’m sharing the full story: the roadblocks I faced, the strategies I tried (and why they failed), and the final working Python script—step by step:

The Core Problem: Why It's "Impossible"

Modern web applications, especially from Google, are designed to prevent basic scraping. The primary roadblock is authentication. You can't just send a username and password anymore; you need to handle potential 2-Factor Authentication (2FA), captchas, and complex JavaScript-driven login flows. A purely automated script running on a server can't do this.

The Breakthrough: A Hybrid "Human-in-the-Loop" Architecture

The solution was to stop thinking about it as a single, fully-automated task. We broke it down into a hybrid system where a human and a robot collaborate:

Manual Authentication (The Human Part): A script opens a browser for a human to perform the complex login. It then saves the session "key" (the cookies).
Automated Scraping (The Robot Part): A separate, powerful headless script uses that session key to do the heavy lifting—visiting every page, downloading all content, and saving it in an organized way.

Toolkit

This solution relies on a few key Python libraries. You'll want a requirements.txt file with the following:

# requirements.txt
playwright
httpx
beautifulsoup4
lxml

The Rocky Road: Our Initial Failures

Before arriving at the final script, we hit several walls. Our first attempt was to use a higher-level scraping library (like Crawl4AI), but it didn't offer the granular control needed for the interactive login. This forced us to use Playwright directly.

This led to our first major bug: a NotImplementedError on Windows. It turns out the default asyncio event loop on Windows is incompatible with some of Playwright's underlying processes.

Lesson Learned: Always account for platform differences. The fix was to explicitly set a compatible event loop policy for Windows right at the start of the script. This was a critical lesson in writing robust, cross-platform code.

The Implementation: Building the Scraper

Let's build the script, function by function.

Step 1: Handling Authentication

First, we need a way to perform the manual login and get the session cookies. This function opens a visible browser, lets you log in, and then saves the cookies for the automated part to use.

async def get_auth_cookies():
    """Launches a browser for login to get authentication cookies."""
    if sys.platform == "win32":
        asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())

    print("--- 👤 YOUR TURN: AUTHENTICATION ---")
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=False)
        context = await browser.new_context()
        page = await context.new_page()
        await page.goto(START_URL)
        print("Please complete the login process in the browser window...")
        try:
            # We wait until the URL is back on the Google Site domain
            await page.wait_for_url(f"**/{BASE_DOMAIN}/**", timeout=300000)
            print("✅ Login successful! Extracting session cookies...")
            cookies = await context.cookies()
            await browser.close()
            print("🔒 Headed browser closed. Authentication complete.")
            return cookies
        except TimeoutError:
            print("❌ Login timed out.")
            await browser.close()
            return None

Step 2: The Scraper Engine

Now for the main part. This function, scrape_site_headless, takes the cookies and the list of pages to visit, then iterates through them in a headless browser.

The key part here is how we wait for each page to load. We use wait_until="networkidle" with a generous 90-second timeout. This is the most reliable way to ensure complex pages with lots of embedded iframes are fully loaded before we try to read them.

Failure Note: Initially, I tried using wait_until="load" with a short, fixed wait_for_timeout(). This failed constantly on the heavier pages like the homepage, resulting in a TimeoutError.
Lesson Learned: For complex, dynamic sites, a patient networkidle wait is far more reliable than a blind, fixed delay.

async def scrape_site_headless(cookies, initial_links):
    """Launches a headless browser to scrape all pages."""
    print("\n--- 🤖 MY TURN: HEADLESS SCRAPING ---")
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context(storage_state={"cookies": cookies})

        for i, link_info in enumerate(initial_links):
            url_to_scrape = link_info['url']
            print(f"({i+1}/{len(initial_links)}) Scraping: {url_to_scrape}")

            page = await context.new_page()
            try:
                # This is the patient waiting strategy that solved the timeout errors
                await page.goto(url_to_scrape, wait_until="networkidle", timeout=90000)

                # ... The content extraction logic will go here ...

            except Exception as e:
                print(f"    ❌ Failed to scrape {url_to_scrape}. Error: {e}")
            finally:
                if not page.is_closed():
                    await page.close()
        await browser.close()

Step 3: Extracting Content with Context

This is where we solve the context problem. Inside the scraping loop, we'll get the full HTML, parse it with BeautifulSoup, find and download images/docs, and replace them with placeholders before saving the final text.

Failure Note: Finding the embedded document links was the hardest part. My first attempts to find them by looking for "Pop-out" buttons or simple link tags failed because the links are hidden deep inside iframes with non-obvious selectors. I even tried a brute-force regex search on the page's internal JavaScript variables, which also proved unreliable.
The breakthrough came when we created a debug script to save the page's full HTML and manually inspected it. We discovered that Google embeds the direct download and open links in special data- attributes (like data-embed-download-url).
Lesson Learned: When you're stuck, stop guessing and find a way to look at the raw source your script is seeing.

Here's the logic that goes inside the try block of the scrape_site_headless function:

# This code goes inside the `try` block in the function above

# Get HTML from the main page and all its frames
full_html = await page.content()
for frame in page.frames:
    try:
        full_html += await frame.content()
    except Exception: pass

soup = BeautifulSoup(full_html, "lxml")

# Prepare directories and lists
page_output_dir = create_page_folder(page)
images_dir = os.path.join(page_output_dir, "images")
os.makedirs(images_dir, exist_ok=True)
doc_links_to_save = set()
image_counter = 0

# Find all <img> tags, download the image, and replace with a placeholder
print(f"    🖼️  Finding and downloading images...")
for img_tag in soup.find_all('img'):
    src = img_tag.get('src')
    if src and src.startswith('http'):
        image_counter += 1
        saved_filename = await download_file(cookies, src, images_dir, f"image_{image_counter}")
        if saved_filename:
            placeholder = f"\n[IMAGE: {os.path.join('images', saved_filename)}]\n"
            img_tag.replace_with(placeholder)

# Find all embedded documents, download or link them, and replace with a placeholder
print(f"    📎 Finding and downloading documents...")
for embed_div in soup.find_all('div', attrs={'data-embed-doc-id': True}):
    download_url = embed_div.get('data-embed-download-url')
    if download_url:
        # This is a downloadable file like a PDF
        saved_filename = await download_file(cookies, download_url, page_output_dir, "document")
        if saved_filename:
            placeholder = f"\n[DOWNLOADED_DOCUMENT: {saved_filename}]\n"
            embed_div.replace_with(placeholder)
    else:
        # This is an interactive doc, so we save the link
        open_url = embed_div.get('data-embed-open-url')
        if open_url:
            doc_links_to_save.add(open_url)
            placeholder = f"\n[DOCUMENT_LINK: {open_url}]\n"
            embed_div.replace_with(placeholder)

# Finally, get the clean text from our modified HTML
page_text = soup.get_text(separator='\n', strip=True)

# Save the final results to files
save_final_content(page_output_dir, page_text, list(doc_links_to_save))

Step 4: The Reliable File Downloader

We learned that using the browser to navigate to download links can fail. The robust solution is to use a direct HTTP client (httpx) with our session cookies. This function handles that for both images and documents.

Failure Note: Before switching to httpx, my first attempt was to use Playwright's page.goto() to download the files. This resulted in a cryptic net::ERR_ABORTED error. This happens because page.goto() is for navigating to a webpage, not for handling file downloads, which the server provides differently.
Lesson Learned: Use the right tool for the job. A direct HTTP client is the correct and robust way to handle file downloads.

import httpx
from urllib.parse import unquote
import mimetypes

async def download_file(session_cookies, file_url, save_dir, file_prefix):
    """Downloads a file directly using an HTTP client."""
    if not file_url or file_url.startswith('data:image'): return None

    cookie_jar = httpx.Cookies()
    for cookie in session_cookies:
        cookie_jar.set(cookie['name'], cookie['value'], domain=cookie['domain'])

    try:
        async with httpx.AsyncClient(cookies=cookie_jar, follow_redirects=True, timeout=120.0) as client:
            response = await client.get(file_url)
            response.raise_for_status()

            # Try to get the real filename from the server
            filename = file_prefix
            if 'content-disposition' in response.headers:
                fn_match = re.search(r'filename="([^"]+)"', response.headers['content-disposition'], re.IGNORECASE)
                if fn_match:
                    filename = unquote(fn_match.group(1))
            else:
                # Fallback to guessing the extension
                ext = mimetypes.guess_extension(response.headers.get("content-type", "")) or ""
                filename = f"{file_prefix}{ext}"

            filepath = os.path.join(save_dir, filename)
            with open(filepath, "wb") as f:
                f.write(response.content)
            return filename
    except Exception as e:
        print(f"      - Could not download {file_url}. Error: {e}")
    return None

Step 5: Putting It All Together

Finally, we need a main function to orchestrate the entire process: get the cookies, find all the pages to scrape, and then kick off the headless scraper.

# The functions to create folders and save text go here...
# def create_page_folder(page): ...
# def save_final_content(page_dir, text, docs): ...

async def main():
    # 1. Authenticate and get cookies
    cookies = await get_auth_cookies()
    if not cookies:
        return

    # 2. Get the list of all internal pages to scrape
    print("\n--- 🤖 Getting initial links to scrape ---")
    # ... (this part uses a temporary headless browser to get the nav links) ...
    # This logic is complex, so for the article, we'll just summarize it.

    # 3. Start the main scraping job with the cookies and link list
    if internal_links:
        await scrape_site_headless(cookies, internal_links)

    print("\n🎉 All tasks complete.")

if __name__ == "__main__":
    asyncio.run(main())

Conclusion

Scraping modern web apps is a battle of persistence. While a simple approach might seem "impossible," breaking the problem down and using the right tools for each part of the job makes it achievable. By combining a manual login with an automated scraper and using direct HTTP requests for downloads, we were able to build a robust and reliable solution. The key was to inspect the target, understand its behavior, and adapt our strategy.

Acknowledgment

I'd like to note that this project was developed in close collaboration with an AI assistant.

My Journey into Novel Creation Using Generative AI: Day 1

Saugata Roy Arghya — Wed, 25 Dec 2024 20:07:08 +0000

As someone passionate about exploring the capabilities of Generative AI, I recently embarked on a project to create literature using LLMs (Large Language Models). This is my first attempt at implementing a fully automated novel-writing pipeline, and I'm thrilled to share my experience so far. Here’s how it went on Day 1.

Choosing the Tools

For this project, I decided to use Groq Client due to its LPU’s (Linear Processing Unit) incredible speed. While I could have opted for Ollama on my computer, Groq Client’s efficiency was a clear winner for this task. Additionally, I brainstormed the overall implementation strategy with Microsoft Copilot, which proved invaluable in refining my ideas.

To test my approach, I created a Jupyter Notebook and started building the foundation of the project. My long-term plan includes deploying the process with Streamlit for a more interactive experience.

Tackling the Challenges

Generating long, coherent, and effective novels is no small feat, especially given the token size limitations of LLMs. My initial attempts fell short of the desired quality, so I refined the process as follows:

1. Outlining the Novel

The first step was generating a broad outline of the novel using the LLM. This outline served as the backbone for the entire story.

2. Creating Chapters and Subplots

Next, I instructed the LLM to generate a list of chapters with detailed descriptions, including key scenes and subplots. This ensured the story had structure and direction.

3. Expanding Each Chapter

Each chapter was then developed into a comprehensive story. To maintain coherence, the LLM had access to the outline and previous chapters, ensuring continuity across the narrative.

4. Overcoming Token Limitations

Despite the automation, token limitations occasionally interrupted the flow. To address this, I implemented summarization with overlapping chunking. This technique allowed the model to work within its constraints while retaining contextual integrity.

The Results: A Promising Start

The automation worked beautifully! With just a single click, the entire story was generated as the LLMs "conversed" among themselves. The resulting novel was creative, with:

A compelling plot and well-executed climax and twists.
Vivid descriptions that set immersive scenes.

However, there were areas for improvement:

Rushed Pacing: While the stage-setting was effective, transitions felt abrupt as the story moved to the next plot point.
Lack of Emotional Depth: The narrative occasionally felt mechanical, missing the nuanced emotions that make characters and events truly resonate.

Plans for Day 2

To address these issues, I’ll focus on:

Better Prompt Engineering: Crafting prompts that encourage the LLM to add more emotional depth and smooth transitions.
Refining Chapters: Feeding the chapters back into the LLM for iterative enhancements, focusing on making them more emotionally engaging and less rushed.
Integrating Web Search and Databases: Exploring ways to incorporate real-world data into the pipeline for creating other forms of literature.

Conclusion

This is just the beginning of my journey using LLMs for creative writing. I'm excited about the potential of this technology and eager to see where it takes me. I'll be sure to share my progress and learnings along the way.

DEV Community: Saugata Roy Arghya

How I Scraped our Private Google Site in a Semi-Automated way

The Core Problem: Why It's "Impossible"

The Breakthrough: A Hybrid "Human-in-the-Loop" Architecture

Toolkit

The Rocky Road: Our Initial Failures

The Implementation: Building the Scraper

Step 1: Handling Authentication

Step 2: The Scraper Engine

Step 3: Extracting Content with Context

Step 4: The Reliable File Downloader

Step 5: Putting It All Together

Conclusion

Acknowledgment

My Journey into Novel Creation Using Generative AI: Day 1

Choosing the Tools

Tackling the Challenges

1. Outlining the Novel

2. Creating Chapters and Subplots

3. Expanding Each Chapter

4. Overcoming Token Limitations

The Results: A Promising Start

Plans for Day 2

Conclusion