Saugata Roy Arghya

Posted on Jul 5 • Edited on Jul 14

How I Scraped our Private Google Site in a Semi-Automated way

#playwright #googlesite #webscraping #automation

Ever faced a project that felt like hitting a brick wall? I was working on something that looked straightforward, until it wasn’t. I was working with an internal Google Sites instance—access-protected and tied to a workspace domain. That meant standard scraping tools hit a wall at the login page.

After countless hours of trial, error, and debugging, I was able to finally crack it. The solution? A hybrid approach that combines manual intervention with automation which resulted in a seamless, robust system.

In this post, I’m sharing the full story: the roadblocks I faced, the strategies I tried (and why they failed), and the final working Python script—step by step:

The Core Problem: Why It's "Impossible"

Modern web applications, especially from Google, are designed to prevent basic scraping. The primary roadblock is authentication. You can't just send a username and password anymore; you need to handle potential 2-Factor Authentication (2FA), captchas, and complex JavaScript-driven login flows. A purely automated script running on a server can't do this.

The Breakthrough: A Hybrid "Human-in-the-Loop" Architecture

The solution was to stop thinking about it as a single, fully-automated task. We broke it down into a hybrid system where a human and a robot collaborate:

Manual Authentication (The Human Part): A script opens a browser for a human to perform the complex login. It then saves the session "key" (the cookies).
Automated Scraping (The Robot Part): A separate, powerful headless script uses that session key to do the heavy lifting—visiting every page, downloading all content, and saving it in an organized way.

Toolkit

This solution relies on a few key Python libraries. You'll want a requirements.txt file with the following:

# requirements.txt
playwright
httpx
beautifulsoup4
lxml

The Rocky Road: Our Initial Failures

Before arriving at the final script, we hit several walls. Our first attempt was to use a higher-level scraping library (like Crawl4AI), but it didn't offer the granular control needed for the interactive login. This forced us to use Playwright directly.

This led to our first major bug: a NotImplementedError on Windows. It turns out the default asyncio event loop on Windows is incompatible with some of Playwright's underlying processes.

Lesson Learned: Always account for platform differences. The fix was to explicitly set a compatible event loop policy for Windows right at the start of the script. This was a critical lesson in writing robust, cross-platform code.

The Implementation: Building the Scraper

Let's build the script, function by function.

Step 1: Handling Authentication

First, we need a way to perform the manual login and get the session cookies. This function opens a visible browser, lets you log in, and then saves the cookies for the automated part to use.

async def get_auth_cookies():
    """Launches a browser for login to get authentication cookies."""
    if sys.platform == "win32":
        asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())

    print("--- 👤 YOUR TURN: AUTHENTICATION ---")
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=False)
        context = await browser.new_context()
        page = await context.new_page()
        await page.goto(START_URL)
        print("Please complete the login process in the browser window...")
        try:
            # We wait until the URL is back on the Google Site domain
            await page.wait_for_url(f"**/{BASE_DOMAIN}/**", timeout=300000)
            print("✅ Login successful! Extracting session cookies...")
            cookies = await context.cookies()
            await browser.close()
            print("🔒 Headed browser closed. Authentication complete.")
            return cookies
        except TimeoutError:
            print("❌ Login timed out.")
            await browser.close()
            return None

Step 2: The Scraper Engine

Now for the main part. This function, scrape_site_headless, takes the cookies and the list of pages to visit, then iterates through them in a headless browser.

The key part here is how we wait for each page to load. We use wait_until="networkidle" with a generous 90-second timeout. This is the most reliable way to ensure complex pages with lots of embedded iframes are fully loaded before we try to read them.

Failure Note: Initially, I tried using wait_until="load" with a short, fixed wait_for_timeout(). This failed constantly on the heavier pages like the homepage, resulting in a TimeoutError.
Lesson Learned: For complex, dynamic sites, a patient networkidle wait is far more reliable than a blind, fixed delay.

async def scrape_site_headless(cookies, initial_links):
    """Launches a headless browser to scrape all pages."""
    print("\n--- 🤖 MY TURN: HEADLESS SCRAPING ---")
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context(storage_state={"cookies": cookies})

        for i, link_info in enumerate(initial_links):
            url_to_scrape = link_info['url']
            print(f"({i+1}/{len(initial_links)}) Scraping: {url_to_scrape}")

            page = await context.new_page()
            try:
                # This is the patient waiting strategy that solved the timeout errors
                await page.goto(url_to_scrape, wait_until="networkidle", timeout=90000)

                # ... The content extraction logic will go here ...

            except Exception as e:
                print(f"    ❌ Failed to scrape {url_to_scrape}. Error: {e}")
            finally:
                if not page.is_closed():
                    await page.close()
        await browser.close()

Step 3: Extracting Content with Context

This is where we solve the context problem. Inside the scraping loop, we'll get the full HTML, parse it with BeautifulSoup, find and download images/docs, and replace them with placeholders before saving the final text.

Failure Note: Finding the embedded document links was the hardest part. My first attempts to find them by looking for "Pop-out" buttons or simple link tags failed because the links are hidden deep inside iframes with non-obvious selectors. I even tried a brute-force regex search on the page's internal JavaScript variables, which also proved unreliable.
The breakthrough came when we created a debug script to save the page's full HTML and manually inspected it. We discovered that Google embeds the direct download and open links in special data- attributes (like data-embed-download-url).
Lesson Learned: When you're stuck, stop guessing and find a way to look at the raw source your script is seeing.

Here's the logic that goes inside the try block of the scrape_site_headless function:

# This code goes inside the `try` block in the function above

# Get HTML from the main page and all its frames
full_html = await page.content()
for frame in page.frames:
    try:
        full_html += await frame.content()
    except Exception: pass

soup = BeautifulSoup(full_html, "lxml")

# Prepare directories and lists
page_output_dir = create_page_folder(page)
images_dir = os.path.join(page_output_dir, "images")
os.makedirs(images_dir, exist_ok=True)
doc_links_to_save = set()
image_counter = 0

# Find all <img> tags, download the image, and replace with a placeholder
print(f"    🖼️  Finding and downloading images...")
for img_tag in soup.find_all('img'):
    src = img_tag.get('src')
    if src and src.startswith('http'):
        image_counter += 1
        saved_filename = await download_file(cookies, src, images_dir, f"image_{image_counter}")
        if saved_filename:
            placeholder = f"\n[IMAGE: {os.path.join('images', saved_filename)}]\n"
            img_tag.replace_with(placeholder)

# Find all embedded documents, download or link them, and replace with a placeholder
print(f"    📎 Finding and downloading documents...")
for embed_div in soup.find_all('div', attrs={'data-embed-doc-id': True}):
    download_url = embed_div.get('data-embed-download-url')
    if download_url:
        # This is a downloadable file like a PDF
        saved_filename = await download_file(cookies, download_url, page_output_dir, "document")
        if saved_filename:
            placeholder = f"\n[DOWNLOADED_DOCUMENT: {saved_filename}]\n"
            embed_div.replace_with(placeholder)
    else:
        # This is an interactive doc, so we save the link
        open_url = embed_div.get('data-embed-open-url')
        if open_url:
            doc_links_to_save.add(open_url)
            placeholder = f"\n[DOCUMENT_LINK: {open_url}]\n"
            embed_div.replace_with(placeholder)

# Finally, get the clean text from our modified HTML
page_text = soup.get_text(separator='\n', strip=True)

# Save the final results to files
save_final_content(page_output_dir, page_text, list(doc_links_to_save))

Step 4: The Reliable File Downloader

We learned that using the browser to navigate to download links can fail. The robust solution is to use a direct HTTP client (httpx) with our session cookies. This function handles that for both images and documents.

Failure Note: Before switching to httpx, my first attempt was to use Playwright's page.goto() to download the files. This resulted in a cryptic net::ERR_ABORTED error. This happens because page.goto() is for navigating to a webpage, not for handling file downloads, which the server provides differently.
Lesson Learned: Use the right tool for the job. A direct HTTP client is the correct and robust way to handle file downloads.

import httpx
from urllib.parse import unquote
import mimetypes

async def download_file(session_cookies, file_url, save_dir, file_prefix):
    """Downloads a file directly using an HTTP client."""
    if not file_url or file_url.startswith('data:image'): return None

    cookie_jar = httpx.Cookies()
    for cookie in session_cookies:
        cookie_jar.set(cookie['name'], cookie['value'], domain=cookie['domain'])

    try:
        async with httpx.AsyncClient(cookies=cookie_jar, follow_redirects=True, timeout=120.0) as client:
            response = await client.get(file_url)
            response.raise_for_status()

            # Try to get the real filename from the server
            filename = file_prefix
            if 'content-disposition' in response.headers:
                fn_match = re.search(r'filename="([^"]+)"', response.headers['content-disposition'], re.IGNORECASE)
                if fn_match:
                    filename = unquote(fn_match.group(1))
            else:
                # Fallback to guessing the extension
                ext = mimetypes.guess_extension(response.headers.get("content-type", "")) or ""
                filename = f"{file_prefix}{ext}"

            filepath = os.path.join(save_dir, filename)
            with open(filepath, "wb") as f:
                f.write(response.content)
            return filename
    except Exception as e:
        print(f"      - Could not download {file_url}. Error: {e}")
    return None

Step 5: Putting It All Together

Finally, we need a main function to orchestrate the entire process: get the cookies, find all the pages to scrape, and then kick off the headless scraper.

# The functions to create folders and save text go here...
# def create_page_folder(page): ...
# def save_final_content(page_dir, text, docs): ...

async def main():
    # 1. Authenticate and get cookies
    cookies = await get_auth_cookies()
    if not cookies:
        return

    # 2. Get the list of all internal pages to scrape
    print("\n--- 🤖 Getting initial links to scrape ---")
    # ... (this part uses a temporary headless browser to get the nav links) ...
    # This logic is complex, so for the article, we'll just summarize it.

    # 3. Start the main scraping job with the cookies and link list
    if internal_links:
        await scrape_site_headless(cookies, internal_links)

    print("\n🎉 All tasks complete.")

if __name__ == "__main__":
    asyncio.run(main())

Conclusion

Scraping modern web apps is a battle of persistence. While a simple approach might seem "impossible," breaking the problem down and using the right tools for each part of the job makes it achievable. By combining a manual login with an automated scraper and using direct HTTP requests for downloads, we were able to build a robust and reliable solution. The key was to inspect the target, understand its behavior, and adapt our strategy.

Acknowledgment

I'd like to note that this project was developed in close collaboration with an AI assistant.

DEV Community