Ever faced a project that felt like hitting a brick wall? I was working on something that looked straightforward, until it wasn’t. I was working with an internal Google Sites instance—access-protected and tied to a workspace domain. That meant standard scraping tools hit a wall at the login page.
After countless hours of trial, error, and debugging, I was able to finally crack it. The solution? A hybrid approach that combines manual intervention with automation which resulted in a seamless, robust system.
In this post, I’m sharing the full story: the roadblocks I faced, the strategies I tried (and why they failed), and the final working Python script—step by step:
The Core Problem: Why It's "Impossible"
Modern web applications, especially from Google, are designed to prevent basic scraping. The primary roadblock is authentication. You can't just send a username and password anymore; you need to handle potential 2-Factor Authentication (2FA), captchas, and complex JavaScript-driven login flows. A purely automated script running on a server can't do this.
The Breakthrough: A Hybrid "Human-in-the-Loop" Architecture
The solution was to stop thinking about it as a single, fully-automated task. We broke it down into a hybrid system where a human and a robot collaborate:
- Manual Authentication (The Human Part): A script opens a browser for a human to perform the complex login. It then saves the session "key" (the cookies).
- Automated Scraping (The Robot Part): A separate, powerful headless script uses that session key to do the heavy lifting—visiting every page, downloading all content, and saving it in an organized way.
Toolkit
This solution relies on a few key Python libraries. You'll want a requirements.txt
file with the following:
# requirements.txt
playwright
httpx
beautifulsoup4
lxml
The Rocky Road: Our Initial Failures
Before arriving at the final script, we hit several walls. Our first attempt was to use a higher-level scraping library (like Crawl4AI), but it didn't offer the granular control needed for the interactive login. This forced us to use Playwright directly.
This led to our first major bug: a NotImplementedError
on Windows. It turns out the default asyncio
event loop on Windows is incompatible with some of Playwright's underlying processes.
Lesson Learned: Always account for platform differences. The fix was to explicitly set a compatible event loop policy for Windows right at the start of the script. This was a critical lesson in writing robust, cross-platform code.
The Implementation: Building the Scraper
Let's build the script, function by function.
Step 1: Handling Authentication
First, we need a way to perform the manual login and get the session cookies. This function opens a visible browser, lets you log in, and then saves the cookies for the automated part to use.
async def get_auth_cookies():
"""Launches a browser for login to get authentication cookies."""
if sys.platform == "win32":
asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
print("--- 👤 YOUR TURN: AUTHENTICATION ---")
async with async_playwright() as p:
browser = await p.chromium.launch(headless=False)
context = await browser.new_context()
page = await context.new_page()
await page.goto(START_URL)
print("Please complete the login process in the browser window...")
try:
# We wait until the URL is back on the Google Site domain
await page.wait_for_url(f"**/{BASE_DOMAIN}/**", timeout=300000)
print("✅ Login successful! Extracting session cookies...")
cookies = await context.cookies()
await browser.close()
print("🔒 Headed browser closed. Authentication complete.")
return cookies
except TimeoutError:
print("❌ Login timed out.")
await browser.close()
return None
Step 2: The Scraper Engine
Now for the main part. This function, scrape_site_headless
, takes the cookies and the list of pages to visit, then iterates through them in a headless browser.
The key part here is how we wait for each page to load. We use wait_until="networkidle"
with a generous 90-second timeout. This is the most reliable way to ensure complex pages with lots of embedded iframes
are fully loaded before we try to read them.
Failure Note: Initially, I tried using
wait_until="load"
with a short, fixedwait_for_timeout()
. This failed constantly on the heavier pages like the homepage, resulting in aTimeoutError
.
Lesson Learned: For complex, dynamic sites, a patientnetworkidle
wait is far more reliable than a blind, fixed delay.
async def scrape_site_headless(cookies, initial_links):
"""Launches a headless browser to scrape all pages."""
print("\n--- 🤖 MY TURN: HEADLESS SCRAPING ---")
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await browser.new_context(storage_state={"cookies": cookies})
for i, link_info in enumerate(initial_links):
url_to_scrape = link_info['url']
print(f"({i+1}/{len(initial_links)}) Scraping: {url_to_scrape}")
page = await context.new_page()
try:
# This is the patient waiting strategy that solved the timeout errors
await page.goto(url_to_scrape, wait_until="networkidle", timeout=90000)
# ... The content extraction logic will go here ...
except Exception as e:
print(f" ❌ Failed to scrape {url_to_scrape}. Error: {e}")
finally:
if not page.is_closed():
await page.close()
await browser.close()
Step 3: Extracting Content with Context
This is where we solve the context problem. Inside the scraping loop, we'll get the full HTML, parse it with BeautifulSoup, find and download images/docs, and replace them with placeholders before saving the final text.
Failure Note: Finding the embedded document links was the hardest part. My first attempts to find them by looking for "Pop-out" buttons or simple link tags failed because the links are hidden deep inside
iframes
with non-obvious selectors. I even tried a brute-force regex search on the page's internal JavaScript variables, which also proved unreliable.
The breakthrough came when we created a debug script to save the page's full HTML and manually inspected it. We discovered that Google embeds the direct download and open links in specialdata-
attributes (likedata-embed-download-url
).
Lesson Learned: When you're stuck, stop guessing and find a way to look at the raw source your script is seeing.
Here's the logic that goes inside the try
block of the scrape_site_headless
function:
# This code goes inside the `try` block in the function above
# Get HTML from the main page and all its frames
full_html = await page.content()
for frame in page.frames:
try:
full_html += await frame.content()
except Exception: pass
soup = BeautifulSoup(full_html, "lxml")
# Prepare directories and lists
page_output_dir = create_page_folder(page)
images_dir = os.path.join(page_output_dir, "images")
os.makedirs(images_dir, exist_ok=True)
doc_links_to_save = set()
image_counter = 0
# Find all <img> tags, download the image, and replace with a placeholder
print(f" 🖼️ Finding and downloading images...")
for img_tag in soup.find_all('img'):
src = img_tag.get('src')
if src and src.startswith('http'):
image_counter += 1
saved_filename = await download_file(cookies, src, images_dir, f"image_{image_counter}")
if saved_filename:
placeholder = f"\n[IMAGE: {os.path.join('images', saved_filename)}]\n"
img_tag.replace_with(placeholder)
# Find all embedded documents, download or link them, and replace with a placeholder
print(f" 📎 Finding and downloading documents...")
for embed_div in soup.find_all('div', attrs={'data-embed-doc-id': True}):
download_url = embed_div.get('data-embed-download-url')
if download_url:
# This is a downloadable file like a PDF
saved_filename = await download_file(cookies, download_url, page_output_dir, "document")
if saved_filename:
placeholder = f"\n[DOWNLOADED_DOCUMENT: {saved_filename}]\n"
embed_div.replace_with(placeholder)
else:
# This is an interactive doc, so we save the link
open_url = embed_div.get('data-embed-open-url')
if open_url:
doc_links_to_save.add(open_url)
placeholder = f"\n[DOCUMENT_LINK: {open_url}]\n"
embed_div.replace_with(placeholder)
# Finally, get the clean text from our modified HTML
page_text = soup.get_text(separator='\n', strip=True)
# Save the final results to files
save_final_content(page_output_dir, page_text, list(doc_links_to_save))
Step 4: The Reliable File Downloader
We learned that using the browser to navigate to download links can fail. The robust solution is to use a direct HTTP client (httpx
) with our session cookies. This function handles that for both images and documents.
Failure Note: Before switching to
httpx
, my first attempt was to use Playwright'spage.goto()
to download the files. This resulted in a crypticnet::ERR_ABORTED
error. This happens becausepage.goto()
is for navigating to a webpage, not for handling file downloads, which the server provides differently.
Lesson Learned: Use the right tool for the job. A direct HTTP client is the correct and robust way to handle file downloads.
import httpx
from urllib.parse import unquote
import mimetypes
async def download_file(session_cookies, file_url, save_dir, file_prefix):
"""Downloads a file directly using an HTTP client."""
if not file_url or file_url.startswith('data:image'): return None
cookie_jar = httpx.Cookies()
for cookie in session_cookies:
cookie_jar.set(cookie['name'], cookie['value'], domain=cookie['domain'])
try:
async with httpx.AsyncClient(cookies=cookie_jar, follow_redirects=True, timeout=120.0) as client:
response = await client.get(file_url)
response.raise_for_status()
# Try to get the real filename from the server
filename = file_prefix
if 'content-disposition' in response.headers:
fn_match = re.search(r'filename="([^"]+)"', response.headers['content-disposition'], re.IGNORECASE)
if fn_match:
filename = unquote(fn_match.group(1))
else:
# Fallback to guessing the extension
ext = mimetypes.guess_extension(response.headers.get("content-type", "")) or ""
filename = f"{file_prefix}{ext}"
filepath = os.path.join(save_dir, filename)
with open(filepath, "wb") as f:
f.write(response.content)
return filename
except Exception as e:
print(f" - Could not download {file_url}. Error: {e}")
return None
Step 5: Putting It All Together
Finally, we need a main
function to orchestrate the entire process: get the cookies, find all the pages to scrape, and then kick off the headless scraper.
# The functions to create folders and save text go here...
# def create_page_folder(page): ...
# def save_final_content(page_dir, text, docs): ...
async def main():
# 1. Authenticate and get cookies
cookies = await get_auth_cookies()
if not cookies:
return
# 2. Get the list of all internal pages to scrape
print("\n--- 🤖 Getting initial links to scrape ---")
# ... (this part uses a temporary headless browser to get the nav links) ...
# This logic is complex, so for the article, we'll just summarize it.
# 3. Start the main scraping job with the cookies and link list
if internal_links:
await scrape_site_headless(cookies, internal_links)
print("\n🎉 All tasks complete.")
if __name__ == "__main__":
asyncio.run(main())
Conclusion
Scraping modern web apps is a battle of persistence. While a simple approach might seem "impossible," breaking the problem down and using the right tools for each part of the job makes it achievable. By combining a manual login with an automated scraper and using direct HTTP requests for downloads, we were able to build a robust and reliable solution. The key was to inspect the target, understand its behavior, and adapt our strategy.
Acknowledgment
I'd like to note that this project was developed in close collaboration with an AI assistant.
Top comments (0)