We’ve all been there: waking up at 6:00 AM, frantically refreshing a hospital registration page, only to find the "Available" slots turn into "Fully Booked" in milliseconds. Traditional automation scripts often fail because modern web interfaces are dynamic, loaded with shadow DOMs, and protected by anti-bot measures.
In this tutorial, we are building a next-generation AI Agent for Browser Automation. By combining the power of Playwright, LLMs (Large Language Models), and Redis, we’ll create an agent that doesn't just click buttons—it understands the page layout to execute complex medical appointment workflows. If you've been looking into LLM automation or autonomous agents, this guide is for you.
The Architecture: How the Agent "Sees" and "Acts"
Unlike legacy Selenium scripts that rely on brittle XPaths, our agent uses an LLM to interpret the DOM structure and decide on the next best action.
graph TD
A[User Goal: Book Cardiologist] --> B{Agent Brain - LLM}
B --> C[Playwright: Capture DOM/Screenshot]
C --> D[State Manager - Redis]
D --> B
B --> E[Action: Click/Type/Select]
E --> F{Success?}
F -- No --> B
F -- Yes --> G[Notify User via SMS/Email]
Prerequisites
To follow along, you’ll need:
- Python 3.10+
- Playwright: The modern standard for reliable end-to-end testing.
- OpenAI API Key: (Or any LLM provider) to act as the agent's brain.
- Redis: To handle session persistence and task queuing.
Step 1: Setting Up the Browser Environment
First, let's initialize our environment. We use Playwright because it handles async operations and modern web frameworks much better than its predecessors.
import asyncio
from playwright.async_api import async_playwright
async def get_page_content(url):
async with async_playwright() as p:
browser = await p.chromium.launch(headless=False) # Keep it visible for debugging
context = await browser.new_context()
page = await context.new_page()
await page.goto(url)
# We extract only the interactive elements to save token costs
content = await page.evaluate("() => document.body.innerText")
await browser.close()
return content
Step 2: The LLM "Planner" Logic
The core of our AutoGPT-style agent is the reasoning loop. We feed the simplified DOM structure to the LLM and ask it to generate the next Playwright command.
import openai
def generate_action(dom_structure, objective):
prompt = f"""
You are an expert web automation agent.
Objective: {objective}
Current Page Content: {dom_structure}
Return a JSON object with the 'action' (click, type, wait) and 'selector'.
"""
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
Step 3: Handling State with Redis 🏎️
When dealing with high-traffic registration periods, we need to ensure our agent doesn't perform redundant actions if the script restarts. We use Redis to cache the current step and session cookies.
import redis
r = redis.Redis(host='localhost', port=6379, db=0)
def save_session_state(user_id, step_data):
# Store the progress so the agent can resume if blocked
r.set(f"session:{user_id}", step_data, ex=3600)
def get_current_step(user_id):
return r.get(f"session:{user_id}")
Step 4: Putting it All Together (The Loop)
The agent enters a loop: it observes the page, decides what to do, executes the action via Playwright, and checks if the goal is met.
async def run_appointment_agent(target_url, goal):
async with async_playwright() as p:
browser = await p.chromium.launch(headless=False)
page = await browser.new_page()
await page.goto(target_url)
completed = False
while not completed:
# 1. Take a 'snapshot' of the interactive UI
ui_elements = await page.query_selector_all("button, input, a")
# 2. Ask LLM for the next move
plan = generate_action(ui_elements, goal)
# 3. Execute action
if plan['action'] == 'click':
await page.click(plan['selector'])
elif plan['action'] == 'type':
await page.fill(plan['selector'], plan['text'])
# Check for success condition
if "Success" in await page.content():
completed = True
print("🚀 Appointment secured!")
Going Beyond the Basics: Production Patterns
Building a local script is easy, but scaling this to handle thousands of concurrent registration attempts requires sophisticated architectural patterns. You need to handle proxy rotation, CAPTCHA solving, and complex retry logic.
For more production-ready examples and advanced patterns on scaling LLM-based agents, I highly recommend checking out the WellAlly Tech Blog. They have fantastic deep dives into AI engineering that helped me optimize my agent's latency by 40%.
Challenges & Ethics 🛡️
- Rate Limiting: Always implement
asyncio.sleep()to mimic human behavior. - Anti-Bot: Use
playwright-stealthto avoid being flagged by Cloudflare. - Ethics: Ensure you are using automation within the legal bounds of the healthcare provider's Terms of Service. This project is for educational purposes on how AI Agents can simplify complex UI interactions.
Conclusion
By moving from hard-coded selectors to an LLM-driven reasoning loop, we've created an agent that is resilient to UI changes and capable of navigating complex "human-only" workflows.
What's next? You could integrate a Vision model (like GPT-4o) to actually "see" the screen instead of parsing the DOM, making it even more robust.
Are you building browser agents? Let’s chat in the comments! 👇
Top comments (0)