James

Posted on May 1 • Originally published at gaffa.dev

Gaffa @ Major League Hacking's Global Hack Week

#hackathon #data #automation #playwright

Every month, Major League Hacking (MLH) hosts a Global Hack Week, a free event where developers can learn new skills, build their portfolios, and connect with other hackers. MLH is the world’s largest developer community with over 5M software creators living throughout 100+ countries. Every year, MLH hosts 1000+ events online and in-person where community members come together to learn, build, and share the latest and greatest technology.

We were invited to present at Global Hack Week Cloud, where I ran a live session introducing Gaffa and how it makes building with web data significantly easier. Below are the key moments from the session.

▶️ What is web scraping and why does it matter?

I opened the session by covering the fundamentals of web scraping, the practice of extracting data from websites that don't offer an API. The internet is the world's largest database, but most of it isn't neatly packaged for developers, and scraping is getting harder every year. Modern JavaScript frameworks mean pages often don't include their data in the initial HTML response, and many sites actively detect and block automated requests. Tools like Playwright, Selenium, and BeautifulSoup have long been the go-to stack, but they require significant setup, maintenance, and infrastructure to run reliably at scale.

We also touched on the legal question that arises whenever scraping is discussed. Scraping publicly accessible data is generally accepted and widely used across industries, from price comparison to financial data feeds to AI training sets. The areas to avoid are personal data, content behind a login, and anything that puts undue load on a site, particularly smaller, nonprofit ones.

▶️ Introducing Gaffa and the API playground

The session then moved into a walkthrough of Gaffa itself. Gaffa is a web browser automation API. You send a POST request with a URL and a list of actions, and Gaffa executes them in a real, hosted browser and returns the result. No infrastructure to manage, no proxies to configure, no bot detection to fight.

The API Playground is the best place to get started. It lets you build and test browser requests interactively, with built-in examples covering common scenarios. During the session, I walked through a live form-filling example, including enabling request recording so you can see exactly what the browser did.

▶️ Demo: Scraping a webpage and asking questions with AI

The first full demo showed how to scrape a Wikipedia article and use it as context for an OpenAI Q&A session. The workflow is straightforward: use Gaffa's generate_markdown action to strip a page down to clean, LLM-ready text, then pass that markdown to the model with a question.

The key insight here is that markdown is a much more efficient way to feed web content into a language model than raw HTML. It removes noise while preserving the page's structure and meaning. The demo showed the model correctly answering questions about the article content and, importantly, telling us when an answer wasn't present, a behavior we prompted for explicitly.

The full example is available in the Gaffa Python Examples GitHub repository.

▶️ Demo: Extracting structured data with parse_json

The second demo is where things get particularly powerful. Rather than asking free-form questions, parse_json lets you define a data schema and have Gaffa use an AI model to extract exactly the fields you need from any page, regardless of its structure.

In the session, I used the Python Wikipedia page as an example, extracting the title, creator, release year, summary, and key features. The schema is defined as a JSON object with named fields, types, and per-field descriptions that act as mini-prompts to guide the model.

One practical detail that came up with a real client: you can use field descriptions to enforce a specific output format, for example, specifying that a country field should return a two-letter ISO Alpha-2 code rather than whatever format appears on the page. The model handles the mapping automatically.

The same action also works on online PDFs. I demonstrated this against a hosted academic paper, extracting the title, abstract, author names, and institutional affiliations, the kind of data that varies in layout across every paper you'd encounter, making it almost impossible to extract reliably with traditional selectors. The result was a clean JSON object ready to insert directly into a database.

Both examples are available in the Gaffa Python Examples GitHub repository.

▶️ The MLH challenges

As part of Global Hack Week, we put together a set of Gaffa challenges for attendees:

Sign up for a Gaffa account and redeem the MLH credit code for $20 of free credits
Send your first request in the API Playground
Use a browser request to subscribe to our newsletter via the Gaffa demo site
Extract the title, summary, and author from a Gaffa blog post using parse_json

If you're working through these and run into any issues, reach out via support, and we'll help you get unstuck.

Had a great experience with Gaffa! It was my first time doing browser automation, and sending that first API request to print an HTML page to PDF felt like magic. The step-by-step challenges made a complex topic really approachable.
— A Global Hack Week participant

A huge thank you to the MLH team, particularly Rosendo, for hosting the opportunity to present to their community. It was a genuinely great audience, full of thoughtful questions about scraping legality, dynamic sites, speed, and cost. If you were in the session or are just now finding this post, thanks for watching and reading.

If you want to try everything covered in the session, sign up for a free Gaffa account and head to the API Playground to make your first request. The demo site, Python examples, and documentation are all there waiting for you.

Top comments (1)

Swift • May 1

Thanks for helping to make Global Hack Week awesome and for posting about your experience on DEV! 🙌