This is a submission for the Bright Data Web Scraping Challenge: Build a Web Scraper API to Solve Business Problems
What I Built
I really enjoy the daily TLDR newsletters, but I often find it lacking when it comes to more niche news topics. The Bright Data web scraping challenge seemed like a good opportunity for me to tackle this problem.
The end result was a daily TLDR-style newsletter, featuring topics of my choice, scraped autonomously and written entirely by AI, delivered to my email inbox every morning.
This might also qualify under the "Most Creative Use of Web Data for AI Models." However, since I am not really training or fine-tuning an LLM I think the primary submission prompt "Build a Web Scraper API to Solve Business Problems" is most appropriate.
Demo
Full code can be found on GitHub.
Below is example output: a "niche" newsletter that was in my inbox at 8:30am this morning.
I think this qualifies as niche enough...
How I Used Bright Data
I made use of the Web Scraper API endpoints - specifically, the Reddit Post and Google News datasets.
The Idea
The idea is simple:
- Find hot news posts on subreddit(s) of interest
- Summarize the reddit comments (since I can't form an opinion on my own /s)
- Collect links from other news outlets for diverse viewpoints
- Send pretty email with links, discussion summaries, and emojis for that sweet, sweet π€ retention π€
- Schedule steps 1-4 to run every day before I wake up so my newsletter is ready for my morning brew
APIs and Data Sources
We could signup for a Reddit developer account and fetch the extra news article links from various APIs, but using Bright Data means we don't have to go to multiple places for the data we want.
Besides the content itself, we will also need a way to summarize the reddit discussions and a way to format the email. For this I chose OpenAI's gpt-4o-mini
.
Finally, we'll want to send this to ourselves as an email. I decided on Mailgun since it's simple enough to get started and they have a generous free tier.
Design
This is the high-level design I settled on for the end-to-end workflow. There are always better solutions, but this one worked well enough and made sense to me.
For full details, see the code on GitHub
After coding it up, I simply scheduled it to run every day at 8:30am.
Results
I'm very happy with the outputs!
The addition of the Google News aggregation step means I often get several links to multiple outlets covering the same topic. I like being able to read different articles on the same story. Oftentimes the first article I read from Reddit won't have the full facts or is biased one way or another (shocked-pikachu.jpg πππ).
The quality and consistency of the Bright Data API responses is impressive.
I'm also pleasantly surprised at the quality of gpt-4o-mini
's HTML formatting given the lack of prompt engineering I did to write the emails.
"Email template:\n{html_template}\n\nArticles w/summary:{articles}\n---\nGiven the above email template and articles with summary, format the articles into the email template. Replace all brackets with content from articles/comments. Be sure to include the article links in href. Do not respond with anything other than raw HTML, do not enclose HTML in quotes.",
If you try this out for yourself, be sure to change the subreddit
when making the call to get_newsletter()
. Active subreddits with several new posts a day are the best candidates for fresh, interesting newsletters.
Checkout the demo on GitHub.
Top comments (0)