This is a submission for the Bright Data Web Scraping Challenge: Scrape Data from Complex, Interactive Websites
What I Built
I have developed a project that scrapes Yahoo Finance to collect the latest financial news and world stock indices. The scraped data is analyzed using AI via the OpenAI API, providing insights such as trend summaries and sentiment analysis. The results can then be automatically sent to a Telegram bot channel for convenient access.
The scraper supports scheduling, allowing it to run at regular intervals (e.g., every hour) or execute immediately when needed.
The news is scraped by dynamically scrolling down to load additional content. Then each story is opened in a new browser tab to extract its full content. By default, as defined by YAHOO_FINANCE_NEWS_PAGES_LIMIT
in the .env
file, only the first page (containing the 10 most recent stories) is processed to keep the aggregated message sent to OpenAI efficiently or will not reach the limit. For customization, you can increase the maximum page limit, select a different OpenAI model, or adjust the token limit OPENAI_MAX_TOKENS
to analyze more extensive results.
The scrapper can be connected to the Bright Data Scraping Browser for advanced scraping needs, such as resolving CAPTCHAs or bypassing IP blocking. When BROWSER_WS
is set in the configuration file, the scraper connects to a remote browser instance.
Configuration
The project can be configured using the .env.local
file that overrides the .env
file.
The default project configuration is present in the .env
file:
# WebSocket endpoint for a remote browser instance.
# If empty, Puppeteer will launch a local browser instance.
BROWSER_WS=
# Cron schedule for running the scraper.
# Default: Every hour. Leave empty to run the scraper immediately upon starting.
SCHEDULE='0 * * * *'
# Notification channels for sending alerts or reports.
# Supported values: 'telegram'. Use a comma-separated list for multiple channels.
NOTIFICATION_CHANNELS=
# Enable or disable AI analysis of scraping results.
# Set to 'true', '1' to enable, or leave empty to disable AI analysis.
OPENAI_ENABLED=false
# OpenAI API key for accessing AI services.
# Obtain your API key from https://platform.openai.com/account/api-keys
OPENAI_API_KEY=
# OpenAI model to use for analysis.
# Use 'gpt-3.5-turbo' for cost-efficiency or 'gpt-4o' for more advanced analysis.
OPENAI_MODEL=gpt-3.5-turbo
# Maximum number of tokens for AI responses.
# Increase this value for longer analyses, keeping in mind token limits for your selected model.
OPENAI_MAX_TOKENS=1000
# Instruction for guiding the AI behavior during analysis.
# Customize this text to match your use case or desired output format.
OPENAI_INSTRUCTION='You are a financial analyst AI. Analyze the following financial news data, summarize trends, and highlight key positive or negative sentiments about the stock market.'
# Telegram bot token for sending notifications.
# Obtain this token from BotFather in Telegram.
TELEGRAM_BOT_TOKEN=
# Telegram chat ID for sending messages.
# Use the chat ID of the recipient (user or group) where notifications should be sent.
TELEGRAM_CHAT_ID=
# Maximum number of news pages to scrape from Yahoo Finance.
# This limits how many pages of news will be processed during each scraping run.
# Example:
# - Set to 1 to scrape only the first page of news.
# - Set to 5 to scrape up to 5 pages of news (if available).
YAHOO_FINANCE_NEWS_PAGES_LIMIT=1
How to Run the Scraper
You need to install the required node modules to set up and run the scraper. While I recommend using the pnpm
package manager, you can also use npm
or yarn
if preferred.
Installation
Run the following command to install the necessary packages:
pnpm install
Starting
To start the scraper you need to execute the next command using the default configuration:
pnpm start
Logs and Debugging
When the scraper runs, it will output detailed log messages to the console.
These logs include:
- The configuration used during execution.
- The progress and results of each scraping iteration.
GitHub Repo: link
How I Used Bright Data
Using the BrightData Scrapping Browser offers significant advantages in handling complex scenarios, such as resolving CAPTCHA(s), bypassing IP blocking because frequent access to Yahoo risked triggering IP bans and CAPTCHA challenges. Also, the consent form is only opened for local browser instance, where Bright Data Scrapping Browser can handle it. However, it comes with certain trade-offs:
- Connection Stability: The WebSocket (WS) connection to the remote browser instance sometimes becomes interrupted or unresponsive, requiring robust error handling.
- Inconsistency: The latest Yahoo Finance news occasionally fails to load on the remote browser instance for unknown reasons, even though it works flawlessly on a local browser instance. However, the local instance cannot handle IP blocking, CAPTCHA challenges, or other advanced scenarios that the Bright Data Scraping Browser effectively manages. While the Bright Data browser is an excellent tool for web scraping, debugging and understanding what happens under the hood can be time-consuming and challenging.
- Navigation Limit: The browser's navigation limit is exceeded when attempting to open a new tab, causing the process to fail.
To address these challenges, I implemented a retry system to automatically handle connection interruptions and unexpected issues with the remote browser. However, in some cases, even this is insufficient to ensure flawless operation.
Conclusion
This project showcases the potential of combining advanced web scraping techniques with AI analysis to extract meaningful insights from dynamic and complex websites like Yahoo Finance. Using tools like the Bright Data Scraping Browser, the scraper handles challenges such as CAPTCHA resolution and IP blocking, making it highly effective for large-scale and sophisticated scraping tasks.
Bright Data significantly enhances scraping capabilities, but it introduces complexities, such as debugging remote instances and managing stability.
The OpenAI adds value by transforming raw scraped data into actionable summaries, trends, and sentiment analysis. This feature, combined with flexible scheduling and Telegram notifications, makes the project a versatile tool for monitoring financial news and stock indices.
Ultimately, this submission demonstrates how modern scraping tools and AI can work together to tackle real-world data challenges, paving the way for more advanced and user-friendly applications in the future.
Thank you for taking the time to explore this project!
Top comments (0)