Kenan Can

Posted on Dec 30, 2024

SEO Performance Analysis Tool: AI-Powered SEO Insights with Complex Web Scraping

#brightdatachallenge #devchallenge #webdev #api

This is a submission for the Bright Data Web Scraping Challenge qualifying for two prompts:

Scrape Data from Complex, Interactive Websites
Most Creative Use of Web Data for AI Models

What I Built

Meet the SEO Performance Analysis Tool: A comprehensive SEO analytics platform that combines complex web scraping with AI-powered insights. This tool helps SEO professionals and content creators optimize their websites by:

Analyzing website performance using Google Lighthouse metrics
Identifying and analyzing top competitors
Providing AI-powered content optimization suggestions
Generating detailed SEO reports

Key Features:

📊 Lighthouse Performance Analysis: Mobile and desktop performance metrics, accessibility scores, and SEO ratings
🔍 Competitor Analysis: Automatic competitor detection and content comparison
📝 Content Analysis: AI-powered structural analysis and SEO recommendations
📈 Visual Reports: Interactive charts and comparative analysis
🤖 AI Integration: Google Gemini AI for intelligent content analysis

Demo

Live Demo: SEO Performance Analysis Tool

Source Code: GitHub Repository

Screenshots

Main Interface: Clean and intuitive interface for URL and keyword input
Lighthouse Analysis: Complex web scraping in action, showing performance metrics
Competitor Analysis: AI-powered competitor content comparison
Content Analysis: Detailed content optimization recommendations

How I Used Bright Data

1. Complex Web Scraping with Scraping Browser

The tool leverages Bright Data's Scraping Browser to handle complex, JavaScript-heavy websites:

# lighthouse.py
def get_lighthouse(target_url: str):
    sbr_connection = ChromiumRemoteConnection(SBR_WEBDRIVER, 'goog', 'chrome')
    driver = Remote(sbr_connection, options=ChromeOptions())

    try:
        # Navigate to PageSpeed Insights
        encoded_url = f"https://pagespeed.web.dev/analysis?url={target_url}"
        driver.get(encoded_url)

        # Challenge 1: Wait for dynamic content loading
        WebDriverWait(driver, 60).until(
            EC.presence_of_element_located((By.CLASS_NAME, "lh-report"))
        )

        # Challenge 2: Handle tab switching for desktop analysis
        desktop_tab = WebDriverWait(driver, 20).until(
            EC.element_to_be_clickable((By.ID, "desktop_tab"))
        )
        actions = ActionChains(driver)
        actions.move_to_element(desktop_tab).click().perform()

        # Challenge 3: Verify report content changed
        WebDriverWait(driver, 20).until(
            lambda driver: driver.find_element(By.CLASS_NAME, "lh-report").text != report_text
        )

Challenges Overcome:

Handling dynamic JavaScript content on PageSpeed Insights
Managing complex user interactions (tab switching between mobile/desktop)
Extracting structured data from interactive reports

2. Web Unlocker for Competitor Analysis

Used Bright Data's Web Unlocker to access competitor content reliably:

# compare_pages.py - Competitor Content Access
def fetch_html_content(url: str) -> tuple:
    try:
        # Ensure the URL has a proper scheme
        if not url.startswith(('http://', 'https://')):
            url = 'https://' + url

        # Brightdata API configuration
        api_url = "https://api.brightdata.com/request"
        headers = {
            "Content-Type": "application/json",
            "Authorization": f"Bearer {get_api_key('BRIGHTDATA_API_KEY')}"
        }
        payload = {
            "zone": "web_unlocker1",
            "url": url,
            "format": "raw"
        }

        # Make request to Brightdata API
        response = requests.post(api_url, json=payload, headers=headers)

        if response.status_code == 200:
            html_content = response.text
            soup = BeautifulSoup(html_content, 'html.parser')
            tags = soup.find_all(['h1', 'h2', 'h3', 'p'])
            collected_html = ''.join(str(tag) for tag in tags)
            return url, collected_html
    except Exception as e:
        print(f"Error fetching HTML content from {url}: {e}")
        return url, None

3. SERP API for Competitor Discovery

Integrated Bright Data's SERP API to identify top competitors:

# compare_pages.py - Competitor Discovery
def get_top_competitor(keyword: str, our_domain: str) -> str:
    try:
        url = "https://api.brightdata.com/request"

        # Challenge: Get real-time SERP results and find relevant competitor
        encoded_keyword = requests.utils.quote(keyword)

        payload = {
            "zone": "serp_api1",
            "url": f"https://www.google.com/search?q={encoded_keyword}",
            "format": "raw"
        }

        headers = {
            "Authorization": f"Bearer {get_api_key('BRIGHTDATA_API_KEY')}",
            "Content-Type": "application/json"
        }

        response = requests.post(url, json=payload, headers=headers)

        if response.status_code == 200:
            # Parse search results with BeautifulSoup
            soup = BeautifulSoup(response.text, 'html.parser')
            all_data = soup.find_all("div", {"class": "g"})

            # Find first relevant competitor
            for result in all_data:
                link = result.find('a').get('href')
                if (link and 
                    link.find('https') != -1 and 
                    link.find('http') == 0 and 
                    our_domain not in link):
                    return link

    except Exception as e:
        st.error(f"Error finding competitor: {str(e)}")
        return None

AI Integration Pipeline

Data Collection: Use Bright Data services to gather:
- Performance metrics (Lighthouse)
- Competitor content
- SERP data
Data Processing: Structure collected data for AI analysis
AI Analysis: Use Google Gemini AI to:
- Compare content quality
- Generate SEO recommendations
- Analyze content structure
Visualization: Present insights through Streamlit interface

Tech Stack

Frontend: Streamlit
Backend: Python
Scraping: Bright Data (Scraping Browser, Web Unlocker, SERP API)
AI: Google Gemini AI
Data Visualization: Plotly

Additional Prompt Qualifications

This project qualifies for two prompts:

Scrape Data from Complex, Interactive Websites: The tool successfully handles JavaScript-heavy pages like PageSpeed Insights, managing dynamic content loading and complex user interactions through Bright Data's Scraping Browser.
Most Creative Use of Web Data for AI Models: The project creates an innovative AI pipeline by combining web-scraped data (performance metrics, competitor content, SERP results) with Google Gemini AI to generate intelligent SEO insights and recommendations.

Team Submission

This submission was created by Kenan Can

Thank you for reviewing my submission! Let's make SEO analysis smarter with the power of web scraping and AI.

Top comments (10)

Hilal Kara • Dec 30 '24

This project offers an excellent solution to the problem it addresses. Congratulations

Kenan Can • Dec 30 '24

Thank you for your feedback! 🙏

Melike Sultan Can • Dec 30 '24

Really enjoyed this! The combination of AI and web scraping for SEO offers great insights.

Kenan Can • Dec 30 '24

Thank you! Glad you found it useful! 🙌

Can Uçanefe • Dec 31 '24

That's the spirit, that's what I'm looking for a very long time... Thanks for that solution which you made for all of us

Kenan Can • Dec 31 '24

Thank you for your kind words! Glad it's helpful! 🙏

Anl Egr • Dec 30 '24

It's a great content. It gives very good tips on what to pay attention to in complex data extraction processes.

Kenan Can • Dec 30 '24

Thank you! Glad the insights about data extraction were helpful! 🙌

Terraflop • Jan 1

How would you integrate Bright Data's proxy service to target specific countries for gathering localized search engine results?

Kenan Can • Jan 2

For country-specific targeting with Bright Data proxy, you can use the country parameter in your configuration:

payload = {
            "zone": "serp_api1",
            "country": "us",  # target country code
            "url": f"https://www.google.com/search?q={encoded_keyword}",
            "format": "raw"
        }