Shahibur Rahman

Posted on Jan 8

Automating WordPress Content Scraping: Build a Knowledge Base Plugin (Beginner's Guide)

#wordpress #php #webscraping #plugin

Developing robust WordPress plugins often benefits from a structured approach. This tutorial dives into creating a powerful WordPress Content Scraping and knowledge base synchronization plugin, leveraging a solid boilerplate structure. We'll explore how to automatically scan your website's pages, scrape their content, and store it efficiently in your database to build a custom knowledge base.

This guide is designed for beginners looking to understand WordPress plugin architecture and implement advanced functionalities like web scraping within the WordPress ecosystem.

Why Automate WordPress Content Scraping?

In today's dynamic web, keeping an external AI or search index updated with your site's latest content can be a challenge. Manually updating knowledge bases is tedious and prone to errors. By implementing an automated WordPress Content Scraping mechanism, you can:

Maintain a Fresh Knowledge Base: Ensure your AI chatbots or internal search features always have the most current information.
Improve Data Consistency: Reduce discrepancies between your live site and its stored knowledge.
Save Time: Eliminate manual content extraction and updates.

We'll achieve this by building upon a well-organized WordPress Plugin Boilerplate.

The Foundation: WordPress Plugin Boilerplate

A boilerplate provides a standardized, organized, and extensible base for your plugin development. It separates concerns, making your code cleaner and easier to maintain. For this project, we'll imagine using a structure similar to this:

my-wp-plugin/
├── 📄 my-wp-plugin.php
├── 📁 includes/
│   ├── 📄 class-main.php               # Plugin orchestrator
│   ├── 📄 class-loader.php             # Hook manager
│   ├── 📄 class-activator.php          # Activation handler
│   └── 📄 class-deactivator.php        # Deactivation handler
├── 📁 admin/                           # Backend functionality ONLY
│   ├── 📄 class-admin.php              # Admin controller
│   └── 📁 partials/                    # Admin HTML templates
│       └── 📄 admin-settings.php
├── 📁 public/                          # Frontend functionality ONLY
│   └── 📄 class-public.php             # Frontend controller
├── 📁 assets/                          # ALL static files
│   ├── 📁 css/
│   └── 📁 js/
│       ├── 📄 admin.js
│       └── 📄 knowledge-base.js
└── 📁 languages/

This structure helps us organize our PHP logic (includes, admin, public) and static assets (assets).

d5b94396feba3 / WP-Plugin-Boilerplate

A modern, object-oriented WordPress plugin boilerplate following WordPress coding standards. Features a generic structure where you only need to change ONE FILE to create a completely new plugin.

WP Plugin Boilerplate

A modern, object-oriented WordPress plugin boilerplate following WordPress coding standards. Features a generic structure where you only need to change ONE FILE to create a completely new plugin.

🚀 Features

One-File Setup (True Dynamic Naming): Change only the main plugin file's header to rename the plugin, slug, version, menu items, and file handles across the entire codebase.

Object-Oriented Architecture: Clean, maintainable code structure.

WordPress Standards: Follows WordPress coding standards and best practices.

Security Ready: Built-in security measures and data sanitization.

Internationalization: Ready for translations with proper text domains.

Admin & Public Separation: Organized separation of backend and frontend functionality.

Hook Management: Centralized WordPress hook system.

Asset Management: Proper CSS/JS enqueuing for admin and public.

📁 Plugin Structure

my-wp-plugin/
├── 📄 my-wp-plugin.php                 # Main plugin file (ONLY FILE TO EDIT HEADER)
├── 📄 index.php
├── 📁 includes/                        # Core

…

View on GitHub

Core Components for Your WordPress Content Scraper

1. The Admin Controller (`class-admin.php`)

The AICA_Plugin_Admin class is responsible for all backend functionalities, including managing settings and handling AJAX requests for our content scraper.

Here's a snippet showing how it registers settings and handles the content synchronization AJAX:

// In my-wp-plugin/admin/class-admin.php

class AICA_Plugin_Admin {
    // ... other properties and constructor ...

    public function register_settings() {
        // ... other settings ...
        register_setting($this->plugin_slug . '_prompt_settings', 'ai_agent_knowledge_file_id');
    }

    public function handle_sync_website_ajax() {
        check_ajax_referer('admin_proposal_nonce', 'nonce');
        if (!current_user_can('manage_options')) wp_send_json_error('Unauthorized');

        $step = isset($_POST['sync_step']) ? sanitize_text_field($_POST['sync_step']) : 'discovery';
        $db = $this->get_db_manager();

        // Initialize your scraper class (e.g., AICA_Plugin_SimpleHTMLDOM)
        $scraper = new AICA_Plugin_SimpleHTMLDOM(); // This would be your custom scraper implementation

        if ($step === 'discovery') {
            $posts = get_posts([
                'post_type' => ['page', 'post'], 
                'posts_per_page' => -1, 
                'post_status' => 'publish'
            ]);

            $pages = [];
            $excluded_slugs = ['cart', 'checkout', 'my-account', 'logout', 'login', 'register', 'sample-page'];

            foreach ($posts as $post) {
                if (in_array($post->post_name, $excluded_slugs)) continue;
                $url = get_permalink($post->ID);
                $sync_data = $db->get_sync_status($url);

                $pages[] = [
                    'url'    => $url,
                    'title'  => $post->post_title,
                    'status' => $sync_data ? 'Synced' : 'Pending',
                    'last'   => $sync_data ? date_i18n(get_option('date_format'), strtotime($sync_data['last_synced'])) : '-'
                ];
            }
            wp_send_json_success(['pages' => $pages]);
        }

        if ($step === 'sync_page') {
            $url = esc_url_raw($_POST['url']);
            $content = $scraper->scrape_page($url); // The actual scraping happens here

            if (empty($content)) {
                wp_send_json_error('Page empty or ignored');
            }

            $db->update_knowledge_base($url, $content); // Store the scraped content

            wp_send_json_success([
                'status' => 'Synced', 
                'last'   => current_time('mysql')
            ]);
        }
    }
    // ... other methods ...
}

The handle_sync_website_ajax method is crucial. It has two main steps:

Discovery: It fetches all published WordPress pages and posts, filters out common irrelevant pages (like cart/checkout), and prepares a list of URLs to be synced.
Sync Page: For each URL, it uses an instance of AICA_Plugin_SimpleHTMLDOM (which you would implement or integrate from a library) to scrape_page() and extract its content. This extracted content is then passed to the database manager.

2. The Database Manager (`class-db-manager.php`)

The AICA_Plugin_DB_Manager handles all interactions with the database, including creating custom tables and storing/retrieving scraped content.

// In my-wp-plugin/includes/class-db-manager.php

class AICA_Plugin_DB_Manager {
    // ... properties and constructor ...

    public static function create_table( $plugin_slug ) {
        global $wpdb;
        $clean_slug = str_replace('-', '_', sanitize_key($plugin_slug));
        $kb_table = $wpdb->prefix . $clean_slug . '_knowledge_base'; // Our knowledge base table

        $charset_collate = $wpdb->get_charset_collate();

        $sql_kb = "CREATE TABLE $kb_table (
            id BIGINT(20) UNSIGNED NOT NULL AUTO_INCREMENT,
            source_url VARCHAR(255) NOT NULL,
            content LONGTEXT NOT NULL,
            last_synced DATETIME NOT NULL,
            PRIMARY KEY (id),
            UNIQUE KEY source_url (source_url(191))
        ) $charset_collate;";

        require_once ABSPATH . 'wp-admin/includes/upgrade.php';
        dbDelta( $sql_kb ); // Creates or updates the table
    }

    public function update_knowledge_base($url, $content) {
        global $wpdb;
        $table = $this->get_kb_table_name();

        $content = trim($content); // Ensure content is clean

        // Uses wpdb->replace to either insert new or update existing records based on source_url
        return $wpdb->replace(
            $table,
            [
                'source_url'  => esc_url_raw($url),
                'content'     => $content,
                'last_synced' => current_time('mysql')
            ],
            ['%s', '%s', '%s']
        );
    }

    public function get_sync_status($url) {
        $table = $this->get_kb_table_name();
        return $this->wpdb->get_row(
            $this->wpdb->prepare("SELECT last_synced FROM $table WHERE source_url = %s", $url),
            ARRAY_A
        );
    }
}

The create_table method is called during plugin activation to set up our _knowledge_base table. This table stores the source_url, the content scraped from that URL, and the last_synced timestamp. The update_knowledge_base method uses wpdb->replace which is convenient for either inserting new pages or updating existing ones.

3. The Admin User Interface (`admin-settings.php` & `knowledge-base.js`)

The admin-settings page (admin/partials/admin-settings.php) provides the user interface for initiating and monitoring the WordPress Content Scraping process.

// Snippet from admin/partials/admin-settings.php for the Knowledge Base tab
<?php elseif ($active_tab === 'prompt'): ?>
    <form method="post" action="options.php">
        <?php settings_fields($this->plugin_slug . '_prompt_settings'); ?>
        <div class="postbox professional-box knowledge-sync-wrapper">
            <div class="postbox-header" id="toggle-sync-table">
                <h2 class="hndle">Website Knowledge Sync</h2>
                <!-- ... UI elements for status and toggle ... -->
            </div>
            <div class="inside" id="sync-collapsible-content">
                <div id="discovery-loader">...</div>
                <div id="sync-progress-bar">...</div>
                <div id="sync-success-msg">...</div>
                <div id="sync-data-container">
                    <table class="wp-list-table widefat fixed striped">
                        <thead>
                            <tr><th>Page Source</th><th>Status</th><th>Last Sync</th></tr>
                        </thead>
                        <tbody id="sync-table-body"></tbody>
                    </table>
                    <div style="margin-top:20px;">
                        <button type="button" id="run-bulk-sync" class="button button-hero button-primary"></button>
                        <button type="button" id="start-discovery-refresh" class="button button-secondary">Refresh Map</button>
                    </div>
                </div>
            </div>
        </div>
        <!-- ... other AI Persona settings ... -->
        <?php submit_button('Save AI Persona & Knowledge'); ?>
    </form>

The JavaScript file (assets/js/admin/knowledge-base.js) handles the client-side logic:

It triggers the discovery AJAX call on page load (if on the correct tab).
It populates the table with discovered pages and their sync status.
It handles the "Run Bulk Sync" button click, iterating through pending pages and making individual sync_page AJAX calls.
It updates the progress bar and displays success messages.

// In my-wp-plugin/assets/js/admin/knowledge-base.js

jQuery(document).ready(function($) {
    let pages = [];

    // Auto-load Discovery on Page Load
    const urlParams = new URLSearchParams(window.location.search);
    if (urlParams.get('tab') === 'prompt') {
        runDiscovery();
    }

    // Function to initiate page discovery via AJAX
    function runDiscovery() {
        $('#discovery-loader').show(); // Show loading spinner
        $('#sync-data-container, #sync-success-msg, #sync-progress-bar').hide(); // Hide other elements

        $.post(adminProposalData.ajax_url, {
            action: 'sync_website_knowledge',
            sync_step: 'discovery',
            nonce: adminProposalData.nonce
        }, function(res) {
            $('#discovery-loader').hide();
            if(res.success) {
                pages = res.data.pages;
                renderRows(pages); // Populate the table
                updateSyncButtonState(); // Update sync button text
                $('#sync-data-container').fadeIn();
                $('#sync-summary-badge').text(pages.length + ' Pages Found').show();
            }
        });
    }

    // Bulk Sync Logic
    $('#run-bulk-sync').on('click', async function() {
        const $btn = $(this).prop('disabled', true).text('Syncing...');
        $('#sync-progress-bar').slideDown();
        $('#sync-success-msg').hide();

        let syncedCount = 0;

        for(let i=0; i < pages.length; i++) {
            let p = pages[i];
            let percent = Math.round(((i + 1) / pages.length) * 100);

            $('#sync-msg').text(`Processing: ${p.title}`);
            $('#sync-fill').css('width', percent + '%');
            $('#sync-percent').text(percent + '%');

            // Send individual AJAX request for each page sync
            const response = await $.post(adminProposalData.ajax_url, {
                action: 'sync_website_knowledge',
                sync_step: 'sync_page',
                url: p.url,
                nonce: adminProposalData.nonce
            });

            if(response.success) {
                syncedCount++;
                const $row = $(`tr[data-url="${p.url}"]`);
                $row.find('.status-badge').attr('class', 'status-badge status-approved').text('Synced');
                $row.find('.last-sync-col').text('Just now');
                pages[i].status = 'Synced'; // Update local state
            }
        }

        // Completion UI sequence
        $('#sync-progress-bar').delay(500).slideUp(400, function() {
            $('#sync-success-msg').fadeIn();
            updateSyncButtonState();
            setTimeout(function() {
                $('#sync-success-msg').fadeOut(500);
            }, 4000);
        });
    });
    // ... other UI functions like renderRows, updateSyncButtonState ...
});

This JavaScript orchestrates the entire synchronization process, providing real-time feedback to the user on the progress of the WordPress Content Scraping.

Deep Dive: The `AICA_Plugin_SimpleHTMLDOM` for Content Scraping

At the heart of our WordPress Content Scraping mechanism is the AICA_Plugin_SimpleHTMLDOM class. This class acts as a wrapper for the popular simple_html_dom PHP library, making it easy to fetch and parse HTML content from URLs.

Integrating the `simple_html_dom` Library

First, we need to ensure the simple_html_dom.php file is loaded. In a typical WordPress plugin setup using a boilerplate, you might place external libraries in a vendor/ directory.

// In my-wp-plugin/includes/class-simplehtmldom.php

defined( 'ABSPATH' ) || exit;

class AICA_Plugin_SimpleHTMLDOM {

    public function __construct() {
        $this->load_lib();
    }

    private function load_lib() {
        // Path to the simple_html_dom.php library
        $lib_path = AICA_PLUGIN_PATH . 'vendor/simplehtmldom/simple_html_dom.php';
        if ( file_exists( $lib_path ) ) {
            require_once $lib_path;
        } else {
            // Handle error if library not found (e.g., log it)
            error_log('Simple HTML DOM library not found at: ' . $lib_path);
        }
    }

    // ... rest of the class methods ...
}

The load_lib() method checks if the simple_html_dom.php file exists at the expected path and then includes it using require_once. This ensures the library's functions (like str_get_html) are available when needed.

How `scrape_page` Works

The scrape_page method is where the actual content extraction happens. Let's break down its functionality:

// Inside AICA_Plugin_SimpleHTMLDOM class

    public function scrape_page( $url ) {
        // 1. Fetch the HTML content from the URL
        $response = wp_remote_get( $url, ['timeout' => 15] );
        if ( is_wp_error( $response ) ) return ''; // Handle errors

        $html_content = wp_remote_retrieve_body( $response );

        // 2. Use simple_html_dom to parse the HTML string
        $html = str_get_html( $html_content );
        if ( ! $html ) return ''; // Return empty if parsing fails

        // 3. Early exit for 'Page Not Found' titles
        $title = $html->find('title', 0) ? $html->find('title', 0)->plaintext : '';
        if (stripos($title, 'Page not found') !== false) {
            $html->clear();
            return '';
        }

        // 4. Remove unwanted elements (scripts, styles, navigation, footers, etc.)
        foreach($html->find('script, style, nav, footer, header, noscript, .gdpr, .cookie, #cookie-law-info-bar') as $item) {
            $item->outertext = ''; 
        }

        // 5. Extract the pure plain text from the cleaned HTML
        $text = $html->plaintext;

        // 6. Perform a "nuclear cleanup" on the text
        // Convert HTML entities (like &nbsp;) to actual characters
        $text = html_entity_decode($text, ENT_QUOTES, 'UTF-8');

        // Replace all types of whitespace (multiple spaces, newlines, tabs) with a single space
        $text = preg_replace('/\s+/u', ' ', $text);

        // Clean up memory used by simple_html_dom
        $html->clear();
        unset($html);

        return trim($text); // Return the final, cleaned text
    }

// ... other methods like sync_website_content ...

This method demonstrates a robust way to fetch, clean, and extract meaningful text content from a web page, which is essential for accurate WordPress Content Scraping.

Key Takeaways

Structured Development: Using a boilerplate significantly improves plugin organization and maintainability.
AJAX for Background Tasks: WordPress AJAX is ideal for long-running tasks like content scraping, preventing UI freezes.
Custom Database Tables: For storing custom data like scraped content, creating your own tables is often more efficient than using wp_options or wp_posts.
Client-Side Feedback: Providing progress bars and status updates through JavaScript enhances the user experience for intensive operations.
Targeted Scraping: By fetching post_type and filtering slugs, you can control exactly which content your WordPress Content Scraping targets.
Simple HTML DOM: A powerful PHP library that simplifies parsing and manipulating HTML content for scraping purposes.

Next Steps

This framework provides a solid foundation for your WordPress Content Scraping plugin. You could extend it by:

Implementing a more sophisticated AICA_Plugin_SimpleHTMLDOM class to handle various content structures or even external websites.
Adding scheduling options to run the sync automatically using WordPress Cron.
Integrating with external AI services to process the scraped content for advanced knowledge base features.

What are your thoughts on building an automated content scraper for WordPress? Share your ideas and experiences in the comments below!

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.

Why Automate WordPress Content Scraping?

The Foundation: WordPress Plugin Boilerplate

d5b94396feba3 / WP-Plugin-Boilerplate

A modern, object-oriented WordPress plugin boilerplate following WordPress coding standards. Features a generic structure where you only need to change ONE FILE to create a completely new plugin.

WP Plugin Boilerplate

🚀 Features

📁 Plugin Structure

Core Components for Your WordPress Content Scraper

1. The Admin Controller (class-admin.php)

2. The Database Manager (class-db-manager.php)

3. The Admin User Interface (admin-settings.php & knowledge-base.js)

Deep Dive: The AICA_Plugin_SimpleHTMLDOM for Content Scraping

Integrating the simple_html_dom Library

How scrape_page Works

Key Takeaways

Next Steps

1. The Admin Controller (`class-admin.php`)

2. The Database Manager (`class-db-manager.php`)

3. The Admin User Interface (`admin-settings.php` & `knowledge-base.js`)

Deep Dive: The `AICA_Plugin_SimpleHTMLDOM` for Content Scraping

Integrating the `simple_html_dom` Library

How `scrape_page` Works