Building a Financial Data Pipeline: How I Scraped 25 Years of Stock Market Filings with Python and a Graph Database

#python #opensource #database #showdev

The Discovery: An Undocumented JSON API

The official HKEx website is a maze of JavaScript and session-based navigation. Scraping it directly with tools like Selenium would be slow, brittle, and a constant maintenance headache. I knew there had to be a better way.

After some digging in my browser's network tab while using the official search portal, I found a hidden gem: an undocumented JSON API. The website's frontend was making calls to a titleSearchServlet.do endpoint that returned clean, structured JSON data. This was the key. By mimicking these API calls, I could bypass the browser entirely and get the data directly from the source.

The Stack: Python, Requests, and SurrealDB

With the API discovered, I chose a simple but powerful stack:

Python: For its rich data processing ecosystem and ease of use.
Requests: A straightforward library for making the necessary HTTP calls to the HKEx API.
SurrealDB: A multi-model database that was a perfect fit for this project. I could store the filing metadata as structured documents and, more importantly, create graph relationships between companies and filings.

Architecture: A Two-Phase Pipeline

The process is broken down into two main phases: scraping the metadata and then enriching it with the full document content.

Phase 1: Scraping Filing Metadata

The first step is to fetch the metadata for every filing. Since the HKEx API limits searches to one-month intervals when no stock code is specified, I had to generate monthly date chunks and iterate through them.

Here's how the generate_monthly_chunks function works:

def generate_monthly_chunks(date_from: datetime, date_to: datetime) -> List[Tuple[datetime, datetime]]:
    """Generate (chunk_from, chunk_to) pairs in 1-month increments (newest first)."""
    chunks: List[Tuple[datetime, datetime]] = []
    cursor = datetime(date_to.year, date_to.month, 1)
    while cursor >= datetime(date_from.year, date_from.month, 1):
        chunk_start = max(cursor, date_from)
        _, last_day = monthrange(cursor.year, cursor.month)
        chunk_end = min(datetime(cursor.year, cursor.month, last_day), date_to)
        chunks.append((chunk_start, chunk_end))
        if cursor.month == 1:
            cursor = datetime(cursor.year - 1, 12, 1)
        else:
            cursor = datetime(cursor.year, cursor.month - 1, 1)
    return chunks

For each chunk, the fetch_chunk_via_api function first sends a POST request to the search page to set the date range in the server's session, then makes paginated GET requests to the JSON API endpoint to retrieve all the records.

The raw JSON from the API looks like this:

{
    "FILE_INFO": "53KB",
    "NEWS_ID": "12022263",
    "STOCK_NAME": "ZHONGTAIFUTURES",
    "STOCK_CODE": "01461",
    "TITLE": "Articles of Association",
    "FILE_TYPE": "PDF",
    "DATE_TIME": "11/02/2026 19:10",
    "FILE_LINK": "/listedco/listconews/sehk/2026/0211/2026021100854.pdf"
}

This data is parsed, cleaned, and stored in a SCHEMAFULL table in SurrealDB called exchange_filing.

Phase 2: Downloading and Extracting Content

With the metadata in place, the next step is to download the actual filing documents (PDF, HTML, or Excel) and extract their content. This is done in parallel using a ThreadPoolExecutor for efficiency.

def _download_document(url: str, filing_id: str) -> Tuple[bytes, int, str]:
    # ... (implementation to download the document)

Once downloaded, the text and any structured tables are extracted using PyMuPDF for PDFs and BeautifulSoup for HTML. This extracted content is then saved back to the corresponding record in the exchange_filing table.

The Graph Model: Connecting the Dots

This is where SurrealDB's multi-model capabilities shine. I wanted to not only store the filings but also understand the relationships between them. I defined two types of graph edges using TYPE RELATION:

(company)-[has_filing]->(filing): This links a company to the filings it has released.
(filing)-[references_filing]->(company): This links a filing to other companies mentioned in its title.

This simple graph model allows for powerful queries, such as "find all filings from company X that mention company Y," which would be complex and slow to execute in a traditional relational database.

Here's a snippet of the code that creates the has_filing edges:

def link_filings_to_companies(ticker_set: set | None = None) -> int:
    # ...
    log("Linking filings to companies via graph edges...")
    # ...
    update_query = (
        f"UPDATE {COMPANY_TABLE} SET filings += {filing_id}; "
        f"RELATE {company_id}->has_filing->{filing_id} CONTENT {{ at: {filing_date} }};"
    )
    # ...

From a Single Script to an Open-Source Project

The initial version of this tool was a single, 1500-line Python script. While functional, it was difficult to maintain and not very user-friendly. I decided to refactor it into a proper, modular open-source project.

This involved:

Decoupling Dependencies: Removing hardcoded dependencies on my private company data table, making the graph linking feature optional and configurable.
Modularization: Breaking the monolithic script into logical modules (api.py, db.py, extractor.py, etc.).
Packaging: Creating a pyproject.toml file to make the project installable via pip.
CLI: Building a user-friendly command-line interface with argparse.
Documentation: Writing a comprehensive README.md with installation instructions, configuration details, and usage examples.

Conclusion and Next Steps

The result is hkex-filing-scraper, a robust and easy-to-use tool for building a comprehensive database of HKEx filings. It's now available on GitHub and installable via PyPI.

GitHub Repo: https://github.com/simonplmak-cloud/hkex-filing-scraper

This project was a fun journey into reverse engineering, data pipeline design, and the power of multi-model databases. Future plans could include adding support for other exchanges like the SEC EDGAR database or building a web interface to explore the data.

I encourage you to check out the repository, try it out for your own financial analysis projects, and contribute if you find it useful. Feedback is always welcome!