Scaling a Keyword Marking System: Partitioned vs Non-Partitioned Database Approaches

Scalable Keyword Marking System

Building a realtime keyword highlighting system for millions of users and billions of keywords presents significant architectural challenges. In this document I will try to address database partitioning and efficient keyword matching.

Problem Overview

1 Million Users: Each user can store up to 10,000 keywords.
10 Billion Keywords: Assume total system capacity.
Real-time Requirement: Sub-second retrieval and highlighting of all user keywords on any given page, including overlapping matches (e.g., "Dhaka," "Dhaka City").

Source code

https://github.com/asadpstu/keyword_match

Database: PostGres

You can run as a docker service. Just copy and paste in the terminal. Within a few moment, your pgres database will be up and running

docker run -d --name my-postgres-db --network my_pg_network -p 5434:5432 -e POSTGRES_DB=mydatabase -e POSTGRES_USER=myuser -e POSTGRES_PASSWORD=mypassword -v pg_data:/var/lib/postgresql/data postgres:16-alpine

Download dbever or you can use your favorite db tools.

`CREATE TABLE keywords (user_id BIGINT NOT NULL,keyword TEXT NOT NULL,PRIMARY KEY (user_id, keyword)) PARTITION BY HASH (user_id);`

Creating partitions of keywords table e.g. 256 partitions

DO $$
BEGIN
  FOR i IN 0..255 LOOP
    EXECUTE format('CREATE TABLE keywords_p%s PARTITION OF keywords FOR VALUES WITH (MODULUS 256, REMAINDER %s);', i, i);
  END LOOP;
END;
$$;

I am attaching some queries too. It might come handy if you feel it is worthy too practice.

SELECT * FROM keywords WHERE user_id = 1;
SELECT * FROM keywords_p125 WHERE user_id=1;

SELECT tableoid::regclass AS partition_name, *
FROM keywords
WHERE user_id = 1;

EXPLAIN SELECT user_id, keyword FROM keywords WHERE user_id=1;

Why Database Partitioning is important

Without partitioning, a single, massive table may lead to:

Slow Queries: Indexing and scanning across billions of rows become prohibitive.
Large, Inefficient Indexes: Unable to fit in memory, causing I/O bottlenecks.
High Maintenance Overhead: VACUUM, ANALYZE, REINDEX operations are lengthy and disruptive.
Slow Data Management: Deleting old user data is resource-intensive.

With hash partitioning on user_id (e.g., into 256 partitions), we achieve:

Fast Queries: Queries only hit a relevant, small partition (1/256th of the data).
Smaller, In-Memory Indexes: Faster lookups.
Efficient Maintenance: Operations run quickly per partition, with less impact.
Instant Data Pruning: Dropping old user data by dropping entire partitions is instantaneous.
Enhanced Parallelism: Concurrent access and maintenance across partitions improve throughput.

This directly translates to sub-second query times even at massive scale, unlike the multi-second or more queries seen without partitioning.

Keyword Matching: The Role of Aho–Corasick Algorithm (Trie Data structure)

Retrieving keywords is just one part. Efficiently matching thousands of keywords against page content is solved by the Aho–Corasick algorithm. This algorithm builds a Trie of keywords, enabling linear-time scanning of text to detect all occurrences, including overlapping matches, which is vital for real-time performance.

Summary: Partitioning vs. Non-Partitioning

Feature	Without Partitioning	With Partitioning
Query Speed	Degrades with scale (seconds+)	Stays fast (sub-second) due to smaller scope
Index Size	Huge, inefficient, disk-bound	Smaller, efficient, fits in memory per partition
Maintenance Ops	Slow, blocking	Fast, isolated, less impactful
Data Pruning	Painfully slow row-by-row	Instant (drop partition)
Scalability	Quick bottlenecks	Efficient, leverages parallelism

Conclusion

Database partitioning, especially by user_id, combined with an efficient keyword matching algorithm like Aho–Corasick can help building a performant and scalable keyword marking system. I would like to suggest partitioning your table early for any system that may handle large scale per user data.