Scalable Keyword Marking System
Building a realtime keyword highlighting system for millions of users and billions of keywords presents significant architectural challenges. In this document I will try to address database partitioning and efficient keyword matching.
Problem Overview
- 1 Million Users: Each user can store up to 10,000 keywords.
- 10 Billion Keywords: Assume total system capacity.
- Real-time Requirement: Sub-second retrieval and highlighting of all user keywords on any given page, including overlapping matches (e.g., "Dhaka," "Dhaka City").
Source code
https://github.com/asadpstu/keyword_match
Database: PostGres
You can run as a docker service. Just copy and paste in the terminal. Within a few moment, your pgres database will be up and running
docker run -d --name my-postgres-db --network my_pg_network -p 5434:5432 -e POSTGRES_DB=mydatabase -e POSTGRES_USER=myuser -e POSTGRES_PASSWORD=mypassword -v pg_data:/var/lib/postgresql/data postgres:16-alpine
Download dbever or you can use your favorite db tools.
`CREATE TABLE keywords (user_id BIGINT NOT NULL,keyword TEXT NOT NULL,PRIMARY KEY (user_id, keyword)) PARTITION BY HASH (user_id);`
Creating partitions of keywords table e.g. 256 partitions
DO $$
BEGIN
FOR i IN 0..255 LOOP
EXECUTE format('CREATE TABLE keywords_p%s PARTITION OF keywords FOR VALUES WITH (MODULUS 256, REMAINDER %s);', i, i);
END LOOP;
END;
$$;
I am attaching some queries too. It might come handy if you feel it is worthy too practice.
SELECT * FROM keywords WHERE user_id = 1;
SELECT * FROM keywords_p125 WHERE user_id=1;
SELECT tableoid::regclass AS partition_name, *
FROM keywords
WHERE user_id = 1;
EXPLAIN SELECT user_id, keyword FROM keywords WHERE user_id=1;
Why Database Partitioning is important
Without partitioning, a single, massive table may lead to:
- Slow Queries: Indexing and scanning across billions of rows become prohibitive.
- Large, Inefficient Indexes: Unable to fit in memory, causing I/O bottlenecks.
-
High Maintenance Overhead:
VACUUM
,ANALYZE
,REINDEX
operations are lengthy and disruptive. - Slow Data Management: Deleting old user data is resource-intensive.
With hash partitioning on user_id
(e.g., into 256 partitions), we achieve:
- Fast Queries: Queries only hit a relevant, small partition (1/256th of the data).
- Smaller, In-Memory Indexes: Faster lookups.
- Efficient Maintenance: Operations run quickly per partition, with less impact.
- Instant Data Pruning: Dropping old user data by dropping entire partitions is instantaneous.
- Enhanced Parallelism: Concurrent access and maintenance across partitions improve throughput.
This directly translates to sub-second query times even at massive scale, unlike the multi-second or more queries seen without partitioning.
Keyword Matching: The Role of Aho–Corasick Algorithm (Trie Data structure)
Retrieving keywords is just one part. Efficiently matching thousands of keywords against page content is solved by the Aho–Corasick algorithm. This algorithm builds a Trie of keywords, enabling linear-time scanning of text to detect all occurrences, including overlapping matches, which is vital for real-time performance.
Summary: Partitioning vs. Non-Partitioning
Feature | Without Partitioning | With Partitioning |
---|---|---|
Query Speed | Degrades with scale (seconds+) | Stays fast (sub-second) due to smaller scope |
Index Size | Huge, inefficient, disk-bound | Smaller, efficient, fits in memory per partition |
Maintenance Ops | Slow, blocking | Fast, isolated, less impactful |
Data Pruning | Painfully slow row-by-row | Instant (drop partition) |
Scalability | Quick bottlenecks | Efficient, leverages parallelism |
Conclusion
Database partitioning, especially by user_id
, combined with an efficient keyword matching algorithm like Aho–Corasick can help building a performant and scalable keyword marking system. I would like to suggest partitioning your table early for any system that may handle large scale per user data.
Top comments (0)