Rajeshwar R

Posted on Nov 9

Building Advanced Search with PostgreSQL: pg_search on AWS

#postgres #aws #ai #devops

Introduction

Ever needed powerful search capabilities in your application but dreaded the complexity of managing Elasticsearch or OpenSearch? You're not alone. While these dedicated search engines are powerful, they introduce significant operational overhead: another cluster to deploy, ETL pipelines to maintain, and constant synchronization headaches.

What if you could get search engine capabilities directly inside PostgreSQL? That's exactly what pg_search offers. In this post, I'll walk you through what pg_search is, why it can't run on Amazon RDS, and how we architected a solution using EC2 that keeps everything in PostgreSQL while delivering modern search features.

What is pg_search?

pg_search is a PostgreSQL extension from ParadeDB that transforms your database into a full-featured search engine. Instead of basic substring matching with LIKE or ILIKE, you get:

Full-text search with BM25 ranking (the algorithm behind modern search engines)
Vector search for semantic similarity using embeddings
Hybrid search that combines text relevance and vector similarity
All using standard SQL queries

Here's a simple example:

SELECT id, title, 
       bm25_rank((title, body), 'configure postgres search') AS score
FROM articles
ORDER BY score DESC
LIMIT 10;

This returns ranked results like a real search engine, not just rows that contain your keywords.

Why pg_search Instead of Traditional Search Solutions?

You might reach for pg_search when:

Plain Postgres queries aren't enough. Basic LIKE queries don't provide relevance ranking, and scaling text search across multiple columns becomes difficult.

You want to avoid separate infrastructure. Solutions like Elasticsearch require:

Additional clusters to deploy and monitor
ETL pipelines to sync data from Postgres
Managing consistency between two systems
Extra latency from network hops

You prefer the Postgres ecosystem. With pg_search, everything stays in Postgres. Your backups, permissions, transactions, and tooling all work the same way. No new stack to learn.

The RDS Challenge

Here's the catch: Amazon RDS doesn't support pg_search.

RDS only allows a pre-approved list of extensions. Since RDS is a managed service, AWS controls what can be loaded into the PostgreSQL process. If you try:

CREATE EXTENSION pg_search;

You'll get:

ERROR: extension "pg_search" is not available

This limitation is by design. To use pg_search, you need a self-managed PostgreSQL instance where you control the installation.

Our Solution: RDS + EC2 Architecture

Since we can't install pg_search on RDS, we use a hybrid approach:

Amazon RDS remains our primary database for all application writes and transactions
EC2 PostgreSQL runs as a search replica with pg_search installed
Data flows from RDS to EC2 using logical replication
Applications query RDS for normal operations and EC2 for search

Here's the architecture:

          ┌─────────────────────────────┐
          │         Application         │
          │  (API / backend / service)  │
          └────────────┬────────────────┘
                       │
                 Write / Read 
                       │
          ┌────────────▼──────────────┐
          │   Amazon RDS PostgreSQL   │
          │        (Primary DB)       │
          └────────────┬──────────────┘
                       │
                Logical Replication
                       │
          ┌────────────▼──────────────┐
          │        Amazon EC2         │
          │  PostgreSQL + pg_search   │
          │    (Search / Analytics)   │
          └────────────▲──────────────┘
                       │
                Search Queries
                       │
          ┌────────────┴──────────────┐
          │         Application       │
          └───────────────────────────┘

Everything stays inside your private AWS VPC with no external services.

Setting Up pg_search on EC2

Step 1: Provision Your EC2 Instance

Launch an EC2 instance (t3.medium or larger) in the same VPC as your RDS instance. Place it in a private subnet and attach sufficient storage (100+ GB gp3 EBS).

Configure security groups to allow:

EC2 → RDS communication on port 5432
Your admin access via SSH

Step 2: Install PostgreSQL with pg_search

We'll use Docker for a reproducible setup. Create a Dockerfile:

FROM postgres:16

# Install build dependencies
RUN apt-get update && apt-get install -y \
    build-essential git curl pkg-config libssl-dev \
    libclang-dev clang postgresql-server-dev-16 \
    libicu-dev wget ca-certificates

# Install Rust toolchain
ENV CARGO_HOME=/usr/local/cargo
ENV PATH=$CARGO_HOME/bin:$PATH

RUN curl https://sh.rustup.rs -sSf | sh -s -- -y && \
    . "$CARGO_HOME/env" && \
    cargo install cargo-pgrx --version 0.15.0 --locked && \
    cargo pgrx init --pg16=$(which pg_config)

# Install pgvector (optional, for vector search)
RUN wget https://github.com/pgvector/pgvector/archive/refs/tags/v0.5.1.tar.gz && \
    tar -xvzf v0.5.1.tar.gz && \
    cd pgvector-0.5.1 && make && make install && \
    cd .. && rm -rf pgvector-0.5.1 v0.5.1.tar.gz

# Build and install pg_search
RUN git clone https://github.com/paradedb/paradedb.git && \
    cd paradedb/pg_search && \
    cargo pgrx install --release && \
    cd ../.. && rm -rf paradedb

Create a docker-compose.yml:

version: '3.8'

services:
  postgres:
    build:
      context: .
      dockerfile: Dockerfile
    container_name: pgsearch-ec2
    restart: always
    environment:
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: your_strong_password
    ports:
      - "5432:5432"
    volumes:
      - /mnt/pgdata:/var/lib/postgresql/data
    command:
      - "postgres"
      - "-c"
      - "wal_level=logical"
      - "-c"
      - "max_replication_slots=10"
      - "-c"
      - "max_wal_senders=10"
      - "-c"
      - "shared_preload_libraries=pg_search"

Start the container:

docker compose up -d

Step 3: Enable the Extensions

Connect to your EC2 PostgreSQL instance:

psql -h localhost -U postgres

Create your database and enable extensions:

CREATE DATABASE app_db;
\c app_db

CREATE EXTENSION pg_search;
CREATE EXTENSION vector;  -- if using vector search

Step 4: Create Search Indexes

Now you can create powerful search indexes on your tables:

-- BM25 full-text search index
CREATE INDEX documents_search_idx
ON documents
USING bm25 (title, body)
WITH (key_field = 'id');

-- Vector similarity index (for semantic search)
CREATE INDEX documents_embedding_idx
ON documents
USING ivfflat (embedding vector_l2_ops)
WITH (lists = 100);

Syncing Data from RDS to EC2

Installing pg_search is only half the solution. We need continuous data replication from RDS to EC2.

Initial Data Load

First, take a snapshot of your RDS data:

# Dump from RDS
pg_dump -h your-rds-endpoint.amazonaws.com \
        -U your_user -d app_db \
        -Fc -f app_db_dump.backup

# Restore to EC2
psql -h localhost -U postgres -c "CREATE DATABASE app_db;"
pg_restore -h localhost -U postgres \
           -d app_db app_db_dump.backup

Configure RDS for Logical Replication

In your RDS parameter group, set:

rds.logical_replication = 1
max_replication_slots = 50 (or more, depending on your needs)
max_wal_senders = 10
max_slot_wal_keep_size = 2048 (MB — safety net to avoid WAL filling storage)

Apply changes and reboot your RDS instance.

Create a replication user on RDS:

CREATE ROLE repl_user WITH LOGIN REPLICATION PASSWORD 'secure_password';
GRANT CONNECT ON DATABASE app_db TO repl_user;
GRANT USAGE ON SCHEMA public TO repl_user;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO repl_user;

Create a publication on RDS:

-- Replicate all tables
CREATE PUBLICATION app_pub FOR ALL TABLES;

-- Or specific tables only
-- CREATE PUBLICATION app_pub FOR TABLE documents, users;

Set Up Subscription on EC2

On your EC2 PostgreSQL instance:

CREATE SUBSCRIPTION app_sub
  CONNECTION 'host=your-rds-endpoint.amazonaws.com port=5432 
              user=repl_user password=secure_password dbname=app_db'
  PUBLICATION app_pub
  WITH (
    slot_name = app_slot,
    create_slot = true,
    copy_data = false,  -- data already loaded via pg_restore
    enabled = true
  );

Verify Replication

Check replication status on EC2:

SELECT subname, pid, latest_end_time,
       now() - latest_end_time AS delay
FROM pg_stat_subscription;

Test with a simple insert on RDS:

INSERT INTO documents (id, title, body)
VALUES (gen_random_uuid(), 'Test', 'Replication check');

The row should appear on EC2 within seconds.

Running Search Queries

Now your application can run powerful search queries against the EC2 instance:

Full-text search with ranking:

SELECT id, title,
       bm25_rank((title, body), 'postgres configuration') AS score
FROM documents
ORDER BY score DESC
LIMIT 10;

Vector similarity search:

SELECT id, title,
       1 - (embedding <=> $1) AS similarity
FROM documents
ORDER BY embedding <=> $1
LIMIT 10;

Your application connects to RDS for writes and normal reads, and to EC2 for search operations.

Keeping New Tables in Sync

By default:

If you used FOR ALL TABLES in your publication, new tables will be included on the publisher side, but…

The subscriber needs to be refreshed to “see” them:

ALTER SUBSCRIPTION app_sub REFRESH PUBLICATION WITH (copy_data = true);

You can automate this with a cron job on EC2:

*/15 * * * * psql -h localhost -U postgres -d app_db \ -c "ALTER SUBSCRIPTION app_sub REFRESH PUBLICATION WITH (copy_data = true);"

This ensures new tables created on RDS are picked up and synced to EC2 regularly.

Pros and Cons

Advantages

✅ Advanced search features – BM25 ranking, vector search, and hybrid search capabilities

✅ Postgres-native – Use SQL and standard Postgres tooling instead of learning new systems

✅ RDS stays primary – Keep RDS for transactions, backups, and managed infrastructure

✅ Private network – Everything stays inside your VPC with no external dependencies

✅ Flexible extensions – Install any Postgres extension on EC2, not limited by RDS restrictions

Challenges

❌ Additional infrastructure – You now manage a self-hosted Postgres instance

❌ Replication complexity – Setting up and maintaining logical replication requires care

❌ Near real-time search – Search data lags RDS by a few seconds (acceptable for most use cases)

❌ Dual endpoints – Application needs to connect to two different databases

❌ Higher costs – Extra EC2 instance and storage expenses

Conclusion

We chose this architecture because it gives us powerful search capabilities without the operational complexity of maintaining a separate search infrastructure. While managing an EC2 Postgres instance adds some overhead, it's significantly simpler than running Elasticsearch clusters and ETL pipelines.

For teams already comfortable with PostgreSQL, this approach feels natural. Everything stays in the Postgres ecosystem, uses familiar SQL, and remains inside your private network. The trade-off of managing one additional Postgres node is worthwhile for the search capabilities you gain.

If you're considering this architecture, start small: set up a test EC2 instance, replicate a subset of tables, and validate that the search performance meets your needs before committing to production deployment.

Have questions about this setup? Feel free to reach out or leave a comment below.

DEV Community