necologicLabs

Posted on Jan 5, 2025

Real-Time Data Integration Techniques Using PostgreSQL Foreign Data Wrapper (FDW)

#aws #aurora #postgres #database

Introduction

When operating relational databases on AWS, there may be situations where you need to exchange data between multiple databases. For example, you might have Aurora PostgreSQL as your source database containing a Products table, and you want to reference it in real time from a PostgreSQL database (the target) located in a different VPC. You then wish to perform a JOIN with the Transactions table on the target side.

No need for data duplication or ETL
Real-time access to the latest source data
Flexible use for historical aggregation and reporting

These requirements can be met using Foreign Data Wrapper (FDW). In this article, we’ll start by explaining what FDW is, then discuss its advantages and disadvantages, network configurations, and finally walk through the setup steps in order.

What Is Foreign Data Wrapper (FDW)?

Foreign Data Wrapper (FDW) is an extension feature of PostgreSQL that provides a mechanism allowing you to treat external data sources (e.g., tables from other databases) as if they were local tables.

In practice, you install an FDW extension (such as postgres_fdw), then configure “servers,” “user mappings,” and “foreign tables” to reference the tables located on a remote database.

Advantages and Disadvantages of Using FDW

Advantages

Real-Time Access

You can always retrieve the latest data from the source database. There’s no need for ETL batch processing, making it simple to leverage the most up-to-date information.
Improved Development Efficiency

By simply writing SQL JOIN statements, you can seamlessly integrate local and remote tables, simplifying application and data analysis implementations.
Reduced Data Duplication

Compared to a batch approach that duplicates data, FDW can help minimize storage costs and prevent unnecessary data redundancy.

Disadvantages

Performance Risks

Because access is remote, query response times can vary depending on network latency and the load on the source database.
High Dependency on the Source DB

If the source database experiences failures or heavy load, any queries passing through FDW will be directly affected.
Limited Functionality in Some Cases

Certain PostgreSQL features (like transaction management or triggers) may not be available on foreign tables.

Network Configuration Steps (VPC Peering or Transit Gateway)

Even if the source (Aurora PostgreSQL) and the target (PostgreSQL) are in the same account but reside in different VPCs, you need to configure one of the following methods to securely enable network connectivity:

VPC Peering
- Create a peering connection between the source VPC and the target VPC.
- Update the route tables in each VPC to include a route pointing to the peering connection for the relevant IP addresses.
- Open port 5432 (PostgreSQL port) in the security groups’ inbound/outbound rules.
AWS Transit Gateway
- If you need to connect multiple VPCs or multiple Regions from a single hub, use a Transit Gateway.
- Attach each VPC to the Transit Gateway and configure the necessary routes.
- Similarly, make sure security groups and NACLs allow traffic on the required ports.

In either case, be sure to configure the correct security groups to grant access to Aurora.

Step-by-Step Guide to Setting Up FDW

Below is a set of concrete steps assuming the following configuration:

Source: Aurora PostgreSQL (in a separate VPC)
- Endpoint: maindb-instance-1.ctqegcgyi6u1.us-east-1.rds.amazonaws.com
- Contains a Products table
Target: PostgreSQL (in a separate VPC)
- Endpoint: subdb-instance-1.ctqegcgyi6u1.us-east-1.rds.amazonaws.com
- Contains a Transactions table and needs to JOIN with Products

1. Preparation on the Aurora PostgreSQL (Source) Side

VPC Peering or Transit Gateway Configuration
- Ensure that the Aurora VPC and the target PostgreSQL VPC can communicate over the network.
- In the Aurora instance’s security group, allow inbound traffic on port 5432 from the target side.

Table Definition in the Source Database

CREATE TABLE Products (
  product_id UUID PRIMARY KEY,
  product_name TEXT NOT NULL,
  category TEXT NOT NULL,
  price NUMERIC(10, 2) NOT NULL,
  stock INTEGER NOT NULL
);

Assumes data is already populated.

2. Preparation on the PostgreSQL (Target) Side

Launch the Target PostgreSQL Instance
- This could be RDS for PostgreSQL, PostgreSQL on EC2, a container-based setup, etc.
- Make sure port 5432 is also open to accept connections.

Table Definition in the Target Database

CREATE TABLE Transactions (
  transaction_id UUID PRIMARY KEY,
  product_id UUID NOT NULL,
  quantity INTEGER NOT NULL,  
  total_price NUMERIC(10, 2) NOT NULL,
  transaction_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

This will be used in the future together with the foreign_products table (FDW) for JOIN operations.

Installing and Enabling FDW

On the target database (PostgreSQL), make postgres_fdw available for use:

-- Execute on the target database 
CREATE EXTENSION IF NOT EXISTS postgres_fdw;

Note: Depending on your managed service or PostgreSQL version, this extension may be disabled by default. You may need to enable it via the management console or similar interface.

Creating a Server Connection (CREATE SERVER)

CREATE SERVER source_aurora FOREIGN DATA WRAPPER postgres_fdw 
OPTIONS (   
    host 'maindb-instance-1.ctqegcgyi6u1.us-east-1.rds.amazonaws.com',
    port '5432',
    dbname 'postgres'  -- Name of the source database
);

Here, “source_aurora” is simply an identifier.

The “host” and “port” specify the Aurora connection details.

“dbname” is the source database name.

Creating a User Mapping (CREATE USER MAPPING)

Set up the authentication details needed to access the source via FDW:

CREATE USER MAPPING FOR target_user SERVER source_aurora  
OPTIONS (
    user 'aurora_user',
    password 'aurora_password'
);

Sample:

CREATE USER MAPPING FOR postgres SERVER source_aurora  
OPTIONS (  
    user 'postgres',   
    password 'xxx' 
 );

“FOR target_user” refers to a user present on the target PostgreSQL side.

“user” and “password” are valid credentials for Aurora PostgreSQL.

Creating a Foreign Table (CREATE FOREIGN TABLE)

Define the Products table from the Aurora source as a “foreign table” on the target:

CREATE FOREIGN TABLE foreign_products (
    product_id UUID,  
    product_name TEXT,  
    category TEXT,  
    price NUMERIC(10, 2), 
    stock INTEGER
) SERVER source_aurora 
OPTIONS (   
    schema_name 'public',  -- Schema in the source DB
    table_name 'products'  -- Table name in the source DB
);

This allows you to run a simple query, such as SELECT * FROM foreign_products; on the target database, which in turn retrieves data from the Products table in the remote Aurora database.

Tip: For PostgreSQL 9.5 and later, you can use the IMPORT FOREIGN SCHEMA feature to import multiple tables from the source schema at once, which is convenient when you need to work with numerous tables.

Confirming the Configuration

You can verify foreign table information using the following query:

SELECT 
    fs.srvname AS server_name, 
    fs.srvoptions AS server_options, 
    um.umuser AS user_id,
    um.umoptions AS user_options,    
    ft.foreign_table_name AS table_name, 
    ft.foreign_table_schema AS table_schema 
FROM    
    pg_foreign_server fs 
LEFT JOIN    
    pg_user_mappings um
ON 
    fs.oid = um.srvid  -- changed from umserver to srvid 
LEFT JOIN     
    information_schema.foreign_tables ft 
ON 
    fs.srvname = ft.foreign_server_name;

Sample Results

Attribute	Value
server_name	source_aurora
server_options	{host=maindb-instance-1.ctqegcgyi6u1.us-east-1.rds.amazonaws.com, port=5432, dbname=postgres}
user_id	16400
user_options	{user=postgres, password=xxx}
table_name	foreign_products
table_schema	public

Example of a JOIN Query

Once the configuration is set up correctly, you can easily join the target’s Transactions table with the remote foreign_products:

SELECT   
    t.transaction_id,  
    f.product_name,   
    f.category,    
    t.quantity,       
    t.total_price,      
    t.transaction_date  
FROM      
    transactions t 
JOIN   
    foreign_products f
ON    
    t.product_id = f.product_id
WHERE    
    f.category = 'Electronics';

Fields such as f.product_name and f.category come from the Aurora table in real time.
t.transaction_date is a local column in the target (Transactions) table.

This makes it straightforward to leverage the latest source database information on the target side.

Sample Results (Excerpt)

            transaction_id            | product_name |  category   | quantity | total_price |  transaction_date   
--------------------------------------+--------------+-------------+----------+-------------+---------------------
 f8b8b097-75ce-47e6-8fb8-42adbf3456fc | Growth       | Electronics |        4 |      421.16 | 2025-01-02 21:57:27
 8b8a1390-6f24-4d07-9239-adc5df9ab92e | Lose         | Electronics |        6 |      448.08 | 2025-01-03 03:29:54
 86386a4b-4eda-4ff4-9854-87b65d58e3aa | Without      | Electronics |        7 |     2537.57 | 2025-01-01 20:44:28
 203291b1-24af-443e-ad34-092151a99c61 | Wonder       | Electronics |        8 |     2098.00 | 2025-01-02 14:48:15
 fd75b3fa-f564-4599-9faf-c29dd42f82e3 | How          | Electronics |        3 |     1176.63 | 2025-01-02 15:22:55
 42a281a5-de30-4bf2-b528-bd3b942e35b6 | Pattern      | Electronics |        2 |      186.56 | 2025-01-01 06:54:58
 03a7c005-9f1c-46ca-b102-352dd0ed32ae | Win          | Electronics |        6 |     2955.60 | 2025-01-01 13:39:38
 6bc0bdd8-0d9e-47a8-853a-69ac235ca983 | Successful   | Electronics |        2 |      986.18 | 2025-01-02 23:59:29
 aaffdb6e-dbd1-4f33-a788-d62bdbf012bd | Every        | Electronics |        1 |      184.57 | 2025-01-02 07:48:06
 76f0a5db-297a-42fc-b8e6-d0e874031471 | Sing         | Electronics |        3 |     1426.11 | 2025-01-03 02:17:25
 b1dd0e57-4b49-413c-9514-aeb879200138 | Paper        | Electronics |        5 |      481.60 | 2025-01-01 04:53:51
 23f60790-6cb8-4c67-8c2c-60afe30289fb | Teacher      | Electronics |        6 |      429.30 | 2025-01-02 21:37:51
 638f7951-7a70-4da5-b87d-41e681143811 | Society      | Electronics |        3 |      302.22 | 2025-01-04 04:47:11

Performance

In PostgreSQL’s Foreign Data Wrapper (FDW), there is no built-in caching mechanism. Each time you access a foreign table, a query is sent to the remote database (the server configured for FDW) to retrieve the latest data.

Below are some performance measurements taken when retrieving data within the same AWS Region. The following query yielded these results:

(5987 rows) Time: 1094.468 ms (00:01.094)

Overall, this performance is quite good and can be considered practical for real-world use.

Tuning 1

PostgreSQL Foreign Data Wrapper (FDW) can push parts of query planning to the remote database to streamline communication. This is often referred to as query pushdown. Thanks to pushdown, not all data must be retrieved every time.

Pushdown Explained

FDW minimizes unnecessary data transfer by sending parts of the query to the remote database. For instance:

WHERE clauses
LIMIT
Some JOIN conditions
Queries that select only specific columns (SELECT column_name)

Query Optimization Tuning

Sample Query

To illustrate room for improvement, let’s consider a less optimal SQL query:

SELECT 
    p1.product_name,
    SUM(t.quantity) AS total_sold
FROM   
    transactions t 
JOIN (    
    SELECT 
        * 
    FROM
        foreign_products ) p1
ON
    t.product_id = p1.product_id 
WHERE 
    t.transaction_date >= '2022-01-01' 
AND  
    p1.category = 'Electronics'
AND  
    p1.product_id 
IN (
    SELECT
        product_id
    FROM
        foreign_products   
    WHERE 
        stock > 10
 )
GROUP BY
    p1.product_name;

(830 rows) Time: 531.426 ms

Although not terrible, it can be improved. Let’s analyze the query plan via EXPLAIN VERBOSE (sql)

Problem Analysis

Double Scanning of the Remote Table

The remote table foreign_products is being scanned twice:

First scan:

SELECT 
    product_id,
    product_name
FROM
    public.products 
WHERE
    ((category = 'Electronics'))

- Retrieves 6 rows for `category = 'Electronics'`.

Second scan:

SELECT
    product_id 
FROM 
    public.products 
WHERE
     ((stock > 10))

- Retrieves 683 rows for `stock > 10`.

Problem:

- The remote SQL is executed twice, increasing the communication cost with the remote database.
- Data is unnecessarily processed by a local Hash Join.

Unnecessary Local Processing

- There is a local Hash Join happening.
- Conditions like `category` and `stock` that could be handled remotely are instead processed locally, reducing efficiency.

Full Table Scan on the Local transactions Table

- The local `transactions` table is undergoing a Seq Scan:

    `Seq Scan on public.transactions t (cost=0.00..685.00 rows=29997 width=20)    Filter: (t.transaction_date >= '2022-01-01')`

- Impact:
    - Because there is no index on `transaction_date`, all 29,997 rows are being scanned.

Points for Improvement

Perform JOIN and condition filtering remotely

Apply conditions (category = 'Electronics' and stock > 10) on the remote database to reduce unnecessary local joins. For instance:

WITH filtered_products AS (
    SELECT
    product_id, 
    product_name   
    FROM       
    foreign_products    
    WHERE        
    category = 'Electronics'   
    AND         
    stock > 10 
)
SELECT     
    p.product_name,   
    SUM(t.quantity) AS total_sold 
FROM   
    transactions t 
JOIN      
    filtered_products p      
ON 
    t.product_id = p.product_id
WHERE    
    t.transaction_date >= '2022-01-01' 
GROUP BY
    p.product_name;

Add an Index to the Local Table

Create an index on transaction_date in the transactions table to avoid a full table scan.

Results

(830 rows) Time: 387.544 ms

This is roughly a 30% reduction in execution time.

From the query plan, you can see that remote access has been consolidated into a single call. While this example is not unique to foreign tables, inefficiencies in the query become more evident when using FDW. Hence, analyzing queries with EXPLAIN is crucial for avoiding performance issues.

Tuning 2 – Using Materialized Views for Caching

As another method of performance tuning, consider using caching. Since FDW does not provide built-in caching by default, you can use a Materialized View to cache data locally. This approach offloads frequent remote queries and complex processes from the remote database to the local environment, reducing load on the remote data source and further shortening query execution times.

Materialized View

Below is an example of caching the entire foreign_products table:

CREATE MATERIALIZED VIEW products AS SELECT * FROM foreign_products;

SQL

Next, use the materialized view to filter data. In the following query, only the reference to the table has been changed from foreign_products to the materialized view named products.

WITH filtered_products AS (
    SELECT    
        product_id, 
        product_name   
    FROM       
        products    
    WHERE         
        category = 'Electronics' 
    AND          
        stock > 10 
) 
SELECT    
    p.product_name, 
    SUM(t.quantity) AS total_sold
FROM     
    transactions t 
JOIN     
    filtered_products p 
ON 
    t.product_id = p.product_id
 WHERE 
    t.transaction_date >= '2022-01-01' 
GROUP BY 
    p.product_name;

(830 rows) Time: 353.511 ms

This yields roughly a 10% reduction in execution time. The exact effect will vary depending on data volume, query complexity, master database location, and how often the view is accessed. In this example, the entire table was cached, but caching just the subset of data you need may improve performance even more.

Because a materialized view will not reflect real-time data updates, its use depends on your requirements. For instance, in a batch workflow, you could refresh the materialized view (to get the latest data) before running the main process, then use the view locally for faster batch processing. This can be very effective.

Operational Points and Considerations

Performance Monitoring

- Since FDW is a remote connection, network latency may significantly impact queries.
- If you perform frequent JOIN operations, consider indexing and caching strategies on the source side as needed.

Load on the Source Database

- FDW directly references the latest data on the source. Repeatedly issuing heavy queries can affect the source’s performance.
- Monitor query frequency and complexity, and consider using materialized views if appropriate.

Permissions and Security

- In addition to VPC peering and security group settings, ensure proper roles and privileges are set at the database level to restrict unwanted access.

Combining Partial Data Migration

- If certain data does not need to be accessed in real time, using ETL or snapshots to bring it into the target might be faster.
- FDW is not a one-size-fits-all solution; combine various approaches based on the specific use case.

Dependency on PostgreSQL

- Because this approach depends on PostgreSQL features, it may constrain future system architecture decisions. Conduct sufficient reviews before committing to this design.

Summary

By using Foreign Data Wrapper (FDW), you can JOIN tables across Aurora PostgreSQL (source) and Aurora PostgreSQL (target) even if they reside in separate VPCs, making them appear as if they were part of the same database. Proper network configuration (VPC Peering or Transit Gateway) and basic FDW setup (server, user mapping, and foreign tables) enable real-time data access with minimal effort.

Advantages: Real-time access, enhanced development efficiency, reduced data duplication
Disadvantages: Potential performance risks, dependency on the source DB, limited functionality in certain scenarios

The key is to set up your network, security, and source-side optimizations with care. This approach is especially effective for scenarios where on-demand access to the latest data is required alongside seamless local table integration on the target side. Consider introducing it in your next project if it meets your requirements.

DEV Community