In today's data-driven landscape, ensuring real-time data synchronization between databases is crucial for maintaining data consistency and enabling immediate data access across systems. This guide presents a detailed approach for synchronizing multiple master-detail tables from one Oracle database to another in real time using Apache Kafka and Kafka Connect's built-in JDBC connectors. This solution offers a lightweight alternative to using Debezium or other third-party tools, streamlining the process while minimizing operational overhead.
Table of Contents
- Introduction
- Architecture Overview
- Understanding the Capabilities and Limitations
- Source Oracle Database Configuration
- Kafka Connect JDBC Source Connector Configuration
- Kafka Connect JDBC Sink Connector Configuration
- Handling Master-Detail Table Relationships
- Performance Optimization and Monitoring
- Deployment Considerations
- Conclusion
Introduction
Synchronizing data between Oracle databases in real time can be challenging, especially when dealing with high volumes of transactions and complex master-detail relationships. While tools like Debezium provide robust change data capture (CDC) capabilities, they can add complexity and overhead to your architecture.
This guide explores how to leverage Kafka Connect's built-in JDBC connectors to achieve near real-time synchronization of Oracle tables. We will cover the necessary configurations, best practices, and considerations to implement this solution effectively.
Architecture Overview
The proposed architecture consists of three main components:
- Source Layer: Oracle database from which data changes are captured using Kafka Connect's JDBC source connector.
- Messaging Layer: Apache Kafka cluster that transports and stores the change events.
- Target Layer: Oracle database where the changes are applied using Kafka Connect's JDBC sink connector.
This architecture leverages Kafka Connect’s managed connectors to simplify implementation, reduce the impact on the source database, and ensure scalability.
Understanding the Capabilities and Limitations
Before proceeding, it's crucial to understand the capabilities and limitations of using Kafka Connect's built-in JDBC connectors for real-time synchronization:
- Data Capture: The JDBC source connector polls the database at regular intervals to capture new data based on specified criteria.
- Change Types: The connector can capture inserts and, with certain configurations, updates. However, capturing deletes and full change data capture (CDC) is limited.
-
Latency: The polling mechanism introduces latency based on the
poll.interval.ms
setting. - Consistency: Without CDC, there is a risk of missing changes that occur between polls, especially in high-throughput environments.
For scenarios requiring full CDC with support for inserts, updates, and deletes, consider using tools like Debezium or Oracle GoldenGate.
Source Oracle Database Configuration
Enabling Necessary Database Features
While the JDBC source connector does not require supplemental logging or CDC features on the Oracle database, certain configurations can improve performance and reliability.
-- (Optional) Increase the size of the redo logs to handle load
ALTER DATABASE ADD LOGFILE GROUP 4 (
'/path/to/logfile4a.log',
'/path/to/logfile4b.log'
) SIZE 100M;
-- (Optional) Adjust undo retention to ensure consistent reads
ALTER SYSTEM SET UNDO_RETENTION = 900 SCOPE=BOTH;
Creating a Dedicated User for Kafka Connect JDBC Source Connector
Create a user with the minimal privileges required for the connector to operate:
CREATE USER kafka_user IDENTIFIED BY password;
GRANT CONNECT TO kafka_user;
GRANT SELECT ANY TABLE TO kafka_user;
These permissions allow the Kafka Connect JDBC source connector to read data from the Oracle database.
Kafka Connect JDBC Source Connector Configuration
The JDBC source connector imports data from the source Oracle database into Kafka topics. Below is an example configuration:
{
"name": "oracle-source-connector",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
"tasks.max": "1",
"connection.url": "jdbc:oracle:thin:@source-oracle-host:1521/ORCLCDB",
"connection.user": "kafka_user",
"connection.password": "password",
"table.whitelist": "INVENTORY.CUSTOMERS,INVENTORY.ORDERS,SALES.PRODUCTS",
"mode": "timestamp+incrementing",
"timestamp.column.name": "last_update_time",
"incrementing.column.name": "id",
"topic.prefix": "oracle-source-",
"poll.interval.ms": "5000",
"validate.non.null": "true",
"numeric.mapping": "best_fit",
"schema.pattern": "INVENTORY,SALES"
}
}
Key Configuration Parameters:
-
table.whitelist
: Specifies the tables to capture data from. -
mode
: Set to"timestamp+incrementing"
to capture inserts and updates based on timestamp and incrementing columns. -
timestamp.column.name
: The column used to detect new or updated rows (e.g.,last_update_time
). -
incrementing.column.name
: The unique incrementing column (usually the primary key). -
topic.prefix
: Prefix for Kafka topics corresponding to source tables. -
poll.interval.ms
: Interval in milliseconds for polling new data.
Note: Ensure that the timestamp.column.name
and incrementing.column.name
columns are properly indexed for optimal performance.
Kafka Connect JDBC Sink Connector Configuration
The JDBC sink connector writes Kafka topic data to the target Oracle database. Below is an example configuration:
{
"name": "oracle-sink-connector",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"tasks.max": "1",
"topics": "oracle-source-INVENTORY.CUSTOMERS,oracle-source-INVENTORY.ORDERS,oracle-source-SALES.PRODUCTS",
"connection.url": "jdbc:oracle:thin:@target-oracle-host:1521/ORCLCDB",
"connection.user": "kafka_user",
"connection.password": "password",
"auto.create": "true",
"auto.evolve": "true",
"insert.mode": "upsert",
"delete.enabled": "false",
"pk.mode": "record_value",
"pk.fields": "ID",
"table.name.format": "${topic}",
"batch.size": "3000"
}
}
Key Configuration Parameters:
-
insert.mode
: Set to"upsert"
to handle inserts and updates. -
pk.mode
: Specifies how to define the primary key (record_value
uses fields from the record value). -
pk.fields
: Fields to be used as primary keys in the target database. -
auto.create
andauto.evolve
: Automatically create tables and adapt schemas in the target database. -
delete.enabled
: Set tofalse
as deletes are not captured by the JDBC source connector in this setup.
Note: The table.name.format
parameter controls how the sink connector maps topics to tables.
Handling Master-Detail Table Relationships
Correctly handling master-detail relationships requires consideration to ensure referential integrity.
Ensuring Data Consistency
- Ordering: Ensure that the polling interval and query configurations account for dependencies. For example, load master records before detail records.
-
Transactions: While Kafka Connect does not support multi-statement transactions, the
batch.size
parameter can be tuned to control the number of records processed in each batch.
Custom Query Modes
If necessary, use custom SQL queries to control the data retrieved:
"query": "SELECT * FROM INVENTORY.CUSTOMERS WHERE LAST_UPDATE_TIME > ? ORDER BY LAST_UPDATE_TIME ASC"
This allows you to fine-tune the data capture process to suit complex relationships.
Performance Optimization and Monitoring
Optimizing Kafka Connect Source Connector
-
Increase
tasks.max
: Parallelize data ingestion across multiple tasks if processing multiple tables. -
Proper Indexing: Ensure that
timestamp.column.name
andincrementing.column.name
are indexed. -
Adjust
poll.interval.ms
: Balance latency and processing load by configuring an appropriate polling interval.
Optimizing Kafka Connect Sink Connector
-
Batch Processing: Adjust the
batch.size
parameter to optimize writes to the target database. - Connection Pooling: Configure the JDBC driver to use connection pooling for better performance.
Monitoring
- Kafka Connect REST API: Monitor connector status, task configurations, and runtime errors.
- JMX Metrics: Enable JMX to collect detailed metrics for connectors and tasks.
- Logging: Configure logging levels appropriately to capture necessary information without overwhelming the logging system.
Deployment Considerations
Infrastructure Requirements
- Kafka Connect Cluster: Deploy in distributed mode for scalability and fault tolerance.
- Kafka Brokers: Use a multi-broker Kafka cluster to handle throughput and provide resilience.
- Database Resources: Ensure that both source and target Oracle databases have sufficient resources to handle additional load.
Best Practices
- Staging Environment: Test configurations in a non-production environment to validate behavior.
- Security: Use secure connections (e.g., SSL/TLS) for database and Kafka connections.
- High Availability: Implement appropriate backup and failover mechanisms for the databases and Kafka cluster.
Conclusion
Leveraging Kafka Connect's built-in JDBC connectors provides a lightweight solution for synchronizing Oracle tables in near real time. While this approach has limitations—particularly in capturing deletes and providing full CDC—it can be effective for scenarios that primarily involve inserts and updates.
By carefully configuring the connectors and understanding the operational considerations, you can achieve reliable data replication between Oracle databases, maintaining data consistency and supporting your organization's real-time data needs.
Additional Resources
- Kafka Connect JDBC Connector Documentation
- Oracle JDBC Driver Documentation
- Kafka Connect Best Practices
Disclaimer: The configurations and examples provided in this guide are for illustrative purposes. Please refer to the official Apache Kafka and Oracle documentation for the most up-to-date and detailed information.
Top comments (0)