Abhishek Mishra for Dozer

Posted on May 30, 2023 • Originally published at getdozer.io

Creating a Low-Latency Cache with Dozer and PostgreSQL

#postgres #cache #api #systems

In the realm of software development and data management, the efficiency of data retrieval plays a crucial role in the overall performance of an application. One effective way to enhance this efficiency is by implementing a cache. A cache is a high-speed data storage layer that stores a subset of data, typically transient in nature so that future requests for that data are served up faster than is possible by accessing the data’s primary storage location.

This blog post explores how to enhance these aspects by implementing a low-latency cache using Dozer and PostgreSQL. We delve into the importance of caching in data management, the limitations of not having a cache, and the unique advantages of using Dozer as a cache layer. We also provide a guide on how to use Dozer for real-time data ingestion from PostgreSQL. This post is a must-read for developers seeking to improve their application's performance and responsiveness, and for those looking to simplify their data infrastructure.

Why Do We Need a Cache on Top of PostgreSQL?

PostgreSQL is a powerful, open-source object-relational database system. It is highly extensible and enables high performance on complex queries and heavy loads. However, like any other database, it has its limitations.

When a database grows in size and complexity, the time taken to retrieve data can increase significantly. This is especially true for complex queries that involve multiple joins and aggregations. In such scenarios, every millisecond counts, and the latency can add up quickly, leading to a slow and unresponsive application.

This is where a cache comes into play. A cache is a high-speed data storage layer that stores a subset of data so that future requests for that data are served up faster than is possible by accessing the data’s primary storage location. By storing frequently accessed data in a cache, we can significantly reduce the time taken to access this data.

What's the Problem with Not Having a Cache?

Without a cache, every time a client requests data, the application has to query the database. This can be a time-consuming operation, especially if the database is large or the query is complex. This can lead to increased latency and a poor user experience.

Moreover, without a cache, the database can become a bottleneck, especially under heavy load. Every read and write operation involves a significant amount of work, and as the number of operations increases, the time taken to complete these operations also increases. This can lead to increased CPU and memory usage, and in extreme cases, it can even lead to database crashes.

Why Not PostgreSQL Read Replica?

While a PostgreSQL read replica can certainly help offload some of the read operations from your primary database, it's not without its limitations. The process of querying a read replica still involves a network round trip and the overhead of a database query. These factors can contribute to latency, especially when dealing with large volumes of data or complex queries. A cache like the one provided by Dozer, on the other hand, can serve data much faster than a read replica.

In terms of using a read replica instead of a cache layer like Dozer, it depends on your specific use case and requirements. A read replica can help improve read performance by distributing read traffic across multiple instances, but it may not provide the same level of performance improvement as a cache layer like Dozer.

Moreover, Dozer doesn't make a distinction between types of data sources. Developers can get a seamless experience building products with application databases such as Postgres and MySQL, data warehouses such as SnowFlake and cloud storage such as S3 and Deltalake. Dozer can also consume real-time events and Ethereum data.

Why Dozer?

In the context of database management, most OLTP workloads involve random disk I/O usage. Given that disks, including SSDs, are slower in performance than RAM, database systems use caching to increase performance. Caching is all about storing data in memory (RAM) for faster access at a later point in time.

One approach to database caching involves replicating data using Change Data Capture (CDC) to an alternative database optimized for your queries. This method is increasingly being adopted for operations that require extensive data reading. However, it comes with its own set of challenges such as ensuring real-time data, managing indexing, guaranteeing availability, handling schema changes, and considering costs. Dozer addresses these pain points by providing an end-to-end system that takes care of the complexity of caching logic, allowing developers to focus on building their applications.

Dozer, an open-source data API backend, provides a solution to this. It connects to any of your data sources, transforms and stores the data in an embedded cache powered by LMDB, automatically creates secondary indexes, and instantly generates low-latency REST and gRPC APIs. Dozer is implemented fully in Rust for fast processing of data and is less resource intensive.

Without a cache layer, data retrieval would rely solely on the performance of the PostgreSQL database and its internal caching mechanisms. This can result in slower response times and increased load on the database server. By using a cache layer like Dozer, you can improve the performance of data retrieval and reduce the load on the database server.

How to Use Dozer as a Cache

Architecture Diagram

Here's a simple architecture diagram showing how Dozer can be used as a cache layer for a PostgreSQL database:

In a nutshell:

Dozer begins by ingesting data in real-time from the PostgreSQL database using the PostgreSQL connector and stores this data in its cache.
When a client sends a request for data, Dozer processes the request using the low-latency APIs it has generated, and retrieves the required data from its cache to serve the client.
For write requests, these are facilitated through the backend API, which directly communicates with the PostgreSQL database. Once the write operation is confirmed, the PostgreSQL database sends a write confirmation back to the backend API.

The key point here is that data ingestion into Dozer's cache and the serving of data to the client through Dozer's APIs occur in parallel. This setup ensures quick and efficient access to data, enhancing the performance of your application."

Dozer provides a PostgreSQL connector that allows it to ingest data in real-time from your PostgreSQL database. The configuration can be as simple as the following:

app_name: movie-booking
connections:
  - config: !Postgres
      user: postgres
      password: postgres
      host: 0.0.0.0
      port: 5438
      database: database_name
    name: connection_name

sql: |
  -- transformation and
  -- analytical queries here.

sources:
  - name: source_name_1
    table_name: table_name
    columns:
    connection: !Ref connection_name

  ...

endpoints:
  - name: endpoint_name
    path: /endpoint_path
    table_name: table_name
    index:
      primary_key:
        - primary_key_column_name

This configuration allows Dozer to ingest data in real-time from your PostgreSQL database and store it in a high-performance cache. This cache can then be used to serve low-latency data APIs to your application.

Conclusion

Implementing Dozer with its ability to ingest data in real-time from PostgreSQL can significantly improve the performance of your application by reducing the time taken to access frequently used data. This approach can significantly enhance your application's performance, making it more efficient and responsive in today's demanding tech scenario. By leveraging Dozer's real-time SQL engine, automatic secondary indexing, and instant API generation, developers can drastically lower the cost, complexity, and effort involved in putting together the data infrastructure necessary to build data applications.

Happy coding, Happy Data APIng! 🚀👩‍💻👨‍💻

DEV Community