DEV Community: Andrejus Voitovas

From Legacy to Modern: Creating Self-Testable APIs for Seamless Integration

Andrejus Voitovas — Tue, 14 Jan 2025 05:59:53 +0000

In every project there is a phase when you have to rewrite a legacy application into a new one. It’s often the case that refactoring is not only an unavoidable, but also extremely challenging task. One of the primary goals of such integrations is to ensure that both systems - new and old - produce consistent outputs.

In this article, we'll dive into how we build a self-testable API mechanism. The goal is simple - to build a system that is able to validate the input and output itself and notify the developers of any inconsistencies. By creating this system, we will then be able to ensure that both systems produced identical results by running thousands of different scenarios.

We’ll cover the architecture of such systems, the challenges we face in such situations, and what the limitations of ensuring the smooth coexistence of both systems are.

Legacy

Imagine the legacy system has been serving various business-critical operations for about 10 years. While it has proven reliable over the years, it has several outdated standards that made new integrations nearly impossible. One of the primary complications is the diversity of data formats it handles. The legacy system supports both JSON and XML formats - an unfortunate result of an unsuccessful attempt to move from XML to JSON. This introduces complexity during integration. Another challenge is that endpoints are being used by a variety of third-party clients, so we can't make even the most minimal changes, because we don't know how they are parsing data or how they are integrated. This not only increases the complexity of the system, but it also means that we have to consider how the system can be refactored and replaced without ongoing operations being disrupted. And lastly, legacy systems don't follow RESTful architecture. There are a lot of different naming conventions, input variants (query parameters, body), authentication mechanisms, etc which makes it less predictable and harder to standardize across different parts of the system.

Given these constraints, our primary challenge is ensuring that our new system can be integrated smoothly into the legacy system, whilst maintaining consistent and verifiable outputs. Here, the self-testable API we have developed (more on that later) will act as a bridge, allowing us to automate the validation process between the two systems.

New application

Our new system is built on modern principles, but due to the use of the legacy API by numerous third-party clients, certain aspects must remain unchanged during the migration phase. Specifically, we can’t modify the authentication mechanism, API paths, or input formats, as it is this that ensures we are able to maintain seamless compatibility with existing clients.

The main focus now is on migrating the code to the new test, while taking great pains to ensure that as many tests as possible are conducted to ensure its reliability. Only once the refactored code is fully tested and validated, can we move forward with the plan to rewrite the API under a new version, /api/v2. The self-testable API is the main component of this process, as it allows us to validate that both the new and the legacy systems are returning the same outputs without disrupting current clients.

Overview

Now, it's time to look at both systems from an architectural perspective. This is quite straightforward, but in any case we should have a clear picture of it. Both systems share a single MySQL database to access the same dataset. In front of these two systems, we have an HAProxy load balancer that routes traffic based on predefined rules.

Solution Design: Step-by-step implementation

So, it’s time to present technical details on how we implemented this self-testable mechanism.

From this diagram you can see that we have both a New and Legacy system. When an API request is received, we forward the request along with all the data directly to the legacy system, and then wait for its response. After the legacy system has responded, we immediately serve this output to the client to prevent any delays. Also, we gather all the incoming request data into a message and send it to RabbitMQ using Symfony Messenger. As part of the asynchronous processing, we send the same API call to the new system for validation. This entire process is done in the background to prevent any performance issues on the primary API response.

Once we receive the output from the new system, we compare it to the legacy system’s response. If there’s a mismatch between the two, an error is logged, and the development team is automatically notified, allowing us to detect bugs or inconsistencies between the systems without affecting the client’s experience.

After the new code has been fully tested and verified, we release the API. Releasing means that we exclude a specific route from the shadow mechanism. At this point, all incoming requests are routed exclusively to the new system, bypassing the legacy system entirely. The legacy API calls are no longer needed, marking a complete migration of the route. We repeat this process from the beginning for each route.

If we dive deeper and look at it from the Symfony perspective, it looks like this:

Limitations

But not everything goes as smoothly as it seems. Every system and approach has its limitations, and ours is no exception. Since both the legacy and new systems rely on a single database, we face a significant challenge: we cannot modify the data during testing. This means that our self-testable API approach is only viable for read-only endpoints or those that don’t involve data modification. For any endpoints that involve creating, updating, or deleting data, this approach isn’t usable, as changes made by one system would affect the other, leading to inaccurate comparisons and potential data integrity issues.

Summary

In this post, we have looked at how we can verify whether the refactoring process has gone well. As I have outlined in this article, we forward requests to the legacy system and present the output right away to ensure that it works as before. Meanwhile, at the same time we use Symfony Messenger to check the outputs with the new system in the background. However, we might still have some issues, as sharing the same database can create challenges, particularly for endpoints that modify data. But despite this, the approach presented above is very useful, as it helps us to ensure that our migration process is bug-free, while reducing risks and allowing us to deliver a smoother shift to the new system.

Doctrine’s Collection filter method - a double-edged sword

Andrejus Voitovas — Wed, 11 Dec 2024 03:54:22 +0000

As a developer, you're probably no stranger to the challenges associated with large datasets. From optimizing your code to managing database queries, there's a lot to consider when working with massive amounts of data. But have you ever stopped to consider the impact of your choice of data structure?

Enter Doctrine Collections. These powerful data structures are a staple of object-oriented PHP development, offering a convenient way to manage complex data relationships. However, as any seasoned developer would tell you, working with Doctrine Collections can also be a double-edged sword.

On the one hand, these collections offer a powerful and flexible way to organize and manipulate your data. On the other hand, they can quickly become a bottleneck when working with large datasets, slowing down your application and hindering your ability to make progress.

In this article, we'll explore the challenges and opportunities presented by Doctrine Collections, and provide you with a number of practical tips and strategies for managing large datasets. Whether you're a beginner or an experienced developer, you'll come away with a deeper understanding of how to optimize your code and get the most out of your data structures.

Filtering

Doctrine Collections have many implemented methods, including “map”, “matching”, and “filter”. In this article, we’ll be focusing specifically on the latter method, used to filter a collection of objects based on a given Closure. This can help you to simplify your code and make it more readable by allowing you to express the filtering criteria in a concise and declarative manner. The “filter” method is a good choice in cases where you want to display only a subset of the objects in a collection based on some criteria, such as those meeting specific conditions. For instance, you might want to filter a collection of customer orders that matches a specific order state.

How does filtering work under the hood?

When you call the “filter” method on a collection, Doctrine creates a new instance of the collection class and iterates over the original collection. An entity is added to the new collection only when the filter function for it returns “true”.

protected function createFrom(array $elements)
{
    return new static($elements);
}

The “filter” method uses PHP's built-in “array_filter” function to apply the “filter” method to each entity in the collection. The ”array_filter” function takes an array as its input and returns a new array that contains only the elements that satisfy the “filter” method.

With Doctrine's collection classes, the input to “array_filter” is an array of entities. The “filter” itself, on the other hand, is a callable that takes an entity as its input and returns a boolean value, indicating whether the entity should be included in the filtered collection.

public function filter(Closure $p)
{
     return $this->createFrom(array_filter($this->elements, $p, ARRAY_FILTER_USE_BOTH));
}

Once the filtered collection has been created, Doctrine returns it as the result of the “filter” method call, leaving the original collection unchanged.

Problem

To identify the disadvantages of using the “filter” method, let’s choose a specific example, such as the ones provided below on Symfony with Doctrine. We have already installed and prepared:

Symfony 6.2 with Doctrine ORM 2.14
PHP 8.2
MySQL 8 (InnoDB Engine)

Let’s begin with a situation. For the following example, I’ll be using Order and Customer entities from the Sylius e-commerce framework:

Order Entity
Customer Entity

To test the “filter()” method, I’ve prepared 100,000 rows of orders and 100,000 rows of customers as the data input. In the following case, we’ll be trying to find all the specific customer orders with the status “Fulfilled”. Presently, the customer has 11,653 orders – 8,658 with the “Fulfilled” status, and 2,995 with other statuses. This test is executed 10 times in order to calculate average times and check memory consumption. All examples are compared with the logically identical QueryBuilder method to see the clear difference between use cases.

For the following test, I’ve prepared a method using above mentioned “filter”:

public function getFulfilledOrdersWithFilter(): Collection
{
    return $this->orders->filter(static fn(Order $order) => $order->getState() === 'fulfilled');
}

From this data table, we can clearly see that the “filter” method works slower than Doctrine QueryBuilder by 38.5% on average. Furthermore, memory consumption is significantly (27.2%) higher than with the QueryBuilder. Why is that? As I’ve mentioned in the previous section, Doctrine Collections loads all the order objects that belong to the customer and then filters only those that meet the given criteria. This consumes significantly more memory and takes longer to complete.

Solution

How can the code be refactored to avoid using the “filter” Collection method? With the code provided, we actually have 2 options:

Doctrine Query Builder: As the above example shows, Doctrine QueryBuilder is the most efficient in terms of time and memory. This method fetches rows and matches them with the provided criteria. After fetching, subsequent PHP-level operations have less data to work with, resulting in a faster and less memory-intensive process.
Doctrine Criteria: “Matching” is another useful method of Doctrine Collections. We can use this method along with the “Criteria” object. Let’s see how we might refactor the code using Doctrine Criteria and the “matching” method. Since both are still going to use Doctrine collections, our aim will be to find a way to make this query faster. For this test, I’ve prepared a method using a Criteria object:

public function getFulfilledOrdersWithCriteria(): Collection
{
     $expressionBuilder = Criteria::expr();

     $criteria = Criteria::create();
     $criteria->where($expressionBuilder->eq('state', 'fulfilled'));

     return $this->orders->matching($criteria);
 }

From the data above, we can see that using “Criteria” takes about as long as QueryBuilder, and consumes almost as much memory. Given that both the “filter” and the “matching” methods are implemented in the same manner, why do they differ in terms of duration and memory consumption? The answer is that, if the collection has not yet been loaded from the database, Criteria performs at the SQL level to optimize access to large collections. Meanwhile, the “filter” method iterates over the data for already selected rows from the Database. This explains the virtually identical duration and memory requirements of the Criteria and QueryBuilder processes.

Summary

Doctrine Collections and QueryBuilders serve different purposes and have different performance characteristics.

The “filter” method on a Doctrine Collection is used to filter the elements of the collection based on a given condition. This method works on the collection in memory and requires that all the collection elements be loaded into memory before filtering. This can be an expensive operation if the collection is large, as it consumes a lot of memory.

On the other hand, the QueryBuilder is used to create SQL queries that operate at the database level. The QueryBuilder constructs a SQL query that is sent to the database, and only the matching rows are returned. This approach is much more efficient when dealing with large datasets, as it only retrieves the relevant data from the database, rather than requiring all of it to be loaded into memory.

Moreover, the ”matching” method is faster than “filter” when you have a large collection and a complex filtering condition. This is because the matching method will create a SQL query that is optimized for database filtering. The resulting SQL query will fetch only the data that matches the filtering condition from the database, instead of fetching all the data and then filtering it in PHP – which is what the “filter” method does.

In conclusion, if you are dealing with large datasets, using QueryBuilder is generally faster and more memory-efficient than using the “filter” method on a Doctrine Collection. However, if you are dealing with a small dataset, the performance difference may be negligible, and you can choose based on personal preference or convenience.