alexrozz

Posted on Sep 19, 2023

Enhancing Symfony Application Performance: Overcoming Many-to-Many Challenges with Postgres and ClickHouse

#symfony #postgres #architecture #sql

Introduction
In this article, I would like to share my experience optimizing the performance of many-to-many relationships in a Symfony application using the PostgreSQL database and how Clickhouse helped us in this endeavor.

I want to start by saying that I'm a proponent of writing less code whenever possible. The best code is no code at all, and the best solution is a standard one described in the documentation. Especially today, when everything has already been implemented, all that's left is to configure and use it correctly. What could be better for a Symfony application than creating a data schema and entities using the maker bundle, simply specifying property names and relationships?

However, sometimes the task forces us to look at familiar things from a different angle.

The goal
So, the goal is quite simple: there are financial transactions - operations that are tagged with key-value pairs, for example:

merchant: someMerchant,
provider: someProvider,
customer: someCustomer,
card: someCardToken,
country: IL

We need to manage limits in various ways, for example:

the sum of operations exceeding 5 million USD for someProvider in a month,
the number of operations from a single payer at someMerchant should not exceed 10 per day,
the sum of operations for someCardToken from outside Russia should not exceed 50,000 USD per week for anotherMerchant, and so on.

Using the PHP + Symfony/Doctrine stack, the schema is represented as a many-to-many relationship as shown below:

And here's the generated Symfony entity description:

/**
 * @ORM\Table(name="operation")
 * @ORM\Entity()
 */
class Operation
{
    // ......

    /**
     * @var ArrayCollection|Tag[]
     *
     * @ORM\ManyToMany(targetEntity="Tag")
     * @ORM\JoinTable(
     *     name="operation_tag",
     *     joinColumns={@ORM\JoinColumn(name="operation_id", referencedColumnName="operation_id")},
     *     inverseJoinColumns={@ORM\JoinColumn(name="tag_id", referencedColumnName="tag_id")}
     * )
     */
    private $tags;

    // ......
}

/**
 * @ORM\Table(name="tag")
 * @ORM\Entity()
 */
class Tag
{
    /**
     * @ORM\Column(name="tag_id", type="bigint", nullable=false)
     * @ORM\Id
     * ...
     */
    private int $tagId;

    /**
     * @ORM\Column(name="name", type="string", length=255, nullable=false)
     */
    private string $name;

    /**
     * @ORM\Column(name="value", type="string", length=255, nullable=false)
     */
    private string $value;

    // ...
}

The schema is slightly complicated by the fact that tags are not just simple strings but key-value pairs.

So, for example, calculating the limit of operations for someProvider from a unique customer at someMerchant would look like this:

-- Calculate the limit of operations for someProvider from a unique customer at someMerchant
SELECT
    SUM(o.amount)
FROM
    operation o
INNER JOIN
    operation_tag ot1 ON ot1.operation_id = o.operation_id
INNER JOIN
    tags t1 ON ot1.tag_id = t1.tag_id
INNER JOIN
    operation_tag ot2 ON ot2.operation_id = o.operation_id
INNER JOIN
    tags t2 ON ot2.tag_id = t2.tag_id
INNER JOIN
    operation_tag ot3 ON ot3.operation_id = o.operation_id
INNER JOIN
    tags t3 ON ot3.tag_id = t3.tag_id
WHERE
    t1.name = 'provider'
    AND t1.value = 'someProvider'
    AND t2.name = 'merchant'
    AND t2.value = 'someMerchant'
    AND t3.name = 'customer_id'
    AND t3.value = 'someUniqueCustomerId'
    AND o.created_at BETWEEN '2023-08-01' AND '2023-08-31'

Execution time: ~1.5 seconds.

And if the aggregation involves not 3 but 4 tags, there will be 8 joins, and so on.
If multiple such queries are part of the check, the total execution time can become significant.

We initially assumed that the performance bottleneck was in PostgreSQL itself. After quick research, we decided to try some other database, which is much faster, especially considering the growing data.

Our selection criteria were as follows:

Support for SQL or something similar so that we could rewrite only the database layer with minimal impact on the business logic
Good performance and fast execution of analytical queries and aggregation of large data amounts
A reliable PHP client available
Affordable price
A well-developed community and support

NoSQL databases were ruled out immediately due to criterion 1. Oracle didn't fit criteria 4 and 5. As a result, we settled on Clickhouse. It is known for its high performance with huge datasets and it's free.

Let's go.

The syntax for creating a table in Clickhouse is similar to MySQL:

CREATE TABLE IF NOT EXISTS ourdb.operation
(
    ...
    amount UInt64,
    tags Nested (
        name String,
        value String
    )
)

One of Clickhouse's strengths is its support for denormalized data. Goodbye, joins! There's a Nested type for inner tables "inside". This means each operation has its own set of tags in denormalized way.

And then, the Clickhouse query with an aggregation of the sum looked like this:

SELECT SUM(amount) AS result
FROM operation
WHERE created_at >= '2023-08-01 00:00:00' AND created_at <= '2023-08-31 23:59:59'
 AND tags.value[indexOf(tags.name, 'provider')] = 'someProvider'
 AND tags.value[indexOf(tags.name, 'merchant')] = 'someMerchant'
 AND tags.value[indexOf(tags.name, 'customer_id')] = 'someUniqueCustomerId'

We run the query, and the execution time is 0.4 seconds! Even with any number of tags that would have turned into expensive joins in a many-to-many relationship.

Nice. However, Clickhouse also has some drawbacks. First, it is excellent for selects but not too much efficient for inserts/updates. There is even an official recommendation to insert data in batches of at least 1000 rows. Of course, this recommendation can be ignored, but we didn't want to make such an experiment in prod. That way, we had to wait for a batch to accumulate, and it can cause critical troubles in some cases where transactions come too fast: it could leadi to race conditions and data inconsistency. Taking this into account, we tried a hybrid approach - keeping the most recent operational part of transactions (created last hour) in PostgreSQL and handling the rest in Clickhouse, and then summarize the results. It had some effect, but it didn't fundamentally boosted the perfomance because expensive PostgreSQL joins remained expensive joins even with a smaller data size.

What if we tried to implement this nested structure in PostgreSQL itself? It has great functionality for JSON. Let's try!

ALTER TABLE operation
    ADD tags_nested JSONB DEFAULT '{}' NOT NULL;

CREATE INDEX ix_operation_tags_nested ON operation
    USING gin(tags_nested jsonb_path_ops);

We added a tags_nested column of type jsonb along with GIN index, since the criteria for the search are flat maps. But for hierarchical structures it is better to use the GIST index.

Let's execute the updated select statement:

SELECT
    SUM(o.amount)
FROM
    operation o
WHERE
    o.tags_nested::jsonb @> '[{"provider":"someProvider"},{"merchant":"someMerchant"},{"customer_id":"someUniqueCustomerId"}]'
    AND o.created_at BETWEEN '2023-08-01' AND '2023-08-31'

The execution time, regardless of the number of tags specified, does not exceed 0.3 seconds! Could it be that simple?

Here, the containment operator @> works well for finding values containing all the specified tags.

However, this also illustrates the drawback of this schema and approach in general - we've tied the business logic to the implementation of a specific database. For example, in MySQL, we have to rewrite it using JSON_CONTAINS(). It breaks ORM abstraction principle.
But realizing that we're likely to change the database only along with the entire infrastructure and codebase in some abstract future, these doubts were overcome.

This is what the transition moment looked like on the chart:

Conclusion
we decided to postpone Clickhouse until ~~harder~~ better times when PostgreSQL can no longer handle the load.

Spoiler: now, after 2 years, that time has come, which pushed me to share this experience in the article. We are currently restoring Clickhouse and adapting it to new requirements. Fortunately, we didn't remove it from the code repository, just disabled.

I decided not to overwhelm the article with query plans and simplified the examples significantly - with all the technical details, they would be less relevant.

Thanks for reading. Feel free to share your opinion if you've had a similar experience or just a note.
Peace to everyone!

DEV Community

Enhancing Symfony Application Performance: Overcoming Many-to-Many Challenges with Postgres and ClickHouse

Top comments (0)

Read next

Using Postgres with Docker on a Mac

Ensuring High Availability in Microservices: A Failover Testing Journey with AWS

Understanding event driven architecture

Building Elastic and Fully Managed Cloud-Native VectorDB Milvus Infrastructure on AWS