ClickHouse has been popular in recent years as an analytic database. It is noted for the high speed and is often used as the benchmark by vendors of many other analytic databases whose selling points are high performance. Many users will consider turning to ClickHouse when they encounter database performance problems.
ClickHouse has an outstanding feature. Its column-oriented wide table runs fast, thanks to the excellent compression probably. However, it does not have any other merits. Users who hope to use it to solve database performance problems will likely feel disappointed.
We compare the TPC-H 100G performance of ClickHouse with that of Oracle in the same hardware environment. Below only lists the test results (Time unit: second). Find complete test report HERE.
TPC-H database performance test is relatively basic and – generally speaking, not complicated. But not all tests are simple single table traversal, there are also JOINs and subqueries. In those cases, ClickHouse’s performance is very poor. There are quite a few computing tasks that can’t be finished at all; SQL statements of a few tasks are even considered too complicated to be executed. ClickHouse is not even as good as Oracle in handling them. Oracle is not a specialized analytic database and is much slower, but it can complete all tasks.
The specific reason is probably as mentioned above. ClickHouse only does a good job in data storage compression, which makes simple traversal very fast; but for slightly more complex computations, it has rather limited optimization capabilities and just cannot finish the execution.
So, users who want to use ClickHouse to solve performance problems should first examine the complexity of their computing tasks by comparing them with TPC-H.
Now we still use the TPC-H data to generate a table having a lot of columns and perform the multi-dimensional analysis ClickHouse is best at. Results are listed below (Time unit: second). Find complete test report HERE.
ClickHouse is good at handling scenarios involving wide table traversal and has the highest performance. Its performance, however, drops sharply once a join is involved. More complicated joins will cause memory overflow, which once again verifies the above cause analysis.
It seems that ClickHouse is fast only when it handles the simple single table traversals that do not involve any associations. But such “fast speed” ClickHouse can achieve applies too narrow application scenarios. Is it worthwhile to introduce a database specifically to do just such a simple job?
Unlike ClickHouse that has an exaggerated fame, esProc SPL mentioned in the above test report is the real performance champion.
esProc SPL is also open-source. It is developed in Java, but it can far outperform the C++ developed ClickHouse in many performance optimization scenarios.
Strictly speaking, esProc isn’t an analytic database. But it offers high-performance storage schemas (column-oriented storage, compression, etc.) and corresponding algorithm libraries, which enables it to completely replace the analytic database’s computing functionality.
Unlike ClickHouse’s other database competitors, esProc does not adopt the SQL syntax, but uses the more concise SPL (Structured Process Language) instead. The design can overcome SQL’s defects and help achieve high-performance algorithms SQL finds it difficult to or cannot implement. Find easy to understand illustrations HERE. If the database based on the SQL system is still used, the product will still have the SQL limits and cannot make the most use of the hardware resources to obtain the best performance possible, even if it can surpass ClickHouse in certain aspects.
Let’s compare the esProc performances in the above test report with those of ClickHouse:
esProc SPL noticeably outstrips ClickHouse in performance as it executes all tasks very fast. It has comprehensive, overwhelming advantages over ClickHouse.
When data volume grows, ClickHouse is indeed better and faster than esProc SPL in handling traversal of a single wide table it is good at. Yet, as wide tables are used in most cases to avoid low-speed join operations (by increasing data volume and performing more complicated data preparations), they become unnecessary when esProc SPL’s unique join optimization strategy enables higher speed than ClickHouse wide tables. Wide tables do not have any merits any longer and become useless once they lose the performance advantages.
Though ClickHouse is faster in handling simple, non-association summarization on single tables, it isn’t orders of magnitude faster than esProc (after all, CPU and hard disk merely have this speed). In this case, even if task characteristics can be used to perform optimization, we will still find it difficult to work it out based on ClickHouse. Many optimization logics cannot be achieved in SQL and C++ programming is needed in the higher level to complete the goal. This is complicated and difficult. SPL, however, has much more powerful programming ability and can make good use of the task characteristics to write optimization code.
For example, contemporary multi-dimensional analyses almost always involve multi-index summarization. We can write multi-purpose traversal algorithm in SPL to compute multiple aggregate values during one traversal. Even if SPL is slightly slower than ClickHouse in handling the single-index computation, it can significantly surpass it when dealing with the multi-index summarization:
(Find the complete test report HERE)
For more complicated computations, such as funnel analysis, they are too complex to be tested using the ClickHouse. But esProc SPL can test them. Find related information HERE.
In a nutshell, esProc SPL has complete and comprehensive performance advantages, but ClickHouse’s performance advantages only apply in a very narrow range of applications.
Here is one real-world use case. A space-time collision problem involves about 25 billion rows of data. It seems that the SQL statement is not complicated:
WITH DT AS ( SELECT DISTINCT id, ROUND(tm/900)+1 as tn, loc FROM T WHERE tm<3*86400)
SELECT * FROM (
SELECT B.id id, COUNT( DISINCT B.tn ) cnt
FROM DT AS A JOIN DT AS B ON A.loc=B.loc AND A.tn=B.tn
WHERE A.id=a AND B.id<>a
GROUP BY id )
ORDER BY cnt DESC
LIMIT 20
Traditional databases run too slow and users turn to ClickHouse to solve the problem. ClickHouse runs over 30 minutes under a 5-node cluster environment and falls short of expectation. With the same amount of data, it takes SPL code one node and less than 6 minutes to complete the computation, which exceeds users’ expectations. Taking account of the gap between hardware resources, SPL is over 25 times faster than ClickHouse.
(Unlike the code produced by other programming languages, SPL code is written in a grid. Find related information HERE).
The SQL DISTINCT computation involves HASH values and their comparison. It becomes computation-intensive when data volume is huge. The self-join and the further COUNT(DISTINCT) will also seriously slow down the performance. SPL can make the most use of ordered grouping and ordinal-number-based location that SQL does not have to effectively avoid the high complexity self-joins and DISTINCT operations. Though SPL does not have the advantage over ClickHouse in the aspect of storage efficiency and Java is a little slower than C++, it still boosts the performance by orders of magnitude.
Top comments (1)
Open source SPL:github.com/SPLWare/esProc