DEV Community

loading...

Building a lightweight Trino distribution

resurfacelabs
Resurface, an API system of record, turns every API call into a durable transaction. Capture, store, and explore every REST and GraphQL call.
・3 min read

Too many data frameworks built for large scale have unacceptable complexity at small scale. But with a few tweaks, Trino scales down to run nicely on small single-container configurations.

(Trino is the new brand for PrestoSQL, an open source distributed query engine.)

Official docker image is large

The docker image provided by the Trino team (trinodb/trino) is 1.32 GB when extracted. This includes a full CentOS distribution, which is a safe and comfortable choice. But this is pretty large for cases where Trino is embedded into another application, like we're doing with Resurface.

Picking a smaller base image

Much of the weight from the official Trino container is from the base CentOS image.

FROM azul/zulu-openjdk-centos:11
Enter fullscreen mode Exit fullscreen mode

Switching to an Alpine-based distribution like adoptopenjdk cuts the download size dramatically.

FROM adoptopenjdk/openjdk11:jdk-11.0.10_9-alpine-slim
Enter fullscreen mode Exit fullscreen mode

⚠️Pick your Alpine distribution carefully! We've seen significant performance degradations for Java applications when using Alpine distributions that don't include glibc. The adoptopenjdk containers have good performance while still being relatively small.

Reducing the number of connectors

The next step is optional, but has a big impact on container size. Trino ships with many pre-installed connectors, each of which requires supporting libraries.

However, these connectors aren't all strictly required. For our single-container distributions, we strip out all the optional connectors except for our own Resurface connector.

rm -rf /opt/trino/plugin/accumulo &&\
rm -rf /opt/trino/plugin/atop &&\
rm -rf /opt/trino/plugin/bigquery &&\
rm -rf /opt/trino/plugin/blackhole &&\
rm -rf /opt/trino/plugin/cassandra &&\
rm -rf /opt/trino/plugin/clickhouse &&\
rm -rf /opt/trino/plugin/druid &&\
rm -rf /opt/trino/plugin/elasticsearch &&\
rm -rf /opt/trino/plugin/example-http &&\
rm -rf /opt/trino/plugin/geospatial &&\
rm -rf /opt/trino/plugin/google-sheets &&\
rm -rf /opt/trino/plugin/hive-hadoop2 &&\
rm -rf /opt/trino/plugin/iceberg &&\
rm -rf /opt/trino/plugin/jmx &&\
rm -rf /opt/trino/plugin/kafka &&\
rm -rf /opt/trino/plugin/kinesis &&\
rm -rf /opt/trino/plugin/kudu &&\
rm -rf /opt/trino/plugin/local-file &&\
rm -rf /opt/trino/plugin/memsql &&\
rm -rf /opt/trino/plugin/ml &&\
rm -rf /opt/trino/plugin/mongodb &&\
rm -rf /opt/trino/plugin/mysql &&\
rm -rf /opt/trino/plugin/oracle &&\
rm -rf /opt/trino/plugin/phoenix &&\
rm -rf /opt/trino/plugin/phoenix5 &&\
rm -rf /opt/trino/plugin/pinot &&\
rm -rf /opt/trino/plugin/postgresql &&\
rm -rf /opt/trino/plugin/prometheus &&\
rm -rf /opt/trino/plugin/raptor-legacy &&\
rm -rf /opt/trino/plugin/redis &&\
rm -rf /opt/trino/plugin/redshift &&\
rm -rf /opt/trino/plugin/sqlserver &&\
rm -rf /opt/trino/plugin/teradata-functions &&\
rm -rf /opt/trino/plugin/thrift &&\
rm -rf /opt/trino/plugin/tpcds &&\
rm -rf /opt/trino/plugin/tpch
Enter fullscreen mode Exit fullscreen mode

Tuning memory parameters

Trino is very tunable when it comes to memory usage. But beyond that, the Trino team doesn't discourage small configurations. When I had the chance to ask Martin Traverso about this, his reaction was that they expect Trino to pass all tests when running on a small laptop-sized configuration, just the same as on a large configuration. The fact that Martin reacted this way gave us renewed confidence to experiment with smaller configurations.

For our smallest containers, we limit Trino to 1GB of memory using these standard parameters.

query.max-length=1000000
query.max-memory=1000MB
query.max-memory-per-node=1000MB
query.max-total-memory=1000MB
query.max-total-memory-per-node=1000MB
Enter fullscreen mode Exit fullscreen mode

If you're still seeing out-of-memory conditions, you may also want to reduce the memory used by the query cache. This is especially important if your SQL statements are large, or if your transaction rates are relatively high so that a lot of query history data is being cached.

query.max-history=20
query.min-expire-age=1s
Enter fullscreen mode Exit fullscreen mode

Final results

Following these steps yields a stable and high-performing Trino configuration that is 391 MB. That's just 30% of the download size of the standard Trino container! This doesn't come without tradeoffs, but is great to have this range in flexibility.

If you're looking for a minimal Trino container image, you can use ours as a base. (The version tag corresponds to the Trino version)

FROM resurfaceio/trino-minimal:358
Enter fullscreen mode Exit fullscreen mode

Or you can inspect this Dockerfile for ideas on how to build your own lightweight Trino image.

https://github.com/resurfaceio/containers/blob/master/trino/trino-minimal.dockerfile

Discussion (0)