Matsuda for AWS Community Builders

Posted on Feb 6, 2025 • Edited on Mar 15, 2025

Self-Hosting Langfuse v3 on AWS Using CDK

I created a CDK project to self-host the observability tool (OSS) Langfuse v3 on AWS.

https://github.com/mazyu36/langfuse-with-aws-cdk

In this article, I'll share the usage method, architecture, and troubleshooting know-how.

The following official documents are helpful for self-hosting:

Architecture

I've built Langfuse v3 on AWS. Essentially, it's a simple configuration that replaces Langfuse's architecture with managed services.

Langfuse Application (Web, Worker)

I've deployed the Langfuse container image to ECS on Fargate.

Service-to-service communication with ClickHouse, mentioned later, uses ECS Service Connect. Since Web and Worker act as clients (the requesting side) in service-to-service communication, Service Connect only needs to be defined with client settings.

If you want to reduce costs and just need basic communication, Service Discovery should also work fine.

According to the documentation, for production environments, the following is recommended:

All containers should have at least 2 CPUs and 4GB RAM
Langfuse Web should have 2 instances for redundancy

For production environments, we recommend to use at least 2 CPUs and 4 GB of RAM for all containers. You should have at least two instances of the Langfuse Web container for high availability. For auto-scaling, we recommend to add instances once the CPU utilization exceeds 50% on either container.

In the CDK implementation, you can configure whether to use Fargate Spot in lib/stack-config.ts. In development environments, you can use Fargate Spot to reduce costs.

ClickHouse - OLAP

ClickHouse is also running directly as a container image on ECS on Fargate. EFS is mounted for data persistence.

PostgreSQL - OLTP

Using Aurora Serverless v2.

In the CDK implementation, lib/stack-config.ts allows enabling Zero Capacity for cost reduction.

https://aws.amazon.com/blogs/database/introducing-scaling-to-0-capacity-with-amazon-aurora-serverless-v2

When enabled, the database stops after a certain period without connections. When attempting to connect during this state, it takes about 15 seconds to restart, so retries are necessary after a certain time. This is intended for development environments.

https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora-serverless-v2-auto-pause.html#auto-pause-whynot

S3 - Blob Storage

Storage for storing events and traces. This uses S3 directly.

Cache/Queue

Redis/Valkey is needed for caching and asynchronous communication between Web and Worker. In the CDK project, ElastiCache is used, with Valkey as the engine for cost benefits.

There are several points to note about ElastiCache, which I'll explain in detail.

Use REDIS_STRING when using in-transit encryption

ElastiCache allows encryption settings for communication (TLS). In CDK, setting transitEncryptionMode to required allows only TLS communication.

    const cache = new elasticache.CfnReplicationGroup(this, 'Resource', {
      // omit
      transitEncryptionEnabled: true,
      transitEncryptionMode: 'required', // Allow only TLS communication. 'preferred' allows both TLS and non-TLS.
      // omit
      authToken: secret.secretValue.unsafeUnwrap(),
    });

When setting connection information for Langfuse Web/Server in environment variables, there are two ways:

Use REDIS_CONNECTION_STRING
Set REDIS_HOST, REDIS_PORT, REDIS_AUTH

https://langfuse.com/self-hosting/infrastructure/cache#configuration

When only TLS communication is allowed, currently, method 1 (REDIS_CONNECTION_STRING) must be used. Method 2 will not connect and won't output an error (I got stuck on this).

Here's the English translation:

(Updated February 15, 2025)
As of Langfuse v3.28.0 and later, TLS communication settings can now be configured using "2. Set REDIS_HOST, REDIS_PORT, REDIS_AUTH".

This can be enabled by setting the REDIS_TLS_ENABLED environment variable to true in the Langfuse Web/Worker environment. Since this eliminates the need to generate a REDIS_CONNECTION_STRING value, method 2 is generally preferred.

I have also updated the CDK implementation to use method 2. For specific implementation details, please refer to the environment variable configuration section below:

https://github.com/mazyu36/langfuse-with-aws-cdk/blob/main/lib/constructs/services/common-environment.ts#L57.

For versions prior to v3.28.0, you will need to use the REDIS_CONNECTION_STRING configuration as described in the following sections.

For TLS-only communication, set REDIS_CONNECTION_STRING with rediss://....

This is due to the implementation in Langfuse's packages/shared/src/server/redis/redis.ts.

Langfuse uses ioredis, and the instance creation is defined as follows:

  const instance = env.REDIS_CONNECTION_STRING
    ? new Redis(env.REDIS_CONNECTION_STRING, {
        ...defaultRedisOptions,
        ...additionalOptions,
      })
    : env.REDIS_HOST
      ? new Redis({
          host: String(env.REDIS_HOST),
          port: Number(env.REDIS_PORT),
          password: String(env.REDIS_AUTH),
           // No TLS configuration
          ...defaultRedisOptions,
          ...additionalOptions,
        })
      : null;

When using REDIS_HOST, individual host settings are configured, but the tls property is necessary for TLS communication (reference: ioredis documentation)

const redis = new Redis({
  host: "redis.my-service.com",
  tls: {}, // TLS configuration is necessary
});

However, this option is currently not provided by Langfuse. Therefore, using REDIS_HOST etc. makes TLS communication impossible.

On the other hand, ioredis allows TLS connection by defining connection information starting with rediss.

const redis = new Redis("rediss://redis.my-service.com");  // Specify TLS with rediss

In Langfuse, using REDIS_CONNECTION_STRING enables TLS communication.

Set noeviction

The parameter maxmemory-policy must be set to noeviction (no eviction setting).
This ensures that queue jobs are not removed from the cache.

You must set the parameter maxmemory-policy to noeviction to ensure that the queue jobs are not evicted from the cache.

[https://langfuse.com/self-hosting/infrastructure/cache#deployment-options:embed:cite]

To prevent jobs from accumulating infinitely, retry and job retention on failure are considered (example in IngestionQueue).

    IngestionQueue.instance = newRedis
      ? new Queue<TQueueJobTypes[QueueName.IngestionQueue]>(
          QueueName.IngestionQueue,
          {
            connection: newRedis,
            defaultJobOptions: {
              removeOnComplete: true,  // Remove job after success
              removeOnFail: 100_000, // Maximum number of failed job retentions
              attempts: 5,  // Number of retries
              backoff: {  // Retry with exponential backoff
                type: "exponential",
                delay: 5000,
              },
            },
          },
        )
      : null;

noeviction can be set in the parameter group in CDK:

    /**
     * We must set the parameter `maxmemory-policy` to `noeviction` to ensure that the queue jobs are not evicted from the cache.
     * @see https://langfuse.com/self-hosting/infrastructure/cache#deployment-options
     */
    const parameterGroup = new elasticache.CfnParameterGroup(this, 'RedisParameterGroup', {
      cacheParameterGroupFamily: 'valkey8',
      description: 'Custom parameter group for Langfuse ElastiCache',
      properties: {
        'maxmemory-policy': 'noeviction',  // here
      },
    });

Cluster mode and ElastiCache Serverless are not supported

At the time of writing, Langfuse Web/Worker does not support Redis/Valkey cluster mode. Therefore, ElastiCache cluster mode cannot be used.

Langfuse handles failovers between read-replicas, but does not support Redis cluster mode for now, i.e. there is no sharding support.

https://langfuse.com/self-hosting/infrastructure/cache#managed-redisvalkey-by-cloud-providers

This means ElastiCache Serverless also cannot be used currently. AWS documentation also states that ElastiCache Serverless runs in cluster mode.

ElastiCache Serverless runs Valkey, Memcached, or Redis OSS in cluster mode and is only compatible with clients that support TLS.

https://docs.aws.amazon.com/AmazonElastiCache/latest/dg/WhatIs.corecomponents.html

If you try to use ElastiCache Serverless, you'll frequently encounter CROSSSLOT errors:

2025-02-04T09:09:03.525Z warn Redis connection error: CROSSSLOT Keys in request don't hash to the same slot

The cause is that keys are being operated on that hash to different slots.

In cluster mode, keys are calculated and stored in hash slots. When operating multiple keys, they must be in the same hash slot. A detailed explanation can be found in this article:

https://dev.to/inspector/resolved-crossslot-keys-error-in-redis-cluster-mode-enabled-3kec

Since the current implementation does not consider cluster mode, these errors occur. This cannot be resolved through infrastructure settings.

Usage

Please refer to the README for details. Below is a simple deployment method and the flow for verifying the operation.

CDK Configuration and Deployment

The CDK implementation allows parameters to be defined for each environment (dev/stg/prod), and the deployment target environment is specified as a context. The general deployment method is as follows:

Clone the repository and install the necessary libraries with npm ci.
Set the parameters for the corresponding environment in bin/app-config and lib/stack-config.
Deploy with npx cdk deploy --context env=ENV_NAME.

The deployment takes about 20 minutes. The URL is output in the Outputs.

 ✅  LangfuseWithAwsCdkStack-dev

✨  Deployment time: 1238.58s

Outputs:

# omit

LangfuseWithAwsCdkStack-dev.LangfuseURL = https://langfuse.example.com

# omit

✨  Total time: 1247.3s

Initial Setup

After opening the URL, first sign up by entering your email address and other details, then click Sign up.

If Aurora Serverless v2 Zero scaling is enabled and the database is in an idle state, you may encounter a DB error like the following (the error message is hard to read due to the color...). At this point, the DB will start to wake up from idle, so wait for a while (about 15-30 seconds) and then click Sign up again.

First, create an Organization by clicking New Organization.

Set the Organization name and click Create.

If you want to add members, configure them here. I will not set this up here, so click next.

Next, set the Project name. After setting, click create.

Finally, select API Keys and click Create new API Keys.

Once the Secret key and Public key are issued, make a note of them. This completes the initial setup.

Operation Verification

Perform operation verification using the issued API keys. Here, we use curl.

First, set the API keys and hostname as environment variables in an appropriate environment.

export LANGFUSE_SECRET_KEY="YOUR_SECRET_KEY"
export LANGFUSE_PUBLIC_KEY="YOUR_PUBLIC_KEY"
export LANGFUSE_HOST="YOUR_LANGFUSE_URL"

Next, execute the ingestion (trace ingestion) API with curl. Note that the content is dummy, so there is almost no actual data.

curl -X POST "$LANGFUSE_HOST/api/public/ingestion" \
  -u "$LANGFUSE_PUBLIC_KEY:$LANGFUSE_SECRET_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "batch": [
      {
        "type": "trace-create",
        "id": "'$(uuidgen)'",
        "timestamp": "'$(date -u +"%Y-%m-%dT%H:%M:%S.000Z")'",
        "metadata": null,
        "body": {
          "id": "'$(uuidgen)'",
          "name": "test",
          "timestamp": "'$(date -u +"%Y-%m-%dT%H:%M:%S.000Z")'",
          "public": false
        }
      }
    ],
    "metadata": null
  }'

If a 201 response is output, it is working correctly.

{
  "successes": [
    {
      "id": "523EFC5E-8BAC-485D-9CC3-C049B5F64FA4",
      "status": 201
    }
  ],
  "errors": []
}

You can check the traces in the browser under Tracing -> Traces.
If the executed data is included, the operation verification is complete 🎉

Troubleshooting Approach

I will summarize the actual troubleshooting steps I took during the construction.

When I was building, the Langfuse application started, but I struggled with issues such as ingestion (trace ingestion) and trace confirmation on the screen.

There is documentation available for troubleshooting during self-hosting.
https://langfuse.com/self-hosting/troubleshooting

Especially if ingestion is not working, the Missing Events After POST /api/public/ingestion section lists points to check.

Ingestion is done by implementing a program using the SDK or executing the API with the curl command mentioned earlier to ingest traces into Langfuse. If you cannot confirm it in the browser, the process is failing somewhere.

The general flow of ingestion is as follows:

When received via the web, upload the event to the bucket and set a job in the cache queue.
The worker picks up the job from the cache queue, reads the event from the bucket, and finally stores it in ClickHouse.
When accessed from the browser, the web retrieves the data from ClickHouse and displays it.

※You can get a feel for this by looking at packages/shared/src/server/ingestion/processEventBatch.ts and worker/src/queues/ingestionQueue.ts.

Based on the above process flow and documentation, I performed troubleshooting as follows:

Check the logs of the Web and Worker.
Check the bucket (S3).
Check the cache (ElastiCache).
Check ClickHouse.

I will describe what I mainly did in order.

1. Check the Logs of the Web and Worker

First, check the logs of the Web and Worker as the logs of Langfuse itself. If there is an error due to infrastructure layer configuration issues, the cause can be identified here.

However, at the time of writing, there were not many logs output, and the cause was often difficult to identify due to the asynchronous nature of the queue, etc.

2. Check the Bucket (S3)

When ingestion is performed, the event JSON is stored in S3. If the JSON does not exist in S3, it is likely that the connection between the Web and S3 is not working properly. Check the NW and permission settings.

3. Check the Cache (ElastiCache)

If the JSON is stored in S3, next check if the job is being stored and processed in ElastiCache.

Use valkey-cli for connection (if using Redis, redis-cli is better). The setup method is summarized in the following documentation.

https://docs.aws.amazon.com/AmazonElastiCache/latest/dg/connect-tls.html

After setting up, connect to the cache with the following command.

$ valkey-cli -h Primary or Configuration Endpoint --tls -a 'your-password' -p 6379

Once connected, tail the logs with MONITOR.

$ MONITOR

If OK appears after executing the command and no logs are displayed, it is likely that the Web/Worker cannot connect to ElastiCache. Recheck the connection settings. If connected, some logs will be displayed.

Also, check the ingestion-queue as follows. If the Web and ElastiCache are connected and jobs are being added to the queue, something like the following should be displayed.

$ KEYS bull:ingestion-queue:*

# Result
1) "bull:ingestion-queue:2"
2) "bull:ingestion-queue:active"
3) "bull:ingestion-queue:3"
4) "bull:ingestion-queue:stalled"
5) "bull:ingestion-queue:3:lock"
6) "bull:ingestion-queue:meta"
7) "bull:ingestion-queue:id"
8) "bull:ingestion-queue:events"
9) "bull:ingestion-queue:1"
10) "bull:ingestion-queue:1:lock"
11) "bull:ingestion-queue:2:lock"
12) "bull:ingestion-queue:stalled-check"

When I was troubleshooting, the above was the case, but the traces were not displayed in the browser, so I dug deeper.

First, check if there are any failed jobs with the following command.

$ ZRANGE bull:ingestion-queue:failed 0 -1 WITHSCORES

# Result
1) "2"
2) "1738464474985"
3) "3"
4) "1738464474986"
5) "1"
6) "1738464474996"

Jobs with IDs 1~3 have failed. Let's look at the details of job ID 2.

$ HGETALL bull:ingestion-queue:2

# Result
1) "data"

# ... omit

17) "stacktrace"
18) "[\"Error: upstream request timeout\\n    at parseError (/app/node_modules/.pnpm/@clickhouse+client-common@1.4.0/node_modules/@clickhouse/client-common/dist/error/parse_error.js:37:39)\\n    at ClientRequest.onResponse 

# ...omit

Then, the error cause is upstream request timeout, and it is happening in the Clickhouse related part.
Therefore, it is expected that a timeout occurred when the Worker tried to request ClickHouse while processing the job.

In the issue I encountered, the cause was ultimately a missing security group setting. The inbound rule of the Fargate Service's security group for ClickHouse did not allow connections from the Worker's Fargate Service, so communication was not possible.

In this case, the Worker's logs did not output any errors, so the error content could not be identified without investigating ElastiCache.

4. Check ClickHouse

If everything seems fine on ElastiCache, check if data is stored in ClickHouse. First, connect to ClickHouse with ECS Exec.

aws ecs execute-command --cluster cluster-name \
    --task task-id \
    --container container-name \
    --interactive \
    --command "/bin/sh"

Then start the client with clickhouse-client. The following indicates a successful start.

# clickhouse-client
ClickHouse client version 25.1.2.3 (official build).
Connecting to localhost:9000 as user clickhouse.
Connected to ClickHouse server version 25.1.2.

Then, check if data is stored with SQL, just like with RDB.

USE default;
SHOW TABLES;
SELECT * FROM traces ORDER BY created_at DESC LIMIT 10;

Based on the results of checking the contents, you can suspect the following. However, in such cases, errors are likely to be output in the Web or Worker logs, so the issue might be identified at the 1. Check the Logs of the Web and Worker stage.

No data in ClickHouse: It is likely that the job processing results could not be stored, so the communication between the Worker and ClickHouse is suspicious.
Data exists in ClickHouse: If data exists but is not displayed in the browser, the communication between the Web and ClickHouse is suspicious.

Conclusion

If you find any issues or have suggestions for additions to the CDK project, please let me know!
I plan to work on CloudFront's VPC Origin support at some point in the near future.

DEV Community