Design: Vector-Database: Qdrant-Cluster on Fargate

This is 2nd part of a set of articles:

Get-Started article # 1
This article # 2 has FULL DETAILs on the Critical-Design & Key-requirements that influenced/constrained/forced the final implementation.
A article # 3 re: Snapshots.
A separate GitLab-repo contains the full CDK-Construct.
Assumption: You'll OK to CUSTOM-build the Qdrant Container-IMAGE (using a custom Dockerfile) using Qdrant's github. article # 4 for a sensible/defensible Dockerfile.

All of this (CDK & Clusters) is too much to handle? Seek professional-services from Qdrant.tech; If not, me/my current-employer.

Summary of Challenges:

This is a Database!
Need multi-AZ resiliency when cluster is created.
Must automatically maintain multi-AZ resiliency even after innumerable Node-Failures; How to delegate to AWS (ECS), to maintain this multi-AZ balancing by automatically REPLACING nodes, ensuring resiliency & DLP (data-loss protection).
Need high-speed robust multi-AZ replication (of storage/data)
Avoid re-inventing the wheel, that too .. on AWS Cloud
Avoid deploying any of my own mechanisms for:-
1. Resiliency, incl. automatically spinning up a new "node", if any node becomes unhealthy.
2. inter-node / inter-process Communication
3. Replication of Storage-layer or just database-data
4. Monitoring & Alerting
Storage-Persistence challenges
1. Qdrant, like any database, does file-Locking (highly parallel threads attempting to lock the file-system)
2. Multiple copies of data across AZs in sync all the time, almost instantly ready to copy to a new replica-node on any AZ-outage.
3. EBS Volumes can NOT be replicated to another AZ or to another Region.
4. A "crash" (of a single-node Qdrant-Vector-DB) can leave the contents of the EBS Volumes in a "failed-data-integrity" status that may prevent the replacement node from spinning up; This is a classic "Crashed-Database problem" which has NO simple solution.
Self-managed cluster:
1. Using Fargate-Tasks/Containers as "nodes" (instead of traditional EC2s)
2. Qdrant requires a cluster's "nodes/tasks/containers" .. .. to be spun up in certain SEQUENCE; in a specific order.
  - Worse, Qdrant MANDATES that each SUBSEQUENT "node/task/container" to be given information about ANY one of the "preceding" ones.
3. If the 1st Qdrant "node/task/container" that was created goes down, replacing it requires special-handling! That is, replacement of "startup/1st" node/task/container uses a different Entrypoint command (compared to the "origina" Entrypoint).
4. Fargate-Tasks/Containers, by design, have ephemeral PRIVATE-IPs. That's good!
5. Database comes with a "Qdrant-Dashboard" (website); Its built-in into the database! And, we need access to it, but do NOT want the database to use public-IPs!
6. Snapshots-based restore/recovery (usually a manual activity) requires use of this "Qdrant-Dashboard".
7. Challenge: Fargate tasks have ephemeral private IPs, no predictable FQDNs/server-names. Connecting all "nodes/tasks/containers" together as a cluster
  - Best Solution is ECS Service Discovery :- It creates predictable DNS names (e.g., qdrant-cluster.my-ecs-cluster-name.local)
  - Alternatives are: Parameter Store Coordination , EFS-Based Coordination (Leverage existing EFS) , ALB + Health Check Discovery (Use load balancer for discovery)
8. Blame "Raft" protocol and/or Qdrant, but .. .. you are REQUIRED to fully bring up 1st node/task/container, and then wait for it to finish initialization, before bringing up any other node/task/container.
  - Can it be designed simpler !? Sure! You are welcome anytime to put in a PR on Qdrant's official-GitHub.
9. No more EFS (instead of EBS) as primary filesystem, for ALL Task-instances/Containers. Reason: Qdrant will detect it and stop working.

Must-have Requirements

Per Qdrant's documentation, with their reliance on "Raft" protocol, they recommend a cluster of minimum 3 "nodes/tasks/containers". Studying Qdrant's documentation clarifies that, it should be 3 per shard, and that for future-proofing they recommend starting with 2 shards per collection.
- So, in total 6! That is, 2 nodes/containers/tasks in each AZ, each containing a different Shard. Thereby, each Shard is replicated across 3 AZs.
- You MAY be able to make do with 4, but RAFT's behavior needs studying.
Backup Snapshots
1. Must have 100% integrity.
2. Ideally, avoid taking down the "database/cluster/writers" before taking a snapshot
3. Leverage the database's INHERENT-ability to create snapshots.
4. Avoid storage-level snapshots (given the decades-long history of the "Crashed-Database problem")
Maintanence Scenarios:-
1. Automation of snapshots as appropriate for D.R. R.P.O; We chose an RPO of 15 minutes.
2. RETENTION: Snapshots made every 15 minutes; Last 24 hourly-snapshots to be retained; Last 30 daily-snapshots to be retained; Last 12 monthly-snapshots to be retained.
3. Based on your IT-process sophistication and based on the size of your collection(s), choose between Shard-level snapshots -versus- collection-level snapshots. We chose collection-snapshots. Note: Avoid storage-level snapshots.
4. Resilient-storage of snapshots across regions. FYI: Typical choices are S3, EFS & AWS-Backup. "Standard" EBS can Not do this.
Security:
1. httpS from outside Vpc, via a single-endpoint no matter what changes happen within Qdrant-cluster.
2. "native" ApiKey, to protect the "Qdrant Dashboard" & Qdrant-apis.
Monitoring:
1. Basic "node" monitoring (O/S level)
2. Database monitoring (Qdrant-API-level)
3. Shard-health monitoring (Qdrant-API-level)
4. Replacement-of-nodes monitoring
5. Periodic "101% Healthy" monitoring-checks (actually trying to create a new "test-collection", insert data & run a query).
Software Licensing (for use within a Proprietary product sold to Clients)
1. Will Qdrant's licensing support this Enterprise-grade Cluster being used in a commercial-product? Answer is yes, as of Dec 2025.
Advanced:
1. if I have a large-cluster, then I should be utilizing every single Qdrant-node (for faster "sharded" queries).
2. Not sure how Qdrant leverages the 3-copies (across 3 AZs) of each shard for improving query-performance.
3. But, should you ever want to use a VPC-Lambda to split-up queries across ALL Qdrant-nodes, the Lambda must be able to get the list of PRIVATE-IPs of each Qdrant-node.

Decisions

Use pure CDK / pure CloudFormation.
Avoid Custom-Resources in CDK/CloudFormation at all costs.
Avoid having to "do stuff" after deployment of aws-resoures. Avoid using Lambdas completely w.r.t. ANY Cluster modifications (post-deployment).
1. The cluster will need to run 24x7x365 and must self-adjust. Having lambdas run periodically is Never viable for resilience.
2. Periodically doing Scale-up/down will significantly raise the risk of Data-Loss. No more Scaling-down on a schedule, daily.
  1. Mechanism to take a snapshot (of --EACH-- shard or of --EACH-- collection). This is important due to implications of SHARDING.
  2. Ability to save SnapShots onto an EFS-filesystem (either on-demand or scheduled)
  3. Restore a collection from a SnapShot instead (avoid DIY).
These articles require you to custom-build your own Docker-container-image (from Qdrant's source-code in GitHub).
- If your personal-requirements/constraints are forcing you to avoid changing/enhancing the Dockerfile (so that any "drop-in" official Qdrant image will also work) .. you'll then need 3 different ECS-services. Details are out of scope of these articles.
Only production will use a Qdrant-6-node-Cluster. All other environments will use a single-node Qdrant-container/Task (which has a scheduled daily downtime).
Should clusters be considered too expensive, then a single-node Qdrant Vector-DB along with Snapshots can potentially support an RPO of 15-minutes, but this will require automation to thoroughly test the "snapshots".
Each Fargate-Task will have just one Container; In these articles, they are synonymous.
Native non-storage SnapShots is mandatory from day 1, needed no matter what the final architecture & design.
Snapshots (with zero-overhead) should be automatically stored resiliently across regions. Choices are S3, EFS & AWS Backup. Simplest is EFS (with multi-region replication), by mounting to /qdrant/snapshots into the container, so we chose EFS JUST for storing snapshots only.

End.

DEV Community

Design: Vector-Database: Qdrant-Cluster on Fargate

Summary of Challenges:

Must-have Requirements

Decisions

Top comments (0)