All of this (CDK & Clusters) is too much to handle? Seek professional-services from Qdrant.tech; If not, me/my current-employer.
Summary of Challenges:
- This is a Database!
- Need multi-AZ resiliency when cluster is created.
- Must automatically maintain multi-AZ resiliency even after innumerable Node-Failures; How to delegate to AWS (ECS), to maintain this multi-AZ balancing by REPLACING nodes, ensuring resiliency & DLP (data-loss protection).
- Need high-speed robust multi-AZ replication (of storage/data)
- Avoid re-inventing the wheel, that too .. on AWS Cloud
- Avoid deploying any of my own mechanisms for:-
- Resiliency, incl. automatically spinning up a new "node", if any node becomes unhealthy.
- inter-node / inter-process Communication
- Replication of Storage-layer or just database-data
- Monitoring & Alerting
- Storage-Persistence challenges
- Qdrant, like any database, does file-Locking (highly parallel threads attempting to lock the file-system)
- Multiple copies of data across AZs in sync all the time, almost instantly ready to copy to a new replica-node on any AZ-outage.
- EBS Volumes can NOT be replicated to another AZ or to another Region.
- A "crash" (of a single-node Qdrant-Vector-DB) can leave the contents of the EBS Volumes in a "failed-data-integrity" status that may prevent the replacement node from spinning up; This is a classic "Crashed-Database problem" which has NO simple solution.
- Self-managed cluster:
- Using Fargate-Tasks/Containers as "nodes" (instead of traditional EC2s)
- Qdrant requires a cluster's "nodes/tasks/containers" .. .. to be spun up in SEQUENCE, in a specific order.
- Worse, Qdrant MANDATES that each SUBSEQUENT "node/task/container" to be given information about ANY one of the "preceding" ones.
- If the 1st Qdrant "node/task/container" that was created goes down, replacing it requires special-handling!
- Fargate-Tasks/Containers, by design, have ephemeral PRIVATE-IPs.
- Database requires NO PUBLIC-IPs, but "Qdrant-Dashboard" (website) is built-in into the database! And, we need access to it!
- Snapshots-based restore/recovery (usually a manual activity) requires use of this "Qdrant-Dashboard".
Must-have Requirements
- Per Qdrant's documentation, with their reliance on "Raft" protocol, they recommend a cluster of minimum 3 "nodes/tasks/containers". Studying Qdrant's documentation clarifies that, it should be 3 per shard, and that for future-proofing they recommend starting with 2 shards per collection.
- So, in total 6! That is, 2 nodes/containers/tasks in each AZ, each containing a different Shard. Thereby, each Shard is replicated across 3 AZs.
- You can definitely make do with 4, but RAFT's behavior needs studying.
- Backup Snapshots
- Must have 100% integrity.
- Ideally, avoid taking down the "database/cluster/writers" before taking a snapshot
- Leverage the database's INHERENT-ability to create snapshots.
- Avoid storage-level snapshots (given the decades-long history of the "Crashed-Database problem")
- Maintanence Scenarios:-
- Automation of snapshots as appropriate for D.R. R.P.O; We chose an RPO of 15 minutes.
- Snapshots made every 15 minutes; Last 24 hourly-snapshots to be retained; Last 30 daily-snapshots to be retained; Last 12 monthly-snapshots to be retained.
- Based on your IT-process sophistication and based on the size of your collection(s), choose between Shard-level snapshots -versus- collection-snapshots. We chose collection-snapshots. Note: Avoid storage-level snapshots.
- Resilient-storage of snapshots across regions. FYI: Typical choices are S3, EFS & AWS-Backup. "Standard" EBS can Not do this.
- Security:
-
httpSfrom outside Vpc, via a single-endpoint no matter what changes happen within Qdrant-cluster. - "native" ApiKey.
-
- Monitoring:
- Basic "node" monitoring (O/S level)
- Database monitoring (Qdrant-API-level)
- Shard-health monitoring (Qdrant-API-level)
- Replacement-of-nodes monitoring
- Periodic "101% Healthy" monitoring-checks (actually trying to create a new "test-collection", insert data & run a query).
-
Software Licensing (for use within a Proprietary product sold to Clients)
- Will Qdrant's licensing support this Enterprise-grade Cluster being used in a commercial-product? Answer is yes, as of Dec 2025.
-
Connecting all "nodes/tasks/containers" together as a cluster:
- Challenge: Fargate tasks have ephemeral private IPs, no predictable FQDNs/server-names.
- Best Solution is ECS Service Discovery :- It creates predictable DNS names (e.g.,
qdrant-node-1.my-cluster-name.local) - Alternatives are: Parameter Store Coordination , EFS-Based Coordination (Leverage existing EFS) , ALB + Health Check Discovery (Use load balancer for discovery)
- Best Solution is ECS Service Discovery :- It creates predictable DNS names (e.g.,
- Blame "Raft" protocol and/or Qdrant, but .. .. you are REQUIRED to fully bring up 1st node/task/container, and then wait for it to finish initialization, before bringing up any other node/task/container.
- Can it be designed simpler !? Sure! You are welcome anytime to put in a PR on Qdrant's official-GitHub.
- Challenge: Fargate tasks have ephemeral private IPs, no predictable FQDNs/server-names.
-
Advanced:
- if I have a large-cluster, then I should be utilizing every single Qdrant-node (for faster "sharded" queries).
- Not sure how Qdrant leverages the 3-copies (across 3 AZs) of each shard for improving query-performance.
- But, should you ever want to use a VPC-Lambda to split-up queries across ALL Qdrant-nodes, the Lambda must be able to get the list of PRIVATE-IPs of each Qdrant-node.
Decisions
- Use pure CDK / pure CloudFormation.
- Avoid Custom-Resources in CDK/CloudFormation at all costs.
- Avoid having to "do stuff" after deployment of aws-resoures.
Avoid using Lambdas completely w.r.t. ANY Cluster modifications (post-deployment).
- The cluster will need to run 24x7x365 and must self-adjust. Having lambdas run periodically is Never viable for resilience.
- Periodically doing Scale-up/down will significantly raise the risk of Data-Loss. No more Scaling-down on a schedule, daily.
- New ability to take a snapshot (of --EACH-- shard or of --EACH-- collection). This is important due to implications of SHARDING.
- New ability to save SnapShots onto an EFS-filesystem
- Restore a collection from a SnapShot instead (avoid DIY).
- These articles require you to custom-build your own Docker-container-image (from Qdrant's source-code in GitHub).
- If your personal-requirements/constraints are forcing you to avoid changing/enhancing the
Dockerfile(so that any "drop-in" official Qdrant image will also work) .. you'll then need 3 different ECS-services. Details are out of scope of these articles.
- If your personal-requirements/constraints are forcing you to avoid changing/enhancing the
- No more EFS (instead of EBS) as primary filesystem, for ALL Task-instances/Containers.
- Only
productionwill use a Qdrant-6-node-Cluster. All other environments will use a single-node Qdrant-container/Task (which has a scheduled daily downtime).- Should clusters be considered too expensive, then a single-node Qdrant Vector-DB along with Snapshots can potentially support an RPO of 15-minutes, but this will require automation to thoroughly test the "snapshots".
- Each Fargate-Task will have just one Container; In these articles, they are synonymous.
- Native non-storage SnapShots is mandatory from day 1, needed no matter what the final architecture & design.
- Snapshots (with zero-overhead) should be automatically stored resiliently across regions. Choices are S3, EFS & AWS Backup. Simplest is EFS (with multi-region replication), by mounting to
/qdrant/snapshotsinto the container
End.
Top comments (0)