DEV Community

Sarma
Sarma

Posted on

Design: Vector-Database: Qdrant-Cluster

All of this (CDK & Clusters) is too much to handle? Seek professional-services from Qdrant.tech; If not, me/my current-employer.

Summary of Challenges:

  1. This is a Database!
  2. Need multi-AZ resiliency when cluster is created.
  3. Must automatically maintain multi-AZ resiliency even after innumerable Node-Failures; How to delegate to AWS (ECS), to maintain this multi-AZ balancing by REPLACING nodes, ensuring resiliency & DLP (data-loss protection).
  4. Need high-speed robust multi-AZ replication (of storage/data)
  5. Avoid re-inventing the wheel, that too .. on AWS Cloud
  6. Avoid deploying any of my own mechanisms for:-
    1. Resiliency, incl. automatically spinning up a new "node", if any node becomes unhealthy.
    2. inter-node / inter-process Communication
    3. Replication of Storage-layer or just database-data
    4. Monitoring & Alerting
  7. Storage-Persistence challenges
    1. Qdrant, like any database, does file-Locking (highly parallel threads attempting to lock the file-system)
    2. Multiple copies of data across AZs in sync all the time, almost instantly ready to copy to a new replica-node on any AZ-outage.
    3. EBS Volumes can NOT be replicated to another AZ or to another Region.
    4. A "crash" (of a single-node Qdrant-Vector-DB) can leave the contents of the EBS Volumes in a "failed-data-integrity" status that may prevent the replacement node from spinning up; This is a classic "Crashed-Database problem" which has NO simple solution.
  8. Self-managed cluster:
    1. Using Fargate-Tasks/Containers as "nodes" (instead of traditional EC2s)
    2. Qdrant requires a cluster's "nodes/tasks/containers" .. .. to be spun up in SEQUENCE, in a specific order.
    3. Worse, Qdrant MANDATES that each SUBSEQUENT "node/task/container" to be given information about ANY one of the "preceding" ones.
    4. If the 1st Qdrant "node/task/container" that was created goes down, replacing it requires special-handling!
    5. Fargate-Tasks/Containers, by design, have ephemeral PRIVATE-IPs.
    6. Database requires NO PUBLIC-IPs, but "Qdrant-Dashboard" (website) is built-in into the database! And, we need access to it!
    7. Snapshots-based restore/recovery (usually a manual activity) requires use of this "Qdrant-Dashboard".

Must-have Requirements

  1. Per Qdrant's documentation, with their reliance on "Raft" protocol, they recommend a cluster of minimum 3 "nodes/tasks/containers". Studying Qdrant's documentation clarifies that, it should be 3 per shard, and that for future-proofing they recommend starting with 2 shards per collection.
    • So, in total 6! That is, 2 nodes/containers/tasks in each AZ, each containing a different Shard. Thereby, each Shard is replicated across 3 AZs.
    • You can definitely make do with 4, but RAFT's behavior needs studying.
  2. Backup Snapshots
    1. Must have 100% integrity.
    2. Ideally, avoid taking down the "database/cluster/writers" before taking a snapshot
    3. Leverage the database's INHERENT-ability to create snapshots.
    4. Avoid storage-level snapshots (given the decades-long history of the "Crashed-Database problem")
  3. Maintanence Scenarios:-
    1. Automation of snapshots as appropriate for D.R. R.P.O; We chose an RPO of 15 minutes.
    2. Snapshots made every 15 minutes; Last 24 hourly-snapshots to be retained; Last 30 daily-snapshots to be retained; Last 12 monthly-snapshots to be retained.
    3. Based on your IT-process sophistication and based on the size of your collection(s), choose between Shard-level snapshots -versus- collection-snapshots. We chose collection-snapshots. Note: Avoid storage-level snapshots.
    4. Resilient-storage of snapshots across regions. FYI: Typical choices are S3, EFS & AWS-Backup. "Standard" EBS can Not do this.
  4. Security:
    1. httpS from outside Vpc, via a single-endpoint no matter what changes happen within Qdrant-cluster.
    2. "native" ApiKey.
  5. Monitoring:
    1. Basic "node" monitoring (O/S level)
    2. Database monitoring (Qdrant-API-level)
    3. Shard-health monitoring (Qdrant-API-level)
    4. Replacement-of-nodes monitoring
    5. Periodic "101% Healthy" monitoring-checks (actually trying to create a new "test-collection", insert data & run a query).
  6. Software Licensing (for use within a Proprietary product sold to Clients)

    1. Will Qdrant's licensing support this Enterprise-grade Cluster being used in a commercial-product? Answer is yes, as of Dec 2025.
  7. Connecting all "nodes/tasks/containers" together as a cluster:

    1. Challenge: Fargate tasks have ephemeral private IPs, no predictable FQDNs/server-names.
      • Best Solution is ECS Service Discovery :- It creates predictable DNS names (e.g., qdrant-node-1.my-cluster-name.local)
      • Alternatives are: Parameter Store Coordination , EFS-Based Coordination (Leverage existing EFS) , ALB + Health Check Discovery (Use load balancer for discovery)
    2. Blame "Raft" protocol and/or Qdrant, but .. .. you are REQUIRED to fully bring up 1st node/task/container, and then wait for it to finish initialization, before bringing up any other node/task/container.
      • Can it be designed simpler !? Sure! You are welcome anytime to put in a PR on Qdrant's official-GitHub.
  8. Advanced:

    1. if I have a large-cluster, then I should be utilizing every single Qdrant-node (for faster "sharded" queries).
    2. Not sure how Qdrant leverages the 3-copies (across 3 AZs) of each shard for improving query-performance.
    3. But, should you ever want to use a VPC-Lambda to split-up queries across ALL Qdrant-nodes, the Lambda must be able to get the list of PRIVATE-IPs of each Qdrant-node.

Decisions

  1. Use pure CDK / pure CloudFormation.
  2. Avoid Custom-Resources in CDK/CloudFormation at all costs.
  3. Avoid having to "do stuff" after deployment of aws-resoures. Avoid using Lambdas completely w.r.t. ANY Cluster modifications (post-deployment).
    1. The cluster will need to run 24x7x365 and must self-adjust. Having lambdas run periodically is Never viable for resilience.
    2. Periodically doing Scale-up/down will significantly raise the risk of Data-Loss. No more Scaling-down on a schedule, daily.
      1. New ability to take a snapshot (of --EACH-- shard or of --EACH-- collection). This is important due to implications of SHARDING.
      2. New ability to save SnapShots onto an EFS-filesystem
      3. Restore a collection from a SnapShot instead (avoid DIY).
  4. These articles require you to custom-build your own Docker-container-image (from Qdrant's source-code in GitHub).
    • If your personal-requirements/constraints are forcing you to avoid changing/enhancing the Dockerfile (so that any "drop-in" official Qdrant image will also work) .. you'll then need 3 different ECS-services. Details are out of scope of these articles.
  5. No more EFS (instead of EBS) as primary filesystem, for ALL Task-instances/Containers.
  6. Only production will use a Qdrant-6-node-Cluster. All other environments will use a single-node Qdrant-container/Task (which has a scheduled daily downtime).
    • Should clusters be considered too expensive, then a single-node Qdrant Vector-DB along with Snapshots can potentially support an RPO of 15-minutes, but this will require automation to thoroughly test the "snapshots".
  7. Each Fargate-Task will have just one Container; In these articles, they are synonymous.
  8. Native non-storage SnapShots is mandatory from day 1, needed no matter what the final architecture & design.
  9. Snapshots (with zero-overhead) should be automatically stored resiliently across regions. Choices are S3, EFS & AWS Backup. Simplest is EFS (with multi-region replication), by mounting to /qdrant/snapshots into the container

End.

Top comments (0)