SciForce

Posted on Nov 17

Designing a Secure, Automated Virtual Datacenter for Multi-Tenant Virtualization

#ai #machinelearning #devops

Client Profile

The client is a hardware and infrastructure provider developing a platform for delivering virtual data centers as a scalable, cost-efficient service. The project’s goal was to enable enterprise customers to deploy and manage computing resources — including virtual machines, storage, and network components — through a unified, automated environment.

The platform was designed to integrate physical infrastructure with software-defined orchestration, providing secure tenant isolation, flexible resource allocation, and end-to-end automation. By relying on open-source technologies and custom orchestration components, the client aimed to achieve the reliability and manageability of enterprise-grade systems while keeping operational costs under control.

Challenge

1) Technology and Architecture
The project involved bringing together physical servers, virtualization tools, and Kubernetes orchestration into one system that could securely host multiple tenant environments. The architecture was revised many times as the team refined how clusters communicate, how isolation is maintained, and how resources are allocated.

During testing, performance issues appeared when virtualization was layered on top of virtual machines instead of running on physical hardware. The team also had to configure networking, DNS, and routing to ensure full tenant isolation, and set up persistent storage that could handle replication, recovery, and data preservation after workloads stopped running.

2) Tooling and Platform Limitations
The team faced obstacles choosing technologies that met both technical and cost requirements. Commercial platforms offered strong reliability but were too expensive for the project’s budget goals. Many open-source options were unstable, lacked important features, or required extensive setup and maintenance. Most also came without reliable support, creating additional risks for production use.

3) Operational and Maintenance
The system had to be easy to support with a small DevOps team and without using complex commercial tools. To achieve this, the team focused on automating deployment, monitoring, and recovery to minimize manual work. Some open-source components required advanced expertise, and finding specialists who could maintain them was difficult. Ensuring stability and data recovery also had to be done with limited resources, using internal tools instead of large-scale infrastructure typical for big cloud providers.

4) Business and Cost Constraints
The main goal was to create a virtual data center platform that stayed affordable without losing key functionality. Expensive commercial tools didn’t fit this goal, so the team focused on open-source technologies and custom development. Each design decision was reviewed for its cost impact, from storage and networking to automation and maintenance. A high level of automation helped reduce manual work and operating expenses, making the platform more sustainable and cost-effective for both the provider and its clients.

Solution

Scalable Virtualization Layer
The platform was built on Kubernetes and KubeVirt, allowing it to run both virtual machines and containerized applications in one system. Its two-layer design included a management cluster for overall control and tenant clusters that were created automatically for each client. Each tenant cluster worked as a separate Kubernetes environment with its own computing power, storage, and network. All tasks from setup and scaling to removal were automated through Kubernetes APIs, helping the system grow easily and maintain stable performance.

Tenant Isolation and Networking
Each tenant cluster was deployed as a fully functional Kubernetes cluster with its own networking, storage, API and configured to prevent cross-tenant access. Workloads in tenants could connect to the internet while remaining fully isolated from other environments and the management cluster. Network policies and routing were configured to keep communication secure and stable between internal services and external systems.

Storage and Data Management
The platform used a storage layer to keep data available even after workloads were stopped or clusters restarted. The storage system included replication and automatic recovery to prevent data loss if a node or disk failed. Through custom Kubernetes storage classes, each tenant could create and manage their own volumes, ensuring reliable and consistent access to data.

Automation and Control Layer
The platform included an API service and web interface that allowed clients to deploy and manage their own environments. The API handled all cluster operations — creating, scaling, and deleting — and automatically allocated CPU, memory, and storage within set limits. All provisioning and setup were done through Kubernetes APIs, making the process fully automated and removing the need for manual work from operators.

Resource Provisioning Model
Clients could use the platform’s web interface to choose how much CPU, memory, storage, and GPU power they needed for their workloads. The system automatically assigned these resources through Kubernetes APIs while following the set limits for each tenant. When workloads stopped, the system released the computing resources back into the shared pool, and persistent volumes kept the stored data available for future use.

Cost Optimization and Sustainability
The platform used open-source technologies and in-house tools, which removed licensing fees and lowered overall costs. Automation handled key processes like provisioning, scaling, monitoring, and recovery, so the system required only a small DevOps team for support. Better resource use and built-in recovery features helped reduce infrastructure expenses while keeping performance stable and reliable.

Features

- Self-Service and Unified Management
Clients can deploy, scale, and remove complete virtual data centers through a single web interface or API. All provisioning of software defined data centers happens automatically, with full integration into DevOps pipelines.

- Dedicated and Secure Tenant Environments
Each client operates in a dedicated Kubernetes cluster with its own compute, storage, and networking resources. Network isolation ensures full separation between tenants while maintaining secure internet connectivity.

- High-Availability Storage
Each environment uses a distributed, fault-tolerant storage system that replicates data across multiple nodes. If a node or disk fails, the platform restores replicas automatically, keeping data safe and workloads online.

- Automated Monitoring and Recovery
The platform continuously checks the health of all components using Kubernetes-native monitoring tools.

- Secure External Connectivity
Tenants can expose their workloads to the internet through managed load balancers. Outbound connections are filtered through isolated gateways with firewall rules to maintain security boundaries.

- Role-based Access Control
Users authenticate only within their assigned tenant cluster. Access permissions define who can view or manage resources, ensuring each team operates independently and securely.

- Data Lifecycle Controls
When workloads stop, compute and network resources are released automatically. Persistent storage can be retained or deleted based on policy, allowing tenants to keep essential data while optimizing available capacity.

Development Process

1) Defined architecture and stack
After evaluating traditional virtualization and open-source options, the team designed a two-layer Kubernetes-based system — a central management cluster governing isolated tenant clusters through Cluster API and KubeVirt. Key steps included:

Comparing commercial (VMware, OpenShift) and open-source stacks for performance, scalability, and cost.
Choosing Kubernetes, Cluster API, and KubeVirt to manage both containers and virtual machines directly on physical hardware.
Establishing tenant isolation and automated provisioning as core design principles.
Validating architecture and automation logic through early proof-of-concept testing before full rollout.

2) Automated management cluster setup
The team created Infrastructure-as-Code (IaC) scripts to deploy the management (control) cluster in a consistent and repeatable way. The automation provisioned Kubernetes control-plane nodes, configured networking and monitoring components, and ran health checks to confirm readiness.

3) Implemented tenant lifecycle management
Used Cluster API to automate how tenant clusters are created, configured, scaled, and removed. Each tenant automatically received its own compute, network, and storage limits during setup. When a tenant was deleted, cleanup scripts released and reused the freed resources. The process was tested under multiple simultaneous deployments to ensure the system stayed stable and efficient.

4) Configured networking and isolation
The team established a secure and fully isolated network setup for each tenant, ensuring reliable communication and data protection across the platform. Key steps included:

Per-tenant network segmentation: Separate subnets, routing rules, and DNS zones created for each environment.
Strict access controls: Firewall and network policies preventing any cross-tenant traffic.
Secure connectivity: Managed internet egress and tenant-controlled ingress for external services.
Verification and testing: Functional and security tests confirming DNS, routing, and complete isolation from both other tenants and the management cluster.

5) Deployed distributed storage layer
The team introduced a resilient, multi-node storage system to keep tenant data safe and accessible at all times.

Implemented a distributed backend integrated through Kubernetes CSI drivers.
Configured real-time replication and automatic failover to recover from node or disk outages.
Enabled dynamic volume provisioning to scale storage capacity as workloads grew.
Stress-tested the system under simulated hardware failures to verify stability and data integrity.

6) Integrated storage into tenant clusters
The team enabled tenants to provision and manage persistent volumes directly through Kubernetes APIs, providing full compatibility with standard workflows. Each tenant cluster was configured with its own storage classes, quotas, and reclaim policies, defining how capacity was allocated, expanded, and released.

To ensure reliability, the storage layer maintained data persistence through workload restarts, cluster scaling, and lifecycle events such as upgrades or migration.

7) Developed API and web interface
The team built a REST API that handled all tenant operations — creating, scaling, pausing, and deleting clusters — by connecting directly to Kubernetes. On top of it, they developed a web dashboard where clients could manage their environments through a simple self-service interface with live status, resource usage, and activity logs. Authentication and role-based access ensured that each user could securely access only their own tenant resources.

8) Performed end-to-end testing
The team verified that the platform operated reliably under real conditions. Tests confirmed correct automation for cluster creation, scaling, and teardown, as well as effective auto-healing, data replication, and recovery during simulated failures. Network isolation was validated across tenants, and performance tests measured provisioning speed, recovery time, and stability under load. All results were documented to support future optimization.

Impact

Cost efficiency: Reduced infrastructure and licensing expenses by over 60% through open-source technologies and in-house automation.
Lean operations: Lowered maintenance workload to a two-person small DevOps team without losing reliability.
Faster provisioning: Cut environment deployment time from several hours to under 15 minutes with full automation.
Better resource utilization: Improved capacity efficiency by 30–40% through automated scaling and cleanup logic.
High availability: Achieved 99.9% uptime and reduced downtime incidents by over 70% using built-in replication and recovery.
Scalability: Enabled seamless onboarding of new enterprise tenants with minimal manual effort.

DEV Community