DEV Community: InfraCloud Technologies

Batch Scheduling on Kubernetes: Comparing Apache YuniKorn, Volcano.sh, and Kueue

Rahul Kadam — Mon, 26 May 2025 12:39:49 +0000

Batch processing plays a vital role in many modern systems, especially for data processing, machine learning training, ETL, etc. Kubernetes, while historically designed for long-running services, has expanded its capabilities to support these batch workloads. With the help of specialized tools, Kubernetes has become a robust platform for handling resource-intensive and time-sensitive tasks.

In this blog post, we will dive into batch scheduling on Kubernetes, the challenges it entails, and compare three powerful open-source tools - Apache YuniKorn, Volcano.sh, and Kueue - that cater to batch scheduling needs.

Introduction to batch processing on Kubernetes

Batch processing refers to executing a series of tasks or jobs without requiring immediate mannual interaction. Jobs like data transformation, building software in CI/CD workflows, or running AI/ML training typically fall under batch processing. Unlike traditional workloads, such as APIs or databases that run continuously and focus on uptime, batch workloads are:

Finite in nature, running for a fixed duration, and then terminating.
Resource-heavy, often requiring large, bursty allocations of CPU, memory, or GPUs.
Dependency-driven, with tasks needing to execute in a specific order or framework, like parallel execution.

With its evolving ecosystem, Kubernetes has become the go-to orchestrator for running batch jobs. However, the default Kubernetes scheduler, designed for general-purpose workloads, struggles with certain nuances of batch processing.

Challenges in batch scheduling on Kubernetes

Batch scheduling has some fundamentally different requirements compared to normal workloads. A good batch scheduling tool is designed to mitigate these challenges effectively. Here’s a look at common challenges in batch scheduling and how robust tools can resolve them:

Resource contention
Batch workloads often compete for limited resources like GPUs, which can lead to inefficiencies or deadlocks. For example, in a multi-tenant AI/ML environment where resources are shared, if Team A runs low-priority job followed by Team B’s high-priority job, just because the job was run earlier, the low-priority job can obstruct the high-priority job due to limited resources getting blocked. It works on a first-come, first-served basis, without considering priorities. To address this, tools should implement fair scheduling and resource quotas to ensure resources are shared proportionally among teams, preventing any single workload from monopolizing resources.

Gang scheduling
Distributed tasks like ML model training require all associated pods to start simultaneously. Without coordination, resource wastage becomes inevitable. A good tool provides gang scheduling, ensuring simultaneous resource allocation for all tasks in a job to avoid partial execution.

Dependency handling
Many batch workflows consist of interdependent tasks where one cannot begin until another completes. For example, in ETL jobs, the extraction must be completed before the transformative job can run, or in CI/CD jobs, the build must happen before the deployment job is triggered. Tools that support workflow dependencies ensure seamless execution by automatically scheduling tasks in the correct order.

Job prioritization
Not all jobs have the same urgency. Critical jobs should preempt less important ones to meet business-critical deadlines. A job running critical functionality should be able to take priority over a regular non-critical job; to achieve this, we should be able to tag jobs based on their priority. Effective tools enable priority-based scheduling and preemption, dynamically rescheduling resources to accommodate high-priority workloads without delay.

Scalability
Kubernetes clusters handling hundreds or thousands of batch jobs require scalable solutions. Batch scheduling tools must efficiently manage large-scale workloads, optimizing cluster-wide resources without breaking under pressure.

Multi tenancy
Shared environments introduce complexities where multiple teams or projects need equitable resource access. Tools with multi-tenancy support ensure fair resource distribution, often through hierarchical queues, ensuring workloads from different users coexist peacefully. For example, a company with a shared Kubernetes cluster with 100 GPUs may need to share it with multiple teams like NLP, data science, and Analytics, with the condition that 50% of the GPUs are reserved for the NLP team.
By bridging these gaps, specialized tools enhance the capacity of Kubernetes to execute batch workloads effectively, reducing inefficiencies and improving overall cluster health.

Available tools for batch scheduling on Kubernetes

Several open-source tools aim to bridge the gap left by Kubernetes’ default scheduler. Let’s examine the three key players: Apache Yunikorn, Volcano, and Kueue:

1. Apache YuniKorn: Supports both batch and non-batch workloads

[Image source]

Apache YuniKorn is a universal scheduler for Kubernetes and other platforms, focusing on multi-tenant resource sharing. It is designed to replace the default Kubernetes scheduler and seamlessly supports batch and non-batch workloads.

1. Documentation: To get started with Apache YuniKorn, explore the official documentation here.
2. Strengths: Unified scheduling for batch and service workloads, hierarchical queues, and fairness policies.

Use case: Multi-tenant clusters requiring hierarchical resource sharing and fairness. The below example shows how Apache YuniKorn helps with setting up queues for resource management in multi-tenant environment and allows fair scheduling (efficient resource dedication).

kind: ConfigMap
metadata:
 name: yunikorn-configs
 namespace: yunikorn
apiVersion: v1
data:
 admissionController.accessControl.externalGroups: "admin,^group-$"
 queues.yaml: |
   partitions:
   - name: default
     queues:
       - name: root
         queues:
         - name: system
           adminacl: " admin"
           resources:
             guaranteed:
               {memory: 2G, vcore: 2}
             max:
               {memory: 6G, vcore: 6}
         - name: tenants
           resources:
             guaranteed:
               {memory: 2G, vcore: 2}
             max:
               {memory: 4G, vcore: 8}
           queues:
             - name: group-a
               adminacl: " group-a"
               resources:
                 guaranteed:
                   {memory: 1G, vcore: 1}
                 max:
                   {memory: 2G, vcore: 4}
             - name: group-b
               adminacl: " group-b"
               resources:
                 guaranteed:
                   {memory: 1G, vcore: 1}
                 max:
                   {memory: 2G, vcore: 4}

2. Volcano.sh: Best for high performance workloads

[Image source]

Volcano is a batch scheduling system designed for high-performance workloads like AI/ML, deep learning, and big data processing. Volcano supports popular computing frameworks such as Spark, TensorFlow, PyTorch, Flink, Argo, MindSpore, Ray, and PaddlePaddle. Unlike Apache Yunikorn, Volcano works alongside Kubernetes’ default scheduler.

Documentation: Learn more about using Volcano in its official documentation.
Strengths: Gang scheduling, resource bin packing, and managing job dependencies.

Use case: Resource-intensive workloads requiring advanced scheduling policies. The capabilities of Volcano make it popular in AI and big data applications. For running machine learning workloads, teams often need to schedule hundreds of GPU-intensive training jobs that compete for limited GPU resources.

Standard Kubernetes schedulers operate at the pod level, leading to resource fragmentation and inefficient GPU usage. Volcano solves this by introducing gang scheduling - ensuring that either all pods in a job are scheduled simultaneously, or none are. This avoids partial resource allocation, which is critical for parallel jobs like distributed training (e.g., TensorFlow, PyTorch). The code snippet below shows how Volcano can help with it:

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: tf-training
spec:
  minAvailable: 4  # Gang scheduling: all 4 pods must be scheduled together
  schedulerName: volcano
  tasks:
    - replicas: 4
      name: worker
      template:
        spec:
          containers:
            - image: tensorflow/tensorflow:latest
              name: tensorflow-worker
          restartPolicy: Never

3. Kueue

[Image source]

Kueue is a Kubernetes-native job queueing system built for batch workloads. It focuses on managing job queues, resource quotas, and priorities while complementing the Kubernetes ecosystem.

Documentation: Learn more about Kueue from the official documentation.
Strengths: Native Kubernetes integration, resource reservation, and priority-based scheduling.

Video description: With the skyrocketing demand for GPUs and problems with obtaining the hardware in requested quantities in desired locations, the need for multicluster batch jobs is stronger than ever. During this talk, Ricardo Rocha & Marcin Wielgus show how you can automatically find the needed capacity across multiple clusters, regions, or clouds, dispatch the jobs there, and monitor their status.

Use case: Kueue provides the simplest Kubernetes-based batch job orchestration with resource quotas. In multi-tenant Kubernetes clusters, organizations often deploy mixed workloads - like ML training jobs, simulations, and large batch ETL pipelines - but they may want maximum native Kubernetes compatibility without introducing a new scheduler. Kueue stands out by working with the default Kubernetes scheduler rather than replacing it. Instead of scheduling pods directly, Kueue manages suspending, queuing, and admitting entire jobs via Kubernetes-native fields (spec.suspend), letting the default scheduler handle the pod placement once admitted.

Compared to Volcano (which uses a custom scheduler) and Apache YuniKorn (which fully replaces the Kubernetes scheduler), Kueue’s “job-first, pod-native” design allows clean, minimal, and incremental adoption. It is ideal for teams that cannot risk introducing a new scheduler binary into production clusters but still need advanced batch queuing behavior.

Here is the code snippet showing how Kueue manages suspend/resume and admission operations:

apiVersion: batch/v1
kind: Job
metadata:
  name: etl-batch-job
  labels:
    kueue.x-k8s.io/queue-name: batch-queue
spec:
  suspend: true  # Kueue will unsuspend once resources are available
  parallelism: 5
  completions: 5
  template:
    spec:
      containers:
      - name: etl
        image: busybox
        command: ["sh", "-c", "echo Processing; sleep 30"]
      restartPolicy: Never
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  name: batch-queue
spec: {}

Feature comparison: Apache YuniKorn vs Volcanno.sh vs Kueue

The following table compares Apache YuniKorn, Volcano.sh, and Kueue across key features:

Feature	Apache YuniKorn	Volcano.sh	Kueue
Batch Workload Support	✅	✅	✅
Gang Scheduling	✅	✅	❌
Fair Scheduling	✅	✅	✅
Resource Quotas	✅	✅	✅
Dependency Handling	✅	✅	❌
Multi-Tenancy Support	✅ (Strong Support)	⚠️ Limited	❌
Ease of Use	Moderate	Steeper Learning Curve	Easy
Integration with K8s	Standard APIs	Custom APIs/Annotations	Native Integration
Preemption Support	✅	✅	✅
Replaces Default Scheduler	✅	❌ (Works alongside)	❌ (Works alongside)
Scalability	High	High	Moderate

Selecting the right batch processing tool

Choosing the right batch scheduling tool for Kubernetes depends on the specific requirements of your workloads and organization:

Apache YuniKorn is ideal for environments requiring a universal scheduler that replaces Kubernetes’ default scheduler, offering excellent multi-tenancy and hierarchical fairness for both batch and service workloads.
Volcano.sh shines in high-performance environments like AI/ML and deep learning, thanks to its advanced features like gang scheduling and job dependency management.
Kueue is perfect for Kubernetes-native setups, providing robust job queueing, priority scheduling, and resource quotas. By aligning these tools with your workload demands — be it fairness, scalability, or advanced scheduling policies — you can leverage Kubernetes at its full potential for batch processing.

Final words

In this blog, we explored and compared different tools for batch processing in Kubernetes — YuniKorn, Volcano, and Kueue. Each has its strengths. While these tools can significantly streamline batch processing, implementing them effectively can sometimes be challenging.

If you’re unsure which tool best fits your environment or need help setting things up, please reach out to our Kubernetes experts. We’re here to help you get the most out of your batch processing setup. To have a discussion about this blog post or to ask any questions, please find me on LinkedIn.

Istio Consulting and Enterprise Implementation Partner

InfraCloud — Thu, 03 Apr 2025 10:32:57 +0000

According to a CNCF survey, 47% of tech leaders struggle with a lack of expert service mesh engineers when adopting a service mesh. If you're struggling with Istio setup or management, you're not alone. Istio can be tricky, with challenges like complicated setups, scaling issues, security risks, and hard-to-solve problems. These can lead to downtime and frustration.

With InfraCloud's Istio consultation and implementation services, enterprises, growing startups, and Fortune 500 companies gain access to expert guidance to streamline deployment, resolve challenges efficiently, optimize performance, and ensure your service mesh operates seamlessly and securely.

How can InfraCloud help you with Istio Consulting & Implementation?

InfraCloud has been officially certified as an Istio Professional Service Provider. With our team of experts by your side, you can confidently deploy, manage, and optimize Istio for your infrastructure needs.

Istio Consulting & Advisory

Adopting a service mesh like Istio can significantly improve visibility, security, and traffic management in your microservices architecture. However, its complexity and the need for proper integration often make implementation challenging. InfraCloud’s Istio specialists can help simplify this process. We start by assessing your current architecture, infrastructure, and services to identify the best implementation strategy. Based on this assessment, Istio mesh experts from InfraCloud provide detailed recommendations, a tailored adoption roadmap, and a deployment plan to ensure a smooth transition. With our service mesh expertise, you can confidently leverage Istio to optimize your microservices environment.

Implement Production Grade Istio

InfraCloud’s service mesh experts can guide you in implementing a production-grade Istio setup built on industry standards and best practices. With our Istio consultants, you get seamless deployment across diverse infrastructures like Kubernetes, VMs, and cloud environments. We implement deployment patterns tailored to your needs, whether it’s a single-cluster or multi-cluster mesh, and configure Active-Active or Active-Passive failover strategies to achieve zero downtime. With InfraCloud’s networking engineers' guidance, you can confidently deploy the Istio service mesh to enhance your microservices architecture with resilience and scalability.

Istio Configuration & Integration with Observability Tools

With our Istio experts, you can configure your services for advanced traffic routing and application-level features, creating a highly reliable and fault-tolerant platform. We set up service-level properties like circuit breaking, timeouts, retries, and fault injection to enhance fault tolerance. Our Istio service mesh specialists also configure traffic routing with deployment patterns such as Blue-Green, Canary deployments, A/B testing, and staged rollouts. Additionally, InfraCloud Istio experts integrate observability tools like Prometheus, Grafana, Jaeger/Zipkin, and Kiali to provide a single pane of glass for performance metrics, empowering you to monitor and optimize your services effortlessly.

Implement Security Across Services with Istio

With InfraCloud’s team of consultants, you can have end-to-end security for your infrastructure and applications. We help you implement zero-trust security models and encrypted service-to-service communication using mTLS. Our Istio service mesh experts conduct regular security audits to maintain compliance with the latest standards and integrate authentication, authorization, and audit tools to safeguard your services and data. Additionally, we configure your existing identity provider tools with Istio for seamless, integrated security, providing you with a robust and reliable security framework for your microservices environment.

Enterprise Istio Support & Training

InfraCloud offers comprehensive Istio support through a team of experienced experts and engineers to manage your day-to-day operations and provide ongoing assistance. Istio services include support for regular maintenance, patches, and version upgrades, ensuring your Istio deployment remains up-to-date and reliable. We also provide mentorship and training to help you build in-house Istio expertise. Additionally, our team conducts regular architecture reviews and helps establish a knowledge base for easy reference. Whether you’re starting with Istio or need help managing existing operations, InfraCloud is here to support you at every step of your Istio adoption journey.

Looking for Professional Istio Support?

Running Istio can be tough, with challenges like complicated setups, scaling problems, security risks, and troubleshooting difficulties. These issues can cause downtime and frustration. With InfraCloud's Istio support, you get expert assistance to fix problems fast, improve performance, and keep your service mesh running smoothly and securely.

Feel free to get in touch with our Istio consulting team for a non-obligatory chat about your Istio queries and any doubts.

Metrics at a Glance for Production Clusters

Ruturaj Kadikar — Mon, 31 Mar 2025 14:18:50 +0000

Keeping a close eye on your production clusters is not just good practice—it’s essential for survival. Whether you’re managing applications at scale or ensuring robust service delivery, understanding the vital signs of your clusters through metrics is like having a dashboard in a race car, giving you real-time insights and foresight into performance bottlenecks, resource usage and the operational health of your car.

However, too much happens in any cluster. There are so many metrics to track that the huge observability data you may collect could become another obstacle to viewing what is actually happening with your cluster. That’s why you should only collect the important metrics that offer you a complete picture of your cluster’s health without overwhelming you.

In this blog post, we will cut through the complexity and spotlight the essential metrics you need on your radar to quickly detect and address issues as they arise. From CPU usage to network throughput, we’ll break down each metric, show you how to monitor them effectively and provide the queries that get you the data you need. Before we dive into the specifics of which metrics to monitor, let's understand the foundational monitoring principles that guide our approach. We'll explore the RED and USE methods along with the Four Golden Signals, providing a robust framework for what to measure and why it matters in maintaining the health of your production clusters.

Monitoring Principles

Effective monitoring is the cornerstone of maintaining the health and performance of your production clusters. It helps you catch issues early, optimize resource usage, and ensure that your systems are running smoothly. In this section, we introduce two essential monitoring frameworks — USE and RED — and the Four Golden Signals. These principles provide a structured approach to monitoring, making it easier to interpret vast amounts of data and identify critical performance metrics. By understanding and applying these principles, you can transform raw data into actionable insights that keep your systems in top shape.

RED and USE method

In modern systems, keeping track of numerous metrics can be overwhelming, especially when troubleshooting or simply checking for issues. To make this easier, you can use two helpful acronyms: USE and RED.

The USE Method (Utilization, Saturation, Errors) was introduced by Brendan Gregg, a renowned performance engineer:

Utilization: Measures how busy your resources are.
Saturation: Shows how much backlog or congestion there is.
Errors: Counts the number of error events.

The RED Method was introduced by Tom Wilkie. Drawing from his experiences at Google, Wilkie developed this methodology to focus on three key metrics for monitoring microservices (Rate, Errors, and Duration):

Rate: Measures the request throughput.
Errors: Tracks the error rates.
Duration: Measures how long requests take to be processed.

The USE method focuses on resource performance from an internal perspective, while the RED method looks at request performance from an external, workload-focused perspective. Together, they give you a comprehensive view of system health by covering both resource usage and workload behavior. By using these standard performance metrics, USE and RED provide a solid foundation for monitoring and diagnosing issues in complex systems.

Four Golden Signals

The Four Golden Signals — Latency, Traffic, Errors, and Saturation — are foundational metrics introduced in Google's Site Reliability Engineering (SRE) practices to monitor system performance and reliability. According to this method. dashboards should address all the fundamental questions about your service. These signals are essential for understanding system performance and should be prioritized when selecting metrics to monitor.

Latency: Refers to the time taken to handle a request, distinguishing between successful and failed requests.
Traffic: Measures the demand placed on the system, typically quantified by metrics like HTTP requests per second or network I/O rate.
Errors: Represents the rate of failed requests, including explicit errors like HTTP 500s and implicit errors like incorrect content responses.
Saturation: Indicates how "full" the service is, emphasizing the most constrained resources and predicting impending saturation for proactive maintenance.

By monitoring these four golden signals and promptly alerting administrators or support engineers when issues arise, your cluster will benefit from comprehensive monitoring coverage, ensuring reliability and performance.

Using Four Golden Signals for Comprehensive Monitoring

If you're managing a production Kubernetes cluster, you know the importance of staying on top of your monitoring game. We’re here to simplify your monitoring approach by integrating the RED and USE methods with Google's Four Golden Signals, enabling comprehensive monitoring from a single dashboard. This approach allows you to swiftly spot and address issues, ensuring your cluster operates smoothly without the hassle of jumping between multiple dashboards. To get started, you can download the Monitoring Golden Signals for Kubernetes Grafana dashboard.

Let’s jump into each golden signal to understand what metrics should be monitored to track them.

Traffic: What is it and how to monitor it?

If we consider Kubernetes a city's road system, then pods will be cars, nodes will be streets, and services will be traffic lights that manage the flow. In this case, monitoring in Kubernetes is like using traffic cameras and sensors at crossroads to keep everything moving smoothly and avoid traffic jams.

Network I/O is like the main roads that handle cars coming into and going out of the city. If these roads are too busy, it slows everything down. The API server is like a Regional Transport Office (RTO), regulating and overseeing all operations within the cluster, much like traffic and vehicle management in a region. Monitoring traffic to external services such as databases is also important, similar to watching vehicles travel to other cities. You can use tools like the blackbox exporter to keep an eye on traffic leaving Kubernetes. This highlights the importance of pinpointing key areas for monitoring traffic flow.

Below, we outline the primary and general metrics for monitoring the traffic.

Component (Dashboard Title)	Metric	Why Monitor this Metric
Ingress Traffic (Istio)	istio_requests_total	Tracks the total number of requests handled by Istio, essential for understanding ingress controller load and overall health.
API Server Traffic	apiserver_request_total	Measures the number of API server requests, which helps in monitoring control plane load and identifying potential bottlenecks.
API Server Traffic	workqueue_adds_total	Indicates the total number of items added to work queues, helping identify workload spikes and manage resource allocation effectively.
Node Traffic	node_network_receive_bytes_total node_network_transmit_bytes_total	Monitors data received/transmitted by nodes, which is crucial for identifying and addressing network capacity issues.
Node Traffic	node_network_receive_packets_total node_network_transmit_packets_total	Monitors the number of packets received/transmitted, important for analyzing network traffic, identifying issues, and maintaining robust network communication.
Workload Traffic	container_network_receive_bytes_total container_network_transmit_bytes_total	Vital for monitoring the amount of network traffic received/transmitted by containers, ensuring proper traffic handling and performance.
Storage Operations	storage_operation_duration_seconds_bucket	Provides insights into storage operation performance, helping diagnose and address slow disk access issues.
CoreDNS Requests/s	coredns_dns_requests_total	Monitors the number of DNS queries handled by CoreDNS, ensuring reliable service discovery and network performance.

Latency: What is it and how to monitor it?

To understand the latency in Kubernetes, let's take the previous example of traffic system. Latency in Kubernetes is like delays in a city’s traffic system, where slowdowns at various points affect overall efficiency. If a major road is under construction or blocked due to an accident (slow data processing), cars must take detours, increasing travel time. Similarly, when a microservice is overloaded, requests pile up, causing system-wide slowdowns.

Traffic lights that take too long to change (rate-limited APIs or overloaded queues) create long waiting lines, much like API call delays that hold up processing. Similarly, pod startup delays are like traffic signals malfunctioning cars remain idle, and congestion builds up, just as new pods taking too long to initialize slow down request handling.

During rush hour congestion, roads get overwhelmed, making travel slower for everyone. In Kubernetes, when resources like CPU and memory are exhausted, requests are delayed, affecting performance. Likewise, a single-lane road with no passing option (sequential processing) forces cars to crawl behind slow-moving vehicles, just as inefficient sequential request handling slows down application performance.

Just as city planners use traffic monitoring and smart infrastructure to optimize flow, engineers must track key latency metrics to prevent bottlenecks in Kubernetes.

Component (Dashboard Title)	Metric	Why Monitor This Metric
Pod Start Duration	kubelet_pod_start_duration_seconds_bucket	Monitors time taken for pods to start, crucial for optimizing scaling and recovery processes.
Pod Startup Latency	kubelet_pod_worker_duration_seconds_bucket	Tracks duration of pod operations, important for assessing pod management efficiency.
ETCD Cache Duration 99th Quantile	etcd_request_duration_seconds_bucket	Essential for monitoring ETCD request processing latency, impacts overall cluster performance.
API Server Request Duration 99th Quantile	apiserver_request_duration_seconds_bucket	Important for understanding API server response times, indicates control plane health.
API Server Work Queue Latency	workqueue_queue_duration_seconds_bucket	Measures delays in API server work queues, vital for spotting potential processing issues.
API Server Work Queue Depth	workqueue_depth	Provides insight into API server queue load, critical for preventing system overloads.
CoreDNS Request Duration	coredns_dns_request_duration_seconds_bucket	Tracks CoreDNS DNS request processing times, key for efficient network resolution.

Errors: What is it and how to monitor it?

Again, continuing our analogy, let's consider a Kubernetes cluster like a city’s road system. Everything needs to move smoothly for the city to function well. But what happens when things go wrong?

CoreDNS crashes: It’s like traffic signals failing. Without proper directions, cars (data) can’t find their way, leading to confusion and delays.
API Server goes down: This is like losing the central traffic control center. The entire system becomes unresponsive, and nothing moves.
Pod failures: These are like car breakdowns. A few stalled cars won’t stop the whole city, but they slow down traffic in specific lanes (services).
Node issues (like DiskPressure): Imagine a major road being closed. Cars (pods) have to reroute, leading to congestion and bottlenecks.

Just as traffic disruptions cause delays and frustration, Kubernetes failures impact SLAs, SLOs, and user experience. That’s why monitoring errors is like a real-time traffic control system. It detects problems early and helps keep everything running smoothly. The following metrics will help monitor the errors.

Component (Dashboard Title)	Metric	Why Monitor this Metric
CoreDNS	coredns_cache_misses_total	Tracks cache misses in CoreDNS, important for identifying DNS resolution issues affecting cluster connectivity and performance.
API Server Errors	apiserver_request_total	Monitors API server request errors, crucial for detecting and diagnosing failures in handling cluster management tasks.
Nodes	kube_node_spec_unschedulable	Counts nodes that are unschedulable, essential for understanding cluster capacity and scheduling issues.
Nodes	kube_node_status_condition	Tracks node conditions like 'OutOfDisk', 'DiskPressure', 'MemoryPressure', important for preemptive system health alerts.
Kubelet	kubelet_runtime_operations_errors_total	Measures error rates in kubelet operations, key for maintaining node and pod health.
Workloads	kube_pod_status_phase	Monitors pods in failed states, critical for identifying failed workloads and ensuring reliability.
Network	node_network_receive_errs_total, node_network_transmit_errs_total	Monitors network errors in data transmission and reception, vital for maintaining robust network communication.

Saturation: What is it and how to monitor it?

To explain saturation, we will take you to the example of the city traffic system again. CPU and memory utilization are akin to monitoring the flow of vehicles — too much traffic causes congestion, slowing down the city. Node resource exhaustion is similar to key intersections getting overwhelmed, which can halt traffic across the city. Network capacity matches the width and condition of roads; inadequate capacity leads to bottlenecks. Monitoring the top ten nodes and pods with the highest resource utilization is like tracking the busiest areas in the city to prevent and manage traffic jams more effectively. This approach ensures smooth operation and prevents system slowdowns.

The following metrics help to quickly identify the possible slowdowns in Kubernetes clusters.

Component (Dashboard Title)	Metric	Importance of Monitoring that Metric
Cluster Memory Utilization	node_memory_MemFree_bytes, node_memory_MemTotal_bytes, node_memory_Buffers_bytes, node_memory_Cached_bytes	Tracks memory usage metrics to prevent saturation and ensure resource availability.
Cluster CPU Utilization	node_cpu_seconds_total	Monitors CPU usage to prevent overload and maintain performance efficiency.
Node Count	kube_node_labels	Counts the number of nodes, essential for scaling and resource allocation.
PVCs	kube_persistentvolumeclaim_info	Tracks persistent volume claims, important for storage resource management.
Node Parameters	node_filefd_maximum node_filefd_allocated	Tracks maximum file descriptors, prevents resource exhaustion. Tracks allocated file descriptors, prevents resource exhaustion.
Node Parameters	node_sockstat_sockets_used	Monitors sockets in use, crucial for system stability.
Node Parameters	node_nf_conntrack_entries node_nf_conntrack_entries_limit	Tracks active network connections, ensures capacity isn't exceeded. Monitors conntrack entries limit, prevents network tracking overload.

Note: This dashboard is designed specifically for infrastructure monitoring. To cover application insights, you need to create similar dashboards from application metrics, assuming the relevant metrics are available. Additionally, you can generate metrics from logs as needed and incorporate them into these dashboards to achieve a unified view.

Conclusion

By meticulously applying these Four Golden Signals in our monitoring strategy, we ensure a proactive approach to infrastructure management. This not only helps in quick problem resolution but also aids in efficient resource utilization, ultimately enhancing the performance and stability of your Kubernetes cluster. With this comprehensive view provided by this single-dashboard approach, Kubernetes administrators and SREs can effortlessly manage cluster health, allowing them to focus on strategic improvements and innovation. No more navigating through complex monitoring setups—everything you need is now in one place, streamlined for efficiency and effectiveness.

I hope you found this post informative and engaging. I’d love to hear your thoughts on this post; let’s connect and start a conversation on LinkedIn.

Enterprise-grade Linkerd Support for Production Environment

InfraCloud — Mon, 31 Mar 2025 10:12:32 +0000

A CNCF survey reveals that 47% of tech leaders see a lack of engineering expertise as the biggest challenge in using a service mesh. If managing Linkerd feels difficult, you're not alone. Running Linkerd can be challenging due to its steep learning curve, complex setup, resource optimization needs, and the intricacies of troubleshooting in distributed systems. These problems can result in downtime and frustration.

InfraCloud's Linkerd support team provides expert assistance to swiftly resolve issues, enhance performance, and ensure your service mesh operates seamlessly and securely.

How can InfraCloud help you with Linkerd Support?

InfraCloud is an officially recognized Linkerd Professional Service Provider. With our experts by your side, you can deploy and manage Linkerd with confidence.

Adopting Linkerd from scratch

Our CNCF-certified engineers will craft a tailored plan to support your Linkerd adoption and deployment from the ground up. From initial setup to seamless operation, our team makes sure Linkerd integrates effortlessly into your environment.

Ad hoc Linkerd support

If you need a one-time solution for a specific Linkerd issue without ongoing support, our Linkerd support team is here to help. We’ll resolve your issue without tying you to a long-term contract.

**24*7 Linkerd support with unlimited incidents**

Emergencies can happen anytime, and we're here for you 24/7/365 to handle your Linkerd (P1 & P2) issues. Our Linkerd experts are always ready to tackle critical problems and ensure your system runs smoothly, so you never have to worry about a lack of support. We provide fast resolutions whenever you need assistance.

Receive secure updates

Installing Linkerd updates without testing them might harm your infrastructure and cause disruptions. Our experts test each update thoroughly before applying it, ensuring smooth performance, stability, and security for your system.

Scalable solutions

Transitioning from a single cluster to a multi-cluster or multi-cloud service mesh poses challenges. However, with the right setup, scaling across different environments becomes straightforward. The InfraCloud team specializes in developing scalable service meshes, guaranteeing optimal performance as your infrastructure grows.

Multi-platform support

We provide multi-platform support, including cloud providers' managed Kubernetes distributions, VMs, and on-prem installations. The InfraCloud team can help configure and optimize your environment across these platforms, making sure your infrastructure runs smoothly no matter where it is hosted.

Fail-safe plan

Cluster failures can result in downtime and impact service availability. InfraCloud's Linkerd support utilizes Global Failover and Locality-aware Routing to effectively address these challenges, reducing disruptions by efficiently routing traffic and enhancing system resilience.

What does InfraCloud’s Linkerd support process look like?

Incident Reporting

You can contact our dedicated Linkerd support team through Slack or Jira, based on the severity of the issue. Our transparent support model guides you on how to report incidents, ensuring prompt attention and appropriate prioritization.

Acknowledgment

InfraCloud’s Linkerd support team will quickly acknowledge your reported issue and assign the right engineers with the expertise to resolve it. You’ll receive regular updates throughout the process.

Solution

Our experienced Linkerd experts will carefully assess the situation and create a detailed plan to resolve the issue. We focus on providing an effective, tailored solution that meets your specific needs.

Collaboration

We'll provide a detailed action plan and guide you through the resolution process, offering Linkerd support as needed to ensure smooth implementation and minimize any disruption.

Resolution

Once the issue is resolved, we'll confirm with your team that Linkerd is working as expected. After verification, we'll close the ticket and ensure your service mesh infrastructure is stable and secure.

Professional Linkerd support tiers for businesses of every size

Managing Linkerd can be challenging due to complex configurations, scaling difficulties, security concerns, and troubleshooting issues. These challenges can result in downtime and frustration. With InfraCloud Linkerd support, you get expert assistance to quickly resolve problems, enhance performance, and ensure your service mesh operates securely and seamlessly.

Feel free to get in touch with our Linkerd support team for a non-obligatory chat about your Linkerd queries and any doubts.

Enterprise-grade Istio Support for Production Environment

InfraCloud — Mon, 31 Mar 2025 09:54:16 +0000

According to a CNCF survey, 47% of tech leaders say that a shortage of engineering expertise is the biggest non-technical challenge when using a service mesh. If you are struggling to run Istio, you are not alone. Running Istio can be difficult, with challenges like complex configurations, scaling issues, security risks, and troubleshooting. These problems can lead to downtime and frustration.

With InfraCloud Istio support, you get expert help to resolve issues quickly, optimize performance, and ensure your service mesh runs smoothly and securely.

How can InfraCloud help you with Istio Support?

InfraCloud is officially recognized as an Istio Professional Service Provider. With our experts on your side, you can confidently deploy and manage Istio.

Adopting Istio from scratch

Our CNCF-certified engineers will create a customized plan to guide your Istio adoption and deployment, starting from scratch. From setup to smooth operation, our experts will make Istio work seamlessly in your environment.

Ad hoc Istio support

If you need a one-time fix for a specific Istio issue without ongoing support and maintenance, our Istio support team got you covered. We will help you out with the specific Istio issue you are having, without binding you in any long term contract.

**24*7 Istio support with unlimited incidents**

Emergencies can arise at any time. No matter the time—day or night—we're available 24/7/365 to handle your Istio emergencies (P1 & P2). Our Istio expert team is always ready to resolve critical issues and keep your system running smoothly. You never have to worry about running out of Istio support. Our team is available 24/7, ready to provide fast resolutions whenever you need assistance.

Receive secure updates

Installing unchecked Istio updates can damage your infrastructure and disrupt operations. Our Istio specialists will thoroughly test updates before upgrading Istio, making sure everything functions as expected and your system remains stable and secure.

Scalable solutions

Scaling from a single to multi-cluster or multi-cloud service mesh can be complex. With the right configuration, you can achieve seamless scalability across environments. The InfraCloud team specializes in building scalable service mesh, optimizing performance as your infrastructure grows.

Multi-platform support

Fail-safe plan

Cluster failures can cause significant downtime and disrupt service availability. With Global Failover and Locality-aware Routing, InfraCloud’s Istio support can help manage these failures effectively, minimizing the impact by routing traffic efficiently and maintaining system resilience.

What does InfraCloud’s Istio support process look like?

Istio Support Tiers by InfraCloud

Incident Reporting

You can reach out to our dedicated Istio support team via Slack or Jira, depending on the severity of the issue. Our clear support model helps you determine the best way to report an incident, ensuring quick attention and proper prioritization.

Acknowledgment

InfraCloud’s Istio support team will promptly acknowledge your reported issue and assign the right Istio engineers with the expertise to address it. You’ll receive timely updates throughout the process.

Solution

Our experienced Istio experts will thoroughly analyze the situation and develop a comprehensive plan to resolve the issue. We take a detailed approach to ensure that the solution is effective and tailored to your specific needs.

Collaboration

We believe in close collaboration with your team. We will present a detailed action plan and guide your team through the resolution process, offering Istio support as needed to ensure smooth implementation and minimize disruption.

Resolution

Once the problem is successfully resolved, we will verify with your team that Istio is functioning as expected. After confirming, we’ll close the ticket and ensure your service mesh infrastructure is fully stable and secure.

Professional Istio support tiers for businesses of every size

Running Istio can be difficult, with challenges like complex configurations, scaling issues, security risks, and troubleshooting. These problems can lead to downtime and frustration. With InfraCloud Istio support, you get expert help to resolve issues quickly, optimize performance, and ensure your service mesh runs smoothly and securely.

Feel free to get in touch with our Istio support team for a non-obligatory chat about your Istio queries and any doubts.

Get Kubernetes Consulting Services from Certified Experts

InfraCloud — Thu, 27 Mar 2025 09:54:37 +0000

More than 60% of enterprises have adopted Kubernetes to power their cloud-native applications, and businesses are eager to adopt it for its scalability and resilience. However, its complexity can make implementation challenging.

As a trusted Kubernetes Certified Service Provider (KCSP) backed by the CNCF, InfraCloud offers expertise in Kubernetes consulting, deployment services, and support. In 2023, InfraCloud won the Stratus Award for Cloud Computing in the Kubernetes category. If you are looking to start with Kubernetes, or make the best use of it, InfraCloud can help you with consulting, training and support.

How can InfraCloud help you with Kubernetes Consulting & Support?

Kubernetes Advisory Services

InfraCloud's Kubernetes consulting services make it easy for companies to adopt Kubernetes and leverage its capabilities. Our skilled consultants know all the major cloud providers—AWS, GCP, and Azure—and will help you assess your current processes to industry standards while creating a clear plan for successful implementation. Our team of Kubernetes experts will analyze your existing infrastructure to see how ready your applications are and give you practical recommendations for moving to a cloud-native setup. With a straightforward deployment plan that matches your business goals, you'll be ready to begin your Kubernetes journey.

Kubernetes Application Development & Migration

With our Kubernetes experts, you can build new applications or migrate from a monolithic architecture to microservices seamlessly. We break down your legacy system into smaller, manageable containerized workloads, making it ready for Kubernetes. Our team helps organizations create Kubernetes-ready apps from scratch, allowing for a fresh start in their IT journey.

By deconstructing tightly integrated monolith applications into loosely coupled microservices, we provide unmatched flexibility and communication between services. Additionally, we offer a single pane of glass for managing all your Kubernetes clusters across different environments, simplifying multicluster management.

Kubernetes Deployment Partner

Our team of Kubernetes experts has worked with over 100 customers, and we will help you navigate the complexities of deploying Kubernetes in production. We ensure you get it right the first time, whether in the cloud or on-premises, guiding you from strategy to implementation. We assist in deploying production-ready Kubernetes clusters on private, public, or bare metal environments, following the best practices outlined during our consultancy.

Our Kubernetes consultants enable the provisioning of highly available clusters that support auto-healing and scalability to meet ongoing demand. We also integrate essential applications like Prometheus and implement necessary security measures and RBAC policies. If you're deploying machine learning models, our Kubernetes consultants are here to support you every step of the way.

DevSecOps - Enabling Security & Compliance

With DevSecOps, our Kubernetes specialists empower your teams to innovate, collaborate, and deliver code faster—securely. We enable role-based access to nodes, network segmentation, and the implementation of network policies, along with automatic upgrades. Our experts enhance security by integrating open source and third-party tools, as well as continuous container scanning to detect and identify threats in near real-time. With an effective and secure environment, we help you sign, scan, deploy packages, and administer clusters with confidence.

Kubernetes Day 2 Support

Businesses of all sizes can rely on InfraCloud’s experienced team of Kubernetes architects and engineers to fix any configuration issues across public, private clouds, or bare metal environments. InfraCloud provides tooling for monitoring, proxy servers, networking, and policy enforcement. They enable observability and vulnerability scanning to detect issues in real-time, preventing any business impact.

Our Kubernetes support team stays ready for any situation by implementing a comprehensive Kubernetes backup and disaster recovery strategy. Additionally, InfraCloud supports cost management by offering detailed cost reporting at the business unit or executive level. Clients also receive assured assistance for regular Kubernetes updates, patches, and rapid releases, ensuring their systems are always secure and up-to-date.

Kubernetes Training

InfraCloud’s Kubernetes consultants and architects empower your engineers with the skills they need to confidently manage the all-new cloud-native environment. Our Kubernetes experts assist in implementing necessary cultural changes by promoting container best practices. Through hands-on training workshops, InfraCloud coaches and DevRels enable your team on how to containerize, deploy, and configure applications effectively. By mastering Kubernetes concepts such as networking, architecture, authentication, scaling, and storage, your team will be well-equipped to overcome daily challenges with ease.

Looking for Kubernetes Support?

Our certified Kubernetes experts are here to guide you through every step—whether it's deployment, management, or scaling. With InfraCloud’s tailored Kubernetes consulting services, you’ll gain a resilient, secure, and cost-efficient cloud infrastructure without the complexity.

Feel free to get in touch with our Kubernetes consulting services for a non-obligatory chat about your Kubernetes queries and any doubts.

AI Bare Metal and Orchestration Platform by InfraCloud

InfraCloud — Thu, 27 Mar 2025 09:37:59 +0000

InfraCloud’s AI bare metal platform delivers ready-to-use GPU instances with configured software stack, so you can skip the setup and dive right into your work. Our AI Orchestration platform uses Kubernetes to manage AI resources smoothly, letting you share GPUs and maximize efficiency without the hassle of configuration and setup.

How InfraCloud's AI Bare metal and Orchestration Platform helps you?

On-Demand GPUs with Fast Booting and Reliable Performance

Offer high-performance GPUs whenever platform users need them, billed by the minute or hour for flexibility. Our platform ensures fast booting instances, so users can jump into work right away. Robust storage and networking built in enable smooth experience, and uninterrupted performance—even with demanding AI workloads.

Be Productive from the First Hour with ML in a Box

Start your machine learning experiments and projects right away using our preconfigured instances. With ML out of the box, you can dive into your work without delay, selecting your preferred framework—like TensorFlow or PyTorch—and using a familiar IDE such as Jupyter Notebooks or VSCode. Everything you need is set up with AI bare metal platform, enabling a smooth transition from setup to productive coding in no time.

Achieve Effective Auto-Healing and Auto-Scaling with Kubernetes

Kubernetes orchestration helps you create a platform that automatically heals and scales your containerized workloads. This smart management of GPU cloud resources helps lower costs by using features like scale-to-zero and cluster autoscaler. You’ll experience smooth performance while optimizing resource use, so you only pay for what you need, all while keeping your AI applications reliable and efficient.

Efficient Resource Allocation for Multiple Workloads

With AI platform, you can allocate resources to different workloads by combining various scheduling techniques tailored to your needs. Our platform comes with options like fair share scheduling, guaranteed quotas, or GPU over-provisioning that allows you to match specific AI tasks with the best-suited hardware configurations, utilizing dynamic resource allocation and node pooling techniques for maximum efficiency and performance.

Monitor Your GPU Cloud Health with Built-In Observability

Keep a close eye on the health of your GPU cloud using our built-in observability. Proactive capacity planning helps you in maximizing uptime and making sure your AI infrastructure consistently meets demand. With real-time insights, you can quickly address potential issues and maintain smooth operations for your applications.

Looking to deploy AI on Bare Metal?

With AI bare metal and orchestration platform by InfraCloud, you can get high-performance GPU instances and a hassle-free setup, to focus on your projects without worrying about infrastructure. Our efficient resource management and seamless scaling ensure your AI applications run smoothly. Feel free to explore how we're helping organizations build AI cloud.

Boost Developer Engagement with DevRel as a Service

InfraCloud — Thu, 27 Mar 2025 09:27:13 +0000

Our Developer Relations (DevRel) service connects your product with the developer community. Our experienced dev advocates create engaging content, develop courses, and lead community events. Our team's extensive network and deep understanding of developers help amplify your product's reach. Through strategic content creation and community engagement, we boost your product's visibility, drive adoption, and build a loyal developer following.

What do we Offer in DevRel as a Service?

InfraCloud offers comprehensive DevRel as a Service, including content strategy and writing, video planning and production, and custom course creation to effectively engage and educate the developer community around your product.

1. Content Strategy, Planning & Writing

Deliver premium content that speaks to your users' challenges, educates them, and builds trust with your brand. Our DevRel team, featured in top tech publications like TheNewStack, DZone, and CNCF, will research your product and craft a content strategy. We’ll ensure your message reaches the right developers at the right time. With high-quality, in-depth content, we’ll empower both developers and your support team. Our tailored distribution strategy will drive user engagement and increase your product's visibility across multiple channels.

2. Video Planning & Production

InfraCloud DevRel experts will create engaging video content that educates, empowers, and connects with developers. We’ll help you choose the most effective formats, like technical tutorials, live coding, or architecture breakdowns, to showcase your product’s capabilities. Our team can evangelize your brand at virtual hackathons, live Q&As, and conferences, building a strong developer community around your product. We’ll also develop tailored video tutorials to train your team, streamline onboarding, and ensure consistent knowledge transfer, boosting productivity across your organization.

3. Course Creation, Done for You

InfraCloud’s DevRel engineers are CNCF-recognized, certified course creators, published authors, and founders of the School of Kubernetes. Our expert team specializes in creating in-depth courses for your product, guiding learners from beginner to pro. We’ll develop training materials that equip the next generation of developers, ensuring your product’s long-term success. Our DevRel team handles everything from outlining learning objectives and structuring modules to creating engaging content with technical diagrams and building exam frameworks. With our DevRel support, turning your expertise into a valuable online course is seamless and effective.

Ready to scale & reach the tech audience?

DevRel can help you effectively engage with your developer community, as managing these relationships can be time-consuming and challenging, often requiring specialized knowledge and resources. Our team provides expert support in content creation, course development, and community building to help you succeed. Let InfraCloud expert developer advocates help you with developer relationships and drive your product’s success. Feel free to get in touch with our DevRel team for a non-obligatory chat about your queries and any doubts.

AI Cloud: What, Why, and How?

Sarvani swapna priya yallapragada — Mon, 03 Feb 2025 12:17:06 +0000

AI Cloud: What, Why, and How?
The rapid growth of AI applications across industries has led to significant changes, particularly with the adoption of deep learning and generative AI, which provide a competitive advantage in industries such as drug discovery in pharmaceutical R&D and fraud detection in banking and e-commerce.

However, these advancements require substantial infrastructure that on-premises solutions often fail to support due to high initial costs, inflexible resources, inefficient GPU management and the rapid evolution of GPU hardware technologies. Additionally, increasing data requirements and the need for global availability complicate the ability to meet the dynamic demands of modern AI workloads.

Scaling AI applications on-premises infrastructure struggles with the computational power, memory, and storage required for AI workloads, leading to inefficiencies. Large datasets can cause delays, while geographic limitations hinder global scalability. Resource competition and slow networking further disrupt performance in shared environments.

This blog post introduces you to AI Cloud, what it is, and how it allows you to deploy and scale your AI workloads. We'll understand the various components of AI Cloud, migration strategies and challenges.

What is AI Cloud and what are its key features and benefits?

AI Cloud is a suite of cloud services that provide on-demand access to AI applications, tools, and infrastructure. It enables organizations to leverage pre-trained models and advanced AI functionalities, including computer vision, natural language processing (NLP), and predictive analytics, without the need for complex system development.

The key features of AI Cloud are:

Provides a robust platform that efficiently manages and allocates computing resources to optimize performance and scalability.
Offers capabilities for organizing, storing, and managing training datasets to ensure data quality and accessibility.
Features advanced tools that streamline the development and deployment of machine learning (ML) models, reducing time and complexity.
Delivers support for real-time predictions, enabling businesses to respond to dynamic data inputs and user demands quickly.
Includes a diverse range of AI services that are easily accessible via APIs, allowing organizations to integrate advanced functionalities into their applications seamlessly.

This flexibility allows businesses to scale their AI usage according to demand, making it a cost-effective solution that improves efficiency and drives innovation. By offering the tools to harness the potential of AI fully, AI Cloud empowers businesses to optimize performance while maintaining data sovereignty. It eliminates the need for significant infrastructure investment, enabling easier access to cutting-edge AI capabilities.

Feature	AI Cloud	On-Premise
Setup Cost	Low, pay-as-you-go	High upfront investment
Flexibility & Scalability	High flexibility with a wide range of services, Scalable on-demand,	Customizable, but limited scalability and requires hardware upgrades
GPU Management	Managed automatically	Requires manual management
Deployment Speed	Quick with pre-built tools	Slower, custom setup is needed

Why AI Cloud is important?

The increasing complexity of the AI models, added to the growing demands of data-driven applications, has underlined limitations in traditional infrastructures. Businesses are increasingly deploying AI to enable user personalization, automate processes, etc. These applications require immense computational resources, low latency, and scalability - all of which are attributes of the AI Cloud uniquely positioned to provide.

It helps organizations tackle the modern demand to become agile and competitive, with the facility for training, deployment, and optimization of their AI workloads. Emerging AI-driven solutions, like intelligent agents for smart contextual Q&A, are revolutionizing customer engagement by providing real-time, personalized interactions that adapt to user needs, improving efficiency and satisfaction.

Compared to on-premise setups or traditional cloud workloads, AI Cloud promises unmatched scalability, flexibility, and cost efficiency. It provides a pay-as-you-use pricing model that eliminates heavy upfront investments while ensuring that resources will be dynamically allocated according to demand. In addition, AI Cloud accelerates the time-to-market by simplifying infrastructure management and providing high-performance tools to train and deploy models.

It is also powered by robust security, compliance, and performance optimization, ultimately allowing to scale AI capabilities globally in a reliable and efficient manner making it the future of hosting demanding workloads.

How AI Cloud work?

Here is how the AI cloud work:

AI Cloud offers GPU clusters, scalable storage, and low-latency networking, optimized for handling large datasets and complex models, managed with the advanced virtualization and containerization tools.
The platform divides workloads across multiple nodes, enabling rapid processing of data and algorithms with distributed computing.
Built-in tools streamline training, testing, deployment, and monitoring, freeing users to focus on innovation rather than infrastructure.
AI Cloud quickly ingests, preprocesses, and stores large volumes of data, accelerating machine learning model training.
APIs, hybrid cloud configurations, and data migration tools allow smooth integration with existing IT systems and legacy workflows, modernizing operations without disruption.
The platform supports easy scaling across various applications, making it adaptable to different applications and operational needs.
Operates on a pay-as-you-go model, automates processes for enhanced efficiency, and continuously refines AI models to meet evolving business challenges.

Watch our webinar on building an AI cloud, where Vishal and Sanket explained what makes a GPU cloud.

Core components of AI Cloud

Compute resources

AI Cloud relies on robust compute infrastructure tailored for demanding workloads. High-Performance Computing (HPC) clusters provide the raw power necessary for training and running AI models. GPUs and TPUs offer accelerated processing, drastically reducing the time required for computation-intensive tasks like deep learning. Specialized hardware, such as AI accelerators and custom chips, further optimize performance, while power and cooling systems support the high-density requirements of these components, ensuring reliability and efficiency.

Data management & storage

AI Cloud handles data with advanced data lakes and warehouses designed to store and manage vast datasets. Data ingestion and integration tools streamline data flow from various sources while governance and security frameworks protect data integrity and privacy. These systems ensure data is accessible, compliant, and ready for AI model training and inference.

AI/ML services & frameworks

AI Cloud offers a plethora of services that support the entire AI/ML lifecycle, from training and deployment to simplifying for developers to work on model development, while pre-trained models and APIs accelerate the integration of applications. MLOps enables efficient model monitoring, versioning, and updates to maintain and scale AI systems seamlessly over time.

It supports the AI/ML lifecycle with frameworks like TensorFlow, PyTorch, and JAX for training, services like SageMaker, etc., for deployment and MLOps tools like MLflow, Kubeflow, and TFX, and pre-trained APIs like Google Vision, AWS Rekognition, and OpenAI for seamless integration.

Networking & connectivity

Efficient AI operations frequently depend on high-speed networks, particularly when managing large datasets and distributed workloads. AI Cloud solutions focus on optimizing network architectures to reduce latency and ensure consistent performance for demanding AI applications. Secure and reliable connections are prioritized, safeguarding data during transmission and supporting real-time AI applications.

Security & compliance

AI Cloud incorporates robust measures to address data privacy and security by using encryption, access controls, and advanced protocols to protect sensitive information. It complies with industry regulations such as GDPR and HIPAA, ensuring data is handled responsibly. It also is committed to AI ethics and reducing bias, with strategies to maintain fairness and transparency in AI models. These are integral to its design, helping organizations adopt AI responsibly and sustainably.

Understanding the underlying infrastructure and optimization strategies is necessary for efficient resource utilization, minimizing latency, and maximizing the performance of AI models while adapting to the evolving demands of various applications and workloads.

Challenges and mitigation strategies in AI Cloud

Even though AI Cloud has plenty of benefits, it also brings some challenges that need to be addressed.

Challenges in AI Cloud adoption

Several major challenges in adopting AI cloud:

Technical integration and infrastructure setup: Integrating AI workloads with existing IT systems and establishing the right cloud infrastructure poses significant challenges, as misalignment can lead to delays and performance issues. Selecting appropriate hardware, storage, and networking configurations is essential for smooth deployment and optimized resource management. InfraCloud’s AI cloud simplifies this by offering pre-configured AI cloud solutions that seamlessly integrate with existing systems, reducing complexity and ensuring optimal performance while adhering to data residency and compliance requirements.
Cloud dependency and vendor lock-in: Heavy reliance on a single cloud provider can create challenges such as vendor switching, unforeseen costs, reduced flexibility, and complicated scaling efforts.
Cost management: AI workloads often require significant resources, and without careful planning, costs can escalate quickly. Scaling resources efficiently while maintaining financial transparency is a persistent challenge.
Data privacy, security, compliance, and prompt injection: Handling sensitive data in the cloud necessitates stringent security measures like encryption and access controls to protect against breaches. Compliance with data privacy regulations (e.g., GDPR, CCPA) is crucial. Our AI cloud ensures security frameworks and compliance readiness are in place, helping handle data responsibly while meeting global regulatory standards.
Power, cooling, and environmental impact: The high computational power required by AI models results in increased energy consumption and cooling needs, presenting logistical and environmental challenges, especially for on-premises or hybrid setups.
Ensuring reliable AI outcomes: AI models are prone to generating unreliable or hallucinated results. Building reliable validation pipelines, monitoring outputs, and continuously refining models are critical to delivering dependable outcomes from AI applications.
High initial investments (CAPEX): Transitioning to AI Cloud often requires substantial upfront costs for private or hybrid configurations, including specialized hardware and skilled personnel.

Mitigation strategies

The mitigation strategies below can help you with seamless implementation and optimal performance.

Security and compliance: Use advanced encryption and regular audits to secure data. Adhere to privacy regulations to build trust.
Optimized scalability and cost control: Choose scalable cloud solutions with transparent pricing; utilize cost-monitoring tools for resource adjustments.
Flexible cloud architectures: Adopt hybrid or multi-cloud strategies to avoid vendor lock-in and enhance flexibility.
Energy efficiency and sustainability: Implement energy-efficient hardware and optimize workloads to reduce costs and environmental impact.
Outcome reliability practices: Integrate rigorous testing and monitoring to ensure reliable AI outcomes.
CAPEX reduction and scalability: Leverage the cloud’s pay-as-you-go model to minimize upfront costs while validating ROI.
Preventing prompt injection: Implement strict input validation and monitoring systems to mitigate malicious attempts to manipulate AI models. Regularly update training datasets to counteract injection patterns.

Major AI Cloud platforms

Feature	Amazon Web Services (AWS)	Microsoft Azure	Google Cloud Platform (GCP)	IBM Cloud
AI/ML Services	SageMaker for training and deployment	Azure Machine Learning	Vertex AI for training and deployment	Watson Studio and Watson Machine Learning
Model as a Service	AWS Bedrock, Amazon Polly, Amazon Rekognition	Azure Cognitive Services, Azure AI Models	AI Hub, AutoML, Vertex AI Models	Watson AI services, Watson Visual Recognition
Compute Resources	EC2 Instances, Elastic Inference, AWS Lambda	Virtual Machines, Azure Kubernetes Service	Compute Engine, TPUs, GPUs	Bare Metal Servers, Cloud Functions
Data Storage	S3, Redshift, Data Lake Formation	Blob Storage, Data Lake Storage	Cloud Storage, BigQuery	Cloud Object Storage, Db2
Security & Compliance	Extensive compliance certifications (GDPR, HIPAA)	Compliance with industry regulations (GDPR, HIPAA)	Google Cloud Security, SOC 2, GDPR	IBM Cloud Security, IBM X-Force
Integration & Tools	AWS AI tools, ML Frameworks (TensorFlow, PyTorch)	Pre-built models, Cognitive Services	TensorFlow, AutoML, TensorFlow Extended	Open-source tools, Integration with other IBM systems

Emerging AI cloud solutions offer specialized services, including custom AI chips, edge computing capabilities, and decentralized data storage, paving the way for more tailored and innovative AI applications.

Real-world applications and industry use cases of AI Cloud

AI Cloud is revolutionizing industries by providing scalable, flexible, and efficient solutions for complex problems. Here are some real-world examples of how businesses are leveraging AI Cloud:

Healthcare: Healthcare providers use AI Cloud to train diagnostic models across institutions while maintaining data privacy. It enables the use of federated learning, securely and efficiently managing the complexities of distributed model training.
Manufacturing: Companies use AI Cloud platforms to create digital replicas of their factory operations to enable real-time analysis of sensor data to optimize production processes and improve efficiency. Centralized cloud platforms make it easier to connect data from multiple facilities without the need for expensive local infrastructure.
Financial services: Banks benefit from AI Cloud’s ability to dynamically scale computing resources. Top Companies use AI infrastructures to process millions of trading data points during peak trading hours, ensuring cost-effective elasticity.
Retail: Global retailers utilize AI Cloud's geographic distribution for efficient operations. AI infrastructures are used to deploy low-latency inventory management models tailored to each store’s region, streamlining compliance and updates.

Final words

AI Cloud represents the next frontier in AI-driven innovation, offering a powerful and scalable infrastructure to meet the growing demands of modern applications. By offering access to robust computing resources, specialized AI/ML services, and a flexible framework, AI Cloud empowers businesses to accelerate innovation, enhance their competitive edge, drive automation, improve user experiences, and make informed decisions.

With continued advancements in machine learning, data processing, and infrastructure optimization, the role of AI Cloud will only grow in importance, driving advancements across various industries and shaping the future of technology.

Ready to take the next step in AI-driven innovation? If you’re looking for experts who can help you scale or build your AI infrastructure, reach out to our AI & GPU Cloud experts.

If you found this post valuable and informative, subscribe to our weekly newsletter for more posts like this. I’d love to hear your thoughts on this post, so do start a conversation on LinkedIn.

vCluster Consulting and Support Partner for Enterprises

InfraCloud — Wed, 29 Jan 2025 11:23:36 +0000

Helping businesses of all sizes leverage vCluster with our expert consulting, implementation, & enterprise support services.

10X Cheaper than ‘real’ Kubernetes Clusters & Better Performance

With vCluster, you can instantly provision and seamlessly manage multiple virtual Kubernetes clusters that are 10x cheaper than the ‘real’ clusters without trading off the performance.

Significant Cost Savings

vCluster offers significant cost-saving advantages by maximizing resource utilization and reducing infrastructure expenses. By allowing multiple virtual clusters to run on a single Kubernetes cluster, enterprises can save up to 70% on cloud costs. Managing fewer Kubernetes clusters reduces operational overhead, ultimately lowering overall costs.

Faster Development

vCluster boosts speed and flexibility by enabling you to set up isolated virtual clusters for development and testing quickly. With vCluster, you can provision K8s clusters as fast as you can make namespaces. Your team can try out new features without affecting production, speeding up software development and reducing time-to-market.

Better Security and Isolation

With vCluster, you enjoy security and isolation within the virtual cluster, like you have a separate Kubernetes cluster. Each virtual cluster has its own API server and control plane, keeping your data safe and isolated from others. You get full control within your virtual Kubernetes cluster but limited access to the host cluster, reducing the risk of misuse.

Seamless Management

vCluster makes managing virtual Kubernetes clusters easier by letting you run separate, independent environments on a single server. Each team can work within their virtual K8s cluster without getting in each other’s way, while admins can view everything from one dashboard. It also keeps everything running smoothly by syncing resources, which reduces resource overhead in Kubernetes.

How can InfraCloud help you with vCluster consulting?

Unlock the power of Kubernetes multi-tenancy using vCluster to reduce cost and operational overhead.

vCluster Adoption & Implementation

Companies looking to adopt vCluster to leverage virtual Kubernetes clusters efficiently can use InfraCloud’s consulting services for smooth vCluster adoption and implementation. Our vCluster consultants have experience in virtualization, multi-tenancy, and managing idle clusters, and we can help you configure and use vCluster correctly from the first day.

->Our team will inspect your tech stack and project requirements, audit your multi-cluster setup, and analyze your readiness and maturity to adopt vCluster.

->Team of vCluster consultants will make a comprehensive plan to implement vCluster with security, scalability, and flexibility in mind so your team can leverage virtual Kubernetes clusters at scale.

->Be it vCluser OSS or vCluster Pro, you do not have to worry about the setup. We will help you at every step to achieve secure and optimized virtual Kubernetes clustering and multi-tenancy.

Migrate to vCluster

vCluster offers great flexibility and scalability, but your team has to allocate resources, set security policies, and integrate it with the existing development workflows. Replacing namespaces and ’real’ Kubernetes clusters with virtual clusters may overwhelm your team. InfraCloud’s vCluster experts can walk you through the migration for a seamless experience.

->Our vCluster consulting services experts will assess your existing namespaces and Kubernetes clusters that must be migrated to virtual clusters. We develop a detailed roadmap and plan tailored to your organization’s needs, outlining each step of the migration journey.

->Our Kubernetes virtualization and multi-tenancy specialists will execute the migration, ensuring that all workloads are transferred seamlessly to vCluster with minimal disruption. We will also conduct thorough testing to validate the functionality and performance of your new virtual clusters.

->Integrate vCluster with your existing CI/CD pipelines and other tools to optimize your development workflows. With InfraCloud consultants, you can confidently add vCluster to your SDLC without worrying about breaking the pipeline or production.

FinOps with vCluster

With InfraCloud’s expertise in vCluster and FinOps, you can effectively achieve significant cost savings while enhancing your Kubernetes management capabilities.

->Our experts implement vCluster to enable multiple teams to share physical resources effectively, eliminating the need for separate clusters. This approach reduces infrastructure costs while ensuring seamless collaboration.

->InfraCloud consultants set up isolated environments with vCluster for rapid provisioning, enabling your team to test and deploy applications faster. Your team can do quicker application testing and deployment, ultimately leading to reduced time to market and a competitive edge.

->We help your organization streamline operations by avoiding duplicated Kubernetes tools and services across multiple clusters. Our expertise ensures your teams operate efficiently on a shared platform, lowering operational overhead and maximizing Kubernetes ROI.

vCluster Support & Training

With our vCluster experts on your side, your team will get 24x7 support, and our team will help resolve your Kubernetes cluster issues as a priority. Our vCluster consultants and experts equip your engineers and developers with the skills to build and manage a virtual Kubernetes cluster at scale and achieve multi-tenancy.

->A team of vCluster experts will always be active in providing ongoing support. We will test vCluster upgrades in a controlled environment, ensuring their security while conducting performance optimizations to enhance efficiency and drive cost savings.

->Need a one-time fix? Our vCluster experts will offer ad-hoc support. You can get help from vCluster experts without worrying about long-term commitment.

->Post-implementation training by InfraCloud’s vCluster experts and consultants will help you build in-house expertise. To empower your team for success, we provide tailored vCluster training sessions, making sure they are well equipped to manage and operate within the vCluster environment.

Why choose InfraCloud for vCluster Consulting Services?

Certified Developers

170 in-house engineers, including 4 CKS, 51 CKA & 19 Certified Kubernetes Application Developers.

Domain Expertise

Implement the best practices that we have learned while working with 100+ clients.

First Mover Advantage

Partner with the first Kubernetes service provider in India and second in APAC.

Training

Our training focuses on building knowledge of core concepts with practical experiences.

CNCF Certified Provider

InfraCloud is a proud CNCF Silver Member, and Kubernetes Certified Service Provider (KCSP).

Expand Easily

With InfraCloud, easily scale up the team of engineers without the hassle of hiring or training.

Ready to Transform your Business with vCluster?

Schedule a call with our vCluster expert to understand how our vCluster consulting services can help you, and why startups and Fortune 500 companies alike trust us.

Kyverno Consulting & Implementation Partner

InfraCloud — Wed, 29 Jan 2025 11:21:08 +0000

Helping startups and enterprises with Kubernetes policy management with Kyverno consulting, implementation, & professional support services.

Policy as Code with Kyverno Managed Policies

Kyverno streamlines Kubernetes security by automating policy enforcement, reducing operational overhead, and enabling self-service for developers while ensuring compliance with industry standards.

Enhanced Security

Security breaches can be devastating—leading to data loss, financial impact, and reputational damage. Kyverno helps you protect your Kubernetes clusters by enforcing robust security policies, ensuring vulnerabilities are minimized, and threats are addressed proactively.

Better Compliance

Compliance issues can be expensive and damaging to your reputation, but Kyverno makes it easy to stay on track. By automating policy enforcement and providing clear audit trails, Kyverno helps you meet regulations effortlessly while maintaining control and protecting your organization.

Cost Reduction

Cloud costs can quickly spiral out of control due to over-provisioning and inefficient resource use, eating into your budget and limiting growth opportunities. By automating resource limits and quotas, optimizing usage, and cost savings, Kyverno automation can reduce your daily cloud expenses.

Faster Delivery

CI/CD workflows often struggle with misaligned policies, leading to inconsistent deployments and security vulnerabilities. With Kyverno’s policy samples and declarative YAML syntax (just like Kubernetes), our policy-as-code consultants make adoption and implementation faster & seamless.

How can InfraCloud help you with Kyverno consulting and support?

Explore our Kyverno expertise, and why startups and Fortune 500 companies alike trust us:

Kyverno Adoption & Implementation

Implementing Kyverno in your Kubernetes environment can be challenging, with the need for seamless integration and effective policy management. Our experienced consultants are here to guide you through the full policy engine adoption, ensuring a smooth transition so you can leverage the full potential of Kyverno for your organizational goals.

->Kyverno experts from InfraCloud start by thoroughly assessing your current Kubernetes policy management and understanding your security, compliance, and operational requirements. By identifying gaps and opportunities, we use Kyverno to implement policies that align with your requirements.

->InfraCloud will develop a tailored K8s policy policy management roadmap that outlines the adoption process. The plan includes timelines, stages, and resource allocation for a smooth implementation, focusing on minimizing disruption while maximizing the benefits of Kyverno.

->An experienced team will implement & deploy Kyverno within your Kubernetes environment. We will apply policies, configure settings, and ensure the system operates smoothly without disrupting existing workflows.

Kyverno Migration

Kyverno migration services by InfraCloud are designed to help you seamlessly transition from legacy systems or manual policies and other policy engines like Open Policy Agena (OPA) and jsPolicy to Kyverno-managed policies.

->A team of Kyverno experts from InfraCloud will assess your current Kubernetes policies and security needs. We’ll assess your environment, map your current policies, and define Kyverno-based policies to replace the existing policies.

->Kyverno specialists will join your team to design custom Kyverno-managed policies tailored to your environment. We consider your organization’s specific security requirements, compliance standards, and operational needs, developing scalable and maintainable policies.

->InfraCloud’s Kyverno experts will handle the migration and test everything thoroughly in staging, fix any issues, and carry out a smooth rollout to production.

Policy Management and Optimization

With Kyverno-managed policies, we enhance Kubernetes compliance, leveraging its automation features for governance. Our Kyverno experts will create and improve policies that fit your business goals. Leverage Kyverno’s powerful capabilities to make clear, declarative policies using familiar YAML syntax, making sure they are easy to manage and maintain.

->Whether you need validation rules, mutation logic, resource generation, image verification, or cleanup policies, InfraCloud’s Kyverno team will tailor them to integrate seamlessly into your workflows. We can create custom policies from scratch or modify sample policies to suit your needs. This approach enhances governance and helps you adapt quickly to evolving regulatory requirements.

->InfraCloud’s continuous compliance monitoring services utilize Kyverno’s automation features to ensure that your resources remain compliant with defined policies at all times. We set up monitoring systems that provide real-time alerts and comprehensive reporting, allowing you to proactively address potential violations before they escalate into critical issues.

->Kyverno and CI/CD experts from InfraCloud help you seamlessly integrate Kyverno into your CI/CD workflows, enabling automated compliance checks during the development lifecycle. This helps identify configuration issues early, reducing the risk of deploying insecure applications and streamlining your release cycles.

Kyverno Enterprise Support

With our Kyverno experts on your side, your team will get 24x7 emergency support services, and our team will help with policies and compliance issues as a priority. Our consultants and experts equip your engineers and developers with the skills to confidently create, enforce, and manage Kubernetes policies, ensuring robust security and compliance across your clusters.

->A team of Kyverno experts will always be active in providing ongoing support. Kyverno policy testing will be done in a safe environment so they work perfectly in production and make it easier to manage policies across the Kubernetes clusters.

->Need a one-time fix? Our Kyverno experts will offer ad-hoc support. You can get help from Kyverno experts without worrying about long-term commitment.

->InfraCloud’s Kyverno experts provide post-implementation training to help your team build in-house expertise. Our tailored sessions enable your team to manage and operate confidently within the Kyverno environment.

Why choose InfraCloud for Kyverno Consulting Services?

Certified Developers

170 in-house engineers, including 4 CKS, 51 CKA & 19 Certified Kubernetes Application Developers.

Domain Expertise

Implement the best practices that we have learned while working with 100+ clients.

First Mover Advantage

Partner with the first Kubernetes service provider in India and second in APAC.

OSS Contributions

InfraCloud engineers are one of the top OSS contributors to the Kyverno project.

CNCF Certified Provider

InfraCloud is a proud CNCF Silver Member, and Kubernetes Certified Service Provider (KCSP).

Expand Easily

With InfraCloud, easily scale up the team of engineers without the hassle of hiring or training.

Ready to Transform your Business with Kyverno?

Schedule a call with our Kyverno expert to understand how our Kyverno consulting services can help you.

The Quest for HA and DR in Loki

Pavaningithub — Thu, 02 Jan 2025 11:49:03 +0000

According to the 2016 Ponemon Institute research, the average downtime cost is nearly $9,000 per minute. These downtimes not only cost money, but also hurt the competitive edge and brand reputation. The organization can prepare for downtime by identifying the root causes. For that, they need information on how the software and infrastructure is running. Many software programs help aggregate this information, and one of the popular and most used tools is Loki.

However, keeping Loki active under pressure is another problem. Recently, Our team was running the single monolith instance of Loki as a private logging solution for our application microservices rather than for observing Kubernetes clusters. The logs were stored in the EBS filesystem. We wanted our system to be more robust and resilient, so we implemented High Availability (HA) and Disaster Recovery (DR) for our microservice application.

But it was difficult due to the following reasons:

Running clustered Loki is not possible with the file system store unless the file system is shared in some fashion (NFS, for example).
Using shared file systems with Loki can lead to instability
Shared file systems are prone to several issues, including inconsistent performance, locking problems, and increased risk of data corruption, especially under high load.
Durability of the data depends solely on the file system’s reliability, which can be unpredictable.

Our team decided to use object stores like S3 or GCS. Object stores are specifically engineered for high durability and provide advanced behind-the-scenes mechanisms—such as automatic replication, versioning, and redundancy—to ensure your data remains safe and consistent, even in the face of failures or surges.

In this blog post, we will share how we achieved high availability (HA) and configured disaster recovery (DR) for Loki with AWS S3 as our object store. This ensures we can prevent or minimize data loss and business disruption from catastrophic events. First, let’s briefly discuss Loki and see what makes it different.

What is Loki, and how does it help with observability?

Loki is a horizontally-scalable, highly-available, multi-tenant log aggregation system inspired by Prometheus. Loki differs from Prometheus by focusing on logs instead of metrics, and collecting logs via push, instead of pull. It is designed to be very cost-effective and highly scalable. Unlike other logging systems, Loki does not index the contents of the logs but only indexes metadata about your logs as a set of labels for each log stream.

A log stream is a set of logs that share the same labels. Labels help Loki to find a log stream within your data store, so having a quality set of labels is key to efficient query execution.

Log data is then compressed and stored in chunks in an object store such as Amazon Simple Storage Service (S3) or Google Cloud Storage (GCS) or, for development or proof of concept, on the file system. A small index and highly compressed chunks simplify the operation and significantly lower Loki's cost. Now, we can understand the Loki deployment modes.

Loki Deployment modes

Loki is a distributed system composed of multiple microservices, each responsible for specific tasks. These microservices can be deployed independently or together in a unique build mode where all services coexist within the same binary. Understanding the available deployment modes helps you decide how to structure these microservices to achieve optimal performance, scalability, and resilience in your environment. Different modes will impact how Loki's components—like the Distributor, Ingester, Querier, and others—interact and how efficiently they manage logs.

The list of Loki microservices includes:

Cache Generation Loader
Compactor
Distributor
Index-gateway
Ingester
Ingester-Querier
Overrides Exporter
Querier
Query-frontend
Query-scheduler
Ruler
Table Manager (deprecated)

Different deployment modes

Loki offers different deployment modes, which allow us to build a highly available logging system. We need to choose the modes considering our log reads/writes rate, maintenance overhead, and complexity. Loki can be deployed in three modes, each suited for varying scales and complexity.

Monolithic mode: The monolithic mode is the simplest option, where all Loki’s microservices run within a single binary or Docker image under all targets. The target flag is used to specify which microservices will run on startup. This mode is ideal for getting started with Loki, as it can handle log volumes of up to approximately 20 GB/day. High availability can be achieved by running multiple instances of the monolithic setup.

Simple Scalable Deployment (SSD) mode: The Simple Scalable Deployment (SSD) mode is the preferred mode for most installations and is the default configuration when installing Loki via Helm charts. This mode balances simplicity and scalability by separating the execution paths into distinct targets: READ, WRITE, and BACKEND. These targets can be scaled independently based on business needs, allowing this deployment to handle up to a few terabytes of logs per day. The SSD mode requires a reverse proxy, such as Nginx, to route client API requests to the appropriate read or write nodes, and this setup is included by default in the Loki Helm chart.

Microservices Deployment mode: The microservices deployment mode is the most granular and scalable option, where each Loki component runs as a separate process specified by individual targets. While this mode offers the highest control over scaling and cluster management, it is also the most complex to configure and maintain. Therefore, microservices mode is recommended only for huge Loki clusters or operators requiring precise control over the infrastructure.

Achieving high availability (HA) in Loki

To achieve HA in Loki, we would:

Configure multiple Loki instances using the memberlist_config configuration.
Use a shared object store for logs, such as:
- AWS S3
- Google Cloud Storage
- Any self-hosted storage
Set the replication_factor to 3.

These steps help ensure your logging service remains resilient and responsive.

Memberlist Config

memberlist_config is a key configuration element for achieving high availability in distributed systems like Loki. It enables the discovery and communication between multiple Loki instances, allowing them to form a cluster. This configuration is essential for synchronizing the state of the ingesters and ensuring they can share information about data writes, which helps maintain consistency across your logging system.

In a high-availability setup, memberlist_config facilitates the dynamic management of instances, allowing the system to respond to failures and maintain service continuity. Other factors contributing to high availability include quorum, Write-Ahead Log (WAL), and replication factor.

Replication Factor, Quorum, and Write-Ahead Log (WAL)

Replication factor: Typically set to 3, the replication factor ensures that data is written to multiple ingesters (servers), preventing data loss during restarts or failures. Having multiple copies of the same data increases redundancy and reliability in your logging system.
Quorum: With a replication factor of 3, at least 2 out of 3 writes must succeed to avoid errors. This means the system can tolerate the loss of one ingester without losing any data. If two ingesters fail, however, the system will not be able to process writes successfully, thus emphasizing the importance of having a sufficient number of active ingesters to maintain availability.
Write-Ahead Log (WAL): The Write-Ahead Log provides an additional layer of protection against data loss by logging incoming writes to disk. This mechanism is enabled by default and ensures that even if an ingester crashes, the data can be recovered from the WAL. The combination of replication and WAL is crucial for maintaining data integrity, as it ensures that your data remains consistent and retrievable, even in the face of component failures.

We chose the Simple Scalable Deployment (SSD) mode as the default deployment method for running Loki instead of using multiple instances in monolithic mode for high availability. The SSD mode strikes a balance between ease of use and the ability to scale independently, making it an ideal choice for our needs. Additionally, we opted to use AWS S3 as the object store while running our application and Loki in AWS EKS services, which provides a robust and reliable infrastructure for our logging needs.

To streamline the setup process, refer to the Terraform example code snippet to create the required AWS resources, such as IAM roles, policies, and an S3 bucket with appropriate bucket policies. This code helps automate the provisioning of the necessary infrastructure, ensuring that you have a consistent and repeatable environment for running Loki with high availability.

Guide to install Loki

Following the guide, you can install Loki in Simple Scalable mode with AWS S3 as the object store. Below are the Helm chart values for reference, which you can customize based on your requirements.

# https://github.com/grafana/loki/blob/main/production/helm/loki/values.yaml
# Grafana loki parameters: https://grafana.com/docs/loki/latest/configure/

loki:
  storage_config:
    # using tsdb instead of boltdb
    tsdb_shipper:
      active_index_directory: /var/loki/tsdb-shipper-active
      cache_location: /var/loki/tsdb-shipper-cache
      cache_ttl: 72h # Can be increased for faster performance over longer query periods, uses more disk space
      shared_store: s3
  schemaConfig:
    configs:
      - from: 2020-10-24
        store: tsdb
        object_store: s3
        schema: v12
        index:
          prefix: index_
          period: 24h
  commonConfig:
    path_prefix: /var/loki
    replication_factor: 3
    ring:
      kvstore:
        store: memberlist
  storage:
    bucketNames:
      chunks: aws-s3-bucket-name
      ruler: aws-s3-bucket-name
    type: s3
    s3:
      # endpoint is required if we are using aws IAM user secret access id and key to connect to s3
      # endpoint: "s3.amazonaws.com"
      # Region of the bucket
      region: s3-bucket-region

To ensure the Loki pods are in the Running state, use the command kubectl get pods—n loki.

In this setup, we are running multiple replicas of Loki read, write, and backend pods.

With a replication_factor of 3, it is imperative to ensure that both the write and backend are operating with 3 replicas; otherwise, the quorum will fail, and Loki will be unavailable.

The following image illustrates Loki's integration with Amazon S3 for log storage in a single-tenant environment. In this configuration, logs are organized into two primary folders within the S3 bucket: index and fake.

Index folder: This folder contains the index files that allow Loki to efficiently query and retrieve log data. The index serves as a mapping of log entries, enabling fast search operations and optimizing the performance of log retrieval.
Fake folder: This folder is used to store the actual log data. In a single-tenant setup, it may be labeled as "fake," but it holds the important logs generated by your applications.

Now Loki is running with HA. Using logcli, we should also be able to verify the logs by querying against Loki instances.

Exploring approaches for disaster recovery

Loki is a critical component of our application stack, responsible for aggregating logs from multiple microservices and displaying them in the web application console for end-user access. These logs need to be retained for an extended period—up to 90 days.

As part of our disaster recovery (DR) strategy for the application stack, ensuring the availability and accessibility of logs during a disaster is crucial. If Region-1 becomes unavailable, the applications must continue to run and access logs seamlessly. To address this, we decided to implement high availability for Loki by running two instances in separate regions. If one Loki instance fails, the instance in the other region should continue to handle both read and write operations for the logs.

We explored three different approaches to setting up DR for Loki, intending to enable read and write capabilities across both Region-1 and Region-2, ensuring fault tolerance and uninterrupted log management.

Approach 1: Implementing S3 Cross-Region Replication

AWS S3 Cross-Region Replication (CRR) is a feature that allows you to automatically replicate objects from one S3 bucket to another bucket in a different AWS region. This is particularly useful for enhancing data durability, availability, and compliance by ensuring that your data is stored in multiple geographic locations. With CRR enabled, any new objects added to your source bucket are automatically replicated to the destination bucket, providing a backup in case of regional failures or disasters.

In Loki, setting up S3 CRR means that logs written to a single S3 bucket are automatically duplicated to another region. This setup ensures that logs are accessible even if one region encounters issues. However, when using multiple cross-region instances of Loki pointing to the same S3 bucket, there can be delays in log accessibility due to how Loki handles log flushing.

Flushing logs and configuration parameters

When logs are generated, Loki stores them in chunks, which are temporary data structures that hold log entries before they are flushed to the object store (in this case, S3). The flushing process is controlled by two critical parameters: max_chunk_age and chunk_idle_period.

Max Chunk Age:

The max_chunk_age parameter defines the maximum time a log stream can be buffered in memory before it is flushed to the object store. When this value is set to a lower threshold (less than 2 hours), Loki flushes chunks more frequently. This leads to higher storage input/output (I/O) activity but reduces memory usage because logs are stored in S3 more often.
Conversely, if max_chunk_age is set to a higher value (greater than 2 hours), it results in less frequent flushing, which can lead to higher memory consumption. In this case, there is also an increased risk of data loss if an ingester (the component that processes and writes logs) fails before the buffered data is flushed.

Chunk Idle Period:

The chunk_idle_period parameter determines how long Loki waits for new log entries in a stream before considering that stream idle and flushing the chunk. A lower value (less than 2 hours) can lead to the creation of too many small chunks, increasing the storage I/O demands.
On the other hand, setting a higher value (greater than 2 hours) allows inactive streams to retain logs in memory longer, which can enhance retention but may lead to potential memory inefficiency if many streams become idle.

This example shows querying logs from one Loki instance, which is pointed to CRR-enabled S3 bucket.

Here we are querying the logs from another Loki instance which is also reading logs from the same CRR-enabled S3 bucket. You can observe the delay of ~2 hours in the logs retrieved.

With this approach, in the event of a disaster or failover in one region, there is a risk of losing up to 2 hours of log data. This potential data loss occurs because logs that have not yet been flushed from memory to the S3 bucket during that time frame may not be recoverable if the ingester fails.

Also, Cross-Region Replication is an asynchronous process, but the objects are eventually replicated. Most objects replicate within 15 minutes, but sometimes replication can take a couple of hours or more. Several factors affect replication time, including:

The size of the objects to replicate.
The number of objects to replicate.

For example, if Amazon S3 is replicating more than 3,500 objects per second, then there might be latency while the destination bucket scales up for the request rate. Therefore, we wanted real-time logs to be accessible from both instances of Loki running in different regions, so we decided against using AWS S3 Cross-Region Replication (CRR). This choice was made to minimize delays and ensure that logs could be retrieved promptly from both instances without the 2-hour latency associated with chunk flushing when using CRR. Instead, we focused on optimizing our setup to enable immediate log access across regions.

Approach 2: Utilizing the S3 Multi-Region Access Point

Amazon S3 Multi-Region Access Points (MRAP) offer a global endpoint for routing S3 request traffic across multiple AWS Regions, simplifying the architecture by eliminating complex networking setups. While Loki does not directly support MRAP endpoints, this feature can still enhance your logging solution. MRAP allows for centralized log management, improving performance by routing requests to the nearest S3 bucket, which reduces latency. It also boosts redundancy and reliability by rerouting traffic during regional outages, ensuring logs remain accessible. Additionally, MRAP can help minimize cross-region data transfer fees, making it a cost-effective option. However, at the time of this writing, there is a known bug that prevents Loki from effectively using this endpoint. Understanding MRAP can still be beneficial for future scalability and efficiency in your logging infrastructure.

Approach 3: Employing Vector as a Sidecar

We decided to use Vector, a lightweight and ultra-fast tool for building observability pipelines. With Vector, we could collect, transform, and route logs to AWS S3.

So, our infrastructure is one S3 bucket and Loki per region.
Vector will be running as a sidecar with the application pods.
Since EKS clusters are connected via a transit gateway, we configured a private endpoint for both the Loki instances. We don't want to expose it to the public as it contains application logs.
Configured vector sources to read the application logs, transform and sink, and write to both the Loki instance.

This way, all logs are ingested and available in both Loki and no need for cross-region replication and/or sharing the same bucket across many regions.

Vector configuration

Vector Remap Language (VRL) is an expression-oriented language designed for transforming observability data (logs and metrics) in a safe and performant manner.

Sources collect or receive data from observability data sources into Vector.
Transforms manipulate, or change that observability data as it passes through your topology.
Sinks send data onward from Vector to external services or destinations.

data_dir: /vector-data-dir
sinks:
  # Write events to Loki in the same cluster
  loki_write:
    encoding:
      codec: json
    endpoint: http://loki-write.loki:3100
    inputs:
      - my_transform_id
    type: loki
  # Write events to Loki in the cross-region cluster
  loki_cross:
    encoding:
      codec: json
    endpoint: https://loki-write.aws-us-west-2.loki
    inputs:
      - my_transform_id
    type: loki
# Define the source to read log file
sources:
  my_source_id:
    type: file
    include:
      - /var/log/**/*.log
# Define the transform to parse syslog messages
transforms:
  my_transform_id:
    type: remap
    inputs:
      - my_source_id
    source: . = parse_json(.message)

In this setup, Vector collects logs from the /var/log/ directory and internal Vector logs.
It parses as JSON, and replaces the entire event with the parsed JSON object and sends them to two Loki destinations (local and cross-region). The configuration ensures logs are sent in JSON format and can handle errors during log processing.

Conclusion

The journey to achieving high availability (HA) and disaster recovery (DR) for Loki has been challenging and enlightening. Through exploring various deployment modes and approaches, we've gained a deeper understanding of ensuring our logging system can withstand and recover from potential disruptions. The successful implementation of a Simple Scalable Mode with an S3 backend and the innovative use of Vector as a sidecar have fortified our system's resilience and underscored the importance of proactive planning and continuous improvement in our infrastructure.

I hope you found this post informative and engaging. I’d love to hear your thoughts on this post; let’s connect and start a conversation on LinkedIn. Looking for help with Kubernetes? Do check out how we’re helping startups & enterprises as a Kubernetes consulting services provider.