DEV Community: Saket

From A10 to M60: An Architect's Journey into Azure GPU VM Sizing for Kubernetes Inference Workloads

Saket — Wed, 15 Jul 2026 12:45:45 +0000

How an unexpected regional constraint forced us to deeply understand Azure GPU VM families, naming conventions, and workload fit.

Introduction

As architects, we often assume that infrastructure decisions are straightforward:

"The workload is already running successfully in Region A. Let's deploy the same Kubernetes workload in Region B."

That's exactly what we thought.

Our workload consisted of a Visual Element Detection (VED) service hosted on Kubernetes. The application uses a PyTorch model to analyze images and detect various visual elements in an image file. The service was already running successfully on a node pool backed by Azure's NVads_A10_v5 GPU VMs.
Then we hit an unexpected challenge.
The target region did not offer NVads_A10_v5 instances.
What looked like a simple deployment exercise became a deep dive into Azure GPU virtual machine families, GPU architectures, VM naming conventions, and workload characteristics.
This article shares what I learned in the hope that it helps others who find themselves evaluating Azure GPU SKUs for AI inference workloads. I am relatively new to the world of MLOps, Model deployments, GPU Workloads etc and equally interested and excited to learn more on this front.

The Workload

Before discussing VM selection, let's understand the workload characteristics:

Model Type        : PyTorch
Model Size        : less than 200 MB (.pth)
Image Resolution  : ~2000 x 2000
Expected Throughput: 5-7 requests/sec
Platform          : AKS (Kubernetes)
Workload Type     : Inference only

This is important because GPU sizing should always start from the workload and not from the VM catalog.

Step 1: Understanding Azure GPU VM Families

Many engineers first encounter Azure GPU machines through names like:

NV12s_v3
NV6ads_A10_v5
NC4as_T4_v3
ND96isr_H100_v5

The naming can be intimidating.
The first breakthrough was understanding that Azure organizes GPU VMs into three primary families:

N-Series
├── NV
├── NC
└── ND

NV Series – Visualization and Graphics

NV-series VMs are designed primarily for:

Virtual Desktop Infrastructure (VDI)
Graphics rendering
CAD/CAM workloads
Remote workstations
Visualization workloads
Image processing

Examples:
NV12s_v32
NV24s_v33
NV6ads_A10_v54

Typical GPUs:
NVIDIA Tesla M60
NVIDIA A10
AMD MI25

Historically, these SKUs were optimized for graphics workloads, although many organizations now use them for lightweight AI inference workloads.

NC Series – Compute

NC-series VMs are optimized for:

AI/ML workloads
PyTorch
TensorFlow
CUDA applications
HPC workloads
Inference APIs

Examples:
NC4as_T4_v32
NC8as_T4_v33
NC_A100_v44
NCads_H100_v5

Typical GPUs:
T4
V100
A100
H100

For pure AI workloads, NC is often the most natural fit.

ND Series – Deep Learning at Scale

ND-series is designed for:

Multi-GPU training
Large Language Model training
Distributed AI
NVLink-enabled workloads
InfiniBand-connected GPU clusters

Examples:
ND_A100_v42
ND_H100_v5

If you're training LLMs, ND is your friend.
If you're serving a 200 MB inference model, ND is probably overkill.

Step 2: Decoding Azure VM Names

One of the biggest learnings from this exercise was understanding Azure's VM naming scheme.
Let's decode the VM we were already using:
NV6ads_A10_v5
Breaking it apart:

Component	Meaning
NV	Visualization family
6	vCPU count
a	AMD processor
d	Local temporary disk
s	Premium SSD support
A10	NVIDIA A10 GPU
v5	Generation 5

Once you learn the modifiers, every Azure VM becomes easier to understand.

Common Suffixes You'll Encounter
as - NC4as_T4_v3
ads - NV6ads_A10_v5
adms - NV36adms_A10_v5
These memory-optimized variants generally provide significantly more RAM than their standard counterparts.

Common Letters we will usually see -

Letter	Meaning
a	AMD CPU
b	Higher storage bandwidth
d	Local/temp disk
e	Confidential/encrypted capabilities
m	Memory optimized
n	Network optimized
p	ARM processor
r	RDMA / InfiniBand
s	Premium SSD support

Step 3: Understanding Fractional GPUs

One detail many people miss about NVads_A10_v5 is that you may not receive a full GPU.
For example:
NV6ads_A10_v5 provides: 1/6th of an NVIDIA A10 ≈ 4 GB VRAM
Azure uses GPU partitioning to expose slices of the physical A10 GPU.

Conceptually:

A10 GPU (24 GB)
 ├─ VM1 -> 4 GB
 ├─ VM2 -> 4 GB
 ├─ VM3 -> 4 GB
 ├─ VM4 -> 4 GB
 ├─ VM5 -> 4 GB
 └─ VM6 -> 4 GB
``

This creates cost-effective GPU options, especially for inference workloads.

Step 4: Evaluating NVv3 as an Alternative

Because the A10-based SKUs were unavailable in our target region, we began evaluating:
NV12s_v3, NV24s_v3 and NV48s_v3
These machines use the much older - NVIDIA Tesla M60 GPU.
The key specifications:

SKU	GPUs	GPU Memory
NV12s_v3	1 × M60	16 GB
NV24s_v3	2 × M60	32 GB
NV48s_v3	4 × M60	64 GB

At first glance, moving from an A10 to an M60 looked risky.
However, after analyzing the workload, we realized something important:
The model itself was only less than 200MB so memory wasn't the challenge.
The real challenge was 2000 x 2000 image processing which drives compute consumption much more than model size.

Step 5: Right-Sizing the Kubernetes Node Pool

A common temptation in infrastructure planning is:

"Let's buy the biggest VM and move on."

We resisted that temptation.
For our workload:

5–7 requests/sec
163 MB model
Inference only
the most logical starting point became:
NV12s_v3
Why?
16 GB GPU memory
Sufficient CPU capacity
Lower cost
Easier horizontal scaling in Kubernetes

Rather than using:
1 × NV48s_v3, we preferred the Kubernetes-native approach:

Multiple smaller nodes
Horizontal Pod Autoscaling
Cluster Autoscaling This usually provides better resilience and resource efficiency.

Key Takeaways

If you're evaluating Azure GPU machines for inference workloads, remember:
Understand the family first
NV = Visualization
NC= Compute
ND= Deep Learning
Understand the suffixes
as = AMD + Premium SSD
ads = AMD + Local Disk + Premium SSD
adms = AMD + Local Disk + Memory Optimized + Premium SSD
Don't focus only on GPU names
A larger GPU doesn't automatically mean you need a larger VM.
Consider:

Model size
Input resolution
Throughput requirements
Concurrency levels
Kubernetes scaling strategy

Start with measurement using Benchmark:
nvidia-smi

Monitor:

GPU utilization
Memory usage
Inference latency
Requests/sec

Then scale based on evidence.

Final Thoughts

What began as a regional availability issue became a valuable learning opportunity.
We started with a simple question:

"What should we use if NVads_A10_v5 isn't available?"
We ended up gaining a much deeper understanding of:

Azure GPU VM families
GPU partitioning
A10 vs M60 trade-offs
AKS node pool sizing
Azure VM naming conventions

As architects, these are the moments that help us move beyond simply selecting SKUs and toward making informed infrastructure decisions based on workload characteristics.
Sometimes the best architecture lessons come from constraints rather than from choice.

Have you had to migrate inference workloads between Azure regions and deal with GPU availability differences? I'd love to hear what SKUs and strategies worked for your teams.

Observability in Agentic AI: What I Learned After Instrumenting a Real LLM Agent with OpenTelemetry

Saket — Sun, 28 Jun 2026 15:32:56 +0000

A hands-on walkthrough for AI architects who want visibility into tools, API calls, MCP servers, and model interactions—not just “did the API return 200?”

Introduction

If you ship traditional microservices, observability is a solved problem in principle: traces show request paths, metrics expose latency and error rates, and logs give you the narrative. Agentic AI breaks that mental model.

The same user question can produce different outputs. A single “chat turn” might trigger guardrail checks, session reads, multiple LLM calls, web search, custom tool execution, and follow-up reasoning loops. Failures are often subtle—slow tools, ballooning context windows, guardrails firing too often, or models that degrade under load rather than returning hard errors.

I recently worked through the SigNoz guide on observing LLM applications with OpenTelemetry and ran the OpenTelemetry NBA Agent demo end to end. This post is my architect’s take on what that exercise taught me—and how observability needs to span the entire agentic ecosystem, not just the model call at the center.

Why Agentic AI Demands a Different Observability Mindset

Three ideas from the SigNoz article stuck with me after running the demo myself:

1. Non-determinism is the default

The same prompt can yield different answers across runs. That makes regression testing harder and makes distributed traces more valuable than unit tests alone. You need to see what path the agent took, not just what text came back.

2. “Correct” is contextual

A one-word answer might be fine for “Will it rain tomorrow?” and unacceptable for “Should I rebalance my portfolio?” Observability helps you correlate guardrail decisions, tool usage, and response structure with user intent and topic.

3. The stack moves fast

Model updates, provider brownouts, new APIs (Responses vs Chat Completions), and evolving OpenTelemetry GenAI conventions all change behavior under you. You need baseline metrics before you can tell whether a deploy or a model swap made things better or worse.

OpenTelemetry gives you a vendor-neutral way to capture that picture. SigNoz (or any OTLP-compatible backend) is where you visualize it. The demo proves the plumbing works; production agentic systems need you to extend that plumbing across every integration point.

What I Built (and Ran)

The demo is intentionally small—a FastAPI app with one agent, one custom tool, one guardrail, and session-backed conversations. That restraint is a feature. It produces meaningfully agent-shaped telemetry without hiding the signal in product complexity.

Component	Role
FastAPI	`POST /agent/turn` — one conversational turn per request
OpenAI Agents SDK	Orchestrates the agent loop, tools, and guardrails
WebSearchTool	External retrieval (high latency, high cost if abused)
`calculate_win_percentage`	Deterministic custom tool for stats questions
Input guardrail	Blocks off-topic queries (e.g., weather in Barcelona)
OpenAIConversationsSession	Server-managed context across turns

Running it is straightforward once environment variables are set:

cd python/opentelemetry-llm-demo
pip install -r requirements.txt

OPENAI_API_KEY="<your-key>" \
OTEL_EXPORTER_OTLP_ENDPOINT="https://ingest.<region>.signoz.cloud:443" \
OTEL_EXPORTER_OTLP_HEADERS="signoz-ingestion-key=<your-key>" \
OTEL_SERVICE_NAME="opentelemetry-llm-demo" \
OTEL_RESOURCE_ATTRIBUTES="service.version=0.1.0,deployment.environment=dev" \
OTEL_EXPORTER_OTLP_PROTOCOL="http/protobuf" \
OTEL_PYTHON_LOGGING_AUTO_INSTRUMENTATION_ENABLED=true \
opentelemetry-instrument fastapi run --port 8085 --workers 1

I could not get a Signoz free trial account due to regional limitations on their site. But since this demo supports Open Telemetry vendors, i created a grafana labs free trial account for 15 days and use that to export the metrics.

cd python/opentelemetry-llm-demo
pip install -r requirements.txt

OPENAI_API_KEY="<your-key>" \
OTEL_EXPORTER_OTLP_ENDPOINT="https://otlp-gateway-prod-ap-south-1.grafana.net/otlp" \
OTEL_EXPORTER_OTLP_HEADERS="Authorization=Basic <GrafanaLabstoken>" \
OTEL_SERVICE_NAME="opentelemetry-llm-demo" \
OTEL_RESOURCE_ATTRIBUTES="service.version=0.1.0,deployment.environment=dev" \
OTEL_EXPORTER_OTLP_PROTOCOL="http/protobuf" \
OTEL_PYTHON_LOGGING_AUTO_INSTRUMENTATION_ENABLED=true \
opentelemetry-instrument fastapi run --port 8085 --workers 1

The important detail: opentelemetry-instrumentation-openai-agents-v2 is pinned manually in requirements.txt. opentelemetry-bootstrap does not yet discover the OpenAI Agents SDK. That’s a real-world snapshot of GenAI observability in 2026—powerful, but still maturing.

What a Good Trace Actually Looks Like

After sending a few turns—initial question, follow-up with session_id, a stats question that triggers the tool, and an off-topic question blocked by the guardrail—I knew what “healthy” vs “blocked” looked like in the trace UI.

A successful multi-turn flow

POST /agent/turn — root HTTP span
Session I/O — GET/POST spans for conversation fetch and persist
invoke_agent — agent workflow container
guardrail_check — with gen_ai.guardrail.triggered=false
Model call span — model name, token usage, optional message content attributes
Tool spans — when the agent calls calculate_win_percentage or web search

Follow-up requests reuse session_id; traces show the agent pulling prior context instead of starting cold. That context growth shows up in rising input token counts over a session—something worth dashboarding.

A guardrail block

Off-topic queries still produce useful telemetry:

guardrail_check with gen_ai.guardrail.triggered=true
Workflow stops before expensive model/tool work
FastAPI records an exception on the span via span.record_exception(exc)
API returns 400 with a clear message

That’s observability doing safety and cost control: you can measure how often guardrails fire, by topic, without paying for a full agent loop.

Honest gaps I noticed

Several spans appeared as unknown—internal SDK paths not yet mapped to semantic span names. Child spans (e.g., guardrail_check) still tell the story. An open conformance effort for the Agents SDK should improve this. As an architect, plan for partial coverage today and richer conventions tomorrow—design dashboards around stable attributes (gen_ai.*) where possible.

Observability Across the Full Agentic Ecosystem

The NBA demo covers HTTP → agent → guardrail → LLM → tools → session store. Production agentic systems add more layers. Each needs the same treatment: trace context propagation, consistent span naming, and metrics that roll up to SLOs.

LLM interactions

Every model call should emit:

Operation name and provider (gen_ai.system, gen_ai.request.model)
Input / output / total tokens (and cached tokens when available)
Duration (gen_ai.client.operation.duration)
Finish reason, error type, retry count
Optional: prompt/completion content (off in production via OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT=0)

The demo surfaces usage in the API response as well—useful for debugging, but metrics and traces are the source of truth for trends:

{
  "usage": {
    "input_tokens": 123,
    "output_tokens": 45,
    "total_tokens": 168
  }
}

Tool usage

Tools are where agents become reliable—or expensive.

Instrument each tool with:

Span name: e.g., tool.calculate_win_percentage or tool.web_search
Arguments (sanitized), result size, success/failure
Latency and timeout events
Parent link back to the agent span

Compare tool latency vs model latency. In many agents, a slow external API dominates p95 while the LLM looks fine in isolation.

API calls (REST, GraphQL, internal services)

Web search in the demo is a stand-in for any external dependency. Treat them like microservice calls:

Propagate trace context in outbound headers
Record status codes, payload sizes, and retry behavior
Tag by dependency name for dependency maps

MCP server calls

Model Context Protocol (MCP) servers are becoming the standard way agents reach databases, repos, ticketing systems, and internal APIs. They deserve first-class spans:

Signal	Why it matters
`mcp.server.name`, `mcp.tool.name`	Which integration failed or slowed down
Request/response size	Context bloat and cost
Duration	Often dominates agent turn time
Error taxonomy	Auth vs timeout vs schema mismatch

Propagate the same trace ID from your API through the agent into MCP client calls. Without that, “the agent was slow” is impossible to decompose.

Guardrails and policy layers

The demo’s keyword guardrail is simple; production systems use classifiers, PII detectors, and budget enforcers. Track:

Trigger rate by guardrail name
Time spent in guardrail evaluation
Block vs warn vs redact outcomes

Correlate guardrail blocks with saved tokens and avoided tool calls to justify the complexity to leadership.

Sessions and memory

OpenAIConversationsSession makes context explicit. For any memory layer (vector store, Redis, custom DB), observe:

Read/write latency per turn
Context size growth (tokens or bytes)
Cache hit rate for retrieved chunks

Runaway context is a leading cause of latency and cost spikes in long conversations.

Metrics Every AI Architect Should Track

Traces explain one request. Metrics tell you whether the system is healthy at scale. Below is a practical catalog—some exported automatically by GenAI instrumentation today, others you’ll derive or add custom instrumentation for.

Latency and streaming

Metric	Definition	Why it matters
Time to First Token (TTFT)	Request start → first streamed token	User-perceived responsiveness; spikes often indicate queueing or cold starts
Time per token / inter-token latency	(Total generation time − TTFT) / (output tokens − 1)	Detects “stuttery” streaming and provider throttling
End-to-end turn latency	HTTP request → final response	Your real SLO; includes tools, MCP, and multiple model rounds
Agent loop depth	Number of model invocations per turn	High depth → runaway reasoning or tool retry loops
`gen_ai.client.operation.duration`	Provider-reported operation time	Compare across models and regions

Tokens and cost

Metric	Definition	Why it matters
`gen_ai.client.token.usage` (input / output / total)	Tokens per operation	Direct driver of cost and context pressure
Cached input tokens	Tokens served from provider cache	Cost optimization signal (when exposed)
Cost per turn / per user / per session	tokens × price table	Finance and capacity planning
Token growth per session turn	Δ input tokens turn-over-turn	Early warning for context explosion

Reliability and quality proxies

Metric	Definition	Why it matters
Error rate by span kind	LLM vs tool vs MCP vs HTTP	Target fixes where failures cluster
Tool success rate & p95 latency	Per tool name	Fragile tools break agent reliability
Guardrail trigger rate	Blocks / total requests	Mis-tuned guardrails hurt UX; under-tuned ones hurt safety
Empty or truncated response rate	502s, zero output tokens	Model or SDK integration issues
Provider availability / timeout rate	By model and region	Brownout detection

Agent-specific operational metrics

Metric	Definition	Why it matters
Tool calls per turn	Count by tool	Cost and latency multipliers
MCP calls per turn	Count by server/tool	Integration hot paths
Parallel vs sequential tool time	Sum of tool spans vs wall clock	Opportunities to parallelize
Retry count	On LLM and downstream calls	Instability or aggressive client config

Suggested dashboards

Cost & usage — tokens and estimated spend by model, topic, tenant
Latency — p50/p95 TTFT, turn latency, tool latency
Reliability — error rates, guardrail blocks, tool failures
Agent behavior — tool/MCP call mix, loop depth, session length

SigNoz’s OpenAI Python SDK dashboard template is a reasonable starting point; expect to extend it for Agents SDK, MCP, and custom tools.

GenAI Semantic Conventions: What I Internalized

OpenTelemetry uses gen_ai.* attributes under evolving GenAI semantic conventions. Treat them as a contract in motion:

gen_ai.agent.name — which agent ran
gen_ai.input.messages / gen_ai.output.messages — powerful for debug, risky for PII in production
gen_ai.guardrail.triggered — policy outcomes
gen_ai.client.token.usage — standardized token metrics
gen_ai.client.operation.duration — standardized latency metrics

Content capture: With session-backed conversations, gen_ai.input.messages may include prior turns, not just the latest user message. Disable capture in production:

export OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT=0

Instrumentation choice: The plain OpenAI Python SDK instrumentation lagged the Responses API; the demo uses OpenAI Agents SDK + opentelemetry-instrumentation-openai-agents-v2. Pick instrumentation that matches your API surface, or you’ll fly blind on the code path you actually run.

Lessons I’m Taking Into Production Designs

Instrument at the agent boundary, not only at the LLM client. The demo’s value is seeing guardrails, sessions, tools, and model calls in one trace—not three disconnected logs.
Treat tools and MCP like microservices. They need spans, timeouts, and RED metrics (rate, errors, duration). An agent is only as reliable as its slowest dependency.
Measure cost and latency together. A “cheap” model that needs three extra tool rounds or a longer context window can cost more than a faster model with better cache behavior.
Guardrails are observability events. Blocked requests should be visible, measurable, and cheap—exactly what the demo’s guardrail path demonstrates.
Plan for convention churn. Manual dependency pins, occasional unknown spans, and missing cache metrics today are normal. Abstract your dashboards around stable concepts (turn, tool, token, cost), not one vendor’s span titles.
Correlate logs and traces. Enabling OTEL_PYTHON_LOGGING_AUTO_INSTRUMENTATION_ENABLED=true ties application logs to trace IDs—essential when debugging “the agent said something weird” incidents.

Try It Yourself

If you’re evaluating observability for an agentic product:

Clone the SigNoz examples repo and run python/opentelemetry-llm-demo.
Read the original SigNoz walkthrough for setup screenshots and trace UI context.
Run four scenarios: first turn, follow-up with session_id, a tool question (“Raptors went 46-36—what’s their win percentage?”), and an off-topic guardrail block.
In your backend, note span hierarchy, token metrics, and guardrail attributes.
List your production integrations (APIs, MCP servers, memory stores) and mark which ones lack spans today—that’s your instrumentation backlog.

Closing Thoughts

Agentic AI isn’t “an LLM behind an API.” It’s a distributed system where the planner is probabilistic and the dependencies are heterogeneous. OpenTelemetry gives you a shared language for that system—traces for debugging individual turns, metrics for TTFT, token cost, tool reliability, and guardrail behavior at scale.

Running the SigNoz NBA agent demo didn’t just validate a tutorial. It clarified what I need to require from every agent service we design: propagated trace context, GenAI semantic attributes, tool and MCP spans, and dashboards that connect latency to tokens to dollars.

The conventions and SDK instrumentation will keep improving. The architectural imperative won’t: if you can’t see the full agent loop, you can’t operate it in production.

Have you instrumented MCP servers or multi-agent workflows yet? I’d be interested in what span attributes and metrics you’ve standardized on—drop a comment or reach out.

Extend Bitnami Cassandra Image to customize the configuration in cassandra.yaml

Saket — Mon, 29 Apr 2024 12:07:42 +0000

Cassandra is a NoSQL Column Family based database. It is generally recommended in uses cases which need fast writes. Cassandra is deployed on Kubernetes as Statefulset due to its nature unlike stateless applications which are deployed as Deployments.

Bitnami Cassandra Image

There are multiple benefits of using the images from Bitnami. We can refer to their github repo for additional details.
The Bitnami image from cassandra provides us the option to override few of the configurations in the cassandra.yaml file by passing the values as environment variables.
For eg: When we provide an environment variable - CASSANDRA_CLUSTER_NAME – to the container, the value of this variable gets updated in the cassandra.yaml -> cluster_name field.

#cassandra.yaml
..
...
# The name of the cluster. This is mainly used to prevent machines in
# one logical cluster from joining another.
cluster_name: ‘Dev Cassandra Cluster'
...
..

The image executes container with a command which executes a shell script. This script when executes creates a cassandra.yaml file with default configurations and some parameters that can be provided to the container. Once the file is generated, it is placed in the appropriate location and then then the last step of the script is to start the Cassandra process.
Like explained above regarding the cluster_name configuration in cassandra.yaml, there are various other configurations which can be updated by providing values through environment variables. For all such variables please go through the github page of Bitnami cassandra image.

Need for further customization

The Bitnami image does provided custom configuration for some of the fields in cassandra.yaml. When we were working on a POC we encountered a problem related to queries failing due to tombstones failure threshold. A simple google search will help you explain what are tombstones in cassandra.
The default limit for tombstone failure threshold is 100000 but our POC use case had nothing to worry even if this threshold was breached. So, we wanted a way to customize this number setting on the Bitnami Cassandra image that we were using. The nearest answer/solution that we got was to provide an entire cassandra.yaml file to the image. This way we could override which ever configuration we wanted and it worked in a simple single instance cluster. But when we scaled the cluster to 3 instances the custom configuration file did not help the new instances to join the cluster due to seed node address setting of the configuration.

Bitnami Image code on Github

If we check the rootfs/opt/bitnami/scripts folder, it has few shell scripts.

Libcassandra.sh -> responsible for setting up the cassandra.yaml by reading the configurations from environment variables.
Cassandra-env.sh -> this script is included in the libcassandra.sh and is responsible for injecting the environment variables into environment using export.

How to customize Bitnami Image to override configuration through environment variables.

So in our case we wanted to override the tombstone failure threshold in cassandra.yaml. For this to achieve we introduced a new variable in Cassandra-env.sh file towards the bottom of this file.

#!/bin/bash
#
# Environment configuration for cassandra

# The values for all environment variables will be set in the below order of precedence
# 1. Custom environment variables defined below after Bitnami defaults
# 2. Constants defined in this file (environment variables with no default), i.e. BITNAMI_ROOT_DIR
# 3. Environment variables overridden via external files using *_FILE variables (see below)
# 4. Environment variables set externally (i.e. current Bash context/Dockerfile/userdata)

# Load logging library
# shellcheck disable=SC1090,SC1091
. /opt/bitnami/scripts/liblog.sh

export BITNAMI_ROOT_DIR="/opt/bitnami"
export BITNAMI_VOLUME_DIR="/bitnami"
…
…
..

# Custom environment variables may be defined below
export CASSANDRA_TOMBSTONE_WARN_THRESHOLD="${CASSANDRA_TOMBSTONE_WARN_THRESHOLD:-1000}"
export CASSANDRA_TOMBSTONE_FAILURE_THRESHOLD="${CASSANDRA_TOMBSTONE_FAILURE_THRESHOLD:-100000}"

The above code just exports the value if passed in as environment variable or default it to 100000 for failure threshold variable. Similarly for warn threshold variable.
In libcassandra.sh file there is a function called cassandra_setup_cluster(). This function actually sets up the Cassandra.yaml file which has all the configuration for the instance.
This function calls the below function to set each configuration in cassandra.yaml file that needs to be customized or overridden.

cassandra_yaml_set "listen_address" "$host"

So we can call the same function and set the tombstone related threshold values accordingly.

        cassandra_yaml_set "tombstone_warn_threshold" "$CASSANDRA_TOMBSTONE_WARN_THRESHOLD"
        debug "setting tombstone_warn_threshold to $CASSANDRA_TOMBSTONE_WARN_THRESHOLD"
        cassandra_yaml_set "tombstone_failure_threshold" "$CASSANDRA_TOMBSTONE_FAILURE_THRESHOLD"
        debug "setting tombstone_failure_threshold to $CASSANDRA_TOMBSTONE_FAILURE_THRESHOLD"

With the help of docker file we can now build the Cassandra image and push to image repository.
When this image is used in docker cli or Kubernetes pod and if we want to override the default tombstone threshold specific values then we can pass the below environvent variable with our custom override value.

Docker run Cassandra –iimage mycustomcassandra:latest -e CASSANDRA_TOMBSTONE_FAILURE_THRESHOLD=200000

Or in Kubernetes statefulset definition as follows:
(Please check the last of the env values passed to the statefulset)

kind: StatefulSet
apiVersion: apps/v1
metadata:
  name: cassandra
  namespace: cassandra
spec:
  replicas: 3
  selector:
    matchLabels:
      app.kubernetes.io/instance: cassandra
      app.kubernetes.io/name: cassandra
  template:
    metadata:
      creationTimestamp: null
      labels:
        app.kubernetes.io/instance: cassandra
        app.kubernetes.io/name: cassandra
    spec:
      containers:
        - name: cassandra
          image: my-custom-cassandra:latest
          command:
            - bash
            - '-ec'
            - >
              # Node 0 is the password seeder

              if [[ $POD_NAME =~ (.*)-0$ ]]; then
                  echo "Setting node as password seeder"
                  export CASSANDRA_PASSWORD_SEEDER=yes
              else
                  # Only node 0 will execute the startup initdb scripts
                  export CASSANDRA_IGNORE_INITDB_SCRIPTS=1
              fi

              /opt/bitnami/scripts/cassandra/entrypoint.sh
              /opt/bitnami/scripts/cassandra/run.sh
          ports:
            - name: intra
              containerPort: 7000
              protocol: TCP
            - name: tls
              containerPort: 7001
              protocol: TCP
            - name: jmx
              containerPort: 7199
              protocol: TCP
            - name: cql
              containerPort: 9042
              protocol: TCP
            - name: thrift
              containerPort: 9160
              protocol: TCP
          env:
            - name: BITNAMI_DEBUG
              value: 'false'
            - name: CASSANDRA_CLUSTER_NAME
              value: cassandra
            - name: CASSANDRA_SEEDS
              value: cassandra-0.cassandra-headless.cassandra.svc.cluster.local
            - name: CASSANDRA_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: cassandra
                  key: cassandra-password
            - name: POD_IP
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: status.podIP
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.name
            - name: CASSANDRA_USER
              value: cassandra
            - name: CASSANDRA_NUM_TOKENS
              value: '256'
            - name: CASSANDRA_DATACENTER
              value: dc1
            - name: CASSANDRA_ENDPOINT_SNITCH
              value: SimpleSnitch
            - name: CASSANDRA_KEYSTORE_LOCATION
              value: /opt/bitnami/cassandra/certs/keystore
            - name: CASSANDRA_TRUSTSTORE_LOCATION
              value: /opt/bitnami/cassandra/certs/truststore
            - name: CASSANDRA_CLIENT_ENCRYPTION
              value: 'true'
            - name: CASSANDRA_TRUSTSTORE_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: cassandra-tls-pass
                  key: truststore-password
            - name: CASSANDRA_KEYSTORE_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: cassandra-tls-pass
                  key: keystore-password
            - name: CASSANDRA_RACK
              value: rack1
            - name: CASSANDRA_ENABLE_RPC
              value: 'true'
            - name: CASSANDRA_TRANSPORT_PORT_NUMBER
              value: '7000'
            - name: CASSANDRA_JMX_PORT_NUMBER
              value: '7199'
            - name: CASSANDRA_CQL_PORT_NUMBER
              value: '9042'
            - name: CASSANDRA_TOMBSTONE_FAILURE_THRESHOLD
              value: 200000
            - name: CASSANDRA_TOMBSTONE_WARN_THRESHOLD
              value: 2000
          resources:
            limits:
              cpu: '3'
              memory: 16Gi
            requests:
              cpu: 1500m
              memory: 8Gi
          volumeMounts:
            - name: data
              mountPath: /bitnami/cassandra
            - name: certs-shared
              mountPath: /opt/bitnami/cassandra/certs
          livenessProbe:
            exec:
              command:
                - /bin/bash
                - '-ec'
                - |
                  nodetool status
            initialDelaySeconds: 60
            timeoutSeconds: 5
            periodSeconds: 30
            successThreshold: 1
            failureThreshold: 5
          readinessProbe:
            exec:
              command:
                - /bin/bash
                - '-ec'
                - |
                  nodetool status | grep -E "^UN\\s+${POD_IP}"
            initialDelaySeconds: 60
            timeoutSeconds: 5
            periodSeconds: 10
            successThreshold: 1
            failureThreshold: 5
          securityContext:
            runAsUser: 1001
            runAsNonRoot: true
            allowPrivilegeEscalation: false
  volumeClaimTemplates:
    - kind: PersistentVolumeClaim
      apiVersion: v1
      metadata:
        name: data
        creationTimestamp: null
        labels:
          app.kubernetes.io/instance: cassandra
          app.kubernetes.io/name: cassandra
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 1Ti
        storageClassName: default
        volumeMode: Filesystem

Add Custom Headers to Outgoing request from applications deployed on Kubernetes

Saket — Fri, 26 Apr 2024 15:24:11 +0000

Introduction

Kubernetes in todays world is the defacto platform for deploying containerized applications including microservices because of the various benefits this platform offers including automated deployment, scaling security, resiliency, load balancing and self healing along with many others.
This article will explain how we can automate the insertion of custom headers in outgoing requests from the applications that are deployed in Kubernetes with Istio Service Mesh.

Communication between Pods using Kubernetes Service

Each Pod deployed gets its onw unique internal IP address in the cluster. When applications running inside pods wants to connect to other pods, they can connect using the podname or pod ipaddress.
In case there are multiple replicas of an application running leading to multiple pods then kubernetes service is used to reach the destination.
Below diagram shows the pod to pod communication through kubernetes service.

Istio Service Mesh

Service Mesh is an infrastructure layer that handles communication and networking between services in an application. It handles traffic, monitors performance and enforces different policies.
Some of the benefits of using service mesh includes Service Discovery, Traffic control, Observability, Security and Compliance
Istio is an open source service mesh which offers all of the above mentioned features of a typical service mesh. You can read about it more on this link Istio / Architecture
Istio uses an extended version of the Envoy proxy. Envoy is a high-performance proxy developed in C++ to mediate all inbound and outbound traffic for all services in the service mesh. Envoy proxies are the only Istio components that interact with data plane traffic. Istio deploys Envoy proxies as side car containers to workloads logically augmenting the services with many built in features of Envoy.

Below diagram shows the communication across applications when Istio Service Mesh is used.

Insert Custom Headers to Outgoing Requests

Envoy proxy intercepts all the incoming and outgoing requests within the pod to/from all the containers that are running in the pod. This configuration can be overridden to implement various functionalities like logging, auditing, authentication, modifying requests etc. Envoy filters are additively applied to the proxy configuration. We can add multiple filters with different functionalities.
Here we will see how can we extend this functionality to add custom headers to outgoing requests from the pods.

We need to create the EnvoyFilter kubernetes Istio CRD and override the function which is invoked when a request is intercepted by the proxy.
The following code enables Envoys Lua filter for all outbound HTTP calls on service port 8080 of the myapplication service pod with labels "app: myapplication". The Lua filter adds a custom header key and value to the request handle headers object.

apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name: customheaders-filter
spec:
  workloadSelector:
    labels:
      app: myapplication
  configPatches:
  - applyTo: HTTP_FILTER
    match:
      context: SIDECAR_OUTBOUND
      listener:
        portNumber: 8080
        filterChain:
          filter:
            name: envoy.filters.network.http_connection_manager
            subFilter:
              name: envoy.filters.http.router
    patch:
      operation: INSERT_BEFORE
      value:
        name: envoy.lua
        typed_config:
          "@type": "type.googleapis.com/envoy.extensions.filters.http.lua.v3.Lua"
          inlineCode: |

              function envoy_on_request(request_handle)
                request_handle:headers():add( "my-custom-header", "my-custom-header-value )
              end

We can also extend this function to add headers only if the destination host is expected one.

  function envoy_on_request(request_handle)
    local destinationHost = request_handle:headers():get(":authority")
    if destinationHost ~= nil and destinationHost ~= '' then
      request_handle:logInfo(" found authority  :"..destinationHost)
    else
      request_handle:logInfo(" no authority found")
      destinationHost = ''
    end

    request_handle:logInfo("destination host -> "..destinationHost) 
    request_handle:logInfo("configured host -> {{ $customHeadersByHost.host }}")

    if destinationHost:find({{ $customHeadersByHost.host | quote }}, 0, true) ~= nil then
      {{- range $header := $customHeadersByHost.headers }}
      request_handle:headers():add({{ $header.name | quote }} , {{ $header.value | quote }} )
      {{- end }}
      request_handle:logInfo(" added headers for host - {{ $customHeadersByHost.host }}")
    else
      request_handle:logInfo(" destination does not match configured host ")
    end

  end

Inorder to view the envoy-filter proxy logs from the pods we need to add the below annotations to the Pod definition if logging is restricted at istio system level.

apiVersion: apps/v1
kind: Deployment
metadata:
  sidecar.istio.io/logLevel: info
spec:
  ..
  ..

To view the logs that are added in the on request function above, we can view them using kubectl command

kubectl logs -n mynamespace myappliation-pod -c istio-proxy

Apache Airflow - For Beginners

Saket — Mon, 18 Mar 2024 10:47:39 +0000

Airflow: A Beginner's Guide to Data Orchestration in Data Pipelines.

Data pipelines are the lifeblood of modern data-driven applications. These data pipelines help automate the flow of data from various sources to its single or multiple destinations by transforming it. But managing these pipelines, especially complex ones, can be a challenge. Apache Airflow an open source solution helps in overcoming this challenge. This is a very basic introduction to Apache Airflow for absolute beginners.

A Brief History of Airflow

Airflow was originally created by engineers at Airbnb to internally manage their evergrowing data processing pipelines. Recognizing the potential it has, Airbnb open-sourced Airflow in 2016, making it available for the wider developer community. Today, Airflow is a mature and widely adopted tool, used by companies of all sizes to orchestrate and automate their data workflows.

How Airflow Works

At its core, Airflow is a workflow orchestration tool. It allows you to define complex data pipelines as Directed Acyclic Graphs (DAGs). A DAG is a series of tasks with dependencies, ensuring they run in the correct order. By Acyclic, it means the flow of data never returns the same node in graph which it had crossed once before.

Here's a simplified breakdown of how Airflow works:

Define Tasks: You define individual tasks within your workflow using Python code. There is support for other languages also through plugins or using kubernetes pod operator to define these tasks. These tasks could involve data extraction, transformation, loading, analyzing, cleaning or any other operations needed in your pipeline.
Set Dependencies: You specify dependencies between tasks. For example, a data transformation task might depend on a data extraction task completing successfully before it can run.
Schedule Workflows: Airflow allows you to schedule your workflows to run at specific times (cron expressions) or at regular intervals.
Monitoring and Alerting: Airflow provides a web interface to monitor the status of your workflows and tasks. You can also configure alerts to be sent if tasks fail or encounter issues.

Deploying Airflow in Production

For production environments, proper deployment of Airflow is essential. Here are some considerations:

Security: Implement role-based access control (RBAC) to restrict access to workflows and resources.
High Availability: Consider setting up a high-availability configuration with redundant Airflow servers and worker nodes to ensure uptime.
Monitoring and Logging: Integrate Airflow with monitoring tools to track performance and identify potential issues.

Advantages of Apache Airflow

Here's a breakdown of some key advantages Apache Airflow holds over other commonly available open-source data orchestration tools:

Maturity and Large Community:

Extensive Resources: Airflow, being a mature project, boasts a vast collection of documentation, tutorials, and readily available solutions for common challenges. This translates to easier learning, faster troubleshooting, and a wealth of resources to tap into.
Strong Community Support: The active and large Airflow community provides valuable support through forums, discussions, and contributions. This can be immensely helpful when encountering issues or seeking best practices.

Flexibility and Ease of Use:

Python-Based Approach: Airflow leverages Python for defining tasks, making it accessible to users with varying coding experience. Python is a widely used language familiar to many data professionals.
User-Friendly Interface: The web interface provides a decent level of user-friendliness for monitoring and managing workflows. It allows for visualization of task dependencies and overall workflow status.

Scalability and Extensibility:

Horizontal Scaling: Airflow scales horizontally by adding more worker nodes to distribute the workload. This ensures smooth operation even for complex workflows with numerous tasks.
Rich Plugin Ecosystem: Airflow offers a vast plugin ecosystem. These plugins extend its functionality for various data sources, operators (specific actions within workflows), and integrations with other tools. This allows for customization and tailoring Airflow to fit your specific needs.

DAG Paradigm:

Clear Visualization: Airflow utilizes Directed Acyclic Graphs (DAGs) to define workflows. This allows for a clear visual representation of task dependencies and execution order. This visual approach promotes maintainability and simplifies debugging, especially for intricate pipelines.

Additional Advantages:

Robust Scheduling: Airflow provides extensive scheduling capabilities. You can define how often tasks should run using cron expressions, periodic intervals, or custom triggers.
Monitoring and Alerting: Built-in monitoring and alerting functionalities offer visibility into your workflows and send notifications for potential issues or task failures.
Security: Airflow supports role-based access control (RBAC). This allows you to define user permissions and secure access to workflows and resources within your data pipelines.

Open-Source Alternatives to Airflow

Luigi: While offering simplicity and Python-based workflows, Luigi might lack the same level of flexibility and scalability as Airflow.
Prefect: Prefect excels in user-friendliness and visual design, but its in-process execution model might not be suitable for all scenarios compared to Airflow's worker-based approach.
Dagster: Prioritizing data lineage is valuable, but Dagster might have a steeper learning curve compared to Airflow.
Argo Workflows/Kubeflow Pipelines: For Kubernetes-centric environments, these tools offer tight integration, but they might not be as general-purpose as Airflow for broader data orchestration needs.

Absolutely, here's an example of a simple "Hello World" DAG in Apache Airflow:

from airflow import DAG
from airflow.operators.python_operator import PythonOperator

def say_hello():
  """Prints a hello world message"""
  print("Hello World from Airflow!")

# Define the DAG
with DAG(
    dag_id='hello_world_dag',
    start_date=datetime(2024, 3, 19),  # Set the start date for the DAG
    schedule_interval=None,  # Don't schedule this DAG to run automatically
) as dag:

  # Define a Python task to print the message
  say_hello_task = PythonOperator(
      task_id='say_hello_world',
      python_callable=say_hello,
  )

*Explanation for the above simple DAG *

Import Libraries: We import necessary libraries from airflow.
Define Hello World Function: The say_hello function simply prints a message.
Create DAG: We define a DAG using DAG object. Here, we specify:
- dag_id: Unique identifier for the DAG (here, "hello_world_dag").
- start_date: The date on which the DAG can first be run (set to today's date).
- schedule_interval: Set to None as we don't want this DAG to run automatically (you can trigger it manually in the Airflow UI).
Define Task: We define a Python task using PythonOperator. Here, we specify:
- task_id: Unique identifier for the task (here, "say_hello_world").
- python_callable: The Python function to be executed (here, the say_hello function).

Running the DAG:

Save this code as a Python file (e.g., hello_world_dag.py).
Place this file in your Airflow DAGs directory (usually $AIRFLOW_HOME/dags).
Go to the Airflow UI (usually at http://localhost:8080).
Find your "hello_world_dag" DAG and click on it.
You'll see the task "say_hello_world". Click on the play button next to it to run the task manually.

If everything is set up correctly and executes without error, you should see the "Hello World from Airflow!" message printed in the task logs. This is a simple example, but it demonstrates the basic structure of defining a DAG and tasks in Airflow.

Setting Up a Spark Cluster on Kubernetes Using Helm

Saket — Fri, 15 Sep 2023 14:36:30 +0000

Introduction

Apache Spark is a powerful distributed data processing engine that can handle large-scale data processing tasks efficiently. Spark is a cluster computing framework which offers a complete solution to many of the common problems like ETL and warehousing, Stream data processing, common use case of supervised and unsupervised learning for data analytics and predictive modelling.

Kubernetes, on the other hand, is a popular container orchestration platform that simplifies the deployment and management of containerized applications. Combining Spark and Kubernetes allows you to harness the benefits of both technologies for running Spark workloads in a scalable and flexible manner. With Kubernetes, the scalability aspect just becomes so easy for Spark, as you just scale the master or worker nodes by running a simple command. In this article, we'll guide you through the process of setting up a Spark cluster on Kubernetes using Helm, a package manager for Kubernetes.

Prerequisites:

Before proceeding, make sure you have the following prerequisites:

A Kubernetes cluster: Ensure you have a functioning Kubernetes cluster to deploy your Spark applications.

Helm installed: Install Helm, the package manager for Kubernetes, on your local machine.

Setting Up the Spark Cluster on Kubernetes:

Step 1: Install the Spark Helm Chart:
To deploy Spark on Kubernetes, we'll use the official Spark Helm chart. Open your terminal and add the Spark Helm repository:

helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo update

Step 2: Customize Configuration (Optional):
You can customize the Spark cluster configuration by creating a values.yaml file. This file allows you to set parameters like the number of master or worker nodes, cpu/memory allocation, Spark version, add additional supporting jars through init containers etc. You can refer to the official Spark Helm chart documentation for available configuration options.
For eg: In values.yaml, you can update the image version to match the spark version according to your need as shown below:

...
...
image:
  registry: docker.io
  repository: bitnami/spark
  tag: 3.2.0
  digest: ""
...
...

Step 3: Deploy the Spark Cluster:
With the customization (if any) done, it's time to deploy the Spark cluster using Helm:

helm install spark bitnami/spark -f path/to/your/values.yaml

This command will deploy the Spark Master and Worker nodes as specified in the configuration, along with all the required kubernetes components like services, secrets etc required for proper functioning of the spark cluster.
This deploys a stateful set each for master and worker which can be scaled independently as per requirements.
It also deploys a headless service to access master and worker nodes.
It also deploys a regular service to access master and worker nodes.
Other components to capture the help way of deployment.

Step 4: Monitor the Spark Cluster:
After the deployment is complete, you can monitor the Spark cluster by accessing the Spark Web UI. Find the service IP for the Spark Master by running:

kubectl get svc spark-master-headless

Then, access the Spark Web UI in your browser using the obtained IP and port 8080 (default Spark UI port).

Step 5: Submit Spark Applications:
With the Spark cluster up and running, you can now submit your Spark applications for processing. Use the kubectl command we can exec into a pod either master or worker or any other pod which is available in the same namespace as the master and worker nodes:


# Running Spark application on Kubernetes cluster
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master http://spark-master-headless:7077 \
  --deploy-mode cluster \
  --executor-memory 5G \
  --executor-cores 8 \
  /spark-home/examples/jars/spark-examples_versionxx.jar 80

The spark-job.yaml file contains the specification of your Spark job, including the application jar, main class, input, and output paths, and any required configurations.

Conclusion:

Setting up a Spark cluster on Kubernetes using Helm brings together the power of Spark's distributed computing and Kubernetes' container orchestration capabilities. This combination allows you to scale your Spark workloads efficiently and take advantage of Kubernetes' resource management and fault tolerance features. With Helm's ease of use, you can quickly deploy and manage Spark clusters, making it a valuable approach for big data processing in cloud-native environments.

Software Application Telemetry and Open Telemetry Protocol

Saket — Mon, 28 Aug 2023 10:57:54 +0000

What does Telemetry mean in software applications?

Telemetry in software applications refers to the collection and analysis of data from software systems. This data can be used to monitor the performance of the system, identify problems, and improve the system's design and implementation. This helps to gain knowledge of whats going inside the application.
Telemetry data can be collected from a variety of sources, like for e.g.:

Sensors: Collect data about the physical environment, such as temperature, humidity, and pressure.
Logs: Collect data about the activities of the software system, such as errors, warnings, and performance metrics.
Events: Collect data about specific events that occur in the software system, such as user logins, page views, and API calls.

Telemetry data can be analyzed using a variety of tools and techniques, such as:

Statistical analysis: Used to identify trends and patterns in the data.
Machine learning: Used to build models that can predict future behavior based on historical data.
Visualization: Used to make the data easier to understand and interpret.

Telemetry data can be used to improve software applications in a variety of ways:

Monitoring the performance of the system: Monitor CPU usage, memory usage, and network traffic to identify problems and take corrective action.
Identifying problems: Identify problems with the software system, such as errors, warnings, and performance bottlenecks. This data can be used to fix the problems and improve the system's reliability.
Improving the system's design and implementation: Telemetry data can be to identify areas where the system can be improved, such as the performance, scalability, and security.

Telemetry is a powerful tool which helps to collect and analyse data which helps gain insights into the behaviour of applications and improve their overall quality.

What is Open Telemetry Protocol?

**
OpenTelemetry Protocol (OTLP) is a standard way to collect and export telemetry data from software systems. It is a general-purpose protocol that can be used to collect data from a variety of sources, including applications, services, and infrastructure.
OTLP is based on the Protocol Buffers language, which is a language-neutral, efficient way to serialize structured data. This makes it easy to transport telemetry data between different systems.
OTLP defines two main types of data: traces and metrics. Traces represent the execution of a single request or transaction, while metrics represent measurements of the state of a system.
OTLP can be used to collect telemetry data from a variety of sources, including:

Applications: OTLP can be used to instrument applications to collect data about their execution, such as the time it takes to respond to requests.
Services: OTLP can be used to collect data about the performance of services, such as the number of requests they are handling and the amount of time they are taking to respond.
Infrastructure: OTLP can be used to collect data about the performance of infrastructure components, such as servers, networks, and databases.

Once telemetry data has been collected using OTLP, it can be exported to a variety of backends, such as:

Observability platforms: Observability platforms, such as Prometheus and Grafana, can be used to visualize and analyze telemetry data.
Logging systems: Logging systems, such as ELK and Splunk, can be used to store and search telemetry data.
Data warehouses: Data warehouses, such as Snowflake and BigQuery, can be used to store telemetry data for long-term analysis.

OTLP is a powerful and versatile protocol that can be used to collect and export telemetry data from a variety of sources and in a variety of environments.

Here are some of the benefits of using OTLP:

It is a standard protocol, so it can be used to collect data from a variety of systems.
It is based on Protocol Buffers, which is a language-neutral, efficient way to serialize data.
It is easy to use and implement.
It is supported by a wide range of tools and platforms.

Kubernetes - Monitor custom application metrics using Prometheus and Istio.

Saket — Thu, 10 Aug 2023 14:32:19 +0000

Kubernetes pod level metrics are captured by out of the box combination of Istio, Prometheus and Grafana components. In order to get your application metrics (e.g.: jvm metrics, custom counter metrics etc) onto Grafana, we can configure the existing setup in such a way that Istio will capture the metrics from pod and applications and merge them before sending to prometheus.

In order to do this, we need to have the application expose its metrics on some endpoint (eg: springboot provides out of the box facility to expose jvm/app metrics on actuator/prometheus api. )

Once this is setup, we can configure the deployment yaml to add annotations to the pod which help istio to scrape the metrics from the application, merge them with pod metrics and send to prometheus.

annotations: prometheus.io/path: "/actuator/prometheus" prometheus.io/port: "80" prometheus.io/scheme: "http" prometheus.io/scrape: "true"
With this setup, the application metrics will be scraped from the pod and be available along with other metrics on Grafana dashboard.

Janusgraph OLAP Traversal not working with Cassandra backend with client SSL enabled.

Saket — Fri, 23 Jun 2023 17:14:30 +0000

What is olap traversal in graph database?

OLAP stands for OnLine Analytical Processing, is one of the ways to traverse graph database parallelly in batch operations.
Janusgraph OLAP Traversal makes use of distributed graph processing by leveraging gremlin plugin for Apache Hadoop and Apache Spark.
For more information on this topic please refer to below links:
JanusGraph with TinkerPop’s Hadoop-Gremlin - JanusGraph

The Problem

We had a working setup of Janusgraph with version 0.5.2 where we were able to insert and query (OLTP) the data as per need. We were exploring JanusGraph OLAP traversal for some reporting and analytical requirements. However when we tried to follow the instructions provided on the JanusGraph documentation, we were not able connect to Cassandra with SSL enabled, when traversing the graph in OLAP mode through Gremlin queries. Cassandra database was setup on SSL connection with a Truststore expected with client connection requests. OLTP Queries or the regular way of working with the queries was working fine and inline with the official documentation available.

Below is config for OLTP which works janusgraph-cql-oltp.properties:

gremlin.graph=org.janusgraph.core.JanusGraphFactory
storage.backend=cql
storage.hostname=cassandra.cassandra.svc.cluster.local
storage.username=cassandra
storage.password=cassandra123
storage.cql.keyspace=janusgraph
cache.db-cache = true
cache.db-cache-clean-wait = 20
cache.db-cache-time = 180000
cache.db-cache-size = 0.5
storage.lock.wait-time = 60000
storage.cql.ssl.enabled=true
storage.cql.ssl.truststore.location=/etc/config/tls/truststore
storage.cql.ssl.truststore.password=secretpasswd

When we load this line in gremlin console to connect and traverse a simple query we were able to fetch the expected results.

Below is the config for OLAP which is showing error for connection to Cassandra with ssl enabled:

gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin.hadoop.graphReader=org.janusgraph.hadoop.formats.cql.CqlInputFormat
gremlin.hadoop.graphWriter=org.apache.hadoop.mapreduce.lib.output.NullOutputFormat

gremlin.hadoop.jarsInDistributedCache=true
gremlin.hadoop.inputLocation=none
gremlin.hadoop.outputLocation=output
gremlin.spark.persistContext=true
# # JanusGraph Cassandra InputFormat configuration
# # These properties defines the connection properties which were used while write data to JanusGraph.
janusgraphmr.ioformat.conf.storage.backend=cql
# This specifies the hostname & port for Cassandra data store.
janusgraphmr.ioformat.conf.storage.hostname=cassandra.cassandra.svc.cluster.local
janusgraphmr.ioformat.conf.storage.port=9042
janusgraphmr.ioformat.conf.storage.username=cassandra
janusgraphmr.ioformat.conf.storage.password=cassandra123
janusgraphmr.ioformat.conf.storage.cql.keyspace=janusgraph
janusgraphmr.ioformat.conf.storage.lock.wait-time = 60000
janusgraphmr.ioformat.conf.storage.cql.ssl.enabled=true
janusgraphmr.ioformat.conf.storage.cql.ssl.truststore.location=/etc/config/tls/truststore
janusgraphmr.ioformat.conf.storage.cql.ssl.truststore.password=cassandra123

janusgraphmr.ioformat.conf.storage.ssl.enabled=true
janusgraphmr.ioformat.conf.storage.ssl.truststore.location=/etc/config/tls/truststore
janusgraphmr.ioformat.conf.storage.ssl.truststore.password=cassandra123

janusgraphmr.ioformat.conf.storage.cql.read-consistency-level=ONE

storage.lock.wait-time = 60000
storage.cql.ssl.enabled=true
storage.cql.ssl.client-authentication-enabled=true
storage.cql.ssl.truststore.location=/etc/config/tls/truststore
storage.cql.ssl.truststore.password=cassandra123

janusgraphmr.ioformat.conf.cache.db-cache = true
janusgraphmr.ioformat.conf.cache.db-cache-clean-wait = 20
janusgraphmr.ioformat.conf.cache.db-cache-time = 180000
janusgraphmr.ioformat.conf.cache.db-cache-size = 0.5

cassandra.input.partitioner.class=org.apache.cassandra.dht.Murmur3Partitioner
cassandra.input.widerows=true

# # SparkGraphComputer Configuration #
spark.master=local[*]
spark.executor.memory=1g
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.kryo.registrator=org.janusgraph.hadoop.serialize.JanusGraphKryoRegistrator

When we load the graph object in gremlin console, we can see properties are loaded correctly. But when we traverse the graph as mentioned in the documentation, we get cassandra connection error related to ssl config.

gremlin> graph=HadoopGraph.open('/janusgraph-full-0.5.2/conf/olap.properties')
==>hadoopgraph[cqlinputformat->nulloutputformat]
gremlin> g=graph.traversal().withComputer(SparkGraphComputer)
==>graphtraversalsource[hadoopgraph[cqlinputformat->nulloutputformat], sparkgraphcomputer]
gremlin> graph.configuration()
//// i can see all the properties from the file loaded here
gremlin> g.V().limit(1)
07:34:44 WARN  org.apache.tinkerpop.gremlin.spark.process.computer.SparkGraphComputer  - class org.apache.hadoop.mapreduce.lib.output.NullOutputFormat does not implement PersistResultGraphAware and thus, persistence options are unknown -- assuming all options are possible
com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: cassandra.cassandra.svc.cluster.local/10.0.165.158:9042 (com.datastax.driver.core.exceptions.TransportException: [cassandra.cassandra.svc.cluster.local/10.0.165.158:9042] Connection has been closed))
Type ':help' or ':h' for help.

We could verify from cassandra logs that a connection was attempted but request was rejected for ssl reasons. Below are the logs from cassandra instance:

INFO  [epollEventLoopGroup-2-4] 2023-05-02 07:34:58,809 Message.java:826 - Unexpected exception during request; channel = [id: 0xeb0e017f, L:/10.12.0.224:9042 ! R:/10.12.0.135:60316]
io.netty.handler.ssl.NotSslRecordException: not an SSL/TLS record: 0400000001000000500003000b43514c5f56455253494f4e0005332e302e30000e4452495645525f56455253494f4e0005332e392e30000b4452495645525f4e414d4500144461746153746178204a61766120447269766572
        at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1057) ~[netty-all-4.0.44.Final.jar:4.0.44.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:411) [netty-all-4.0.44.Final.jar:4.0.44.Final]

Finally found the missing piece

After trying several combinations to pass the ssl info the connection configuration, we were still not able to establish connection with Cassandra and successfully execute an OLAP query.
We posted this as a question on stackoverflow, discord channel and google groups hoping to receive some help from community. Finally got a response from the discord community member and it worked out. The discord channel for Janusgraph and Gremlin users is quite active. The configuration parameters which were needed to be populated for ssl connection were not mentioned in the documentation. They are there in the code and below is the reference. These however work with latest versions of Janusgraph and we verified this with 0.6.0 and 1.0.0-rc2 versions.

The OLAP connection configuration was updated with below mentioned entries:

cassandra.input.native.ssl.trust.store.password=cassandra123

Finally the updated OLAP traversal configuration looks like below:

 gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
    gremlin.hadoop.graphReader=org.janusgraph.hadoop.formats.cql.CqlInputFormat
    gremlin.hadoop.graphWriter=org.apache.hadoop.mapreduce.lib.output.NullOutputFormat
    gremlin.hadoop.jarsInDistributedCache=true
    gremlin.hadoop.inputLocation=none
    gremlin.hadoop.outputLocation=output
    gremlin.spark.persistContext=true
    janusgraphmr.ioformat.conf.storage.backend=cql
    janusgraphmr.ioformat.conf.storage.hostname=cassandra-headless.cassandra.svc.cluster.local
    janusgraphmr.ioformat.conf.storage.port=9042
    janusgraphmr.ioformat.conf.storage.username=cassandra
    janusgraphmr.ioformat.conf.storage.password=cassa@2@2!
    janusgraphmr.ioformat.conf.storage.cql.keyspace=janusgraph
    janusgraphmr.ioformat.conf.storage.cql.read-consistency-level=ONE
    janusgraphmr.ioformat.conf.storage.cql.ssl.enabled=true
    janusgraphmr.ioformat.conf.storage.cql.ssl.truststore.location=/tmp/security/truststore
    janusgraphmr.ioformat.conf.storage.cql.ssl.truststore.password=cassandra123
    storage.cql.read-consistency-level=ONE
    janusgraphmr.ioformat.conf.cache.db-cache = true
    janusgraphmr.ioformat.conf.cache.db-cache-clean-wait = 20
    janusgraphmr.ioformat.conf.cache.db-cache-time = 180000
    janusgraphmr.ioformat.conf.cache.db-cache-size = 0.5
    cassandra.input.partitioner.class=org.apache.cassandra.dht.Murmur3Partitioner
    cassandra.input.native.keep.alive=true
    cassandra.input.native.ssl.trust.store.path=/tmp/security/truststore
    cassandra.input.native.ssl.trust.store.password=cassa@2@2!
    storage.cql.protocol-version=V4 
    spark.master=local[*]
    spark.executor.memory=3g
    spark.serializer=org.apache.spark.serializer.KryoSerializer
    spark.kryo.registrator=org.janusgraph.hadoop.serialize.JanusGraphKryoRegistrator
    spark.cassandra.input.fetch.size_in_rows=500

With the above configuration we were able to traverse the graph using OLAP traversal and achieve our objective.