DEV Community: Anas T

A Beginner's Guide to Auto-Instrumenting a Flask App with OpenTelemetry and SigNoz

Anas T — Tue, 08 Apr 2025 14:07:48 +0000

Understanding what your code is doing — really doing — is essential. That’s where observability comes in.

In this hands-on tutorial, you’ll learn how to instrument a simple Python Flask app using OpenTelemetry and send that data to SigNoz, an open-source observability platform. Step by step, we’ll walk through everything you need to get visibility into your application.

By the end, you’ll know how to:

Enable auto-instrumentation
Add custom spans
Track custom metrics
Capture logs
Correlate logs with traces

What Is OpenTelemetry and Why Use It?

OpenTelemetry is an open-source framework that helps you collect data about your application's performance and behavior. It gathers three types of telemetry data: traces, metrics, and logs. The data together provide a complete picture of how your app is working.

For beginners, OpenTelemetry is valuable because it simplifies monitoring without requiring deep expertise, and it works with many tools, like SigNoz, to display the data.

Understanding Telemetry Data: Traces, Metrics, and Logs

Telemetry data is the information your app produces to show what it's doing:

Traces track the journey of a request through your app
Metrics measure things like how many requests happen
Logs record specific events or messages.

Together, they help you spot problems, measure performance, and debug issues.

What Is SigNoz and How Does It Fit In?

SigNoz is an open-source tool that takes the telemetry data from OpenTelemetry and turns it into easy-to-read charts and dashboards. It's designed to work seamlessly with OpenTelemetry, making it perfect for anyone who wants to see their app's data without complex setup. SigNoz will be our window into the Flask app's performance in this tutorial.

Setting Up the Environment

Before we start instrumenting our Flask app, we need to prepare our system with the right tools. This section walks you through:

Installing Python
Creating a virtual environment to manage packages safely
Setting up SigNoz locally

Following the steps, you'll have a clean workspace ready for OpenTelemetry.

Installing Python and Creating a Virtual Environment

We'll use Python and Flask, a simple web framework, for our app. First, ensure that Python 3.8 or newer is installed by running python3 --version on your terminal.

If Python is not installed, download it from python.org or use your system's package manager (e.g., sudo apt install python3 on Ubuntu).

Verify that pip is available with the pip3 --version; if not, install it with sudo apt install python3-pip.

We'll create a virtual environment to avoid conflicts with your system's Python. This self-contained Python setup lets you install packages without affecting the rest of your system.

Run these commands in your terminal:

python3 -m venv otel-venv
source otel-venv/bin/activate

After running source otel-venv/bin/activate, your terminal prompt should change (e.g., (otel-venv)), indicating you're in the virtual environment. Now, any packages you install with pip will stay isolated here.

If you ever need to exit the virtual environment, type deactivate.

Setting Up SigNoz Locally

SigNoz will collect and display our telemetry data, and we'll run it using Docker. Open a terminal and run these commands to set up SigNoz:

git clone -b main https://github.com/SigNoz/signoz.git
cd signoz/deploy/
./install.sh

This downloads SigNoz and starts it. Once finished, SigNoz will be accessible at http://localhost:8080 :

You've now set up the backend to receive data from our app.

Creating a Simple Flask Application

Let's build a basic Flask app to instrument. Create a file called app.py and add this code:

from flask import Flask

app = Flask(__name__)

@app.route('/')
def home():
    return "Welcome to the Flask App!"

@app.route('/task')
def task():
    return "Task completed!"

if __name__ == "__main__":
    app.run(host='0.0.0.0', port=5000)

This app has two endpoints: one to greet users (http://localhost:5000) and another to simulate a task (http://localhost:5000/task). This gives us something simple yet practical to monitor.

Auto-Instrumenting the Flask App with OpenTelemetry

Auto-instrumentation lets OpenTelemetry automatically track your app's activity without changing much code. It's a beginner-friendly way to monitor and capture data like request times and errors.

Installing OpenTelemetry Packages

With your virtual environment active (run source otel-venv/bin/activate if it's not), install the necessary OpenTelemetry packages.

In your terminal, run:

pip install opentelemetry-distro \
opentelemetry-exporter-otlp \
flask \
opentelemetry-instrumentation-flask \
opentelemetry-instrumentation-logging

Then run:

 opentelemetry-bootstrap --action=install

opentelemetry-distro: Provides auto-instrumentation for Python apps.
opentelemetry-exporter-otlp: Sends data to SigNoz using the OTLP protocol.
flask: Ensures Flask is installed for our app.
opentelemetry-instrumentation-flask: Instruments Flask applications to automatically capture traces.
opentelemetry-instrumentation-logging: Hooks into Python's built-in logging module to enrich log messages with trace and span context.
opentelemetry-bootstrap --action=install: Automatically installs required dependencies for OpenTelemetry instrumentation based on detected libraries in your environment.

Configuring Auto-Instrumentation

We'll use a command that wraps our app with OpenTelemetry to enable auto-instrumentation. Update your terminal command to run the app like this:

opentelemetry-instrument --traces_exporter otlp --metrics_exporter otlp --logs_exporter otlp --service_name flask-app python3 app.py

This command tells OpenTelemetry to:

Collect traces, metrics, and logs
Send them to SigNoz (running locally at localhost:4317 by default).
Name our service flask-app for easy identification in SigNoz.

You've now auto-instrumented the app, and it's ready to send basic telemetry data to SigNoz.

Viewing Auto-Instrumented Traces

Hit http://localhost:5000 and http://localhost:5000/task a few times to generate data. In SigNoz, go to the “Traces” tab. You'll see traces for requests to / and /task, showing timing and endpoints automatically captured by OpenTelemetry:

This shows how requests flow through your app.

Adding Custom Spans for Detailed Tracing

Auto-instrumentation is great, but custom spans let you track specific parts of your code, like a slow task. A span is a single unit of work in a trace, giving you detailed timing info.

Modifying the App to Include Custom Spans

Update app.py to add a custom span for the /task endpoint:

from flask import Flask
from opentelemetry import trace

app = Flask(__name__)
tracer = trace.get_tracer(__name__)

@app.route('/')
def home():
    return "Welcome to the Flask App!"

@app.route('/task')
def task():
    with tracer.start_as_current_span("process-task"):
        return "Task completed!"

if __name__ == "__main__":
    app.run(host='0.0.0.0', port=5000)

Here, tracer.start_as_current_span("process-task") creates a custom span around the task logic. Rerun the app with the same opentelemetry-instrument command we used above and generate data.

Analysing Custom Spans

In the same “Traces” tab, click a /task trace to see the “process-task” span with its duration:

Sending Custom Metrics to Track Performance

Metrics measure things over time, like how many requests your app handles. Custom metrics let you track what matters to you, beyond what auto-instrumentation provides.

Implementing Custom Metrics in the App

Add a custom metric to count requests to the /task endpoint. Update app.py:

from flask import Flask
from opentelemetry import trace, metrics

app = Flask(__name__)
tracer = trace.get_tracer(__name__)
meter = metrics.get_meter(__name__)
task_counter = meter.create_counter("task_requests", description="Number of task requests")

@app.route('/')
def home():
    return "Welcome to the Flask App!"

@app.route('/task')
def task():
    with tracer.start_as_current_span("process-task"):
        task_counter.add(1)
        return "Task completed!"

if __name__ == "__main__":
    app.run(host='0.0.0.0', port=5000)

The task_counter metric increments each time /task is called.

Rerun the app using the same auto instrument code we used above and generate new data.

Viewing custom metrics on Signoz

On Signoz, follow these steps to create a dashboard with the custom metric:

Click “Dashboards” in the navigation
Create a new dashboard
1. Click “+ New Panel.”
2. Choose “Time Series” as the Panel Type.
In the Query Builder,
1. select task_requests from the Metrics dropdown.
2. Set Aggregation to “Increase” (for counters).
3. Save the panel and dashboard (e.g., name it “Flask Metrics”).

Capturing Logs and Correlating with Traces

Logs are messages your app produces, like status updates or errors. Correlating them with traces links these messages to specific requests, making it easier to debug issues by seeing the full context of a request's journey.

In this section, we'll add logging to our Flask app, enable trace-log correlation using OpenTelemetry, and view the results in SigNoz.

Adding Logging to the Flask App

We'll update app.py to include logging with trace context, using OpenTelemetry to capture and correlate logs with traces. We'll also use auto-instrumentation for Flask traces and metrics.

Update app.py with the following code:

import logging
from flask import Flask, request
from opentelemetry import trace
from opentelemetry import metrics
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.logging import LoggingInstrumentor
import time

# Configure logging with OpenTelemetry
class OpenTelemetryLogFormatter(logging.Formatter):
    def format(self, record):
        # Add trace and span context to log record
        current_span = trace.get_current_span()
        if current_span:
            span_context = current_span.get_span_context()
            if span_context and span_context.is_valid:
                record.trace_id = f"{span_context.trace_id:032x}"
                record.span_id = f"{span_context.span_id:016x}"
            else:
                record.trace_id = "00000000000000000000000000000000"
                record.span_id = "0000000000000000"
        else:
            record.trace_id = "00000000000000000000000000000000"
            record.span_id = "0000000000000000"
        return super().format(record)

# Create formatter
formatter = OpenTelemetryLogFormatter(
    '%(asctime)s - %(name)s - %(levelname)s - trace_id=%(trace_id)s span_id=%(span_id)s - %(message)s'
)

# Set up root logger
root_logger = logging.getLogger()
root_logger.setLevel(logging.INFO)
handler = logging.StreamHandler()
handler.setFormatter(formatter)
root_logger.addHandler(handler)

# Create Flask app
app = Flask(__name__)

# Disable Werkzeug's default logger
app.logger.handlers = []
werkzeug_logger = logging.getLogger('werkzeug')
werkzeug_logger.disabled = True

# Create a custom logger for HTTP access
http_logger = logging.getLogger('http.access')
http_logger.setLevel(logging.INFO)
http_logger.addHandler(handler)

# Custom logging middleware
@app.before_request
def before_request():
    # Store request start time
    request.start_time = time.time()

@app.after_request
def after_request(response):
    # Calculate request duration
    duration_ms = (time.time() - request.start_time) * 1000

    # Log the request with the current trace context
    http_logger.info(
        '%s - - [%s] "%s %s %s" %s %s (%.2fms)',
        request.remote_addr,
        time.strftime('%d/%b/%Y %H:%M:%S'),
        request.method,
        request.path,
        request.environ.get('SERVER_PROTOCOL', ''),
        response.status_code,
        response.content_length or '-',
        duration_ms
    )

    return response

# Instrument Flask for traces and logs
FlaskInstrumentor().instrument_app(app)
LoggingInstrumentor().instrument()

# Custom metric
meter = metrics.get_meter(__name__)
task_counter = meter.create_counter(
    "task_requests", 
    description="Number of task requests"
)

# Tracer for custom spans
tracer = trace.get_tracer(__name__)

@app.route('/')
def home():
    logging.info("Home endpoint accessed")
    return "Welcome to the Flask App!"

@app.route('/task')
def task():
    with tracer.start_as_current_span("task_processing"):
        task_counter.add(1)
        logging.info("Task endpoint accessed")
        return "Task completed!"

if __name__ == "__main__":
    app.run(host='0.0.0.0', port=5000)

Custom Log Formatter:
- OpenTelemetryLogFormatter extracts trace and span IDs from the current span context
- Correctly handles the span context properties to avoid errors
- Formats IDs as hexadecimal strings with proper padding
HTTP Request Logging:
- Disables Werkzeug's default logger, which doesn't capture trace context
- Implements custom request hooks with before_request and after_request
- Logs HTTP requests within the trace context to ensure correlation
- Includes request duration similar to standard web server logs
Auto-Instrumentation:
- FlaskInstrumentor().instrument_app(app) automatically generates traces for Flask HTTP requests
- LoggingInstrumentor().instrument() enables log collection with trace context

Enabling Log-Trace Correlation

To correlate logs with traces, we'll use OpenTelemetry's auto-instrumentation to capture traces, metrics, and logs and send them to SigNoz.

Run the app with the following commands:

export OTEL_PYTHON_LOGGING_AUTO_INSTRUMENTATION_ENABLED=true
opentelemetry-instrument --traces_exporter otlp --metrics_exporter otlp --logs_exporter otlp --service_name flask-app python3 app.py

OTEL_PYTHON_LOGGING_AUTO_INSTRUMENTATION_ENABLED=true enables OpenTelemetry to capture logs and include trace context.
opentelemetry-instrument runs the app with auto-instrumentation, exporting traces, metrics, and logs to SigNoz.

This setup links log messages to their corresponding traces using trace IDs.

Analyzing correlated logs with traces

Follow the given steps:

Visit http://localhost:5000/ and http://localhost:5000/task to generate some requests
Go to the “Logs” tab on Signoz
Add otelTraceID and otelSpanID in the column settings

Now you can see correlated logs with traces:

You can also click on a log entry to see its details. You'll see a trace_id field, which you can click to jump to the matching trace in the “Traces” tab:

Alternatively, go to the “Traces” tab, select a trace, and click “Go to related logs” to see the logs associated with that trace:

The correlation links each log to its request's journey, helping you debug issues by seeing both the log messages and the trace details together.

Keep digging

Observability opens a window into your application's inner workings. It’s how you learn from your systems. And the more you see, the better you build. This is just the start. Keep digging.

What is Envoy Proxy?

Anas T — Tue, 16 May 2023 18:11:30 +0000

Why is Envoy Proxy required?

Challenges are plenty for organizations moving their applications from monolithic to microservices architecture. Managing and monitoring the sheer number of distributed services across Kubernetes and public cloud often exhausts app developers, cloud teams, and SREs. Below are some of the major network-level operational hassles of microservices, which shows why Envoy proxy is required.

Lack of secure network connection

Kubernetes is not inherently secure because services are allowed to talk to each other freely. It poses a great threat to the infrastructure since an attacker who gains access to a pod can move laterally across the network and compromise other services. This can be a huge problem for security teams, as it is harder to ensure the safety and integrity of sensitive data. Also, the traditional perimeter-based firewall approach and intrusion detection systems will not help in such cases.

Complying with security policies is a huge challenge

There is no developer on earth who would enjoy writing security logic to ensure authentication and authorization, instead of brainstorming business problems. However, organizations who want to adhere to policies such as HIPAA or GDPR, ask their developers to write security logic such as mTLS encryption in their applications. Such cases in enterprises will lead to two consequences: frustrated developers, and security policies being implemented locally and in siloes.

Lack of visibility due to complex network topology

Typically, microservices are distributed across multiple Kubernetes clusters and cloud providers. Communication between these services within and across cluster boundaries will contribute to a complex network topology in no time. As a result, it becomes hard for Ops teams and SREs to have visibility over the network, which impedes their ability to identify and resolve network issues in a timely manner. This will lead to frequent application downtime and compromised SLA.

Complicated service discovery

Services are often created and destroyed in a dynamic microservices environment. Static configurations provided by old-generation proxies are ineffective to keep track of services in such an environment. This makes it difficult for application engineers to configure communication logic between services. Because they have to manually update the configuration file whenever a new service is deployed or deleted. It leads to application developers spending more of their time configuring the networking logic rather than coding the business logic.

Inefficient load balancing and traffic routing

It is crucial for platform architects and cloud engineers to ensure effective traffic routing and load balancing between services. However, it is a time-consuming and error-prone process for them to manually configure routing rules and load balancing policies for each service, especially when they have a fleet of them. Also, traditional load balancers with simple algorithms would result in inefficient resource utilization and suboptimal load balancing in the case of microservices. All these lead to increased latency, and service unavailability due to improper traffic routing.

With the rise in the adoption of microservices architecture, there was a need for a fast, intelligent proxy that can handle the complex service-to-service connection across the cloud.

Introducing Envoy proxy

Envoy is an open-source edge and service proxy, originally developed by Lyft to facilitate their migration from a monolith to cloud-native microservices architecture. It also serves as a communication bus for microservices (refer to fig. A below) across the cloud, enabling them to communicate with each other in a rapid, secure, and efficient manner.

Envoy proxy abstracts network and security from the application layer to an infrastructure layer. This helps application developers simplify developing cloud-native applications by saving hours spent on configuring network and security logic.

Envoy proxy provides advanced load balancing and traffic routing capabilities that are critical to run large, complex distributed applications. Also, the modular architecture of Envoy helps cloud and platform engineers to customize and extend its capabilities.

Envoy proxy architecture with Istio

Envoy proxies are deployed as sidecar containers alongside application containers. The sidecar proxy then intercepts and takes care of the service-to-service connection (refer to fig B below) and provides a variety of features. This network of proxies is called a data plane, and it is configured and monitored from a control plane provided by Istio. These two components together form the Istio service mesh architecture, which provides a powerful and flexible infrastructure layer for managing and securing microservices.

Envoy proxy features

Envoy proxy offers the following features at a high level. (Visit Envoy docs for more information on the features listed below.)

Out-of-process architecture: It means that the Envoy proxy runs independently as a separate process, apart from the application process. It can be deployed as a sidecar proxy and also as a gateway without requiring any changes to the application. Envoy is also compatible with any application language like Java or C++, which provides greater flexibility for application developers.
L3/L4 and L7 filter architecture: Envoy supports filters and allows customizing traffic at the network layer (L3/L4) and at the application layer ( L7). This allows for more control over the network traffic and offers granular traffic management capabilities such as TLS client certificate authentication, buffering, rate limiting, and routing/forwarding.
HTTP/2 and HTTP/3 support: Envoy supports HTTP/1.1, HTTP/2, and HTTP/3 (currently in alpha) protocols. This enables seamless communication between clients and target servers using different versions of HTTP.
HTTP L7 routing: Envoy’s HTTP L7 routing subsystem can route and redirect requests based on various criteria, such as path, authority, and content type. This feature is useful for building front/edge proxies and service-to-service meshes.
gRPC support: Envoy supports gRPC, a Google RPC framework that uses HTTP/2 or above as its underlying transport. Envoy can act as a routing and load balancing substrate for gRPC requests and responses.
Service discovery and dynamic configuration: Envoy supports service discovery and dynamic configuration through a layered set of APIs that provide dynamic updates about backend hosts, clusters, routing, listening sockets, and cryptographic material. This allows for centralized management and simpler deployment, with options for DNS resolution or static config files.
Health checking: For building an Envoy mesh, service discovery is treated as an eventually consistent process. Envoy has a health checking subsystem that can perform active and passive health checks to determine healthy load balancing targets.
Advanced load balancing: Envoy’s self-contained proxy architecture allows it to implement advanced load balancing techniques, such as automatic retries, circuit breaking, request shadowing, and outlier detection, in one place, accessible to any application.
Front/edge proxy support: Using the same software at the edge provides benefits such as observability, management, and identical service discovery and load balancing algorithms. Envoy’s feature set makes it well-suited as an edge proxy for most modern web application use cases, including TLS termination, support for multiple HTTP versions, and HTTP L7 routing.
Best-in-class observability: Envoy provides robust statistics support for all subsystems and supports distributed tracing via third-party providers, making it easier for SREs and Ops teams to monitor and debug problems occurring at both the network and application levels.

Given its powerful set of features, Envoy proxy has become a popular choice for organizations to manage and secure microservices. In practice, it has two main use cases.

Use cases of Envoy proxy

Envoy proxy can be used as both a sidecar service proxy and a gateway.

Envoy sidecar proxy

As we have seen in the Isito architecture, Envoy proxy constitutes the data plane and manages the traffic flow between services deployed in the mesh. The sidecar proxy provides features such as service discovery, load balancing, traffic routing, etc., and offers visibility and security to the network of microservices.

Envoy Gateway as API

Envoy proxy can be deployed as an API gateway and as an ingress (read the Envoy Gateway project). Envoy Gateway is deployed at the edge of the cluster to manage external traffic flowing into the cluster and between multicloud applications (north-south traffic). Envoy Gateway helped application developers who were toiling to configure Envoy proxy (Istio-native) as API and ingress controller, instead of purchasing a third-party solution like NGINX. With its implementation, they have a central location to configure and manage ingress and egress traffic, and apply security policies such as authentication and access control.

Below is a diagram of Envoy Gateway architecture and its components.

To read more about Envoy API gateway architecture, features, and learn how to get started with it, follow this link: What is Envoy Gateway, and why is it required for Kubernetes?

Benefits of Envoy proxy

Envoy’s ability to abstract network and security layers offers several benefits for IT teams such as developers, SREs, cloud engineers, and platform teams. Following are a few of them.

Effective network abstraction

The out-of-process architecture of Envoy helps it to abstract the network layer from the application to its own infrastructure layer. This allows for faster deployment for application developers, while also providing a central plane to manage communication between services.

Fine-grained traffic management

With its support for the network (L3/L4) and application (L7) layers, Envoy provides flexible and granular traffic routing, such as traffic splitting, retry policies, and load balancing.

Ensure zero trust security at L4/L7 layers

Envoy proxy helps to implement authentication among services inside a cluster with stronger identity verification mechanisms like mTLS and JWT. You can achieve authorization at the L7 layer with Envoy proxy easily and ensure zero trust. (You can implement AuthN/Z policies with Istio service mesh — the control plane for Envoy.)

Control east-west and north-south traffic for multicloud apps

Since enterprises deploy their applications into multiple clouds, it is important to understand and control the traffic or communication in and out of the data centers. Since Envoy proxy can be used as a sidecar and also an API gateway, it can help manage east-west traffic and also north-south traffic, respectively.

Monitor traffic and ensure optimum platform performance

Envoy aims to make the network understandable by emitting statistics, which are divided into three categories: downstream statistics for incoming requests, upstream statistics for outgoing requests, and server statistics for describing the Envoy server instance. Envoy also provides logs and metrics that provide insights into traffic flow between services, which is also helpful for SREs and Ops teams to quickly detect and resolve any performance issues.

Get started with Envoy Proxy

Below are some resources to help you get started with Envoy Proxy.

How to set up Envoy Proxy in Linux

The following video will give you a high-level overview of Envoy architecture and components such as listeners, network chain filters, routers, and clusters. It will be followed by a demo of installing Envoy on Ubuntu. You will also see a sample flask application and how Envoy configuration is written to define all the components.

Deploying Envoy in K8s and Configuring as Load Balancer

This video discusses different deployment types and their use cases, and it shows a demo of Envoy deployment into Kubernetes and how to set it as a load balancer (edge proxy).

About IMESH

IMESH offers solutions to help organizations adopt Istio service mesh without any implementation or operational hassle. IMESH provides a platform built on top of Istio and Envoy API gateway to help start with Istio from Day 1. The platform is hardened for production and is fit for multicloud and hybrid cloud applications.

IMESH also provides consulting services and expertise to help you adopt Istio rapidly in your organization. We make it easier to deploy Istio into production and ensure there are no unintended container crashes or application misbehavior. IMESH also offers a strong visibility layer on top of Istio, which provides Ops and SREs with a multicluster view of services, dependencies, and network traffic. If you are interested, please talk to an Istio expert or book an Istio demo.

5 Reasons Why You Should Choose Enterprise Istio Over DIY

Anas T — Mon, 15 May 2023 09:07:44 +0000

Istio service mesh is an open-source platform that simplifies the security and network of cloud-native applications. It abstracts the network and security layer from the application into the infrastructure layer. This is helpful in securing and managing communication between microservices, improving developer experience, and achieving zero trust networks.

There are two operating models for Istio: DIY and managed Istio with enterprise support. The DIY approach, where organizations deploy and manage Istio on their own, is prone to some challenges, such as lack of technical expertise, resource availability, etc. In this blog, we will explore these challenges of self-managing Istio in detail and see how enterprise Istio support can bring immense value.

Challenges of self-managing Istio

Self-managing any open-source software comes with a set of challenges. A complex piece of software like Istio is no exception to this. Below are the 5 dimensions of challenges enterprises will face while implementing Istio in the DIY operating model.

5 dimensions of challenges with Istio DIY model

1. Ownership problems

Most open-source products are not plug-and-play solutions. Implementing them requires someone or a team to take full ownership of their implementation and maintenance. Otherwise, it may lead to suboptimal performance, unpatched security vulnerabilities, and downtime. So the following questions have to be addressed before considering Istio or any other open-source solutions:

Who will analyze if the product is hardened and ensure it does not compromise enterprise security?
Who will perform the required experimentation or chaos engineering?
Who will troubleshoot when the software breaks?
Who will be in charge of fixing vulnerabilities and bugs?
Who will take care of the product’s lifecycle management?
Who will collaborate with multiple departments for its enterprise-level implementation?

Ideally, Istio owners also need to closely follow Istio’s development to keep the product up-to-date with the latest releases and security patches. But the problem is that it takes time for the community to patch specific bugs and fix security vulnerabilities. Waiting for the fix to be released and letting the infrastructure be vulnerable meanwhile is not a brilliant solution. So Istio owners will have to investigate and resolve the issue themselves, which requires an in-depth understanding of Istio’s architecture and underlying infrastructure.

Assuming developers or platform teams have learned Istio in full capacity to implement and maintain it, they are required to do a lot of troubleshooting. This brings the question: What should the developers ultimately do? Maintain an open-source solution like Istio, or code business logic.

2. Learning curve

Enterprises that are confident in their IT teams’ competency usually venture into the idea of handling Istio by themselves. However, operating Istio in the DIY model may backfire, given its complexity. Istio is a heavy-weight platform that abstracts the network and security infrastructure. Understanding it completely — including its shortcomings and making enhancements to it — requires crossing a huge learning curve.

On top of learning Istio itself, there is the data plane, which is implemented using Envoy proxies. Envoy proxies are powerful and configurable proxies that have their own set of features, configurations, and capabilities that need to be understood separately. Above all, in general, developers also have to understand the following concepts to learn Istio properly:

Kubernetes API controller
CNI (Container Network Interface)
Ingress and egress
Multicluster configurations
API gateway integrations
And the list goes on.

Note that I am not saying this to overwhelm and restrain anyone from learning Istio, but to shed some light on the reality of the learning curve in the DIY model of service mesh. And it is true whether you are using Istio or Linkerd.

3. Documentation problems

We can consider documentation pages the “soul” of any open-source project. It helps users to understand the product and use it effectively. Some solutions, such as Kubernetes, are popular for maintaining a comprehensive documentation page. In a sense, the documentation makes or breaks an open-source product. If the steps outlined on the page are constantly throwing errors, the users will eventually hesitate to dig deeper and stop using the product.

Istio has a well-maintained documentation page with clear instructions and step-by-step tutorials. However, there are a few problems here and there. For example, those who had followed Istio multi-cluster with multi-network setup a few weeks back must have gotten errors while implementing and configuring Istio, until Ravi Verma, CTO of IMESH, fixed it recently.

The challenge here is that there is a lack of volunteers to update and maintain the documentation page on time. The speed at which Istio evolves is also another reason adding to this. And the rapid evolution has made some documentation or examples outdated, maybe because of functionality changes or interface changes. They will no longer be accurate or applicable to the current version of Istio you are using.

In such a case, when the documentation is not working as expected, the developers will have to reach out to Istio maintainers through the Slack channel to sort it out. This can be a time-consuming process and result in delayed troubleshooting.

4. Customization challenges

Open-source solutions are tested in a specific environment. They are suitable only for environments similar to the ones in which it is tested (not unique to open-source). In real-life scenarios, enterprises run different environments and the requirements vary from one to another. They will need a lot of customization to make the product seamlessly integrate with their existing tech stack.

Also, with Istio, there are no one-size-fits-for-all use cases. Organizations will have different requirements, and it takes some amount of resources to make Istio fit their needs. Below are some what-if scenarios where heavy customization is required:

The latest version of Istio (v1.17) supports K8s versions 1.23, 1.24, 1.25, and 1.26. What if you have an older version of K8s?
What if you need to implement Istio in hybrid environments — Fargate/Lambda/GKE/on-prem VMs?
What if you have to integrate Istio with AWS CA or DigiCert Enterprise PKI Manager or any other certificate manager to implement certification rotation for mTLS?
What if you are already using an API gateway, such as Kong, Mulesoft, or Apigee? How to make the Istio integrate with an already invested infrastructure?
What if you are using AuthN/AuthZ providers like Microsoft AD with x versions for which Istio does not provide out-of-the-box support?
What if you are using Spinnaker or Argo CD or Tekton CD for deployment and want to configure Istio and Envoy as resources for deployment into multiple clusters?
What if you want a complicated architecture like Istio-on-Istio deployment?

These are only a few of the custom requirements/integrations enterprises require while configuring Istio. Sometimes they can be much more complicated.

5. Version upgrades

Challenges are plenty in upgrading versions of open-source solutions that are implemented enterprise-wide. Upgrading to a new version may introduce compatibility issues with existing applications and infrastructure, and this can lead to unexpected downtime. Having enough technical expertise is inevitable to carry out version upgrades so that it does not break customizations and integrations, and bring down your applications.

(Trust me, this is the first thing that comes to everyone’s mind during version upgrade conversations.)

Source: stack.io

Istio is one of the most popular projects in CNCF, with over 500 companies including Google, Microsoft, and IBM contributing to it. The active community of contributors of Istio leads to faster releases of new features. However, dealing with upgrades itself can become a problem for developers. For instance, matching the Istio version to the underlying version of Kubernetes can be painful and time-consuming.

There are many questions DevOps, architects, and cloud platform teams need to ask themselves:

Is your current Istio version inadequate to meet your requirements?
Does the new version, say v1.17, have all the features to help achieve your security and network management goals?
Is the new Istio version stable?
Will the new features significantly outweigh the cost of migrating to the new Istio version?
What is the estimated cost of running the latest version of Istio?
What is the estimated impact of the new version of Istio?
What are the CVE risk mitigation strategies? How to ensure FIPS compliance?

Finding all these answers can be overwhelming for a team, and the discussion can stretch for months. Meanwhile, there is another major Istio release.

Recently, Istio launched ambient mesh, which has a completely different architecture aimed at solving the latency problem with Istio. Istio ambient mesh will be very compelling for existing Istio users to cut down their cloud bills. But the DIY model will only generate more problems in hand for developers to leave their core work and focus on upgrading Istio.

How enterprise Istio support can help

The negative impact of managing Istio in a DIY operating model can be catastrophic at times, such as resulting in unexpected downtime. And since the developers maintaining Istio have to spend too much time learning and troubleshooting Istio along with their core work, the possible degradation of developer experience cannot also be overlooked.

One ideal solution to avoid all the challenges of self-managing Istio and its negative impact is to choose enterprise Istio support. The enterprise Istio support solution is there to make the implementation of Istio and its management painless. IMESH provides enterprise Istio support, and here are three ways we can help.

Faster Istio implementation for multicloud and multicluster

Enabling Istio to secure multicluster and multicloud apps can cause endless troubleshooting in the DIY approach. Although Istio is Kubernetes-native, it needs careful planning and engineering to enable traffic management between multiple clusters and VMs.

IMESH offers support for all kinds of workloads deployed in private or public clouds, Kubernetes, or VMs. Watch the video below to have a peek at how to get started with multicluster Istio in EKS and GKE.

Enterprise customization for production usage

Each enterprise requires custom integrations to implement Istio because they have different processes, tools, security standards, and governance policies. IMESH offers pre-built integrations and customizations for over 40 DevOps tools. We also integrate with CD tools, such as Spinnaker and Argo CD, to help you deploy the security and network policies rapidly into target clusters.

(Non-exhaustive) List of software that IMESH can integrate Istio service mesh with:

Data centers: AWS/GCP/Azure/On-prem VMs
Kubernetes: On-prem K8s/EKS/AKS/GKE
API Gateway: Mulesoft/APIgee/Kong/
Ingress controllers: NGINX/HA proxy/Ambassador
CD tools: Spinnaker/GitHub Action/Tekton
GitOps tool: Argo CD/ Flux CD/ Argo Rollouts
SCM: Git/Bitbucket/GitLab
SSO(IAM): Google SSO/OAuth2.0/SAML/OKTA
RBAC: LDAP/Azure AD
Key management: Vault
Certificate management: Lets Encrypt/AWS CA/ GCP CA/ SPIRE
Monitoring: Prometheus/Zipkin/Skywalking/Stackdriver
Logging: Datadog/Splunk
Tracing: Jaeger
Notification: Slack/Jira/MS Teams/PagerDuty
Configuration management: Terraform

Rely on Istio experts for lifecycle management and CVE fixes

There is a problem VPs and Directors of Engineering face when they have got someone who is doing a good job with Istio. Let us say if a team member in your organization is in charge of everything Istio, what happens when they leave suddenly? This can lead to operation disruptions and downtime, especially if you have configured Istio for multicluster/multicloud communications. From our interactions with various enterprises, we can say for sure that it happens quite a lot; Istio experts are hard to find and even harder to retain.

With IMESH enterprise Istio support, you can ensure the availability of Istio experts around the clock. They are specialized in setting up Istio for enterprise applications across multicloud and environments. Istio experts from IMESH can also help in seamlessly onboarding your developers, Ops, SRE, and Platform teams to Istio. Regardless of your team size, environment size, or cluster size, IMESH’s Istio support team is obsessed with quickly providing value.

To understand the level of expertise our team has and experience it firsthand, you can watch some videos from IMESH YouTube channel where our Istio experts show demonstrations around Istio.

IMESH enterprise Istio support and services

Benefits of IMESH enterprise Istio support

IMESH provides Istio training, administration, and operation training. Besides, we provide advanced concepts training to developers, application/platform teams, SecOps, and SREs. Some major benefits of IMESH enterprise Istio support include the following:

3X faster time to implement Istio for your Kubernetes and VM workloads within SLA
$2Mn savings in the cost of ownership of maintaining Istio and Envoy with faster time to vulnerability fixes
100% security of traffic and data-in-transit with faster implementation of certificates and authorization and authentication in your environment.

Benefits of IMESH enterprise Istio support

Interested in trying enterprise Istio (free pilot)?

Open-source solutions are not cheap. It takes a huge amount of resources to deploy and manage them. As we saw above, Istio is no exception to this pattern. IMESH helps enterprises with their Istio and Envoy journey, and you can test Istio before production by booking a free Istio pilot. Our experts will help you deploy Istio in a pilot environment to determine the security benefits it can bring to your fleet of microservices:

Check the core Istio features
Implement mTLS
Experiment with advanced network strategies
Visualize network performance and behavior
Evaluate the cost, time, and risk for the project

Book the free Istio pilot here: https://imesh.ai/book-istio-service-mesh-pilot.html

The post 5 Reasons Why You Should Choose Enterprise Istio Over DIY appeared first on IMESH.

Zero Trust Network for Microservices with Istio

Anas T — Fri, 24 Mar 2023 07:41:06 +0000

Security was mostly perimeter-based while building monolithic applications. This means securing the network perimeter and access control using firewalls. With the advent of microservices architecture, static and network-based perimeters are no longer effective.

Nowadays, applications are deployed and managed by container orchestration systems like Kubernetes, which are spread across the cloud. Zero trust network (ZTN) is a different approach to secure data across cloud-based networks. In this article, we will explore how ZTN can help secure microservices.

What is Zero Trust Network (ZTN)?

Zero trust network is a security paradigm that does not grant implicit trust to users, devices, and services, and continuously verifies their identity and authorization to access resources.

In a microservices architecture, if a service (client) receives a request from another service (server), the server should not assume the trustworthiness of the client. The server should continuously authenticate and authorize a client first and then allow the communication to happen securely (refer to fig. A below).

Fig. A – A Zero Trust Network (ZTN) environment where continuous authentication and authorization are enforced between microservices across multicloud

Why is a zero trust network environment inevitable for microservices?

The importance of securing the network and data in a distributed network of services cannot be stressed enough. Below are a few challenges why a ZTN environment is necessary for microservices:

Lack of ownership on the network: Applications moved from perimeter-based to multiple clouds and data centers with microservices. As a result, the network has also got distributed, giving more attack surface to intruders.
Increased network and security breaches: Data and security breaches among cloud providers have become increasingly common since applications moved to public clouds. In 2022, nearly half of all data breaches occurred in the cloud.
Managing multicluster network policies has become tedious: Organizations deploy hundreds of services across multiple Kubernetes clusters and environments. Network policies are local to clusters and do not usually work for multiple clusters. They require a lot of customization and development to define and implement security and routing policies in multicluster and multicloud traffic. Thus, configuring and managing consistent network policies and firewall rules for each service becomes an everlasting and frustrating process.
Service-to-service connection is not inherently secure in K8s: By default, one service can talk to another service inside a cluster. So, if a service pod is hacked, an attacker can quickly hack other services in that cluster easily (also known as vector attack). Kubernetes does not provide out-of-the-box encryption or authentication for communication between pods or services. Although K8s offers additional security features like enabling mTLS, it is a complex process and has to be implemented manually for each service.
Lack of visibility into the network traffic: If there is a security breach, the Ops and SRE team should be able to react to the incident faster. Poor real-time visibility into the network traffic across environments becomes a bottleneck for SREs to diagnose issues in time. This impedes their ability for incident response, which leads to high mean time for recovery (MTTR) and catastrophic security risks.

In theory, a zero trust network (ZTN) philosophy solves all the above challenges. Istio service mesh helps Ops and SREs to implement ZTN and secure microservices across the cloud.

Please read top 10 pillars of zero trust network considered by top CISOs.

How Istio service mesh enables ZTN for microservices

Istio is a popular open-source service mesh implementation software that provides a way to manage and secure communication between microservices. Istio abstracts the network into a dedicated layer of infrastructure and provides visibility and control over all communication between microservices.

Istio works by injecting an Envoy proxy (a small sidecar daemon) alongside each service in the mesh (refer to fig. B). Envoy is an L4 and L7 proxy that helps in ensuring security connections and network connectivity among the microservices, respectively. The Istio control plane allows users to manage all these Envoy proxies, such as directly defining and cascading security and network policies. (More on Istio architecture and its components will be explained soon in another blog.)

Fig B – Istio using Envoy proxy to secure connections between services across clusters and clouds

Istio simplifies enforcing a ZTN environment for microservices across the cloud. Inspired by Gartner Zero Trust Network Access, we have outlined four pillars of zero trust network that can be implemented by Istio.

Four pillars of zero trust network enforced by Istio service mesh

1. Enforcing Authentication with Istio

Security teams would be required to create authentication logic for each service to verify the identity of users (humans or machines) that sent requests. This is necessary to ensure the trustworthiness of the user.

In Istio, it can be done by configuring peer-to-peer and request authentication policies using PeerAuthentication and RequestAuthentication custom resources (CRDs):

Peer authentication policies involve authenticating service-to-service communication using mTLS. That is, certificates are issued for both the client and server to verify the identity of each other.

Below is a sample PeerAuthentication resource that enforces strict mTLS authentication for all workloads in the foo namespace:

apiVersion: security.istio.io/v1beta1  
kind: PeerAuthentication  
metadata:  
  name: default  
  namespace: foo  
spec:  
  mtls:  
    mode: STRICT

Request authentication policies involve the server ensuring whether the client is even allowed to make the request. Here, the client will attach JWT (JSON Web Token) to the request for server-side authentication.

Below is a sample RequestAuthentication policy created in the foo namespace. It specifies that incoming requests to the my-app service must contain JWT that is issued, and verified using public keys by entities mentioned under jwtRules.

apiVersion: security.istio.io/v1beta1  
kind: RequestAuthenticationetadata:  
metadata:  
  name: jwt-example  
  namespace: foo  
spec:  
  selector:  
 matchLabels:  
  app: my-app  
  jwtRules:  
  – issuer: “https://issuer.example.com”  
 jwksUri: “https://issuer.example.com/keys”

Both authentication policies are stored in Istio configuration storage.

2. Implementing authorization with Istio

Authorization is verifying whether the authenticated user is allowed to access a server (access control) and perform the specific action. Continuous authorization prevents malicious users from accessing services, which ensures their safety and integrity.

AuthorizationPolicy is another Istio CRD that provides access control for services deployed in the mesh. It helps in creating policies to deny, allow, and also perform custom actions against an inbound request. Istio allows setting multiple policies with different actions for granular access control to the workloads.

The following AuthorizationPolicy denies POST requests from workloads in the dev namespace to workloads in the foo namespace.

apiVersion: security.istio.io/v1beta1  
kind: AuthorizationPolicy  
metadata:  
  name: httpbin  
  namespace: foo  
spec:  
  action: DENY  
  rules:  
  – from:  
    – source:  
      namespaces: [“dev”]  
    to:  
    – operation:  
      methods: [“POST”]

3. Multicluster and multicloud visibility with Istio

Another important pillar of ZTN is network and service visibility. SREs and Ops teams would require real-time monitoring of traffic flowing between microservices across cloud and cluster boundaries. Having deep visibility into the network would help SREs quickly identify the root cause of anomalies, develop resolution, and restore the applications.

Istio provides visibility into traffic flow and application health by collecting the following telemetry data from the mesh from the data and control plane.

Logs: Istio collects all kinds of logs such as services logs, API logs, access logs, gateway logs, etc., which will help to understand the behavior of an application. Logs also help in faster troubleshooting and diagnosis of network incidents.
Metrics: They help to understand the real-time performance of services for identifying anomalies and fine-tuning them in the runtime. Istio provides many metrics apart from the 4 golden ones, which are error rates, traffic, latency, and saturation.
Distributed tracing: It is the tracing and visualizing of requests flowing through multiple services in a mesh. Distributed tracing helps understand interactions between microservices and provides a holistic view of service-to-service communication in the mesh.

4. Network auditing with Istio

Auditing is analyzing logs of a process over a period with the goal to optimize the overall process. Audit logs provide auditors with valuable insights into network activity, including details on each access, the methods used, traffic patterns, etc. This information is useful to understand the communication process in and out of the data center and public clouds.

Istio provides information about who accessed (or requested), when, and onto what resources, which is important for auditors to investigate faulty situations, and then suggest steps to improve the overall performance of the network and security of cloud-native applications.

Deploy Istio for a better security posture

The challenges around securing networks and data in a microservices architecture are going to be increasingly complex. Attackers are always ahead in finding vulnerabilities and exploiting them before anyone in the SRE team gets time to notice.

Implementing a zero-trust network will provide visibility and secure Kubernetes clusters from internal or external threats. Istio service mesh can lead this endeavor from the front, with its ability to implement zero trust out of the box. IMESH helps enterprises to onboard and adopt Istio service mesh without any operation hassle. Check out our offerings.

About IMESH

IMESH offers solutions to help you avoid errors during the experimentation of implementing Istio and fend off operational issues. IMESH provides a platform built on top of Istio and Envoy API gateway to help start with Istio from Day-1. IMESH Istio platform is hardened for production and is fit for multicloud and hybrid cloud applications. IMESH also provides consulting services and expertise to help you adopt Istio rapidly in your organization.

IMESH also provides a strong visibility layer on top of Istio which provides Ops and SREs a multicluster view of services, dependencies, and network traffic. The visibility layer also provides details of logs, metrics, and traces to help Ops folks to troubleshoot any network issues faster.

The post Zero Trust Network for Microservices with Istio appeared first on IMESH.