Alain Airom

Posted on Jul 12, 2023

How to gather OpenTelemtry Metrics in Instana with ‘no’ Instana agent on your ‘production’ infrastructure (part 2)

#observability #opentelemetry #instana #apm

This is the 2nd part of OpenTelemetry trace export to an Instana back-end series of articles.

The previous article discussed a simple use-case of Instana / OpenTelemetry Integration for Go Applications.

The current article is a demonstration of more complex, enterpise/production oriented usage.

How to gather OpenTelemtry Metrics in Instana with ‘no’ Instana agent on your ‘production’ infrastructure
IBM® Instana® Observability is the gold standard of incident prevention with automated full-stack visibility, 1-second granularity, and 3 seconds to notify. With today’s highly dynamic and complex cloud environments, the average cost of an hour of downtime can reach six figures and beyond. Traditional application performance monitoring (APM) tools simply aren’t fast enough to keep up or thorough enough to contextualize the issues identified. Also, they are typically limited to super users who must complete months of training to learn.

IBM Instana Observability goes beyond traditional APM solutions by democratizing observability so anyone across DevOps, SRE, platform engineering, ITOps, and development can get the data they want with the context they need. Instana automatically delivers continuous high-fidelity data at 1-second granularity and end-to-end traces with the context of logical and physical dependencies across mobile, web, applications, and infrastructure.

Instana is capable of monitoring all sorts of platforms/application types once the agent is deployed on a target infrastructure;

web sites
mobile apps
containers
K8s clusters
databases
serverless apps
APIs
Microservi ces All documentation (current and previous versions) could be found on the official page: [https://www.ibm.com/docs/en/instana-observability].

But there could be use cases where an organization is already using an observability tool and would like to connect to the Instana back-end BUT with no Instana agent deployments directly on their existing infrastructure. In this article, we will walk through OpenTelemtry metrics to Instana export.

This use case was brought up to us by one of our customers recently.

In order to test the feasibility a minimum of requirements are needed;

So basically what you need to do this is;

your code (duh)
a bastion to install the Instana agent.

What is the Instana agent?

The demonstration/proof of concept

The demo/code I used to put in place this use case could be found at: https://www.linkedin.com/in/styblope/ and the contributor is Petr Styblo.

The GitHub repo mocks a microservices-based application with services written in different languages/technologies such as Node.js, Golang, and Python programming languages and using technologies such as Redis, Kafka…

The demo app could all be implemented locally using docker compose or it could be deployed to a Kubernetes/OpenShift cluster (local, distant, cloud-based). All the instructions to run the app in either of the environments are fully documented (commands to be executed, Yaml files, or Helm chart for K8s deployments).

The use case mocks an Instana agent deployed on one bastion server, and all the OpenTelemtry traces/metrics go through this server to be visible on the Instana backend.

Running everything locally to monitor a microservices application on Instana back-end

When running the demo app locally, the agent directory should contain a “.env” file as in the example below;

# AGENT .env
# Instana agent configuration
INSTANA_AGENT_KEY="your backend instana agent key"
INSTANA_DOWNLOAD_KEY="your backend instana agent key"
INSTANA_AGENT_ENDPOINT="the instana backend you want to use"
INSTANA_AGENT_ENDPOINT_PORT=443 # this is the port for SaaS backend only
INSTANA_LOG_LEVEL=INFO            # INFO, DEBUG, TRACE, ERROR or OFF.
# EUM settings
INSTANA_EUM_URL=
INSTANA_EUM_KEY=
INSTANA_AGENT_ZONE=otel-demo #the name to track on the backen

For the microservices part, there is a need of an “.env” file too;

# App .env
# Images
IMAGE_VERSION=1.4.0
IMAGE_NAME=styblope/otel-demo

# Instana
INSTANA_AGENT_HOST=xxx.xxx.xxx.xxx
INSTANA_AGENT_PORT=42699

# Collector
# Demo Platform
ENV_PLATFORM=local

# OpenTelemetry Collector
# OTEL_COLLECTOR_HOST=otelcol
OTEL_COLLECTOR_HOST=${INSTANA_AGENT_HOST:-otelcol}
OTEL_COLLECTOR_PORT=4317
OTEL_EXPORTER_OTLP_ENDPOINT=http://${OTEL_COLLECTOR_HOST}:${OTEL_COLLECTOR_PORT}
PUBLIC_OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://localhost:4318/v1/traces

# OpenTelemetry Resource Definitions
OTEL_RESOURCE_ATTRIBUTES="service.namespace=opentelemetry-demo"

# Metrics Temporality
OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE=cumulative

# ******************
# Core Demo Services
# ******************
# Ad Service
AD_SERVICE_PORT=9555
AD_SERVICE_ADDR=adservice:${AD_SERVICE_PORT}

# Cart Service
CART_SERVICE_PORT=7070
CART_SERVICE_ADDR=cartservice:${CART_SERVICE_PORT}

# Checkout Service
CHECKOUT_SERVICE_PORT=5050
CHECKOUT_SERVICE_ADDR=checkoutservice:${CHECKOUT_SERVICE_PORT}

# Currency Service
CURRENCY_SERVICE_PORT=7001
CURRENCY_SERVICE_ADDR=currencyservice:${CURRENCY_SERVICE_PORT}

# Email Service
EMAIL_SERVICE_PORT=6060
EMAIL_SERVICE_ADDR=http://emailservice:${EMAIL_SERVICE_PORT}

# Feature Flag Service
FEATURE_FLAG_SERVICE_PORT=8081
FEATURE_FLAG_SERVICE_ADDR=featureflagservice:${FEATURE_FLAG_SERVICE_PORT}
FEATURE_FLAG_SERVICE_HOST=feature-flag-service
FEATURE_FLAG_GRPC_SERVICE_PORT=50053
FEATURE_FLAG_GRPC_SERVICE_ADDR=featureflagservice:${FEATURE_FLAG_GRPC_SERVICE_PORT}

# Frontend
FRONTEND_PORT=8080
FRONTEND_ADDR=frontend:${FRONTEND_PORT}

# Frontend Proxy (Envoy)
FRONTEND_HOST=frontend
ENVOY_PORT=8080

# Load Generator
LOCUST_WEB_PORT=8089
LOCUST_USERS=10
LOCUST_HOST=http://${FRONTEND_ADDR}
LOCUST_WEB_HOST=loadgenerator
LOCUST_AUTOSTART=true
LOCUST_HEADLESS=false

# Payment Service
PAYMENT_SERVICE_PORT=50051
PAYMENT_SERVICE_ADDR=paymentservice:${PAYMENT_SERVICE_PORT}

# Product Catalog Service
PRODUCT_CATALOG_SERVICE_PORT=3550
PRODUCT_CATALOG_SERVICE_ADDR=productcatalogservice:${PRODUCT_CATALOG_SERVICE_PORT}

# Quote Service
QUOTE_SERVICE_PORT=8090
QUOTE_SERVICE_ADDR=http://quoteservice:${QUOTE_SERVICE_PORT}

# Recommendation Service
RECOMMENDATION_SERVICE_PORT=9001
RECOMMENDATION_SERVICE_ADDR=recommendationservice:${RECOMMENDATION_SERVICE_PORT}

# Shipping Service
SHIPPING_SERVICE_PORT=50050
SHIPPING_SERVICE_ADDR=shippingservice:${SHIPPING_SERVICE_PORT}

# ******************
# Dependent Services
# ******************
# Kafka
KAFKA_SERVICE_PORT=9092
KAFKA_SERVICE_ADDR=kafka:${KAFKA_SERVICE_PORT}

# Redis
REDIS_PORT=6379
REDIS_ADDR=redis-cart:${REDIS_PORT}

# ********************
# Telemetry Components
# ********************
# Grafana
GRAFANA_SERVICE_PORT=3000
GRAFANA_SERVICE_HOST=grafana

# Jaeger
JAEGER_SERVICE_PORT=16686
JAEGER_SERVICE_HOST=jaeger

# Prometheus
PROMETHEUS_SERVICE_PORT=9090
PROMETHEUS_SERVICE_HOST=prometheus
PROMETHEUS_ADDR=${PROMETHEUS_SERVICE_HOST}:${PROMETHEUS_SERVICE_PORT}

We can also dig into the code provided to discover how each of the microservices sends their traces to OpenTelemetry. The example below is taken from the Python microservice which is used as a recommender system on the mock-up eCommerce site that the application provides.

# LOGGER.PY
#!/usr/bin/python
# Copyright The OpenTelemetry Authors
# SPDX-License-Identifier: Apache-2.0
#!/usr/bin/python
#

import logging
import sys
from pythonjsonlogger import jsonlogger
from opentelemetry import trace


class CustomJsonFormatter(jsonlogger.JsonFormatter):
    def add_fields(self, log_record, record, message_dict):
        super(CustomJsonFormatter, self).add_fields(log_record, record, message_dict)
        if not log_record.get('otelTraceID'):
            log_record['otelTraceID'] = trace.format_trace_id(trace.get_current_span().get_span_context().trace_id)
        if not log_record.get('otelSpanID'):
            log_record['otelSpanID'] = trace.format_span_id(trace.get_current_span().get_span_context().span_id)

def getJSONLogger(name):
    logger = logging.getLogger(name)
    handler = logging.StreamHandler(sys.stdout)
    formatter = CustomJsonFormatter('%(asctime)s %(levelname)s [%(name)s] [%(filename)s:%(lineno)d] [trace_id=%(otelTraceID)s span_id=%(otelSpanID)s] - %(message)s')
    handler.setFormatter(formatter)
    logger.addHandler(handler)
    logger.setLevel(logging.INFO)
    logger.propagate = False
    return logger

# METRICS.PY
#!/usr/bin/python
# Copyright The OpenTelemetry Authors
# SPDX-License-Identifier: Apache-2.0
#!/usr/bin/python
#


def init_metrics(meter):

    # Recommendations counter
    app_recommendations_counter = meter.create_counter(
        'app_recommendations_counter', unit='recommendations', description="Counts the total number of given recommendations"
    )

    rec_svc_metrics = {
        "app_recommendations_counter": app_recommendations_counter,
    }

    return rec_svc_metrics

# recommendation_server.py

#!/usr/bin/python
# Copyright The OpenTelemetry Authors
# SPDX-License-Identifier: Apache-2.0
#!/usr/bin/python
#

# Python
import os
import random
from concurrent import futures

# Pip
import grpc
from opentelemetry import trace, metrics
from opentelemetry._logs import set_logger_provider
from opentelemetry.exporter.otlp.proto.grpc._log_exporter import (
    OTLPLogExporter,
)
from opentelemetry.sdk._logs import LoggerProvider, LoggingHandler
from opentelemetry.sdk._logs.export import BatchLogRecordProcessor
from opentelemetry.sdk.resources import Resource

# Local
import logging
import demo_pb2
import demo_pb2_grpc
from grpc_health.v1 import health_pb2
from grpc_health.v1 import health_pb2_grpc

from metrics import (
    init_metrics
)

cached_ids = []
first_run = True

class RecommendationService(demo_pb2_grpc.RecommendationServiceServicer):
    def ListRecommendations(self, request, context):
        prod_list = get_product_list(request.product_ids)
        span = trace.get_current_span()
        span.set_attribute("app.products_recommended.count", len(prod_list))
        logger.info(f"Receive ListRecommendations for product ids:{prod_list}")

        # build and return response
        response = demo_pb2.ListRecommendationsResponse()
        response.product_ids.extend(prod_list)

        # Collect metrics for this service
        rec_svc_metrics["app_recommendations_counter"].add(len(prod_list), {'recommendation.type': 'catalog'})

        return response

    def Check(self, request, context):
        return health_pb2.HealthCheckResponse(
            status=health_pb2.HealthCheckResponse.SERVING)

    def Watch(self, request, context):
        return health_pb2.HealthCheckResponse(
            status=health_pb2.HealthCheckResponse.UNIMPLEMENTED)


def get_product_list(request_product_ids):
    global first_run
    global cached_ids
    with tracer.start_as_current_span("get_product_list") as span:
        max_responses = 5

        # Formulate the list of characters to list of strings
        request_product_ids_str = ''.join(request_product_ids)
        request_product_ids = request_product_ids_str.split(',')

        # Feature flag scenario - Cache Leak
        if check_feature_flag("recommendationCache"):
            span.set_attribute("app.recommendation.cache_enabled", True)
            if random.random() < 0.5 or first_run:
                first_run = False
                span.set_attribute("app.cache_hit", False)
                logger.info("get_product_list: cache miss")
                cat_response = product_catalog_stub.ListProducts(demo_pb2.Empty())
                response_ids = [x.id for x in cat_response.products]
                cached_ids = cached_ids + response_ids
                cached_ids = cached_ids + cached_ids[:len(cached_ids) // 4]
                product_ids = cached_ids
            else:
                span.set_attribute("app.cache_hit", True)
                logger.info("get_product_list: cache hit")
                product_ids = cached_ids
        else:
            span.set_attribute("app.recommendation.cache_enabled", False)
            cat_response = product_catalog_stub.ListProducts(demo_pb2.Empty())
            product_ids = [x.id for x in cat_response.products]

        span.set_attribute("app.products.count", len(product_ids))

        # Create a filtered list of products excluding the products received as input
        filtered_products = list(set(product_ids) - set(request_product_ids))
        num_products = len(filtered_products)
        span.set_attribute("app.filtered_products.count", num_products)
        num_return = min(max_responses, num_products)

        # Sample list of indicies to return
        indices = random.sample(range(num_products), num_return)
        # Fetch product ids from indices
        prod_list = [filtered_products[i] for i in indices]

        span.set_attribute("app.filtered_products.list", prod_list)

        return prod_list

def must_map_env(key: str):
    value = os.environ.get(key)
    if value is None:
        raise Exception(f'{key} environment variable must be set')
    return value

def check_feature_flag(flag_name: str):
    flag = feature_flag_stub.GetFlag(demo_pb2.GetFlagRequest(name=flag_name)).flag
    return flag.enabled

if __name__ == "__main__":
    service_name = must_map_env('OTEL_SERVICE_NAME')

    # Initialize Traces and Metrics
    tracer = trace.get_tracer_provider().get_tracer(service_name)
    meter = metrics.get_meter_provider().get_meter(service_name)
    rec_svc_metrics = init_metrics(meter)

    # Initialize Logs
    logger_provider = LoggerProvider(
        resource=Resource.create(
            {
                'service.name': service_name,
            }
        ),
    )
    set_logger_provider(logger_provider)
    log_exporter = OTLPLogExporter(insecure=True)
    logger_provider.add_log_record_processor(BatchLogRecordProcessor(log_exporter))
    handler = LoggingHandler(level=logging.NOTSET, logger_provider=logger_provider)

    # Attach OTLP handler to logger
    logger = logging.getLogger('main')
    logger.addHandler(handler)

    catalog_addr = must_map_env('PRODUCT_CATALOG_SERVICE_ADDR')
    ff_addr = must_map_env('FEATURE_FLAG_GRPC_SERVICE_ADDR')
    pc_channel = grpc.insecure_channel(catalog_addr)
    ff_channel = grpc.insecure_channel(ff_addr)
    product_catalog_stub = demo_pb2_grpc.ProductCatalogServiceStub(pc_channel)
    feature_flag_stub = demo_pb2_grpc.FeatureFlagServiceStub(ff_channel)

    # Create gRPC server
    server = grpc.server(futures.ThreadPoolExecutor(max_workers=10))

    # Add class to gRPC server
    service = RecommendationService()
    demo_pb2_grpc.add_RecommendationServiceServicer_to_server(service, server)
    health_pb2_grpc.add_HealthServicer_to_server(service, server)

    # Start server
    port = must_map_env('RECOMMENDATION_SERVICE_PORT')
    server.add_insecure_port(f'[::]:{port}')
    server.start()
    logger.info(f'Recommendation service started, listening on port {port}')
    server.wait_for_termination()

Once the repo is cloned and the values set, you can use the docker compose commands provided to test/run the app.

The application is accessible at http://localhost:8080/

You can monitor the Instana back-end for traces (this screen capture comes from a test/local back-end, let’s move on).

Running a distant agent bastion

This is great, but now let’s try it with a distant machine used as the bastion for the Instana agent.

The high-level idea of the architecture is the following;

Test the app with a distant agent

I provisioned a minimal Ubuntu VSI on IBM Cloud (you can use either in classic or VPC modes)
Then we need an Instana backend to generate the agent for our server. For that purpose, I used the IBM Instana dev sandbox to generate an agent for my VSI.

Copy the generated agent execution script provided by the back-end, connect to the distant server by “ssh root@xx.xx.x.xx” and paste and execute the agent:

#!/bin/bash

curl -o setup_agent.sh https://setup.instana.io/agent && chmod 700 ./setup_agent.sh && sudo ./setup_agent.sh -a xxxxxxxxxxxx -d xxxxxxxxxxxx -t dynamic -e ingress-xxx-xxx.instana.io:443

You need to uncomment/modify the following sections;

nano /opt/instana/agent/etc/instana/configuration.yaml

# Hardware &   Zone
com.instana.plugin.generic.hardware:
  enabled: true # disabled by default
  availability-zone: 'the-name-you-choose-to-see-your-bastion-server-on-instana-backend'

This is my example (when taken into account by Instana back-end after a few seconds)

# OpenTelemetry Collector
com.instana.plugin.opentelemetry:
   enabled: true
   grpc:
    enabled: true
   http:
    enabled: true

Then you’ll need to restart the Instana agent on your bastion server.

cd /opt/instana/agent/bin
./stop
./status (if you want to check)
./start

Now again you can run the microservice part of the cloned repo locally (or for example from your K8s deployment).

Network adjustments required in case of a distant bastion agent (present case on IBM Cloud)

As the IBM Cloud platform was used to provision a VSI and deploy the agent, some network configuration is required.

In order to enable the Instana back-end to gather services from the bastion agent, you should put in place a reverse-proxy such as Nginx.

Install Nginx as reverse proxy

Log into your bastion and install Nginx;

sudo apt update
sudo apt install nginx

If you have a firewall you should also adjust it;

sudo ufw app list

In my case, I didn’t put in place any firewall configuration.

Configuration of the reverse proxy

As discussed earlier, there is a specific configuration of the microservices app, which now must be changed in order to reflect the usage of a reverse proxy.

First, we give a different port number for the Instana agent in the .env file.

# Instana
INSTANA_AGENT_HOST=xxx.xxx.xxx.xxx
INSTANA_AGENT_PORT=42700

Then we configure our reverse proxy as shown below in /etc/nginx/sites-enabled/default (the file name is ‘default’);

##
# You should look at the following URL's in order to grasp a solid understanding
# of Nginx configuration files in order to fully unleash the power of Nginx.
# https://www.nginx.com/resources/wiki/start/
# https://www.nginx.com/resources/wiki/start/topics/tutorials/config_pitfalls/
# https://wiki.debian.org/Nginx/DirectoryStructure
#
# In most cases, administrators will remove this file from sites-enabled/ and
# leave it as reference inside of sites-available where it will continue to be
# updated by the nginx packaging team.
#
# This file will automatically load configuration files provided by oth
# applications, such as Drupal or Wordpress. These applications will be made
# available underneath a path with that package name, such as /drupal8.
#
# Please see /usr/share/doc/nginx-doc/examples/ for more detailed examples.
##

# Default server configuration
#
server {
  listen 80 default_server;
  listen [::]:80 default_server;

  # SSL configuration
  #
  # listen 443 ssl default_server;
  # listen [::]:443 ssl default_server;
  #
  # Note: You should disable gzip for SSL traffic.
  # See: https://bugs.debian.org/773332
  #
  # Read up on ssl_ciphers to ensure a secure configuration.
  # See: https://bugs.debian.org/765782
  #
  # Self signed certs generated by the ssl-cert package
  # Don't use them in a production server!
  #
  # include snippets/snakeoil.conf;

  root /var/www/html;

  # Add index.php to the list if you are using PHP
  index index.html index.htm index.nginx-debian.html;

  server_name _;

  location / {
    # First attempt to serve request as file, then
    # as directory, then fall back to displaying a 404.
    try_files $uri $uri/ =404;
  }

  # pass PHP scripts to FastCGI server
  #
  #location ~ \.php$ {
  # include snippets/fastcgi-php.conf;
  #
  # # With php-fpm (or other unix sockets):
  # fastcgi_pass unix:/run/php/php7.4-fpm.sock;
  # # With php-cgi (or other tcp sockets):
  # fastcgi_pass 127.0.0.1:9000;
  #}

  # deny access to .htaccess files, if Apache's document root
  # concurs with nginx's one
  #
  #location ~ /\.ht {
  # deny all;
  #}
}

server {
        access_log /var/log/nginx/access_42700.log;
        error_log /var/log/nginx/error.log error;
        listen 42700;
        listen [::]:42700;
        location / {
                proxy_pass http://localhost:42699;
        }
}        
server {
        access_log /var/log/nginx/access_42717.log;
        error_log /var/log/nginx/error.log error;
        listen 42717 http2 ;
        location / {
                grpc_pass 127.0.0.1:4317;
        }
}
server {
        access_log /var/log/nginx/access_42718.log;
        error_log /var/log/nginx/error.log error;
        listen 42718;
        listen [::]:42718;
        location / {
                proxy_pass http://localhost:4318;
        }
}

# Virtual Host configuration for example.com
#
# You can move that to a different file under sites-available/ and symlink that
# to sites-enabled/ to enable it.
#
#server {
# listen 80;
# listen [::]:80;
#
# server_name example.com;
#
# root /var/www/example.com;
# index index.html;
#
# location / {
#   try_files $uri $uri/ =404;
# }
#}

In this configuration, OpenTelemetry specific 4317 (gRPC) and 4318 (http) which are only listening on the localhost will be visible and open through a remote connection too!

Now we restart the Nginx server to apply the changed configuration.

systemctl restart nginx
systemctl status nginx

OpenTelemetry traces visible on Instana Backend

Example of OpenTelemetry traces on the Instana backend.

And there we go!!!! We have the result we wanted.

The next step is to give a more user-friendly name to services! Stay tunded and thanks for reading.

The dream-team who contributed to this project

First, Petr Styblo (Cloud-Native Solutions Architect @ibm) for pointing me to his repo and adjusting his code on the fly for my needs.

Many thanks to Mathieu Figiel (Technical Sales Specialist Observability @ibm) for his valuable help on Instana and network configuration.

And also many thanks to Badreddine Boutanzit SRE @ibm Client Engineering for his precious adjustments in reverse proxy implementation.

Last but not least, my buddy Keyvan Tofighi (APM/ARM Specialist @ibm) who implicated us in his project and for his help and knowledge on the Instana back-end.

DEV Community