DEV Community: Reliably

Observing the Reliability of your Java Apps and Services with Spring Boot, Micrometer, Prometheus & Reliably

Russ Miles — Wed, 04 Aug 2021 14:49:16 +0000

Here at Reliably we are huge fans of Spring Boot and the Micrometer dimensional metrics instrumentation library for providing the rich set of possible metrics that can be a great foundation for the Service Level Indicators that provide coverage for your Reliably Service Level Objectives as Code.

As of Spring Boot 2.0, Micrometer became the default instrumentation library for the huge range of Spring Boot applications, from monoliths to microservices. With Micrometer bakes in by default, we started to explore just how easy it would be to bring Reliably's "Developer-First" SLOs to bear on your Spring Boot apps and services.

In this article we share our findings including how Reliably really can work "Bootifully" (TM, Josh Long :) ) with all your Spring Boot apps and services, out of the box and with no extra code required!

NOTE: The full coded sample for this article is available on GitHub.

The Setup: Spring Boot, Prometheus & Reliably

The exercise we wanted to conduct was to show how you could define and collaborate on Reliably Service Level Objectives that were measuring the availability of a simple Spring Boot service. To do this we needed three pieces in the mix:

With this approach, Spring Boot and Micrometer would push dimensional metrics to Prometheus. Then Reliably would use Prometheus queries to collate Service Level Indicators to back the Service Level Objectives being observed. Simple? Actually, it is…

NOTE: We chose Prometheus for this particular article but we could just have easily picked one of the other tools supported by Micrometer and Reliably, such as DataDog.

Sourcing Metrics from our Spring Boot Service

To get things built as quickly and easily as possible, we used the Spring Initializr to generate a very simple HTTP-based application that did nothing more than provide a default root / response of "Greetings from Spring Boot!" to provide the service that we'd look to observe our SLOs on.

As mentioned in the introduction, by default Spring Boot applications come with all the power of Micrometer by default, the only thing we needed to do was make sure that our Spring Boot service's metrics could be scraped by Prometheus by adding a single line to our service's application.properties:

Setting up Prometheus to Scrape the Metrics

Next we added a simple Scraper configuration to our instance of Prometheus to periodically grab all the Micrometer metrics for our Spring Boot service from the endpoint we configured in the previous step:

With this config in place, Prometheus will grab the metrics from our Spring Boot service every 15 seconds.

Creating and Observing the Reliably SLOs as Code

The final step was for us to use the new Prometheus support in the Reliably CLI v0.23.0 to create our SLO with an SLI implemented as an appropriate Prometheus query:

This SLO definition includes a Service Level Indicator (SLI) that queries Prometheus for the appropriate metrics to help us judge that the SLO is being met, or not.

Pushing your SLIs to Reliably

With your Service Level Objective, and corresponding Prometheus-driven Service Level Indicators, in hand, you can now begin pushing the SLO and SLIs over time to Reliably using the reliably slo agent command:

$ reliably slo agent -i 10
INFO[0000] --- starting slo indicator agent ---         
INFO[0000] getting indicators for objective: [name='99% of requests return 2xx over last 1 hour', service='my spring boot app'] 
INFO[0000] indicator percent: [98.91] for objective: [name='99% of requests return 2xx over last 1 hour', service='my spring boot app']

Observing your SLOs from your Command Line

With our Spring Boot service running and receiving requests, our Prometheus instance scraping the available metrics, and Reliably monitoring the above SLO, we have successfully defined and can observe a Java Spring Boot application's reliability with as little code as possible! The final step is to observe the status of your SLO using the reliably slo report -w -m reliably.yaml command:

Where to go next: We need … You!

Our goal is to shift reliability left by making it as easy as possible for you and your team to be able to define, observe and learn how to make your system's reliable. As such we are constantly looking to make it easier for people to collaborate, code and observe Service Level Objectives and Indicators for their own bespoke needs.

You can check out all the different tools that we currently integrate with in our docs, but if there's something you don't see then please get in touch or maybe even raise a ticket and PR yourself on our free and open source Reliably CLI project.

How does chaos engineering relate to the mathematical definitions of chaos?

Mick Roper — Thu, 29 Jul 2021 12:56:05 +0000

Recently, we had the pleasure of sharing Reliably's ideas on proactive reliability practices with a fabulous group of devops, engineers, architects and SREs at a bank. In conversations that followed, we discussed how chaos engineering relates to the mathematical definitions of chaos.

In my view Chaos Engineering principles do align with mathematical chaos, where a chaotic dynamic system is highly sensitive to input conditions, and can generate a non-linear outcome depending on those conditions, as well as the evolution of those conditions over time.

A good analogy would be weather - it exhibits many of the same conditions of a complex computer system where the input conditions are themselves complex systems, and therefore the outcome is complex and hard to model accurately (which is why 'weather predictions' should be renamed 'weather probabilities').

Any sufficiently complex system can be subject to the behaviours described by chaos theory, and chaos engineering is simply exploring how the system might respond to turbulent conditions. Turbulence was carefully chosen as the term in the Principles of Chaos because it evokes the weather system example as any attempt at proving and predicting from design principles will have the same problems as predicting the weather.

The upside with computer systems over weather systems is that, thanks to tools like the Chaos Toolkit (CTK), we don't simulate the the entire system (like you have to do with weather) but instead run a bounded, controlled experiment that allows us to develop an appreciation for the probability of an outcome when given certain inputs.

I believe that the goal of modern software architecture is to manage the amount of dynamism in a given system. That's a big reason for using things like 'Bounded Context' from DDD, de-coupled service-oriented architectures and *-as-code tools like Terraform, Snyk and Reliably to manage and understand the occurrence of complex events that might impact our system. By reducing the opportunity for an unknown, unplanned input we reduce the opportunity for an unknown, unplanned output from our system.

These ideas have informed the development and roadmaps of both Reliably and the Chaos Toolkit.

Chaos Toolkit enables developer-first discovery of your system's weaknesses through the exploration and testing of your systems as code.

Reliably provides developer-first cloud native application reliability for teams. It enables developers to:

Define service level objectives as code with your team and all stakeholders.
Observe service level objective trends over time, surfacing detected reliability weaknesses as you code, continuously with gates and guardrails for Service Level Objective trends and detected weaknesses, right in your own CI/CD platform.
Alert teams, when your reliability is trending in the wrong direction.
Explore the impact of chaotic conditions on your reliability through chaos engineering experiments.
Fix reliability problems using the best advice for your infrastructure.
Verify continuously if your reliability fixes are having the right effects on your Service Level Objectives.
Learn, collaborate and share how reliability is managed amongst your team and across your entire organisation.

If you are interested in the concepts and best practices for pro-active reliability and want to learn more, you may find the following resources useful:

Get started with Chaos Toolkit here and join the Chaos Toolkit community on slack here.
Get started with Reliably here. The easiest and quickest way of getting started with Reliably is to run the Reliably CLI on your machine with your local source code files. You can find a link to the install guide here.
Join the Reliability meetup group to learn and share skills and experience with fellow engineers who are introducing pro-active reliability in their organisations. The meetup was paused for a little while due to Covid, but will resume meetups in person and online from September 2021.

Are you introducing chaos engineering or proactive reliability practices in your organisation? I'd love to hear more about your experience! Ping me on slack or share your thoughts below!

eBPF for SRE with Reliably

Sylvain Hellegouarch — Wed, 28 Jul 2021 21:20:46 +0000

eBPF is a funny piece of technology, it is based of a BPF which is almost as old as Linux itself and yet eBPF has been trending heavily for the past couple of years.

In my book, eBPF is a system event generator. By tapping into that event pool and listening at the right level, you can gain tons of insights of your system.

Funnily enough, SRE has also been trending for a couple of years and it precisely talk about how the health of the system.

So, could these two be made for each other? Well, maybe not in such candid terms, but there is something very appealing to bring them closer.

SRE introduced Service Level Objective (SLO) and Service Level Indicators (SLI). SLO encode what good looks like for aspects that matter to you in your system. SLI encode the metrics that aggregates as the SLO's target. eBPF is an interesting data source for indicators.

For instance, say you have this eBPF program (via BCC):

#include <uapi/linux/ptrace.h>
#include <net/sock.h>
#include <bcc/proto.h>

#define IP_TCP  6

int http_filter(struct __sk_buff *skb) {
    u8 *cursor = 0;

    // let's not care for anything not Ethernet or TCP
    struct ethernet_t *ethernet = cursor_advance(cursor, sizeof(*ethernet));
    if (!(ethernet->type == 0x0800)) {
        return 0;
    }

    struct ip_t *ip = cursor_advance(cursor, sizeof(*ip));
    if (ip->nextp != IP_TCP) {
        return 0;
    }

    return -1;
}

Simple socket filtering really. We only keep TCP packets to inspect them in user-land and ignore the rest.

From user-land, we now have a Python program that injects this program into the kernel:

class MySocketHndl(psocket.SocketHndl):
    def __init__(self, b: BPF, timeout: int = None, iface: str = "lo"):
        """
        BCC gives the us a socket to listen from. Bind to it.
        """
        function_http_filter = b.load_func("http_filter", BPF.SOCKET_FILTER)
        BPF.attach_raw_socket(function_http_filter, iface)
        socket_fd = function_http_filter.sock

        self._socket = socket.fromfd(
            socket_fd, socket.PF_PACKET, socket.SOCK_RAW,
            socket.IPPROTO_IP)

        # blocks forever when timeout is None
        self._socket.settimeout(timeout)

@contextmanager
def bpfsock(iface: str = "lo"):
    """
    Loads the BPF program and starts listening on the socket we attach
    to the interface. Cleanup when finished.
    """
    try:
        b = BPF(src_file = "ebpf.c")
        psock = MySocketHndl(b=b, iface=iface)
        yield psock
    finally:
        psock.close()
        b.cleanup()

We simply attach to the interface and add our filter to the socket used to listen on the interface. Now we can process packets as we see them. We then dismiss any packet we don't care about, here anything not HTTP, and we parse valid packets as HTTP requests/responses, using the awesome pypacker.

def filter_pkt(eth: ethernet.Ethernet, target_port: int = 8000) -> bool:
    """
    Process only packets that are going to or from the target server.
    """
    if eth[ethernet.Ethernet, ip.IP, tcp.TCP] is not None:
        tcp_p = eth[tcp.TCP]
        if tcp_p.dport == target_port or tcp_p.sport == target_port:
            return True
    return False

def process():
    with bpfsock(iface="lo") as psock:
        for pkt in psock.recvp_iter(filter_match_recv=filter_pkt):
            h = http.HTTP(pkt[tcp.TCP].body_bytes)

Boom, we're gold!

From there on, all we have to do is collect some information about requests/responses we see (duration, status code, path requested...) and aggregate ratios over time window we are interested in tracking.

total_count = class_2xx = good_latency_count = 0
for req in requests:
    if window_start <= req["end"] < window_end:
       total_count += 1
    # our SLO latency is 150ms
    if req["duration"] <= 0.15:
       good_latency_count += 1
    if req["status"] == 200:
       class_2xx += 1

    if total_count == 0:
       continue

   indicators.put(
     (
         "availability", last_push, next_push, path,
         100.0 * (class_2xx / total_count)
     )
   )
   indicators.put(
     (
         "latency", last_push, next_push, path,
         100.0 * (good_latency_count / total_count)
     )
   )

Nothing fancy here.

Now that we have our indicators, we can send them to Reliably to generate our SLO results:

def send_indicators():
    indicator_type, from_ts, to_ts, path, value = indicators.get()

    headers = {"Authorization": f"Bearer {TOKEN}"}
    indicator = {
        "metadata": {
             "labels": {
                 "category": indicator_type,
                 "path": path
             }
         },
         "spec": {
             "from": f"{from_ts.isoformat()}Z",
             "to": f"{to_ts.isoformat()}Z",
             "percent": value
         }
    }

    if indicator_type == "latency":
       indicator["metadata"]["labels"]["percentile"] = "100"
       indicator["metadata"]["labels"]["latency_target"] = "150ms"

    httpx.put(reliably_url, headers=headers, json=indicator)

Again, nothing fancy.

At this stage, Reliably can now generate SLO results you can start viewing using the Reliably CLI:

$ reliably slo report
Refreshing SLO report every 3 seconds. Press CTRL+C to quit.
                                                                  Current  Objective  / Time Window     Type             Trend      
  Service #1: ebpf-2021-demo                                 
  ❌ 99% of the responses are under 150ms                          98.99%      99.5%  / 10s             Latency          ✕ ✕ ✕ ✓ ✕  
  ❌ 99% of the responses to our users are in the 2xx class        98.66%        99%  / 10s             Availability     ✓ ✓ ✕ ✓ ✕

Kaboom! We have now successfully mapped low-level ebpf events to high-level SLO constructs.

Obviously, this is a rather trivial showcase but it's promising. Nevertheless, there is some rout to cover before the whole process becomes more attractive as eBPF's UX is perhaps not as transparent as one could hope for.

Still, so much fun!

The code can be found at https://github.com/Lawouach/ebpf-2021-talk.

Bringing reliability closer to you with Reliably and DataDog

Sylvain Hellegouarch — Fri, 23 Jul 2021 10:14:04 +0000

As engineers we care about our users, at least we ought to :) They depend on us and our services to run just fine. This is reliability in a nutshell.

Site Reliability Engineering, or SRE if you're casual, has gained momentum to codify this view on reliability. This article is not about detailing SRE but focusing on how we can use one of its tools, Service Level Objectives (or SLO for short), to signal loss of reliability as close to engineers as we can.

Let's say we have a web application like this one below:

from starlette.applications import Starlette
from starlette.responses import JSONResponse
from starlette.routing import Route

async def homepage(request):
    return JSONResponse({'hello': 'world'})

app = Starlette(debug=True, routes=[
    Route('/', homepage),
])

Nothing fancy about it, just a Hello World example. When running it as follows:

$ uvicorn --reload server:app

where server is the name of the Python module containing that code: server.py. The --reload flag allows us to change the code and let uvicorn restart the server automatically.

We can access this server as follows:

$ curl localhost:8000/

Now; let's run a basic load against this server using hey:

$ hey -c 3 -q 10 -z 20s http://localhost:8000/

Summary:
  Total:    20.0125 secs
  Slowest:  0.0164 secs
  Fastest:  0.0020 secs
  Average:  0.0046 secs
  Requests/sec: 29.9813

  Total data:   10200 bytes
  Size/request: 17 bytes

Response time histogram:
  0.002 [1] |
  0.003 [152]   |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.005 [184]   |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.006 [195]   |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.008 [53]    |■■■■■■■■■■■
  0.009 [11]    |■■
  0.011 [1] |
  0.012 [1] |
  0.014 [0] |
  0.015 [0] |
  0.016 [2] |


Latency distribution:
  10% in 0.0027 secs
  25% in 0.0033 secs
  50% in 0.0043 secs
  75% in 0.0055 secs
  90% in 0.0067 secs
  95% in 0.0074 secs
  99% in 0.0083 secs

Details (average, fastest, slowest):
  DNS+dialup:   0.0000 secs, 0.0020 secs, 0.0164 secs
  DNS-lookup:   0.0000 secs, 0.0000 secs, 0.0005 secs
  req write:    0.0000 secs, 0.0000 secs, 0.0002 secs
  resp wait:    0.0043 secs, 0.0017 secs, 0.0161 secs
  resp read:    0.0001 secs, 0.0001 secs, 0.0007 secs

Status code distribution:
  [200] 600 responses

This will gently load our server without going overboard.

We likely want to monitor this server, why not use DataDog to do so, as follows:

from ddtrace import config, patch
import ddtrace.profiling.auto
from starlette.applications import Starlette
from starlette.responses import JSONResponse
from starlette.routing import Route


async def homepage(request):
    return JSONResponse({'hello': 'world'})


patch(starlette=True)
config.starlette['service_name'] = 'my-test-service'

app = Starlette(debug=True, routes=[
    Route('/', homepage),
])

What differs is that we are importing DataDog ddtrace to push requests to the local DataDog agent. The agent is started as follows on a different terminal:

$ export DD_API_KEY=...
$ export DD_SITE=datadoghq.eu

$ docker run --rm -it --name dd-agent \
-v /var/run/docker.sock:/var/run/docker.sock:ro \
-v /proc/:/host/proc/:ro \
-v /sys/fs/cgroup/:/host/sys/fs/cgroup:ro \
-e DD_API_KEY=${DD_API_KEY} \
-e DD_SITE=${DD_SITE} \
-e DD_APM_ENABLED=true \
-e DD_APM_NON_LOCAL_TRAFFIC=true \
-p 8126:8126/tcp \
gcr.io/datadoghq/agent:latest

After a couple of minutes, you'll be able to search for metrics from this application on DataDog. Look for metrics with starlette in the name.

Could we now trick the application into raising odd errors to fake a faulty service? Why yes of course! By simply returning a 4xx or 5xx class of errors at random from time to time:

import random


from ddtrace import config, patch
import ddtrace.profiling.auto
from starlette.applications import Starlette
from starlette.requests import Request
from starlette.responses import JSONResponse
from starlette.routing import Route


async def index(request: Request) -> JSONResponse:
    if random.random() > 0.91:
        return JSONResponse({'error': 'boom'}, status_code=500)
    return JSONResponse({'hello': 'world'})


patch(starlette=True)
config.starlette['distributed_tracing'] = True
config.starlette['service_name'] = 'my-frontend-service'

app = Starlette(debug=True, routes=[
    Route('/', index),
])

Let's see how this impacts our client now, run again our mild load:

$ hey -c 3 -q 10 -z 20s http://localhost:8000/

Summary:
  Total:    20.0120 secs
  Slowest:  0.0189 secs
  Fastest:  0.0018 secs
  Average:  0.0051 secs
  Requests/sec: 29.9820

  Total data:   10142 bytes
  Size/request: 16 bytes

Response time histogram:
  0.002 [1] |
  0.004 [146]   |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.005 [193]   |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.007 [146]   |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.009 [101]   |■■■■■■■■■■■■■■■■■■■■■
  0.010 [10]    |■■
  0.012 [0] |
  0.014 [0] |
  0.016 [0] |
  0.017 [1] |
  0.019 [2] |


Latency distribution:
  10% in 0.0029 secs
  25% in 0.0036 secs
  50% in 0.0050 secs
  75% in 0.0065 secs
  90% in 0.0074 secs
  95% in 0.0079 secs
  99% in 0.0092 secs

Details (average, fastest, slowest):
  DNS+dialup:   0.0000 secs, 0.0018 secs, 0.0189 secs
  DNS-lookup:   0.0000 secs, 0.0000 secs, 0.0005 secs
  req write:    0.0000 secs, 0.0000 secs, 0.0002 secs
  resp wait:    0.0048 secs, 0.0017 secs, 0.0187 secs
  resp read:    0.0002 secs, 0.0001 secs, 0.0011 secs

Status code distribution:
  [200] 542 responses
  [500] 58 responses

Now notice how we get a summary that does show us some responses were in errors as per our change above. Yai we broke something!

Can we now ask DataDog about these recorded errors? Yes we can:

# datadog info (change them to fit your owns)
export DD_API_KEY=
export DD_APP_KEY=
export DD_SITE=datadoghq.eu

# your query data
$ export query="(sum:trace.starlette.request.hits{service:my-test-service,resource_name:get_/}.as_count() - sum:trace.starlette.request.errors{service:my-test-service,resource_name:get_/}.as_count()) / (sum:trace.starlette.request.hits{service:my-test-service,resource_name:get_/}.as_count())"
$ export from=$(date "+%s" -d "15 min ago")
$ export to=$(date "+%s")

$ curl -G -s -X GET "https://api.${DD_SITE}/api/v1/query" \
--data-urlencode "from=${from}" \
--data-urlencode "to=${to}" \
--data-urlencode "query=${query}" \
-H "Content-Type: application/json" \
-H "DD-API-KEY: ${DD_API_KEY}" \
-H "DD-APPLICATION-KEY: ${DD_APP_KEY}" | jq .

The query we are running may look daunting but is rather straightforward. We take the total number of requests and we remove the ones that were on error. We then divide by the total again and this should give us a ratio of good requests as a percentage.

Great, we now have a query we can use to create a service level object (SLO) that will tell us how our service is doing over time. Let's use Reliably for this.

$ reliably slo init
? What is the name of the service you want to declare SLOs for? my-frontend-service
| Paste your 'numerator' (good events) datadog query: (sum:trace.starlette.request.hits{service:my-test-service,resource_name:get_/}.as_count() - sum:trace.starlette.request.errors{service:my-test-service,resource_name:get_/}.as_count())
| Paste your 'denominator' (total events) datadog query: (sum:trace.starlette.request.hits{service:my-test-service,resource_name:get_/}.as_count())
? What is your target for this SLO (in %)? 99
? What is your observation window for this SLO? custom
? Define your custom observation window PT5M

? What is the name of this SLO? 99% of frontend responses over last 5 minutes are 2xx
SLO '99% of frontend responses over last 5 minutes are 2xx' added to Service 'my-frontend-service'

? Do you want to add another SLO? No
Service 'my-frontend-service' added

? Do you want to add another Service? No

✓ Your manifest has been saved to ./reliably.yaml

In a nutshell, we created a file that contains the definition of the SLO:

apiVersion: reliably.com/v1
kind: Objective
metadata:
  labels:
    name: 99% of requests  over last 5 minutes
    service: my-test-service
spec:
  indicatorSelector:
    datadog_denominator_query: sum:trace.starlette.request.hits{service:my-test-service,resource_name:get_/}.as_count()
    datadog_numerator_query: (sum:trace.starlette.request.hits{service:my-test-service,resource_name:get_/}.as_count() / sum:trace.starlette.request.errors{service:my-test-service,resource_name:get_/}.as_count())
  objectivePercent: 99
  window: 1h0m0s

Now we can make reliably know about it:

$ reliably slo sync

Finally, while the application is still running with some load injected into it, start fetching data from DataDog, using the query we saw earlier and let Reliably consolidate them over the window duration given in the objective:

$ reliably slo agent -i3

Open now a new terminal and run the following:

$ reliably slo report -w

This will show you the SLO report for your service as computed by Reliably.

So what happened exactly? Well, let's zoom in on a section of the SLO:

  indicatorSelector:
    datadog_denominator_query: sum:trace.starlette.request.hits{service:my-test-service,resource_name:get_/}.as_count()
    datadog_numerator_query: (sum:trace.starlette.request.hits{service:my-test-service,resource_name:get_/}.as_count() / sum:trace.starlette.request.errors{service:my-test-service,resource_name:get_/}.as_count())

The indicatorSelector property is how the magic happens. These are used for the following purposes:

giving the reliably slo agent command the means to know what provider to use, here DataDog, and therefore how to fetch the required datapoints, here the two queries. These datapoints are stored under the name of indicators on Reliably
declaring how these objective and the indicators are mapped together

That second point is key. Indicators themselves are not declared as entities (or objects) as objectives are. Instead they are merely a stream of values consumed by Reliably when sent by a client (reliably slo agent or via the API directly). Upon receiving an indicator, Reliably looks at its labels and match this to any indicatorSelector of any objectives (in the current organization). This tells us that objectives and indicators are loosly coupled. The fact the reliably.yaml manifest contains the selector doesn't define the indicator, only how to match indicators to objectives.

At this stage, you have a simple declaration of a service level object that relies on DataDog's data to compute it. Since the SLO is a just a file, you can now store it alongside your code base and use it as part of your CI/CD pipeline to automate decision about releasing. We'll see this in a future article using GitHub actions.

The code for this article can be found on GitHub.