DEV Community: Alex Dzyoba

Envoy first impression

Alex Dzyoba — Fri, 25 Jan 2019 00:00:00 +0000

When I was doing traffic mirroring with nginx I’ve stumbled upon a surprising problem – nginx was delaying original request if mirror backend was slow. This is really bad because you expect that mirroring is “fire and forget”. Anyway, I’ve solved this by mirroring only part of the traffic but this drove me to find another proxy that could have mirror traffic without such problems. This is when I finally found time and energy to look into Envoy – I’ve heard a lot of great things about it and always wanted to get my hands dirty with it.

Just in case you’ve never heard about it – Envoy is a proxy server that is most commonly used in a service mesh scenario but it’s also can be an edge proxy.

In this post, I will look only for edge proxy scenario because I’ve never maintained service mesh. Keep that use case in mind. Also, I will inevitably compare Envoy to nginx because that’s what I know and use.

What’s great about Envoy

The main reason why I wanted to try Envoy was its several compelling features:

Observability
Advanced load balancing policies
Active checks
Extensibility

Let’s unpack that list!

Observability

Observability is one of the most thorough features in Envoy. One of its design principles is to provide the transparency in network communication given how complex modern systems is built with all this microservices madness.

Out of the box it provides lots of metrics for various metrics system including Prometheus.

To get that kind of insight in nginx you have to buy nginx plus or use VTSmodule, thus compiling nginx on your own. Hopefully, my project nginx-vts-build will help – I’m building nginx with VTS module as a drop-in replacement for stock nginx with systemd service and basic configs. Think about it as nginx distro. Currently, it had only one release for Debian 9 but I’m open for suggestions. If you have a feature request, please let me know. But let’s get back to Envoy.

In addition to metrics, Envoy can be integrated with distributed tracing systems like Jaeger.

And finally, it can capture the traffic for further analysis with wireshark.

I’ve only looked at Prometheus metrics and they are quite nice!

Advanced load balancing

Load balancing in Envoy is very feature-rich. Not only it supports round-robin, weighted and random policies but also load balancing using consistent hashing algorithms like ketama and maglev. The point of the latter is fewer changes in traffic patterns in case of rebalancing in
the upstream cluster.

Again, you can get the same advanced features in nginx but only if you pay for nginx plus.

Active checks

To check the health of the upstream endpoints Envoy will actively send the request and expect the valid answer so this endpoint will remain in the upstream cluster. This is a very nice feature that open source nginx lacks (but nginx plus has).

Extensibility

You can configure Envoy as a Redis proxy, DynamoDB filter, MongoDB filter, grpc proxy, MySQL filter, Thrift filter.

This is not a killer feature, imho, given that most of these protocols support is experimental but anyway it’s nice to have and shows that Envoy is extensible.

It also supports Lua scripting out of the box. For nginx you have to use OpenResty.

What’s not so great about Envoy

The features above alone make a very good reason to use Envoy. However, I found a few things that keep me from switching to Envoy from nginx:

No caching
No static content serving
Lack of flexible configuration
Docker-only packaging

No caching

Envoy doesn’t support caching of responses. This is a must-have feature for the edge proxy and nginx implements it really good.

No static content serving

While Envoy does networking really well, it doesn’t access filesystem apart from initial config file loading and runtime configuration handling. If you thought about serving static files like frontend things (js, html, css) then you’re out of luck - Envoy doesn’t support that. Nginx, again, does it very well.

Lack of flexible configuration

Envoy is configured via YAML and for me its configuration feels very explicit though I think it’s actually a good thing – explicit is better than implicit. But I feel that Envoy configuration is bounded by features specifically implemented in Envoy. Maybe it’s a lack of experience with Envoy and old habits but I feel that in nginx with maps, rewrite module (with if directive) and other nice modules I have a very flexible config system that allows me to implement anything. The cost of this flexibility is, of course, a good portion of complexity – nginx configuration requires some learning and practice but in my opinion it’s worth it.

Nevertheless, Envoy supports dynamic configuration, though it’s not like you can change some configuration part via REST call, it’s about the discovery of configuration settings – that’s what the whole XDS protocol is all about with its EDS, CDS, RDS and what-not-DS.

Citing docs:

Envoy discovers its various dynamic resources via the filesystem or by querying one or more management servers.

Emphasis is mine – I wanted to note that you have to provide a server that will respond to the Envoy discovery (XDS) requests.

However, there is no ready-made solution that implements Envoys’ XDS protocol. There was a rotor but the company behind it shut down so the project is mostly dead.

There is an Istio but it’s a monster I don’t want to touch right now. Also, if you’re on Kubernetes then there is a Heptio Contour, but not everybody needs and usesKubernetes.

In the end, you could implement your own XDS service using go-control-plane stubs.

But that’s doesn’t seem to be used. What I saw most people do is using DNS forEDS and CDS. Especially, remembering that Consul has DNS interface, it seems that we can use Consul for dynamically providing the list of hosts to the Envoy. This isn’t big news because I can (and do) use Consul to provide the list of backends for nginx by using DNS name in proxy_pass and resolver directive.

Also, Consul Connect supports Envoy for proxying requests but this is not about Envoy – this is about how awesome Consul is!

So this whole dynamic configuration thing of Envoy is really confusing and hard to follow because whenever you try to google it you’ll get bombarded with posts about Istio which is distracting.

Docker-only packaging

This is a minor thing but it just annoys me. Also, I don’t like that Docker images don’t have tags with versions. Maybe it’s intended so you always run the latest version but it seems very strange.

Conclusion on not-so-great parts

In the end, I’m not saying Envoy is bad in any way – from my point of view it just has a different focus on advanced proxying and out of process service mesh data plane. The edge proxy part is just a bonus that is suitable in some but not many situations.

What about mirroring?

With that being said let’s see Envoy in practice and repeat mirroring experiments from my previous post.

Here are 2 minimal configs – one for nginx and the other Envoy. Both doing the same – simply proxying requests to some backend service.

# nginx proxy config

upstream backend {
    server backend.local:10000;
}

server {
    server_name proxy.local;
    listen 8000;

    location / {
        proxy_pass http://backend;
    }
}

# Envoy proxy config
static_resources:
  listeners:
  - name: listener_0
    address:
      socket_address:
        protocol: TCP
        address: 0.0.0.0
        port_value: 8001
    filter_chains:
    - filters:
      - name: envoy.http_connection_manager
        config:
          stat_prefix: ingress_http
          route_config:
            virtual_hosts:
            - name: local_service
              domains: ['*']
              routes:
              - match:
                  prefix: "/"
                route:
                  cluster: backend
          http_filters:
          - name: envoy.router
  clusters:
  - name: backend
    type: STATIC
    connect_timeout: 1s
    hosts:
      - socket_address:
          address: 127.0.0.1
          port_value: 10000

They perform identical:

$ # Load test nginx
$ hey -z 10s -q 1000 -c 1 -t 1 http://proxy.local:8000

Summary:
  Total:    10.0006 secs
  Slowest:  0.0229 secs
  Fastest:  0.0002 secs
  Average:  0.0004 secs
  Requests/sec: 996.7418

  Total data:   36881600 bytes
  Size/request: 3700 bytes

Response time histogram:
  0.000 [1] |
  0.002 [9963]  |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.005 [3] |
  0.007 [0] |
  0.009 [0] |
  0.012 [0] |
  0.014 [0] |
  0.016 [0] |
  0.018 [0] |
  0.021 [0] |
  0.023 [1] |

...

Status code distribution:
  [200] 9968 responses


$ # Load test Envoy
$ hey -z 10s -q 1000 -c 1 -t 1 http://proxy.local:8001

Summary:
  Total:    10.0006 secs
  Slowest:  0.0307 secs
  Fastest:  0.0003 secs
  Average:  0.0007 secs
  Requests/sec: 996.1445

  Total data:   36859400 bytes
  Size/request: 3700 bytes

Response time histogram:
  0.000 [1] |
  0.003 [9960]  |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.006 [0] |
  0.009 [0] |
  0.012 [0] |
  0.015 [0] |
  0.019 [0] |
  0.022 [0] |
  0.025 [0] |
  0.028 [0] |
  0.031 [1] |

...

Status code distribution:
  [200] 9962 responses

Anyway, let’s check the crucial part – mirroring to the backend with a delay. A quick reminder – nginx, in that case, will throttle original request thus affecting your production users.

Here is the mirroring config for Envoy:

# Envoy mirroring config
static_resources:
  listeners:
  - name: listener_0
    address:
      socket_address:
        protocol: TCP
        address: 0.0.0.0
        port_value: 8001
    filter_chains:
    - filters:
      - name: envoy.http_connection_manager
        config:
          stat_prefix: ingress_http
          route_config:
            virtual_hosts:
            - name: local_service
              domains: ['*']
              routes:
              - match:
                  prefix: "/"
                route:
                  cluster: backend
                  request_mirror_policy:
                    cluster: mirror
          http_filters:
          - name: envoy.router
  clusters:
  - name: backend
    type: STATIC
    connect_timeout: 1s
    hosts:
      - socket_address:
          address: 127.0.0.1
          port_value: 10000
  - name: mirror
    type: STATIC
    connect_timeout: 1s
    hosts:
      - socket_address:
          address: 127.0.0.1
          port_value: 20000

Basically, we’ve added request_mirror_policy to the main route and defined the cluster for mirroring. Let’s load test it!

$ hey -z 10s -q 1000 -c 1 -t 1 http://proxy.local:8001

Summary:
  Total:    10.0012 secs
  Slowest:  0.0046 secs
  Fastest:  0.0003 secs
  Average:  0.0008 secs
  Requests/sec: 997.6801

  Total data:   36918600 bytes
  Size/request: 3700 bytes

Response time histogram:
  0.000 [1] |
  0.001 [2983]  |■■■■■■■■■■■■■■■■■
  0.001 [6916]  |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.002 [72]    |
  0.002 [2] |
  0.002 [0] |
  0.003 [0] |
  0.003 [3] |
  0.004 [0] |
  0.004 [0] |
  0.005 [1] |

...

Status code distribution:
  [200] 9978 responses

Zero errors and amazing latency! This is a victory and it proves that Envoy’smirroring is truly “fire and forget”!

Conclusion

Envoy’s networking is of exceptional quality – its mirroring is well thought, its load balancing is very advanced and I like the active health check feature.

I’m not convinced to use it in the edge proxy scenario because you might need features of a web server like caching, content serving and advanced configuration.

As for the service mesh – I’ll surely evaluate Envoy for that when the opportunity arises, so stay tuned – subscribe to the blog Atom feed and check my twitter @AlexDzyoba.

That’s it for now, till the next time!

nginx mirroring tips and tricks

Alex Dzyoba — Mon, 14 Jan 2019 00:00:00 +0000

Lately, I’ve been playing with nginx and its relatively new mirror module which appeared in 1.13.4. The mirror module allows you to copy requests to another backend while ignoring answers from it. The example use cases for this are:

Pre-production testing by observing how your new system handle real production traffic
Logging of requests for security analysis. This is what Wallarm tool do
Copying requests for data science research
etc.

I’ve used it for pre-production testing of the new rewritten system to see how well (if at all ;-) it can handle the production workload. There are some non-obvious problems and tips that I didn’t find when I started this journey and now I wanted to share it.

Basic setup

Let’s begin with a simple setup. Say, we have some backend that handles production workload and we put a proxy in front of it:

Here is the nginx config:

upstream backend {
    server backend.local:10000;
}

server {
    server_name proxy.local;
    listen 8000;

    location / {
        proxy_pass http://backend;
    }
}

There are 2 parts – backend and proxy. The proxy (nginx) is listening on port 8000 and just passing requests to the backend on port 10000. Nothing fancy, but let’s do a quick load test to see how it performs. I’m using heytool because it’s simple and allows generating constant load instead of bombarding as hard as possible like many other tools do (wrk, apache benchmark, siege).

$ hey -z 10s -q 1000 -n 100000 -c 1 -t 1 http://proxy.local:8000

Summary:
  Total:    10.0016 secs
  Slowest:  0.0225 secs
  Fastest:  0.0003 secs
  Average:  0.0005 secs
  Requests/sec: 995.8393

  Total data:   6095520 bytes
  Size/request: 612 bytes

Response time histogram:
  0.000 [1] |
  0.003 [9954]  |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.005 [4] |
  0.007 [0] |
  0.009 [0] |
  0.011 [0] |
  0.014 [0] |
  0.016 [0] |
  0.018 [0] |
  0.020 [0] |
  0.022 [1] |

Latency distribution:
  10% in 0.0003 secs
  25% in 0.0004 secs
  50% in 0.0005 secs
  75% in 0.0006 secs
  90% in 0.0007 secs
  95% in 0.0007 secs
  99% in 0.0009 secs

Details (average, fastest, slowest):
  DNS+dialup:   0.0000 secs, 0.0003 secs, 0.0225 secs
  DNS-lookup:   0.0000 secs, 0.0000 secs, 0.0008 secs
  req write:    0.0000 secs, 0.0000 secs, 0.0003 secs
  resp wait:    0.0004 secs, 0.0002 secs, 0.0198 secs
  resp read:    0.0001 secs, 0.0000 secs, 0.0012 secs

Status code distribution:
  [200] 9960 responses

Good, most of the requests are handled in less than a millisecond and there are no errors – that’s our baseline.

Basic mirroring

Now, let’s put another test backend and mirror traffic to it

The basic mirroring is configured like this:

upstream backend {
    server backend.local:10000;
}

upstream test_backend {
    server test.local:20000;
}

server {
    server_name proxy.local;
    listen 8000;

    location / {
        mirror /mirror;
        proxy_pass http://backend;
    }

    location = /mirror {
        internal;
        proxy_pass http://test_backend$request_uri;
    }

}

We add mirror directive to mirror requests to the internal location and define that internal location. In that internal location we can do whatever nginx allows us to do but for now we just simply proxy pass all requests.

Let’s load test it again to check how mirroring affects the performance:

$ hey -z 10s -q 1000 -n 100000 -c 1 -t 1 http://proxy.local:8000

Summary:
  Total:    10.0010 secs
  Slowest:  0.0042 secs
  Fastest:  0.0003 secs
  Average:  0.0005 secs
  Requests/sec: 997.3967

  Total data:   6104700 bytes
  Size/request: 612 bytes

Response time histogram:
  0.000 [1] |
  0.001 [9132]  |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.001 [792]   |■■■
  0.001 [43]    |
  0.002 [3] |
  0.002 [0] |
  0.003 [2] |
  0.003 [0] |
  0.003 [0] |
  0.004 [1] |
  0.004 [1] |

Latency distribution:
  10% in 0.0003 secs
  25% in 0.0004 secs
  50% in 0.0005 secs
  75% in 0.0006 secs
  90% in 0.0007 secs
  95% in 0.0008 secs
  99% in 0.0010 secs

Details (average, fastest, slowest):
  DNS+dialup:   0.0000 secs, 0.0003 secs, 0.0042 secs
  DNS-lookup:   0.0000 secs, 0.0000 secs, 0.0009 secs
  req write:    0.0000 secs, 0.0000 secs, 0.0002 secs
  resp wait:    0.0004 secs, 0.0002 secs, 0.0041 secs
  resp read:    0.0001 secs, 0.0000 secs, 0.0021 secs

Status code distribution:
  [200] 9975 responses

It’s pretty much the same – millisecond latency and no errors. And that’s good because it proves that mirroring itself doesn’t affect original requests.

Mirroring to buggy backend

That’s all nice and dandy but what if mirror backend has some bugs and sometimes replies with errors? What would happen to the original requests?

To test this I’ve made a trivial Go service that can inject errors randomly. Let’s launch it

$ mirror-backend -errors
2019/01/13 14:43:12 Listening on port 20000, delay is 0, error injecting is true

and see what load testing will show:

$ hey -z 10s -q 1000 -n 100000 -c 1 -t 1 http://proxy.local:8000

Summary:
  Total:    10.0008 secs
  Slowest:  0.0027 secs
  Fastest:  0.0003 secs
  Average:  0.0005 secs
  Requests/sec: 998.7205

  Total data:   6112656 bytes
  Size/request: 612 bytes

Response time histogram:
  0.000 [1] |
  0.001 [7388]  |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.001 [2232]  |■■■■■■■■■■■■
  0.001 [324]   |■■
  0.001 [27]    |
  0.002 [6] |
  0.002 [2] |
  0.002 [3] |
  0.002 [2] |
  0.002 [0] |
  0.003 [3] |

Latency distribution:
  10% in 0.0003 secs
  25% in 0.0003 secs
  50% in 0.0004 secs
  75% in 0.0006 secs
  90% in 0.0007 secs
  95% in 0.0008 secs
  99% in 0.0009 secs

Details (average, fastest, slowest):
  DNS+dialup:   0.0000 secs, 0.0003 secs, 0.0027 secs
  DNS-lookup:   0.0000 secs, 0.0000 secs, 0.0008 secs
  req write:    0.0000 secs, 0.0000 secs, 0.0001 secs
  resp wait:    0.0004 secs, 0.0002 secs, 0.0026 secs
  resp read:    0.0001 secs, 0.0000 secs, 0.0006 secs

Status code distribution:
  [200] 9988 responses

Nothing changed at all! And that’s great because errors in the mirror backend don’t affect the main backend. nginx mirror module ignores responses to the mirror subrequests so this behavior is nice and intended.

Mirroring to a slow backend

But what if our mirror backend is not returning errors but just plain slow? How original requests will work? Let’s find out!

My mirror backend has an option to delay every request by configured amount of seconds. Here I’m launching it with a 1 second delay:

$ mirror-backend -delay 1
2019/01/13 14:50:39 Listening on port 20000, delay is 1, error injecting is false

So let’s see what load test show:

$ hey -z 10s -q 1000 -n 100000 -c 1 -t 1 http://proxy.local:8000

Summary:
  Total:    10.0290 secs
  Slowest:  0.0023 secs
  Fastest:  0.0018 secs
  Average:  0.0021 secs
  Requests/sec: 1.9942

  Total data:   6120 bytes
  Size/request: 612 bytes

Response time histogram:
  0.002 [1] |■■■■■■■■■■
  0.002 [0] |
  0.002 [1] |■■■■■■■■■■
  0.002 [0] |
  0.002 [0] |
  0.002 [0] |
  0.002 [1] |■■■■■■■■■■
  0.002 [1] |■■■■■■■■■■
  0.002 [0] |
  0.002 [4] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.002 [2] |■■■■■■■■■■■■■■■■■■■■

Latency distribution:
  10% in 0.0018 secs
  25% in 0.0021 secs
  50% in 0.0022 secs
  75% in 0.0023 secs
  90% in 0.0023 secs
  0% in 0.0000 secs
  0% in 0.0000 secs

Details (average, fastest, slowest):
  DNS+dialup:   0.0007 secs, 0.0018 secs, 0.0023 secs
  DNS-lookup:   0.0003 secs, 0.0002 secs, 0.0006 secs
  req write:    0.0001 secs, 0.0001 secs, 0.0002 secs
  resp wait:    0.0011 secs, 0.0007 secs, 0.0013 secs
  resp read:    0.0002 secs, 0.0001 secs, 0.0002 secs

Status code distribution:
  [200] 10 responses

Error distribution:
  [10]  Get http://proxy.local:8000: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

What? 1.9 rps? Where is my 1000 rps? We’ve got errors? What’s happening?

Let me explain how mirroring in nginx works.

How mirroring in nginx works

When the request is coming to nginx and if mirroring is enabled, nginx will create a mirror subrequest and do what mirror location specifies – in our case, it will send it to the mirror backend.

But the thing is that subrequest is linked to the original request, so as far as I understand unless that mirror subrequest is not finished the original requests will throttle.

That’s why we get ~2 rps in the previous test – hey sent 10 requests, got responses, sent next 10 requests but they stalled because previous mirror subrequests were delayed and then timeout kicked in and errored the last 10 requests.

If we increase the timeout in hey to, say, 10 seconds we will receive no errors and 1 rps:

$ hey -z 10s -q 1000 -n 100000 -c 1 -t 10 http://proxy.local:8000

Summary:
  Total:    10.0197 secs
  Slowest:  1.0018 secs
  Fastest:  0.0020 secs
  Average:  0.9105 secs
  Requests/sec: 1.0978

  Total data:   6732 bytes
  Size/request: 612 bytes

Response time histogram:
  0.002 [1] |■■■■
  0.102 [0] |
  0.202 [0] |
  0.302 [0] |
  0.402 [0] |
  0.502 [0] |
  0.602 [0] |
  0.702 [0] |
  0.802 [0] |
  0.902 [0] |
  1.002 [10]    |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

Latency distribution:
  10% in 1.0011 secs
  25% in 1.0012 secs
  50% in 1.0016 secs
  75% in 1.0016 secs
  90% in 1.0018 secs
  0% in 0.0000 secs
  0% in 0.0000 secs

Details (average, fastest, slowest):
  DNS+dialup:   0.0001 secs, 0.0020 secs, 1.0018 secs
  DNS-lookup:   0.0000 secs, 0.0000 secs, 0.0005 secs
  req write:    0.0001 secs, 0.0000 secs, 0.0002 secs
  resp wait:    0.9101 secs, 0.0008 secs, 1.0015 secs
  resp read:    0.0002 secs, 0.0001 secs, 0.0003 secs

Status code distribution:
  [200] 11 responses

So the point here is that if mirrored subrequests are slow then the original requests will be throttled. I don’t know how to fix this but I know the workaround – mirror only some part of the traffic. Let me show you how.

Mirroring part of the traffic

If you’re not sure that mirror backend can handle the original load you can mirror only some part of the traffic – for example, 10%.

mirror directive is not configurable and replicates all requests to the mirror location so it’s not obvious how to do this. The key point in achieving this is the internal mirror location. If you remember I’ve said that you can anything to mirrored requests in its location. So here is how I did this:

 1  upstream backend {
 2      server backend.local:10000;
 3  }
 4  
 5  upstream test_backend {
 6      server test.local:20000;
 7  }
 8  
 9  split_clients $remote_addr $mirror_backend {
10      50% test_backend;
11      *   "";
12  }
13  
14  server {
15      server_name proxy.local;
16      listen 8000;
17  
18      access_log /var/log/nginx/proxy.log;
19      error_log /var/log/nginx/proxy.error.log info;
20  
21      location / {
22          mirror /mirror;
23          proxy_pass http://backend;
24      }
25  
26      location = /mirror {
27          internal;
28          if ($mirror_backend = "") {
29              return 400;
30          }
31  
32          proxy_pass http://$mirror_backend$request_uri;
33      }
34  
35  }
36

First of all, in mirror location we proxy pass to the upstream that is taken from variable $mirror_backend (line 32). This variable is set in split_clientblock (lines 9-12) based on client remote address. What split_client does is it sets right variable value based on left variable distribution. In our case, we look at requests remote address ($remote_addr variable) and for 50% of remote addresses we set$mirror_backend to the test_backend, for other requests it’s set to empty string. Finally, the partial part is performed in mirror location – if$mirror_backend variable is empty we reject that mirror subrequest, otherwise weproxy_pass it. Remember that failure in mirror subrequests doesn’t affect original requests so it’s safe to drop request with error status.

The beauty of this solution is that you can split traffic for mirroring based on any variable or combination. If you want to really differentiate your users then remote address may not be the best split key – user may use many IPs or change them. In that case, you’re better off using some user-sticky key like API key. For mirroring 50% of traffic based on apikey query parameter we just change key in split_client:

split_clients $arg_apikey $mirror_backend {
    50% test_backend;
    * "";
}

When we’ll query apikeys from 1 to 20 only half of it (11) will be mirrored. Here is the curl:

$ for i in {1..20};do curl -i "proxy.local:8000/?apikey=${i}" ;done

and here is the log of mirror backend:

...
2019/01/13 22:34:34 addr=127.0.0.1:47224 host=test_backend uri="/?apikey=1"
2019/01/13 22:34:34 addr=127.0.0.1:47230 host=test_backend uri="/?apikey=2"
2019/01/13 22:34:34 addr=127.0.0.1:47240 host=test_backend uri="/?apikey=4"
2019/01/13 22:34:34 addr=127.0.0.1:47246 host=test_backend uri="/?apikey=5"
2019/01/13 22:34:34 addr=127.0.0.1:47252 host=test_backend uri="/?apikey=6"
2019/01/13 22:34:34 addr=127.0.0.1:47262 host=test_backend uri="/?apikey=8"
2019/01/13 22:34:34 addr=127.0.0.1:47272 host=test_backend uri="/?apikey=10"
2019/01/13 22:34:34 addr=127.0.0.1:47278 host=test_backend uri="/?apikey=11"
2019/01/13 22:34:34 addr=127.0.0.1:47288 host=test_backend uri="/?apikey=13"
2019/01/13 22:34:34 addr=127.0.0.1:47298 host=test_backend uri="/?apikey=15"
2019/01/13 22:34:34 addr=127.0.0.1:47308 host=test_backend uri="/?apikey=17"
...

And the most awesome thing is that partitioning in split_client is consistent – requests with apikey=1 will always be mirrored.

Conclusion

So this was my experience with nginx mirror module so far. I’ve shown you how to simply mirror all of the traffic, how to mirror part of the traffic with the help of split_client module. I’ve also covered error handling and non-obvious problem when normal requests are throttled in case of slow mirror backend.

Hope you’ve enjoyed it! Subscribe to the Atom feed. I also post on twitter @AlexDzyoba.

That’s it for now, till the next time!

tzconv - convert time between timezones

Alex Dzyoba — Wed, 15 Aug 2018 00:00:00 +0000

I made a nice little thing called tzconv – https://github.com/alexdzyoba/tzconv. It’s a CLI tool that converts time between timezones and it’s useful (at least for me) when you investigate done incident and need to match times.

Imagine, you had an incident that happened at 11:45 your local time but your logs in ELK or Splunk are in UTC. So, what time was 11:45 in UTC?

$ tzconv utc 11:45
08:45

Boom! You got it!

You can add the third parameter to convert time from specific timezone, not from your local. For instance, your alert system sent you an email with a central European time and your server log timestamps are in Eastern time.

$ tzconv neyork 20:20 cet
14:20

Note, that I’ve mistyped New York and it still worked. That’s because locations are not matched exactly but fuzzy searched!

You can find more examples in the project README. Feel free to contribute, I’ve got a couple of things I would like to see implemented – check the issues page. The tool itself is written in Go and quite simple yet useful.

That’s it for now, till the next time!

Peculiarities of c10k client

Alex Dzyoba — Wed, 04 Jul 2018 00:00:00 +0000

There is a well-known problem called c10k. The essence of it is to handle 10000 concurrent clients on a single server. This problem was conceived in 1999 by Dan Kegel and at that time it made the industry to rethink the way the web servers were handling connections. Then-state-of-the-art solution to allocate a thread for each client started to leak facing the upcoming web scale. Nginx was born to solve this problem by embracing event-driven I/O model provided by a shiny new epoll system call (in Linux).

Times were different back then and now we can have a really beefy server with the 10G network, 32 cores and 256 GiB RAM that can easily handle that amount of clients, so c10k is not much of a problem even with threaded I/O. But, anyway, I wanted to check how various solutions like threads and non-blocking async I/O will handle it, so I started to write some silly servers in my c10k repo and then I’ve stuck because I needed some tools to test my implementations.

Basically, I needed a c10k client. And I actually wrote a couple – one in Go and the other in C with libuv. I’m going to also write the one in Python 3 with asyncio.

While I was writing each client I’ve found 2 peculiarities – how to make it bad and how to make it slow.

How to make it bad

By making bad I mean making it really c10k – creating a lot of connections to the server thus saturation its resources.

Go client

I started with the client in Go and quickly stumbled upon the first roadblock. When I was making 10 concurrent HTTP request with simple "net/http" requests there were only 2 TCP connections

$ lsof -p $(pgrep go-client) -n -P
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
go-client 11959 avd cwd DIR 253,0 4096 1183846 /home/avd/go/src/github.com/dzeban/c10k
go-client 11959 avd rtd DIR 253,0 4096 2 /
go-client 11959 avd txt REG 253,0 6240125 1186984 /home/avd/go/src/github.com/dzeban/c10k/go-client
go-client 11959 avd mem REG 253,0 2066456 3151328 /usr/lib64/libc-2.26.so
go-client 11959 avd mem REG 253,0 149360 3152802 /usr/lib64/libpthread-2.26.so
go-client 11959 avd mem REG 253,0 178464 3151302 /usr/lib64/ld-2.26.so
go-client 11959 avd 0u CHR 136,0 0t0 3 /dev/pts/0
go-client 11959 avd 1u CHR 136,0 0t0 3 /dev/pts/0
go-client 11959 avd 2u CHR 136,0 0t0 3 /dev/pts/0
go-client 11959 avd 4u a_inode 0,13 0 12735 [eventpoll]
go-client 11959 avd 8u IPv4 68232 0t0 TCP 127.0.0.1:55224->127.0.0.1:80 (ESTABLISHED)
go-client 11959 avd 10u IPv4 68235 0t0 TCP 127.0.0.1:55230->127.0.0.1:80 (ESTABLISHED)

The same with ss

$ ss -tnp dst 127.0.0.1:80
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 0 127.0.0.1:55224 127.0.0.1:80 users:(("go-client",pid=11959,fd=8))
ESTAB 0 0 127.0.0.1:55230 127.0.0.1:80 users:(("go-client",pid=11959,fd=10))

The reason for this is quite simple – HTTP 1.1 is using persistent connections with TCP keepalive for clients to avoid the overhead of TCP handshake on each HTTP request. Go’s "net/http" fully implements this logic – it multiplexes multiple requests over a handful of TCP connections. It can be tuned via Transport.

But I don’t need to tune it, I need to avoid it. And we can avoid it by explicitly creating TCP connection via net.Dial and then sending a single request over this connection. Here the function that does it and runs concurrently inside a dedicated goroutine.

func request(addr string, delay int, wg *sync.WaitGroup) {
    conn, err := net.Dial("tcp", addr)
    if err != nil {
        log.Fatal("dial error ", err)
    }

    req, err := http.NewRequest("GET", "/index.html", nil)
    if err != nil {
        log.Fatal("failed to create http request")
    }

    req.Host = "localhost"

    err = req.Write(conn)
    if err != nil {
        log.Fatal("failed to send http request")
    }

    _, err = bufio.NewReader(conn).ReadString('\n')
    if err != nil {
        log.Fatal("read error ", err)
    }

    wg.Done()
}

Let’s check it’s working

$ lsof -p $(pgrep go-client) -n -P
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
go-client 12231 avd cwd DIR 253,0 4096 1183846 /home/avd/go/src/github.com/dzeban/c10k
go-client 12231 avd rtd DIR 253,0 4096 2 /
go-client 12231 avd txt REG 253,0 6167884 1186984 /home/avd/go/src/github.com/dzeban/c10k/go-client
go-client 12231 avd mem REG 253,0 2066456 3151328 /usr/lib64/libc-2.26.so
go-client 12231 avd mem REG 253,0 149360 3152802 /usr/lib64/libpthread-2.26.so
go-client 12231 avd mem REG 253,0 178464 3151302 /usr/lib64/ld-2.26.so
go-client 12231 avd 0u CHR 136,0 0t0 3 /dev/pts/0
go-client 12231 avd 1u CHR 136,0 0t0 3 /dev/pts/0
go-client 12231 avd 2u CHR 136,0 0t0 3 /dev/pts/0
go-client 12231 avd 3u IPv4 71768 0t0 TCP 127.0.0.1:55256->127.0.0.1:80 (ESTABLISHED)
go-client 12231 avd 4u a_inode 0,13 0 12735 [eventpoll]
go-client 12231 avd 5u IPv4 73753 0t0 TCP 127.0.0.1:55258->127.0.0.1:80 (ESTABLISHED)
go-client 12231 avd 6u IPv4 71769 0t0 TCP 127.0.0.1:55266->127.0.0.1:80 (ESTABLISHED)
go-client 12231 avd 7u IPv4 71770 0t0 TCP 127.0.0.1:55264->127.0.0.1:80 (ESTABLISHED)
go-client 12231 avd 8u IPv4 73754 0t0 TCP 127.0.0.1:55260->127.0.0.1:80 (ESTABLISHED)
go-client 12231 avd 9u IPv4 71771 0t0 TCP 127.0.0.1:55262->127.0.0.1:80 (ESTABLISHED)
go-client 12231 avd 10u IPv4 71774 0t0 TCP 127.0.0.1:55268->127.0.0.1:80 (ESTABLISHED)
go-client 12231 avd 11u IPv4 73755 0t0 TCP 127.0.0.1:55270->127.0.0.1:80 (ESTABLISHED)
go-client 12231 avd 12u IPv4 71775 0t0 TCP 127.0.0.1:55272->127.0.0.1:80 (ESTABLISHED)
go-client 12231 avd 13u IPv4 73758 0t0 TCP 127.0.0.1:55274->127.0.0.1:80 (ESTABLISHED)

$ ss -tnp dst 127.0.0.1:80
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 0 127.0.0.1:55260 127.0.0.1:80 users:(("go-client",pid=12231,fd=8))
ESTAB 0 0 127.0.0.1:55262 127.0.0.1:80 users:(("go-client",pid=12231,fd=9))
ESTAB 0 0 127.0.0.1:55270 127.0.0.1:80 users:(("go-client",pid=12231,fd=11))
ESTAB 0 0 127.0.0.1:55266 127.0.0.1:80 users:(("go-client",pid=12231,fd=6))
ESTAB 0 0 127.0.0.1:55256 127.0.0.1:80 users:(("go-client",pid=12231,fd=3))
ESTAB 0 0 127.0.0.1:55272 127.0.0.1:80 users:(("go-client",pid=12231,fd=12))
ESTAB 0 0 127.0.0.1:55258 127.0.0.1:80 users:(("go-client",pid=12231,fd=5))
ESTAB 0 0 127.0.0.1:55268 127.0.0.1:80 users:(("go-client",pid=12231,fd=10))
ESTAB 0 0 127.0.0.1:55264 127.0.0.1:80 users:(("go-client",pid=12231,fd=7))
ESTAB 0 0 127.0.0.1:55274 127.0.0.1:80 users:(("go-client",pid=12231,fd=13))

C client

I also decided to make a C client built on top of libuv for a convenient event loop.

In my C client, there is no HTTP library so we’re making TCP connections from the start. It works well by creating a connection for each request so it doesn’t have the problem (more like feature :-) of the Go client. But when it finishes reading response it stucks and doesn’t return the control to the event loop until the very long timeout.

Here is the response reading callback that seems stuck:

static void on_read(uv_stream_t* stream, ssize_t nread, const uv_buf_t* buf)
{
    if (nread > 0) {
        printf("%s", buf->base);
    } else if (nread == UV_EOF) {
        log("close stream");
        uv_connect_t *conn = uv_handle_get_data((uv_handle_t *)stream);
        uv_close((uv_handle_t *)stream, free_close_cb);
        free(conn);
    } else {
        return_uv_err(nread);
    }

    free(buf->base);
}

It appears like we’re stuck here and wait for some (quite long) time until we finally got EOF.

This “quite long time” is actually HTTP keepalive timeout set in nginx and by default it’s 75 seconds.

We can control it on the client though with Connection and Keep-AliveHTTP headers which are part of HTTP 1.1.

And that’s the only sane solution because on the libuv side I had no way to close the convection – I don’t receive EOF because it is sent only when connection actually closed.

So what is happening is that my client creates a connection, send a request, nginx replies and then nginx is keeping connection because it waits for the subsequent requests. Tinkering with libuv showed me that and that’s why I love making things in C – you have to dig really deep and really understand how things work.

So to solve this hanging requests I’ve just set Connection: close header to enforce the new connection on each request from the same client and to disable HTTP keepalive. As an alternative, I could just insist on HTTP 1.0 where there is no keep-alive.

Now, that it’s creating lots of connections let’s make it keep those connections for a client-specified delay to appear a slow client.

How to make it slow

I needed to make it slow because I wanted my server to spend some time handling the requests while avoiding putting sleeps in the server code.

Initially, I thought to make reading on the client side slow, i.e. reading one byte at a time or delaying reading the server response. Interestingly, none of these solutions worked.

I tested my client with nginx by watching access log with the $request_timevariable. Needless to say, all of my requests were served in 0.000 seconds. Whatever delay I’ve inserted, nginx seemed to ignore it.

I started to figure out why by tweaking various parts of the request-response pipeline like the number of connections, response size, etc.

Finally, I was able to see my delay only when nginx was serving really big file like 30 MB and that’s when it clicked.

The whole reason for this delay ignoring behavior were socket buffers. Socket buffers are, well, buffers for sockets, in other words, it’s the piece of memory where the Linux kernel buffers the network requests and responses for performance reason – to send data in big chunks over the network and to mitigate slow clients, and also for other things like TCP retransmission. Socket buffers are like page cache – all network I/O (with page cache it’s disk I/O) is made through it unless explicitly skipped.

So in my case, when nginx received a request, the response written by send/write syscall was merely stored in the socket buffer but from nginx point of view, it was done. Only when the response was large enough to not fit in the socket buffer, nginx would be blocked in syscall and wait until the client delay was elapsed, socket buffer was read and freed for the next portion of data.

You can check and tube the size of the socket buffers in/proc/sys/net/ipv4/tcp_rmem and /proc/sys/net/ipv4/tcp_wmem.

So after figuring this out, I’ve inserted delay after establishing the connection and before sending a request.

This way the server will keep around client connections (yay, c10k!) for a client-specified delay.

Recap

So in the end, I have a 2 c10k clients – one written in Go and the other written in C with libuv. The Python 3 client is on its way.

All of these clients connect to the HTTP server, waits for a specified delay and then sends GET request with Connection: close header.

This makes HTTP server keep a dedicated connection for each request and spend some time waiting to emulate I/O.

That’s how my c10k clients work.

Configuring JMX exporter for Kafka and Zookeeper

Alex Dzyoba — Sat, 12 May 2018 00:00:00 +0000

I’ve been using Prometheus for quite some time and really enjoying it. Most of the things are quite simple – installing and configuring Prometheus is easy, setting up exporters is launch and forget, instrumenting your code is a bliss. But there are 2 things that I’ve really struggled with:

Grokking data model and PromQL to get meaningful insights.
Configuring jmx-exporter.

In this post, I’ll share the JMX part because I don’t feel that I’ve fully understood the data model and PromQL. So let’s dive into that jmx-exporter thing.

What is jmx-exporter

jmx-exporter is a program that reads JMX data from JVM based applications (e.g. Java and Scala) and exposes it via HTTP in a simple text format that Prometheus understand and can scrape.

JMX is a common technology in Java world for exporting statistics of running application and also to control it (you can trigger GC with JMX, for example).

jmx-exporter is a Java application that uses JMX APIs to collect the app and JVM metrics. It is a Java agent which means it is running inside the same JVM. This gives you a nice benefit of not exposing JMX remotely – jmx-exporter will just collect the metrics and exposes it over HTTP in read-only mode.

Installing jmx-exporter

Because it’s written in Java, jmx-exporter is distributed as a jar, so you just need to download it from maven and put it somewhere on your target host.

I have an Ansible role for this – https://github.com/alexdzyoba/ansible-jmx-exporter. Besides downloading the jar it’ll also put the configuration file for jmx-exporter.

This configuration file contains rules for rewriting JMX MBeans to the Prometheus exposition format metrics. Basically, it’s a collection of regexps to convert MBeans strings to Prometheus strings.

The example_configs directory in jmx-exporter sources contains examples for many popular Java apps including Kafka and Zookeeper.

Configuring Zookeeper with jmx-exporter

As I’ve said jmx-exporter runs inside other JVM as java agent to collect JMX metrics. To demonstrate how it all works, let’s run it within Zookeeper.

Zookeeper is a crucial part of many production systems including Hadoop, Kafka and Clickhouse, so you really want to monitor it. Despite the fact that you can do this with 4lw commands(mntr, stat, etc.) and that there are dedicated exporters I prefer to use JMX to avoid constant Zookeeper querying (they add noise to metrics because 4lw commands counted as normal Zookeeper requests).

To scrape Zookeeper JMX metrics with jmx-exporter you have to pass the following arguments to Zookeeper launch:

-javaagent:/opt/jmx-exporter/jmx-exporter.jar=7070:/etc/jmx-exporter/zookeeper.yml

If you use the Zookeeper that is distributed with Kafka (you shouldn’t) then pass it via EXTRA_ARGS:

$ export EXTRA_ARGS="-javaagent:/opt/jmx-exporter/jmx-exporter.jar=7070:/etc/jmx-exporter/zookeeper.yml"
$ /opt/kafka_2.11-0.10.1.0/bin/zookeeper-server-start.sh /opt/kafka_2.11-0.10.1.0/config/zookeeper.properties

If you use standalone Zookeeper distribution then add it as SERVER_JVMFLAGS to the zookeeper-env.sh:

# zookeeper-env.sh
SERVER_JVMFLAGS="-javaagent:/opt/jmx-exporter/jmx-exporter.jar=7070:/etc/jmx-exporter/zookeeper.yml"

Anyway, when you launch Zookeeper you should see the process listening on the specified port (7070 in my case) and responding to /metrics queries:

$ netstat -tlnp | grep 7070
tcp 0 0 0.0.0.0:7070 0.0.0.0:* LISTEN 892/java

$ curl -s localhost:7070/metrics | head
# HELP jvm_threads_current Current thread count of a JVM
# TYPE jvm_threads_current gauge
jvm_threads_current 16.0
# HELP jvm_threads_daemon Daemon thread count of a JVM
# TYPE jvm_threads_daemon gauge
jvm_threads_daemon 12.0
# HELP jvm_threads_peak Peak thread count of a JVM
# TYPE jvm_threads_peak gauge
jvm_threads_peak 16.0
# HELP jvm_threads_started_total Started thread count of a JVM

Configuring Kafka with jmx-exporter

Kafka is a message broker written in Scala so it runs in JVM which in turn means that we can use jmx-exporter for its metrics.

To run jmx-exporter within Kafka, you should set KAFKA_OPTS environment variable like this:

$ export KAFKA_OPTS='-javaagent:/opt/jmx-exporter/jmx-exporter.jar=7071:/etc/jmx-exporter/kafka.yml'

Then launch the Kafka (I assume that Zookeeper is already launched as it’s required by Kafka):

$ /opt/kafka_2.11-0.10.1.0/bin/kafka-server-start.sh /opt/kafka_2.11-0.10.1.0/conf/server.properties

Check that jmx-exporter HTTP server is listening:

$ netstap -tlnp | grep 7071
tcp6 0 0 :::7071 :::* LISTEN 19288/java

And scrape the metrics!

$ curl -s localhost:7071 | grep -i kafka | head
# HELP kafka_server_replicafetchermanager_minfetchrate Attribute exposed for management (kafka.server<type=ReplicaFetcherManager, name=MinFetchRate, clientId=Replica><>Value)
# TYPE kafka_server_replicafetchermanager_minfetchrate untyped
kafka_server_replicafetchermanager_minfetchrate{clientId="Replica",} 0.0
# HELP kafka_network_requestmetrics_totaltimems Attribute exposed for management (kafka.network<type=RequestMetrics, name=TotalTimeMs, request=OffsetFetch><>Count)
# TYPE kafka_network_requestmetrics_totaltimems untyped
kafka_network_requestmetrics_totaltimems{request="OffsetFetch",} 0.0
kafka_network_requestmetrics_totaltimems{request="JoinGroup",} 0.0
kafka_network_requestmetrics_totaltimems{request="DescribeGroups",} 0.0
kafka_network_requestmetrics_totaltimems{request="LeaveGroup",} 0.0
kafka_network_requestmetrics_totaltimems{request="GroupCoordinator",} 0.0

Here is how to run jmx-exporter java agent if you are running Kafka under systemd:

...
[Service]
Restart=on-failure
Environment=KAFKA_OPTS=-javaagent:/opt/jmx-exporter/jmx-exporter.jar=7071:/etc/jmx-exporter/kafka.yml
ExecStart=/opt/kafka/bin/kafka-server-start.sh /etc/kafka/server.properties
ExecStop=/opt/kafka/bin/kafka-server-stop.sh
TimeoutStopSec=600
User=kafka
...

Recap

With jmx-exporter you can scrape the metrics of running JVM applications. jmx-exporter runs as a Java agent (inside the target JVM) scrapes JMX metrics, rewrite it according to config rules and exposes it in Prometheus exposition format.

For a quick setup check my Ansible role for jmx-exporter alexdzyoba.jmx-exporter.

That’s all for now, stay tuned by subscribing to the RSS or follow me on Twitter @AlexDzyoba.

Redis cluster with cross replication

Alex Dzyoba — Sat, 21 Apr 2018 00:00:00 +0000

In my previous post on Redis high availability, I’ve said that Redis cluster has some sharp corners and promised to tell about it.

This post will cover tricky cases with cross-replicated cluster only because that’s what I use. If you have a plain flat topology with single Redis instances on the dedicated nodes you’ll be fine. But it’s not my case.

So let’s dive in.

Intro

First, let’s define some terms so we understand each other.

Node – physical server or VM where you will run the Redis instance.
Instance – Redis server process in a cluster mode.

Second, let me describe how my Redis cluster topology looks like and what is cross-replication.

Redis cluster is built from multiple Redis instances that are run in a cluster mode. Each instance is isolated because it serves a particular subset of keys in a master or slave role. The emphasis on the role is intentional – there is separate Redis instance for every shard master and every shard replica, e.g. if you have 3 shards with replication factor 3 (2 additional replicas) you have to run 9 Redis instances. This was my first naive attempt to create a cluster on 3 nodes:

$ redis-trib create --replicas 2 10.135.78.153:7000 10.135.78.196:7000 10.135.64.55:7000
>>> Creating cluster
*** ERROR: Invalid configuration for cluster creation.
*** Redis Cluster requires at least 3 master nodes.
*** This is not possible with 3 nodes and 2 replicas per node.
*** At least 9 nodes are required.

(redis-trib is an “official” tool to create a Redis cluster)

The important point here is that all of the Redis tools operate with Redis instances, not nodes, so it’s your responsibility to put the instances in the right redundant topology.

The motivation for cross replication

Redis cluster requires at least 3 nodes because to survive network partition it needs a masters majority (like in Sentinel). If you want 1 replica than add another 3 nodes and boom! now you have a 6 nodes cluster to operate.

It’s fine if you work in the cloud where you can just spin up a dozen of small nodes that cost you a little. Unfortunately, not everyone joined the cloud party and have to operate real metal nodes and server hardware usually starts with something like 32 GiB of RAM and 8 core CPU which is a real overkill for a Redis node.

So to save on hardware we can make a trick and run several instances on a single node (and probably colocate it with other services). But remember that in that case, you have to distribute masters among nodes manually and configure cross-replication.

Cross replication simply means that you don’t have dedicated nodes for replicas, you just replicate the data to the next node.

This way you save on the cluster size – you can make a Redis cluster with 2 replicas on 3 nodes instead of 9. So you have fewer things to operate and nodes are better utilized – instead of one single-threaded lightweight Redis process per 9 nodes now you’ll have 3 such processes on 3 nodes.

To create a cluster you have to run a redis-server with cluster-enabled yes parameter. With a cross-replicated cluster you run multiple Redis instances on a node, so you have to run it on separate ports. You can check these two manuals for details but the essential part are configs. This is the config file I’m using:

protected-mode no
port {{ redis_port }}
daemonize no
loglevel notice
logfile ""
cluster-enabled yes
cluster-config-file nodes-{{ redis_port }}.conf
cluster-node-timeout 5000
cluster-require-full-coverage no
cluster-slave-validity-factor 0

The redis_port variable takes 7000, 7001 and 7002 values for each shard. Launch 3 instances of Redis server with 7000, 7001 and 7002 on each of 3 nodes so you’ll have 9 instances total and let’s continue.

Building a cross-replicated cluster

The first surprise may hit you when you’ll build the cluster. If you invoke theredis-trib like this

$ redis-trib create --replicas 2 10.135.78.153:7000 10.135.78.196:7000 10.135.64.55:7000 10.135.78.153:7001 10.135.78.196:7001 10.135.64.55:7001 10.135.78.153:7002 10.135.78.196:7002 10.135.64.55:7002

then it may put all your master instances on a single node. This is happening because, again, it assumes that each instance lives on the separate node.

So you have to distribute masters and slaves by hand. To do so, first, create a cluster from masters and then add slaves for each master.

# Create a cluster with masters
$ redis-trib create 10.135.78.153:7000 10.135.78.196:7001 10.135.64.55:7002
>>> Creating cluster
>>> Performing hash slots allocation on 3 nodes...
Using 3 masters:
10.135.78.153:7000
10.135.78.196:7001
10.135.64.55:7002
M: 763646767dd5492366c3c9f2978faa022833b7af 10.135.78.153:7000
slots:0-5460 (5461 slots) master
M: f63c210b13d68fa5dc97ca078af6d9c167f8c6ec 10.135.78.196:7001
slots:5461-10922 (5462 slots) master
M: 5f4bb09230ca016e7ffe2e6a4e5a32470175fb66 10.135.64.55:7002
slots:10923-16383 (5461 slots) master
Can I set the above configuration? (type 'yes' to accept): yes
>>> Nodes configuration updated
>>> Assign a different config epoch to each node
>>> Sending CLUSTER MEET messages to join the cluster
Waiting for the cluster to join.
>>> Performing Cluster Check (using node 10.135.78.153:7000)
M: 763646767dd5492366c3c9f2978faa022833b7af 10.135.78.153:7000
slots:0-5460 (5461 slots) master
0 additional replica(s)
M: 5f4bb09230ca016e7ffe2e6a4e5a32470175fb66 10.135.64.55:7002
slots:10923-16383 (5461 slots) master
0 additional replica(s)
M: f63c210b13d68fa5dc97ca078af6d9c167f8c6ec 10.135.78.196:7001
slots:5461-10922 (5462 slots) master
0 additional replica(s)
[OK] All nodes agree about slots configuration.
>>> Check for open slots...
>>> Check slots coverage...
[OK] All 16384 slots covered.

This is our cluster now:

127.0.0.1:7000> CLUSTER NODES                
763646767dd5492366c3c9f2978faa022833b7af 10.135.78.153:7000@17000 myself,master - 0 1524041299000 1 connected 0-5460
f63c210b13d68fa5dc97ca078af6d9c167f8c6ec 10.135.78.196:7001@17001 master - 0 1524041299426 2 connected 5461-10922
5f4bb09230ca016e7ffe2e6a4e5a32470175fb66 10.135.64.55:7002@17002 master - 0 1524041298408 3 connected 10923-16383

Now add 2 replicas for each master:

$ redis-trib add-node --slave --master-id 763646767dd5492366c3c9f2978faa022833b7af 10.135.78.196:7000 10.135.78.153:7000
$ redis-trib add-node --slave --master-id 763646767dd5492366c3c9f2978faa022833b7af 10.135.64.55:7000 10.135.78.153:7000

$ redis-trib add-node --slave --master-id f63c210b13d68fa5dc97ca078af6d9c167f8c6ec 10.135.78.153:7001 10.135.78.153:7000
$ redis-trib add-node --slave --master-id f63c210b13d68fa5dc97ca078af6d9c167f8c6ec 10.135.64.55:7001 10.135.78.153:7000

$ redis-trib add-node --slave --master-id 5f4bb09230ca016e7ffe2e6a4e5a32470175fb66 10.135.78.153:7002 10.135.78.153:7000
$ redis-trib add-node --slave --master-id 5f4bb09230ca016e7ffe2e6a4e5a32470175fb66 10.135.78.196:7002 10.135.78.153:7000

Now, this is our brand new cross replicated cluster with 2 replicas:

$ redis-cli -c -p 7000 cluster nodes
763646767dd5492366c3c9f2978faa022833b7af 10.135.78.153:7000@17000 myself,master - 0 1524041947000 1 connected 0-5460
216a5ea51af1faed7fa42b0c153c91855f769321 10.135.78.196:7000@17000 slave 763646767dd5492366c3c9f2978faa022833b7af 0 1524041948515 1 connected
0441f7534aed16123bb3476124506251dab80747 10.135.64.55:7000@17000 slave 763646767dd5492366c3c9f2978faa022833b7af 0 1524041947094 1 connected
f63c210b13d68fa5dc97ca078af6d9c167f8c6ec 10.135.78.196:7001@17001 master - 0 1524043602115 2 connected 5461-10922
f90c932d5cf435c75697dc984b0cbb94c130f115 10.135.78.153:7001@17001 slave f63c210b13d68fa5dc97ca078af6d9c167f8c6ec 0 1524043601595 2 connected
00eb2402fc1868763a393ae2c9843c47cd7d49da 10.135.64.55:7001@17001 slave f63c210b13d68fa5dc97ca078af6d9c167f8c6ec 0 1524043600057 2 connected
5f4bb09230ca016e7ffe2e6a4e5a32470175fb66 10.135.64.55:7002@17002 master - 0 1524041948515 3 connected 10923-16383
af75fc17e552279e5939bfe2df68075b3b6f9b29 10.135.78.153:7002@17002 slave 5f4bb09230ca016e7ffe2e6a4e5a32470175fb66 0 1524041948000 3 connected
19b8c9f7ac472ecfedd109e6bb7a4b932905c4fd 10.135.78.196:7002@17002 slave 5f4bb09230ca016e7ffe2e6a4e5a32470175fb66 0 1524041947094 3 connected

Failover of a cluster node

If we fail (with DEBUG SEGFAULT command) our third node (10.135.64.55) cluster will continue to work:

127.0.0.1:7000> CLUSTER NODES
763646767dd5492366c3c9f2978faa022833b7af 10.135.78.153:7000@17000 myself,master - 0 1524043923000 1 connected 0-5460
216a5ea51af1faed7fa42b0c153c91855f769321 10.135.78.196:7000@17000 slave 763646767dd5492366c3c9f2978faa022833b7af 0 1524043924569 1 connected
0441f7534aed16123bb3476124506251dab80747 10.135.64.55:7000@17000 slave,fail 763646767dd5492366c3c9f2978faa022833b7af 1524043857000 1524043856593 1 disconnected
f63c210b13d68fa5dc97ca078af6d9c167f8c6ec 10.135.78.196:7001@17001 master - 0 1524043924874 2 connected 5461-10922
f90c932d5cf435c75697dc984b0cbb94c130f115 10.135.78.153:7001@17001 slave f63c210b13d68fa5dc97ca078af6d9c167f8c6ec 0 1524043924000 2 connected
00eb2402fc1868763a393ae2c9843c47cd7d49da 10.135.64.55:7001@17001 slave,fail f63c210b13d68fa5dc97ca078af6d9c167f8c6ec 1524043862669 1524043862000 2 disconnected
5f4bb09230ca016e7ffe2e6a4e5a32470175fb66 10.135.64.55:7002@17002 master,fail - 1524043864490 1524043862567 3 disconnected
af75fc17e552279e5939bfe2df68075b3b6f9b29 10.135.78.153:7002@17002 slave 19b8c9f7ac472ecfedd109e6bb7a4b932905c4fd 0 1524043924568 4 connected
19b8c9f7ac472ecfedd109e6bb7a4b932905c4fd 10.135.78.196:7002@17002 master - 0 1524043924000 4 connected 10923-16383

We can see that replica on 10.135.78.196:7002 took over the slot range 10923-16383 and now it’s master

127.0.0.1:7000> set a 2
-> Redirected to slot [15495] located at 10.135.78.196:7002
OK

Should we restore Redis instances on the third node cluster will restore

127.0.0.1:7000> CLUSTER nodes
763646767dd5492366c3c9f2978faa022833b7af 10.135.78.153:7000@17000 myself,master - 0 1524044130000 1 connected 0-5460
216a5ea51af1faed7fa42b0c153c91855f769321 10.135.78.196:7000@17000 slave 763646767dd5492366c3c9f2978faa022833b7af 0 1524044131572 1 connected
0441f7534aed16123bb3476124506251dab80747 10.135.64.55:7000@17000 slave 763646767dd5492366c3c9f2978faa022833b7af 0 1524044131367 1 connected
f63c210b13d68fa5dc97ca078af6d9c167f8c6ec 10.135.78.196:7001@17001 master - 0 1524044130334 2 connected 5461-10922
f90c932d5cf435c75697dc984b0cbb94c130f115 10.135.78.153:7001@17001 slave f63c210b13d68fa5dc97ca078af6d9c167f8c6ec 0 1524044131876 2 connected
00eb2402fc1868763a393ae2c9843c47cd7d49da 10.135.64.55:7001@17001 slave f63c210b13d68fa5dc97ca078af6d9c167f8c6ec 0 1524044131877 2 connected
19b8c9f7ac472ecfedd109e6bb7a4b932905c4fd 10.135.78.196:7002@17002 master - 0 1524044131572 4 connected 10923-16383
af75fc17e552279e5939bfe2df68075b3b6f9b29 10.135.78.153:7002@17002 slave 19b8c9f7ac472ecfedd109e6bb7a4b932905c4fd 0 1524044131000 4 connected
5f4bb09230ca016e7ffe2e6a4e5a32470175fb66 10.135.64.55:7002@17002 slave 19b8c9f7ac472ecfedd109e6bb7a4b932905c4fd 0 1524044131572 4 connected

However, master was not restored back on original node, it’s still on the second node (10.135.78.196). After reboot the third node contains only slave instances

$ redis-cli -c -p 7000 cluster nodes | grep 10.135.64.55
0441f7534aed16123bb3476124506251dab80747 10.135.64.55:7000@17000 slave 763646767dd5492366c3c9f2978faa022833b7af 0 1524044294347 1 connected
00eb2402fc1868763a393ae2c9843c47cd7d49da 10.135.64.55:7001@17001 slave f63c210b13d68fa5dc97ca078af6d9c167f8c6ec 0 1524044293138 2 connected
5f4bb09230ca016e7ffe2e6a4e5a32470175fb66 10.135.64.55:7002@17002 slave 19b8c9f7ac472ecfedd109e6bb7a4b932905c4fd 0 1524044294553 4 connected

and the second node serves 2 master instances.

$ redis-cli -c -p 7000 cluster nodes | grep 10.135.78.196
216a5ea51af1faed7fa42b0c153c91855f769321 10.135.78.196:7000@17000 slave 763646767dd5492366c3c9f2978faa022833b7af 0 1524044345000 1 connected
f63c210b13d68fa5dc97ca078af6d9c167f8c6ec 10.135.78.196:7001@17001 master - 0 1524044345000 2 connected 5461-10922
19b8c9f7ac472ecfedd109e6bb7a4b932905c4fd 10.135.78.196:7002@17002 master - 0 1524044345000 4 connected 10923-16383

Now, what is interesting is that if the second node will fail in this state we’ll lose 2 out of 3 masters and we’ll lose the whole cluster because there is no masters quorum.

$ redis-cli -c -p 7000 cluster nodes
763646767dd5492366c3c9f2978faa022833b7af 10.135.78.153:7000@17000 myself,master - 0 1524046655000 1 connected 0-5460
216a5ea51af1faed7fa42b0c153c91855f769321 10.135.78.196:7000@17000 slave,fail 763646767dd5492366c3c9f2978faa022833b7af 1524046544940 1524046544000 1 disconnected
0441f7534aed16123bb3476124506251dab80747 10.135.64.55:7000@17000 slave 763646767dd5492366c3c9f2978faa022833b7af 0 1524046654010 1 connected
f63c210b13d68fa5dc97ca078af6d9c167f8c6ec 10.135.78.196:7001@17001 master,fail? - 1524046602511 1524046601582 2 disconnected 5461-10922
f90c932d5cf435c75697dc984b0cbb94c130f115 10.135.78.153:7001@17001 slave f63c210b13d68fa5dc97ca078af6d9c167f8c6ec 0 1524046655039 2 connected
00eb2402fc1868763a393ae2c9843c47cd7d49da 10.135.64.55:7001@17001 slave f63c210b13d68fa5dc97ca078af6d9c167f8c6ec 0 1524046656075 2 connected
19b8c9f7ac472ecfedd109e6bb7a4b932905c4fd 10.135.78.196:7002@17002 master,fail? - 1524046605581 1524046603746 4 disconnected 10923-16383
af75fc17e552279e5939bfe2df68075b3b6f9b29 10.135.78.153:7002@17002 slave 19b8c9f7ac472ecfedd109e6bb7a4b932905c4fd 0 1524046654623 4 connected
5f4bb09230ca016e7ffe2e6a4e5a32470175fb66 10.135.64.55:7002@17002 slave 19b8c9f7ac472ecfedd109e6bb7a4b932905c4fd 0 1524046654515 4 connected

Let me reiterate that – with cross replicated cluster you may lose the whole cluster after 2 consequent reboots of the single nodes. This is the reason why you’re better off with a dedicated node for each Redis instance, otherwise, with cross replication, we should really watch for masters distribution.

To avoid the situation above we should manually failover one of the slaves on the third node to become a master.

To do this we should connect to the 10.135.64.55:7002 which is replica now and then issue CLUSTER FAILOVER command:

127.0.0.1:7002> CLUSTER FAILOVER
OK

127.0.0.1:7002> CLUSTER NODES
763646767dd5492366c3c9f2978faa022833b7af 10.135.78.153:7000@17000 master - 0 1524047703000 1 connected 0-5460
216a5ea51af1faed7fa42b0c153c91855f769321 10.135.78.196:7000@17000 slave 763646767dd5492366c3c9f2978faa022833b7af 0 1524047703512 1 connected
0441f7534aed16123bb3476124506251dab80747 10.135.64.55:7000@17000 slave 763646767dd5492366c3c9f2978faa022833b7af 0 1524047703512 1 connected
f63c210b13d68fa5dc97ca078af6d9c167f8c6ec 10.135.78.196:7001@17001 master - 0 1524047703000 2 connected 5461-10922
f90c932d5cf435c75697dc984b0cbb94c130f115 10.135.78.153:7001@17001 slave f63c210b13d68fa5dc97ca078af6d9c167f8c6ec 0 1524047703000 2 connected
00eb2402fc1868763a393ae2c9843c47cd7d49da 10.135.64.55:7001@17001 slave f63c210b13d68fa5dc97ca078af6d9c167f8c6ec 0 1524047703110 2 connected
5f4bb09230ca016e7ffe2e6a4e5a32470175fb66 10.135.64.55:7002@17002 myself,master - 0 1524047703000 5 connected 10923-16383
af75fc17e552279e5939bfe2df68075b3b6f9b29 10.135.78.153:7002@17002 slave 5f4bb09230ca016e7ffe2e6a4e5a32470175fb66 0 1524047702510 5 connected
19b8c9f7ac472ecfedd109e6bb7a4b932905c4fd 10.135.78.196:7002@17002 slave 5f4bb09230ca016e7ffe2e6a4e5a32470175fb66 0 1524047702009 5 connected

Replacing a failed node

Now, suppose we’ve lost our third node completely and want to replace it with a completely new node.

$ redis-cli -c -p 7000 cluster nodes
763646767dd5492366c3c9f2978faa022833b7af 10.135.78.153:7000@17000 myself,master - 0 1524047906000 1 connected 0-5460
216a5ea51af1faed7fa42b0c153c91855f769321 10.135.78.196:7000@17000 slave 763646767dd5492366c3c9f2978faa022833b7af 0 1524047906811 1 connected
0441f7534aed16123bb3476124506251dab80747 10.135.64.55:7000@17000 slave,fail 763646767dd5492366c3c9f2978faa022833b7af 1524047871538 1524047869000 1 connected
f90c932d5cf435c75697dc984b0cbb94c130f115 10.135.78.153:7001@17001 slave f63c210b13d68fa5dc97ca078af6d9c167f8c6ec 0 1524047908000 2 connected
f63c210b13d68fa5dc97ca078af6d9c167f8c6ec 10.135.78.196:7001@17001 master - 0 1524047907318 2 connected 5461-10922
00eb2402fc1868763a393ae2c9843c47cd7d49da 10.135.64.55:7001@17001 slave,fail f63c210b13d68fa5dc97ca078af6d9c167f8c6ec 1524047872042 1524047869515 2 connected
19b8c9f7ac472ecfedd109e6bb7a4b932905c4fd 10.135.78.196:7002@17002 master - 0 1524047907000 6 connected 10923-16383
af75fc17e552279e5939bfe2df68075b3b6f9b29 10.135.78.153:7002@17002 slave 19b8c9f7ac472ecfedd109e6bb7a4b932905c4fd 0 1524047908336 6 connected
5f4bb09230ca016e7ffe2e6a4e5a32470175fb66 10.135.64.55:7002@17002 master,fail - 1524047871840 1524047869314 5 connected

First, we have to forget the lost node by issuing CLUSTER FORGET <node-id>on every single node of the cluster (even slaves).

for id in 0441f7534aed16123bb3476124506251dab80747 00eb2402fc1868763a393ae2c9843c47cd7d49da 5f4bb09230ca016e7ffe2e6a4e5a32470175fb66; do 
    for port in 7000 7001 7002; do 
        redis-cli -c -p ${port} CLUSTER FORGET ${id}
    done
done

Check that we’ve forgotten the failed node:

$ redis-cli -c -p 7000 cluster nodes
763646767dd5492366c3c9f2978faa022833b7af 10.135.78.153:7000@17000 myself,master - 0 1524048240000 1 connected 0-5460
216a5ea51af1faed7fa42b0c153c91855f769321 10.135.78.196:7000@17000 slave 763646767dd5492366c3c9f2978faa022833b7af 0 1524048241342 1 connected
f63c210b13d68fa5dc97ca078af6d9c167f8c6ec 10.135.78.196:7001@17001 master - 0 1524048240332 2 connected 5461-10922
f90c932d5cf435c75697dc984b0cbb94c130f115 10.135.78.153:7001@17001 slave f63c210b13d68fa5dc97ca078af6d9c167f8c6ec 0 1524048240000 2 connected
19b8c9f7ac472ecfedd109e6bb7a4b932905c4fd 10.135.78.196:7002@17002 master - 0 1524048241000 6 connected 10923-16383
af75fc17e552279e5939bfe2df68075b3b6f9b29 10.135.78.153:7002@17002 slave 19b8c9f7ac472ecfedd109e6bb7a4b932905c4fd 0 1524048241845 6 connected

Now spin up a new node, install redis on it and launch 3 new instances with our cluster configuration.

These 3 new nodes doesn’t know anything about the cluster:

[root@redis-replaced ~]# redis-cli -c -p 7000 cluster nodes
9a9c19e24e04df35ad54a8aff750475e707c8367 :7000@17000 myself,master - 0 0 0 connected
[root@redis-replaced ~]# redis-cli -c -p 7001 cluster nodes
3a35ebbb6160232d36984e7a5b97d430077e7eb0 :7001@17001 myself,master - 0 0 0 connected
[root@redis-replaced ~]# redis-cli -c -p 7002 cluster nodes
df701f8b24ae3c68ca6f9e1015d7362edccbb0ab :7002@17002 myself,master - 0 0 0 connected

so we have to add these Redis instances to the cluster:

$ redis-trib add-node --slave --master-id 763646767dd5492366c3c9f2978faa022833b7af 10.135.82.90:7000 10.135.78.153:7000
$ redis-trib add-node --slave --master-id f63c210b13d68fa5dc97ca078af6d9c167f8c6ec 10.135.82.90:7001 10.135.78.153:7000
$ redis-trib add-node --slave --master-id 19b8c9f7ac472ecfedd109e6bb7a4b932905c4fd 10.135.82.90:7002 10.135.78.153:7000

Now we should failover for the third shard:

[root@redis-replaced ~]# redis-cli -c -p 7002 cluster failover
OK

Aaaand, it’s done!

$ redis-cli -c -p 7000 cluster nodes
763646767dd5492366c3c9f2978faa022833b7af 10.135.78.153:7000@17000 myself,master - 0 1524049388000 1 connected 0-5460
f90c932d5cf435c75697dc984b0cbb94c130f115 10.135.78.153:7001@17001 slave f63c210b13d68fa5dc97ca078af6d9c167f8c6ec 0 1524049389000 2 connected
af75fc17e552279e5939bfe2df68075b3b6f9b29 10.135.78.153:7002@17002 slave df701f8b24ae3c68ca6f9e1015d7362edccbb0ab 0 1524049388000 7 connected
216a5ea51af1faed7fa42b0c153c91855f769321 10.135.78.196:7000@17000 slave 763646767dd5492366c3c9f2978faa022833b7af 0 1524049389579 1 connected
f63c210b13d68fa5dc97ca078af6d9c167f8c6ec 10.135.78.196:7001@17001 master - 0 1524049389579 2 connected 5461-10922
19b8c9f7ac472ecfedd109e6bb7a4b932905c4fd 10.135.78.196:7002@17002 slave df701f8b24ae3c68ca6f9e1015d7362edccbb0ab 0 1524049388565 7 connected
9a9c19e24e04df35ad54a8aff750475e707c8367 10.135.82.90:7000@17000 slave 763646767dd5492366c3c9f2978faa022833b7af 0 1524049389880 1 connected
3a35ebbb6160232d36984e7a5b97d430077e7eb0 10.135.82.90:7001@17001 slave f63c210b13d68fa5dc97ca078af6d9c167f8c6ec 0 1524049389579 2 connected
df701f8b24ae3c68ca6f9e1015d7362edccbb0ab 10.135.82.90:7002@17002 master - 0 1524049389579 7 connected 10923-16383

Recap

If you have to deal with bare metal servers, want a highly available Redis cluster and effectively utilize your hardware you have a good option of building cross replicated topology of Redis cluster.

This will work great but there are 2 caveats:

Cluster building is a manual process because you have to put masters on separate nodes.
You have to monitor your masters’ distribution to avoid cluster failure after a single node failure.

Redis high availability

Alex Dzyoba — Wed, 28 Mar 2018 00:00:00 +0000

Recently, at the place where I work, we started to use Redis for session-like objects storage. Despite that these objects are small and short-lived, without them our service would stop working, so a question about Redis high availability arose. Turns out, for Redis there is no ready-made solution – there are multiple options with different tradeoffs and information sometimes is a bit scarce and distributed across documentation and blog posts, hence I’m writing this in shy hope of helping another poor soul like myself to solve such problem. I’m by no means a Redis guru but I wanted to share my experience anyway because, after all, it’s my personal blog.

I’m going to describe high availability in terms of node failure and not persistence.

Redis high availability options

Standalone Redis, which is a good old redis-server you launch after installation, is easy to setup and use, but it’s not resilient to the failure of a node it’s running on. It doesn’t matter whether you use RDB or AOF as long as a node is unavailable you are in a trouble.

Over the years, Redis community came up with a few high availability options – most of them are built in Redis itself, though there are some others that are 3rd party tools. Let’s dive into it.

Simple Redis replication

Redis has a replication support since, like, forever and it works great – just put the slaveof <addr> <port> in your config file and the instance will start receiving the stream of the data from the master.

You can configure multiple slaves for the master, you can configure slave for a slave, you can enable slave-only persistence, you can make replication synchronous (it’s async by default) – the list of what you can do with Redis seems like bounded only by your imagination. Just read the docs for replication – it’s really great.

Pros:

Quick and simple to setup
Could be automated via configuration management tools
Continue to work as long as a single master instance is available - it can survive failures of all of the slave instances.

Cons:

Writes must go to the master
Slaves may serve reads but because replication is asynchronous you may get stale reads
It doesn’t shard data, so master and slaves will have unbalanced utilization
In the case of master failure, you have to elect the new master manually

The last thing is, IMHO, a major downside and that’s where the Redis Sentinel helps.

Redis replication with Sentinel

Nobody wants to wake up in the middle of the night, just to issue theSLAVEOF NO ONE to elect new master – it’s pretty silly and should be automated, right? Right. That’s why Redis Sentinel exists.

Redis Sentinel is the tool that monitors Redis masters and slaves and automatically elects the new master from one of the slaves. It’s a really critical task so you’re better off making Sentinel highly available itself. Luckily, it has a built-in clustering which makes it a distributed system.

Sentinel is a quorum system, meaning that to agree on the new master there should be a majority of Sentinel nodes alive. This has a huge implication on how to deploy Sentinel. There are basically 2 options here – colocate with Redis server or deploy on a separate cluster. Colocating with Redis server makes sense because Sentinel is a very lightweight process, so why pay for additional nodes? But in this case, we lose our resilience because if you colocate Redis server and Sentinel on, say, 3 nodes, you can only lose 1 node because Sentinel needs 2 nodes to elect the new Redis server master. Without Sentinel, we could lose 2 slave nodes. So maybe you should think about a dedicated Sentinel cluster. If you’re on the cloud you could deploy it on some sort of nano instances but maybe it’s not your case. Tradeoffs, tradeoffs, I know.

Besides dealing with maintaining one more distributed system, with Sentinel, you should change the way your clients work with Redis because now your master node can move. For this case, your application should first go to Sentinel, ask it about current master and only then work with it. You can build a clever hack with HAProxy here – instead of going to Sentinel you can put a HAProxy in front of Redis servers to detect the new master with the help of TCP checks. See example at HAProxy blog

Nevertheless, Sentinel colocated with Redis servers is a really common solution for Redis high availability, for example, Gitlab recommends it in its admin guide.

Pros:

Automatically selects new master in case of its failure. Yay!
Easy to setup, (seems) easy to maintain.

Cons:

Yet another distributed system to maintain
May require a dedicated cluster if not colocated with Redis server
Still doesn’t shard data, so master will be overutilized in comparison to slaves

Redis cluster

All of the solutions above seems IMHO half-assed because they add more things and these things are not obvious at least at first sight. I don’t know any other system that solves availability problem by adding yet another cluster that must be available itself. It’s just annoying.

So with recent versions of Redis came the Cluster – a builtin feature that adds sharding, replication and high availability to the known and loved Redis. Within a cluster, you have multiple master instances that serve a subset of the keyspace. Clients may send requests to any of the master instances which will redirect to the correct instance for the given key. Master instances may have as many replicas as they want, and these replicas will be promoted to master automatically even without a quorum. Note, though, that master instances quorum is required for the whole cluster work, but a quorum is not required for the shard working including the new master election.

Each instance in the Redis cluster (master or slave) should be deployed on a dedicated node but you can configure cross replication where each node will contain multiple instances. There are sharp corners here, though, that I’ll illustrate in the next post, so stay tuned!

Pros:

Shards data across multiple nodes
Has replication support
Has builtin failover of the master

Cons:

Not every library supports it
May not be as robust (yet) as standalone Redis or Sentinel
Tooling is wack, building and maintaining (replacing a node) cluster is a manual process
Introduces an extra network hop in case we missed the shard.

Twemproxy

Twemproxy is a special proxy for in-memory databases – namely, memcached and Redis – that was built by Twitter. It adds sharding with consistent hashing, so resharding is not that painful, and also maintains persistent connections and enables requests/response pipelining.

I haven’t tried it because in the era of Redis cluster it doesn’t seem relevant to me anymore, so I couldn’t tell pros and cons, but YMMV.

Redis Enterprise

After the initial post, quite a few people reached out to me telling that they have great success with Redis Enterprise from Redis Labs. Check out this one from Reddit. The point is that if you have a really high workload and your data is more critical and you can afford it then you should consider their solution.

You may also check their guide on Redis High Availability – it is also well written and illustrated.

Conclusion

Choosing the right solution for Redis high availability is full of tradeoffs. Nobody knows your situation better than you, so get to know how Redis works – there is no magic here – in the end, you’ll have to maintain the solution. In my case, we have chosen a Redis cluster with cross replication after lots of testing and writing a doc with instructions on how to deal with failures.

That’s all for now, stay tuned for the dedicated Redis cluster post!

How to use Ansible with Terraform

Alex Dzyoba — Fri, 09 Mar 2018 00:00:00 +0000

Recently, I’ve started using Terraform for creating a cloud test rig and it’s pretty dope. In a matter of a few days, I went from “never used AWS” to the “I have a declarative way to create an isolated infrastructure in the cloud”. I’m spinning a couple of instances in a dedicated subnet inside a VPC with a security group and dedicated SSH keypair and all of this is coded in a mere few hundred lines.

It’s all nice and dandy but after creating an instance from some basic AMI I need to provision it. My go-to tool for this is Ansible but, unfortunately, Terraform doesn’t support it natively as it does for Chef and Salt. This is unlike Packer that has ansible (remote) and ansible-local that I’ve used for creating a Docker image.

So I’ve spent some time and found a few ways to marry Terraform with Ansible that I’ll describe hereafter. But first, let’s talk about provisioning.

Do we really need provisioning in the cloud?

Instead of using the empty AMIs you could bake your own AMI and skip the whole provisioning part completely but I see a giant flaw in this setup. Every change, even a small one, requires recreation of the whole instance. If it’s a change somewhere on the base level then you’ll need to recreate your whole fleet. It quickly becomes unusable in case of deployment, security patching, adding/removing a user, changing config and other simple things.

Even more so if you bake your own AMIs then you should again provision it somehow and that’s where things like Ansible appears again. My recommendation here is again to use Packer with Ansible.

So in most cases, I’m strongly for the provisioning because it’s unavoidable anyway.

How to use Ansible with Terraform

Now, returning to the actual provisioning I found 3 ways to use Ansible with Terraform after reading the heated discussion at this GitHub issue. Read on to find the one that’s most suitable for you.

Inline inventory with instance IP

One of the most obvious yet hacky solutions is to invoke Ansible within local-exec provisioner. Here is how it looks like:

provisioner "local-exec" {
    command = "ansible-playbook -i '${self.public_ip},' --private-key ${var.ssh_key_private} provision.yml"
}

Nice and simple, but there is a problem here. local-exec provisioner starts without waiting for an instance to launch, so in most cases, it will fail because by the time it will try to connect there is nobody listening.

As a nice workaround, you can use preliminary remote-exec provisioner that will wait until the connection to the instance is established and then invoke thelocal-exec provisioner.

As a result, I have this thingy that plays the role of “Ansible provisioner”

provisioner "remote-exec" {
    inline = ["sudo dnf -y install python"]

    connection {
      type = "ssh"
      user = "fedora"
      private_key = "${file(var.ssh_key_private)}"
    }
  }

  provisioner "local-exec" {
    command = "ansible-playbook -u fedora -i '${self.public_ip},' --private-key ${var.ssh_key_private} provision.yml" 
  }

To make ansible-playbook work you have to have an Ansible code in the same directory with Terraform code like this:

$ ll infra
drwxrwxr-x. 3 avd avd 4.0K Mar 5 15:54 roles/
-rw-rw-r--. 1 avd avd 367 Mar 5 15:19 ansible.cfg
-rw-rw-r--. 1 avd avd 2.5K Mar 7 18:54 main.tf
-rw-rw-r--. 1 avd avd 454 Mar 5 15:27 variables.tf
-rw-rw-r--. 1 avd avd 38 Mar 5 15:54 provision.yml

This inline inventory will work in most cases, except when you need multiple hosts in inventory. For example, when you setup Consul agent you need a list of Consul servers for rendering a config and that is usually found in the usual inventory. So but it won’t work here because you have a single host in your inventory.

Anyway, I’m using this approach for the basic things like setting up users and installing some basic packages.

Dynamic inventory after Terraform

Another simple solution for provisioning infrastructure created by Terraform is just don’t tie Terraform and Ansible together. Create infrastructure with Terraform and then use Ansible with dynamic inventory regardless of how your instances were created.

So you first create an infra with terraform apply and then you invoke ansible-playbook -i inventory site.yml, where inventory dir contains dynamic inventory scripts.

This will work great but has a little drawback – if you need to increase the number of instances you must remember to launch Ansible after Terraform.

That’s what I use complementary to the previous approach.

Inventory from Terraform state

There is another interesting thing that might work for you – generate static inventory from Terraform state.

When you work with Terraform it maintains the state of the infrastructure that contains everything including your instances. With a local backend, this state is stored in a JSON file that can be easily parsed and converted to the Ansible inventory.

Here are 2 projects with examples that you can use if you want to go this way.

https://github.com/adammck/terraform-inventory

$ terraform-inventory -inventory terraform.tfstate
[all]
52.51.215.84

[all:vars]

[server]
52.51.215.84

[server.0]
52.51.215.84

[type_aws_instance]
52.51.215.84

[name_c10k server]
52.51.215.84

[%_1]
52.51.215.84

https://github.com/express42/terraform-ansible-example/blob/master/ansible/terraform.py

$ ~/soft/terraform.py --root . --hostfile
## begin hosts generated by terraform.py ##
52.51.215.84 C10K Server
## end hosts generated by terraform.py ##

IMHO, I don’t see a point in this approach.

Ansible plugin for Terraform that didn’t work for me

Finally, there are few projects that try to make a native looking Ansible provisioner for Terraform like builtin Chef provisioner.

https://github.com/jonmorehouse/terraform-provisioner-ansible – this was the first attempt to make such plugin but, unfortunately, it’s not currently maintained and moreover it’s not supported by the current Terraform plugin system.

https://github.com/radekg/terraform-provisioner-ansible – this one is more recent and currently maintained. It enables this kind of provisioning:

...
provisioner "ansible" {
    plays {
        playbook = "./provision.yml"
        hosts = ["${self.public_ip}"]
    }
    become = "yes"
    local = "yes"
}
...

Unfortunately, I wasn’t able to make it work so I blew it off because first 2 solutions cover all of my cases.

Conclusion

Terraform and Ansible is a powerful combo that I use for provisioning cloud infrastructure. For basic cloud instances setup, I invoke Ansible withlocal-exec and later I invoke Ansible separately with dynamic inventory.

You can find an example of how I do it at c10k/infrastructure

Thanks! Until next time!

Instrumenting a Go service for Prometheus

Alex Dzyoba — Sat, 03 Feb 2018 00:00:00 +0000

I’m the big proponent of the DevOps practices and always been keen to operate things I’ve developed. That’s why I’m really excited about DevOps, SRE, Observability, Service Discovery and other great things which I believe will transform our industry to be truly software engineering. In this blog I’m trying to (among other cool stuff I’m doing) share examples of how you can help yourself or your grumpy Ops guys to operate your service. Last time we developed a typical web service, serving data from key-value storage, and added Consul integration into it for Service Discovery. This time we are going to instrument our code for monitoring.

Why instrument?

At first, you may wonder why should we instrument our code, why not collect metrics needed for the monitoring from the outside like just install Zabbix agent or setup Nagios checks? There is nothing really wrong with that solution where you treat monitoring targets as black boxes. Though there is another way to do that – white-box monitoring – where your services provide metrics themselves as a result of instrumentation. It’s not really about choosing only one way of doing things – both of these solutions may, and should, supplement each other. For example, you may treat your database servers as a black box providing metrics such as available memory, while instrumenting your database access layer to measure DB request latency.

It’s all about different points of view and it was discussed in Google’s SRE book:

The simplest way to think about black-box monitoring versus white-box monitoring is that black-box monitoring is symptom-oriented and represents active—not predicted—problems: “The system isn’t working correctly, right now.” White-box monitoring depends on the ability to inspect the innards of the system, such as logs or HTTP endpoints, with instrumentation. White-box monitoring, therefore, allows detection of imminent problems, failures masked by retries, and so forth. … When collecting telemetry for debugging, white-box monitoring is essential. If web servers seem slow on database-heavy requests, you need to know both how fast the web server perceives the database to be, and how fast the database believes itself to be. Otherwise, you can’t distinguish an actually slow database server from a network problem between your web server and your database.

My point is that to gain a real observability of your system you should supplement your existing black-box monitoring with a white-box by instrumenting your services.

What to instrument

Now, after we convinced that instrumenting is a good thing let’s think about what to monitor. A lot of people say that you should instrument everything you can, but I think it’s over-engineering and you should instrument for things that really matter to avoid codebase complexity and unnecessary CPU cycles in your service for collecting the bloat of metrics.

So what are those things that really matter that we should instrument for? Well, the same SRE book defines the so-called four golden signals of monitoring:

Traffic or Request Rate
Errors
Latency or Duration of the requests
Saturation

Out of these 4 signals, saturation is the most confusing because it’s not clear how to measure it or if it’s even possible in a software system. I see saturation mostly for the hardware resources which I’m not going to cover here, check the Brendan Gregg’s USE method for this.

Because saturation is hard to measure in a software system, there is a service tailored version of 4 golden signals which is called “the RED method”, which lists 3 metrics:

Request rate
Errors
Duration (latency) distribution

That’s what we’ll instrument for in the webkv service.

We will use Prometheus to monitor our service because it’s a go-to tool for monitoring these days – it’s simple, easy to setup and fast. We will need Prometheus Go client library for instrumenting our code.

Instrumenting HTTP handlers

Prometheus works by pulling data from /metrics HTTP handler that serves metrics in a simple text-based exposition format so we need to calculate RED metrics and export it via a dedicated endpoint.

Luckily, all of these metrics can be easily exported with an InstrumentHandler helper.

diff --git a/webkv.go b/webkv.go
index 94bd025..f43534f 100644
--- a/webkv.go
+++ b/webkv.go
@@ -9,6 +9,7 @@ import (
        "strings"
        "time"

+       "github.com/prometheus/client_golang/prometheus"
        "github.com/prometheus/client_golang/prometheus/promhttp"

        "github.com/alexdzyoba/webkv/service"
@@ -32,7 +33,7 @@ func main() {
        if err != nil {
                log.Fatal(err)
        }
-       http.Handle("/", s)
+       http.Handle("/", prometheus.InstrumentHandler("webkv", s))
        http.Handle("/metrics", promhttp.Handler())

        l := fmt.Sprintf(":%d", *port)

and now to export the metrics via /metrics endpoint just add another 2 lines:

diff --git a/webkv.go b/webkv.go
index 1b2a9d7..94bd025 100644
--- a/webkv.go
+++ b/webkv.go
@@ -9,6 +9,8 @@ import (
        "strings"
        "time"

+       "github.com/prometheus/client_golang/prometheus/promhttp"
+
        "github.com/alexdzyoba/webkv/service"
 )

@@ -31,6 +33,7 @@ func main() {
                log.Fatal(err)
        }
        http.Handle("/", s)
+       http.Handle("/metrics", promhttp.Handler())

        l := fmt.Sprintf(":%d", *port)
        log.Print("Listening on ", l)

And that’s it!

No, seriously, that’s all you need to do to make your service observable. It’s so nice and easy that you don’t have excuses for not doing it.

InstrumentHandler conveniently wraps your handler and export the following metrics:

http_request_duration_microseconds summary with 50, 90 and 99 percentiles
http_request_size_bytes summary with 50, 90 and 99 percentiles
http_response_size_bytes summary with 50, 90 and 99 percentiles
http_requests_total counter labeled by status code and handler

promhttp.Handler also exports Go runtime information like a number of goroutines and memory stats.

The point is that you export simple metrics that you can easily calculate on the service and everything else is done with Prometheus and its powerful query language PromQL.

Scraping metrics with Prometheus

Now you need to tell Prometheus about your services so it will start scraping them. We could’ve hard code our endpoint with static_configs pointing it to the ‘localhost:8080’. But remember how we previously registered out service in Consul? Prometheus can discover targets for scraping from Consul for our service and any other services with a single job definition:

- job_name: 'consul'
  consul_sd_configs:
    - server: 'localhost:8500'
  relabel_configs:
    - source_labels: [__meta_consul_service]
      target_label: job

That’s the pure awesomeness of Service Discovery! Your ops buddy will thank you for that :-)

(relabel_configs is needed because otherwise all services would be scraped as consul)

Check that Prometheus recognized new targets:

Yay!

The RED method metrics

Now let’s calculate the metrics for the RED method. First one is the request rate and it can be calculated from http_requests_total metric like this:

rate(http_requests_total{job="webkv",code=~"^2.*"}[1m])

We filter HTTP request counter for the webkv job and successful HTTP status code, get a vector of values for the last 1 minute and then take a rate, which is basically a diff between first and last values. This gives us the amount of request that was successfully handled in the last minute. Because counter is accumulating we’ll never miss values even if some scrape failed.

The second one is the errors that we can calculate from the same metric as a rate but what we actually want is a percentage of errors. This is how I calculate it:

sum(rate(http_requests_total{job=“webkv”,code!~“^2.*“}[1m])) 
/ sum(rate(http_requests_total{job=“webkv”}[1m])) 
* 100

In this error query, we take the rate of error requests, that is the ones with non 2xx status code. This will give us multiple series for each status code like 404 or 500 so we need to sum them. Next, we do the same sum and rate but for all of the requests regardless of its status to get the overall request rate. And finally, we divide and multiply by 100 to get a percentage.

Finally, the latency distribution lies directly in http_request_duration_microseconds metric:

http_request_duration_microseconds{job="webkv"}

So that was easy and it’s more than enough for my simple service.

If you want to instrument for some custom metrics you can do it easily. I’ll show you how to do the same for the Redis requests that are made from the webkv handler. It’s not of a much use because there is a dedicated Redis exporter for Prometheus but, anyway, it’s just for the illustration.

Instrumenting for the custom metrics (Redis requests)

As you can see from the previous sections all we need to get the meaningful monitoring are just 2 metrics – a plain counter for HTTP request quantified on status code and a summary for request durations.

Let’s start with the counter. First, to make things nice, we define a new type Metrics with Prometheus CounterVec and add it to the Service struct:

--- a/service/service.go
+++ b/service/service.go
@@ -13,6 +14,7 @@ type Service struct {
        Port        int
        RedisClient redis.UniversalClient
        ConsulAgent *consul.Agent
+       Metrics     Metrics
 }
+
+type Metrics struct {
+       RedisRequests *prometheus.CounterVec
+}
+

Next, we must register our metric:

--- a/service/service.go
+++ b/service/service.go
@@ -28,6 +30,15 @@ func New(addrs []string, ttl time.Duration, port int) (*Service, error) {
                Addrs: addrs,
        })

+       s.Metrics.RedisRequests = prometheus.NewCounterVec(
+               prometheus.CounterOpts{
+                       Name: "redis_requests_total",
+                       Help: "How many Redis requests processed, partitioned by status",
+               },
+               []string{"status"},
+       )
+       prometheus.MustRegister(s.Metrics.RedisRequests)
+
        ok, err := s.Check()
        if !ok {
                return nil, err

We have created a variable of CounterVec type because plain Counter is for a single time series and we have a label for status, which makes it a vector of time series.

Finally, we need to increment the counter depending on the status:

--- a/service/redis.go
+++ b/service/redis.go
@@ -15,7 +15,9 @@ func (s *Service) ServeHTTP(w http.ResponseWriter, r *http.Request) {
        if err != nil {
                http.Error(w, "Key not found", http.StatusNotFound)
                status = 404
+               s.Metrics.RedisRequests.WithLabelValues("fail").Inc()
        }
+       s.Metrics.RedisRequests.WithLabelValues("success").Inc()

        fmt.Fprint(w, val)
        log.Printf("url=\"%s\" remote=\"%s\" key=\"%s\" status=%d\n",

Check, that it’s working:

$ curl -s 'localhost:8080/metrics' | grep redis
# HELP redis_requests_total How many Redis requests processed, partitioned by status
# TYPE redis_requests_total counter
redis_requests_total{status="fail"} 904
redis_requests_total{status="success"} 5433

Nice!

Calculating latency distribution is a little bit more involved because we have to time our requests and put it in distribution buckets. Fortunately, there is a very nice prometheus.Timer helper to help measure time. As for the distribution buckets, Prometheus has a Summary type that does it automatically.

Ok, so first we have to register our new metric (adding it to our Metrics type):

--- a/service/service.go
+++ b/service/service.go
@@ -18,7 +18,8 @@ type Service struct {
 }

 type Metrics struct {
        RedisRequests  *prometheus.CounterVec
+       RedisDurations prometheus.Summary
 }

 func New(addrs []string, ttl time.Duration, port int) (*Service, error) {
@@ -39,6 +40,14 @@ func New(addrs []string, ttl time.Duration, port int) (*Service, error) {
        )
        prometheus.MustRegister(s.Metrics.RedisRequests)

+       s.Metrics.RedisDurations = prometheus.NewSummary(
+               prometheus.SummaryOpts{
+                       Name:       "redis_request_durations",
+                       Help:       "Redis requests latencies in seconds",
+                       Objectives: map[float64]float64{0.5: 0.05, 0.9: 0.01, 0.99: 0.001},
+               })
+       prometheus.MustRegister(s.Metrics.RedisDurations)
+
        ok, err := s.Check()
        if !ok {
                return nil, err

Our new metrics is just a Summary, not a SummaryVec because we have no labels. We defined 3 “objectives” – basically 3 buckets for calculating distribution – 50, 90 and 99 percentiles.

Here is how we measure request latency:

--- a/service/redis.go
+++ b/service/redis.go
@@ -5,12 +5,18 @@ import (
        "log"
        "net/http"
        "strings"
+
+       "github.com/prometheus/client_golang/prometheus"
 )

 func (s *Service) ServeHTTP(w http.ResponseWriter, r *http.Request) {
    status := 200

    key := strings.Trim(r.URL.Path, "/")
+
+   timer := prometheus.NewTimer(s.Metrics.RedisDurations)
+   defer timer.ObserveDuration()
+
    val, err := s.RedisClient.Get(key).Result()
    if err != nil {
            http.Error(w, "Key not found", http.StatusNotFound)
            status = 404
            s.Metrics.RedisRequests.WithLabelValues("fail").Inc()
        }
    s.Metrics.RedisRequests.WithLabelValues("success").Inc()

    fmt.Fprint(w, val)
    log.Printf("url=\"%s\" remote=\"%s\" key=\"%s\" status=%d\n",
        r.URL, r.RemoteAddr, key, status)
}

Yep, it’s that easy. You just create a new timer and defer it’s invocation so it will be invoked on the function exit. Although it will additionally calculate a logging I’m okay with that.

By default, this timer measure time in seconds. To mimic http_request_duration_microseconds we can implement Observer interface that NewTimer accepts that does the calculation our way:

--- a/service/redis.go
+++ b/service/redis.go
@@ -14,7 +14,10 @@ func (s *Service) ServeHTTP(w http.ResponseWriter, r *http.Request) {

        key := strings.Trim(r.URL.Path, "/")

-       timer := prometheus.NewTimer(s.Metrics.RedisDurations)
+       timer := prometheus.NewTimer(prometheus.ObserverFunc(func(v float64) {
+               us := v * 1000000 // make microseconds
+               s.Metrics.RedisDurations.Observe(us)
+       }))
        defer timer.ObserveDuration()

        val, err := s.RedisClient.Get(key).Result()

--- a/service/service.go
+++ b/service/service.go
@@ -43,7 +43,7 @@ func New(addrs []string, ttl time.Duration, port int) (*Service, error) {
        s.Metrics.RedisDurations = prometheus.NewSummary(
                prometheus.SummaryOpts{
                        Name:       "redis_request_durations",
-                       Help:       "Redis requests latencies in seconds",
+                       Help:       "Redis requests latencies in microseconds",
                        Objectives: map[float64]float64{0.5: 0.05, 0.9: 0.01, 0.99: 0.001},
                })
        prometheus.MustRegister(s.Metrics.RedisDurations)

That’s it!

$ curl -s 'localhost:8080/metrics' | grep -P '(redis.*durations)'
# HELP redis_request_durations Redis requests latencies in microseconds
# TYPE redis_request_durations summary
redis_request_durations{quantile="0.5"} 207.17399999999998
redis_request_durations{quantile="0.9"} 230.399
redis_request_durations{quantile="0.99"} 298.585
redis_request_durations_sum 3.290851703000006e+06
redis_request_durations_count 15728

And now, when we have beautiful metrics let’s make a dashboard for them!

Grafana dashboard

It’s no secret, that once you have a Prometheus, you will eventually have Grafana to show dashboards for your metrics because Grafana has builtin support for Prometheus as a data source.

In my dashboard, I’ve just put our RED metrics and sprinkled some colors. Here is the final dashboard:

Note, that for latency graph, I’ve created 3 series for each of the 0.5, 0.9 and 0.99 quantiles, and divided it by 1000 for millisecond values.

Conclusion

There is no magic here, monitoring the four golden signals or the RED metrics is easy with modern tools like Prometheus and Grafana and you really need it because without it you’re flying blind. So the next time you will develop any service, just add some instrumentation – be nice and cultivate at least some operational sympathy for great good.

Hitchhiker's guide to the Python imports

Alex Dzyoba — Sat, 13 Jan 2018 00:00:00 +0000

Disclaimer : If you write Python on a daily basis you will find nothing new in this post. It’s for people who occasionally use Python like Ops guys and forget/misuse its import system. Nonetheless, the code is written with Python 3.6 type annotations to entertain an experienced Python reader. As usual, if you find any mistakes, please let me know!

Modules

Let’s start with a common Python stanza of

if __name__ == '__main__':
    invoke_the_real_code()

A lot of people, and I’m not an exception, write it as a ritual without trying to understand it. We somewhat know that this snippet makes difference when you invoke your code from CLI versus import it. But let’s try to understand why we really need it.

For illustration, assume that we’re writing some pizza shop software. It’s on Github. Here is the pizza.py file.

# pizza.py file

import math

class Pizza:
    name: str = ''
    size: int = 0
    price: float = 0

    def __init__(self, name: str, size: int, price: float) -> None:
        self.name = name
        self.size = size
        self.price = price

    def area(self) -> float:
        return math.pi * math.pow(self.size / 2, 2)

    def awesomeness(self) -> int:
        if self.name == 'Carbonara':
            return 9000

        return self.size // int(self.price) * 100

print('pizza.py module name is %s' % __name__)
if __name__ == '__main__':
    print('Carbonara is the most awesome pizza.')

I’ve added printing of the magical __name__ variable to see how it may change.

OK, first, let’s run it as a script:

$ python3 pizza.py
pizza.py module name is __main__
Carbonara is the most awesome pizza.

Indeed, the __name__ global variable is set to the __main__ when we invoke it from CLI.

But what if we import it from another file? Here is the menu.py source code:

# menu.py file

from typing import List
from pizza import Pizza

MENU: List[Pizza] = [
    Pizza('Margherita', 30, 10.0),
    Pizza('Carbonara', 45, 14.99),
    Pizza('Marinara', 35, 16.99),
]

if __name__ == '__main__':
    print(MENU)

Run menu.py

$ python3 menu.py
pizza.py module name is pizza
[<pizza.Pizza object at 0x7fbbc1045470>, <pizza.Pizza object at 0x7fbbc10454e0>, <pizza.Pizza object at 0x7fbbc1045b38>]

And now we see 2 things:

The top-level print statement from pizza.py was executed on import
__name__ in pizza.py is now set to the filename without .py suffix.

So, the thing is, __name__ is the global variable that holds the name of the current Python module.

Module name is set by the interpreter in __name__ variable
When module is invoked from CLI its name is set to __main__

So what is the module, after all? It’s really simple - module is a file containing Python code that you can execute with the interpreter (the pythonprogram) or import from other modules.

Python module is just a file with Python code

Just like when executing, when the module is being imported, its top-level statements are executed, but be aware that it’ll be executed only once even if you import it several times even from different files.

When you import module it’s executed

Because modules are just plain files, there is a simple way to import them. Just take the filename, remove the .py extension and put it in the importstatement.

To import modules you use the filename without the .py extensions

What is interesting is that __name__ is set to the filename regardless how you import it – with import pizza as broccoli __name__ will still be thepizza. So

When imported, the module name is set to filename without .py extension even if it’s renamed with import module as othername

But what if the module that we import is not located in the same directory, how can we import it? The answer is in module search path that we’ll eventually discover while discussing packages.

Packages

Package is a namespace for a collection of modules

The namespace part is important because by itself package doesn’t provide any functionality – it only gives you a way to group a bunch of your modules.

There are 2 cases where you really want to put modules into a package. First is to isolate definitions of one module from the other. In our pizza module, we have a Pizza class that might conflict with other’s Pizza packages (and we do have some pizza packages on pypi)

The second case is if you want to distribute your code because

Package is the minimal unit of code distribution in Python

Everything that you see on PyPI and install via pip is a package, so in order to share your awesome stuff, you have to make a package out of it.

Alright, assume we’re convinced and want to convert our 2 modules into a nice package. To do this we need to create a directory with empty __init__.py file and move our files to it:

pizzapy/
├── __init__.py
├── menu.py
└── pizza.py

And that’s it – now you have a pizzapy package!

To make a package create the directory with __init__.py file

Remember that package is a namespace for modules, so you don’t import the package itself, you import a module from a package.

>>> import pizzapy.menu
pizza.py module name is pizza
>>> pizzapy.menu.MENU
[<pizza.Pizza object at 0x7fa065291160>, <pizza.Pizza object at 0x7fa065291198>, <pizza.Pizza object at 0x7fa065291a20>]

If you do the import that way, it may seem too verbose because you need to use the fully qualified name. I guess that’s intentional behavior because one of the Python Zen items is “explicit is better than implicit”.

Anyway, you can always use a from package import module form to shorten names:

>>> from pizzapy import menu
pizza.py module name is pizza
>>> menu.MENU
[<pizza.Pizza object at 0x7fa065291160>, <pizza.Pizza object at 0x7fa065291198>, <pizza.Pizza object at 0x7fa065291a20>]

Package init

Remember how we put a __init__.py file in a directory and it magically became a package? That’s a great example of convention over configuration – we don’t need to describe any configuration or register anything. Any directory with__init__.py by convention is a Python package.

Besides making a package __init__.py conveys one more purpose – package initialization. That’s why it’s called init after all! Initialization is triggered on the package import, in other words, importing a package invokes__init__.py

When you import a package, the __init__.py module of the package is executed

In the __init__ module you can do anything you want, but most commonly it’s used for some package initialization or setting the special __all__ variable. The latter controls star import – from package import *.

And because Python is awesome we can do pretty much anything in the __init__module, even really strange things. Suppose we don’t like the explicitness of import and want to drag all of the modules’ symbols up to the package level, so we don’t have to remember the actual module names.

To do that we can import everything from menu and pizza modules in__init__.py like this

# pizzapy/__init__.py

from pizzapy.pizza import *
from pizzapy.menu import *

See:

>>> import pizzapy
pizza.py module name is pizzapy.pizza
pizza.py module name is pizza
>>> pizzapy.MENU
[<pizza.Pizza object at 0x7f1bf03b8828>, <pizza.Pizza object at 0x7f1bf03b8860>, <pizza.Pizza object at 0x7f1bf03b8908>]

No more pizzapy.menu.Menu or menu.MENU :-) That way it kinda works like packages in Go, but note that this is discouraged because you are trying to abuse the Python and if you gonna check in such code you gonna have a bad time at code review. I’m showing you this just for the illustration, don’t blame me!

You could rewrite the import more succinctly like this

# pizzapy/__init__.py

from .pizza import *
from .menu import *

This is just another syntax for doing the same thing which is called relative imports. Let’s look at it closer.

Absolute and relative imports

The 2 code pieces above is the only way of doing so-called relative import because since Python 3 all imports are absolute by default (as in PEP328), meaning that import will try to import standard modules first and only then local packages. This is needed to avoid shadowing of standard modules when you create your ownsys.py module and doing import sys could override the standard library sysmodule.

Since Python 3 all import are absolute by default – it will look for system package first

But if your package has a module called sys and you want to import it into another module of the same package you have to make a relative import. To do it you have to be explicit again and write from package.module import somesymbol or from .module import somesymbol. That funny single dot before module name is read as “current package”.

To make a relative import prepend the module with the package name or dot

Executable package

In Python you can invoke a module with a python3 -m <module> construction.

$ python3 -m pizza
pizza.py module name is __main__
Carbonara is the most awesome pizza.

But packages can also be invoked this way:

$ python3 -m pizzapy
/usr/bin/python3: No module named pizzapy. __main__ ; 'pizzapy' is a package and cannot be directly executed

As you can see, it needs a __main__ module, so let’s implement it:

# pizzapy/__main__.py

from pizzapy.menu import MENU

print('Awesomeness of pizzas:')
for pizza in MENU:
    print(pizza.name, pizza.awesomeness())

And now it works:

$ python3 -m pizzapy
pizza.py module name is pizza
Awesomeness of pizzas:
Margherita 300
Carbonara 9000
Marinara 200

Adding __main__.py makes package executable (invoke it with python3 -m package)

Import sibling packages

And the last thing I want to cover is the import of sibling packages. Suppose we have a sibling package pizzashop:

.
├── pizzapy
│   ├── __init__.py
│   ├── __main__.py
│   ├── menu.py
│   └── pizza.py
└── pizzashop
    ├── __init__.py
    └── shop.py

# pizzashop/shop.pyimport pizzapy.menuprint(pizzapy.menu.MENU)

Now, sitting in the top level directory, if we try to invoke shop.py like this

$ python3 pizzashop/shop.py
Traceback (most recent call last):
  File "pizzashop/shop.py", line 1, in <module>
    import pizzapy.menu
ModuleNotFoundError: No module named 'pizzapy'

we get the error that our pizzapy module not found. But if we invoke it as a part of the package

$ python3 -m pizzashop.shop
pizza.py module name is pizza
[<pizza.Pizza object at 0x7f372b59ccc0>, <pizza.Pizza object at 0x7f372b59ccf8>, <pizza.Pizza object at 0x7f372b59cda0>]

it suddenly works. What the hell is going on here?

The explanation for this lies in the Python module search path and it’s greatly described in the documentation on modules.

Module search path is a list of directories (available at runtime as sys.path) that interpreter uses to locate modules. It is initialized with the path to Python standard modules (/usr/lib64/python3.6), site-packages where pip puts everything you install globally, and also a directory that depends on how you run a module. If you run a module as a file like python3 pizzashop/shop.py the path to containing directory (pizzashop) is added to sys.path. Otherwise, including running with -m option, the current directory (as in pwd) is added to module search path. We can check it by printing sys.path inpizzashop/shop.py:

$ pwd
/home/avd/dev/python-imports

$ tree
.
├── pizzapy
│   ├── __init__.py
│   ├── __main__.py
│   ├── menu.py
│   └── pizza.py
└── pizzashop
    ├── __init__.py
    └── shop.py

$ python3 pizzashop/shop.py
['/home/avd/dev/python-imports/pizzashop',
 '/usr/lib64/python36.zip',
 '/usr/lib64/python3.6',
 '/usr/lib64/python3.6/lib-dynload',
 '/usr/local/lib64/python3.6/site-packages',
 '/usr/local/lib/python3.6/site-packages',
 '/usr/lib64/python3.6/site-packages',
 '/usr/lib/python3.6/site-packages']
Traceback (most recent call last):
  File "pizzashop/shop.py", line 5, in <module>
    import pizzapy.menu
ModuleNotFoundError: No module named 'pizzapy'

$ python3 -m pizzashop.shop
['',
 '/usr/lib64/python36.zip',
 '/usr/lib64/python3.6',
 '/usr/lib64/python3.6/lib-dynload',
 '/usr/local/lib64/python3.6/site-packages',
 '/usr/local/lib/python3.6/site-packages',
 '/usr/lib64/python3.6/site-packages',
 '/usr/lib/python3.6/site-packages']
pizza.py module name is pizza
[<pizza.Pizza object at 0x7f2f75747f28>, <pizza.Pizza object at 0x7f2f75747f60>, <pizza.Pizza object at 0x7f2f75747fd0>]

As you can see in the first case we have the pizzashop dir in our path and so we cannot find sibling pizzapy package, while in the second case the current dir (denoted as '') is in sys.path and it contains both packages.

Python has module search path available at runtime as sys.path
If you run a module as a script file, the containing directory is added tosys.path, otherwise, the current directory is added to it

This problem of importing the sibling package often arise when people put a bunch of test or example scripts in a directory or package next to the main package. Here is a couple of StackOverflow questions:

The good solution is to avoid the problem – put tests or examples in the package itself and use relative import. The dirty solution is to modifysys.path at runtime (yay, dynamic!) by adding the parent directory of the needed package. People actually do this despite it’s an awful hack.

The End!

I hope that after reading this post you’ll have a better understanding of Python imports and could finally decompose that giant script you have in your toolbox without fear. In the end, everything in Python is really simple and even when it is not sufficient for your case, you can always monkey patch anything at runtime.

And on that note, I would like to stop and thank you for your attention. Until next time!

Write your own diff for fun

Alex Dzyoba — Wed, 27 Dec 2017 00:00:00 +0000

On the other day, when I was looking at git diff, I thought “How does it work?“. Brute-force idea of comparing all possible pairs of lines doesn’t seem efficient and indeed it has exponential algorithmic complexity. There must be a better way, right?

As it turned out, git diff, like a usual diff tool is modeled as a solution to a problem called Longest Common Subsequence. The idea is really ingenious – when we try to diff 2 files we see it as 2 sequences of lines and try to find a Longest Common Subsequence. Then anything that is not in that subsequence is our diff. Sounds neat, but how can one implement it in an effective way (without that exponential complexity)?

LCS problem is a classic problem that is better solved with dynamic programming – somewhat advanced technique in algorithm design that roughly means an iteration with memoization.

I’ve always struggled with dynamic programming because it’s mostly presented through some (in my opinion) artificial problem that is hard for me to work on. But now, when I see something so useful that can help me write a diff, I just can’t resist.

I used a Wikipedia article on LCS as my guide, so if you want to check the algorithm nitty-gritty, go ahead to the link. I’m going to show you my implementation (that is, of course, available on GitHub) to demonstrate how easily you can solve such seemingly hard problem.

I’ve chosen Python to implement it and immediately felt grateful because you can copy-paste pseudocode and use it with minimal changes. Here is the diff printing function from Wikipedia article in pseudocode:

function printDiff(C[0..m,0..n], X[1..m], Y[1..n], i, j)
    if i > 0 and j > 0 and X[i] = Y[j]
        printDiff(C, X, Y, i-1, j-1)
        print " " + X[i]
    else if j > 0 and (i = 0 or C[i,j-1] ≥ C[i-1,j])
        printDiff(C, X, Y, i, j-1)
        print "+ " + Y[j]
    else if i > 0 and (j = 0 or C[i,j-1] < C[i-1,j])
        printDiff(C, X, Y, i-1, j)
        print "- " + X[i]
    else
        print ""

And in Python:

def print_diff(c, x, y, i, j):
    """Print the diff using LCS length matrix by backtracking it"""

    if i >= 0 and j >= 0 and x[i] == y[j]:
        print_diff(c, x, y, i-1, j-1)
        print("  " + x[i])
    elif j >= 0 and (i == 0 or c[i][j-1] >= c[i-1][j]):
        print_diff(c, x, y, i, j-1)
        print("+ " + y[j])
    elif i >= 0 and (j == 0 or c[i][j-1] < c[i-1][j]):
        print_diff(c, x, y, i-1, j)
        print("- " + x[i])
    else:
        print("")

This is not the actual function for my diff printing because it doesn’t handle few corner cases – it’s just to illustrate Python awesomeness.

The essence of diffing is building the matrix C which contains lengths for all subsequences. Building it may seem daunting until you start looking at the simple cases:

LCS of “A” and “A” is “A”.
LCS of “AA” and “AB” is “A”.
LCS of “AAA” and “ABA” is “AA”.

Building iteratively we can define the LCS function:

LCS of 2 empty sequences is the empty sequence.
LCS of “${prefix1}A” and “${prefix2}A” is LCS(${prefix1}, ${prefix2}) + A
LCS of “${prefix1}A” and “${prefix2}B” is the longest of LCS(${prefix1}A, ${prefix2}) and LCS(${prefix1}, ${prefix2}B)

That’s basically the core of dynamic programming – building the solution iteratively starting from the simple base cases. Note, though, that it’s working only when the problem has so-called “optimal” structure, meaning that it can be built by reusing previous memoized steps.

Here is the Python function that builds that length matrix for all subsequences:

def lcslen(x, y):
    """Build a matrix of LCS length.

    This matrix will be used later to backtrack the real LCS.
    """

    # This is our matrix comprised of list of lists.
    # We allocate extra row and column with zeroes for the base case of empty
    # sequence. Extra row and column is appended to the end and exploit
    # Python's ability of negative indices: x[-1] is the last elem.
    c = [[0 for _ in range(len(y) + 1)] for _ in range(len(x) + 1)]

    for i, xi in enumerate(x):
        for j, yj in enumerate(y):
            if xi == yj:
                c[i][j] = 1 + c[i-1][j-1]
            else:
                c[i][j] = max(c[i][j-1], c[i-1][j])
    return c

Having the matrix of LCS lengths we can now build the actual LCS by backtracking it.

def backtrack(c, x, y, i, j):
    """Backtrack the LCS length matrix to get the actual LCS"""

    if i == -1 or j == -1:
        return ""
    elif x[i] == y[j]:
        return backtrack(c, x, y, i-1, j-1) + x[i]
    else:
        if c[i][j-1] > c[i-1][j]:
            return backtrack(c, x, y, i, j-1)
        else:
            return backtrack(c, x, y, i-1, j)

But for diff we don’t need the actual LCS, we need the opposite. So diff printing is actually slightly changed backtrack function with 2 additional cases for changes in the head of a sequence:

def print_diff(c, x, y, i, j):
    """Print the diff using LCS length matrix by backtracking it"""

    if i < 0 and j < 0:
        return ""
    elif i < 0:
        print_diff(c, x, y, i, j-1)
        print("+ " + y[j])
    elif j < 0:
        print_diff(c, x, y, i-1, j)
        print("- " + x[i])
    elif x[i] == y[j]:
        print_diff(c, x, y, i-1, j-1)
        print("  " + x[i])
    elif c[i][j-1] >= c[i-1][j]:
        print_diff(c, x, y, i, j-1)
        print("+ " + y[j])
    elif c[i][j-1] < c[i-1][j]:
        print_diff(c, x, y, i-1, j)
        print("- " + x[i])

To invoke it we read input files into Python lists of strings and pass it to our diff functions. We also add some usual Python stanza:

def diff(x, y):
    c = lcslen(x, y)
    return print_diff(c, x, y, len(x)-1, len(y)-1)

def usage():
    print("Usage: {} <file1> <file2>".format(sys.argv[0]))

def main():
    if len(sys.argv) != 3:
        usage()
        sys.exit(1)

    with open(sys.argv[1], 'r') as f1, open(sys.argv[2], 'r') as f2:
        diff(f1.readlines(), f2.readlines())

if __name__ == '__main__':
    main()

And there you go:

$ python3 diff.py f1 f2
+ """Simple diff based on LCS solution"""
+ 
+ import sys
  from lcs import lcslen

  def print_diff(c, x, y, i, j):
+ """Print the diff using LCS length matrix by backtracking it"""
+ 
       if i >= 0 and j >= 0 and x[i] == y[j]:
           print_diff(c, x, y, i-1, j-1)
           print(" " + x[i])
       elif j >= 0 and (i == 0 or c[i][j-1] >= c[i-1][j]):
           print_diff(c, x, y, i, j-1)
- print("+ " + y[j])
+ print("+ " + y[j])
       elif i >= 0 and (j == 0 or c[i][j-1] < c[i-1][j]):
           print_diff(c, x, y, i-1, j)
           print("- " + x[i])
       else:
- print("")
- 
+ print("") # pass?

You can check out the full source code at https://github.com/alexdzyoba/diff.

That’s it. Until next time!

Go service with Consul integration

Alex Dzyoba — Thu, 14 Dec 2017 00:00:00 +0000

In the world of stateless microservices, which are usually written in Go, we need to discover them. This is where Hashicorp’s Consul helps. Services register within Consul so other services can discover them via simple DNS or HTTP queries.

Go has a Consul client library, alas, I didn’t see any real examples of how to integrate it into your services. So here I’m going to show you how to do exactly this.

I’m going to write a service that will serve at some HTTP endpoint and will serve key-value data – I believe this resembles a lot of existing microservices that people write these days. Ours is called webkv and it’s on Github. Choose the “v1” tag and you’re good to go.

This service will register itself in Consul with TTL check that will, well, check internal health status and send a heartbeat like signals to Consul. Should Consul not receive a signal from our service within a TTL interval it will mark it as failed and remove it from queries results.

Side note: Consul has also simple port checks when Consul agent will judge the health of the service based on the port availability. While it’s much simpler, e.g. you don’t have to add anything to your code, it’s not that powerful as a TTL check. With TTL checks you can inspect the internal state of your service which is a huge advantage in comparison with simple availability – you can accept queries but your data may be stale or invalid. Also, with TTL checks service status can be not only in binary state – good/bad – but also with a warning.

All right, to the point! The “v1” version of webkv uses only the standard library and the bare minimum of dependencies like Redis client and Consul API lib. Later I’m going to extend it with other niceties like Prometheus integration, structured logging, and sane configuration management.

Basic Web service

Let’s start with a basic web service that will serve key-value data from Redis.

First, parse port, ttl, and addrs commandline flags. The last one is the list of Redis addresses separated with ;.

func main() {
    port := flag.Int("port", 8080, "Port to listen on")
    addrsStr := flag.String("addrs", "", "(Required) Redis addrs (may be delimited by ;)")
    ttl := flag.Duration("ttl", time.Second*15, "Service TTL check duration")
    flag.Parse()

    if len(*addrsStr) == 0 {
        fmt.Fprintln(os.Stderr, "addrs argument is required")
        flag.PrintDefaults()
        os.Exit(1)
    }

    addrs := strings.Split(*addrsStr, ";")

Now, we create a service that should implement Handler interface and launch it.

    s, err := service.New(addrs, *ttl)
    if err != nil {
        log.Fatal(err)
    }
    http.Handle("/", s)

    l := fmt.Sprintf(":%d", *port)
    log.Print("Listening on ", l)
    log.Fatal(http.ListenAndServe(l, nil))

Nothing fancy here. Now let’s look at the service itself.

import (
    "time"

    "github.com/go-redis/redis"
)

type Service struct {
    Name        string
    TTL         time.Duration
    RedisClient redis.UniversalClient
}

The Service is a type that holds a name, TTL and Redis client handler. It’s instantiated like this:

func New(addrs []string, ttl time.Duration) (*Service, error) {
    s := new(Service)
    s.Name = "webkv"
    s.TTL = ttl
    s.RedisClient = redis.NewUniversalClient(&redis.UniversalOptions{
        Addrs: addrs,
    })

    ok, err := s.Check()
    if !ok {
        return nil, err
    }

    return s, nil
}

Check method issues PING Redis command to check if we’re ok. This will be used later with Consul registration.

func (s *Service) Check() (bool, error) {
    _, err := s.RedisClient.Ping().Result()
    if err != nil {
        return false, err
    }
    return true, nil

And now the implementation of ServeHTTP method that will be invoked for request processing:

func (s *Service) ServeHTTP(w http.ResponseWriter, r *http.Request) {
    status := 200

    key := strings.Trim(r.URL.Path, "/")
    val, err := s.RedisClient.Get(key).Result()
    if err != nil {
        http.Error(w, "Key not found", http.StatusNotFound)
        status = 404
    }

    fmt.Fprint(w, val)
    log.Printf("url=\"%s\" remote=\"%s\" key=\"%s\" status=%d\n",
        r.URL, r.RemoteAddr, key, status)
}

Basically, what we do is retrieve URL path from the request and use it as a key for Redis “GET” command. After that, we return the value or 404 in case of an error. Last, we log the request with a quick and dirty structured logging message inlogfmt format.

Launch it:

$ ./webkv -addrs 'localhost:6379'
2017/12/13 21:44:15 Listening on :8080

Query it:

$ curl 'localhost:8080/blink'
182

And see the log message:

2017/12/13 21:44:29 url="/blink" remote="[::1]:35020" key="blink" status=200

Consul integration

Now let’s make our service discoverable via Consul. Consul has simple HTTP API to register services that you can employ directly via “net/http” but we will use its Go library.

Consul Go library doesn’t have examples, BUT, it has tests! Tests are nice not only because it gives you confidence in your lib, approval for the sanity of your code structure and API and, finally, a set of usage examples. Here is an example from Consul API test suite for service registration and TTL checks.

Looking at these tests, we can tell that we interact with Consul by creating a Client and then getting a handle for the particular endpoint like/agent or /kv. For each endpoint, there is a corresponding Go type. Agent endpoint is responsible for service registration and sending health checks. To store an Agent handle we extend our Service type with a new pointer:

import (
    consul "github.com/hashicorp/consul/api"
)

type Service struct {
    Name        string
    TTL         time.Duration
    RedisClient redis.UniversalClient
    ConsulAgent *consul.Agent
}

Next, in the Service “constructor” we add the creation of Consul agent handle:

func New(addrs []string, ttl time.Duration) (*Service, error) {
    ...
    c, err := consul.NewClient(consul.DefaultConfig())
    if err != nil {
        return nil, err
    }
    s.ConsulAgent = c.Agent()

Next, we use the agent to register our service:

    serviceDef := &consul.AgentServiceRegistration{
        Name: s.Name,
        Check: &consul.AgentServiceCheck{
            TTL: s.TTL.String(),
        },
    }

    if err := s.ConsulAgent.ServiceRegister(serviceDef); err != nil {
        return nil, err
    }

The key thing here is the Check part where we tell Consul how it should check our service. In our case, we say that we ourselves will send heartbeat-like signals to Consul so that it will mark our service failed after TTL. Failed service is not returned as part of DNS or HTTP API queries.

After service is registered we have to send a TTL check signal with Pass, Fail or Warn type. We have to send it periodically and in time to avoid service failure by TTL. We’ll do it in a separate goroutine:

go s.UpdateTTL(s.Check)

UpdateTTL method uses time.Ticker to periodically invoke the actual update function:

func (s *Service) UpdateTTL(check func() (bool, error)) {
    ticker := time.NewTicker(s.TTL / 2)
    for range ticker.C {
        s.update(check)
    }
}

check argument is a function that returns a service status. Based on its result we send either pass or fail check:
go

func (s *Service) update(check func() (bool, error)) {
    ok, err := check()
    if !ok {
        log.Printf("err=\"Check failed\" msg=\"%s\"", err.Error())
        if agentErr := s.ConsulAgent.FailTTL("service:"+s.Name, err.Error()); agentErr != nil {
            log.Print(agentErr)
        }
    } else {
        if agentErr := s.ConsulAgent.PassTTL("service:"+s.Name, ""); agentErr != nil {
            log.Print(agentErr)
        }
    }
}

Check function that we pass to goroutine is the one we used earlier on creating service, it just returns bool status of Redis PING command.

And that’s it! This is how it all works together:

We launch the webkv
It connects to Redis and starts serving at given port
It connects to Consul agent and register service with TTL check
Every TTL/2 seconds we check service status by PINGing Redis and send Pass check
Should Redis connectivity fail we detect it and send a Fail check that will remove our service instance from DNS and HTTP query to avoid returning errors or invalid data

To see it in action you need to launch a Consul and Redis. You can launch Consul with consul agent -dev or start a normal cluster. How to launch Redis depends on your distro, in my Fedora, it’s just systemctl start redis.

Now launch the webkv like this:

$ ./webkv -addrs localhost:6379 -port 8888
2017/12/14 19:00:29 Listening on :8888

Query the Consul for services:

$ dig +noall +answer @127.0.0.1 -p 8600 webkv.service.dc1.consul
webkv.service.dc1.consul. 0 IN A 127.0.0.1

$ curl localhost:8500/v1/health/service/webkv?passing
[
    {
        "Node": {
            "ID": "a4618035-c73d-9e9e-2b83-24ece7c24f45",
            "Node": "alien",
            "Address": "127.0.0.1",
            "Datacenter": "dc1",
            "TaggedAddresses": {
                "lan": "127.0.0.1",
                "wan": "127.0.0.1"
            },
            "Meta": {
                "consul-network-segment": ""
            },
            "CreateIndex": 5,
            "ModifyIndex": 6
        },
        "Service": {
            "ID": "webkv",
            "Service": "webkv",
            "Tags": [],
            "Address": "",
            "Port": 0,
            "EnableTagOverride": false,
            "CreateIndex": 15,
            "ModifyIndex": 37
        },
        "Checks": [
            {
                "Node": "alien",
                "CheckID": "serfHealth",
                "Name": "Serf Health Status",
                "Status": "passing",
                "Notes": "",
                "Output": "Agent alive and reachable",
                "ServiceID": "",
                "ServiceName": "",
                "ServiceTags": [],
                "Definition": {},
                "CreateIndex": 5,
                "ModifyIndex": 5
            },
            {
                "Node": "alien",
                "CheckID": "service:webkv",
                "Name": "Service 'webkv' check",
                "Status": "passing",
                "Notes": "",
                "Output": "",
                "ServiceID": "webkv",
                "ServiceName": "webkv",
                "ServiceTags": [],
                "Definition": {},
                "CreateIndex": 15,
                "ModifyIndex": 141
            }
        ]
    }
]

Now if we stop the Redis we’ll see the log messages

...
2017/12/14 19:29:19 err="Check failed" msg="EOF"
2017/12/14 19:29:27 err="Check failed" msg="dial tcp [::1]:6379: getsockopt: connection refused"
...

and that Consul doesn’t return our service:

$ dig +noall +answer @127.0.0.1 -p 8600 webkv.service.dc1.consul
$ # empty reply

$ curl localhost:8500/v1/health/service/webkv?passing
[]

Starting Redis again will make service healthy.

So, basically this is it – the basic Web service with Consul integration for service discovery and health checking. Check out the full source code at github.com/alexdzyoba/webkv. Next time we’ll add metrics export for monitoring our service with Prometheus.