DEV Community: Pradeep Chhetri

CSE138 Lecture 2 Notes

Pradeep Chhetri — Sun, 19 Apr 2020 06:07:41 +0000

CSE138 Lecture 2 Notes

What is a Distributed System ?

According to Leslie Lamport:

A system where I can’t get my work done because some computer which I never heard of crashed.

According to Martin Kleppmann:

A system running on several nodes and characterised by partial failures.

What is Partial Failure ?

Machine crashing
Network failure
Messages being dropped
Software misbehaviour

Part of the computation happened while another don’t.

In the event of such failure, you don’t want the system to degrade in the case of partial failure. Your system must continue working.

Cloud Computing vs HPC Philosophy

HPC Philosophy:

Choosing to take partial failure as total failure i.e. if something does fail, it will start the computation over completely from scratch. It involves the process of checkpointing (every so often it saves the progress and if the failure happens, it will rollback to the last checkpoint).

Cloud Computing Philosophy:

It involves working around such partial failures and expecting those kinds of failures.

Two Nodes Scenario

System of two machines

Possible Failure Scenarios:

Request from Machine A gets lost: maybe someone removed the cable.
Request from Machine A is slow and Machine B never receives it: network congestion or some sort of message queue either on Machine A side or Machine B side. Machine A thought that it sent the request but Machine B never received it.
Machine B crashed: Machine A sent the message and Machine B received the message and crashed.
Machine B is slow in precessing.
Response from Machine B is slow and Machine A never receives it.
Response from Machine B gets lost.

Does Machine A has any way to distinguish any of the three situations ? No.

If you send a request to another machine and don’t receive a response, it is impossible to know WHY (without having global knowledge of the system).

Other possible failure scenarios:

Machine B is lying or refusing to answer.
Cosmic Rays flipping the bits.

These kinds of scenarios are called Byzantine Faults.

How do real systems deal with it ?

When a machine sends a message, it needs to have some sort of timeout which means if the message has no response, give up and assume failure.

Why it might be a mistake to assume failure due to timeout ?

System of two machines

In this case, there is a possible that value of x is incremented twice because Machine A asked Machine B to increment the value of x twice because it never got ok for the first message and timeout happened.

Let assume the following:

Maximum delay between Machine A and Machine B (and vice-versa) is d

Maximum time processing a request is r

How long should Machine A waiting ? 2d + r

It will rule out the uncertainties due to slowness but still leaves other kinds of uncertainties which we need to deal with. Further more, most of the time we don’t have this sort of guarantee of maximum delay, sometimes we make assumptions about this maximum delay.

According to Prof. Peter Alvaro:

In distributed systems, not only do we have to deal with partial failures but we also have to deal with unbounded latency.

Why do we want to have a distributed system ?

Data too big to fit on a single machine.
You want things to be faster even though the data can fit on a single machine.

Time and Clocks

What are clocks for ?

Mark points in time: Eg. this item in my browser will expire on April 10, 2020 at 08:00 hours.
Durations or Intervals of time: Eg. this user spent 4 minutes and 55 seconds on our websites.

Computers have two types of clocks:

time-of-day clocks: tells you what time it is. It is usually synchronised between machines using NTP (network time protocol). They are bad for measuring durations or intervals because time-of-day clocks can jump backward due to daylight savings or leap seconds. They are okayish (but not good) for marking points in time because clock synchronisation is only so good and we need more fine-grained resolution to prevent certain kinds of bugs. Hence we aren’t going to use them much in distributed systems.
monotonic clocks: they always go further i.e. its certain kind of timer, maybe it counts the milliseconds since the machine restarted. In python, if you use time module, you can get the monotonic counter. It’s completely useless for marking points in time but it’s good for measuring duration or intervals of time. We can use these types of clocks to implement timeouts.

Checkout Cloudflare Blog on Leap Second. They tried implementing timeouts using time-of-day clock.

Both of these kinds of clocks are physical clocks but in distributed systems, we need to have a different notion of clocks which are logical clocks. Logical clocks don’t measure the time-of-day and elapsed time, instead they only measure the ordering of events (which events happened before another).

Suppose A happened before B.

A — — — → B

What does it tells us ?

A could have caused B.
B could not have caused A.

This notion of potential causality is very important. Why ?

Debugging: Figuring out possible causes of bug.
Designing systems.

Resources:

Lecture Video

Thank you Prof. Lindsey Kuper for keeping the lectures online.

Extending Python with Rust

Pradeep Chhetri — Wed, 01 May 2019 17:37:44 +0000

Introduction:

Python is a great programming language but sometimes it can be a bit of slowcoach when it comes to performing certain tasks. That’s why developers have been building C/C++ extensions and integrating them with Python to speed up the performance. However, writing these extensions is a bit difficult because these low-level languages are not type-safe, so doesn’t guarantee a defined behavior. This tends to introduce bugs with respect to memory management. Rust ensures memory safety and hence can easily prevent these kinds of bugs.

Slow Python Scenario:

One of the many cases when Python is slow is building out large strings. In Python, the string object is immutable. Each time a string is assigned to a variable, a new object is created in memory to represent the new value. This contrasts with languages like Perl where a string variable can be modified in place. That’s why the common operation of constructing a long string out of several short segments is not very efficient in Python. Each time you append to the end of a string, the Python interpreter must allocate a new string object and copy the contents of both the existing string and the appended string into it. As the string under manipulation become large, this process can become increasingly slow.

Problem: Write a function which accepts a positive integer as argument and returns a string concatenating a series of integers from zero to that integer.

So let’s try solving the above problem in python and see if we can improve the performance by extending it via Rust.

Python Implementations:

Method I: Naive appending

This is the most obvious approach. Using the concatenate operator (+=) to append each segment to the string.

Method II: Build a list of strings and then join them

This approach is commonly suggested as a very pythonic way to do string concatenation. First a list is built containing each of the component strings, then in a single join operation a string is constructed containing all of the list elements appended together.

Method III: List comprehensions

This version is extremely compact and is also pretty understandable. Create a list of numbers using a list comprehension and then join them all together. This is just an abbreviated version of last approach and it consumes pretty much the same amount of memory.

Let’s measure the performance of each of these three approaches and see which one wins. We are going to do this using pytest-benchmark module.

Here is the result of the above benchmarks. Lower the value, better is the approach.

Just by looking at the Mean column, one can easily justify that the list comprehension approach is definitely the winner among three approaches.

Rust Implementations:

After trying out basic implementation of the above problem in Rust, and doing some rough benchmarking using cargo-bench, the result definitely looked promising. Hence, I decided to port the rust implementation as shared library using rust-cpython project and call it from python program.

To achieve this, I had create a rust crate with the following src/lib.rs.

Building the above crate created a . dylib file which needs to be rename .so.

Then, we ran the same benchmark including the rust one as before.

This time the result is more interesting.

The rust extension is definitely the winner. As you increase the number of iterations to even more, the result is even more promising.

Eg. for iterations = 1000, following are the benchmark results

Code:

You can find the code used in the post:

Conclusion:

I am very new to Rust but these results definitely inspires me to learn Rust more. If you know better implementation of above problem in Rust, do let me know.

PyO3 started a fork of rust-cpython but definitely has lot more active development and hence on my todo-list of experimentation.

Distributing of your python module will demand the rust extension to be compiled on the target system because of the variation of architecture. Milksnake is a extension of python-setuptools that allows you to distribute dynamic linked libraries in Python wheels in the most portable way imaginable.

Cassandra 202: Snitches

Pradeep Chhetri — Wed, 01 May 2019 15:40:41 +0000

Introduction:

Snitch is a component which determines the network topology of the whole cassandra cluster. It provides the translation from the node’s IP address to the datacenter & rack it belongs to. This ensures that the data is placed in such a way that the cluster can handle rack/datacenter level outages.

To improve the resiliency of our cassandra cluster, we decided to move from SimpleSnitch which is the default snitch to GossipingPropertyFileSnitch, a recommended snitch for production grade cluster. While the latter is rack and datacenter aware snitch, the former doesn’t recognize any of this information.

Facts:

Every node in the cluster must be configured to use same snitch type.
Since SimpleSnitch assigns every node to rack1in datacenter1 , you can only migrate from SimpleSnitch to GossipingPropertyFileSnitch first. None of the other snitches like Ec2Snitch or GoogleCloudSnitch are compatible to SimpleSnitch. Migrating to any incompatible snitch directly can cause data loss.
Cassandra doesn’t allow changing rack or datacenter of a node which already has data in it. Hence the only option in such case is to first decommission the node and bootstrap it.

SimpleSnitch to GPFS Migration:

I. Change the Snitch configuration of the current cluster

Let’s say you have five nodes in your cluster configured with SimpleSnitch. You can visualize your cluster like this:

5 nodes configured in simplesnitch

Stop all nodes of the current cluster.
Since GPFS refers to cassandra-rackdc.properties to infer the rack and datacenter of a node, update them in each node as follows

dc=datacenter1
rack=rack1

Update the snitch type in each node in cassandra.yaml

endpoint\_snitch: GossipingPropertyFileSnitch

Start all nodes of the current cluster.

5 nodes configured in GPFS

II. Update all cassandra clients to be DCAware

Before adding the new datacenter, we need to fulfill these prerequisites:

Make sure that clients query the existing datacenter datacenter1 cassandra nodes only:

Ensure clients are configured to useDCAwareRoundRobinPolicy
Ensure clients are pointing to datacenter1
Change consistency level from QUORUM to LOCAL_QUORUM

Update all cassandra keyspaces to use NetworkTopologyStrategy from SimpleStrategy

ALTER KEYSPACE sample\_keyspace WITH REPLICATION =
{ 'class' : 'NetworkTopologyStrategy', 'datacenter1' : 3 };

You need to do it for all system related keyspaces and user created keyspaces except few system keyspaces whose configuration you cant change.

Altering the keyspace replication settings doesn’t actually move any existing data. It only affects new reads/writes.

III. Add the new datacenter

Now start nodes of the new datacenter making sure that they all have the same cluster_name configuration as current cluster.

5 nodes configured in GPFS in two datacenters without knowing each other

To make sure that cassandra nodes in one datacenter can see the nodes of the other datacenter, add the seed nodes of the new datacenter in all of the old datacenter’s nodes configuration and restart them. Similarly, add the seed nodes of the old datacenter in all of the new datacenter’s nodes configuration and restart them. It is always recommended to use seeds from all datacenters.

5 nodes configured in GPFS in two datacenters communicating with each other

One this is done, you will notice that nodetool status will show both the datacenters in the output.

Although the new datacenter nodes have joined the cluster, they still don’t have any data. To ensure that every keyspace from old datacenter nodes is replicated to new datacenter nodes.

Update every keyspace adding count of expected replicas in newer datacenter.

ALTER KEYSPACE sample\_keyspace WITH REPLICATION =
{ 'class' : 'NetworkTopologyStrategy', 'datacenter1' : 3, 'datacenter2' : 3 };

You can specify replica count as 1 for the new datacenter to ensure rebuild faster.

If you specify replica count as 1 for the new datacenter and change it to lets say 3 later, you’ll need to run nodetool repair -full with the -dc option to repair nodes only in the new datacenter. This may increase the overall time.

Rebuild each node in the new datacenter.

nodetool rebuild -- _datacenter1_

You can rebuild on one or more nodes at the same time, but we will suggest you to run it only on once at a time. Running on one node at a time reduces the impact on the older datacenter. In our case, running it concurrently on multiple nodes caused out of memory issues for cassandra.

IV. Verify the newer datacenter is in sync with older datacenter

V. Remove references of older datacenter

Before starting the decommissioning process for the older datacenter, we need to fulfill these prerequisites:

Make sure that clients are pointing to the newer datacenter datacenter2 cassandra nodes only.
Run a full repair with nodetool repair -full to ensure that all data is propagated from the datacenter being decommissioned. You need to run it on each of the nodes of the older datacenter (and on one node at a time)
Update every keyspace removing the older datacenter datacenter1 from replication configuration.

ALTER KEYSPACE sample\_keyspace WITH REPLICATION =
{ 'class' : 'NetworkTopologyStrategy', 'datacenter2' : 3 };

VI. Decommission the older datacenter

Run nodetool decommission on every node of the older datacenter.

References:

Cassandra: Updating the replication factor

Running a modern infrastructure on Kubernetes

Pradeep Chhetri — Fri, 09 Feb 2018 01:57:58 +0000

At StashAway, we have been running Docker containers from the very first day. Initially, we were using Rancher as the container orchestrator, but as we grew, we decided to switch to Kubernetes (k8s) — mainly because of its rapidly growing ecosystem and wide adoption.

This post describes how we use k8s and its tooling stack to run our application in a production-grade environment.

Google trends graph showing how the interest on various container schedulers changed over time.

All applications whether stateless or stateful needs an environment with these fundamental necessities built-in:

Service Discovery

Containers are elastic in nature, they can come up and go down anytime. Since each container gets a dynamic IP address, registration of each container instance is a must, so that others can communicate with it. Kubernetes supports two modes of discovery: Environment variables and DNS-based.

If you are for example running Cassandra inside a container, its IP address will be available both as an environment variable CASSANDRA_HOST, as well as a domain name cassandra.default.svc.cluster.local.

DNS-based service discovery is more popular among the two but special care needs to be taken since some DNS client libraries set high DNS cache TTL values. Eg. JVM caches domain names forever by default.

Service Addressing

DNS-based names for discovery as shown above are only resolvable from inside the Kubernetes cluster. In order to address a service from outside the cluster, it needs an automatic DNS registration to a third party DNS provider such as AWS Route53, Google CloudDNS, AzureDNS, CloudFlare. To mitigate against the dependency on a single DNS provider, you should consider hosting your zones on multiple providers.

In k8s world, this can be achieved easily via annotations using ExternalDNS. This incubator project takes care of registering a new (sub-)domain as soon as any new k8s service or ingress controller is created. It is also aware of the records it manages via an extra TXT record along with the primary A record, hence preventing any accidental overwriting of existing records.

Routing

With lots of containers and services popping in and out of existence, routing external traffic to healthy containers is challenging. Kubernetes Ingress is the saviour. It provides load balancing, SSL termination and even name based routing. Ingress is just an abstraction layer which can use any software load-balancer as its implementation.

Current Ingress controller implementations include Nginx Ingress (Nginx based), Voyager (HAProxy based) and Contour (EnvoyProxy based). The first one is the most matured which we are using (along with ELB) for all our traffic routing — but it provides only HTTP based routing. For TCP based routing, you’ll need to use Voyager. Contour is very interesting since it comes along with all the benefits of Envoy which is a service proxy designed specifically for modern cloud native applications. It has first class support for gRPC and provides features like circuit breaking which are not available in standard load-balancers.

Monitoring

Many Kubernetes objects like pods, services and ingresses together define the application, hence it is important to monitor the state of each one of them.

Prometheus is definitely the right choice available in open-source to monitor your Kubernetes apps and cluster. It has an inbuilt discovery for these k8s objects. Since monitoring without alerting is useless, Alertmanager perfectly fills the gap by providing nice integrations like Slack notifications.

Most people use Prometheus along with Heapster which can be integrated with many open-source monitoring solutions like InfluxDB and Riemann. Those who want to get fine container level metrics can add cAdvisor to their monitoring stack, too.

Logging

While you are running multiple instances of the same image, you can’t afford to login into each container and tail the logs. Each k8s node needs to run an agent to push these container logs. Surprisingly, in k8s world, the EFK stack is more popular than the ELK stack.

Fluent-bit is a lightweight (alternative to Fluentd) and is a fully Docker- and Kubernetes-aware agent which can be used to push these logs directly to Elasticsearch. It automatically adds kubernetes labels and annotations in each log line. You can also integrate it with Slack for sending notifications in case of any error/exception.

Deploying

We bundle all our applications as Docker images along with its dependencies. To keep the deployment pipeline clean, its very important to templatize these manifests. Helm, the package manager for k8s, is a great way to deploy apps on Kubernetes. There are many community-managed helm charts which are stable and ready to be used in production.

Since Helm doesn’t provide a neat way to store secrets, we use Ansible Vault as their source of truth. We trigger the helm command-line via Ansible using the ansible-helm module.

One of the pain-points of helm is that someone needs to write these charts by first understanding each of the YAML fields. Ksonnet is going to remove it by dynamically generating helm charts on demand.

SSL Certificates

Almost 90% of modern applications expose an HTTP endpoint. To secure these HTTP services, the basic requirement is to install an SSL certificate to enable encrypted communication.

Provisioning, installing & updating these certificates can become cumbersome, if it is not automated properly.

Automatic provisioning of Let’s Encrypt certificates for k8s ingresses can be done via kube-cert-manager. We chose this over kube-lego since it has support for Let’s encrypt DNS based validation challenges. Hence it can be used for issuing certs for applications which are hosted in a private network. It also takes care of renewing these certificates.

JetStack folks are developing another tool named cert-manager which is pretty interesting since it will soon be able to use Hashicorp Vault as a CA authority.

Conclusion

In this post, we talked about why you should choose Kubernetes for your next project. We will go in-depth into few of these topics in our upcoming blog posts.

We are constantly on the lookout for great tech talent to join our team — visit our website to learn more and feel free to reach out to us!

Impact of Meltdown fix on AWS

Pradeep Chhetri — Sat, 13 Jan 2018 13:46:38 +0000

On 3rd January, Google Project Zero Team disclosed about the two hardware vulnerabilities: Meltdown and Spectre. Whereas Meltdown is specific to Intel processors, Spectre affects almost all modern processors.

As soon as they were disclosed, all of the cloud providers started working on patching the hypervisors with the fix. In this post, we will talk about how AWS handled the same.

AWS Instances are broadly classified into two categories: PVM and HVM. While HVM hypervisors were patched online without affecting any of the running instances, AWS notified the customers to reboot their PVM instances before 6th January.

We noticed increased CPU utilisation for almost all of our instance groups significantly.

Increased CPU Utilisation on our 3-node Cassandra Cluster running on r4.large instances

{1} By 4th January, AWS patched the hypervisor with Kernel Page Table Isolation (KPTI) which caused > 100% increase in the CPU utilisation. Some of the cassandra consultant and managed hosting companies have noticed the same. Performance impact of KPTI mitigation depends purely on the system calls made by the application. So, the performance impact may vary accordingly.

{2} On 12th January, AWS rolled out something which reduced the performance impact back to the pre-meltdown patch level. Although, AWS hasn’t disclosed anything about the same yet.

We noticed something similar in RDS instance too.

Increased CPU Utilisation on our RDS instance

AWS patch protects from any instance-to-instance concerns (one instance can read the memory of another) and instance-to-hypervisor concerns (instance can read hypervisor memory). AWS still recommend all customers to upgrade their instance kernel to mitigate any process-to-process concerns.