Ben Lovy

Posted on Jan 4, 2020

Correct a Beginner About Buzzword Technologies

#beginners #help #distributedsystems

Learning in Public

As I keep learning, DEV knows things. Ask them stuff.

In this post, I'm going to attempt to ELI5 some buzzword tools and technologies that I don't actually know anything about or use myself, purely based on what I've managed to pick up from Wikipedia and blog posts.

Several of these are specifically distributed computing tools managed by the open-source Apache Software Foundation, a category of tools I have had zero exposure to or opportunity to work with as a hobbyist/self-learner.

I'd like you to tell me what I'm wrong about or add clarity where I'm inevitably short of the mark about these things:

Hadoop
Spark
Cassandra
Kafka
RabbitMQ
Containers
Kubernetes
Redis

Hadoop

Apache Hadoop is a tool for working with huge amounts of data. It abstracts the physical hardware from the data operations you are running. Hadoop takes care of dicing up the data and managing where it lives across a bunch of networked computers, each with otherwise separate, self contained memory and processors. This group is called a "cluster". This allows you to horizontally scale because the software manages how the hardware is utilized for you. Horizontal scaling means adding more components as opposed to vertical scaling which means making your current components more powerful. You can simply throw more computers at Hadoop with low overhead to bring online, you don't need to upgrade what you have to get more power.

The method of interacting with Hadoop is called MapReduce. It's called so because it consists of a "map" operation and then a "reduce" operation. A "map" means an operation applied to every element of a sequence - in this case each chunk of the data spread across the cluster. A "reduce" will process a number of sources into one source, combining the results of each individual operation. This is similar to but not exactly like the map() and reduce() functional programming methods, specifically applied to chunking up a workload in a distributed computing environment. It's so powerful because during the "map" stage you've parallelized your operation, as each node in your cluster can run its section simultaneously.

I've kinda-sorta wanted to build a Raspberry Pi Hadoop cluster for a while, but I don't know what the heck I'd do with it.

Spark

Apache Spark is a system for running operations on massive amounts of data, like MapReduce. Whereas MapReduce is running operations locally on each server, Spark runs operations in-memory. Spark runs in a Hadoop cluster to speed up the operation over the basic MapReduce model by running everything in its nifty in-memory way, cutting down on disk I/O which can be a bottleneck for some workloads.

Hadoop and Spark are both widely used and have different strengths, but I don't know enough to elaborate. What sorts of workloads are each of these used for, and why do these underlying design differences help? Why don't all Hadoop clusters run Spark if it's so dang fast?

More importantly, am I even correct about what these things are?

Cassandra

Apache Cassandra is a NoSQL database like MongoDB that's designed to be run on distributed systems, and supports MapReduce and Hadoop. It uses its own query language that looks kinda like SQL called CQL.

That's all I've got, here - it's what you use if you need a NoSQL data store in a distributed system.

Kafka

Apache Kafka is another distributed computing tool, this time providing a stream of records. It's kinda like Hadoop in that this stream of records is abstracted from the hardware and Kafka manages any actual physical mapping - or does it run on a Hadoop cluster? This property allows incoming "Consumers" and "Producers" of these streams to not care about physical topology, and store logs that are too big for any one server. Streams can also be connected in some way and processed. It can be used as a replacement for traditional message brokers like RabbitMQ.

RabbitMQ

RabbitMQ is a traditional message broker. A message broker provides queues for moving messages around in a big system. This allows you to compose your system from small encapsulated disparate parts, even in different programming languages if you like, and use one of several message queuing protocols to pass around what you need between them. This model is called Message-Oriented Middleware or MOM and is easier to scale than a huge complicated monolithic application design.

Containers

A container is a lightweight method of virtualization for an operating system without needing to emulate a whole computer. It allows you to separately distribute a set of userspace tools without the kernel. When you enter a container, it uses the host OS's kernel to interact with the CPU and RAM but overlays a separate userspace, providing a sandboxed environment that looks like a completely clean operating system to running software. This is more space and resource efficient than a traditional virtual machine, which steals a segment of the host's resources and emulates a full computer in software to provide the same result.

Docker is a commonly used PaaS for containerization, and allows you to configure these sandboxed environments with a text file called a Dockerfile, and orchestrate multiple containers via a YAML file called docker-compose.yml. This allows you to cheaply and conveniently run different parts of an application, for instance a MySQL database and an API server, in completely separate, sterile, reproducible environments.

Kubernetes

Kubernetes is a layer of abstraction on top of containers. It's a more powerful abstraction than docker-compose.yml. It's an orchestration system that lets you stop thinking in terms of individual containers for these components of a huge system, and instead in terms of services. It creates and manages clusters of identical containers to run these services. It can automatically scale by spinning up new containers to manage load or by spinning down unneeded ones, helping to manage server costs. It can also silently kill and reboot containers that have entered poisoned states. It sounds like a great idea to me - when should you prefer Docker Compose?

Redis

Redis is an in memory key-value store that supports a variety of abstract data types like Lists and Strings, so you can use it with objects directly from your app's programming language. It's different from a standard relational database in that you don't query an engine which performs your operation, you just run specific operations on these objects directly in memory. It's both a data store and a cache, and manages both persistent storage and fast client-side data retrieval. This allows you to cache user requests and stop hitting your server-side database on repeated requests. It can also be used for messaging like RabbitMQ or Kafka, I guess? Redis seems pretty cool, it's on my 2020 to-learn list, but I don't feel I have a thorough understanding of it.

Photo by stem.T4L on Unsplash

Top comments (11)

Stefanos Kouroupis • Jan 5 '20 • Edited

Kubernetes has way more capabilities than just an orchestrator for containers ...you can also for example handle multiple versions of an application targeting specific users (gradually releasing a new feature) docker compose works with swarm which is the docker alternative to kubernetes.

Cassandra is really powerful and it has an ingenious peer to peer system so as to add and sync nodes and you could even design a system where specific information exists in specific nodes at specific times. It is specifically designed for consistent writes.

Rabbit and Redis I think are dead on.

Corey McCarty • Jan 5 '20

I'd like to add that Cassandra uses (as one possible replication method) Gossip protocol which stems from Leslie Lamport's paper that I enjoyed called "Part Time Parliament". The basic premise comes down to each node asking others for updates to the data that it may have missed.

rhymes • Jan 5 '20

Redis is widely use as a queue yes, it's supported by tools like Celery (a Python distributed task queue that has multiple backends, both RabbitMQ and Redis, and is/was used by Instagram) or Sidekiq, a similar tool for Ruby.

Lots of "simpler" web apps use distributed queues as a means to offload workload.

But Redis is much more than just a backend, as you hinted it can be used as a data store, as a cache. Until the future release of Redis 6 has been an async only architecture (it will acquire threaded I/O). It has persistence, native support for publisher/subscriber communication, many data types (hashes, sorted sets), support less mainstream features like HyperLogLog, has basic support for geolocalization operations and supports streams in input and output.

There's a lot you can do with Redis :)

Ben Lovy • Jan 5 '20

HyperLogLog

This was an awesome rabbithole, thanks!

rhymes • Jan 6 '20

great read!

David Wickes • Jan 6 '20

RabbitMQ

Does a lot of things. Far too many things for my liking. It's open to being horribly abused as such - ended up working on a project where it was used for synchronous RPC (ridiculous).

As such I'd favour a more lightweight technology for queuing.

Kubernates

I worked as a "DevOps Engineer" for half a year. My impression of Kubernates as a developer is that developers don't need it. Docker Compose sits at the appropriate level of abstraction for developing applications. K8s is at the wrong level of abstraction.

Ben Lovy • Jan 6 '20

Awesome thanks. Do you have an example of what queuing tools you prefer? There's a wide array out there.

For Kubernetes, that makes sense to me from a dev perspective - what k8s does is manage groups of identical containers, so when defining what's actually in each one you shouldn't worry about deployment concerns like that. Once the product is deployed, though, is it not a strict improvement over the built-in behavior of docker compose?