As I keep learning, DEV knows things. Ask them stuff.
In this post, I'm going to attempt to ELI5 some buzzword tools and technologies that I don't actually know anything about or use myself, purely based on what I've managed to pick up from Wikipedia and blog posts.
Several of these are specifically distributed computing tools managed by the open-source Apache Software Foundation, a category of tools I have had zero exposure to or opportunity to work with as a hobbyist/self-learner.
I'd like you to tell me what I'm wrong about or add clarity where I'm inevitably short of the mark about these things:
Apache Hadoop is a tool for working with huge amounts of data. It abstracts the physical hardware from the data operations you are running. Hadoop takes care of dicing up the data and managing where it lives across a bunch of networked computers, each with otherwise separate, self contained memory and processors. This group is called a "cluster". This allows you to horizontally scale because the software manages how the hardware is utilized for you. Horizontal scaling means adding more components as opposed to vertical scaling which means making your current components more powerful. You can simply throw more computers at Hadoop with low overhead to bring online, you don't need to upgrade what you have to get more power.
The method of interacting with Hadoop is called MapReduce. It's called so because it consists of a "map" operation and then a "reduce" operation. A "map" means an operation applied to every element of a sequence - in this case each chunk of the data spread across the cluster. A "reduce" will process a number of sources into one source, combining the results of each individual operation. This is similar to but not exactly like the
reduce() functional programming methods, specifically applied to chunking up a workload in a distributed computing environment. It's so powerful because during the "map" stage you've parallelized your operation, as each node in your cluster can run its section simultaneously.
I've kinda-sorta wanted to build a Raspberry Pi Hadoop cluster for a while, but I don't know what the heck I'd do with it.
Apache Spark is a system for running operations on massive amounts of data, like MapReduce. Whereas MapReduce is running operations locally on each server, Spark runs operations in-memory. Spark runs in a Hadoop cluster to speed up the operation over the basic MapReduce model by running everything in its nifty in-memory way, cutting down on disk I/O which can be a bottleneck for some workloads.
Hadoop and Spark are both widely used and have different strengths, but I don't know enough to elaborate. What sorts of workloads are each of these used for, and why do these underlying design differences help? Why don't all Hadoop clusters run Spark if it's so dang fast?
More importantly, am I even correct about what these things are?
Apache Cassandra is a NoSQL database like MongoDB that's designed to be run on distributed systems, and supports MapReduce and Hadoop. It uses its own query language that looks kinda like SQL called CQL.
That's all I've got, here - it's what you use if you need a NoSQL data store in a distributed system.
Apache Kafka is another distributed computing tool, this time providing a stream of records. It's kinda like Hadoop in that this stream of records is abstracted from the hardware and Kafka manages any actual physical mapping - or does it run on a Hadoop cluster? This property allows incoming "Consumers" and "Producers" of these streams to not care about physical topology, and store logs that are too big for any one server. Streams can also be connected in some way and processed. It can be used as a replacement for traditional message brokers like RabbitMQ.
RabbitMQ is a traditional message broker. A message broker provides queues for moving messages around in a big system. This allows you to compose your system from small encapsulated disparate parts, even in different programming languages if you like, and use one of several message queuing protocols to pass around what you need between them. This model is called Message-Oriented Middleware or MOM and is easier to scale than a huge complicated monolithic application design.
A container is a lightweight method of virtualization for an operating system without needing to emulate a whole computer. It allows you to separately distribute a set of userspace tools without the kernel. When you enter a container, it uses the host OS's kernel to interact with the CPU and RAM but overlays a separate userspace, providing a sandboxed environment that looks like a completely clean operating system to running software. This is more space and resource efficient than a traditional virtual machine, which steals a segment of the host's resources and emulates a full computer in software to provide the same result.
Docker is a commonly used PaaS for containerization, and allows you to configure these sandboxed environments with a text file called a Dockerfile, and orchestrate multiple containers via a YAML file called
docker-compose.yml. This allows you to cheaply and conveniently run different parts of an application, for instance a MySQL database and an API server, in completely separate, sterile, reproducible environments.
Kubernetes is a layer of abstraction on top of containers. It's a more powerful abstraction than
docker-compose.yml. It's an orchestration system that lets you stop thinking in terms of individual containers for these components of a huge system, and instead in terms of services. It creates and manages clusters of identical containers to run these services. It can automatically scale by spinning up new containers to manage load or by spinning down unneeded ones, helping to manage server costs. It can also silently kill and reboot containers that have entered poisoned states. It sounds like a great idea to me - when should you prefer Docker Compose?
Redis is an in memory key-value store that supports a variety of abstract data types like Lists and Strings, so you can use it with objects directly from your app's programming language. It's different from a standard relational database in that you don't query an engine which performs your operation, you just run specific operations on these objects directly in memory. It's both a data store and a cache, and manages both persistent storage and fast client-side data retrieval. This allows you to cache user requests and stop hitting your server-side database on repeated requests. It can also be used for messaging like RabbitMQ or Kafka, I guess? Redis seems pretty cool, it's on my 2020 to-learn list, but I don't feel I have a thorough understanding of it.
Photo by stem.T4L on Unsplash