DEV Community

loading...

CodingBlocksCodingBlocks

Designing Data-Intensive Applications – SSTables and LSM-Trees

CodingBlocks Play Button Pause Button
play

It’s time to learn about SSTables and LSM-Trees as Joe feels pretty zacked, Michael clarifies what he was looking forward to, and Allen has opinions about Dr Who.

These show notes can be found at https://www.codingblocks.net/episode128 where you be a part of the conversation, in case you’re reading this via your podcast player.

Sponsors

Survey Says

Anonymous Vote
Sign in with Wordpress
Do you leave your laptop plugged in the majority of the time?
  • All. The. Time. Pfft. I don't care about the battery.
  • No, I try to maintain my battery's life expectancy by neither fully charging or discharging it.
  • No, but not because I care.

News

  • Thank you for all of the great reviews:
    • iTunes: devextremis, CaffinatedGamer, Matt Hussey, index out of range
    • Stitcher: Marcos Sagrado, MoarLiekCodingRokzAmirite, Asparges69
  • Sadly, due to COVID-19 (aka Coronavirus), the 15th Annual Orlando Code Camp & Tech Conference has been cancelled. We’ll keep you informed of your next opportunity to kick us in the shins. (orlandocodecamp.com)
    • During this unprecedented time, TechSmith is offering Snagit and Video Review for free through June 2020. (TechSmith)

SSTables and LSM-Trees

SSTables

  • SSTable is short for “Sorted String Table”.
  • SSTable requires that the writes be sorted by key.
    • This means we cannot append the new key/value pairs to the segment immediately because we need to make sure the data is sorted by key first.

What are the benefits of the SSTable over the hash indexed log segments?

  • Merging the segments is much faster, and simpler. It’s basically a mergesort against the segment files being merged. Look at the first key in each file, and take the lowest key (according to the sort order), add it to the new segment file … rinse-n-repeat.
    • When the same key shows in multiple segment files, keep the newer segment’s key/value pair, sticking with the notion that the last written key/value for any given key is the most up to date value.
  • To find keys, you no longer need to keep the entire hash of indexes in memory. Instead, you can use a sparse index where you store a key in memory for every few kilobytes from a segment file
    • This saves on memory.
    • This also allows for quick scans as well.
      • For example, when you search for a key, Michael and the key isn’t in the index, you can find two keys in the sparse index that Michael falls between, such as Micah and Mick, then start at the Micah offset and scan that portion of the segment until you find the Michael key.
  • Another improvement for speeding up read scans is to write chunks of data to disk in compressed blocks. Then, the keys in the sparse index point to the beginning of that compressed block.

So how do you write this to disk in the proper order?

  • If you just write them to disk as you get them, they’ll be out of order in an append only manner because you’re likely going to receive them out of order.
  • One method is to actually write them to disk in a sorted structure. B-Tree is one option. However, maintaining a sorted structure in memory is actually easier than trying to maintain it on disk though, due to well known tree data structures like red-black trees and AVL trees.
    • The keys are sorted as they’re inserted due to the way nodes are shuffled during inserts.
    • This allows you to write the data to memory in any order and retrieve it sorted.
  • When data arrives, write it to the memory balanced tree data structure, such as a red-black tree. This is also referred to as a memtable.
  • Once you’ve reached a predefined size threshold, you dump the data from memory to disk in a new SSTable file.
  • While the new segment is being written to disk, any incoming key/value pairs get written to a new memtable.
  • When serving up read requests, you search in your memtable first, then back to the most recent segment, and so on moving backwards until you find the key you’re looking for.
  • Occasionally run a merge on the segments to get rid of overwritten or deleted items.

Downside of this method?

  • If the database crashes for some reason, the data in the memtable is lost.
  • To avoid this, you can use an append-only, unsorted log for each new record that comes in. If the database crashes, that log file can be used to recreate the memtable.

LSM-Trees

This implementation is the ground work for:

  • LevelDB (GitHub) and RocksDB (GitHub),
  • Databases intended to be embedded in other applications,
    • RocksDB is embedded in Kafka Streams and is used for GlobalKTables.
  • Similar storage engines are used by Cassandra and HBase.
    • Both took some design queues from Google’s BigTable whitepaper, which introduced the terms SSTable and memtable.

All of this was initially described under the name Log-Structured Merge Tree, LSM-Tree.

  • Storage engines that are based on the notion of storing compacted and sorted files are often called LSM storage engines.
    • Lucene, the indexing engine used in Solr and ElasticSearch, uses a very similar process.

Optimizing

  • One of the problems with the LSM-Tree model is that searching for keys that don’t exist can be expensive.
    • Must search the memtable first, then latest segment, then the next oldest segment, etc., all the way back through all the segments.
    • One solution for this particular problem is a Bloom filter.
      • A Bloom filter is a data structure used for approximating what is in a set of data. It can tell you if the key does not exist, saving a lot of I/O looking for the key.
  • There are competing strategies for determining when and how to perform the merge and compaction operations. The most common approaches include:
    • Leveled compaction – Key ranges are split into smaller SSTables and old data is moved to different “levels” allowing the compacting process to use less disk and done incrementally. This is the strategy used by LevelDB and RocksDB.
    • Size-tiered compaction – Smaller and newer SSTables are merged into larger and older SSTables. This is the strategy used by HBase.

Resources We Like

  • Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems by Martin Kleppmann (Amazon)
  • Red-black trees in 5 minutes – Insertions (examples) (YouTube)
  • Data Structures – (some) Trees (episode 97)
  • B-Tree Visualization (USFCA)
  • Red Black Tree vs AVL Tree (GeeksforGeeks)
  • How to: Use Bloom Filters in Redis (YouTube)
  • A Busy Developer’s Guide to Database Storage Engines – The Basics (yugabyteDB)

Tip of the Week

  • Save time typing paths by drag-n-dropping a folder from Finder/File Explorer to your command shell. Works on Windows and macOS in Command Prompt, Powershell, Cmder, and Terminal.
  • Popular and seminal white papers curated by Papers We Love (GitHub)
    • See if there is an upcoming PWL meetup in your area (paperswelove.org)
    • And there’s a corresponding Papers We Love Conference (pwlconf.org)
  • Every find yourself in the situation where you’re asked to pivot from your current work to another task that would require you to stash your current changes and change branches? Maybe you do that. Or maybe you clone the repo into another path and work from there? But there’s a pro-tip way. Instead, you can use git worktree to work with your repo in another path without needing to re-clone the repo.
    • For example, git worktree add -b myhotfix /temp master copies the files from master to /temp and creates a new branch named myhotfix.
  • Get your Silicon Valley fix with Mythic Quest. (Apple)
  • Level up your programming skills with exercises and mentors with Exercism. (exercism.io)
    • Exercism has been worth mentioning a few times:
      • Algorithms, Puzzles, and the Technical Interview (episode 26)
      • Deliberate Practice for Programmers (episode 78)
  • Use elasticdump’s import and export tools for Elasticsearch. (GitHub)
  • Use docker run --network="NETWORK-NAME-HERE" to connect a container to an existing Docker network. (docs.docker.com)

Episode source

pic
Editor guide