DEV Community: zhengxin

Kubernetes Cluster as an OpenID Connect Identity Provider

zhengxin — Sat, 02 Dec 2023 14:29:57 +0000

Overview

Traditionally, when executing programs or applications on Virtual Machines (VMs) provided by cloud services, the cloud provider grants specified permissions to the VM. This setup allows the program running on the VM to access designated cloud resources without the need for password authentication. For instance, in Azure, binding a system-assigned or user-assigned managed identity to a VM enables the program running on the VM to operate as the managed identity, inheriting permissions associated with it.

We aim to replicate this user experience in Kubernetes, whereby a pod in a Kubernetes cluster (k8s) can be granted permissions without requiring the program within the pod to read any long-term credentials like passwords.

Previously, Azure Kubernetes Service (AKS) had introduced a feature known as Pod Identity to achieve this. However, this feature was deprecated before reaching General Availability (GA) as a more robust and widely accepted solution emerged.

The feature, Service Account Issuer Discovery, marked as stable starting from Kubernetes version v1.21, transforms the Kubernetes API server into an OIDC identity provider. This setup facilitates the issuance of tokens, via service accounts to pods, which are recognizable by external services outside the Kubernetes cluster, thereby establishing an authentication pathway between the pod within the cluster and external services including those on Azure, AWS, etc.

Both Azure AKS and AWS EKS have enabled this feature by default, offering convenient methods to configure the Kubernetes cluster OIDC Provider to integrate with their respective access control services, namely Microsoft Entra ID and AWS IAM. This feature is termed differently in their respective managed cluster documentation as follows:

AKS: Workload Identity
EKS: IAM roles for service accounts (IRSA)

Note: All Kubernetes clusters with the "Service Account Issuer Discovery" feature enabled can be integrated with cloud providers, not merely the managed clusters provided by cloud providers, albeit the setup may be slightly complex.

In this document, we will delve into the workings of this feature.

Service Account Issuer Discovery Flow

Assume we have a Kubernetes cluster, hosting applications purposed for reading and writing Azure Blobs.

Initially, the cluster administrator should activate the "Service Account Issuer Discovery" feature on the cluster. Following this, the cluster should be configured with the cloud provider, in this instance, Microsoft Entra ID, ensuring that Microsoft Entra ID is cognizant of the cluster's existence. In more precise terms, Microsoft Entra ID and the Kubernetes OIDC provider are federated, establishing a trust relationship.

Applications running in the cluster are assigned service accounts. (If not explicitly defined, a default service account is bound to every pod). Assigning a service account to a pod results in a JWT token being projected into the pod, stored on the pod’s disk.

Through different annotations on the service account, the cluster administrator can manipulate the claims within the token, thus, various service accounts can be endowed with different permissions.

The application can read the JWT token and forward it to the cloud access control service, in this scenario, Microsoft Entra ID. Upon receiving the token, Microsoft Entra ID validates it. Given the pre-established trust relationship with the Kubernetes cluster OIDC provider, it can ascertain the token’s validity.

Note: In this step, the cloud access control service may or may not access the Kubernetes API server, depending on whether the JWT token validation is a local or remote process.

Post validation, Microsoft Entra ID issues an access token to the application. The application can then utilize this token to access Azure Blobs or other cloud resources managed by Microsoft Entra ID.

References

Understand Windows Azure Storage Architecture

zhengxin — Sat, 02 Dec 2023 14:28:54 +0000

Introduction to Windows Azure Storage

Windows Azure Storage (WAS) is Microsoft's answer to the rising demand for robust and accessible cloud storage solutions. This service allows users to store vast amounts of data indefinitely while ensuring accessibility from anywhere at any given time. Its diverse storage options include Blobs for files, Tables for structured data, and Queues for message delivery.

The strength of WAS lies in its commitment to data resilience, with features such as local and geographic replication. This ensures data remains intact, even in the face of disasters. Additionally, the system prides itself on scalability, with a partitioned global namespace allowing consistent data storage and access from any global location. Other commendable features include multi-tenancy, strong consistency, and cost-effective storage options.

Delve deeper into WAS's inner workings in this SOSP Paper.

An Overview of WAS's Architecture

Within the Azure ecosystem, a user can set up one or more Storage Accounts. Each of these accounts is identifiable by a unique name. On account creation, the Location Service assigns it to a primary Storage Stamp and creates a DNS record. This action redirects the AccountName.service.core.windows.net to the Virtual IP (VIP) of that Storage Stamp. Consequently, users can then communicate directly with this Storage Stamp for their storage needs.

To understand the structure further, consider each Storage Stamp as a cluster of nodes, often spread across various racks within a data center. This is where Azure users will store their data.

Delving into the Storage Stamp

A Storage Stamp consists of three main layers:

Front-Ends: These act as intermediaries, receiving client requests and forwarding them to the suitable partition server.
Partition Layer: This layer processes high-level data abstractions, such as Blob, Table, and Queue. It ensures transaction sequencing and strong consistency while operating atop the stream layer.
Stream Layer: Essentially the base, this layer saves data bits on disks. It's tasked with distributing and replicating data across multiple servers, ensuring data durability within a storage stamp.

A Closer Look at the Stream Layer

This layer is split into two core components:

Stream Manager (SM): SM oversees the Extent Nodes (ENs). It's in charge of actions like generating new ENs and overseeing garbage collection. Additionally, it uses paxos to guarantee state consistency for itself.
Extent Node (EN): Every EN governs a group of disks, maintaining the actual data storage. ENs communicate amongst themselves for the purpose of data replication.

When the Partition Layer seeks to create a new data extent, it requests the SM to assign three ENs. Data is primarily sent to the primary EN, which only acknowledges a successful transaction once the data has been replicated across the secondary ENs. This intra-stamp replication is synchronized to ensure internal errors, like disk failures or power outages, don't result in data loss.

Examining the Partition Layer

Primarily serving the Front-Ends, the Partition Layer structures the Blob, Table, and Queue using the Stream Layer. Its two main components are:

Partition Manager (PM): When Front-Ends need to create or delete an object, they use a partition key and an object name. PM then identifies the appropriate Partition Server (PS) for the request.
Partition Server (PS): The PS collaborates with the Stream Layer to organize data and ensure consistency within its partition.

An essential task for the Partition Layer is its asynchronous replication. This feature replicates data across various stamps. Inter-stamp replication is asynchronous, happening in the background, off the critical path of the user's request. This replication method ensures data storage in diverse geographical locations, fortifying disaster recovery measures.

Summary

Windows Azure Storage (WAS) stands as a testament to Microsoft's commitment to delivering a seamless and resilient cloud storage experience. I hope this blog can help you to have a quick understanding on its architecture, and I suggest you to take a look at the origin paper to have a better knowledge on this well-designed large scale storage system.

Unpacking LSM-Trees: The Powerhouse Behind Modern Databases

zhengxin — Sat, 02 Dec 2023 14:24:43 +0000

Introduction

Log-Structured Merge-trees (LSM-trees) are virtually omnipresent in today's database systems, spanning both SQL and NoSQL architectures. They are the storage layer backbone of various high-profile databases like BigTable, Dynamo, HBase, Cassandra, LevelDB, RocksDB, and AsterixDB, to name a few.

Designed for write-intensive workloads, an LSM-tree is essentially made up of two segments: an in-memory component and an on-disk component. The in-memory part gets updated with every operation, and data gets flushed to the on-disk component once certain conditions are met.

In this post, we'll walk you through the evolution of LSM-trees, discuss why they're all the rage, delve into their key components, explore common optimizations, and take a closer look at LevelDB's implementation.

History of LSM-trees

Discussing database systems often boils down to two primary data update strategies: in-place and out-of-place updates.

In-place updates involve direct modifications to the data on disk, while out-of-place updates append new data values to a different disk location, leaving the original values unchanged.

In the above diagram, you'll notice how updating the value of key k1 from v1 to v4 works differently depending on the update strategy employed. When updating the value of key k1 from v1 to v4, the in-place update updates the value directly, while the out-of-place create a new key-value pair and store the (k1,v4) there.

LSM-tree takes the out-of-place strategy and is designed to be write-optimized. With such context, you can understand the rise of LSM-trees, which is influenced by three key trends:

Modern applications are storing increasingly more data, which is further fueled by the decreasing price of storage and memory.
Hence more applications are performing more insertions than read queries in their business logics.
The global trend move to cloud-based datat management further supports immutability-based systems.

Yes, the fact of "more write operation than read" on modern data store systems is the root cause why LSM-tree is so perferred nowadays.

The concept of LSM-trees was introduced in 1996 by O'Neil and later popularized by Google's groundbreaking BigTable paper in 2006. Since then, a multitude of SQL and NoSQL databases have integrated LSM-trees, sparking active research in this field.

The LSM-tree model was first introduced by O'Neil in his 1996 research paper titled "The Log-Structured Merge-Tree (LSM-Tree)." A decade later, in 2006, Google released a pivotal paper on Bigtable, which had a profound impact on the database and big data communities. Bigtable employs LSM-trees to manage the tablets that store the actual data. Since then, LSM-trees have gained widespread adoption as the storage layer in numerous NoSQL and even some SQL databases. Research into LSM-trees continues to be an active area of study.

Today's LSM-trees

Today's LSM-trees generally consist of two main components: MemTable and SSTable.

Note: While not all LSM-trees are built with MemTables and SSTables, these components are the most commonly used. This section outlines a typical LSM-tree architecture, using LevelDB as an illustrative example. Each data storage system may have its own unique implementation, but the core concepts remain consistent.

In the LSM-tree structure, the MemTable serves as a temporary in-memory buffer for incoming write operations. This write buffer stores data in RAM, allowing for quick writes. When the MemTable fills up or a specific trigger occurs, its data is sorted and written to disk as an SSTable (Sorted String Table). This batch processing minimizes disk I/O and is particularly advantageous in write-heavy applications. After flushing, a new MemTable is created to host the newly incomming requests.

Generally, there's only one active MemTable in the system at any given time, as depicted in the figure below.

SSTables are immutable files containing a sequence of sorted key-value pairs. They offer several features:

Key Ordering: Keys are sorted, facilitating efficient range queries and ordered data retrieval.
Indexed: An index often accompanies an SSTable, enabling fast key lookups by pointing to specific data positions within the file.

SSTables are grouped into "levels" on disk. A higher level usually contains more SSTables than the one below it. When a MemTable is flushed to disk, it creates a new SSTable at level 0. Multiple SSTables can exist at this level, and their key ranges may overlap, as shown above.

A compaction is triggered when the number of SSTables in a level reaches a certain threshold. In such a case, an SSTable from level L is chosen, along with any overlapping SSTables from level L+1. These are merged to form new SSTables at level L+1.

Compactions are performed in the background, ensuring that read and write operations aren't interrupted.

For instance, if an SSTable in level L has a key range of 101-150, any SSTables in level L+1 with overlapping key ranges would also be selected for merging, whose key ranges are 31-120 and 121-150.

New SSTables are generated at level L+1, based on the merge result of the 3 selected SSTables. The count of the newly generated SSTables depends not only on key ranges but also on the data size for each key. The primary goal is to maintain uniform SSTable sizes. Therefore, after compaction, the SSTable at level L was removed, and its data is now merged in level L+1.

As you can see, except for level 0, SSTables at each level have non-overlapping key ranges. This design allows for efficient point or range queries, as each key can be located without scanning multiple SSTables.

Optimization on LSM-trees

While each storage system may have its unique optimizations, several common techniques are often employed:

Bloom Filters: Used to minimize disk reads by quickly determining if a key exists in a certain level.
Compaction Strategies: Tailored to reduce write amplification and read latency.
Partitioning: Enhances concurrency and reduces contention by dividing data.
Caching: Strategies are in place to limit disk access and boost performance.
Indexing: Auxiliary structures are employed to speed up lookups.
Compression: Saves space and improves I/O efficiency.
Tuning Size Ratios: Balances read and write performance by adjusting the size ratios between levels.

Summary

In today's data-rich environment, LSM-trees serve as a sturdy foundation for modern storage systems. Understanding LSM-trees is more than just a primer; it's essential for grasping how contemporary data storage operates. Through this blog, my aim is to equip you with the fundamentals of LSM-trees, so you'll have a clearer understanding the next time you encounter them.

References

[1] Chen Luo and Michael J. Carey. 2020. LSM-based Storage Techniques: A Survey.The VLDB Journal 29, 1 (January 2020), 393–418. DOI:https://doi.org/10.1007/s00778-019-00555-y

[2] Patrick O’Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O’Neil. 1996. The log-structured merge-tree (LSM-tree). Acta Informatica 33, 4 (June 1996), 351–385. DOI:https://doi.org/10.1007/s002360050048

[3] Subhadeep Sarkar and Manos Athanassoulis. 2022. Dissecting, Designing, and Optimizing LSM-based Data Stores. In Proceedings of the 2022 International Conference on Management of Data, ACM, Philadelphia PA USA, 2489–2497. DOI:https://doi.org/10.1145/3514221.3522563

[4] leveldb/doc/impl.md at main · google/leveldb. GitHub. Retrieved October 21, 2023 from https://github.com/google/leveldb/blob/main/doc/impl.md