DEV Community: Roman Belshevitz

Das U-Boot: from Power-On to initrd

Roman Belshevitz — Wed, 07 May 2025 15:01:05 +0000

Das U-Boot is a well-known bootloader which brings embedded Linux devices to life since 1999, it turns 25 this year.

My today's post walks through the complete U-Boot boot process, covering everything from SoC power-on to launching the Linux initrd, along with hardware-specific gotchas.

The boot flow

[Power On]
   ↓
[SoC BootROM] → [SPL]
   ↓
[U-Boot Proper]
   ↓
[Loads kernel + dtb + initrd]
   ↓
[bootz / booti]
   ↓
[Linux Kernel starts]
   ↓
[initrd: /init runs]
   ↓
[Switch to rootfs]

🔌 1. Power-On: SoC BootROM

Every SoC contains a hardcoded BootROM:

Executes right after power-on or reset
Detects the boot source via boot pins
Loads the Secondary Program Loader (SPL) from flash, SD, UART, or USB
Minimal hardware is initialized here

📦 2. SPL (Secondary Program Loader)

SPL is a tiny version of U-Boot:

Brings up DRAM and essential regulators
Loads full U-Boot into RAM
Operates within very constrained SRAM limits

💡 Hardware nuance: If DRAM init fails here, the system will hang silently. Always verify timing parameters for your DDR/LPDDR.

🚀 3. U-Boot Proper

Now the full U-Boot runs from RAM:

Initializes serial console, eMMC/SD, network, USB, etc.
Parses environment variables (bootargs, bootcmd)
Loads:
- Linux kernel (zImage or Image)
- Device Tree Blob (.dtb)
- initrd (initramfs)
Starts the boot using bootz or booti

bootz ${kernel_addr_r} ${ramdisk_addr_r} ${fdt_addr_r}

Regarding boot device order, modern U-Boot uses so-called "Distro Boot" framework:

bootcmd runs distro_bootcmd
distro_bootcmd iterates over boot_targets

The distro_bootcmd mechanism in U-Boot supports booting using configuration files like extlinux.conf and scripts like boot.scr:

extlinux.conf: A configuration file that specifies kernel, initrd, and device tree paths, allowing for multiple boot entries and parameters.
boot.scr: a compiled script containing U-Boot commands, offering a way to automate complex boot sequences.

Looks for known boot files: extlinux.conf, boot.scr, boot.ini, or standard kernel/initrd/fdt triplets. Usually there is '1st eMMC/SD - 1st USB - 1st NVMe' sequence of boot devices.

The setenv command in U-Boot is used to define or override environment variables, which control nearly every aspect of how U-Boot boots your system.

🔧 Basic Syntax

setenv variable_name value

For example:

setenv bootargs "console=ttyS0,115200 root=/dev/mmcblk0p2 rw"

To persist changes do saveenv.

This writes changes to persistent storage (e.g., SPI NOR, eMMC, NAND, or a UBI volume). If you don't run saveenv, any updates you may have made to the U-Boot environment will be lost on next system reboot.

💡 Use printenv and bdinfo to check variables and layout. Be careful: U-Boot does not validate variable syntax, so typos can silently break boot! To recover from bad bootcmd or bootargs, use serial console and interrupt boot early (usually by pressing a key).

🐧 4. Linux Kernel Boot

The kernel is now in control:

Unpacks itself into RAM
Parses command-line and DTB
Mounts initrd as the rootfs
Runs /init script from initrd

💡 If any addresses (kernel/initrd/fdt) overlap in RAM, the kernel may crash. Always double-check memory layout!

🐧 5. What Happens in initrd?

The initrd:

Loads drivers/modules
Mounts the actual root filesystem (e.g., /dev/mmcblk0p2)
May perform decryption, overlay setup, or network discovery
Executes switch_root or pivot_root to hand off to your real system (e.g., systemd)

⚙️ Hardware-Related Pitfalls

Some common embedded issues:

DRAM doesn't initialize? Wrong timing in SPL.
Kernel hangs early? Overlapping memory regions.
Ethernet/MAC not working? Wrong or mismatched DTB.
NAND boot fails? Missing ECC config in U-Boot.
Secure boot halts execution? U-Boot binary not signed.

Sources:

Todo: add more context; describe ARM related nuances...
Cover pic: Pixabay / Pexels

Landing From Clouds: Why On-Premise Will Eventually Win... In a Large Number of Cases

Roman Belshevitz — Mon, 05 May 2025 21:36:26 +0000

Update: Well, folks, I edited the title. No one spoke so categorically. Clouds will remain in their niches for some time. But these are strictly niches. In the article I explain why the hype will subside.

“The computer industry is the only industry that is more fashion-driven than women’s fashion.”

Larry Ellison

In our era, the cloud is the haute couture of infrastructure. Everyone wants it. Everyone says you need it. And few ask whether it actually fits.

This post is for those who remember that not every problem needs 12 layers of abstraction. It's for careful builders who believe that owning your tools is better than renting your soul.

And here are 12 reasons why on-premise will outlast the hype.

1. Cloud Is Renting the Same Hardware for Triple the Price

The cloud sells convenience, but charges premium rent, forever. When you buy servers, you own an asset. When you rent a VM, you feed a meter. The cloud is like living in a hotel room and bragging you don’t have to fix the faucet.

Sure, but you’re paying 10× the mortgage and still can’t open a window.

2. DevOps Became a Way to Work 24/7

DevOps was meant to unify teams. Instead, it blurred boundaries: developer and admin, work and sleep, code and ops. Now every developer is also on call. This "you build it, you run it" culture leads to burnout, not harmony.

3. Complexity Is Not a Virtue

In the old days, a deployment was make install. Now it's a procession of YAML files, container registries, ephemeral environments, secrets managers, Terraform pipelines, and CI/CD rituals that often do less than a Bash script with rsync.

The industry calls this “modern infrastructure”. I call it complexity theater.

Many engineers grew up on the UNIX philosophy — small, composable tools, each doing one job well. But this doesn’t mean they signed up for a balkanized mesh of microservices, each with its own language, API, database, and dev team. Microservices often don’t simplify — they multiply: bugs, failure points, logging silos, and deployment dependencies.

Abstraction without necessity is just friction in disguise. Ask yourself: Do you need ten services—or just ten functions?

4. You Will Regret Vendor Lock-In

When you rent your stack, the landlord can raise the rent—or remove the plumbing. Cloud APIs change, regions go down, and cost models shift.

Take CoreWeave, a cloud provider for AI workloads: they’ve accumulated $7.5 billion in debt, much of it due to aggressive infrastructure expansion without sustainable financial models. They now face $1 billion/year in interest. Growth without sovereignty is a debt spiral.

5. Not Every Company Needs Speed. Some Need Sanity

There’s a cult of speed in DevOps: deploy 50 times a day! But most companies — municipalities, utilities, banks — don’t need to push hourly patches. They need reliability.

A 2025 Vertice survey showed that 55% of CFOs saw cloud spending increase year-over-year, and 24% called it “significant.” These weren’t startup execs—they were finance leaders. The "move fast" mantra, unchecked, has real financial consequences.

6. Data Wants to Stay Home

For many industries, data is too large or too sensitive to outsource. Petabytes of sensor data, medical records, or real-time video don’t want to live in some abstract region. They want proximity. They want privacy. And often, regulators demand it.

When regulators ask, “Where is the data?” — you can’t answer “uh... us-east-1”. You need direct control. On-premise allows you to define, audit, and defend your policies with confidence.

Compliance is not a checkbox, not something to outsource! Regulatory compliance isn’t just a checklist — it’s a legal and financial time bomb, especially when you're storing personal data somewhere in the cloud. Frameworks like GDPR in Europe and HIPAA in the U.S. demand strict control over where data resides, who can access it, and how it’s processed. Cloud vendors offer compliance-ready services, but when breaches or misconfigurations occur, it's your organization — not AWS, GCP or Azure —writing the check.

Consider these real-world reminders:

💰 TikTok's €530 Million Fine: In May 2025, Ireland's Data Protection Commission fined TikTok €530 million for unlawfully transferring European user data to China without adequate safeguards, violating the General Data Protection Regulation (GDPR). The investigation revealed that TikTok failed to ensure sufficient data protection measures once the data was transferred, posing significant risks under Chinese surveillance laws.

💰 European Commission's Breach with Microsoft 365: In March 2024, the European Data Protection Supervisor found that the European Commission's use of Microsoft 365 infringed several key data protection rules. The Commission failed to provide appropriate safeguards for personal data transferred outside the EU/EEA, leading to a suspension order on data flows resulting from its use of Microsoft 365 to countries not covered by an adequacy decision.

💰 British Airways' £20 Million Fine: British Airways faced a £20 million fine from the UK's Information Commissioner's Office after a 2018 data breach compromised the personal data of over 400,000 customers. The breach was attributed to poor security arrangements, including the storage of payment card details in plaintext and the use of outdated software, highlighting the airline's failure to protect customer data adequately.

The moral: HIPAA and GDPR compliance in the cloud isn’t automatic—and it certainly isn’t cheap. Keeping sensitive data on-premise, where access and residency are explicitly controlled, isn't just prudent—it might be the only defensible option when auditors come knocking.

7. On-Prem Is Easier to Predict and Debug

Multi-tenant clouds suffer from "noisy neighbors," surprise throttling, and vague incidents. With bare metal, what you provision is what you get.

A 2025 report by Azul Systems found that 83% of CIOs overspent on the cloud, and nearly half exceeded budgets by 25% or more. You can’t fix what you can’t measure, and cloud abstraction blinds even the best teams.

9. The Cloud Is Not Environmentally Holy

The cloud is sold as green, but hyperscale datacenters consume enormous power, often sourced from fossil fuels. Meanwhile, localized, efficient, low-power on-premise servers (especially ARM-based) can be far more sustainable in specific workloads.

10. Reclaiming Infrastructure Is Moral

Cloud Capital, a FinOps startup, recently raised $7.7M to combat what they estimate is a $344 billion/year overspending crisis in cloud infrastructure, projected to hit $1 trillion by 2030. Their business exists because the DevOps toolchain forgot to add a budget dashboard.

11. Only Rushed Software Farms Need DevOps Teams — And They're Expensive

Let’s be blunt: DevOps is not cheap. You're not just hiring engineers—you’re hiring toolchain babysitters, pipeline janitors, Kubernetes whisperers, and Slack war-room veterans. If your software team ships once a month — or doesn’t need to scale horizontally by Thursday — why mimic the deployment posture of a fintech unicorn?

Most businesses do not need that speed. They need uptime. They need audit trails. They need security. They do not need to pay $160K+ per year for someone to maintain Helm charts and debate YAML indentation style on GitHub.

DevOps, in reality, serves urgency. But not every team should live in a permanent state of urgency. If you ship carefully and predictably, you don’t need a DevOps team — you need discipline and a good sysadmin.

12. Everything Goes to Platforms

What once ran on a server now lives in a platform. Version control? GitHub. Monitoring? Datadog. Deployments? Vercel. CI/CD? GHA & GitLab, until it breaks, then you’re Googling “runners stuck in pending”. The industry is becoming a patchwork of black boxes with dashboards.

This isn't just infrastructure [as code] — it’s dependency. You don’t maintain software anymore, you rent a slot in someone else’s stack. And every platform adds abstraction, latency, lock-in, and... another monthly invoice.

Try doing it your way and you're suddenly unsupported, incompatible, or “non-compliant”. There's no oxygen left for homegrown DevOps or pipelines-only “integrations”.

The industry has decided: DIY is dangerous. Everything must be managed, billed, monitored — as a service.

But this monoculture comes at a cost. You lose understanding. You lose flexibility. And eventually, you lose the ability to build outside of someone else’s rules. What was once craft is now just configuration.

On-prem systems, for all their grit, don’t hide from you. You know what it’s doing. You can tune it, fix it, or migrate it — without begging for API rate increases or watching a status page for two hours.

Everything is becoming a platform. But platforms are not freedom. They are permissioned productivity.

Epilogue: The Clouds Will Part

Cloud isn’t evil. But it isn’t holy either. It’s just another tool. The problem is cultural: we've confused speed with value, outsourcing with maturity, and complexity with progress.

It’s time to land. Touch the machines again. Grab data back.

Cover pic: "Piano" (Klavír), 'Pat and Mat' Czech animation series by 'Krátký film Praha' studio.

Accelerating OpenCV with CUDA on Jetson Orin NX: A Complete Build Guide

Roman Belshevitz — Fri, 21 Feb 2025 14:30:00 +0000

What do we have right out of the box?

The NVIDIA Jetson Orin NX is a powerful, community recognized edge AI platform, designed for real-time computer vision and deep learning applications.

While JetPack 5.1.x provides an optimized environment with CUDA, cuDNN, and TensorRT, the default OpenCV package in Ubuntu’s repositories does not take full advantage of the GPU. This means that task such as object detection, video processing, and feature extraction runs primarily on the CPU, significantly limiting performance.

To unlock the full potential of OpenCV on the Orin NX, we need to build it from source with CUDA and cuDNN enabled. This ensures that image processing and deep learning workloads benefit from GPU acceleration, leading to significant speed improvements. In this guide, we will walk through the entire build process, from installing dependencies to verifying a successful installation.

Choosing the Right OpenCV Version

For JetPack 5.1.x on Ubuntu 20.04, the most robust OpenCV versions are 4.5.5 and 4.6.0. These versions have been tested extensively with CUDA 11 and cuDNN, ensuring compatibility and stability. While newer versions, such as 4.7.x and 4.8.x, are available, they may require additional patches and modifications to work seamlessly on Jetson hardware. I recommend sticking with 4.5.5 unless specific features from newer releases are needed.

Building OpenCV with CUDA

Before we start, it's important to remove any pre-installed OpenCV versions that might interfere with our custom build.

The default python3-opencv package from Ubuntu repositories is CPU-only and does not support CUDA acceleration. To avoid conflicts, remove it along with other OpenCV-related packages:

sudo apt remove --purge -y libopencv-dev libopencv-core-dev libopencv-imgproc-dev python3-opencv
sudo apt autoremove -y

After building OpenCV, we must ensure Python correctly loads the CUDA-enabled version. We will set up the PYTHONPATH accordingly.

For the Jetson Orin NX, which uses the Ampere architecture, you should adjust the CUDA_ARCH_BIN to 8.7.

💥 Please keep in mind that you will need to download about six hundreds megabytes of packages from the Internet! One libcudnn8-dev package alone weighs 397 MB!

The following Bash script automates the entire process, ensuring a seamless installation of OpenCV with CUDA and Python bindings.

#!/bin/bash

set -e  # Exit on error
set -x  # Debug mode (prints each command)

# Define OpenCV version
OPENCV_VERSION="4.5.5"

# Update system packages
sudo apt update && sudo apt upgrade -y

# Install required dependencies
sudo apt install -y build-essential pv cmake ccache git unzip pkg-config \
    libjpeg-dev libpng-dev libtiff-dev libavcodec-dev libavformat-dev \
    libswscale-dev libv4l-dev v4l-utils libxvidcore-dev libx264-dev \
    libgtk-3-dev libcanberra-gtk3-dev libtbb2 libtbb-dev libdc1394-22-dev \
    python3-dev python3-numpy python3-pip libopenblas-dev libopenjp2-7-dev liblapack-dev gfortran \
    libhdf5-dev libcudnn8-dev

# Clone OpenCV and contrib modules
cd ~
git clone --branch ${OPENCV_VERSION} https://github.com/opencv/opencv.git
git clone --branch ${OPENCV_VERSION} https://github.com/opencv/opencv_contrib.git

# Create build directory
cd ~/opencv
mkdir -p build && cd build

# Configure CMake with CUDA, cuDNN, and TensorRT
cmake -D CMAKE_BUILD_TYPE=RELEASE \
      -D CMAKE_INSTALL_PREFIX=/usr/local \
      -D OPENCV_EXTRA_MODULES_PATH=~/opencv_contrib/modules \
      -D WITH_CUDA=ON \
      -D CUDA_ARCH_BIN=8.7 \
      -D CUDA_ARCH_PTX="" \
      -D WITH_CUDNN=ON \
      -D OPENCV_DNN_CUDA=ON \
      -D ENABLE_FAST_MATH=ON \
      -D CUDA_FAST_MATH=ON \
      -D WITH_CUBLAS=ON \
      -D WITH_V4L=ON \
      -D WITH_LIBV4L=ON \
      -D WITH_OPENGL=ON \
      -D BUILD_OPENCV_PYTHON3=ON \
      -D BUILD_EXAMPLES=OFF \
      -D BUILD_TESTS=OFF \
      -D BUILD_DOCS=OFF \
      -D BUILD_PERF_TESTS=OFF \ 
      -D CMAKE_C_COMPILER_LAUNCHER=ccache \
      -D CMAKE_CXX_COMPILER_LAUNCHER=ccache ..

# Compile OpenCV using all CPU cores
TOTAL_CPP=$(find ~/opencv ~/opencv_contrib/modules -name "*.cpp" | wc -l)
make -j$(nproc) | pv -lep -s ${TOTAL_CPP}

# Install OpenCV
sudo make install
sudo ldconfig

# Verify installation
python3 -c "import cv2; print(cv2.getBuildInformation())"

# Ensure Python recognizes the new OpenCV installation
PYTHON_VERSION=$(python3 -c "import sys; print('python'+sys.version[:3])")
echo "export PYTHONPATH=/usr/local/lib/${PYTHON_VERSION}/site-packages:\$PYTHONPATH" >> ~/.bashrc
source ~/.bashrc

echo "✅ OpenCV ${OPENCV_VERSION} built and installed with CUDA!"

Here ccache (Compiler Cache) speeds up compilation by storing previously compiled object files and reusing them when no source code changes occur. This is useful when frequently tweaking settings or rebuilding OpenCV.

What is this all for?

Building OpenCV from source with CUDA support on the Jetson Orin NX is an essential optimization for developers working with real-time image processing and AI applications. By leveraging the GPU for operations such as object detection, background subtraction, and feature extraction, performance can improve by 5-10x compared to CPU-only execution.

This custom-built OpenCV version seamlessly integrates with Python, ensuring that developers can access GPU acceleration without changing their existing OpenCV-based code. Whether you are deploying deep learning models, processing high-resolution video streams, or performing complex computer vision tasks, this optimized OpenCV installation ensures that your Orin NX operates at its maximum potential.

With this approach and using the provided script, developers can streamline the build process and focus on developing CUDA-powered applications that take full advantage of NVIDIA's cutting-edge hardware.

Performance gain

Motion tracking and optical flow are ~7-10x faster with CUDA. Moving object detection gets a 9x boost with CUDA. E.g. from ~12 up to 100 FPS. Recalling relatively simple tasks, it is also worth noting that CUDA provides a 5x–10x speed boost for basic image processing.

Wait, does NVIDIA’s TensorRT or DeepStream need OpenCV?

TensorRT is a library for optimized deep learning inference on GPUs. It does not need OpenCV, as it processes models (YOLO, ResNet, etc.) directly. You interact with TensorRT using Python (tensorrt package) or C++. If you need preprocessing (e.g., image resizing, normalization), OpenCV can help but isn’t required.

DeepStream is a full pipeline for video analytics using TensorRT + GStreamer. If you use DeepStream’s GStreamer-based pipeline, OpenCV is optional. However, if you’re post-processing model output (e.g., drawing bounding boxes), OpenCV can be useful.

🤖🔎👀 Wishing you keen machine vision!

Technical specs sources:
a. https://developer.nvidia.com/cuda-gpus
b. https://developer.download.nvidia.com/assets/embedded/secure/jetson/orin_nx/docs/Jetson_Orin_NX_DS-10712-001_v0.5.pdf
c. https://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/
d. https://opencv.org/platforms/cuda/

Finding the Right "Brain" and Software for Civilian Drone. Part 2

Roman Belshevitz — Tue, 19 Mar 2024 13:51:51 +0000

This is a continuation. The first part is here.

There was a US-based company called Aerotenna about ten years ago that made FPGA-based drone controller platforms. They moved onto specializing in sensors specifically.

The OcPoC-Zynq by Aerotenna was conceived as a compelling flight controller for those seeking to push the boundaries of drone technology.

It's compatibility with the open-source PX4 project could provide a robust foundation of reliability.

Unfortunately, the complexity of the implementation (as it turned out, excessive) did not allow all this to come true.

Technical Specifications and Unique Features

Hybrid Architecture

The heart of the OcPoC-Zynq is a Zynq System-on-Chip (SoC), combining an ARM processor with a programmable Field-Programmable Gate Array (FPGA).

The similar Z-turn Board V2 schematic diagram. Pic source: Xilinx

This grants immense flexibility for hardware acceleration and customization.

Processing Power

The chosen Zynq SoC typically features a dual-core ARM Cortex-A9 processor and a sizeable FPGA fabric, ensuring both baseline flight control capabilities and room for advanced functionality.

Enhanced I/O

The OcPoC-Zynq boasts numerous configurable input/output pins, supporting a multitude of sensor configurations, communication protocols, and potential payloads.

Sensor Redundancy

To maximize reliability, the OcPoC-Zynq declared of support triple redundancy on important sensors like the GPS, IMU, and magnetometer.

The OcPoC-Zynq is a FPGA+ARM SoC based flight control platform mounted on a drone.

A developer-oriented solution

Just replacing an embedded MCU with an FPGA would be a step backward. However, FPGA's are far better suited to perform functions such as high bandwidth digital signal processing. However, these specifications create a platform where possibilities extend beyond traditional flight controllers.

The FPGA emerges as a space for hardware-level innovation. Instead of relying solely on the main processor, computationally demanding elements of the PX4 system, such as ones described below, could be offloaded to the FPGA. This distribution of tasks promised smoother flight performance and potential for more sophisticated features.

The FPGA's adaptability allows the OcPoC-Zynq to embrace custom sensors and specialized peripherals, propelling it beyond the limitations of standardized drone builds. This makes it particularly appealing within a research context, where the integration of experimental sensors or novel technologies is key.

Unfortunately, the site with the documentation was removed, some information is available in the cache of the "Web Archive".

Perhaps most exciting is the potential the OcPoC-Zynq could have for advanced autonomous flight.

Architecture model for a reconfigurable autopilot board. Source: Queensland University of Technology

The FPGA's processing power offers the opportunity to run AI algorithms directly on the drone. This localizes decision-making processes, potentially leading to real-time obstacle avoidance, sophisticated vision-based navigation, and previously unimaginable flight behaviors.

The OcPoC flight controller isn't just specialized hardware; it runs a full-fledged operating system (i.e Ubuntu armhf) on its Zynq-7010 ARM processor. This allows it to use standard Linux-based flight control software.

You still can clone the repository and customize your kernel to your specific project using Xilinx's linux-xlnx.

Some functions where the FPGA could be used in a drone

Sensors

High accuracy Kalman Filter/Inertial Measurement Unit (EKF/IMU). This would take input from a 6DOF sensor and provide tracking in flight.

Cameras

Video processing from multiple cameras. A single DSP is adequate for one camera input. But, multiple cameras would require an FPGA due to the high bandwidth requirement. So, you could do some stereo-vision and 🗎 object recognition using the FPGA.

The further fate of the solution

It's important to note that unlocking the full potential of the OcPoC-Zynq requires a degree of development expertise. Customizing the FPGA involves hardware design skills, and careful integration with PX4 software might be necessary to fully realize the benefits of hardware acceleration.

Nonetheless, the Aerotenna OcPoC-Zynq served as an interesting exceptional platform for researchers and innovators eager to shape the future of drone technology within the dynamic PX4 landscape.

Unfortunately, there is no news today about the development of this project. The project has been discontinued and is no longer commercially available.

PX4 v1.11 is the last release that has experimental support for this platform.

So there are certainly use cases for FPGA on flight controllers, and the author reviewed one of them here.

Performing checksums on the sensors data interface, digital filters and cameras signal processing - these are small tasks that could be outsourced to the FPGA, since they are expensive on the CPU.

Yet, flying is still a relatively slow process with little data, so regular ARM CPUs can keep up. Compared to high speed video processing.

Also, development for an FPGA is more expensive. The parts are pricey, high pin count thus the boards are expensive (> 2 layers) and the design software is expensive. Overall, development costs more time, and time is money.

Conclusion

Moreover, most modern ARM MCUs offer an attractively good balance of performance and affordability. They are readily available, well-supported, and cost less than FPGA-based solutions.

In addition, over the decade, a large number of dedicated video transmission and processing modules have appeared on the market, taking on these tasks entirely.

Thus, we can say that highly integrated ARM MCU solutions have gained the upper hand today, at least in the civil mass drone segment.

Finding the Right "Brain" and Software for Civilian Drone. Part 1

Roman Belshevitz — Fri, 15 Mar 2024 20:36:52 +0000

This is like a beginning. See part 2 here.

Introduction

Unmanned Aerial Vehicles (UAVs), colloquially known as drones, have revolutionized industries ranging from agriculture and construction to surveillance and logistics.

This transformative potential hinges on the intricate interplay between hardware components and software systems. At the heart of this symbiotic relationship lies the Microcontroller Unit (MCU), an indispensable component orchestrating the UAV's operations. However, the efficacy of UAVs is contingent upon the selection of an appropriate MCU and the deployment of a robust operating system tailored to their unique operational exigencies.

Specialized Operating Requirements of UAVs

UAVs operate in dynamic and resource-constrained environments, necessitating operating systems that diverge from traditional paradigms. Unlike desktop computers or laptops, UAVs prioritize power efficiency to maximize flight time, embodying stringent constraints on memory and processing capabilities. Real-time responsiveness assumes paramount importance to ensure flight stability, thereby demanding operating systems that offer deterministic timing and swift reaction to dynamic environmental stimuli.

Real-Time Operating Systems: A Tailored Solution

Real-Time Operating Systems (RTOS) have emerged as a tailored solution to address the specialized operational requirements of UAVs. By offering deterministic timing and efficient resource utilization, RTOSes facilitate the seamless execution of critical tasks such as flight control and sensor management. Popular RTOS choices, including FreeRTOS, NuttX, ChibiOS, Zephyr and RT-Thread offer varying strengths in terms of size, security, and hardware support, catering to diverse UAV projects.

STM32: A Preferred MCU for UAV Development

The STM32 family of MCUs has garnered widespread acclaim in the realm of UAV development due to its versatility and performance capabilities. With a plethora of options catering to diverse application scenarios, STM32 MCUs offer seamless integration with various RTOS options, thereby enhancing flexibility and scalability in UAV development endeavors.

Hardware Abstraction Layers: Facilitating Portability and Focus

Hardware Abstraction Layers (HALs) play a pivotal role in bridging the firmware and hardware components of UAV systems. By encapsulating low-level hardware details and providing a standardized interface, HALs facilitate portability, allowing firmware such as Ardupilot to run seamlessly across different MCUs and RTOSes. Furthermore, HALs enable firmware developers to focus on high-level tasks such as navigation and sensor fusion, thereby enhancing development efficiency and code maintainability.

Vendors' Role in HAL Provisioning

Semiconductor vendors, exemplified by STMicroelectronics for the STM32 MCUs family, play a crucial role in providing and maintaining Hardware Abstraction Layers (HALs). One of the pivotal reasons for the widespread adoption of STM32 microcontrollers in civilian drone applications is 🗎 their robust support for various communication buses and protocols, which are essential for facilitating seamless integration with a wide range of peripheral devices and external sensors.

STM32 MCUs boast extensive support for industry-standard communication interfaces, including Universal Asynchronous Receiver-Transmitter (UART), Serial Peripheral Interface (SPI), Inter-Integrated Circuit (I2C), Controller Area Network (CAN), and USB. These communication interfaces enable UAV developers to establish reliable and high-speed data exchange with external components, such as GPS modules, inertial measurement units (IMUs), cameras, and telemetry systems.

Pic source: st.com 🔎

Furthermore, STM32 microcontrollers feature built-in support for popular communication protocols commonly used in UAV applications. This includes protocols such as MAVLink, which facilitates communication between onboard flight controllers and ground control stations, as well as protocols like I2C and SPI for interfacing with peripheral sensors and devices.

By offering comprehensive support for communication buses and protocols, STMicroelectronics empowers drone-makers to build highly interconnected and interoperable UAV systems. This enables seamless integration of diverse sensor arrays, payload systems, and communication modules, thereby enhancing the functionality, versatility, and performance of civilian drones.

With STM32 MCUs as the backbone of UAV development, drone-makers can involve a rich ecosystem of communication interfaces and protocols to realize their vision of advanced and mission-critical drone applications.

The Zynq UltraScale+ platform, developed by Xilinx (now AMD)

This solution offers several distinct advantages for drone making, particularly in scenarios that demand high computational power, flexibility, and integration of custom hardware accelerators. Below are some key advantages of the Zynq platform for drone development:

Zynq UltraScale+ combines the processing capabilities of Cortex-A53 application processors with the programmable logic of FPGA (Field-Programmable Gate Array) fabric on a single chip. This hybrid architecture enables developers to implement custom hardware accelerators in the FPGA fabric to offload computationally intensive tasks from the CPU, thereby enhancing overall system performance and efficiency.

Being built on the industry success of the Zynq 7000 SoC family, the UltraScale+ MPSoC architecture extends AMD SoCs to enable true heterogeneous multi-processing with ‘the right engines for the right tasks’ for smarter systems

The ARM Cortex-A53 processors integrated into the Zynq UltraScale+ provide significant computational power, enabling drones to execute complex algorithms for tasks such as image processing, computer vision, sensor fusion, and autonomous navigation. This high computational capability is crucial for enabling advanced drone functionalities, including obstacle detection and avoidance, object tracking, and environmental mapping.

The programmable logic fabric of the Zynq platform allows developers to implement custom hardware accelerators tailored to specific drone applications. The flexibility of FPGAs allows developers to customize algorithms and processing pipelines, unlocking unique capabilities for specialized drone missions.

Additionally, the reconfigurability of FPGAs enables rapid prototyping and iteration, facilitating agile development processes in the fast-paced UAV industry.

The Zynq platform offers low-latency processing capabilities, making it suitable for real-time applications in drone control and navigation. By deploying critical control and decision-making algorithms on the ARM processors and offloading latency-sensitive tasks to the FPGA fabric, developers can achieve high responsiveness and stability in drone flight operations.

Pic source: linuxgizmos.com

The platform features a rich set of peripheral interfaces, including GPIOs, UARTs, SPI, I2C, USB and MIPI facilitating seamless integration with a wide range of sensors, actuators, communication modules, and external devices. This extensive peripheral support simplifies the development of fully integrated drone systems and enables interoperability with existing UAV components and standards.

The Zynq platform is available in various configurations with different processing capabilities and FPGA resources, allowing developers to choose the optimal combination of performance and cost for their drone applications. Whether designing lightweight drones for aerial photography or high-end UAVs for surveillance and reconnaissance missions, developers can select the Zynq device that best aligns with their performance requirements and budget constraints.

Xilinx (the maker of Zynq UltraScale+ devices) provides a comprehensive HAL as part of their software development tools and frameworks. The primary HAL access point is usually through the Xilinx SDK. It includes libraries and drivers for the various hardware blocks within Zynq UltraScale+ SoCs, including the ARM processor cores, programmable logic, and peripherals.

A talented Javanese engineer Ichiro Kawazome has composed the repository which provides a Linux boot image (U-boot bootloader, kernel, RootFS) for Zynq UltraScale+ MPSoC.

Target Applications

STM32 microcontrollers are ideal for lightweight drones, flight controllers, and embedded systems where power efficiency, real-time responsiveness, and cost-effectiveness are paramount.

Zynq is suitable for drone applications that require high computational power, flexibility, and customization, such as autonomous navigation, advanced image processing, and real-time decision-making.

The choice between Zynq and STM32 depends on the specific requirements of the drone application, including computational complexity, power efficiency, real-time performance, and system integration needs. While STM32 MCUs excel in low-power operation and real-time responsiveness, making them suitable for a wide range of embedded applications, including drones, Zynq offers higher computational power and hardware customization capabilities.

Zynq offers significantly higher computational power compared to STM32 microcontrollers, thanks to its ARM application processors and FPGA fabric.

Developers can create customized Linux distributions tailored for Zynq-based drones, integrating necessary drivers, libraries, and applications for drone control, sensor interfacing, and communication. This approach offers flexibility and control over the software stack but requires significant development effort.

Delineating OS and Firmware: A Conceptual Framework

In the architectural hierarchy of UAV systems, the RTOS serves as the foundational layer, orchestrating task scheduling, resource allocation, and timing management. RTOSes can be remarkably small, especially ones designed for the strict memory constraints of embedded systems.

The HAL, influenced by the MCU vendor, provides a standardized interface for firmware interaction, encapsulating hardware intricacies and enhancing portability.

The most common structure of the existing flight controller firmware, which consists of layers and modules. Pic source: Gyeongsang National University, Rep. of Korea

Firmware, typified by Ardupilot or PX4, operates at a higher level, focusing on flight control, navigation, and hardware interfacing, thereby delineating the boundaries between OS and application-specific functionalities.

ChibiOS is hidden under the hood of Ardupilot, while what's in the same role for PX4 is NuttX.

A high level overview of a typical "simple" PX4 system based around a flight controller. Pic source: docs.px4.io

Is There a Place for Linux Above the Ground?

While Real-Time Operating Systems (RTOS) like FreeRTOS, NuttX, and others have traditionally dominated the realm of civilian drone development due to their suitability for real-time responsiveness and resource optimization, the use of Linux-based platforms, particularly the Zynq platform, has also gained traction in certain niche applications within the UAV ecosystem.

Linux offers several advantages for civilian drone applications, especially in scenarios where complex computational tasks, such as image processing, data analysis, or machine learning, are required. The Zynq platform, which combines the flexibility and ease of use of an ARM processor with the programmable logic capabilities of an FPGA, presents an attractive option for UAV developers looking to leverage the power of Linux in conjunction with customizable hardware acceleration.

One area where Linux-based platforms like Zynq have found utility is in high-level mission planning, data processing, and decision-making tasks. For instance, Linux enables developers to deploy sophisticated algorithms for autonomous navigation, obstacle avoidance, and environmental mapping, leveraging the extensive software libraries and development tools available within the Linux ecosystem.

Additionally, Linux-based platforms offer robust networking capabilities, allowing drones to communicate with ground control stations, cloud services, and other drones in a distributed manner. This facilitates collaborative missions, swarm behavior, and real-time data sharing, thereby expanding the scope of civilian drone applications beyond individual flight operations.

However, it's essential to acknowledge that Linux-based solutions may not be suitable for all drone applications, particularly those that prioritize real-time responsiveness, low latency, and deterministic behavior. In such cases, RTOSes remain the preferred choice due to their ability to guarantee timing constraints and optimize resource utilization.

Saying briefly, while RTOSes continue to dominate the civilian drone landscape, Linux-based platforms like Zynq offer a compelling alternative for certain niche applications that require advanced computational capabilities, networking features, and flexibility.

As the UAV industry evolves and the demand for more sophisticated autonomous capabilities grows, the integration of Linux-based solutions alongside traditional RTOSes is likely to become increasingly prevalent, driving innovation and expanding the horizons of civilian drone technology.

The Spirit of Open-Source

The following projects have become very significant within the discussed area of development and have gained a wide community over the past ten years:

ArduPilot is a widely used open-source autopilot system for UAVs. It offers similar features to PX4, including flight control, mission planning, and telemetry support. ArduPilot supports a variety of airframes and can run on microcontroller-based flight controllers like Arduino boards.

PX4 Autopilot is another popular open-source autopilot system for UAVs. It provides a complete set of flight control algorithms, mission planning tools, and communication protocols. PX4 supports a wide range of airframes and is highly customizable, making it suitable for both hobbyist and commercial UAV applications.

MAVLink is an open-source communication protocol used for exchanging telemetry and control messages between UAVs and ground control stations. It provides a lightweight, efficient messaging format and supports various transport protocols, including serial, UDP, and TCP/IP.

MAVSDK (MAVLink Software Development Kit) is a cross-platform, open-source SDK for accessing MAVLink-based UAV systems programmatically. It provides APIs in multiple programming languages, including C++, Python, and Swift, making it suitable for a wide range of UAV applications.

OpenCV (Open Source Computer Vision Library) is a popular open-source computer vision library that is widely used in UAV applications for tasks such as object detection, tracking, and image processing. It provides a comprehensive set of algorithms and tools for working with visual data. The reader definitely have to take a look at Kenni Nilsson's project!

ROS (Robot Operating System) is an open-source robotics middleware framework that provides libraries and tools for building robotic systems. It includes packages for tasks such as sensor integration, localization, mapping, and path planning, which are applicable to UAVs as well.

LibrePilot, an open-source flight control software for UAVs that offers a range of features including stabilization, navigation, and telemetry. It is designed to work with a variety of hardware platforms and offers extensive configurability and customization options.

ExpressLRS (ELRS) is a rapidly growing open-source radio control link designed to surpass the performance limitations of traditional systems. It prioritizes long range, low latency, and high update rates for applications like FPV drone racing and long-range aircraft. Users have extensive control over the firmware, allowing them to fine-tune parameters like transmission power, frequency bands, telemetry options, and much more. This level of customization caters to a variety of use cases.

Future Trajectory of Drone Software

The trajectory of drone software is poised for significant advancements, propelled by burgeoning research endeavors in secure software engineering, artificial intelligence (AI), and programming language design.

Anticipated advancements encompass the integration of AI algorithms for onboard decision-making, adoption of safer programming languages to mitigate system vulnerabilities, and enhancements in software security mechanisms to fortify UAV resilience against adversarial threats.

Conclusion

In conclusion, the selection of an appropriate MCU or SoC and the deployment of a tailored operating system constitute foundational pillars underpinning the efficacy and reliability of UAV systems.

By embracing Real-Time Operating Systems (RTOS) and leveraging Hardware Abstraction Layers (HALs), developers can navigate the intricate landscape of UAV software development with confidence, ushering in a new era of innovation and transformative potential across diverse industry verticals.

It is also difficult to overestimate the contribution of the open source community in popularizing drone-centric development!

Soft landings and clear skies to everyone!

Most sources are quite fully identified with links given inside the article.
Cover pic by Marian A. Juwan, Pixabay.

See also: 'Getting Started with STM32 - Introduction to FreeRTOS', a blog post by Shawn Hymel @ maker.io / DigiKey.

32 Kernel’s Teeth for “Chewing” the Network Stack on Linux

Roman Belshevitz — Mon, 29 May 2023 17:02:03 +0000

The topic of tuning the network stack is very narrow and complex the same time. Today, many tips either do not match the current default settings. Some mechanisms are already included in modern kernels. Below is my compilation of what seems to be relevant at the moment. It's mostly about timeouts and memory consumption. There are a lot of them now and they are cheap.

In this article I provide a set of recommended network configuration settings for optimizing TCP connections on a server. The suggested settings include adjusting parameters related to orphaned TCP sockets, reducing the timeout for sockets in the FIN-WAIT-2 state, configuring TCP keepalive checks, managing memory allocation for TCP connections, disabling syncookies, selecting a congestion control algorithm, expanding the local port range, enabling protection against TIME_WAIT attacks, increasing the maximum number of open sockets, adjusting buffer sizes for connections, and optionally disabling local ICMP packet redirects.

These discussed settings aim to improve performance, memory usage, and security on powerful and busy servers.

1

⚙️ Increase the value of tcp_max_orphans, which determines the maximum number of orphaned (not associated with any process) TCP sockets. Each socket consumes approximately 64 KB of memory. Therefore, the parameter should be matched with the available memory on the server.
net.ipv4.tcp_max_orphans = 65536

2

⚙️ Decrease tcp_fin_timeout (default is 60). This parameter determines the maximum time a socket can remain in the FIN-WAIT-2 state. This state is used when the other party does not close the connection. Each socket occupies about 1.5 KB of memory, which can consume memory when there are many of them.
net.ipv4.tcp_fin_timeout = 10

3

⚙️ Parameters related to TCP connection checks in the SO_KEEPALIVE status: keepalive_time specifies the time after which checks will begin after the last activity on the connection, keepalive_intvl determines the interval between checks, and keepalive_probes specifies the number of checks.
net.ipv4.tcp_keepalive_time = 1800
net.ipv4.tcp_keepalive_intvl = 15
net.ipv4.tcp_keepalive_probes = 5

4

⚙️ Pay attention to the parameters net.ipv4.tcp_mem, net.ipv4.tcp_rmem, and net.ipv4.tcp_wmem. They heavily depend on the memory available on the server and are automatically calculated during system load. In general, it is not necessary to modify them, but sometimes it is possible to manually adjust them to increase the values.

5

⚙️ Disable (enabled by default) the transmission of syncookies to the host in case of SYN packet queue overflow for a specific socket.
net.ipv4.tcp_syncookies = 0

6

⚙️ Special attention should be given to the congestion control algorithm used in TCP networks. There are many algorithms (cubic, htcp, bic, westwood, etc.), and it is difficult to definitively say which one is better to use. Algorithms show different results in different load scenarios. The kernel parameter tcp_congestion_control controls this:
net.ipv4.tcp_congestion_control = cubic

7

⚙️ When the server has a large number of outbound connections, there may not be enough local ports for them. By default, the range 32768-60999 is used. It can be expanded:
net.ipv4.ip_local_port_range = 10240 65535

8

⚙️ Enable protection against TIME_WAIT attacks. By default, it is disabled.
net.ipv4.tcp_rfc1337 = 1

9

⚙️ The maximum number of open sockets waiting for connections has a relatively low default value. In kernels prior to 5.3, it is 128, which was increased to 4096 in kernel 5.4. It makes sense to increase it on busy and powerful servers:
net.core.somaxconn = 16384

10

⚙️ On powerful and busy servers, you can increase the default buffer size values for both receiving and transmitting for all connections. This parameter is measured in bytes. By default, it is 212992 or 208 KB.
net.core.rmem_default = 851968
net.core.wmem_default = 851968
net.core.rmem_max = 12582912
net.core.wmem_max = 12582912

11

⚙️ Disable local ICMP packet redirects. This should only be done if your server does not act as a router, i.e., if you have a regular web server.
net.ipv4.conf.all.accept_redirects = 0
net.ipv4.conf.all.secure_redirects = 0
net.ipv4.conf.all.send_redirects = 0
Additionally, you can completely disable kernel-level responses to ICMP requests. It is not a common practice.

12

Increase the value of tcp_max_orphans, which determines the maximum number of orphaned (not associated with any process) TCP sockets. Each socket consumes approximately 64 KB of memory. Therefore, the parameter should be matched with the available memory on the server.
net.ipv4.tcp_max_orphans = 65536

13

14

⚙️ Parameters related to TCP connection checks in the SO_KEEPALIVE status: keepalive_time specifies the time after which checks will begin after the last activity on the connection, keepalive_intvl determines the interval between checks, and keepalive_probes specifies the number of checks.
net.ipv4.tcp_keepalive_time = 1800
net.ipv4.tcp_keepalive_intvl = 15
net.ipv4.tcp_keepalive_probes = 5

15

16

⚙️ Disable (enabled by default) the transmission of syncookies to the host in case of SYN packet queue overflow for a specific socket.
net.ipv4.tcp_syncookies = 0

17

⚙️ Special attention should be given to the congestion control algorithm used in TCP networks. There are many algorithms (cubic, htcp, bic, westwood, etc.), and it is difficult to definitively say which one is better to use. Algorithms show different results in different load scenarios. The kernel parameter tcp_congestion_control controls this:
net.ipv4.tcp_congestion_control = cubic

18

19

⚙️ Enable protection against TIME_WAIT attacks. By default, it is disabled.
net.ipv4.tcp_rfc1337 = 1

20

⚙️ The maximum number of open sockets waiting for connections has a relatively low default value. In kernels prior to 5.3, it is 128, which was increased to 4096 in kernel 5.4. It makes sense to increase it on busy and powerful servers:
net.core.somaxconn = 16384

21

22

Additionally, you can completely disable kernel-level responses to ICMP requests. I usually don't Increase the value of tcp_max_orphans, which determines the maximum number of orphaned (not associated with any process) TCP sockets.

Each socket consumes approximately 64 KB of memory. Therefore, the parameter should be matched with the available memory on the server.
net.ipv4.tcp_max_orphans = 65536

23

⚙️ Decrease tcp_fin_timeout (default is 60). This parameter determines the maximum time a socket can remain in the FIN-WAIT-2 state. This state is used when the other party does not close the connection. Each socket occupies about 1.5 KB of memory, which can consume memory when there are many of them.
net.ipv4.tcp_fin_timeout = 10

24

⚙️ Parameters related to TCP connection checks in the SO_KEEPALIVE status: keepalive_time specifies the time after which checks will begin after the last activity on the connection, keepalive_intvl determines the interval between checks, and keepalive_probes specifies the number of checks.
net.ipv4.tcp_keepalive_time = 1800
net.ipv4.tcp_keepalive_intvl = 15
net.ipv4.tcp_keepalive_probes = 5

25

⚙️ Pay attention to the parameters net.ipv4.tcp_mem, net.ipv4.tcp_rmem, and net.ipv4.tcp_wmem. They heavily depend on the memory available on the server and are automatically calculated during system load. In general, it is not necessary to modify them, but sometimes it is possible to manually adjust them to increase the values.

26

⚙️ Disable (enabled by default) the transmission of syncookies to the host in case of SYN packet queue overflow for a specific socket.
net.ipv4.tcp_syncookies = 0

27

⚙️ Special attention should be given to the congestion control algorithm used in TCP networks. There are many algorithms (cubic, htcp, bic, westwood, etc.), and it is difficult to definitively say which one is better to use. Algorithms show different results in different load scenarios. The kernel parameter tcp_congestion_control controls this:
net.ipv4.tcp_congestion_control = cubic

28

⚙️ When the server has numerous outbound connections, there may not be enough local ports for them. By default, the range 32768-60999 is used. It can be expanded:
net.ipv4.ip_local_port_range = 10240 65535

29

⚙️ Enable protection against TIME_WAIT attacks. By default, it is disabled.
net.ipv4.tcp_rfc1337 = 1

30

⚙️ The maximum number of open sockets waiting for connections has a relatively low default value. In kernels prior to 5.3, it is 128, which was increased to 4096 in kernel 5.4. It makes sense to increase it on busy and powerful servers:
net.core.somaxconn = 16384

31

32

Sources:
a. https://man7.org/linux/man-pages/man7/tcp.7.html
b. https://cr.yp.to/syncookies.html
c. https://blog.cloudflare.com/when-tcp-sockets-refuse-to-die/
d. https://blog.cloudflare.com/unbounded-memory-usage-by-tcp-for-receive-buffers-and-how-we-fixed-it/
e. https://www.geeksforgeeks.org/tcp-connection-termination/

Thanks to Mike Freemon and Marek 🐦@majek04 Majkowski.

Liveness Probes: Feel the Pulse of the App

Roman Belshevitz — Mon, 28 Nov 2022 13:30:16 +0000

This article will provide some helpful examples as the author examines probes in Kubernetes. A correct probe definition can increase pod availability and resilience!

A Kubernetes Liveness Probe: What Is It?

Based on a given test, the Liveness probe makes sure that an application inside a container is active and working.

⚙️ Liveness probes

They are used by the kubelet to determine when to restart a container. Applications that crash or enter broken states are detected and, in many cases, can be rectified by restarting them.

A successful configuration of the liveness probe results in no action being taken and no logs being kept. If it fails, the event is recorded, and the container is killed by the kubelet in accordance with the restartPolicy settings.

When a pod might seem to be running, but the application might not be working properly, a liveness probe should be utilized. During a standstill, as an illustration. The pod might be operational, but it is ineffective since it cannot handle traffic.

🖼️ Pic source: K21Academy

Since the kubelet will check the restartPolicy and restart the container automatically if it is set to Always or OnFailure, they are not required when the application is configured to crash the container on failure. The NGINX application, for example, launches rapidly and shuts down if it encounters a problem that prevents it from serving pages. You are not in need of a liveness inquiry in this instance.

There are common adjustable fields for every type of probe:

initialDelaySeconds: Probes start running after initialDelaySeconds after container is started (default: 0)
periodSeconds: How often probe should run (default: 10)
timeoutSeconds: Probe timeout (default: 1)
successThreshold: Required number of successful probes to mark container healthy/ready (default: 1)
failureThreshold: When a probe fails, it will try failureThreshold times before deeming unhealthy/not ready (default: 3)

The periodSeconds field in each of the examples below says that the kubelet should run a liveness probe every 5 seconds. The initialDelaySeconds field instructs the kubelet to delay the first probe for 5 seconds.

The timeoutSeconds option (Time to wait for the reply), successThreshold (Number of successful probe executions to mark the container healthy), and failiureThreshold (Number of failed probe executions to mark the container unhealthy), among other options, can also be customized, if desired.

All different liveness probes can use these five parameters.

What other Kubernetes probes are available?

Although the use of Liveness probes will be the main emphasis of this article, you should be aware that Kubernetes also supports the following other types of probes:

⚙️ Startup probes

The kubelet uses startup probes to help it determine when a container application has begun. When enabled, these make sure startup probes don't obstruct the application startup by disabling liveness and readiness checks until they are successful.

These are especially helpful for slow-starting containers since they prevent the kubelet from killing them before they have even started when a liveness probe fails. Set the startup probe's failureThreshold greater if liveness probes are used on the same endpoint in order to enable lengthy startup periods.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: test-api-deployment
spec:
  template:
    metadata:
      labels:
        app: test-api
    spec:
      containers:
      - name: test-api
        image: myrepo/test-api:0.1
        startupProbe:
          httpGet:
            path: /health/startup
            port: 80
          failureThreshold: 30
          periodSeconds: 10

When a Pod starts and the probe fails, Kubernetes will try failureThreshold times before giving up. Giving up in case of liveness probe means restarting the Pod. In case of readiness probe, the Pod will be marked Unready. Defaults to 3. The minimum value is 1.

Some startup probe's math: why it is important?

0 - 10 s: the container has been spun up but the kubelet doesn't do anything waiting for the initalDelaySeconds to pass
10 - 20 s: the first probe request is sent but no response is sent back, this is because the app hasn’t stood up the APIs yet, this is either a failure due to 2 seconds timeout or an immediate TCP connection error
20 - 30 s: the app has got up but has only started fetching credentials, configurations and so on, so the response to the probe request is 5xx
30 - 210 s: the kubelet has been probing but the success response didn’t come and is reaching the limit set by the failureThreshold. In this case, as per the deployment configuration for the startup probe, the pod will be restarted after roughly 212 seconds.

🖼️ Pic source: Wojciech Sierakowski (HMH Engineering)

It might be a little excessive to wait more than 3 minutes for the app to launch locally with faked dependencies!

🎯 It might be also better to shorten this interval if you are absolutely certain that, for example, reading secrets, credentials, and establishing connections with DBs and other data sources shouldn't take so long. Doing so will slow down the deployment speed.

Maybe it’s important to figure out if you even need more nodes. You don’t want to waste your money on resources you don’t need. Take a look at kubectl top nodes to see if you even need to scale the nodes.

🚧 If probe fails, the event is recorded, and the container is killed by the kubelet in accordance with the restartPolicy settings.

When a container gets restarted you usually want to check the logs why the application went unhealthy. You can do this with the following command:

kubectl logs <pod-name> --previous

⚙️ Readiness probes

Readiness probes keep track of the application's availability. No traffic will be forwarded to the pod if it fails. These are employed when an application requires configuration before it is usable. Additionally, an application may experience traffic congestion and cause the probe to malfunction, stopping further traffic from being routed to it and allowing it to recover. The endpoints controller takes the pod out if it fails.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: test-api-deployment
spec:
  template:
    metadata:
      labels:
        app: test-api
    spec:
      containers:
      - name: test-api
        image: myrepo/test-api:0.1
        readinessProbe:
          httpGet:
            path: /ready
            port: 80
          successThreshold: 3

The kubelet finds that the container is not yet prepared to receive network traffic, but is making progress in that direction if the readiness probe fails but the liveness probe succeeds.

The operation of Kubernetes probes

The kubelet controls the probes. The main "node agent" that executes on each node is the kubelet.

🖼️ Pic source: Andrew Lock (Datadog). SVG is here

The application needs to support one of the following handlers in order to use a K8S probe effectively:

ExecAction handler: Executes a command inside the container. If the command returns a status code of 0, the diagnosis is successful.
TCPSocketAction handler tries to establish a TCP connection to the pod's IP address on a particular port. If the port is discovered to be open, the diagnostic is successful.

Using the IP address of the pod, a particular port, and a predetermined destination, the HTTPGetAction handler sends an HTTP GET request. If the response code given falls between 200 and 399, the diagnostic is successful.

Before version 1.24 Kubernetes did not support gRPC health checks natively. This left the gRPC developers with the following three approaches when they deploy to Kubernetes:

🖼️ Pic source: Ahmet Alp Balkan (Twitter, ex-Google)

As of Kubernetes version 1.24, gRPC handler can be configured to be used by kubelet for application liveness checks if your application implements the gRPC Health Checking Protocol. To configure checks that use gRPC, you must enable the GRPCContainerProbe feature gate.

When the kubelet conducts a probe on a container, it answers with Success, Failure, or Unknown, depending on whether the diagnostic was successful, unsuccessful, or incomplete for some other reason.

So, how rushy to track the pulse?

You should examine the system behavior and typical starting timings of the pod and its containers before defining a probe so that you can choose the appropriate thresholds. Additionally, as the infrastructure or application changes, the probe choices should be changed. For instance, a pod's configuration to use more system resources can have an impact on the values that need to be configured for the probes.

Handlers in action: some examples

`ExecAction` handler: how can it be useful in practice?

🎯 It allows you to use commands inside containers to control the status of life of a counter in pods. With the help of this option, you may examine several aspects of container's operation, such as the existence of files, their contents, and other choices (accessible at the command level).

ExecAction is executed in pod’s shell context and is deemed failed if the execution returns any result code different from 0 (zero).

The example below demonstrates how to use the exec command with the cat command to see if a file exists at the path /usr/share/liveness/html/index.html.

apiVersion: v1
kind: Pod
metadata:
  labels:
    test: liveness
  name: liveness-exec
spec:
  containers:
  - name: liveness
    image: registry.k8s.io/liveness:0.1
    ports:
    - containerPort: 8080
    livenessProbe:
      exec:
        command:
        - cat
        - /usr/share/liveness/html/index.html
      initialDelaySeconds: 5
      periodSeconds: 5

🚧 The container will be restarted if there is no file and the liveness probe will fail.

`TCPSocketAction` handler: how can it be useful in practice?

In this use case, the liveness probe makes use of the TCP handler to determine whether port 8080 is active and open. With this configuration, your container will try to connect to the kubelet by opening a socket on the designated port.

apiVersion: v1
kind: Pod
metadata:
  name: liveness
  labels:
    app: liveness-tcp
spec:
  containers:
  - name: liveness
    image: registry.k8s.io/liveness:0.1
    ports:
    - containerPort: 8080
    livenessProbe:
      tcpSocket:
        port: 8080
      initialDelaySeconds: 5
      periodSeconds: 5

🚧 The container will restart if the socket is dead and liveness probe fails.

`HTTPGetAction` handler: how can it be useful in practice?

This case demonstrates the HTTP handler that will send an HTTP GET request to the /health path on port 8080. A value between 200 and 400 indicates that the probe was successful.

apiVersion: v1
kind: Pod
metadata:
  labels:
    test: liveness
  name: liveness-http
spec:
  containers:
  - name: liveness
    image: registry.k8s.io/liveness:0.1
    livenessProbe:
      httpGet:
        path: /health
        port: 8080
        httpHeaders:
        - name: Custom-Header
          value: ItsAlive
      initialDelaySeconds: 5
      periodSeconds: 5

🚧 The probe fails, and the container is restarted if a code outside this range is received. Any custom headers you want to transmit can be defined using the httpHeaders option.

gRPC handler: how can it be useful in practice?

gRPC protocol is on its way to becoming the lingua franca for communication between cloud-native microservices. If you are deploying gRPC applications to Kubernetes today, you may be wondering about the best way to configure health checks.

This example demonstrates how to check port 2379 responsiveness using the gRPC health checking protocol. A port must be specified in order to use a gRPC probe. You must also specify the service if the health endpoint is set up on a non-default service.

apiVersion: v1
kind: Pod
metadata:
  labels:
    test: liveness
  name: liveness-gRPC
spec:
  containers:
  - name: liveness
    image: registry.k8s.io/liveness:0.1
    ports:
    - containerPort: 2379
    livenessProbe:
      grpc:
        port: 2379
      initialDelaySeconds: 5
      periodSeconds: 5

🚧 The container will restart if the gRPC socket is dead and liveness probe fails.

Since there are no error codes for gRPC built-in probes, all errors are regarded as probe failures.

Using liveness probes in the wrong way can lead to disaster

Please remember that the container will be restarted if the liveness probe fails. It is not conventional to examine dependencies in a liveness probe, unlike a readiness probe.

To determine whether the container itself has stopped responding, a liveness probe should be utilized.

A liveness probe has the drawback of maybe not verifying the service's responsiveness. For instance, if a service maintains two web servers, one for service routes and the other for status routes, such as readiness and liveness probes or metrics gathering, the service may be delayed or inaccessible while the liveness probe route responds without any issues. The liveness probe must use the service in a comparable way to dependent services for it to be effective.

Like the readiness probe, it's crucial to take into account dynamics that change over time. A slight increase in response time, possibly brought on by a brief rise in load, could force the container to restart if the liveness-probe timeout is too short. The restart might put even more strain on the other pods supporting the service, leading to a further cascade of liveness probe failures and worsening the service's overall availability.

🖼️ Pic source: Wojciech Sierakowski (HMH Engineering)

These cascade failures can be prevented by configuring liveness probe timeouts on the order of client timeouts and employing a forgiving failureThreshold count.

Liveness probes may have a small issue with the container startup latency varying over time (see above about the math). Changes in resource allocation, network topology changes, or just rising load as your service grows could all contribute to this.

If the initialDelaySeconds option is insufficient and a container is restarted as a result of a Kubernetes node failure or a liveness probe failure, the application may never start or may start partially before being repeatedly destroyed and restarted. The container's maximum initialization time should be greater than the initialDelaySeconds option.

Some notable suggestions are:

Keep dependencies out of liveness probes. Liveness probes should be reasonably priced and have consistent response times.
So that system dynamics can alter temporarily or permanently without causing an excessive number of liveness probe failures, liveness probe timeouts should be conservatively set. Consider setting client timeouts and liveness-probe timeouts to the same value.
To ensure that containers can be restarted with reliability even if starting dynamics vary over time, the initialDelaySeconds option should be set conservatively.

The inevitable summary

By causing an automatic restart of a container after a failure of a particular test is discovered, the proper integration of liveness probes with readiness and startup probes can increase pod resilience and availability. It is necessary to comprehend the application in order to specify the appropriate alternatives for them.

The author is thankful to Guy Menachem from Komodor for inspiration! Stable applications in the clouds to you all, folks!

Kubernetes TLS, Demystified

Roman Belshevitz — Tue, 11 Oct 2022 18:16:13 +0000

This is the anniversary 10th article in this series.

🛡️ It is more than obvious a secured connection to any exposed service running in Kubernetes cluster is important.

The supposition for this article is that you wish to set up TLS (Transport Layer Security) realm for your ingress resource and that you already have a functioning ingress controller established in your cluster.

The SSL replacement technology is called Transport Layer Security (TLS) today. An enhanced version of SSL is TLS.

Similar to how SSL operates, it uses encryption to safeguard the transmission of data and information. Although SSL is still commonly utilized in the industry, the two names are frequently used interchangeably.

Getting a certificate: what paths can be taken?

A TLS/SSL certificate is the fundamental prerequisite for ingress TLS. These certificates are available to you in the following ways.

Path one. Self-signed certificates: Our own Certificate Authority (root CA) created and signed the TLS certificate. It is a well-known, stunt choice for testing scenarios where you can "collaborate" on the root CA so that browsers will accept the certificate.

🖼️ Pic source: Bizagi

Path two. Get an SSL certificate: for production use cases, you must purchase an SSL certificate from a reputable certificate authority that operating systems and browsers trust. But you must bear in mind that a so-called wildcard certificate suitable to protect all subdomains in a domain can cost $300+/year from major commercial issuers.
Path three. Use a Let's Encrypt certificate: Let's Encrypt is a reputable certificate authority that issues free TLS certificates.

A few words about Let's Encrypt

It is a non-profit organization founded by enthusiasts in the field of struggle for privacy and security in 2014.

The challenge–response protocol used to automate enrolling with the certificate authority is called Automated Certificate Management Environment (ACME). It can query either Web servers or DNS servers controlled by the domain covered by the certificate to be issued.

🖼️ Pic source: Let's Encrypt

If you are interested in the implementation of the protocol, read what Adrian Colyer from SpringSource writes about them.

Each SSL certificate has an expiration date. So, before the certificate expires, you need to rotate it. For instance, Let's Ecrypt certificates have a three-month expiration date (and here they tell why).

This way, below in this article series, the author will dwell on the third path further in detail. Why? Well, since this path is interesting for its relative self-sufficiency and [relative] independence from commercial / state-owned certificate issuers. In general, the motto of this path is: "If made something with your hands, you know how it works - you're more adapted to survival!"

Of course, Let's Encrypt approach does not always fit needs, but for academic purposes and for startups it works.

But let's look at the situation "in manual mode" - just try to associate a certificate with a protected application. So,

Chicken or egg?

The ingress controller, not the ingress resource, is in charge of SSL. In other words, the ingress controller accesses the TLS certificates you provide to the ingress resource as a Kubernetes secret and incorporates them into its configuration.

Setup TLS/SSL certificates for ingress

Let's examine the procedures for setting up TLS for ingress. We'll start by launching a test application on the cluster. This application will be used to test our TLS-secured ingress.

Establish the new namespace, trial.

kubectl create namespace trial

Keep this as hello-app.yaml. The Deployment and Service objects are present.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: hello-app
  namespace: trial
spec:
  selector:
    matchLabels:
      app: hello
  replicas: 1
  template:
    metadata:
      labels:
        app: hello
    spec:
      containers:
      - name: hello
        image: "rbalashevich/hello-app:2.0"
---
apiVersion: v1
kind: Service
metadata:
  name: hello-service
  namespace: trial
  labels:
    app: hello
spec:
  type: ClusterIP
  selector:
    app: hello
  ports:
  - port: 80
    targetPort: 8080
    protocol: TCP

Deploy the application with a command:

kubectl apply -f hello-app.yaml -n trial

Create a Kubernetes TLS Secret

It is necessary to make the SSL certificate a Kubernetes secret. It will subsequently be directed to the tls block for Ingress resources.

The server.crt (CA trust chain) and server.key (private key) SSL files are assumed to be available from a Certificate Authority, your company, or self-signed, as a last resort.

⚠️ A private key is created by you (the certificate owner) when you request your certificate with a Certificate Signing Request (CSR). Saying other words, you receive a private key when generate a CSR. You submit the CSR code to the certificate authority and keep private key in a safe place.

As for the big three public cloud providers, they have instructions for exporting certificates: AWS CM, GCP CAS, Azure KV.

It is necessary to make the SSL certificate a Kubernetes secret. It will subsequently be directed to the tls block for Ingress resources.

And yes, keep the private key (server.key)!

Let's use the server.crt and server.key files to construct a Kubernetes secret of tls type (SSL certificates). In the trial namespace, where the hello-app deployment is located, we are creating the secret.

Run the kubectl command listed below from the directory where your server is located. Supply the absolute path to the files or the .crt and .key files. The name hello-app-tls is made up.

kubectl create secret tls hello-app-tls \
    --namespace trial \
    --key server.key \
    --cert server.crt

The comparable YAML file, where you must include the contents of the .crt and .key files, is provided below.

apiVersion: v1
kind: Secret
metadata:
  name: hello-app-tls
  namespace: trial
type: kubernetes.io/tls
data:
  server.crt: |
       <crt contents here>
  server.key: |
       <private key contents here>

A Kubernetes ingress is a set of rules that can be configured to give services externally reachable URLs. Based on this understanding, to turn on secure connection, we should add tls block to Ingress object. So, in the trial namespace, we create the sample ingress TLS-capable resource:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: hello-app-ingress
  namespace: trial
spec:
  ingressClassName: nginx
  tls:
  - hosts:
    - app.hosting.cloudprovider.com
    secretName: hello-app-tls
  rules:
  - host: "app.hosting.cloudprovider.com"
    http:
      paths:
        - pathType: Prefix
          path: "/"
          backend:
            service:
              name: hello-service
              port:
                number: 80

⚠️ Replace app.hosting.cloudprovider.com to your actual hostname. The host(s) should be the same in both the rules and tls blocks in the Ingress manifest. In other words, they must match.

In case of using NGINX ingress, you can add the supported annotation by the ingress controller you are using if you want a strict SSL. For instance, you can use the annotation nginx.ingress.kubernetes.io/backend-protocol: "HTTPS" in the Nginx ingress controller to permit SSL traffic up until the application.

The way to make sure

Let's check with curl https://app.hosting.cloudprovider.com -kv, is the connection to the app secure now:

* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server accepted to use h2
* Server certificate:
*  subject: CN=app.hosting.cloudprovider.com
*  start date: Oct 6 15:35:07 2022 GMT
*  expire date: Oct 6 15:35:07 2023 GMT
*  issuer: CN=Go Daddy Secure Certificate Authority - G2,
              OU=http://certs.godaddy.com/repository/,
              O="GoDaddy.com, Inc.",L=Scottsdale,ST=Arizona,C=US
*  SSL certificate verify ok.

🔒 If the certificate is valid, then the browser will not swear and there will be no frightening warnings either. Voilà, the connection to our app is secure!

Okay, we've covered the situations for the first and second paths. The next step is pathfinding the third path involving Let's Encrypt certificate.

Estne vita vere brevis?

Sed vita est cum dignitate vivendum. As the author noted above, the life of a certificate from a let's encrypt is short. You have to pay for insolence. Accordingly, some solution is required that would automate the re-issuance of short-lived certificates, right? And such a solution exists, it is cert-manager! It streamlines the process of getting, renewing, and using certificates by adding certificates and certificate issuers as resource types in Kubernetes clusters.

It can generate certificates from a number of supported sources, including Let's Encrypt.

Furthermore, it will check that certificates are current and valid and make an attempt to renew them for a specified period before they expire.

Install cert-manager on Kubernetes

According to the official cert-manager documentation, you can install it by using kubectl or by the provided helm chart.

# Create a dedicated Kubernetes namespace for cert-manager
kubectl create namespace cert-manager

# Add official cert-manager repository to helm CLI
helm repo add jetstack https://charts.jetstack.io

# Update Helm repository cache (think of apt update)
helm repo update

# Install cert-manager on Kubernetes
## cert-manager relies on several Custom Resource Definitions (CRDs)
helm install certmgr jetstack/cert-manager \
    --set installCRDs=true \
    --version v1.9.1 \
    --namespace cert-manager

The Issuer is responsible for issuing certificates. It is the signing authority and based on its configuration. The issuer knows how certificate requests are handled.

Cert-manager also creates several objects using different specifications such as CertificateRequest.

A Certificate resource is a readable representation of a certificate request. Certificate resources are linked to an Issuer who is responsible for requesting and renewing the certificate.

To determine if a certificate needs to be re-issued, cert-manager looks at the the spec of Certificate resource and latest CertificateRequests as well as the data in Secret containing the certificate.

Let's Encrypt: staging or production server?

An Issuer is a custom resource (CRD) which tells cert-manager how to sign a Certificate. Following this howto (section 7) the Issuer will be configured to connect to the Let's Encrypt staging server, which allows us to test everything without using up your Let's Encrypt certificate quota for the domain name.

After debugging, you can safely issue a certificate by using LE's production server.

A video describing cert-manager YAML syntax and recommended by the author of this article is 📽️ Anton Putra's good one.

Conclusion

SSL certificate acquisition was made simple by Let's Encrypt's reputation as a reliable certificate authority. Together with cert-manager tool, ops can quickly and easily assure correct transport encryption and interoperability with already-existing parts like NGINX Ingress. In addition to the example mentioned above, cert-manager can help with trickier situations like those involving wildcard SSL certificates.

If you're interested in using letsencrypt outside of a kubernetes cluster, take a look at Caddy, a 43k ⭐ open source web server, and also at Certbot, a 29k ⭐ ACME client which is open source, too.

Ever tried using wireshark to monitor web traffic? Follow Aaron Phillips from Comparitech to learn how.

Safe connections to you!

Admission Controllers in Action: Datree's Approach

Roman Belshevitz — Sun, 11 Sep 2022 09:19:45 +0000

In the eighth part, the author talked about admission controllers. In this, the ninth, we will see how ACs can be used for practical purposes.

At the same time, it may be considered that this is the second part of the review, so both of the parts will be marked with the appropriate #datree tag.

The originality of Datree's approach

In brief, Datree's integration enables you to check your resources against the defined policy a moment before you put them into a cluster... by leveraging an admission webhook! 😎

The webhook implemented with ValidatingWebhookConfiguration will detect operations such as CREATE, UPDATE or DELETE, and it will start a policy check against the configs related to each operation.

If any configuration errors are discovered, the webhook will refuse the action and show a thorough output with guidance on how to fix each error.

Every cluster operation that is tied up once the webhook is installed will cause a Datree policy check. If there are no configuration errors, the resource will get a green light🚦 to be applied or updated.

🔬 Datree is functioning well in a full-scale cluster and also in a k3s/k3d-baked one! It make a debugging process more suitable even for local development.

Let's go step by step

Following the Software-as-a-Service paradigm, Datree provides their users access to the misconfigurations' database and to the personal workspace at their website, where all the checks initiated by the user become aggregated.

This cheeky astronaut design will brighten up your day.

🎯 Step 1. Access token

They [want to] know everything about you! Well, relax, it is a joke. Sign up or log in, then grab your token to access datree programmatically. API access tokens are widespread in 2022, aren't they? 🔏

💰 There is only one token available when using the so-called Free Plan, enough for evaluation purposes (up to 4 Kubernetes nodes are supported; service access logs will be stored for two weeks).

🎯 Step 2. Set up your CLI environment

The following binaries must be installed on the machine: kubectl, openssl (required for creating a certificate authority, CA) and curl.

Assume everything is in place. Let's run:

$ DATREE_TOKEN=[your-token] bash <(curl https://get.datree.io/admission-webhook)

What you should see and what should happen to your cluster (yes, API requests are additionally encrypted):

🔑 Generating TLS keys...
Generating a RSA private key
Signature ok
subject=CN = webhook-server.datree.svc
Getting CA Private Key
/home/roman
🔗 Creating webhook secret tls...
secret "webhook-server-tls" deleted
secret/webhook-server-tls created
🔗 Creating core resources...
serviceaccount/webhook-server-datree created
clusterrolebinding.rbac.authorization.k8s.io/rolebinding:webhook-server-datree created
clusterrole.rbac.authorization.k8s.io/webhook-server-datree created
deployment.apps/webhook-server configured
service/webhook-server created
deployment "webhook-server" successfully rolled out
🔗 Creating validation webhook resource...
validatingwebhookconfiguration.admissionregistration.k8s.io/webhook-datree configured
🎉 DONE! The webhook server is now deployed and configured

🎯 Step 3. Protect your access token

Because your token is private and you don't want to store it in your repository, we advise setting or changing it using a different kubectl patch command:

$ kubectl patch deployment webhook-server -n datree -p '
spec:
  template:
    spec:
      containers:
        - name: server
          env:
            - name: DATREE_TOKEN
              value: "<your-token>"'

🎯 Step 4. Deploy something: magic will work for you

The author does not want to reinvent the wheel:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
spec:
  selector:
    matchLabels:
      app: nginx
  replicas: 2
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.14.2
        ports:
        - containerPort: 80

Let's focus and look at the kubectl apply -f nginx-deployment.yaml routine's result (the deployment has been denied by AC):

$ kubectl apply -f nginx-deployment.yaml
Error from server: error when creating "nginx-deployment.yaml": admission webhook "webhook-server.datree.svc" denied the request: 
webhook-nginx-deployment-Deployment.tmp.yaml

[V] YAML validation
[V] Kubernetes schema validation

[X] Policy check

❌  Ensure each container has a configured CPU limit  [1 occurrence]
    - metadata.name: nginx-deployment (kind: Deployment)
💡  Missing property object `limits.cpu` - value should be within the accepted boundaries recommended by the organization

❌  Ensure each container has a configured CPU request  [1 occurrence]
    - metadata.name: nginx-deployment (kind: Deployment)
💡  Missing property object `requests.cpu` - value should be within the accepted boundaries recommended by the organization

❌  Ensure each container has a configured liveness probe  [1 occurrence]
    - metadata.name: nginx-deployment (kind: Deployment)
💡  Missing property object `livenessProbe` - add a properly configured livenessProbe to catch possible deadlocks

❌  Ensure each container has a configured memory limit  [1 occurrence]
    - metadata.name: nginx-deployment (kind: Deployment)
💡  Missing property object `limits.memory` - value should be within the accepted boundaries recommended by the organization

❌  Ensure each container has a configured memory request  [1 occurrence]
    - metadata.name: nginx-deployment (kind: Deployment)
💡  Missing property object `requests.memory` - value should be within the accepted boundaries recommended by the organization

❌  Ensure each container has a configured readiness probe  [1 occurrence]
    - metadata.name: nginx-deployment (kind: Deployment)
💡  Missing property object `readinessProbe` - add a properly configured readinessProbe to notify kubelet your Pods are ready for traffic

❌  Prevent workload from using the default namespace  [1 occurrence]
    - metadata.name: nginx-deployment (kind: Deployment)
💡  Incorrect value for key `namespace` - use an explicit namespace instead of the default one (`default`)


(Summary)

- Passing YAML validation: 1/1

- Passing Kubernetes (v1.21.5) schema validation: 1/1

- Passing policy check: 0/1

+-----------------------------------+-----------------------+
| Enabled rules in policy "Default" | 21                    |
| Configs tested against policy     | 1                     |
| Total rules evaluated             | 21                    |
| Total rules skipped               | 0                     |
| Total rules failed                | 7                     |
| Total rules passed                | 14                    |
| See all rules in policy           | https://app.datree.io |
+-----------------------------------+-----------------------+

Opening the link, you'll be redirected to your personal workspace.

🎯 Step 5. Is such rigor necessary?

You can audit reactive policies and review invocation history. If checks are too strict, unset some of the policies.

For example, not in every deployment you really need pre-configured container readiness probes [or CPU & memory limits].

Well, another tryout will be on edited YAML (Octopus may be your fellow):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
  labels:
    app: nginx
spec:
  selector:
    matchLabels:
       app: nginx
  replicas: 2
  strategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
        - name: nginx
          image: nginx:1.14.2
          ports:
            - containerPort: 80
          resources:
            requests:
              memory: 100Mi
              cpu: 100m
            limits:
              memory: 200Mi
              cpu: '1'

As we really want to use the default namespace and have no fears with it, let's disable Prevent workload from using the default namespace policy in Datree web UI.

Also we may want to liberate us from Ensure each container has a configured readiness probe and Ensure each container has a configured liveness probe policies.

Et voilà! 🎭

$ kubectl apply -f nginx-deployment-advanced.yaml
deployment.apps/nginx-deployment created

Now all checks are passed successfully and the admission controller have got us a green light🚦 and allowed the deployment!

🎯 Step 6. Hey, do not climb where it is not necessary

If you want datree to disregard a namespace, add the label admission.datree/validate=skip to its configuration.

$ kubectl label namespaces default "admission.datree/validate=skip"

How to wipe traces

To delete the label and resume running the datree webhook on the namespace again:

$ kubectl label namespaces default "admission.datree/validate-"

Uninstall the webhook

Copy the following command and run it in your terminal to remove the webhook:

$ bash <(curl https://get.datree.io/admission-webhook-uninstall)
validatingwebhookconfiguration.admissionregistration.k8s.io "webhook-datree" deleted
service "webhook-server" deleted
deployment.apps "webhook-server" deleted
secret "webhook-server-tls" deleted
clusterrolebinding.rbac.authorization.k8s.io "rolebinding:webhook-server-datree" deleted
serviceaccount "webhook-server-datree" deleted
clusterrole.rbac.authorization.k8s.io "webhook-server-datree" deleted
namespace/kube-system unlabeled
namespace "datree" deleted

Summing up what was said

As you could understand, the possibilities of Kubernetes API are quite wide. The author hopes that he not only prepared an overview of a useful solution, but also explained the theoretical aspects of its functionality.

Responsible Approach to Communicating With the API Server: Admission Controllers

Roman Belshevitz — Fri, 02 Sep 2022 12:25:12 +0000

A bit of theory

RBAC and Network policies are two fundamental security elements of Kubernetes that you probably already know about if you work with it. These mechanisms are helpful for enforcing fundamental guidelines regarding what operations different users or services within your cluster are permitted to carry out.

However, there are situations when you require more policy features or granularity than RBAC or network policies can provide. Alternatively, you might want to run additional checks to verify a resource before allowing it to join your cluster.

Admission Controllers (ACs) allow you to add additional options to the work of Kubernetes to change or validate objects when making requests to the Kubernetes API.

🖼️ Pic source: Giant Swarm

The image shows the various parts that make up the API component. The request initiates communication between the API and the admission controller. The authorization module determines whether the request issuer is permitted to carry out the operation after the request has been authenticated. The admittance magic kicks in once the request have been duly approved.

If the controller rejects the request, then the entire request to the API server is rejected and an error is returned to the end user.

To activate controllers discussed, you must specify the names of the controllers in the form of a list when creating or updating a cluster. After that, kube-apiserver will be started or restarted with the --enable-admission-plugins option and access controllers set.

Passing a controller that is not available for the current version of Kubernetes will return an appropriate error.

What exactly can ACs be?

🔬 In a scope of implementation

Admission controllers that are built into and made available by Kubernetes itself are known as static admission controllers. Not every one of them is turned on by default. The cloud companies also grab some of them or restrict some of them for their own usage. You can enable and utilize them if you are the owner of your Kubernetes deployment. Some examples:

LimitRanger makes sure that any of the restrictions listed in the LimitRange object in a namespace are not broken by incoming requests. Use this admission controller to impose those restrictions if you are utilizing LimitRange objects in your Kubernetes setup. Applying default resource requests to pods without any specifications is also possible with this AC.

AlwaysPullImages changes the image pull policy for every new Pod. This is useful, for example, in multi-tenant clusters to ensure that only those with the credentials to fetch private images can access them. Without this admission controller, after an image has been pulled to a node, any pod from any user can use it just by knowing the image's name without any authorization checks. This feature must be enabled in the cluster.

NamespaceLifecycle enforces that a namespace that is undergoing termination cannot have new objects created in it, and ensures that requests in a non-existent namespace are rejected.

And there are dynamic ones. See details below.

🔬 In a scope of request processing

There are two types of dynamic admission controllers in Kubernetes. They work slightly differently. Saying shortly, one just validates the requests, and the other modifies it if it isn’t up to spec.

⚙️ The first type is the validating admission controller ValidatingAdmissionWebhook, which proxies the requests to the subscribed webhooks. The Kubernetes API registers the webhooks based on the resource type and the request method. Every webhook runs some logic to validate the incoming resource, and it replies with a verdict to the API.

In case the validation webhook rejects the request, the Kubernetes API returns a failed HTTP response to the user. Otherwise, it continues with the next admission.

⚙️ The second type is a mutating admission controller MutatingAdmissionWebhook, which alters the resource that the user has submitted so that default values can be set, or the schema can be verified. The API can have mutation webhooks attached by cluster administrators so that they can execute them similarly to validation.

Hooks! Hooks are everywhere!

Any resource type, including those that are pre-built like pods, jobs, or services, may be the primary resource type for a controller. The issue is that most built-in resources, if not all of them, already come with associated built-in controllers. In order to prevent having many controllers update the status of a shared object, custom controllers are frequently built for special resources.

If resources are merely Kubernetes API endpoints, writing a controller for a resource is just a fancy way to bind a request handler to an API endpoint!

Conditional resource modification can be implemented using a so-called webhook, which is essentially an API endpoint.

It is possible to configure dynamically what resources are subject to what admission webhooks via ValidatingWebhookConfiguration or MutatingWebhookConfiguration kinds.

Both are available in admissionregistration.k8s.io/v1 API version.

$ kubectl api-versions | grep admiss
admissionregistration.k8s.io/v1
admissionregistration.k8s.io/v1beta1

How would I activate admission controllers?

To use before changing cluster objects, the Kubernetes API server flag --enable-admission-plugins accepts a comma-delimited list of AC plugins. For instance, the following command line activates the LimitRanger and NamespaceLifecycle admission control plugins:

kube-apiserver --enable-admission-plugins=NamespaceLifecycle,LimitRanger

⚠️ Note: You may need to apply the parameters in different ways depending on how your Kubernetes cluster is installed and how the API server is launched. For instance, if Kubernetes is deployed using self-hosted Kubernetes, you may need to alter the manifest file for the API server and/or the systemd Unit file if the API server is installed as a systemd service.

⚠️ Note: API kind admissionregistration.k8s.io/v1beta1 became deprecated in 1.22+

Public cloud providers' implementation

In this case, everything is already set up for you.

To learn more about using dynamic admission controllers with Amazon EKS, see the Amazon EKS documentation.

Azure AKS Policy, Microsoft's implementation of OPA Gatekeeper, is another interesting thing. Involving AC webhooks, if there are problems in the admission control pipeline, it can block numerous requests to the API server.

The VMware Tanzu team followed a similar path in their Tanzu Kubernetes Grid (TKG).

Of course, OPA Gatekeeper itself is a separate and extensive topic, so more on that another time.

In the ninth article of the series, the author will talk about how smart people were able to translate the theory described above into a useful solution.

Be careful and stay tuned!

Many thanks to Leonid Sandler, Douglas Makey @douglasmakey Mendez Molero, Luca 🐦 @LucaDiMaio11 Di Maio, Kristijan Mitevski and W.T. Chang!

Virtual Kubernetes Clusters: What Are They Needed For?

Roman Belshevitz — Mon, 29 Aug 2022 18:58:50 +0000

Developer Wishlist Never Ends

Imagine you can

have many virtual clusters within a single cluster, and
they are much cheaper than the traditional Kubernetes clusters, and
they require lower management and maintenance efforts.

Sounds intriguing, eh? This makes v/clusters ideal for running experiments, continuous integration, and setting up sandbox 🧪 environments.

So, Loft Labs created such a solution written natively in Golang and made it an ~2k⭐ open source.

What's under the hood?

On top of other Kubernetes clusters, virtual clusters are fully functional Kubernetes clusters. Virtual clusters utilize the worker nodes and networking of the host cluster, as opposed to completely distinct "real" clusters. They schedule all workloads into a single namespace of the host cluster and have their own control plane. Virtual clusters divide a single physical cluster into several distinct ones, much like virtual machines do.

🖼️ Right click, don't even think too long.

Only the essential Kubernetes components - the API server, controller manager, storage backend (such as etcd, sqlite, mysql, etc.), and - optionally - a scheduler—make up the virtual cluster itself. In order to minimize virtual cluster overhead, vcluster builds by default on k3s, a fully functional, certified, lightweight Kubernetes distribution that compiles the Kubernetes components into a single binary and disables by default all unnecessary Kubernetes features, such as the pod scheduler or specific controllers.

Other Kubernetes distributions, such k0s and vanilla k8s, are supported in addition to k3s. In addition to the control plane, the virtual cluster also includes a Kubernetes hypervisor that simulates networking and worker nodes. Between the virtual and host clusters, this component syncs a few key resources that are crucial for cluster functionality:

Pods: All the virtual cluster's started pods are rewritten before being launched in the virtual cluster's namespace in the host cluster. Environment variables, DNS, service account tokens, and other configurations are updated to point to the virtual cluster rather than the host cluster. In the pod, it appears that the virtual cluster rather than the host cluster is where the pod is started.
Services: On the namespace of the virtual cluster in the host cluster, all services and endpoints are rewritten and generated. The service cluster IPs are shared by the host cluster and virtual cluster. This implies that there are no performance consequences when a service in the host cluster is accessed from within the virtual cluster.
PersistentVolumeClaims: In the event that persistent volume claims are generated in the virtual cluster, they will be modified and generated in the host cluster's namespace. The relevant persistent volume data will be synchronized back to the virtual cluster if they are bound in the host cluster.
ConfigMaps & Secrets: Only ConfigMaps and secrets mounted to pods within the virtual cluster will be synced to the host cluster; all other ConfigMaps and secrets will only be retained within the virtual cluster.
Other Resources: Deployments, StatefulSets, CRDs, service accounts, etc. do not sync with the host cluster; instead, they only reside in the virtual cluster.

Who lost the magic mirror?

For each pod with the spec.nodeName value it encounters inside the virtual cluster, vcluster by default creates a false node. Because vcluster does not by default have RBAC permissions to access the real nodes in the host cluster because doing so would require a cluster role and cluster role binding, those false nodes are produced. Additionally, each node will get a false kubelet endpoint that will either send requests to the real node or rewrite them to keep virtual cluster names intact.

Vcluster supports multiple modes to customize node syncing behavior. For a detailed list of the resources that may have been synced, see details here in the docs.

The hypervisor also proxies some Kubernetes API calls, including pod port forwarding or container command execution, to the host cluster in addition to synchronizing virtual and host cluster resources. It essentially performs the function of the virtual cluster's reverse proxy.

To ensure proper network operation for the virtual cluster, resources like Service and Ingress are synced by default from the virtual cluster [down] to the host cluster.

There are never too many levels of abstraction

Certain resources (such as CRDs or RBAC policies) reside cluster-wide, and you can’t isolate them using namespaces. For instance, it is not feasible to install an operator simultaneously in multiple versions inside a same cluster.

$ kubectl api-resources --namespaced=false|true

Although Kubernetes itself already offers namespaces for various settings, their use of cluster-scoped resources and the control plane is constrained.

In many circumstances, virtual clusters are also more stable than namespaces. In its own data store, the virtual cluster produces its own Kubernetes resource objects. These resources are unknown to the host cluster.

This kind of isolation is good for resilience. The necessity for access to cluster-scoped resources like cluster roles, shared CRDs, or persistent volumes still exists for engineers who adopt namespace-based isolation. Each team that depends on one of these shared resources will probably experience failure if an engineer destroys something in it.

Finally, virtual cluster configuration is independent of physical cluster configuration. This is excellent for multi-tenancy because it allows you to easily create a fresh environment or amazing demo applications. 😎

How it looks in your CLI

Create file vcluster.yaml:

vcluster:
  image: rancher/k3s:v1.23.5-k3s1   # Choose k3s version

Then, install helm chart using vcluster.yaml for chart values:

helm upgrade --install my-vcluster vcluster \
  --values vcluster.yaml \
  --repo https://charts.loft.sh \
  --namespace host-namespace-1 \
  --repository-config=''

Access:

Get the admin tool

curl -s -L "https://github.com/loft-sh/vcluster/releases/latest" | sed -nE 's!.*"([^"]*vcluster-linux-amd64)".*!https://github.com\1!p' | xargs -n 1 curl -L -o vcluster && chmod +x vcluster;
sudo mv vcluster /usr/local/bin;

Then, connect:

# Connect and switch the current context to the vcluster
vcluster connect my-vcluster -n my-vcluster

# Switch back context
vcluster disconnect

You have an option to create a separate kubeconfig to use instead of changing the current context:

vcluster connect my-vcluster --update-current=false

Or you may execute a command directly with vcluster context without changing the current context:

vcluster connect my-vcluster -- kubectl get namespaces
vcluster connect my-vcluster -- bash

Usage:

# Run any kubectl, helm, etc. command in your vcluster
kubectl get namespace
kubectl get pods -n kube-system
kubectl create namespace demo-nginx
kubectl create deployment nginx-deployment -n demo-nginx --image=nginx
kubectl get pods -n demo-nginx

Cleanup:

helm delete my-vcluster -n vcluster-my-vcluster --repository-config=''

What if you're planning some serious thing?

Well, the stock K8s distribution is compatible with high availability in vcluster. What is meant by high availability? Well, one of the entities is to make etcd database more robust. The second one is to boost syncer's performance. As mentioned above, vcluster uses a so-called syncer which copies the pods that are created within the virtual cluster to the underlying host cluster.

🪲TL;DR #1: etcd uses a leader-based consensus protocol for consistent data replication and log execution. Etcd cluster members elect a single leader, all other members become followers. The cluster elects a new leader automatically when one falls out of favor. Once the incumbent fails, the election does not take place immediately. Since the failure detection methodology is timeout based, electing a new leader takes roughly an election timeout.

🪲TL;DR #2: Why it is recommended to have a minimum of three instances in an etcd cluster, is well described here, a first hand info.

Currently, vcluster's high availability setup does not allow single binary distributions like k0s and k3s.

Create a values.yaml with the following structure in order to operate vcluster in high availability mode:

# Enable HA mode
enableHA: true

# Scale up syncer replicas
syncer:
  replicas: 3

# Scale up etcd
etcd:
  replicas: 3

# Scale up controller manager
controller:
  replicas: 3

# Scale up api server
api:
  replicas: 3

# Scale up DNS server
coredns:
  replicas: 3

To summarize

A virtual Kubernetes cluster that is fully functioning can be built using vcluster! The underlying K8s cluster's namespace is where each vcluster operates. It provides better multi-tenancy and isolation than conventional namespaces, and it is less expensive than building independent, fully-fledged clusters.

Virtual clusters can be a good option to running numerous instances of k3s, or k0s side by side, but they cannot exist on their own without a host cluster.

Compared to fully independent Kubernetes clusters, they are faster, lighter, and simpler to reach. Therefore, give using a virtual cluster a shot if you're tired of having to reset your local or CI/CD Kubernetes clusters all the time. However, this is a topic for a completely different story, much sadder than what you just read.

Be in good & non-ghost shape! 👻

Many thanks to Viktor 🐦@vfarcic Farcic and Mauricio 🐦@salaboy Salatino for inspiration!

How to Stop Rampant Kubernetes Cluster Growth

Roman Belshevitz — Thu, 25 Aug 2022 18:20:00 +0000

Some lyrics as an introduction

The Edvard Munch's famous painting "Scream" was first presented to the public at the Berlin exhibition in December 1893. It was conceived as part of the "Frieze of Life" - a program cycle of paintings about the spiritual life of a person. Munch wrote about him:

“The Frieze of Life” is conceived as a series of paintings connected with each other, which together should give a description of a whole life. A winding line of the coast passes through the picture, behind it is the sea, it is always in motion, and under the crowns of trees there is a diverse life with its sorrows and joys. Frieze is conceived as a poem about life, love and death.

The author of this brief, of course, will not talk about the spiritual life, but about practical approaches that prevent thoughts about the terrible and otherworldly and save the nerves of engineers.

The essence of the Ops' problem

Kubernetes was originally designed to support the consolidation of workloads on a single cluster. However, there are many problematic scenarios that require a multi-cluster approach to optimize performance. These may include workloads across regions, fault propagation radius limits, compliance issues, harsh multi-user environments, security, and custom software solutions.

Unfortunately, this multi-cluster approach poses management challenges, as the complexity of managing a Kubernetes cluster only increases as the size of the cluster increases. The end result is a phenomenon called cluster sprawl, which occurs when the number of clusters and workloads grows and is not managed coherently.

The solution to this problem lies in the early and rapid identification and implementation of the best management practices in order to avoid serious work in the future.

What is Kubernetes governance?

In order to ensure accountability, transparency, and responsibility, a well-defined collection of rules, policies, and procedures is referred to as governance.

Governance is also about synchronizing clusters and providing centralized policy management. Kubernetes' governance is defined as a set of rules created with policies that need to be enforced across all clusters. This is a critical component for large enterprises running Kubernetes.

Typically, this process means applying matching rules across Kubernetes multi-clusters, as well as applications running in those clusters. And while managing Kubernetes may seem insignificant, it pays off in the long run, especially if implemented in a large organization.

Assume that the enterprise continues to increase the number of clusters in use and does not apply management. These clusters will exist under different rules, which will create a huge amount of extra work for the teams in the near future.

Fortunately, there are only a few very important components to building a successful Kubernetes governance.

Creating successful Kubernetes governance

When considering a successful Kubernetes governance strategy, the first component is to ensure good multi-cluster management and monitoring. You must maintain control over how and where clusters are created and configured, as well as which software versions can be used.

🔧 Well-built observabilty

Application development and operations teams should be able to centrally view and manage clusters to better optimize resources and troubleshoot. Solutions in this area are developed, for example, by Red Hat, Platform9, Fairwinds and even Rancher Labs. Improved management practices and greater transparency can also save a company from the headaches of a range of security risks and performance issues down the road.

🔧 RBAC strategies

Next, enterprises must have an authentication and access control system in place. Having centralized authentication and authorization will help an organization streamline the login process and help keep track of user activity. This will allow application development and operations teams to ensure that the right people are doing important tasks in real time.

🔧 Policy management

Finally, to govern Kubernetes, enterprises must optimize policy management. Companies need to think about how Kubernetes will impact their development culture and work on finding the right balance of business agility and development. Ultimately, governance (with the appropriate level of flexibility) ensures that businesses can meet customer needs and deploy mission-critical services in a consistent and consistent manner.

In Kubernetes, Admission Controllers enforce policies on objects during create, update, and delete operations. Admission control is fundamental to policy enforcement in Kubernetes.

Admission controllers allow you to enforce the adherence to certain practices such as having good labels, annotations, resource limits, or other settings.

Being the CNCF project, Open Policy Agent (OPA) is a great tool to develop and implement such policies at scale throughout an organization. Every request will go through the OPA, as illustrated below, and will be decided depending on the policies established for the Kubernetes cluster. The request will be carried out if it complies with the policy. The OPA will reject the request if it violates the established policies.

As a good practice, by deploying OPA as an admission controller, you can:

Require specific labels on all resources.
Require container images come from the corporate image registry.
Require all pods specify resource requests and limits.
Prevent conflicting Ingress objects from being created.

Goals to achieve

But what should be the goals of governance? Where should it be enforced and tested? The four most effective management objectives are security policy, network management, access control, and image management. Let's look at each of these goals one by one:

🎯 Security policy

In security policies for governing Kubernetes, it is important to restrict user access to pods in clusters. Cluster users should have well-defined access based on their role.

To do this, enterprises must implement a security policy that will have rules and conditions related to access and privileges. In this policy, they must specify that containers have read-only access to the file system and that containers and child processes cannot be subject to privilege changes.

🎯 Network management

Network policy plays a very important role in determining which services can communicate with each other. Here, companies must determine which modules and services can interact with each other and which should be isolated. This also applies to module security in Kubernetes management.

The right approach is aimed at controlling traffic within Kubernetes clusters. This approach can be based on modules, namespaces, or IPs, depending on management requirements.

Each popular CNI plugin uses a different type of configuration for the network setup. For example, Calico uses layer 3 networking paired with the BGP routing protocol to connect pods.

Cilium configures an overlay network with eBPF on layers 3 to 7. Along with Calico, Cilium supports setting up network policies to restrict traffic.

🎯 Administration and access control

In access control, when configuring role-based access control (RBAC) policy, administrators need to restrict access to cluster resources. Using Kubernetes objects such as Role, ClusterRole, RoleBinding, and ClusterRoleBinding, they need to fine-tune access to cluster resources appropriately.

Because permissions granted by a ClusterRole apply across the entire cluster, you can use ClusterRoles to control access to different kinds of resources than you can with Roles. These include:

Cluster-scoped resources such as nodes
Non-resource REST Endpoints such as /healthz
Namespaced resources across all Namespaces (for example, all Pods across the entire cluster, regardless of Namespace).

After creating a Role or ClusterRole, you have to assign it to a user or group of users by creating a RoleBinding or ClusterRoleBinding:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: testadminclusterbinding
subjects:
  - kind: ServiceAccount
    name: myaccount
    namespace: test
roleRef:
  kind: ClusterRole
  name: cluster-admin
  apiGroup: rbac.authorization.k8s.io

🎯 Image management

Using public Docker images can increase the speed and flexibility of application development, but there are many vulnerable Docker images, and using them in a production cluster can be very risky.

Image management is also part of Kubernetes governance. All images that will be used in the cluster must be pre-scanned for vulnerabilities. There are several approaches to finding vulnerabilities. How and where an organization checks for vulnerabilities depends on its preferred workflows. However, it is recommended that you test your images before deploying them to a cluster.

Hacker activity has increased exponentially in recent years, and loopholes in systems continue to be discovered. Therefore, it is very important for companies to be vigilant when implementing practices to ensure that they only use official, clean, and verified Docker images on a cluster.

Threat actors can mount sophisticated assaults employing previously dependable third-party artifacts as an attack vector by using malicious scripts or malware concealed in a container image. Static, pattern-based, or signature-based scanners are not effective against this kind of attack because it only appears during runtime.

By evaluating the attack kill chain and running images in a secure hosted sandbox environment, several security solutions can reduce this risk.

To examine images in a running state both before and after the image is checked into a registry, these tools, i.e. trivy by Aqua Security, are frequently incorporated into CI/CD processes. Malicious behavior and unfulfilled policy requirements can mark an image for registry deletion or prevent check-in entirely.

Instead of conclusion

Thus, the author has given in brief the directions needed to better govern Kubernetes and ensure the security of important enterprise systems and data, as well as to limit cluster growth and possible disorder. Stay strong and focused!

The author is thankful to Arthur Chiao, Oleg Chunikhin (CNCF), Tomas Fernandez (Rendered Text / Semaphore), Mike Jordan (Coredge), Kristijan Mitevski and Steven Zimmerman (Aqua Security) for their contribution to comunity.