DEV Community: XIAOXU CHANG

Inside AIO Sandbox (Part 3): Integrating AIO Sandbox with NVIDIA OpenShell

XIAOXU CHANG — Wed, 29 Apr 2026 13:09:47 +0000

By AIO Sandbox team

Beyond SDKs: Securing the AIO Sandbox with NVIDIA OpenShell

This is the third part of a series of posts diving into AIO Sandbox that provides an isolated, programmable environment where agents can safely execute. Please see previous posts here: Part 1 & Part 2.

As AI agents become more capable, the need for secure, tool-rich execution environments is growing rapidly. Agents increasingly require access to browsers, shells, filesystems, development tools, and external services—but granting those capabilities directly on a host system creates obvious operational and security risks.

This is where sandbox runtimes become essential. Recently, we explored integrating AIO Sandbox with NVIDIA OpenShell, and found that the two systems complement each other extremely well. By leveraging OpenShell’s Bring-Your-Own-Container (BYOC) model, we were able to run AIO Sandbox as the underlying workspace environment while benefiting from OpenShell’s policy controls, isolation model, and gateway architecture.

The result is a powerful combination:

AIO Sandbox provides rich built-in agent capabilities such as browser automation, shell access, file operations, VS Code Server, and MCP integration.
OpenShell provides the secure outer runtime boundary: sandbox lifecycle management, policy enforcement, controlled networking, and operational governance.

This post walks through the architecture, why the combination makes sense, and how BYOC enables a clean integration path.

Why Combine AIO Sandbox and OpenShell?

Each platform solves a different part of the agent infrastructure stack.

AIO Sandbox: Rich Agent Workspace

AIO Sandbox is designed as an all-in-one environment for agents. The sandboxing primitives are exposed as a library with multi-language bindings, so it can be embedded directly into your agent framework.

Typically, an agent might need a browser to fetch data, a Python interpreter to analyze it, and a filesystem to store the results. Managing these as separate services introduces latency and synchronization complexity. AIO sandbox consolidates these requirements into a single containerized environment. It packages commonly needed capabilities into a single containerized workspace:

Computer Interaction: A Chromium browser controllable via the Chrome DevTools Protocol (CDP), with documented support for Playwright.
Code Execution: Pre-configured runtimes for Python and Node.js.
MCP Compatibility: AIO Sandbox exports its services as tools that an LLM can call directly via a standardized protocol.
Standard Tooling: A bash terminal and a file system accessible across modules.
Development Interfaces: Integrated VSCode Server and Jupyter Notebook instances for monitoring and debugging.

This makes it ideal as the inner execution environment where the agent actually performs work.

OpenShell: Secure Runtime Envelope

OpenShell is a Kubernetes-based sandbox platform. Each sandbox is a K8s pod inside a K3s cluster inside Docker container. It is built to manage multiple sandboxes with centralized policy (using OPA policy engine). It also provides support for injecting credentials through a network proxy in order to keep real credentials away from the agent via a proxy. OpenShell focuses on secure agent execution and governance:

Isolated sandbox runtime
Gateway-managed access patterns
Declarative policy controls
Network and filesystem restrictions
Operational auditability
BYOC custom runtime support

This makes OpenShell ideal as the outer control plane around a capable agent workspace.

The Architecture

The integration model is straightforward:

User / API Request
        |
        v
NVIDIA OpenShell Gateway
        |
        v
OpenShell Sandbox (BYOC)
        |
        +--> AIO Sandbox Runtime
               - Browser
               - Shell
               - File APIs
               - VS Code
               - MCP Tools

Instead of replacing AIO Sandbox, OpenShell hosts it securely.
This is important because it preserves the strengths of both systems:

AIO remains the productivity layer.
OpenShell remains the security and governance layer.

What This Means for the Future

As agent systems mature, we are likely to see a layered architecture emerge:
Layer 1: Secure Runtime Envelope: Sandboxing, policy, networking, credentials, governance
Layer 2: Capability Workspace: Browser, shell, files, IDE, tools, APIs
Layer 3: Agent Intelligence: Reasoning model, planning, tool use, memory

The AIO Sandbox + OpenShell combination is an early example of this layered model in practice.

Why Bring-Your-Own-Container (BYOC) Matters

Without BYOC support, integrating specialized agent runtimes into secure sandbox systems often requires invasive rework. OpenShell’s BYOC model removes that friction. It allows you to:

Package your existing runtime as a container
Bring your own dependencies and startup model
Retain existing APIs and workflows
Add OpenShell security controls around it

In our case, we already had AIO Sandbox working independently. BYOC meant we could reuse that investment rather than rebuild functionality natively for OpenShell.

Requirements for Custom Images

While OpenShell supports almost any Linux-based image, there are a few technical requirements:

Base Requirements: The image must contain glibc, /proc filesystem support, and the iproute2 package (for network namespace management).
Use a standard Linux base image

Add sandbox user to the AIO Docker Image
User/Group Handling: For custom images, you still must manually create a sandbox user and group in the Dockerfile, like this:

# You must add this to your custom Dockerfile
RUN groupadd -g 1000 sandbox && \
useradd -m -u 1000 -g sandbox sandbox

If you don't do this, the supervisor will refuse to start the agent because it doesn't have a safe, non-root user to "drop" down to after it finishes setting up the secure environment.

docker build -t ghcr.io/deepak-vij/sandbox-openshell:latest .
docker push ghcr.io/deepak-vij/sandbox-openshell:latest

FROM ghcr.io/agent-infra/sandbox:latest

RUN groupadd -r sandbox && \
    useradd -r -g sandbox -m -d /home/sandbox -s /bin/bash sandbox && \
    mkdir -p /workspace && \
    chown -R sandbox:sandbox /home/sandbox /workspace

Create OpenShell Sandbox
The following CLI command creates a sandbox from existing docker image (AIO Sandbox in our case). The very first time, it also boots up a K3S cluster inside a docker container. Within the K3S cluster, it automatically creates an OpenShell gateway (with mTLS enabled).

# Running your own app
# Pass your start command explicitly — use -- <command> on the CLI.
# The image's CMD/ENTRYPOINT is replaced by the sandbox supervisor at runtime.

openshell sandbox create --from ghcr.io/deepak-vij/sandbox-openshell:latest \
    --forward 18080 \
    -- /usr/local/bin/python-server --host 0.0.0.0 --port 8091 --ws-ping-interval 30 --mcp-config /opt/gem/mcp-hub.json --filter-mcp-servers sandbox --workspace /home/gem

openshell forward start 8091 exceeding-reindeer

Practical Benefits of the Integration

Rich Agent Capabilities Inside a Controlled Boundary
Agents can use browsers, terminals, and tools—but only inside the sandbox.
This dramatically reduces host risk while preserving capability.
Operational Simplicity
Rather than assembling separate browser pods, shell workers, file services, and IDE containers, AIO Sandbox bundles them into one runtime.
OpenShell then governs it as a single managed workload.
Better Security Posture
Instead of exposing powerful tools directly, OpenShell can apply policies such as:

restricted egress networking
ephemeral lifecycle
credential boundaries
isolated filesystem mounts
controlled tool invocation paths

Faster Experimentation
Because AIO Sandbox already includes a rich environment, new agent use cases can be tested quickly without building custom images for every scenario.

Real-Life Example Use Cases

Earlier, we briefly mentioned the 3-layer architecture whereby:

OpenShell contributes: capabilities such as secret injection; secure identity; governed outbound access; sandbox lifecycle; policy control.
AIO Sandbox contributes: capabilities such as browser-use/CUA; shell; files etc. More importantly, AIO Sandbox provides the integrated workspace primitives that allow agents to perform multi-step workflows inside a single runtime.
Agent contributes: capabilities such as reasoning; planning; execution.

With that as the background, the following are a set of possible real-life example use cases.
Autonomous Security Patch Agent – Patch-and-PR Agent
Imagine an enterprise wants an agent that automatically patches an internal service when a new CVE is published. This typically requires:

access to internal Git repo
ability to browse vulnerability reports
edit code
run tests
build artifacts
create pull requests
use API keys securely
operate inside controlled boundaries

Most traditional sandboxes can run code. But they do not provide a full secure workspace + browser + governed secrets + developer environment in one flow.

This demonstrates a broader shift in agent infrastructure: from isolated code runners to governed autonomous workspaces. Instead of merely executing commands in isolation, the combined AIO Sandbox + OpenShell stack can run a full enterprise remediation workflow. OpenShell securely provisions short-lived credentials at runtime, launches an AIO workspace container, and applies outbound network policy.

Inside the workspace, the agent researches a newly disclosed vulnerability, updates the affected repository, runs tests, validates behavior in a browser session, and opens a pull request to the approved internal Git host. When the task completes, the sandbox is destroyed and credentials are revoked.
Other Possible Use Case Scenarios

Incident Response Agent – Investigates production alerts inside a governed sandbox, analyzes logs and dashboards, runs diagnostics, and prepares remediation steps or pull requests while keeping credentials and network access tightly controlled.
Browser QA Agent – Launches real browser sessions to validate user flows, detect UI regressions, capture screenshots, and file reproducible bug reports from an isolated workspace environment.
Autonomous Data Ops Agent – Executes data pipelines, validates schemas, runs transformations, investigates failed jobs, and generates reports or fixes in an ephemeral sandbox with secure access to approved data sources.

Key Observations & Final Thoughts

After integrating the two systems, one architectural principle became clear:

The best agent platforms separate capability surfaces from trust boundaries.

AIO Sandbox excels at capability surfaces.

OpenShell excels at trust boundaries.

Combining them yields a much stronger system than forcing either platform to do both jobs alone.

In Conclusion

OpenShell’s Bring-Your-Own-Container model provides a practical path to secure it without redesigning everything. For us, integrating AIO Sandbox with OpenShell demonstrated how modern agent infrastructure can be both:

high capability
operationally safe

And that balance is exactly what production-grade AI agents will need.

Links & Resources

Inside AIO Sandbox (Part 2): Bridging the Gap — Mastering AI Agents with Browser-Use, and MCP

XIAOXU CHANG — Mon, 27 Apr 2026 05:29:16 +0000

By AIO Sandbox team

This is the second part of a series of posts diving into AIO Sandbox that provides an isolated, programmable environment where agents can safely execute.
In the world of AI agents, we’ve moved past simple text-in, text-out. Today’s agents need to act—they need to browse the web, interact with GUIs, and communicate through standardized protocols. However, running these agents locally is risky. You wouldn't want an experimental script having unfettered access to your main browser or local file system. This is where the AIO Sandbox comes in.
In our previous post we focussed on "Filesystem" & "Shell" capabilities of AIO Sandbox. In this post, we’ll explore ways to supercharge AI agents using the AIO Sandbox from web automation to MCP protocol communication.

1. Seamless Web Automation with browser-use

The general-purpose GUI agents (CUA – Computer-use Agents) integrate advanced reasoning, perception, and action modules, enabling them to perform tasks with human-like adaptability and versatility.
The first challenge for any agent is navigating the web. Using the browser-use library alongside a remote CDP (Chrome DevTools Protocol) session provided by the AIO Sandbox, we can create an agent that feels like it’s "seeing" the web.
A browser-use agent built on the Chrome DevTools Protocol (CDP) is essentially an LLM-powered controller that treats the browser like a programmable environment. Think of CDP (Chrome DevTools Protocol) as “The browser’s brain exposed as an API”, the "backdoor" into the browser's brain. Instead of sending clicks to pixel X, Y coordinates, the commands are sent to the DOM (Document Object Model).
High-level architecture of a browser-use agent

User Task
   ↓
LLM (Planner + Reasoner)
   ↓
Agent Runtime
   ↓
CDP Client (WebSocket)
   ↓
Browser (Chrome/Chromium)

The code below demonstrates how to connect a browser-use agent to a sandboxed browser to perform a specific task: extracting Cisco’s stock price.

import asyncio
from agent_sandbox import Sandbox
from browser_use import Agent, Tools, BrowserProfile, BrowserSession
from browser_use.llm import ChatOpenAI

async def main():
    # 1. Connect to the AIO Sandbox
    sandbox = Sandbox(base_url="http://localhost:8080")
    cdp_url = sandbox.browser.get_info().data.cdp_url

    # 2. Attach the browser session to the remote Sandbox
    browser_session = BrowserSession(
        browser_profile=BrowserProfile(cdp_url=cdp_url, is_local=True)
    )

    task = 'Go to Google, look up "Cisco stock price" (CSCO), and extract the current price.'

    agent = Agent(
        task=task,
        llm=ChatOpenAI(model="gpt-4o"),
        browser_session=browser_session,
    )

    await agent.run()
    await browser_session.kill()

if __name__ == "__main__":
    asyncio.run(main())

Why this matters: By offloading the browser to a sandbox, your agent runs in an isolated environment. If the agent accidentally clicks a malicious link or tries to download a file, it only affects the disposable container, not your host machine.

2.Standardizing Discovery: The Agentic MCP (Model Context Protocol) Loop

Next, we have the Model Context Protocol (MCP). MCP is a standardized way for agents to discover and use tools across different environments without custom integration code for every new feature. MCP is the "native language" of modern AI agents. AIO Sandbox exports its services (such as browser_navigate etc.) as tools that an LLM can call directly. For a human developer, the REST API is the interface. For an AI agent, the MCP Tool Manifest is the interface. By exposing both, the sandbox ensures that:

A Python script can use the sandbox via REST.
An LLM (like Claude) can use the sandbox via MCP. The AIO Sandbox acts as an MCP host, allowing you to list and execute tools (like browser_navigate or shell_execute) through a unified interface.

To see this MCP usage in practice, let’s look at how we can bridge the gap between an LLM and the sandbox using MCP. The following example demonstrates a dynamic 'handshake' where the agent queries the AIO Sandbox for its available capabilities and automatically converts them into a format the LLM understands.

import asyncio
import json
from openai import AsyncOpenAI
from agent_sandbox import AsyncSandbox

# Initialize your AI and your MCP Client
# ai_client = AsyncOpenAI(api_key="your-key-here")
ai_client = AsyncOpenAI(api_key="API-KEY")
mcp_client = AsyncSandbox(base_url="http://localhost:8080")

async def run_agent(user_prompt):
    # 1. Fetch available tools from MCP
    tools_resp = await mcp_client.mcp.list_mcp_tools(server_name="browser")

    # Format MCP tools into OpenAI's required JSON schema
    openai_tools = [
        {
            "type": "function",
            "function": {
                "name": tool.name,
                "description": tool.description,
                "parameters": tool.input_schema # MCP uses JSON Schema, which OpenAI likes
            }
        } for tool in tools_resp.data.tools
    ]

    # 2. Ask the model what to do
    response = await ai_client.chat.completions.create(
        model="gpt-5.4",
        messages=[{"role": "user", "content": user_prompt}],
        tools=openai_tools
    )

    message = response.choices[0].message

    # 3. If the model wants to call a tool, we execute it
    if message.tool_calls:
        for tool_call in message.tool_calls:
            # Parse arguments
            args = json.loads(tool_call.function.arguments)

            print(f"Agent is executing {tool_call.function.name}...")

            # CALL THE MCP SANDBOX
            result = await mcp_client.mcp.execute_mcp_tool(
                server_name="browser",
                tool_name=tool_call.function.name,
                request=args
            )

            # 4. Give the tool's output back to the model
            final_resp = await ai_client.chat.completions.create(
                model="gpt-5.4",
                messages=[
                    {"role": "user", "content": user_prompt},
                    message,
                    {
                        "role": "tool",
                        "tool_call_id": tool_call.id,
                        "content": str(result.data.content[0].text)
                    }
                ]
            )
            print(f"Final Answer: {final_resp.choices[0].message.content}")

if __name__ == "__main__":
    asyncio.run(run_agent("Go to google.com and tell me what the search button says and the button next to it."))

Why MCP is the future: Instead of writing custom logic for every new sandbox capability, your agent can "handshake" with the server, see what it's capable of, and start working immediately.
The Shift: This decouples your code from your tools. If you add a "SQL" or "FileSystem" server to the sandbox tomorrow, your agent automatically knows how to use them without a single line of new code.

3. Conclusion

Building AI agents is no longer just about the prompt. It’s about the infrastructure that supports them. By combining AIO Sandbox for security, browser-use for high-level navigation, and MCP for standardized tool access, you can build agents that are powerful, safe, and interoperable.
The future of autonomous work isn't just a chatbot—it's a sandboxed agent with a browser in one hand and MCP in the other.

Links & Resources

Inside AIO Sandbox (Part 1): Files & Shell — The Foundations of Agent Execution

XIAOXU CHANG — Tue, 31 Mar 2026 07:36:56 +0000

by AIO Sandbox Team

Modern AI agents are no longer just generating text—they are expected to write files, modify code, and execute commands.
But doing this directly on your local machine or production systems is risky and hard to control.
This is where AIO Sandbox comes in. It provides an isolated, programmable environment where agents can safely:

create and manipulate files
run shell commands
execute code
produce artifacts
and many more...

Unlike a typical docker container, which often requires manual configuration for tool-chaining, the AIO Sandbox integrates a browser, a shell, and a file system into a single environment designed for AI agents. This unified architecture ensures that artifacts remain persistent and accessible across every stage of an AI-driven workflow executing within the sandbox.
In this first post, we’ll focus on the two most fundamental capabilities:

🧩 Filesystem (state)
⚙️ Shell (execution)

By the end, you’ll see how these combine into a complete runtime for agents.

🌐 Multi-language SDK Support

While this tutorial uses Python, AIO Sandbox is not limited to Python developers.
The agent-sandbox SDK also supports:

TypeScript / JavaScript
Go (Golang)

👉 This makes it easy to integrate AIO Sandbox into a wide range of agent frameworks, backend services, and developer stacks.

🛠️ Prerequisites

Python 3.12+
A running AIO Sandbox instance at http://localhost:8080

Docker Command:
docker run --security-opt seccomp=unconfined --rm -it -p 8080:8080 ghcr.io/agent-infra/sandbox:latest

Python SDK installed

pip install agent-sandbox

🧠 Mental Model

Think of AIO Sandbox as a remote, disposable Linux machine that your agent controls via APIs.

Filesystem → where data and artifacts live
Shell → how actions are executed

Simple Flow:
Agent → API → Sandbox → Filesystem + Shell

Autonomous Data Processing & Validation Agent

Rather than presenting APIs for these services in isolation, we will demonstrate the following end-to-end comprehensive agent workflow example operating within the sandbox environment.
This example simulates an agent that executes the following workflow:

Create some data
Read it (look at it)
Write a script file (process.py)
List files (see what exists)
Run the script
Read the output file
Check if output looks correct
Notice bad data & Fix the data
Run the script again
Read updated output
Find files created
Download final result

What this use case demonstrates
This use case demonstrates a realistic agent loop enabled using AIO Sandbox File & Shell primitives:

Read → Execute → Read → Validate → Fix → Re-run → Export

This workflow makes each AIO Sandbox File & Shell primitive feel purposeful:

File primitives
- write_file creates data and code
- read_file lets the agent inspect inputs and outputs
- list_path gives workspace awareness
- replace_in_file lets the agent repair bad input
- search_in_file validates expected output
- find_files discovers generated artifacts
- download_file exports results out of the sandbox
Shell primitive
- exec_command runs the actual processing job

from agent_sandbox import Sandbox

client = Sandbox(base_url="http://localhost:8080")

# --------------------------------------------------
# 1. Setup workspace
# --------------------------------------------------
home_dir = client.sandbox.get_context().home_dir
app_dir = f"{home_dir}/data_agent"

data_path = f"{app_dir}/data.txt"
script_path = f"{app_dir}/process.py"
report_path = f"{app_dir}/report.txt"

print("Sandbox home directory:", home_dir)
print("App directory:", app_dir)

# --------------------------------------------------
# 2. Create raw input data (with an intentional error)
# --------------------------------------------------
client.file.write_file(
    file=data_path,
    content="""10
20
INVALID
40
50
""",
)

print("\nCreated raw input data.")

# --------------------------------------------------
# 3. Read and inspect input data
# --------------------------------------------------
data_preview = client.file.read_file(file=data_path)

print("\nRaw input data:")
print(data_preview.data.content)

# --------------------------------------------------
# 4. Write processing script
# --------------------------------------------------
client.file.write_file(
    file=script_path,
    content="""numbers = []

with open("data.txt") as f:
    for line in f:
        try:
            numbers.append(int(line.strip()))
        except:
            print("Skipping invalid line:", line.strip())

total = sum(numbers)
avg = total / len(numbers)

report = f\"\"\"Report Summary
--------------
Valid Count: {len(numbers)}
Total: {total}
Average: {avg}
\"\"\"

with open("report.txt", "w") as f:
    f.write(report)

print(report)
""",
)

print("\nCreated processing script.")

# --------------------------------------------------
# 5. List workspace contents
# --------------------------------------------------
workspace = client.file.list_path(
    path=app_dir,
    recursive=True,
)

print("\nWorkspace contents:")
for entry in workspace.data.files:
    print("-", entry.path)
# --------------------------------------------------
# 6. Execute the processing script
# --------------------------------------------------
result = client.shell.exec_command(
    command=f"cd {app_dir} && python3 process.py"
)

print("\nFirst execution output:")
print(result.data.output)
print("Exit code:", result.data.exit_code)

# --------------------------------------------------
# 7. Read generated report
# --------------------------------------------------
report = client.file.read_file(file=report_path)

print("\nGenerated report:")
print(report.data.content)

# --------------------------------------------------
# 8. Validate report contents
# --------------------------------------------------
search = client.file.search_in_file(
    file=report_path,
    regex=r"Average: .*",
)

print("\nReport validation result:")
print(search)

# --------------------------------------------------
# 9. Detect bad input and fix it
# --------------------------------------------------
data_check = client.file.read_file(file=data_path)

if "INVALID" in data_check.data.content:
    print("\nDetected invalid data. Fixing input file...")

    client.file.replace_in_file(
        file=data_path,
        old_str="INVALID",
        new_str="30",
    )

# Read input again after fix
updated_data = client.file.read_file(file=data_path)

print("\nUpdated input data:")
print(updated_data.data.content)

# --------------------------------------------------
# 10. Re-run the processing script
# --------------------------------------------------
result = client.shell.exec_command(
    command=f"cd {app_dir} && python3 process.py"
)

print("\nSecond execution output:")
print(result.data.output)
print("Exit code:", result.data.exit_code)

# --------------------------------------------------
# 11. Read final report again
# --------------------------------------------------
final_report = client.file.read_file(file=report_path)

print("\nFinal report:")
print(final_report.data.content)

# --------------------------------------------------
# 12. Find generated text artifacts
# --------------------------------------------------
artifacts = client.file.find_files(
    path=app_dir,
    glob="*.txt",
)

print("\nDiscovered artifacts:")
print(artifacts)

# --------------------------------------------------
# 13. Download final report to local machine
# --------------------------------------------------
with open("final_report.txt", "wb") as f:
    for chunk in client.file.download_file(path=report_path):
        f.write(chunk)

print("\nFinal report downloaded locally as final_report.txt")

🎯 Key Insight

AIO Sandbox gives agents a safe, programmable runtime:

Files → memory/state

Shell → actions

Together, they enable real-world workflows like:

code generation and execution
data processing
automation pipelines
tool orchestration

🚀 What’s Next

Thanks for reading! Hope it was helpful! This is just the beginning. In upcoming posts, we’ll dive into topics such as:

🌐 Browser automation (CDP-based)
🔌 MCP tool integration
📓 Jupyter / notebook execution
🤖 OpenClaw integration
🎯 Reinforcement learning inside sandbox

💬 Final Thoughts

AIO Sandbox bridges the gap between:

“LLM that generates text”
and
“Agent that can actually do things”

And it does so safely, reproducibly, and programmatically.

Introducing AIO Sandbox, All-in-One Sandbox Environment for AI Agents

XIAOXU CHANG — Fri, 27 Mar 2026 02:33:45 +0000

Unchecked AI autonomy is a ticking time bomb; it’s time to pull the plug on full system unfettered access. We can no longer afford to give AI agents the 'keys to the kingdom' without oversight. The 'wild west' of AI agents running with total system control is officially over.

AIO Sandbox is an open-source project designed to solve these problems. It is everything your agent needs, out of the box. No more juggling multiple services. AIO Sandbox ships a complete, pre-wired environment in a single Docker container.

The AIO (All-in-One) Sandbox is a containerized environment designed for both human developers and AI agents. Its architecture is built around a "Batteries-Included" philosophy, providing a full Linux desktop-like environment inside a single Docker container.

Unified Environment: One Docker container with shared filesystem. Files downloaded in the browser are instantly accessible in Terminal and VSCode.
Out of the Box: Built‑in VNC browser, VS Code, Jupyter, file manager, and terminal—accessible directly via API/SDK.
Agent-Ready: Pre-configured MCP Server with Browser, File, Terminal, Markdown, Ready-to-use for AI agents.
Developer Friendly: Cloud-based VSCode with persistent terminals, intelligent port forwarding, and instant frontend/backend previews.
Secure Execution: Isolated Python and Node.js sandboxes. Safe code execution without system risks.
Production Ready: Enterprise-grade Docker deployment. Lightweight, scalable.

Calling all AI agent developers! How are you securing your builds? Let’s try running your agent in AIO Sandbox and compare notes.

AIO Sandbox is open-sourced under the Apache License 2.0. Contributions welcome.

GitHub: https://github.com/agent-infra/sandbox
Official website: https://sandbox.agent-infra.com/

Kitex/Hertz Empowers LLMs: A Retrospective of Key Features on Its Third Anniversary

XIAOXU CHANG — Thu, 23 Jan 2025 08:24:37 +0000

By Yang Rui from CloudWeGo Team

Github:https://github.com/cloudwego

It has been three years since CloudWeGo's open-source journey began. Adhering to the principle of Internal and External Consistency, we have continuously iterated on our open-source repository, releasing features that served ByteDance internally to the external world. From 2023 to 2024, Kitex/Hertz focused on LLM back-end services, user experience, and performance, aiding the rapid development of new business scenarios while continuously optimizing user experience and performance. Meanwhile, Kitex/Hertz has been widely adopted by external enterprises and attracted numerous external developers, enhancing our CloudWeGo ecosystem all the way.

This article summarizes the presentation "Kitex/Hertz Empowers LLMs: A Retrospective of Key Features on The Third Anniversary". It introduces the significant features of Kitex/Hertz over the past year, aiming to assist enterprise users and community developers in better applying Kitex/Hertz to build their microservices systems in their projects.

Enhanced Streaming Capabilities to Support LLMs

With the rapid development of LLMs and ByteDance's AI applications, streaming communication has emerged as the primary communication mode for LLM application services. To better support business growth, we have optimized streaming communication in microservices in terms of stability, engineering practices, and performance over the past year.

Previous Streaming Capabilities of Kitex/Hertz

Both Kitex and Hertz support streaming scenarios. Kitex supports gRPC with better performance than the official gRPC and aligns with its functionality. Hertz supports HTTP Chunked Transfer Encoding and WebSocket. However, these capabilities were insufficient to support the rapid development of LLM internally at ByteDance due to several reasons:

More SSE Applications on Clients

Before MultiModal Machine Learning Models, LLM applications were mainly text-based dialogue scenarios, often using the SSE protocol to return server results to clients in real-time. Text push scenarios are simpler, requiring only a browser-friendly, straightforward protocol.

The Burden of Transitioning from Thrift to Protobuf

Although gRPC (Protobuf) is commonly used for streaming communication in RPC scenarios, and Kitex also supports gRPC, ByteDance's server-side services primarily use Thrift IDLs. Developers are more familiar with Thrift, and there are not many services using gRPC protocols. However, as the demand for streaming increases, we need to reduce the cognitive burden on developers during the transition based on internal realities. Additionally, widely increasing services defined by Protobuf is not conducive to unified IDL/interface management.

Lack of Engineering Practices

Compared to the PingPong model of one-send-one-receive, streaming communication adds complexity in service governance and engineering practices. The industry lacks accumulated engineering practices for streaming communication. Streaming interfaces can be easily misused, affecting service stability. From an observability perspective, there is no definition for streaming monitoring.

Streaming Capabilities – SSE/Thrift Streaming

Hertz SSE

SSE (Server-Send Events) is based on the HTTP protocol, supporting unidirectional data push from the server to the client. Its advantages include simplicity, ease of use, and developer-friendly, making it suitable for text transmission and meeting the basic communication needs of text dialogue models. Compared to WebSocket, SSE is lighter. For text-based dialogue LLM applications, the server only needs to push data to the client without handling the complexity of bidirectional communication. However, in voice dialogue scenarios, WebSocket, which is also browser-friendly, is more suitable. SSE can define different event types and process data on the client based on event types. In LLM applications, this can be used to distinguish different types of response data (e.g., partial outputs, error messages, status updates).

However, SSE is not suitable for the server side due to the following reasons: high computational and transmission performance requirements on the server side, unsuitability for inefficient text protocols, JSON's simplicity but unsuitability for complex server-side interaction scenarios, preference for strongly typed RPC, and the need for bidirectional streaming communication in certain cases.

Therefore, considering ByteDance's internal needs, we choose to support Thrift Streaming.

Kitex Thrift Streaming

Streaming communication is used not only in LLM applications but also in other business. For example, Douyin Search aims to improve performance by RPC streaming results. During the video packaging stage, it retrieves information related to recalled video IDs, hoping to bundle services (10 docs) in one request and return the first completed package. In the Lark People data export scenario, data is retrieved concurrently. If all data is filled into an Excel sheet before returning, excessive data can lead to OOM (Out of Memory), causing process exception terminated. Enhancing streaming capabilities not only supports the rapid development of LLMs but also meets the development needs of other business scenarios.

Although Kitex supports gRPC, we recommend using Thrift internally. Supporting diversity can meet various needs. However, it is best for a company to establish a best practice to minimize the burden on developers' choices, and the tool chain system will also be more supportive.

Streaming Protocols

Within ByteDance, traffic control for Streaming protocols mainly relies on Service Mesh. However, to quickly support implementation without relying on Service Mesh's support for new protocols, Kitex first supported Thrift Streaming based on gRPC (HTTP2). Since the official gRPC protocol specification supports extending content-type, the implementation is based on gRPC's RPC communication specification, changing Protobuf encoding to Thrift encoding.

Thrift over gRPC began its Alpha at ByteDance in December 2023 and was officially released in Kitex v0.9.0 in March 2024. It is now widely used internally, with usage instructions available on the official website.

Pros:
- Service Mesh Compatibility: Based on HTTP2 transmission, no separate support is required from Service Mesh.
- Low Support Cost: The decoding type is explicitly determined based on SubContentType (an extension supported by gRPC protocol specification.
Cons:
- High Resource Consumption: Flow control and dynamic windows introduce additional overhead.
- Significant Latency Impact: Flow control can significantly degrade latency with heavier traffic or larger packets, requiring users to adjust WindowSize.
- Difficult Troubleshooting: Increased complexity also raises the difficulty of troubleshooting.

Thrift over gRPC can be quickly implemented. However, from the perspectives of performance and troubleshooting, we have developed a Streaming protocol (Streaming over TTHeader) to simplify streaming communication. It is currently under internal debugging and trials, with an expected release in November-December 2024.

How to Define Streaming in Thrift

Users familiar with Thrift know that native Apache Thrift does not support the definition of streaming interfaces. Adding new keywords would make other Thrift parsing tools, including IDE plugins, incompatible. Therefore, defining streaming types for Thrift's RPC methods through annotations ensures parsing compatibility:

streaming.mode="bidirectional": Bidirectional Streaming
streaming.mode="client": Client Streaming
streaming.mode="server": Server Streaming

Both the currently supported Thrift Streaming over gRPC and the upcoming Thrift Streaming over TTHeader use this method to define streaming methods. The client-side will provide options to specify which Streaming protocol to use, while the server-side will support multiple protocols through protocol detection.

Generalized Streaming Invocation

If SSE is used for streaming communication on clients and Thrift Streaming is used on servers, how does the overall communication from clients to servers work?

Taking the internal text dialogue model as an example, the traffic undergoes protocol conversion after passing through the API gateway, and the server uses the Server Streaming type to push data to the client.

An important capability here is protocol conversion. Additionally, pressure testing and interface testing platforms need to dynamically construct data to test server services.

Users of Kitex know that Kitex provides generalized invocation for Thrift protocols, primarily supporting such general services. Previously, internal microservices were mainly Thrift PingPong services. Kitex provided generalized invocation for Map, JSON, HTTP data types, as well as binary generalized invocation for traffic forwarding.

Therefore, for streaming interfaces, Kitex has added support for generalized streaming invocation. Compared to PingPong generalized interfaces, generalized streaming requires separate interfaces for the three streaming types.

PingPong/Unary Generalized Invocation Interface

Streaming Generalized Invocation Interface

Currently, support for the mainstream JSON data type is complete, and other data types will be supported based on business needs in the future. (Since the Kitex Streaming v2 interface is yet to be released, and to avoid affecting the user experience of generalized streaming, this support has not been officially announced, but the functionality is ready. Users can visit the generalized invocation section on the official website for English documents.)

User Experience of Streaming Capability

Although we have improved the basic cases for streaming and introduced the streaming capabilities that Kitex/Hertz has supported in the past, newly supported, and will soon release, do developers who have worked on streaming interfaces, including those using other frameworks like the official gRPC, know how to properly use streaming interfaces and how to locate issues when they arise?

Within ByteDance, as streaming services have evolved, we've noticed a significant increase in feedback issues. On one hand, compared to Thrift PingPong, our support at the basic capability level is still incomplete. On the other hand, developing streaming interfaces requires a deep understanding of proper usage; otherwise, misuse can easily lead to problems.

Therefore, in 2024, we initiated a Streaming Optimization Project, sorted through various issues, and optimized them one by one. In terms of user experience, some issues are related to streaming interface definitions. After comprehensive consideration, we decided to shed the streaming burden and release the Streaming v2 interface. Below are some of the existing issues and ongoing optimizations. It's difficult to enforce proper usage of streaming interfaces solely from the framework level. Therefore, we will release usage specifications and best practices for streaming interfaces to help users develop high-quality streaming interfaces. If you have better suggestions for streaming usage, we welcome your feedback!

Taking streaming observability as an example, previously, streaming interface monitoring was not defined separately, which reuse PingPong reporting, resulting in only overall stream reporting information and lacking Recv/Send monitoring. Therefore, when supporting Thrift Streaming, StreamSend & StreamRecv events were added, with the framework recording the time of occurrence and the size of user-transmitted data. For custom Tracer reporting by enterprise users, it only requires implementing the rpcinfo.StreamEventReporter interface. Kitex will call this interface after each Recv and Send execution, allowing access to the event information for this Recv and Send. Below is the Trace information for Send/Recv within a Stream.

Review of New Features, User Experience/Performance Improvements

While specialized support and optimization for streaming capabilities have been conducted over the past year, we have also provided other new features to meet user needs, enhance user experience, and continue to improve framework performance.

New Features – Thrift/gRPC Multi-Services

The official gRPC framework supports multi-services, but previous versions of Kitex did not, mainly to align with Thrift usage. Thrift's limitation arises from supporting multi-services introducing protocol incompatibility changes, impacting users. Within ByteDance, the TTHeader protocol is widely used, so we decided to transmit the IDL Service Name via TTHeader to solve the issue of Thrift not supporting multi-services.

Kitex v0.9.0 officially supports registering multiple IDL Services within one Server, including Thrift and Protobuf. Thrift provides true multi-service functionality at the protocol level based on TTHeader, while being compatible with the old CombineService.

Here is a briefly introduction of Combine Service. Kitex previously provided a pseudo-multi-service feature, Combine Service, to address the issue of excessively large IDLs (leading to large code outputs and slow compilation speeds). It allows the server to split one IDL Service into multiple ones, but requires that the multiple IDL Services cannot have methods with the same name (since the protocol does not support multi-services, method routing cannot be done). Ultimately, Kitex merges multiple IDL Services into one Service, hence the name CombineService.

With Kitex's new multi-service support, the server can not only register multiple IDL Services, but also provide both Thrift and Protobuf interfaces simultaneously. For example, using Kitex-gRPC (Protobuf) but wanting to switch to Thrift Streaming while ensuring compatibility with old interface traffic, two types of IDL interfaces can be provided for transition.

Below is an example of registering multiple services on the server:

New Features – Mixed Retry

Kitex previously provided two retries: Failure Retry and Backup Request. Failure Retry can improve success rates (enhancing service SLAs), but most are timeout retries, leading to increased latency; Backup Request can reduce request latency, but if there is a failed return, it terminates retries.

In internal practice, businesses generally express a desire to have both retries, offering advantages to:

Optimize the overall retry latency of Failure Retry
Improve the request success rate of Backup Request

Therefore, Kitex supports Mixed Retry in v0.11.0, a hybrid retry function combining Failure Retry and Backup Request functions.

To facilitate understanding the differences between the three retries, here is a scenario: assume the first request takes 1200ms, the second request takes 900ms, with RPCTimeout configured to 1000ms, MaxRetryTimes to 2, and BackupDelay to 200ms.

Comparing the results of the three retries:

Mixed Retry: Success, cost 1100ms
Failure Retry: Success, cost 1900ms
Backup Retry: Failure, cost 1000ms

User Experience - Frugal & FastCodec (Thrift)

Both Frugal and FastCodec (Thrift) are high-performance Thrift serialization tools provided by Kitex. Frugal's advantage over FastCodec is that it does not require code generation, significantly addressing the issue of excessively large outputs.

But still two cons:

Both Frugal and FastCodec decoding must rely on packets with headers. If it's a Thrift Buffered packet, it will fallback to Apache Codec. Users need to be clear about the received protocol; otherwise, using Frugal cannot completely eliminate code generation.
Frugal is based on JIT implementation, with x86 support completed, but ARM just provides a fallback strategy with poor performance.

Addressing the protocol binding issue, the new version supports SkipDecode. Test results show that SkipDecode + FastCodec still outperforms Apache Thrift Codec.

For the Frugal ARM issue, new reflection support is provided, eliminating the need for separate support for different architectures. Although using reflection, bypassing type checks within reflection achieves higher performance. Test results are slightly better compared to JIT.

User Experience - Output Reduction and Generation Speed Optimization

Large output size, slow output generation,and compilation speeds are significant pain points for services with longer iterations within ByteDance. Therefore, Kitex provides various optimization methods to reduce output size and improve output generation speed.

IDL Trimming

A complex IDL with longer iterations contains many obsolete struct definitions. Actively cleaning up these unnecessary definitions can also increase development burden. The trimming tool supports generating code based on the struct definitions required by RPC methods. Users can also specify which methods to generate. According to pilot projects in large ByteDance repositories, generation time is halved, and output size is reduced by over 60%.

Usage: $ kitex -module xx -thrift **trim_idl xxxx.thrift

Example effect: In the example below, the trimming tool deleted 60,000 unused structs and 530,000 fields.

no_fmt Speedup

After code output generated, the code is formatted by default to improve readability, but users rarely care about that. Therefore, users can disable the fmt option to improve generation speed.

Usage: $ kitex -module xx -thrift **no_fmt xxxx.thrift

Effect: The P90 generation time for a certain platform within ByteDance decreased from 80s to 20s.

Removing Unnecessary Codes from Kitex

Kitex defaults to generating the full Apache Thrift code, but in reality, only the Codec part is used in fallback scenarios, and the rest of the code is not needed.

Therefore, Kitex v0.10.0 defaults to removing the Thrift Processor and can remove all Apache Thrift code via parameter specification.

kitex -module xxx -thrift no_default_serdes xxx.thrift

Usage: $ kitex -module xxx -thrift no_default_serdes xxx.thrift

Effect: Output size is reduced by about 50%+.

Frugal Slim Extreme Reduction

Usage: $ kitex -thrift frugal_tag,template=slim -service p.s.m idl/api.thrift, using Frugal for Thrift serialization.

Effect: Output size is reduced by about 90%.

User Experience - kitexcall

Although RPC calls are simpler and more convenient than HTTP, they are not convenient to test, requiring tools to generate code first and then construct request data. Previously mentioned testing platforms use generalized invocations to construct request data without relying on generated code, but the cost of using generalized invocations is not low. Users must first understand the usage and data construction of the method.

To improve testing convenience, based on Kitex JSON generalized invocations, a separate command tool - kitexcall is provided, allowing users to initiate Thrift tests using JSON data. (This feature is supported by community contributions; thanks here!)

Usage: $ kitexcall -idl-path echo.thrift -m echo -d '{"message": "hello"}' -e 127.0.0.1:8888

Future optimization plans:

Graphical interface for more convenient testing
Support for gRPC testing
No need to specify IDL, using server reflection to obtain IDL information

Performance Optimization – Thrift On-Demand Serialization

As business iterations make IDL definitions increasingly complex, upstream services in production may only need partial fields but need to serialize and transmit all of those, introducing additional performance overhead. Considering this issue, Kitex supports on-demand serialization for Thrift.

Reference Protobuf to provide a Thrift FieldMask feature, allowing users to select encoding fields and optimize serialization and transmission overhead.

For example, below, only the Foo field is encoded and returned, ignoring the Bar field:

User constructs Bar data, but annotate the Foo field, and the framework will only encode Foo:

It also supports specifying required fields by the opposite end; for specific usage, see the on-demand serialization documentation on the official website.

Performance Optimization – Thrift Memory Allocation Optimization

Kitex continuously monitors RPC performance. In the current context of high cost pressures, we are deeply exploring more optimizations. Routine optimizations on hot paths have all been done, but further ones are less conventional. v0.10.0 released new optimizations focusing on memory allocation and GC.

Span Cache: Optimizes String/Binary decoding costs:
- Pre-allocates memory, reducing mallocgc calls
- Reduces the actual number of generated objects -> lower GC costs
Centralized memory allocation for container fields
- Similarly, changes from separate memory allocation for each element to centralized approach

Span Cache can optimize CPU but increases memory usage. To avoid impacting services with small memory specifications, it is not enabled by default and requires user to turn on:

Effect: Under extreme testing, throughput is increased by about 10%, and latency is reduced by about 30%.

Memory Analysis Tool

The received objects in RPC/HTTP are constructed, memory-allocated, and value-assigned by the framework before being returned to the user. However, if the user's code holds onto these objects indefinitely, it can lead to memory leaks. While pprof heap can indicate where memory is allocated, it cannot show where references are made. So, how do we determine who is referencing a Go object?

In fact, GC scans and marks objects, capturing reference relationships. By combining this with variable names and type information, we can analyze the referencing situation of objects. Leveraging Delve, we have developed the goref object reference analysis tool, which was open-sourced in July (github.com/cloudwego/goref). This addresses the limitation of Go's native tools in analyzing memory references, aiding Go developers in quickly identifying memory leaks and enhancing the Go tooling ecosystem.

For instance, the Heap Profile of pprof shown in the following image reveals that the currently referenced objects are primarily allocated within FastRead (Kitex's deserialization code). It is normal for decoding to allocate memory for construct data, but this flame graph offers limited help in troubleshooting as allocated memory addresses are often not the source of memory leaks.

However, using the goref tool yields the following result: mockCache holds an RPC Resp, preventing memory from being released. The issue is immediately apparent.

Conclusion and Outlook

Conclusion

Enhancing Streaming Capabilities to Support LLMs

Streaming capabilities provided by Kitex/Hertz: gRPC, HTTP 1.1 Chunked, WebSocket, SSE, Thrift Streaming
SSE <-> Thrift Streaming
Generalized streaming invocations
Streaming capability optimizations to enhance user experience and engineering practices

Review of New Features, User Experience, and Performance Improvements

New Features: Thrift/gRPC multi-service support, Mixed Retry
User Experience: Frugal/FastCodec, streamlined outputs, generation speed optimizations, kitexcall
Performance Optimization: Thrift on-demand serialization, memory allocation improvement
Memory Analysis Tool: goref

Outlook

In the coming year, we will continue to enhance streaming capabilities and optimize the user experience. We will provide usage guidelines for streaming interfaces to help users better develop their streaming services:

Release Kitex Streaming v2 interface to address historical issues
Release TTHeader Streaming for better performance
Engineering practices: graceful shutdown, retries, timeout control
Release streaming-related specifications: error handling, interface usage guidelines

Furthermore, we will consider strengthening the streaming ecosystem, such as enriching generalized streaming invocations and providing more gateway-friendly support:

SSE <-> Thrift Streaming(HTTP2 and TTHeader Streaming)
WebSocket <-> Thrift Streaming (HTTP2 and TTHeader Streaming)
Binary and Map generalized invocations for Streaming

Special announcement: Kitex plans to gradually remove Apache Thrift-generated code in future versions. Due to incompatible changes in Apache Thrift v0.14, Kitex is forced to be locked-in Apache Thrift v0.13. To resolve this, Kitex will eliminate its dependency on Apache Thrift.

KubeAdmiral: next-generation multi-cluster orchestration engine based on Kubernetes

XIAOXU CHANG — Mon, 15 Apr 2024 09:33:34 +0000

Project link: https://github.com/kubewharf/kubeadmiral

Since its release in 2014, Kubernetes has become the de facto standard for cloud native orchestration and scheduling systems, delivering substantial value to infrastructure developers around the world.

As an increasing number of corporations embrace cloud native technologies and migrate their workloads to Kubernetes, the scale of their clusters grows rapidly.

The community edition of Kubernetes, capped at 5000 nodes per cluster, is no longer able to keep up with the scale requirements of large-scale enterprise applications. Moreover, many companies are adopting multi-cloud architectures to achieve cost reduction, increased resource and operational efficiency, geographical disaster recovery, and environment isolation.

As a result, the demand for multi-cluster orchestration and scheduling tools is on the rise.

Brief History of Kubernetes at ByteDance

Dedicated Clusters

In the early years of ByteDance’s cloud native adoption, each business line operated in separate dedicated clusters due to isolation concerns. However, this led to low resource elasticity and efficiency, observed in several ways:

Each business line had to maintain independent resource buffers for scaling and upgrading.
Applications were tightly coupled to specific clusters, and manual resource transfer was required to balance resource utilization as applications scale.
SRE teams had to deeply understand both the businesses and the clusters in order to manage resources efficiently.

Consequently, this resulted in inefficient resource management and suboptimal overall deployment rate.

KubeFed v2

To address these challenges, the technical infrastructure team at ByteDance started exploring cluster federation based on KubeFed v2 in 2019. The goal is to pool resources across business lines, reduce unnecessary buffers, and improve the efficiency of resource management.

KubeFed v2 introduces the concept of host and member clusters. Users create federated workloads (e.g. FederatedDeployment) in the host cluster, and KubeFed schedules and dispatches workloads in the member clusters based on these federated workloads. Each federated workload contains three primary fields: Template (specifying the workload to be dispatched to member clusters), Placement (designating target member clusters), and Overrides (indicating how the template should be varied in some clusters). For example, the following FederatedDeployments instructs KubeFed to create a Deployment in cluster1 and cluster2 with 2 and 3 replicas respectively.

apiVersion: types.kubefed.k8s.io/v1beta1
kind: FederatedDeployment
metadata:
  name: test-deployment
spec:
  template:
    metadata:
      labels:
        app: nginx
    spec:
      replicas: 5
      # more Deployment fields...
  placement:
    clusters:
    - name: cluster1
    - name: cluster2
  overrides: 
  - clusterName: cluster1
    clusterOverrides:
    - path: /spec/replicas
      value: 2
  - clusterName: cluster2
    clusterOverrides:
    - path: /spec/replicas
      value: 3

For Deployment and ReplicaSet, KubeFed supports dividing the desired replicas across multiple clusters based on ReplicaSchedulingPreference (RSP). Users can configure the weights, minimum replicas, and maximum replicas for each cluster, and the RSP controller computes a valid replica distribution and updates the Placement and Overrides fields of FederatedDeployment or FederatedReplicaSet.

RSP Scheduling (Image credit: https://www.kubernetes.org.cn/5702.html)

KubeFed laid the foundation of Kubernetes cluster federation at ByteDance. However, we soon found KubeFed unable to meet our production requirements. The primary pain points were:

Uneven resource utilization across clusters – KubeFed’s RSP only supports static cluster weights and lacks the ability to adapt to fluctuations in cluster resources dynamically.
Service disruption after rescheduling – During rescheduling, replicas might be abruptly migrated between clusters, disrupting service availability.
Limitations in scheduling semantics – KubeFed supports stateless, replica-based resources through RSP, but lacks support for more diverse resources such as stateful workloads and jobs. Moreover, extending the existing scheduling semantics is difficult.
High onboarding cost – KubeFed requires the creation of federated objects and is incompatible with the native Kubernetes API. Users and downstream platforms need to completely overhaul their usage patterns.

KubeAdmiral

With the evolution of cloud native infrastructure at ByteDance, we raised our standards for efficiency, scalability, performance, and cost. Meanwhile, the size and number of our Kubernetes clusters continue to grow phenomenally along with the businesses. Additionally, workloads beyond stateless microservices, including stateful services, storage, offline and machine learning jobs, started embracing cloud native technologies. Against this backdrop, the limitations of KubeFed became increasingly difficult to manage. Therefore, at the end of 2021, we began our endeavor to develop the next generation cluster federation system, building upon KubeFed v2’s foundation. We named it KubeAdmiral to capture our aspiration for it to manage multiple clusters as effectively as a seasoned navy admiral commands a fleet.

Timeline of Kubernetes at ByteDance

KubeAdmiral offers enhanced multi-cluster orchestration and scheduling capabilities for various mainstream business scenarios. Today at ByteDance, KubeAdmiral manages more than 100,000 microservices with more than 10,000,000 pods running on dozens of federated Kubernetes clusters. It supports upwards of 30,000 upgrade and scaling operations daily, and maintains a stable deployment rate of 95-98% without the need for manual intervention.

KubeAdmiral Feature Highlight

KubeAdmiral not only supports native Kubernetes resources and third-party custom resources, but also offers a rich and extensible scheduling framework. Moreover, it refines numerous aspects of scheduling and dispatching, backed by years of practical production experience.

KubeAdmiral architecture diagram

Rich Multi-Cluster Scheduling Capabilities

The scheduler is a core component of KubeAdmiral responsible for computing the desired placement of workloads in member clusters. When scheduling replica-based workloads, it also computes the appropriate replicas for each cluster. Functioning as KubeAdmiral’s “brain”, its decisions directly impact critical aspects such as fault tolerance, resource efficiency, and stability.

KubeFed provides the RSP scheduler for replica-based workloads, but its customizability and extensibility are very limited, and modifying its behavior requires code modification. Additionally, it lacks support for stateful services, job-like resources, etc., which require different sets of scheduling semantics.

KubeAdmiral introduces more comprehensive scheduling semantics. It supports more flexible and fine-grained mechanisms to select clusters via labels, taints, etc, and score clusters based on resource utilization, affinity, and so on. Beyond just replica-based workloads, it also supports scheduling stateful workloads and job-like resources. Additionally, it brings about convenient features such as automatic dependency scheduling (dependencies such as ConfigMaps can automatically follow their Deployment to corresponding member clusters). The scheduling behavior can be configured using a PropagationPolicy object, as shown below:

apiVersion: core.kubeadmiral.io/v1alpha1
kind: PropagationPolicy
metadata:
  name: mypolicy
  namespace: default
spec:
  # Many different ways to select clusters.
  placement:
  # Manually specify desired clusters and replica weights, if required.
  - cluster: cluster-01
    preferences:
      weight: 4
  - cluster: cluster-02
    preferences:
      weight: 3
  - cluster: cluster-03
    preferences:
      weight: 4
  # Filter clusters based on label selectors.
  clusterSelector:
    IPv6: "true"
  # Filter clusters based on affinity.
  clusterAffinity:
  - matchExpressions:
    - key: region
      operator: In
      values:
      - us-east
  # Filter clusters based on taints and tolerations.
  tolerations:
  - key: "key1"
    operator: "Equal"
    value: "value1"
    effect: "NoSchedule"
  # Mode of scheduling - divide or duplicate.
  schedulingMode: Divide
  reschedulePolicy: 
    # Only schedule on creation and do not reschedule afterwards.
    # Suitable for stateful workloads.
    disableRescheduling: false
    # When rescheduling should be triggered.
    # More triggers: reschedule more frequently - favor agility.
    # Fewer triggers: reschedule less frequently - favor stability.
    rescheduleWhen:
      policyContentChanged: true
      clusterLabelsChanged: false
    # Whether to rebalance replicas on reschedule.
    # Enabling rebalance results in optimal placement, but at the potential cost
    # of disrupting existing replicas.
    replicaRescheduling:
      avoidDisruption: true
  # Limit propagation to a single cluster.
  # Suitable for job-like workloads.
  maxClusters: 1

Instead of writing Overrides manually, KubeAdmiral supports generating Overrides based on OverridePolicy:

apiVersion: core.kubeadmiral.io/v1alpha1
kind: OverridePolicy
metadata:
  name: example
  namespace: default
spec:
  # Flexible ways to select target clusters.
  overrideRules:
  - targetClusters:
      # Select clusters by name.
      clusters:
      - on-prem-1
      - edge-1
      # Select clusters by label.
      clusterSelector:
        region: us-east
        az: az1
      # Select clusters by affinity.
      clusterAffinity:
      - matchExpressions:
        - key: region
          operator: In
          values:
          - us-east
      # Change the container image in the target clusters using jsonpatch.
      overriders:
        jsonpatch:
        - path: "/spec/template/spec/containers/0/image"
          operator: replace
          value: "nginx:test"

Scheduler Extension

Taking inspiration from kube-scheduler’s design, KubeAdmiral offers a flexible scheduling framework. It simplifies the scheduling process by dividing it into four distinct stages: Filter, Score, Select, and Replica.

Each stage is handled by individual plugins, creating a logical separation that promotes modularity. For instance, in the provided PropagationPolicy example above, most behaviors are implemented through built-in scheduling plugins. The beauty of this approach is that plugins can be easily added or removed, without any impact on the remaining plugins. This greatly simplifies the scheduler logic and reduces its overall complexity. Although the built-in plugins in KubeAdmiral offer versatile features that cater to common use cases, users have the flexibility to enhance the functionality by creating their own custom scheduling plugins for specific niche scenarios. This empowers users to seamlessly integrate with internal or existing systems.

The KubeAdmiral scheduler interacts with external plugins via the HTTP protocol, enabling users to extend the scheduling logic with minimal effort and without having to modify the KubeAdmiral control plane. The plugin only needs to output the desired placement, and KubeAdmiral takes care of binding and enforcing those results.

Scheduler stages and plugins

Automatic Migration of Unschedulable Workloads

For replica scheduling, KubeAdmiral calculates the number of replicas that each member cluster should receive and overrides the replicas field in the template before distributing the resources to the member clusters. After the resources are distributed to member clusters, the kube-scheduler in each member cluster assigns the corresponding pods to available nodes. Thus, a full scheduling chain is completed.

Occasionally, there are cases where the kube-scheduler fails to find suitable nodes for pods due to reasons including node outages, resource shortages, and unmet node affinity requirements. If left unaddressed, the unschedulable pods will remain pending. KubeAdmiral resolves this by automatically migrating the unschedulable pods to other clusters, enabling better resource utilization overall.

As an illustration, consider three clusters A, B, and C with an equal weight distribution for six replicas. After the initial scheduling by KubeAdmiral, each cluster receives two replicas. If the two replicas in cluster C fail to be scheduled by kube-scheduler after a while, KubeAdmiral automatically shifts them to clusters A and B, ensuring the desired availability of 6 replicas across all clusters.

Cluster	A	B	C
Weight	10	10	10
Replicas before scaling down	15	15	15
Change	-10	-11	0
Replicas after scaling down	5	4	0

Dynamic Replica Distribution Based on Real-Time Resource Availability

In a multi-cluster setup, the resource utilization of each cluster fluctuates as machines go online or offline. Relying solely on the static weight replica scheduling provided by KubeFed RSP can easily lead to skewed resource utilization. Clusters with a high deployment rate are prone to pod pending during upgrade, while clusters with a low deployment rate have idle resources that are wasted.

As a solution to this, KubeAdmiral introduces dynamic weight scheduling based on real-time cluster resource utilization. It calculates the amount of available resources by collecting the total and allocated resources of each cluster, and uses it as the weight for replica scheduling. This ultimately achieves dynamic load balancing across all member clusters. In practice, we are able to maintain a stable deployment rate of 95-98% or above in all member clusters with this approach.

Refined Replicas Rescheduling

KubeFed’s replica rescheduling algorithm usually results in less than ideal distributions for scaling operations. As an illustration, consider 30 replicas currently distributed to 3 member clusters A, B, and C with equal weights. If the workload is scaled down to 9 replicas, KubeFed has 2 possible behaviors depending whether the user enables rebalance:

If rebalance = false, KubeFed retains existing replicas, disregarding cluster weights.
If rebalance = true, KubeFed disregards current distribution and rebalances replicas based on weights.

As seen above, KubeFed is unable to devise a distribution that satisfies fault tolerance and load balancing requirements without compromising service availability. To address this, KubeAdmiral developed a refined replica rescheduling algorithm that guarantees service availability and produces distributions that are as close to the optimal distribution as possible. The gist of the algorithm is to distribute the increment or decrement in replicas, instead of the total replicas.

Using the same scenario of scaling down from 30 replicas to 9 replicas above, the refined algorithm roughly proceeds as follows:

Current distribution = [15, 15, 0]; total current replicas: 30
Desired distribution = [3, 3, 3]; total desired replicas: 9
Distance = desired – current = [-12, -12, 3]; total distance: -21
For scaling down, remove any positive distance terms; distance = [-12, -12, 0]
Distribute the total distance -21 using the distance vector [-12, -12, 0] as weights; adjustments = [-10, -11, 0]
Final distribution = current + adjustments = [15, 15, 0] + [-10, -11, 0] = [5, 4, 0]

Cluster	A	B	C
Weight	10	10	10
Replicas before scaling down	15	15	15
Change	-10	-11	0
Replicas after scaling down	5	4	0

Support for Native Kubernetes Resource API

Unlike KubeFed, which requires users to use an incompatible “federated” API, KubeAdmiral caters to the usage habits of single-cluster Kubernetes users by providing support for native Kubernetes APIs. After the user creates a native resource (such as Deployment), KubeAdmiral’s federate-controller automatically converts it into an internal object for use by downstream KubeAdmiral controllers. This enables users to quickly transition from a single-cluster to a multi-cluster architecture with low onboarding cost.

However, KubeAdmiral doesn’t stop there. In a single cluster, Kubernetes controllers update the status of resources to reflect their current state. For example, a Deployment‘s status reflects its rollout progress and number of replicas it currently has. Users or upper-layer systems often rely on such status. In a multi-cluster environment, the status is populated on individual Deployments propagated to member clusters. Users must check the status of resources in each cluster individually, leading to a fragmented perspective and reduced operational efficiency.

To solve this problem and seamlessly support native resources, KubeAdmiral introduces status aggregation. The KubeAdmiral status-aggregator collects and aggregates the status of individual resources from member clusters and writes it back to the native resources. This allows users to observe the global resource status at a glance.

Final Thoughts

KubeAdmiral has been incubating within ByteDance for a while, and has been an integral part of ByteDance’s internal PaaS platform TCE. Battle-tested by large-scale applications, it has accumulated many valuable practical experiences. To give back to the community, KubeAdmiral has officially been open-sourced on GitHub.

Looking forward, we plan to continue working on KubeAdmiral, especially in the following areas:

Continue to improve the orchestration and scheduling capabilities of stateful and job-like workloads, and develop advanced capabilities such as automatic migration and cost-based scheduling to embrace the new era of multi-cloud batch computing.
Improve user experience and further reduce users’ cognitive burden, striving for a pleasant out-of-the-box experience.
Improve observability, optimize logging and metrics, and enhance the scheduler’s explainability.
Explore features such as one-click migration from single cluster, further smoothening the onboarding experience.

Multi-cluster orchestration and scheduling is not a simple topic. We hope our experience and solution could be useful to the community. We look forward to more friends joining the KubeAdmiral community, and welcome everyone to try KubeAdmiral and give us suggestions!

GitHub repo: https://github.com/kubewharf/kubeadmiral

Gödel Scheduler open-sourced: a unified scheduler for online and offline workloads

XIAOXU CHANG — Wed, 03 Apr 2024 09:16:53 +0000

Background

Since its open-source release in 2014, Kubernetes has rapidly become the de facto standard for container orchestration. The infrastructure team at ByteDance adopted Kubernetes early on to build our private cloud platform. Over the years, ByteDance’s rapid growth across various business lines, including microservices, recommendation/advertising/search services, machine learning & big data, and storage, has led to a substantial increase in the demand for computing resources.

Initially, ByteDance managed its online and offline workloads with separate resource pools, each dedicated to distinct business segments. To accommodate the surge in online business demands during significant holidays and major events, the infrastructure team usually needed to plan ahead by reallocating resources from offline to online pools to bolster the capacity for handling increased online activities. While this temporary fix satisfied immediate requirements, the inter-pool resource borrowing process proved to be time-consuming, operationally heavy, and inefficient.

Furthermore, maintaining separate resource pools for online and offline workloads resulted in significant colocation costs, leaving little scope for enhancing resource utilization.

Therefore, the infrastructure team sought to implement a unified system for scheduling and managing both online and offline workloads. This initiative aimed to facilitate resource pooling, enhance resource utilization and elasticity, optimize costs and user experiences, and alleviate operational burdens.

Practice of Unified Scheduling

Enhancement beyond Kubernetes Default Scheduler:

Since extensive use of Kubernetes in 2018, ByteDance continuously optimized various components of Kubernetes for functionality and performance. However, with the containerization of recommendation/advertising/search services in 2019, the native Kubernetes scheduler, in terms of both functionality and performance, was farther away from meeting ByteDance’s business requirements.

In terms of functionality, more granular resource scheduling capabilities and flexible preemption strategies were required.
In terms of performance, the native Kubernetes default scheduler could only achieve a scheduling throughput of around 10 pods per second in a cluster of 5000 nodes, often causing business upgrades to be bottlenecked, far from meeting the requirements.

Therefore, the team introduced a number of key optimizations to the Kubernetes default scheduler, including:

Functionality:

Extended the scheduling capabilities to support non-native resources, such as memory bandwidth and network bandwidth.
Supported for micro-topology scheduling.
Refactored preemption implementation by providing a pluggable preemption framework to support extending preemption capabilities.

Performance:

Optimized the data synchronization mechanism between Scheduler cache and Snapshot by refactoring data structure and further strengthening the concept of incremental update between snapshots.
Cached scheduling results for homogenous scheduling units to reduce redundant calculations and improve efficiency.
Optimized preemption implementation by reorganizing preemption-related data structures, applying pruning timely, and reducing unnecessary computation.

By implementing the aforementioned optimizations, we successfully enhanced our containerization capabilities to meet ByteDance’s rapidly expanding needs. This led to a remarkable 30-fold increase in scheduling throughput. That is, in a cluster comprising 10,000 nodes, we consistently achieved a scheduling rate of 300 pods per second.

Gödel Scheduler

In 2020, ByteDance initiated a unified scheduling and resource management project for both online and offline business operations. The objective was to enhance overall resource utilization, improve operational efficiency, and reduce maintenance overheads. Initially, the plan involved managing both online and offline tasks through a singular scheduling system. However, this approach presented challenges, primarily due to the intricate nature of offline scheduling, which differed markedly from online processes, especially in its demand for high throughput.

The native Kubernetes scheduler, primarily designed for Pod-level scheduling, was somewhat limited in its support for more complex “Job” scheduling semantics and encountered performance limitations when dealing with these higher-level demands. To effectively address these unique requirements and to better align with the diverse operational needs of ByteDance, the decision was made to develop a bespoke, in-house distributed scheduler. This led to the creation of the Gödel Scheduler, specifically tailored to integrate with the Kubernetes system and to handle the demanding and varied scheduling needs of ByteDance’s expansive and evolving business landscape.

The Gödel Scheduler is a distributed system crafted to consolidate the scheduling of both online and offline workloads. This scheduler is an enhancement of the Kubernetes (K8s) Scheduler, designed to augment scalability and improve scheduling quality. It is adept at fulfilling the diverse functional and performance demands of ByteDance’s online and offline operations. Key features of the Gödel Scheduler include:

Optimistic Concurrency: It incorporates optimistic concurrency concepts, moving the most time-consuming unit-to-node matching (filtering and scoring) to the scheduler component. This allows for concurrent execution and improves scheduling throughput in large-scale clusters.
Two-Layer Scheduling Abstraction (Unit and Pod) and Framework: This provides more flexible batch scheduling capabilities, better supporting offline operations while also improving scheduling throughput and system scalability. The extended framework handles special scenarios more effectively.
Rich Functionality and High Performance: It meets the demands of various operations including online, offline (batch and stream), and training tasks, achieving true unified scheduling.
Compatibility with the Kubernetes Ecosystem: It can serve as an alternative to the K8s Scheduler, but due to performance and architectural optimizations, its framework interface is not entirely the same as the K8s Scheduler. However, its extensibility remains unaffected, and it can implement scheduling plugins like Kubernetes.

Below is the architecture diagram of Gödel Scheduler.

As outlined, the Gödel Scheduler consists of three primary components: the Dispatcher, the Scheduler, and the Binder. Key to its architecture is the Scheduler component, which is typically deployed in multiple shards to facilitate optimistic concurrency scheduling. This multi-shard deployment enhances its efficiency and scalability. On the other hand, the Dispatcher and the Binder are each deployed as single instances, a configuration that suits their specific roles and responsibilities within the Gödel Scheduler system.

Dispatcher

The Dispatcher plays a pivotal role in managing application queuing, distribution, and node partitioning. It is comprised of several key components:

Sort Policy Manager: This module handles the queuing of applications. Currently, it implements FIFO and DRF/FairShare queuing strategies, the latter still pending production use. Future enhancements will introduce more sophisticated queuing strategies, including those based on priority values.
Dispatching Policy Manager: Its primary function is to allocate applications across various Scheduler instances. At present, the LoadBalance strategy is employed as the default. Future updates aim to make this feature more versatile and plugin-based.
Node Shuffler: This component is tasked with organizing cluster nodes relative to the number of Scheduler instances. It assigns each node to a specific node partition, with each Scheduler instance overseeing one partition. During the scheduling process, a Scheduler first considers nodes within its partition before exploring nodes in other partitions. This arrangement is dynamically adjusted in response to changes in node availability or Scheduler count.
Partition Rules: Currently, the system strives for an even distribution of nodes among Scheduler instances. Plans are underway to expand these partition strategies, enhancing their configurability.
Scheduler Maintainer: This element is responsible for monitoring the status of Scheduler instances. It tracks aspects like health status, workload, and the node count within each partition.
Reconciler: Operating periodically, the Reconciler oversees the status of various elements like Pods, Nodes, Schedulers, and SchedulingUnits. It addresses any errors, discrepancies, or deficiencies, ensuring system integrity and performance.

Scheduler

The Scheduler plays a critical role in the decision-making process for scheduling and preempting applications, although it does not execute these decisions itself (that task is handled by the Binder). It operates on a two-tier framework: the Unit Scheduling Framework and the Pod Scheduling Framework. The entire scheduling procedure is segmented into three principal phases: Node Organizing, Unit Scheduling, and Unit Preempting.

Node Organizing: This phase involves filtering and sorting nodes to streamline the scheduling process and enhance its quality. It consists of two types of plugins:
- Locating Plugins: These filter nodes are based on specific application information.
- Grouping Plugins: These group nodes according to available resources or Job-level affinities.
Unit Scheduling: In this stage, nodes are matched and scored in alignment with application requests that have been filtered through the Node Organizing plugins. This process is analogous to the Kubernetes (K8s) Scheduler framework, encompassing:
- Filtering Plugins: These filter nodes are based on the requisites of the application requests.
- Scoring Plugins: These assign scores to nodes that have been filtered in the previous step.
Unit Preempting: If suitable nodes are not found during the Unit Scheduling phase, the Scheduler progresses to the preemption phase. Here, it tries to free up resources by preempting running application instances to make room for new ones. This phase includes:

Victims Searching: Identifying potential victim applications that can be preempted.
Candidates Sorting: Sorting both nodes and potential victims to identify the most appropriate choices for preemption.

Binder

The Binder plays a crucial role in the final stages of the scheduling process, focusing on conflict detection, preemptive operations, and executing the binding of applications to resources. It comprises three main components: ConflictResolver, PreemptionOperator, and UnitBinder.

ConflictResolver: This component is tasked with detecting concurrent conflicts in the scheduling process. It operates in two modes:
- Cross Node Conflict Resolver: Checks for conflicts that might occur across different nodes.
- Single Node Conflict Resolver: Identifies conflicts within a single node. If any conflict is detected, the application is immediately rejected and rescheduled.
PreemptionOperator: In scenarios where no conflict exists but preemption is necessary, this operator takes charge. It executes the preemption by deleting the victims (applications or processes that need to be terminated to free up resources) and then awaits the final scheduling.
UnitBinder: This part of the Binder is responsible for the preparatory work required before binding, such as dynamically creating storage volumes, and then carries out the actual binding operation, linking applications to the designated resources.

Noteworthy, the current version of the Binder integrates a PodGroup controller. This controller is responsible for managing the state and lifecycle of PodGroups. However, it’s important to note that in a future version we plan to remove this functionality from the Binder, transitioning it into an independent controller.

Experience

Over the past two years, the Gödel Scheduler has been a cornerstone within ByteDance, offering a wealth of scheduling features and semantics. It has efficiently and reliably supported the operations of ByteDance’s diverse and complex business workloads.

Building upon the foundation of architectural enhancements, ByteDance has implemented profound performance optimizations drawing from its experience with the Kubernetes native scheduler. Integrated with ByteDance’s internally refined Kubernetes system, the Gödel Scheduler now boasts an impressive throughput: 2000+ pods/s in a single shard and 5000+ pods/s across multiple shards. ByteDance’s ongoing efforts to expand single-cluster capacity have culminated in their largest prod cluster reaching over 20,000 nodes and more than 1,000,000 pods.

After years of thorough internal practices and enhancements within ByteDance, Gödel Scheduler has achieved a state of relative stability. In 2023, the top-notch cloud computing conference, SoCC, accepted our paper on Gödel Scheduler, highlighting ByteDance’s unified approach to large-scale resource management and scheduling. The RD team was also invited to present the work at the conference. For those interested, the Gödel Scheduler paper is available at https://dl.acm.org/doi/10.1145/3620678.3624663.

With a commitment to contributing to the open-source community, the Bytedance team decided to open-source the Gödel Scheduler, offering a new scheduling solution that enhances cloud-native experiences for both online and offline services through its outstanding performance and comprehensive scheduling capabilities.

Future Work

Looking ahead, ByteDance is committed to the continual development of the Gödel Scheduler, focusing on enriching its features and enhancing its scalability. A significant area of attention will be optimizing the scheduling throughput in specific challenging scenarios, such as those involving high rates of deployment and frequent preemptions. Through innovative rescheduling strategies, ByteDance aims to tackle the intricate balance between maintaining scheduling performance and enhancing its quality. The overarching goal is to not only preserve the current scheduling throughput but also to substantially elevate the quality of scheduling.

Moreover, ByteDance places a high priority on ecosystem development. Efforts will be made to ensure Gödel Scheduler’s compatibility with leading systems and frameworks used in various business applications. This initiative will include integration with prominent big data and machine learning frameworks, accompanied by practical usage examples and comprehensive documentation.

To keep the community engaged and informed, a detailed roadmap for the Gödel Scheduler will be methodically laid out and made available on the Gödel Scheduler Repository. This will provide an opportunity for interested parties to track progress, contribute, and become active participants in the project.

While the Gödel Scheduler has undergone numerous iterations within ByteDance, been rigorously tested in various scenarios, and demonstrated its effectiveness, ByteDance acknowledges that there is still considerable potential for advancement in terms of generality and standardization. ByteDance warmly invites and encourages members of the community to join in the development of the Gödel Scheduler, believing that collaborative efforts will lead to even greater improvements and innovations.

Gödel Scheduler Project Repository: https://github.com/kubewharf/godel-scheduler

Katalyst: A QoS-based resource management system for workload colocation on Kubernetes

XIAOXU CHANG — Mon, 01 Apr 2024 11:01:26 +0000

This Blog originally published on Katalyst’s blog by Pengcheng Tang

The resource usage of web applications tends to fluctuate with changes in the number of visitors, displaying noticeable tidal characteristics. To ensure stability, service providers often allocate resources for their applications according to resource usage during peak periods. These resources can easily be underutilized during off-peak hours.

If these idle resources can be reclaimed and temporarily allocated to lower-priority services and returned promptly to online services when needed, the overall resource utilization rate can be significantly improved.

ByteDance colocation practices

ByteDance operates at a massive scale with diverse business types, encompassing various categories such as microservices, advertising, machine learning, big data, and storage. Typically, different business types have distinct resource management requirements at the infrastructure level. The conventional approach involves segmenting resource pools based on business lines or service types to meet customized demands.

However, this method of resource pool segmentation can lead to resource silos, preventing flexible resource sharing and hindering the overall efficiency of resource utilization and cost optimization. It also adds to the operational burden of managing clusters.

Furthermore, considering that different types of businesses have complementary SLO requirements and resource usage patterns, the infrastructure team aims to leverage these characteristics fully. They do so through scheduling and control mechanisms to optimize resource efficiency, achieve the convergence and consolidation of resource pools, and assist business teams in attaining lower resource costs and greater elasticity.

To achieve unified resource management, ByteDance began building a unified infrastructure based on Kubernetes in 2016. At the current stage, ByteDance has essentially completed the containerization of all microservices, advertising, and a significant portion of machine learning and big data businesses. Throughout this process, the infrastructure team has continued to explore resource optimization methods under a unified resource pool and gradually developed a resource pool deployment approach that combines ’elastic scaling’ and ‘colocation.’ Elastic Scaling: This enables machine-level and Numa-level resource time-sharing, combining business and system metrics to guide horizontal and vertical scaling strategies for application instances. This ultimately allows offline services to purchase more idle resources at a lower cost, and online services to purchase more peak-time resources at a higher cost through resource market-oriented operations, leading to overall efficiency improvement.

Colocation: It offers the ability to oversell resources, making full use of ‘sold but underutilized resources’ in the cluster to deploy more low-priority tasks. Simultaneously, we enhance resource isolation mechanisms across multiple dimensions such as CPU, memory, disk, and network at the system level. Minute-level control mechanisms, combined with intelligent load prediction algorithms, are adopted to ensure service stability according to their SLOs.

This solution combines Kubernetes and Yarn systems for joint control. It runs control components of both Kubernetes and Yarn on the same machine, and coordinates the allocatable resources between the two systems through a central coordination component. On top of this joint control system, we achieve real-time resource estimation based on service resource profiles, ensuring more flexible and dynamic resource allocation while meeting various service SLA requirements.

During the implementation of this colocation solution, the infrastructure team verified the feasibility of resource pooling, constructed the foundational capabilities for colocation, and improved the overall utilization of core clusters from 23% to 60%.

Katalyst: From Internal Validation to Open Source

After undergoing extensive testing with businesses like Douyin and Toutiao, which have large-scale tidal traffic, ByteDance’s cloud-native colocation practices have matured. In order to help end users in cloud native community understand the principles behind large-scale colocation practices and improve overall resource efficiency of their own, we have refactored and enhanced the resource management system in a Kubernetes Native manner and built the resource management system “Katalyst”, which has now been officially open-sourced. The name “Katalyst” is derived from the word “catalyst,” and the ‘K’ symbolizes its ability to provide enhanced automation for resource management for all workloads running within the Kubernetes ecosystem.

What Is Katalyst

Katalyst originated from ByteDance’s colocation practices and has been extended and supplemented in terms of resource management capabilities:

Developed entirely within the context of hyperscale colocation practices, achieving true reuse of internal and external technology systems.
Built based on a plugin-based architecture, allowing users to customize various modules such as scheduling, control strategies on top of the Katalyst Framework.
Provides one-click deployment templates and comprehensive operation manuals, reducing the understanding and deployment costs for end users.

Resource Abstraction

The native Kubernetes Quality of Service (QoS) system does not meet the requirements of large-scale production environments, prompting Katalyst to build a QoS system of its own. Katalyst defines four QoS classes including dedicated_cores, shared_cores, reclaimed_cores and system_cores. Users can assign different QoS class to their applications according to different QoS requirements. Based on our practices in Bytedance, CPU, in most scenarios, is a dominant resource that can affect applications’ performance and users tend to associate QoS requirements with CPU as well. So despite the fact that QoS requirement encompasses various kinds of resources(i.e. CPU, memory, disk io, network bandwidth etc), we named it after CPU. Each level is accompanied by various enhancement mechanisms (e.g., whether NUMA node binding is required or if network affinity and bandwidth restrictions are necessary), enabling differentiated resource allocation and control strategies.

Through this abstract resource model, Katalyst provides users with a unified resource entry point. Users can accurately express their specific needs by mapping business services to the appropriate QoS class and sales model based on business requirements. This ultimately allows users to obtain resources from a unified resource pool without needing to delve into the underlying details.

Overall Architecture

In the early stages, the colocation architecture had several issues: although the joint control of Kubernetes and Yarn systems achieved colocation of online and offline businesses, the complexity of the system incurs more maintenance costs. Additionally, this architecture led to resource usage overhead. It comes from the resource consumption of agents running on each node. While resource consumption on an individual node is not significant, the accumulated overhead in hyperscale clusters can be substantial. Moreover, the use of two control systems increased system complexity. Any abnormalities at any stage could lead to resource calculation errors. In Katalyst, we optimized and refactored the overall colocation architecture:

At the control layer, we integrated the fusion system based on both Kubernetes and Yarn systems into a single Kubernetes-based system. Specifically, we retained the API entry points for Kubernetes and Yarn at the access layer, while unifying metadata management and resource control implementation within Katalyst, which is Kubernetes-native.

At the scheduling layer, Katalyst implemented a coordinated resource scheduling and control mechanism between “centralized scheduling” and “node resource management” based on unified metadata.

On the node side: Katalyst extends Kubernetes with a module named QoS Resource Manager (QRM). This module enables plugin-based node-level topology affinity allocation and reports the topology to the control plane through custom CRDs, facilitating scheduling processes. At runtime, Katalyst continuously makes estimation for resource allocation according to system metrics, service level indicators and QoS requirements of the pods. The allocation result is then dynamically sent to the Container Runtime Interface (CRI) through the QRM reconcile loops. The resource estimation algorithm and QRM implementation can be customized through plugins, making resource control strategies more adaptable to different business scenarios.

On the scheduling side: Katalyst extends scheduler with richer scheduling capabilities through scheduler framework. During scheduling, it takes into account how applications of different QoS classes should allocate and collaborate resources when running in the same cluster. The scheduler also combines real-time data and service profiles to do dynamic rebalancing across the entire cluster, reducing cluster vacancy rates and enhancing business stability.

Lastly, under a unified control system, we can fully leverage Kubernetes’ advantages in API design. By decoupling internal systems and generalizing control strategies through custom CRDs, we are able to iteratively improve the system through a plugin-based approach, achieving true convergence between internal and external systems.

RoadMap

Katalyst, as a resource management system, has colocation as one of its core business scenarios. In addition to abstracting the core concepts mentioned above, we have provided and planned various QoS capabilities for Katalyst:

Fine-grained resource estimation strategies: Katalyst supports multiple resource estimation strategies including heuristics, unsupervised learning, and QoS-aware algorithms, improving resource utilization by accurately calculating and predicting the amount of resources that can be reclaimed from the nodes.
Multi-dimensional resource isolation capabilities: Using technologies such as cgroup, RDT, iocost, tc, etc., Katalyst achieves effective isolation of various resources, including CPU, memory, disk, and network, in different colocation scenarios.
Multi-level load eviction strategies: Katalyst supports multi-level eviction strategies based on various metrics, ensuring online business QoS while maximizing offline business QoS.

Besides colocation, Katalyst also provides enhanced resource management capabilities:

Recommendation and autoscaling: Katalyst provides enhanced VPA/HPA capabilities and advanced recommendation algorithms. This helps end users make a more accurate estimation of pod resource request/limit or replica numbers, hence improve deployment rates and resource utilization.
Tidal(Exclusive-mode) colocation: While colocating online and offline applications on the same node gives more resource efficiency improvement, it requires all the infrastructure intricacies(e.g. Resource isolation, scheduling etc.) to work smoothly, which makes the overall system complicated. Katalyst provides an exclusive-mode colocation where the resource is reclaimed in node granularity so that either online or offline applications can run on the same node simultaneously. This allows users to improve resource efficiency at a lower operation cost.
Node overcommitment: With node overcommitment, Katalyst allows the scheduler to schedule more pods to a node without end users’ awareness. Meanwhile, Katalyst adopts methods like interference detection and mitigation, node resource prediction algorithm and so forth to guarantee the QoS requirement of higher priority tasks.
…

For detailed plans, please refer to the roadmap. Besides, we have just given a brief introduction about Katalyst at Kubecon China. You can also refer to our talk for more information.

While colocation has undergone several iterations within ByteDance, a universal, standardized platform foundation must be refined through various scenarios. We look forward to your participation in the Katalyst community and sharing your scenarios and requirements for colocation, resource efficiency improvement and so forth.

GitHub| https://github.com/kubewharf/katalyst-core