DEV Community: ByteDance Open Source

Inside AIO Sandbox (Part 3): Integrating AIO Sandbox with NVIDIA OpenShell

XIAOXU CHANG — Wed, 29 Apr 2026 13:09:47 +0000

By AIO Sandbox team

Beyond SDKs: Securing the AIO Sandbox with NVIDIA OpenShell

This is the third part of a series of posts diving into AIO Sandbox that provides an isolated, programmable environment where agents can safely execute. Please see previous posts here: Part 1 & Part 2.

As AI agents become more capable, the need for secure, tool-rich execution environments is growing rapidly. Agents increasingly require access to browsers, shells, filesystems, development tools, and external services—but granting those capabilities directly on a host system creates obvious operational and security risks.

This is where sandbox runtimes become essential. Recently, we explored integrating AIO Sandbox with NVIDIA OpenShell, and found that the two systems complement each other extremely well. By leveraging OpenShell’s Bring-Your-Own-Container (BYOC) model, we were able to run AIO Sandbox as the underlying workspace environment while benefiting from OpenShell’s policy controls, isolation model, and gateway architecture.

The result is a powerful combination:

AIO Sandbox provides rich built-in agent capabilities such as browser automation, shell access, file operations, VS Code Server, and MCP integration.
OpenShell provides the secure outer runtime boundary: sandbox lifecycle management, policy enforcement, controlled networking, and operational governance.

This post walks through the architecture, why the combination makes sense, and how BYOC enables a clean integration path.

Why Combine AIO Sandbox and OpenShell?

Each platform solves a different part of the agent infrastructure stack.

AIO Sandbox: Rich Agent Workspace

AIO Sandbox is designed as an all-in-one environment for agents. The sandboxing primitives are exposed as a library with multi-language bindings, so it can be embedded directly into your agent framework.

Typically, an agent might need a browser to fetch data, a Python interpreter to analyze it, and a filesystem to store the results. Managing these as separate services introduces latency and synchronization complexity. AIO sandbox consolidates these requirements into a single containerized environment. It packages commonly needed capabilities into a single containerized workspace:

Computer Interaction: A Chromium browser controllable via the Chrome DevTools Protocol (CDP), with documented support for Playwright.
Code Execution: Pre-configured runtimes for Python and Node.js.
MCP Compatibility: AIO Sandbox exports its services as tools that an LLM can call directly via a standardized protocol.
Standard Tooling: A bash terminal and a file system accessible across modules.
Development Interfaces: Integrated VSCode Server and Jupyter Notebook instances for monitoring and debugging.

This makes it ideal as the inner execution environment where the agent actually performs work.

OpenShell: Secure Runtime Envelope

OpenShell is a Kubernetes-based sandbox platform. Each sandbox is a K8s pod inside a K3s cluster inside Docker container. It is built to manage multiple sandboxes with centralized policy (using OPA policy engine). It also provides support for injecting credentials through a network proxy in order to keep real credentials away from the agent via a proxy. OpenShell focuses on secure agent execution and governance:

Isolated sandbox runtime
Gateway-managed access patterns
Declarative policy controls
Network and filesystem restrictions
Operational auditability
BYOC custom runtime support

This makes OpenShell ideal as the outer control plane around a capable agent workspace.

The Architecture

The integration model is straightforward:

User / API Request
        |
        v
NVIDIA OpenShell Gateway
        |
        v
OpenShell Sandbox (BYOC)
        |
        +--> AIO Sandbox Runtime
               - Browser
               - Shell
               - File APIs
               - VS Code
               - MCP Tools

Instead of replacing AIO Sandbox, OpenShell hosts it securely.
This is important because it preserves the strengths of both systems:

AIO remains the productivity layer.
OpenShell remains the security and governance layer.

What This Means for the Future

As agent systems mature, we are likely to see a layered architecture emerge:
Layer 1: Secure Runtime Envelope: Sandboxing, policy, networking, credentials, governance
Layer 2: Capability Workspace: Browser, shell, files, IDE, tools, APIs
Layer 3: Agent Intelligence: Reasoning model, planning, tool use, memory

The AIO Sandbox + OpenShell combination is an early example of this layered model in practice.

Why Bring-Your-Own-Container (BYOC) Matters

Without BYOC support, integrating specialized agent runtimes into secure sandbox systems often requires invasive rework. OpenShell’s BYOC model removes that friction. It allows you to:

Package your existing runtime as a container
Bring your own dependencies and startup model
Retain existing APIs and workflows
Add OpenShell security controls around it

In our case, we already had AIO Sandbox working independently. BYOC meant we could reuse that investment rather than rebuild functionality natively for OpenShell.

Requirements for Custom Images

While OpenShell supports almost any Linux-based image, there are a few technical requirements:

Base Requirements: The image must contain glibc, /proc filesystem support, and the iproute2 package (for network namespace management).
Use a standard Linux base image

Add sandbox user to the AIO Docker Image
User/Group Handling: For custom images, you still must manually create a sandbox user and group in the Dockerfile, like this:

# You must add this to your custom Dockerfile
RUN groupadd -g 1000 sandbox && \
useradd -m -u 1000 -g sandbox sandbox

If you don't do this, the supervisor will refuse to start the agent because it doesn't have a safe, non-root user to "drop" down to after it finishes setting up the secure environment.

docker build -t ghcr.io/deepak-vij/sandbox-openshell:latest .
docker push ghcr.io/deepak-vij/sandbox-openshell:latest

FROM ghcr.io/agent-infra/sandbox:latest

RUN groupadd -r sandbox && \
    useradd -r -g sandbox -m -d /home/sandbox -s /bin/bash sandbox && \
    mkdir -p /workspace && \
    chown -R sandbox:sandbox /home/sandbox /workspace

Create OpenShell Sandbox
The following CLI command creates a sandbox from existing docker image (AIO Sandbox in our case). The very first time, it also boots up a K3S cluster inside a docker container. Within the K3S cluster, it automatically creates an OpenShell gateway (with mTLS enabled).

# Running your own app
# Pass your start command explicitly — use -- <command> on the CLI.
# The image's CMD/ENTRYPOINT is replaced by the sandbox supervisor at runtime.

openshell sandbox create --from ghcr.io/deepak-vij/sandbox-openshell:latest \
    --forward 18080 \
    -- /usr/local/bin/python-server --host 0.0.0.0 --port 8091 --ws-ping-interval 30 --mcp-config /opt/gem/mcp-hub.json --filter-mcp-servers sandbox --workspace /home/gem

openshell forward start 8091 exceeding-reindeer

Practical Benefits of the Integration

Rich Agent Capabilities Inside a Controlled Boundary
Agents can use browsers, terminals, and tools—but only inside the sandbox.
This dramatically reduces host risk while preserving capability.
Operational Simplicity
Rather than assembling separate browser pods, shell workers, file services, and IDE containers, AIO Sandbox bundles them into one runtime.
OpenShell then governs it as a single managed workload.
Better Security Posture
Instead of exposing powerful tools directly, OpenShell can apply policies such as:

restricted egress networking
ephemeral lifecycle
credential boundaries
isolated filesystem mounts
controlled tool invocation paths

Faster Experimentation
Because AIO Sandbox already includes a rich environment, new agent use cases can be tested quickly without building custom images for every scenario.

Real-Life Example Use Cases

Earlier, we briefly mentioned the 3-layer architecture whereby:

OpenShell contributes: capabilities such as secret injection; secure identity; governed outbound access; sandbox lifecycle; policy control.
AIO Sandbox contributes: capabilities such as browser-use/CUA; shell; files etc. More importantly, AIO Sandbox provides the integrated workspace primitives that allow agents to perform multi-step workflows inside a single runtime.
Agent contributes: capabilities such as reasoning; planning; execution.

With that as the background, the following are a set of possible real-life example use cases.
Autonomous Security Patch Agent – Patch-and-PR Agent
Imagine an enterprise wants an agent that automatically patches an internal service when a new CVE is published. This typically requires:

access to internal Git repo
ability to browse vulnerability reports
edit code
run tests
build artifacts
create pull requests
use API keys securely
operate inside controlled boundaries

Most traditional sandboxes can run code. But they do not provide a full secure workspace + browser + governed secrets + developer environment in one flow.

This demonstrates a broader shift in agent infrastructure: from isolated code runners to governed autonomous workspaces. Instead of merely executing commands in isolation, the combined AIO Sandbox + OpenShell stack can run a full enterprise remediation workflow. OpenShell securely provisions short-lived credentials at runtime, launches an AIO workspace container, and applies outbound network policy.

Inside the workspace, the agent researches a newly disclosed vulnerability, updates the affected repository, runs tests, validates behavior in a browser session, and opens a pull request to the approved internal Git host. When the task completes, the sandbox is destroyed and credentials are revoked.
Other Possible Use Case Scenarios

Incident Response Agent – Investigates production alerts inside a governed sandbox, analyzes logs and dashboards, runs diagnostics, and prepares remediation steps or pull requests while keeping credentials and network access tightly controlled.
Browser QA Agent – Launches real browser sessions to validate user flows, detect UI regressions, capture screenshots, and file reproducible bug reports from an isolated workspace environment.
Autonomous Data Ops Agent – Executes data pipelines, validates schemas, runs transformations, investigates failed jobs, and generates reports or fixes in an ephemeral sandbox with secure access to approved data sources.

Key Observations & Final Thoughts

After integrating the two systems, one architectural principle became clear:

The best agent platforms separate capability surfaces from trust boundaries.

AIO Sandbox excels at capability surfaces.

OpenShell excels at trust boundaries.

Combining them yields a much stronger system than forcing either platform to do both jobs alone.

In Conclusion

OpenShell’s Bring-Your-Own-Container model provides a practical path to secure it without redesigning everything. For us, integrating AIO Sandbox with OpenShell demonstrated how modern agent infrastructure can be both:

high capability
operationally safe

And that balance is exactly what production-grade AI agents will need.

Links & Resources

Inside AIO Sandbox (Part 2): Bridging the Gap — Mastering AI Agents with Browser-Use, and MCP

XIAOXU CHANG — Mon, 27 Apr 2026 05:29:16 +0000

By AIO Sandbox team

This is the second part of a series of posts diving into AIO Sandbox that provides an isolated, programmable environment where agents can safely execute.
In the world of AI agents, we’ve moved past simple text-in, text-out. Today’s agents need to act—they need to browse the web, interact with GUIs, and communicate through standardized protocols. However, running these agents locally is risky. You wouldn't want an experimental script having unfettered access to your main browser or local file system. This is where the AIO Sandbox comes in.
In our previous post we focussed on "Filesystem" & "Shell" capabilities of AIO Sandbox. In this post, we’ll explore ways to supercharge AI agents using the AIO Sandbox from web automation to MCP protocol communication.

1. Seamless Web Automation with browser-use

The general-purpose GUI agents (CUA – Computer-use Agents) integrate advanced reasoning, perception, and action modules, enabling them to perform tasks with human-like adaptability and versatility.
The first challenge for any agent is navigating the web. Using the browser-use library alongside a remote CDP (Chrome DevTools Protocol) session provided by the AIO Sandbox, we can create an agent that feels like it’s "seeing" the web.
A browser-use agent built on the Chrome DevTools Protocol (CDP) is essentially an LLM-powered controller that treats the browser like a programmable environment. Think of CDP (Chrome DevTools Protocol) as “The browser’s brain exposed as an API”, the "backdoor" into the browser's brain. Instead of sending clicks to pixel X, Y coordinates, the commands are sent to the DOM (Document Object Model).
High-level architecture of a browser-use agent

User Task
   ↓
LLM (Planner + Reasoner)
   ↓
Agent Runtime
   ↓
CDP Client (WebSocket)
   ↓
Browser (Chrome/Chromium)

The code below demonstrates how to connect a browser-use agent to a sandboxed browser to perform a specific task: extracting Cisco’s stock price.

import asyncio
from agent_sandbox import Sandbox
from browser_use import Agent, Tools, BrowserProfile, BrowserSession
from browser_use.llm import ChatOpenAI

async def main():
    # 1. Connect to the AIO Sandbox
    sandbox = Sandbox(base_url="http://localhost:8080")
    cdp_url = sandbox.browser.get_info().data.cdp_url

    # 2. Attach the browser session to the remote Sandbox
    browser_session = BrowserSession(
        browser_profile=BrowserProfile(cdp_url=cdp_url, is_local=True)
    )

    task = 'Go to Google, look up "Cisco stock price" (CSCO), and extract the current price.'

    agent = Agent(
        task=task,
        llm=ChatOpenAI(model="gpt-4o"),
        browser_session=browser_session,
    )

    await agent.run()
    await browser_session.kill()

if __name__ == "__main__":
    asyncio.run(main())

Why this matters: By offloading the browser to a sandbox, your agent runs in an isolated environment. If the agent accidentally clicks a malicious link or tries to download a file, it only affects the disposable container, not your host machine.

2.Standardizing Discovery: The Agentic MCP (Model Context Protocol) Loop

Next, we have the Model Context Protocol (MCP). MCP is a standardized way for agents to discover and use tools across different environments without custom integration code for every new feature. MCP is the "native language" of modern AI agents. AIO Sandbox exports its services (such as browser_navigate etc.) as tools that an LLM can call directly. For a human developer, the REST API is the interface. For an AI agent, the MCP Tool Manifest is the interface. By exposing both, the sandbox ensures that:

A Python script can use the sandbox via REST.
An LLM (like Claude) can use the sandbox via MCP. The AIO Sandbox acts as an MCP host, allowing you to list and execute tools (like browser_navigate or shell_execute) through a unified interface.

To see this MCP usage in practice, let’s look at how we can bridge the gap between an LLM and the sandbox using MCP. The following example demonstrates a dynamic 'handshake' where the agent queries the AIO Sandbox for its available capabilities and automatically converts them into a format the LLM understands.

import asyncio
import json
from openai import AsyncOpenAI
from agent_sandbox import AsyncSandbox

# Initialize your AI and your MCP Client
# ai_client = AsyncOpenAI(api_key="your-key-here")
ai_client = AsyncOpenAI(api_key="API-KEY")
mcp_client = AsyncSandbox(base_url="http://localhost:8080")

async def run_agent(user_prompt):
    # 1. Fetch available tools from MCP
    tools_resp = await mcp_client.mcp.list_mcp_tools(server_name="browser")

    # Format MCP tools into OpenAI's required JSON schema
    openai_tools = [
        {
            "type": "function",
            "function": {
                "name": tool.name,
                "description": tool.description,
                "parameters": tool.input_schema # MCP uses JSON Schema, which OpenAI likes
            }
        } for tool in tools_resp.data.tools
    ]

    # 2. Ask the model what to do
    response = await ai_client.chat.completions.create(
        model="gpt-5.4",
        messages=[{"role": "user", "content": user_prompt}],
        tools=openai_tools
    )

    message = response.choices[0].message

    # 3. If the model wants to call a tool, we execute it
    if message.tool_calls:
        for tool_call in message.tool_calls:
            # Parse arguments
            args = json.loads(tool_call.function.arguments)

            print(f"Agent is executing {tool_call.function.name}...")

            # CALL THE MCP SANDBOX
            result = await mcp_client.mcp.execute_mcp_tool(
                server_name="browser",
                tool_name=tool_call.function.name,
                request=args
            )

            # 4. Give the tool's output back to the model
            final_resp = await ai_client.chat.completions.create(
                model="gpt-5.4",
                messages=[
                    {"role": "user", "content": user_prompt},
                    message,
                    {
                        "role": "tool",
                        "tool_call_id": tool_call.id,
                        "content": str(result.data.content[0].text)
                    }
                ]
            )
            print(f"Final Answer: {final_resp.choices[0].message.content}")

if __name__ == "__main__":
    asyncio.run(run_agent("Go to google.com and tell me what the search button says and the button next to it."))

Why MCP is the future: Instead of writing custom logic for every new sandbox capability, your agent can "handshake" with the server, see what it's capable of, and start working immediately.
The Shift: This decouples your code from your tools. If you add a "SQL" or "FileSystem" server to the sandbox tomorrow, your agent automatically knows how to use them without a single line of new code.

3. Conclusion

Building AI agents is no longer just about the prompt. It’s about the infrastructure that supports them. By combining AIO Sandbox for security, browser-use for high-level navigation, and MCP for standardized tool access, you can build agents that are powerful, safe, and interoperable.
The future of autonomous work isn't just a chatbot—it's a sandboxed agent with a browser in one hand and MCP in the other.

Links & Resources

Inside AIO Sandbox (Part 1): Files & Shell — The Foundations of Agent Execution

XIAOXU CHANG — Tue, 31 Mar 2026 07:36:56 +0000

by AIO Sandbox Team

Modern AI agents are no longer just generating text—they are expected to write files, modify code, and execute commands.
But doing this directly on your local machine or production systems is risky and hard to control.
This is where AIO Sandbox comes in. It provides an isolated, programmable environment where agents can safely:

create and manipulate files
run shell commands
execute code
produce artifacts
and many more...

Unlike a typical docker container, which often requires manual configuration for tool-chaining, the AIO Sandbox integrates a browser, a shell, and a file system into a single environment designed for AI agents. This unified architecture ensures that artifacts remain persistent and accessible across every stage of an AI-driven workflow executing within the sandbox.
In this first post, we’ll focus on the two most fundamental capabilities:

🧩 Filesystem (state)
⚙️ Shell (execution)

By the end, you’ll see how these combine into a complete runtime for agents.

🌐 Multi-language SDK Support

While this tutorial uses Python, AIO Sandbox is not limited to Python developers.
The agent-sandbox SDK also supports:

TypeScript / JavaScript
Go (Golang)

👉 This makes it easy to integrate AIO Sandbox into a wide range of agent frameworks, backend services, and developer stacks.

🛠️ Prerequisites

Python 3.12+
A running AIO Sandbox instance at http://localhost:8080

Docker Command:
docker run --security-opt seccomp=unconfined --rm -it -p 8080:8080 ghcr.io/agent-infra/sandbox:latest

Python SDK installed

pip install agent-sandbox

🧠 Mental Model

Think of AIO Sandbox as a remote, disposable Linux machine that your agent controls via APIs.

Filesystem → where data and artifacts live
Shell → how actions are executed

Simple Flow:
Agent → API → Sandbox → Filesystem + Shell

Autonomous Data Processing & Validation Agent

Rather than presenting APIs for these services in isolation, we will demonstrate the following end-to-end comprehensive agent workflow example operating within the sandbox environment.
This example simulates an agent that executes the following workflow:

Create some data
Read it (look at it)
Write a script file (process.py)
List files (see what exists)
Run the script
Read the output file
Check if output looks correct
Notice bad data & Fix the data
Run the script again
Read updated output
Find files created
Download final result

What this use case demonstrates
This use case demonstrates a realistic agent loop enabled using AIO Sandbox File & Shell primitives:

Read → Execute → Read → Validate → Fix → Re-run → Export

This workflow makes each AIO Sandbox File & Shell primitive feel purposeful:

File primitives
- write_file creates data and code
- read_file lets the agent inspect inputs and outputs
- list_path gives workspace awareness
- replace_in_file lets the agent repair bad input
- search_in_file validates expected output
- find_files discovers generated artifacts
- download_file exports results out of the sandbox
Shell primitive
- exec_command runs the actual processing job

from agent_sandbox import Sandbox

client = Sandbox(base_url="http://localhost:8080")

# --------------------------------------------------
# 1. Setup workspace
# --------------------------------------------------
home_dir = client.sandbox.get_context().home_dir
app_dir = f"{home_dir}/data_agent"

data_path = f"{app_dir}/data.txt"
script_path = f"{app_dir}/process.py"
report_path = f"{app_dir}/report.txt"

print("Sandbox home directory:", home_dir)
print("App directory:", app_dir)

# --------------------------------------------------
# 2. Create raw input data (with an intentional error)
# --------------------------------------------------
client.file.write_file(
    file=data_path,
    content="""10
20
INVALID
40
50
""",
)

print("\nCreated raw input data.")

# --------------------------------------------------
# 3. Read and inspect input data
# --------------------------------------------------
data_preview = client.file.read_file(file=data_path)

print("\nRaw input data:")
print(data_preview.data.content)

# --------------------------------------------------
# 4. Write processing script
# --------------------------------------------------
client.file.write_file(
    file=script_path,
    content="""numbers = []

with open("data.txt") as f:
    for line in f:
        try:
            numbers.append(int(line.strip()))
        except:
            print("Skipping invalid line:", line.strip())

total = sum(numbers)
avg = total / len(numbers)

report = f\"\"\"Report Summary
--------------
Valid Count: {len(numbers)}
Total: {total}
Average: {avg}
\"\"\"

with open("report.txt", "w") as f:
    f.write(report)

print(report)
""",
)

print("\nCreated processing script.")

# --------------------------------------------------
# 5. List workspace contents
# --------------------------------------------------
workspace = client.file.list_path(
    path=app_dir,
    recursive=True,
)

print("\nWorkspace contents:")
for entry in workspace.data.files:
    print("-", entry.path)
# --------------------------------------------------
# 6. Execute the processing script
# --------------------------------------------------
result = client.shell.exec_command(
    command=f"cd {app_dir} && python3 process.py"
)

print("\nFirst execution output:")
print(result.data.output)
print("Exit code:", result.data.exit_code)

# --------------------------------------------------
# 7. Read generated report
# --------------------------------------------------
report = client.file.read_file(file=report_path)

print("\nGenerated report:")
print(report.data.content)

# --------------------------------------------------
# 8. Validate report contents
# --------------------------------------------------
search = client.file.search_in_file(
    file=report_path,
    regex=r"Average: .*",
)

print("\nReport validation result:")
print(search)

# --------------------------------------------------
# 9. Detect bad input and fix it
# --------------------------------------------------
data_check = client.file.read_file(file=data_path)

if "INVALID" in data_check.data.content:
    print("\nDetected invalid data. Fixing input file...")

    client.file.replace_in_file(
        file=data_path,
        old_str="INVALID",
        new_str="30",
    )

# Read input again after fix
updated_data = client.file.read_file(file=data_path)

print("\nUpdated input data:")
print(updated_data.data.content)

# --------------------------------------------------
# 10. Re-run the processing script
# --------------------------------------------------
result = client.shell.exec_command(
    command=f"cd {app_dir} && python3 process.py"
)

print("\nSecond execution output:")
print(result.data.output)
print("Exit code:", result.data.exit_code)

# --------------------------------------------------
# 11. Read final report again
# --------------------------------------------------
final_report = client.file.read_file(file=report_path)

print("\nFinal report:")
print(final_report.data.content)

# --------------------------------------------------
# 12. Find generated text artifacts
# --------------------------------------------------
artifacts = client.file.find_files(
    path=app_dir,
    glob="*.txt",
)

print("\nDiscovered artifacts:")
print(artifacts)

# --------------------------------------------------
# 13. Download final report to local machine
# --------------------------------------------------
with open("final_report.txt", "wb") as f:
    for chunk in client.file.download_file(path=report_path):
        f.write(chunk)

print("\nFinal report downloaded locally as final_report.txt")

🎯 Key Insight

AIO Sandbox gives agents a safe, programmable runtime:

Files → memory/state

Shell → actions

Together, they enable real-world workflows like:

code generation and execution
data processing
automation pipelines
tool orchestration

🚀 What’s Next

Thanks for reading! Hope it was helpful! This is just the beginning. In upcoming posts, we’ll dive into topics such as:

🌐 Browser automation (CDP-based)
🔌 MCP tool integration
📓 Jupyter / notebook execution
🤖 OpenClaw integration
🎯 Reinforcement learning inside sandbox

💬 Final Thoughts

AIO Sandbox bridges the gap between:

“LLM that generates text”
and
“Agent that can actually do things”

And it does so safely, reproducibly, and programmatically.

Introducing AIO Sandbox, All-in-One Sandbox Environment for AI Agents

XIAOXU CHANG — Fri, 27 Mar 2026 02:33:45 +0000

Unchecked AI autonomy is a ticking time bomb; it’s time to pull the plug on full system unfettered access. We can no longer afford to give AI agents the 'keys to the kingdom' without oversight. The 'wild west' of AI agents running with total system control is officially over.

AIO Sandbox is an open-source project designed to solve these problems. It is everything your agent needs, out of the box. No more juggling multiple services. AIO Sandbox ships a complete, pre-wired environment in a single Docker container.

The AIO (All-in-One) Sandbox is a containerized environment designed for both human developers and AI agents. Its architecture is built around a "Batteries-Included" philosophy, providing a full Linux desktop-like environment inside a single Docker container.

Unified Environment: One Docker container with shared filesystem. Files downloaded in the browser are instantly accessible in Terminal and VSCode.
Out of the Box: Built‑in VNC browser, VS Code, Jupyter, file manager, and terminal—accessible directly via API/SDK.
Agent-Ready: Pre-configured MCP Server with Browser, File, Terminal, Markdown, Ready-to-use for AI agents.
Developer Friendly: Cloud-based VSCode with persistent terminals, intelligent port forwarding, and instant frontend/backend previews.
Secure Execution: Isolated Python and Node.js sandboxes. Safe code execution without system risks.
Production Ready: Enterprise-grade Docker deployment. Lightweight, scalable.

Calling all AI agent developers! How are you securing your builds? Let’s try running your agent in AIO Sandbox and compare notes.

AIO Sandbox is open-sourced under the Apache License 2.0. Contributions welcome.

GitHub: https://github.com/agent-infra/sandbox
Official website: https://sandbox.agent-infra.com/

Kitex/Hertz Empowers LLMs: A Retrospective of Key Features on Its Third Anniversary

XIAOXU CHANG — Thu, 23 Jan 2025 08:24:37 +0000

By Yang Rui from CloudWeGo Team

Github:https://github.com/cloudwego

It has been three years since CloudWeGo's open-source journey began. Adhering to the principle of Internal and External Consistency, we have continuously iterated on our open-source repository, releasing features that served ByteDance internally to the external world. From 2023 to 2024, Kitex/Hertz focused on LLM back-end services, user experience, and performance, aiding the rapid development of new business scenarios while continuously optimizing user experience and performance. Meanwhile, Kitex/Hertz has been widely adopted by external enterprises and attracted numerous external developers, enhancing our CloudWeGo ecosystem all the way.

This article summarizes the presentation "Kitex/Hertz Empowers LLMs: A Retrospective of Key Features on The Third Anniversary". It introduces the significant features of Kitex/Hertz over the past year, aiming to assist enterprise users and community developers in better applying Kitex/Hertz to build their microservices systems in their projects.

Enhanced Streaming Capabilities to Support LLMs

With the rapid development of LLMs and ByteDance's AI applications, streaming communication has emerged as the primary communication mode for LLM application services. To better support business growth, we have optimized streaming communication in microservices in terms of stability, engineering practices, and performance over the past year.

Previous Streaming Capabilities of Kitex/Hertz

Both Kitex and Hertz support streaming scenarios. Kitex supports gRPC with better performance than the official gRPC and aligns with its functionality. Hertz supports HTTP Chunked Transfer Encoding and WebSocket. However, these capabilities were insufficient to support the rapid development of LLM internally at ByteDance due to several reasons:

More SSE Applications on Clients

Before MultiModal Machine Learning Models, LLM applications were mainly text-based dialogue scenarios, often using the SSE protocol to return server results to clients in real-time. Text push scenarios are simpler, requiring only a browser-friendly, straightforward protocol.

The Burden of Transitioning from Thrift to Protobuf

Although gRPC (Protobuf) is commonly used for streaming communication in RPC scenarios, and Kitex also supports gRPC, ByteDance's server-side services primarily use Thrift IDLs. Developers are more familiar with Thrift, and there are not many services using gRPC protocols. However, as the demand for streaming increases, we need to reduce the cognitive burden on developers during the transition based on internal realities. Additionally, widely increasing services defined by Protobuf is not conducive to unified IDL/interface management.

Lack of Engineering Practices

Compared to the PingPong model of one-send-one-receive, streaming communication adds complexity in service governance and engineering practices. The industry lacks accumulated engineering practices for streaming communication. Streaming interfaces can be easily misused, affecting service stability. From an observability perspective, there is no definition for streaming monitoring.

Streaming Capabilities – SSE/Thrift Streaming

Hertz SSE

SSE (Server-Send Events) is based on the HTTP protocol, supporting unidirectional data push from the server to the client. Its advantages include simplicity, ease of use, and developer-friendly, making it suitable for text transmission and meeting the basic communication needs of text dialogue models. Compared to WebSocket, SSE is lighter. For text-based dialogue LLM applications, the server only needs to push data to the client without handling the complexity of bidirectional communication. However, in voice dialogue scenarios, WebSocket, which is also browser-friendly, is more suitable. SSE can define different event types and process data on the client based on event types. In LLM applications, this can be used to distinguish different types of response data (e.g., partial outputs, error messages, status updates).

However, SSE is not suitable for the server side due to the following reasons: high computational and transmission performance requirements on the server side, unsuitability for inefficient text protocols, JSON's simplicity but unsuitability for complex server-side interaction scenarios, preference for strongly typed RPC, and the need for bidirectional streaming communication in certain cases.

Therefore, considering ByteDance's internal needs, we choose to support Thrift Streaming.

Kitex Thrift Streaming

Streaming communication is used not only in LLM applications but also in other business. For example, Douyin Search aims to improve performance by RPC streaming results. During the video packaging stage, it retrieves information related to recalled video IDs, hoping to bundle services (10 docs) in one request and return the first completed package. In the Lark People data export scenario, data is retrieved concurrently. If all data is filled into an Excel sheet before returning, excessive data can lead to OOM (Out of Memory), causing process exception terminated. Enhancing streaming capabilities not only supports the rapid development of LLMs but also meets the development needs of other business scenarios.

Although Kitex supports gRPC, we recommend using Thrift internally. Supporting diversity can meet various needs. However, it is best for a company to establish a best practice to minimize the burden on developers' choices, and the tool chain system will also be more supportive.

Streaming Protocols

Within ByteDance, traffic control for Streaming protocols mainly relies on Service Mesh. However, to quickly support implementation without relying on Service Mesh's support for new protocols, Kitex first supported Thrift Streaming based on gRPC (HTTP2). Since the official gRPC protocol specification supports extending content-type, the implementation is based on gRPC's RPC communication specification, changing Protobuf encoding to Thrift encoding.

Thrift over gRPC began its Alpha at ByteDance in December 2023 and was officially released in Kitex v0.9.0 in March 2024. It is now widely used internally, with usage instructions available on the official website.

Pros:
- Service Mesh Compatibility: Based on HTTP2 transmission, no separate support is required from Service Mesh.
- Low Support Cost: The decoding type is explicitly determined based on SubContentType (an extension supported by gRPC protocol specification.
Cons:
- High Resource Consumption: Flow control and dynamic windows introduce additional overhead.
- Significant Latency Impact: Flow control can significantly degrade latency with heavier traffic or larger packets, requiring users to adjust WindowSize.
- Difficult Troubleshooting: Increased complexity also raises the difficulty of troubleshooting.

Thrift over gRPC can be quickly implemented. However, from the perspectives of performance and troubleshooting, we have developed a Streaming protocol (Streaming over TTHeader) to simplify streaming communication. It is currently under internal debugging and trials, with an expected release in November-December 2024.

How to Define Streaming in Thrift

Users familiar with Thrift know that native Apache Thrift does not support the definition of streaming interfaces. Adding new keywords would make other Thrift parsing tools, including IDE plugins, incompatible. Therefore, defining streaming types for Thrift's RPC methods through annotations ensures parsing compatibility:

streaming.mode="bidirectional": Bidirectional Streaming
streaming.mode="client": Client Streaming
streaming.mode="server": Server Streaming

Both the currently supported Thrift Streaming over gRPC and the upcoming Thrift Streaming over TTHeader use this method to define streaming methods. The client-side will provide options to specify which Streaming protocol to use, while the server-side will support multiple protocols through protocol detection.

Generalized Streaming Invocation

If SSE is used for streaming communication on clients and Thrift Streaming is used on servers, how does the overall communication from clients to servers work?

Taking the internal text dialogue model as an example, the traffic undergoes protocol conversion after passing through the API gateway, and the server uses the Server Streaming type to push data to the client.

An important capability here is protocol conversion. Additionally, pressure testing and interface testing platforms need to dynamically construct data to test server services.

Users of Kitex know that Kitex provides generalized invocation for Thrift protocols, primarily supporting such general services. Previously, internal microservices were mainly Thrift PingPong services. Kitex provided generalized invocation for Map, JSON, HTTP data types, as well as binary generalized invocation for traffic forwarding.

Therefore, for streaming interfaces, Kitex has added support for generalized streaming invocation. Compared to PingPong generalized interfaces, generalized streaming requires separate interfaces for the three streaming types.

PingPong/Unary Generalized Invocation Interface

Streaming Generalized Invocation Interface

Currently, support for the mainstream JSON data type is complete, and other data types will be supported based on business needs in the future. (Since the Kitex Streaming v2 interface is yet to be released, and to avoid affecting the user experience of generalized streaming, this support has not been officially announced, but the functionality is ready. Users can visit the generalized invocation section on the official website for English documents.)

User Experience of Streaming Capability

Although we have improved the basic cases for streaming and introduced the streaming capabilities that Kitex/Hertz has supported in the past, newly supported, and will soon release, do developers who have worked on streaming interfaces, including those using other frameworks like the official gRPC, know how to properly use streaming interfaces and how to locate issues when they arise?

Within ByteDance, as streaming services have evolved, we've noticed a significant increase in feedback issues. On one hand, compared to Thrift PingPong, our support at the basic capability level is still incomplete. On the other hand, developing streaming interfaces requires a deep understanding of proper usage; otherwise, misuse can easily lead to problems.

Therefore, in 2024, we initiated a Streaming Optimization Project, sorted through various issues, and optimized them one by one. In terms of user experience, some issues are related to streaming interface definitions. After comprehensive consideration, we decided to shed the streaming burden and release the Streaming v2 interface. Below are some of the existing issues and ongoing optimizations. It's difficult to enforce proper usage of streaming interfaces solely from the framework level. Therefore, we will release usage specifications and best practices for streaming interfaces to help users develop high-quality streaming interfaces. If you have better suggestions for streaming usage, we welcome your feedback!

Taking streaming observability as an example, previously, streaming interface monitoring was not defined separately, which reuse PingPong reporting, resulting in only overall stream reporting information and lacking Recv/Send monitoring. Therefore, when supporting Thrift Streaming, StreamSend & StreamRecv events were added, with the framework recording the time of occurrence and the size of user-transmitted data. For custom Tracer reporting by enterprise users, it only requires implementing the rpcinfo.StreamEventReporter interface. Kitex will call this interface after each Recv and Send execution, allowing access to the event information for this Recv and Send. Below is the Trace information for Send/Recv within a Stream.

Review of New Features, User Experience/Performance Improvements

While specialized support and optimization for streaming capabilities have been conducted over the past year, we have also provided other new features to meet user needs, enhance user experience, and continue to improve framework performance.

New Features – Thrift/gRPC Multi-Services

The official gRPC framework supports multi-services, but previous versions of Kitex did not, mainly to align with Thrift usage. Thrift's limitation arises from supporting multi-services introducing protocol incompatibility changes, impacting users. Within ByteDance, the TTHeader protocol is widely used, so we decided to transmit the IDL Service Name via TTHeader to solve the issue of Thrift not supporting multi-services.

Kitex v0.9.0 officially supports registering multiple IDL Services within one Server, including Thrift and Protobuf. Thrift provides true multi-service functionality at the protocol level based on TTHeader, while being compatible with the old CombineService.

Here is a briefly introduction of Combine Service. Kitex previously provided a pseudo-multi-service feature, Combine Service, to address the issue of excessively large IDLs (leading to large code outputs and slow compilation speeds). It allows the server to split one IDL Service into multiple ones, but requires that the multiple IDL Services cannot have methods with the same name (since the protocol does not support multi-services, method routing cannot be done). Ultimately, Kitex merges multiple IDL Services into one Service, hence the name CombineService.

With Kitex's new multi-service support, the server can not only register multiple IDL Services, but also provide both Thrift and Protobuf interfaces simultaneously. For example, using Kitex-gRPC (Protobuf) but wanting to switch to Thrift Streaming while ensuring compatibility with old interface traffic, two types of IDL interfaces can be provided for transition.

Below is an example of registering multiple services on the server:

New Features – Mixed Retry

Kitex previously provided two retries: Failure Retry and Backup Request. Failure Retry can improve success rates (enhancing service SLAs), but most are timeout retries, leading to increased latency; Backup Request can reduce request latency, but if there is a failed return, it terminates retries.

In internal practice, businesses generally express a desire to have both retries, offering advantages to:

Optimize the overall retry latency of Failure Retry
Improve the request success rate of Backup Request

Therefore, Kitex supports Mixed Retry in v0.11.0, a hybrid retry function combining Failure Retry and Backup Request functions.

To facilitate understanding the differences between the three retries, here is a scenario: assume the first request takes 1200ms, the second request takes 900ms, with RPCTimeout configured to 1000ms, MaxRetryTimes to 2, and BackupDelay to 200ms.

Comparing the results of the three retries:

Mixed Retry: Success, cost 1100ms
Failure Retry: Success, cost 1900ms
Backup Retry: Failure, cost 1000ms

User Experience - Frugal & FastCodec (Thrift)

Both Frugal and FastCodec (Thrift) are high-performance Thrift serialization tools provided by Kitex. Frugal's advantage over FastCodec is that it does not require code generation, significantly addressing the issue of excessively large outputs.

But still two cons:

Both Frugal and FastCodec decoding must rely on packets with headers. If it's a Thrift Buffered packet, it will fallback to Apache Codec. Users need to be clear about the received protocol; otherwise, using Frugal cannot completely eliminate code generation.
Frugal is based on JIT implementation, with x86 support completed, but ARM just provides a fallback strategy with poor performance.

Addressing the protocol binding issue, the new version supports SkipDecode. Test results show that SkipDecode + FastCodec still outperforms Apache Thrift Codec.

For the Frugal ARM issue, new reflection support is provided, eliminating the need for separate support for different architectures. Although using reflection, bypassing type checks within reflection achieves higher performance. Test results are slightly better compared to JIT.

User Experience - Output Reduction and Generation Speed Optimization

Large output size, slow output generation,and compilation speeds are significant pain points for services with longer iterations within ByteDance. Therefore, Kitex provides various optimization methods to reduce output size and improve output generation speed.

IDL Trimming

A complex IDL with longer iterations contains many obsolete struct definitions. Actively cleaning up these unnecessary definitions can also increase development burden. The trimming tool supports generating code based on the struct definitions required by RPC methods. Users can also specify which methods to generate. According to pilot projects in large ByteDance repositories, generation time is halved, and output size is reduced by over 60%.

Usage: $ kitex -module xx -thrift **trim_idl xxxx.thrift

Example effect: In the example below, the trimming tool deleted 60,000 unused structs and 530,000 fields.

no_fmt Speedup

After code output generated, the code is formatted by default to improve readability, but users rarely care about that. Therefore, users can disable the fmt option to improve generation speed.

Usage: $ kitex -module xx -thrift **no_fmt xxxx.thrift

Effect: The P90 generation time for a certain platform within ByteDance decreased from 80s to 20s.

Removing Unnecessary Codes from Kitex

Kitex defaults to generating the full Apache Thrift code, but in reality, only the Codec part is used in fallback scenarios, and the rest of the code is not needed.

Therefore, Kitex v0.10.0 defaults to removing the Thrift Processor and can remove all Apache Thrift code via parameter specification.

kitex -module xxx -thrift no_default_serdes xxx.thrift

Usage: $ kitex -module xxx -thrift no_default_serdes xxx.thrift

Effect: Output size is reduced by about 50%+.

Frugal Slim Extreme Reduction

Usage: $ kitex -thrift frugal_tag,template=slim -service p.s.m idl/api.thrift, using Frugal for Thrift serialization.

Effect: Output size is reduced by about 90%.

User Experience - kitexcall

Although RPC calls are simpler and more convenient than HTTP, they are not convenient to test, requiring tools to generate code first and then construct request data. Previously mentioned testing platforms use generalized invocations to construct request data without relying on generated code, but the cost of using generalized invocations is not low. Users must first understand the usage and data construction of the method.

To improve testing convenience, based on Kitex JSON generalized invocations, a separate command tool - kitexcall is provided, allowing users to initiate Thrift tests using JSON data. (This feature is supported by community contributions; thanks here!)

Usage: $ kitexcall -idl-path echo.thrift -m echo -d '{"message": "hello"}' -e 127.0.0.1:8888

Future optimization plans:

Graphical interface for more convenient testing
Support for gRPC testing
No need to specify IDL, using server reflection to obtain IDL information

Performance Optimization – Thrift On-Demand Serialization

As business iterations make IDL definitions increasingly complex, upstream services in production may only need partial fields but need to serialize and transmit all of those, introducing additional performance overhead. Considering this issue, Kitex supports on-demand serialization for Thrift.

Reference Protobuf to provide a Thrift FieldMask feature, allowing users to select encoding fields and optimize serialization and transmission overhead.

For example, below, only the Foo field is encoded and returned, ignoring the Bar field:

User constructs Bar data, but annotate the Foo field, and the framework will only encode Foo:

It also supports specifying required fields by the opposite end; for specific usage, see the on-demand serialization documentation on the official website.

Performance Optimization – Thrift Memory Allocation Optimization

Kitex continuously monitors RPC performance. In the current context of high cost pressures, we are deeply exploring more optimizations. Routine optimizations on hot paths have all been done, but further ones are less conventional. v0.10.0 released new optimizations focusing on memory allocation and GC.

Span Cache: Optimizes String/Binary decoding costs:
- Pre-allocates memory, reducing mallocgc calls
- Reduces the actual number of generated objects -> lower GC costs
Centralized memory allocation for container fields
- Similarly, changes from separate memory allocation for each element to centralized approach

Span Cache can optimize CPU but increases memory usage. To avoid impacting services with small memory specifications, it is not enabled by default and requires user to turn on:

Effect: Under extreme testing, throughput is increased by about 10%, and latency is reduced by about 30%.

Memory Analysis Tool

The received objects in RPC/HTTP are constructed, memory-allocated, and value-assigned by the framework before being returned to the user. However, if the user's code holds onto these objects indefinitely, it can lead to memory leaks. While pprof heap can indicate where memory is allocated, it cannot show where references are made. So, how do we determine who is referencing a Go object?

In fact, GC scans and marks objects, capturing reference relationships. By combining this with variable names and type information, we can analyze the referencing situation of objects. Leveraging Delve, we have developed the goref object reference analysis tool, which was open-sourced in July (github.com/cloudwego/goref). This addresses the limitation of Go's native tools in analyzing memory references, aiding Go developers in quickly identifying memory leaks and enhancing the Go tooling ecosystem.

For instance, the Heap Profile of pprof shown in the following image reveals that the currently referenced objects are primarily allocated within FastRead (Kitex's deserialization code). It is normal for decoding to allocate memory for construct data, but this flame graph offers limited help in troubleshooting as allocated memory addresses are often not the source of memory leaks.

However, using the goref tool yields the following result: mockCache holds an RPC Resp, preventing memory from being released. The issue is immediately apparent.

Conclusion and Outlook

Conclusion

Enhancing Streaming Capabilities to Support LLMs

Streaming capabilities provided by Kitex/Hertz: gRPC, HTTP 1.1 Chunked, WebSocket, SSE, Thrift Streaming
SSE <-> Thrift Streaming
Generalized streaming invocations
Streaming capability optimizations to enhance user experience and engineering practices

Review of New Features, User Experience, and Performance Improvements

New Features: Thrift/gRPC multi-service support, Mixed Retry
User Experience: Frugal/FastCodec, streamlined outputs, generation speed optimizations, kitexcall
Performance Optimization: Thrift on-demand serialization, memory allocation improvement
Memory Analysis Tool: goref

Outlook

In the coming year, we will continue to enhance streaming capabilities and optimize the user experience. We will provide usage guidelines for streaming interfaces to help users better develop their streaming services:

Release Kitex Streaming v2 interface to address historical issues
Release TTHeader Streaming for better performance
Engineering practices: graceful shutdown, retries, timeout control
Release streaming-related specifications: error handling, interface usage guidelines

Furthermore, we will consider strengthening the streaming ecosystem, such as enriching generalized streaming invocations and providing more gateway-friendly support:

SSE <-> Thrift Streaming(HTTP2 and TTHeader Streaming)
WebSocket <-> Thrift Streaming (HTTP2 and TTHeader Streaming)
Binary and Map generalized invocations for Streaming

Special announcement: Kitex plans to gradually remove Apache Thrift-generated code in future versions. Due to incompatible changes in Apache Thrift v0.14, Kitex is forced to be locked-in Apache Thrift v0.13. To resolve this, Kitex will eliminate its dependency on Apache Thrift.

BMF: Frame extraction acceleration- video similarity search with Pinecone

Josh Alphonse — Fri, 10 May 2024 17:42:44 +0000

TL;DR: This is a tutorial on how to create a video similarity search with BMF and Pinecone from scratch. View this project's code on github and test it out in a notebook like colab.

So you might have seen in my last blog post that I showed you how to accelerate video frame extraction using GPU's and Babit multimedia framework. In this blog we are going to improve upon our video frame extractor and create a video similarity search(Reverse video search) utlizing different RAG(Retrival Augemented Gerneation) concepts with Pinecone, the vector database that will help us build knowledgeable AI. Pinecone is designed to perform vector searches effectively. You'll see throughout this blog how we extrapulate vectors from videos to make our search work like a charm. With Pinecone, you can quickly find items in a dataset that are most similar to a query vector, making it handy for tasks like recommendation engines, similar item search, or even detecting duplicate content. It's particularly well-suited for machine learning applications where you deal with high-dimensional data and need fast, accurate similarity search capabilities.
Reverse video search works like reverse image search but uses a video to find other videos that are alike. Essentially, you use a video to look for matching ones. While handling videos is generally more complex and the accuracy might not be as good as with other models, the use of AI for video tasks is growing. Reverse video search is really good at finding videos that are connected and can make other video applications better.
So why would you want to create a video similarity search app?

Here are some reasons:

Content Discovery: It enables users to find videos that are visually or contextually similar to what they're interested in, enhancing content discoverability on platforms like streaming services or stock footage libraries.
Recommendation Systems: Enhances recommendation engines by suggesting content that is similar to a user's viewing history, thus improving user engagement and retention.
Duplicate or Near-duplicate Detection: Helps in identifying copies or slight variations of the same video, which is useful for copyright enforcement or content management.
Categorization and Tagging: Assists in automatically categorizing and tagging videos based on content, which can simplify content management and improve searchability.
User-generated Content Moderation: Useful in moderating platforms where vector similarity can help identify potentially problematic content by comparing new uploads with known flagged videos.
Video Analysis: In fields like surveillance, sports, or medical imaging, it can help in analyzing and identifying specific moments or objects in video sequences.

Oh yeah and of course a similarity search like what we'll do in this blog! I've taken inspiration by reading the Milvus video reserve video search notebook and decided to recreate it using technologies I prefer.
Babit Multimedia framework brings forth all the great things we know and love about FFMPEG and amplifies it all with its multi-language support and GPU acceleration capabilities.
Now you might be familiar with other frame extraction methods using OpenCV, FFmpeg,or GStreamer. These are all great options. However, I'm choosing to use BMF for a few reasons:

Multi-language support- BMF supports the use of Python, GO and C++
Full compatiblity with FFmpeg- BMF is fully compatible with FFmpeg’s processing capabilities and indicators, such as demuxing, decoding, filter, encoding, and muxing capabilities. The configuration of these processing capabilities and the results consistent with FFmpeg’s pts, duration, bitrate, fps and other indicators can be obtained. Very good It satisfies the need to quickly integrate FFmpeg capabilities into projects.
Enhanced Support for NVIDIA GPUs to create enterprise ready GPU accelerated video pipelines
- NVENC/NVDEC/GPU filters work out-of-box by inheriting abilities from FFmpeg.
- High performance frame processing is enabled by integration of CV-CUDA and customized CUDA kernels.
- AI inferencing can be easily integrated into video pipelines using TensorRT.
- Data moving between CPU and GPU can be done by a simple call. Alright, so that's more than just a few reasons, but you get the point! Now let's build a video similarity search/

### The Architecture

Required Python packages to install:

Python 3.9-3.10
pinecone-client
BabitMF-GPU
torch
torchvision>=0.12.0
python-dotenv
av Grab a video. I had a short video stored on a github repo. You can use a video stored on your system or elsewhere. BMF can handle any video format(FFmpeg compatibility!)

Inserting the dataset into a Pinecone index

Let's start with inserting videos from our dataset into our Pinecone index. We do this so that our vector database has knowledge of the videos we will be comparing to the end user's video. This is a necessary starting point for our application.
First, I'm going to create an account on Pinecone and create my first index using Pinecone serverless. Pinecone is a fully managed vector database. You can use the CLI or the dashboard when you log in. Here's to learn how to set it up: https://docs.pinecone.io/guides/getting-started/quickstart.

git clone https://github.com/Joshalphonse/Bmf-Huggingface.git

Install BMF with GPU capabilities

!pip install -qU \
  pinecone-client \
  BabitMF-GPU \
  torch \
  torchvision>=0.12.0 \
  python-dotenv \
  av

Install this video dataset or use your own
The data is organized as follows:

train: candidate videos, 20 classes, 10 videos per class (200 in total)
test: query videos, same 20 classes as train data, 1 video per class (20 in total)
reverse_video_search.csv: a csv file containing an id, path, and label for each video in train data

! curl -L https://github.com/towhee-io/examples/releases/download/data/reverse_video_search.zip -O
! unzip -q -o reverse_video_search.zip

Put the files in a dataframe and convert them to a list

import pandas as pd

df = pd.read_csv('./reverse_video_search.csv', nrows=3) #put the files in the dataframe
video_paths = df['path'].tolist() #convert df to python list

print(video_paths) #check if the video paths

Make sure to import all of the necessary packages. Then create environment variables to manage your configurations. They will make your life a lot easier.
Afterwards, load the CSV file from the data set folder I'm also limiting the list to 3 rows just to speed things up for demo purposes.
We'll also load the ResNet Pretrained model because in the next steps we will use it to generate the vector embeddings.
Lastly, in this code snippet, configure a preprocessing pipeline for images using PyTorch's transforms module, which is often used in deep learning for preparing data before feeding it into a neural network.

import os
import cv2
import av
from pinecone import Pinecone, ServerlessSpec
import numpy as np
import pandas as pd
import torch
import torchvision.transforms as transforms
import torchvision.models as models

PINECONE_API_KEY = os.environ["PINECONE_API_KEY"]
PINECONE_ENVIRONMENT = os.environ["PINECONE_ENVIRONMENT"]
PINECONE_DATABASE = os.environ["PINECONE_DATABASE"]

# Replace 'your_pinecone_api_key' with your actual Pinecone API key or use environment variables like I am here
pc = Pinecone(api_key=PINECONE_API_KEY, environment=PINECONE_ENVIRONMENT)
index = pc.Index(PINECONE_DATABASE)

#load the CSV file
csv_file = './reverse_video_search.csv'
df = pd.read_csv(csv_file, nrows=3)
video_paths = df['path'].tolist()
print(video_paths) #check if the video paths

#load a pretrained ResNet model
model = models.resnet18(pretrained = True)
model.eval()

#remove the last fully connectected layer
model = torch.nn.Sequential(*list(model.children())[:-1])

# Define the preprocessing transforms
preprocess = transforms.Compose([
    transforms.ToTensor(),
    transforms.Resize((224, 224)),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

Since we have our dataset and our files ready to go, we will iterate over each video path and generate an embedding. I'm also using the av package to handle the video file, so we can open it and do the extraction.
We then iterate over the frames of the video, preprocessing each frame (using a preprocess function that is not shown) and generating an embedding for the frame using a pre-trained ResNet model. These frame embeddings are stored in a list.
Once all the frame embeddings have been collected, then we calculate the average of the embeddings to get a single embedding that represents the entire video.
Now all we have to do is use the Pinecone package to upsert (insert or update) the average video embedding to a Pinecone index, under the namespace 'video_embeddings'. The video path is used as the unique identifier for the embedding

# Iterate over each video path and generate embeddings
for video_path in video_paths:
    # Open the video file
    video = av.open(video_path)

    # Get the first video stream
    video_stream = next(s for s in video.streams if s.type == 'video')

    # Initialize variables for storing embeddings
    embeddings = []

    # Iterate over the video frames
    for frame in video.decode(video=0):
        # Convert the frame to a numpy array
        img = frame.to_ndarray(format='rgb24')

        # Preprocess the frame
        img = preprocess(img)
        img = img.unsqueeze(0)  # Add batch dimension

        # Generate embeddings using the ResNet model
        with torch.no_grad():
            embedding = model(img)
            embedding = embedding.squeeze().numpy()

        # Append the embedding to the list
        embeddings.append(embedding)

    # Convert the list of embeddings to a numpy array
    embeddings = np.array(embeddings)

    # Calculate the average embedding for the video
    avg_embedding = np.mean(embeddings, axis=0)
    print(avg_embedding)

    # Upsert the embedding to Pinecone
    index.upsert(
        vectors=[
            (video_path, avg_embedding.tolist())
        ],
        namespace='video_embeddings'
    )

    print(f"Upserted embedding for video: {video_path}")

Now you can either use the Pinecone CLI or the dashboard to view what in your index we just updated the data to. Check out the picture below.

Searching For A Similar Video

Install ffmpeg and related libraries. For this demo, we don't have to do this step, because ffmpeg libraries are already installed in the Google Colab environment.

sudo apt install ffmpeg

List the ffmpeg libraries. It is expected that the related libraries such libavcodec, libavformat are installed. The output should be shown below:

sudo apt install libdw1

dpkg -l | grep -i ffmpeg

ffmpeg -version

Install the following package to show the BMF C++ logs in the colab console, otherwise only python logs are printed. This step is not necessary if you're not in a Colab or iPython notebook environment.

pip install wurlitzer
%load_ext wurlitzer

Now import all of these dependencies listed and the beginning of our process is the same as our data upsert from above. Use your Pinecone credentials that we stored in a .env file and work with the ResNet18 pretrained model.
The difference here is that we are finally using BMF for frame extraction.

import os
import glob
import numpy as np
import torch
import torchvision.transforms as transforms
import torchvision.models as models
from pinecone import Pinecone
import bmf
import cv2
from IPython import display
from PIL import Image

PINECONE_API_KEY = os.environ["PINECONE_API_KEY"]
PINECONE_ENVIRONMENT = os.environ["PINECONE_ENVIRONMENT"]
PINECONE_DATABASE = os.environ["PINECONE_DATABASE"]

# Replace 'your_pinecone_api_key' with your actual Pinecone API key or use environment variables like I am here
pc = Pinecone(api_key=PINECONE_API_KEY, environment=PINECONE_ENVIRONMENT)
index = pc.Index(PINECONE_DATABASE)

model = models.resnet18(pretrained=True)
model.eval()
#remove the last fully connectected layer
model = torch.nn.Sequential(*list(model.children())[:-1])

preprocess = transforms.Compose([
    transforms.ToTensor(),
    transforms.Resize((224, 224)),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

input_video_path = "/content/linedancing.mp4"
output_path = "./extracted-images/simple_%03d.jpg"

graph = bmf.graph({'dump_graph':1})
video = graph.decode({
    "input_path": input_video_path,
}).fps(2)

try:
    (
        bmf.encode(
            video['video'],
            None,
            {
                "output_path": output_path,
                "format": "image2",
                "video_params": {"codec": "jpg"},
            }
        ).run()
    )
    print("Frame extraction completed successfully.")
except Exception as e:
    print(f"Error during frame extraction: {str(e)}")

Next, we will load the extracted query frames and generate the embeddings for our video that we will compare to the ones stored in our Pinecone index.

Let me break it down for you:

Load the extracted query frames: I used the glob module to find all the file paths of the extracted query frames, which are stored in the query_frame_paths variable. These are individual frames extracted from the original video.
Generate embeddings for each query frame: We then iterate over each query frame path, load the image using cv2.imread, preprocess it (using a preprocess function that is not shown), and generate an embedding for the frame using the pre-trained model.
Store the embeddings: The generated embeddings for each frame are stored in the query_embeddings list.
Calculate the average embedding: Once all the frame embeddings have been collected, then we calculate the average of the embeddings to get a single embedding that represents the entire set of query frames. By generating an average embedding for the query frames, we are able to capture the overall visual content of the query, which is a main component of how our similarity search will work.

# Load the extracted query frames and generate embeddings
query_frame_paths = glob.glob(output_path.replace("%03d", "*"))
query_embeddings = []

for frame_path in query_frame_paths:
    frame = cv2.imread(frame_path)
    frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)

    frame = preprocess(frame)
    frame = frame.unsqueeze(0)  # Add batch dimension

    with torch.no_grad():
        embedding = model(frame)
        embedding = embedding.squeeze().numpy()

    query_embeddings.append(embedding)

query_embeddings = np.array(query_embeddings)
avg_query_embedding = np.mean(query_embeddings, axis=0)

Lastly, let's perform our similarity search with Pinecone.
The query method from Pinecone will be used to search for the most similar vectors to the avg_query_embedding we created. The top_k parameter is set to 5, which means that the code will retrieve the 5 closest matching vectors to the query(choose whatever number you'd like depending on how many items were upserted into your database. The include_metadata parameter is set to True, which means that we will retrieve the metadata (in this case, the video file paths) associated with the matching vectors.
This step is really straight forward. Pinecone has great documentation and a really easy to use package.

# Perform similarity search using Pinecone
num_results = 5  # Number of similar videos to retrieve
results = index.query(
    vector=avg_query_embedding.tolist(),
    top_k=num_results,
    include_metadata=True,
    namespace='video_embeddings'
)

# Print the most similar video paths
for match in results['matches']:
    video_path = match['id']
    print(f"Similar video: {video_path}")

And our result is....

Bonus

Since I'm using a notebook and I don't want to use up a ton of memory, I also converted all the videos to gifs to view them easier. So here is some bonus code for ya!

def video_to_gif(video_path):
    gif_path = os.path.join(tmp_dir, video_path.split('/')[-1][:-4] + '.gif')
    frames = []
    cap = cv2.VideoCapture(video_path)
    while cap.isOpened():
        ret, frame = cap.read()
        if not ret:
            break
        frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        frames.append(Image.fromarray(frame))
    cap.release()
    frames[0].save(fp=gif_path, format='GIF', append_images=frames[1:], save_all=True, loop=0)
    return gif_path

# Display the input video as a GIF
html = 'Query video "{}": <br/>'.format(input_video_path.split('/')[-1])
query_gif = video_to_gif(input_video_path)
html_line = '<img src="{}"> <br/>'.format(query_gif)
html += html_line
html += 'Top {} search results: <br/>'.format(num_results)

# Display the similar videos as GIFs
for path in [match['id'] for match in results['matches']]:
    gif_path = video_to_gif(path)
    html_line = '<img src="{}" style="display:inline;margin:1px"/>'.format(gif_path)
    html += html_line

display.HTML(html)

You can do it too

What I've showed you is a niche use case for BMF. Video frame extraction has a lot of use cases outside of our example. There are a ton of features with this framework, especially when it comes to building video processing pipelines. Make sure you check out the BMF documentation and try out some other example apps on the quick experience page for more.

KubeAdmiral: next-generation multi-cluster orchestration engine based on Kubernetes

XIAOXU CHANG — Mon, 15 Apr 2024 09:33:34 +0000

Project link: https://github.com/kubewharf/kubeadmiral

Since its release in 2014, Kubernetes has become the de facto standard for cloud native orchestration and scheduling systems, delivering substantial value to infrastructure developers around the world.

As an increasing number of corporations embrace cloud native technologies and migrate their workloads to Kubernetes, the scale of their clusters grows rapidly.

The community edition of Kubernetes, capped at 5000 nodes per cluster, is no longer able to keep up with the scale requirements of large-scale enterprise applications. Moreover, many companies are adopting multi-cloud architectures to achieve cost reduction, increased resource and operational efficiency, geographical disaster recovery, and environment isolation.

As a result, the demand for multi-cluster orchestration and scheduling tools is on the rise.

Brief History of Kubernetes at ByteDance

Dedicated Clusters

In the early years of ByteDance’s cloud native adoption, each business line operated in separate dedicated clusters due to isolation concerns. However, this led to low resource elasticity and efficiency, observed in several ways:

Each business line had to maintain independent resource buffers for scaling and upgrading.
Applications were tightly coupled to specific clusters, and manual resource transfer was required to balance resource utilization as applications scale.
SRE teams had to deeply understand both the businesses and the clusters in order to manage resources efficiently.

Consequently, this resulted in inefficient resource management and suboptimal overall deployment rate.

KubeFed v2

To address these challenges, the technical infrastructure team at ByteDance started exploring cluster federation based on KubeFed v2 in 2019. The goal is to pool resources across business lines, reduce unnecessary buffers, and improve the efficiency of resource management.

KubeFed v2 introduces the concept of host and member clusters. Users create federated workloads (e.g. FederatedDeployment) in the host cluster, and KubeFed schedules and dispatches workloads in the member clusters based on these federated workloads. Each federated workload contains three primary fields: Template (specifying the workload to be dispatched to member clusters), Placement (designating target member clusters), and Overrides (indicating how the template should be varied in some clusters). For example, the following FederatedDeployments instructs KubeFed to create a Deployment in cluster1 and cluster2 with 2 and 3 replicas respectively.

apiVersion: types.kubefed.k8s.io/v1beta1
kind: FederatedDeployment
metadata:
  name: test-deployment
spec:
  template:
    metadata:
      labels:
        app: nginx
    spec:
      replicas: 5
      # more Deployment fields...
  placement:
    clusters:
    - name: cluster1
    - name: cluster2
  overrides: 
  - clusterName: cluster1
    clusterOverrides:
    - path: /spec/replicas
      value: 2
  - clusterName: cluster2
    clusterOverrides:
    - path: /spec/replicas
      value: 3

For Deployment and ReplicaSet, KubeFed supports dividing the desired replicas across multiple clusters based on ReplicaSchedulingPreference (RSP). Users can configure the weights, minimum replicas, and maximum replicas for each cluster, and the RSP controller computes a valid replica distribution and updates the Placement and Overrides fields of FederatedDeployment or FederatedReplicaSet.

RSP Scheduling (Image credit: https://www.kubernetes.org.cn/5702.html)

KubeFed laid the foundation of Kubernetes cluster federation at ByteDance. However, we soon found KubeFed unable to meet our production requirements. The primary pain points were:

Uneven resource utilization across clusters – KubeFed’s RSP only supports static cluster weights and lacks the ability to adapt to fluctuations in cluster resources dynamically.
Service disruption after rescheduling – During rescheduling, replicas might be abruptly migrated between clusters, disrupting service availability.
Limitations in scheduling semantics – KubeFed supports stateless, replica-based resources through RSP, but lacks support for more diverse resources such as stateful workloads and jobs. Moreover, extending the existing scheduling semantics is difficult.
High onboarding cost – KubeFed requires the creation of federated objects and is incompatible with the native Kubernetes API. Users and downstream platforms need to completely overhaul their usage patterns.

KubeAdmiral

With the evolution of cloud native infrastructure at ByteDance, we raised our standards for efficiency, scalability, performance, and cost. Meanwhile, the size and number of our Kubernetes clusters continue to grow phenomenally along with the businesses. Additionally, workloads beyond stateless microservices, including stateful services, storage, offline and machine learning jobs, started embracing cloud native technologies. Against this backdrop, the limitations of KubeFed became increasingly difficult to manage. Therefore, at the end of 2021, we began our endeavor to develop the next generation cluster federation system, building upon KubeFed v2’s foundation. We named it KubeAdmiral to capture our aspiration for it to manage multiple clusters as effectively as a seasoned navy admiral commands a fleet.

Timeline of Kubernetes at ByteDance

KubeAdmiral offers enhanced multi-cluster orchestration and scheduling capabilities for various mainstream business scenarios. Today at ByteDance, KubeAdmiral manages more than 100,000 microservices with more than 10,000,000 pods running on dozens of federated Kubernetes clusters. It supports upwards of 30,000 upgrade and scaling operations daily, and maintains a stable deployment rate of 95-98% without the need for manual intervention.

KubeAdmiral Feature Highlight

KubeAdmiral not only supports native Kubernetes resources and third-party custom resources, but also offers a rich and extensible scheduling framework. Moreover, it refines numerous aspects of scheduling and dispatching, backed by years of practical production experience.

KubeAdmiral architecture diagram

Rich Multi-Cluster Scheduling Capabilities

The scheduler is a core component of KubeAdmiral responsible for computing the desired placement of workloads in member clusters. When scheduling replica-based workloads, it also computes the appropriate replicas for each cluster. Functioning as KubeAdmiral’s “brain”, its decisions directly impact critical aspects such as fault tolerance, resource efficiency, and stability.

KubeFed provides the RSP scheduler for replica-based workloads, but its customizability and extensibility are very limited, and modifying its behavior requires code modification. Additionally, it lacks support for stateful services, job-like resources, etc., which require different sets of scheduling semantics.

KubeAdmiral introduces more comprehensive scheduling semantics. It supports more flexible and fine-grained mechanisms to select clusters via labels, taints, etc, and score clusters based on resource utilization, affinity, and so on. Beyond just replica-based workloads, it also supports scheduling stateful workloads and job-like resources. Additionally, it brings about convenient features such as automatic dependency scheduling (dependencies such as ConfigMaps can automatically follow their Deployment to corresponding member clusters). The scheduling behavior can be configured using a PropagationPolicy object, as shown below:

apiVersion: core.kubeadmiral.io/v1alpha1
kind: PropagationPolicy
metadata:
  name: mypolicy
  namespace: default
spec:
  # Many different ways to select clusters.
  placement:
  # Manually specify desired clusters and replica weights, if required.
  - cluster: cluster-01
    preferences:
      weight: 4
  - cluster: cluster-02
    preferences:
      weight: 3
  - cluster: cluster-03
    preferences:
      weight: 4
  # Filter clusters based on label selectors.
  clusterSelector:
    IPv6: "true"
  # Filter clusters based on affinity.
  clusterAffinity:
  - matchExpressions:
    - key: region
      operator: In
      values:
      - us-east
  # Filter clusters based on taints and tolerations.
  tolerations:
  - key: "key1"
    operator: "Equal"
    value: "value1"
    effect: "NoSchedule"
  # Mode of scheduling - divide or duplicate.
  schedulingMode: Divide
  reschedulePolicy: 
    # Only schedule on creation and do not reschedule afterwards.
    # Suitable for stateful workloads.
    disableRescheduling: false
    # When rescheduling should be triggered.
    # More triggers: reschedule more frequently - favor agility.
    # Fewer triggers: reschedule less frequently - favor stability.
    rescheduleWhen:
      policyContentChanged: true
      clusterLabelsChanged: false
    # Whether to rebalance replicas on reschedule.
    # Enabling rebalance results in optimal placement, but at the potential cost
    # of disrupting existing replicas.
    replicaRescheduling:
      avoidDisruption: true
  # Limit propagation to a single cluster.
  # Suitable for job-like workloads.
  maxClusters: 1

Instead of writing Overrides manually, KubeAdmiral supports generating Overrides based on OverridePolicy:

apiVersion: core.kubeadmiral.io/v1alpha1
kind: OverridePolicy
metadata:
  name: example
  namespace: default
spec:
  # Flexible ways to select target clusters.
  overrideRules:
  - targetClusters:
      # Select clusters by name.
      clusters:
      - on-prem-1
      - edge-1
      # Select clusters by label.
      clusterSelector:
        region: us-east
        az: az1
      # Select clusters by affinity.
      clusterAffinity:
      - matchExpressions:
        - key: region
          operator: In
          values:
          - us-east
      # Change the container image in the target clusters using jsonpatch.
      overriders:
        jsonpatch:
        - path: "/spec/template/spec/containers/0/image"
          operator: replace
          value: "nginx:test"

Scheduler Extension

Taking inspiration from kube-scheduler’s design, KubeAdmiral offers a flexible scheduling framework. It simplifies the scheduling process by dividing it into four distinct stages: Filter, Score, Select, and Replica.

Each stage is handled by individual plugins, creating a logical separation that promotes modularity. For instance, in the provided PropagationPolicy example above, most behaviors are implemented through built-in scheduling plugins. The beauty of this approach is that plugins can be easily added or removed, without any impact on the remaining plugins. This greatly simplifies the scheduler logic and reduces its overall complexity. Although the built-in plugins in KubeAdmiral offer versatile features that cater to common use cases, users have the flexibility to enhance the functionality by creating their own custom scheduling plugins for specific niche scenarios. This empowers users to seamlessly integrate with internal or existing systems.

The KubeAdmiral scheduler interacts with external plugins via the HTTP protocol, enabling users to extend the scheduling logic with minimal effort and without having to modify the KubeAdmiral control plane. The plugin only needs to output the desired placement, and KubeAdmiral takes care of binding and enforcing those results.

Scheduler stages and plugins

Automatic Migration of Unschedulable Workloads

For replica scheduling, KubeAdmiral calculates the number of replicas that each member cluster should receive and overrides the replicas field in the template before distributing the resources to the member clusters. After the resources are distributed to member clusters, the kube-scheduler in each member cluster assigns the corresponding pods to available nodes. Thus, a full scheduling chain is completed.

Occasionally, there are cases where the kube-scheduler fails to find suitable nodes for pods due to reasons including node outages, resource shortages, and unmet node affinity requirements. If left unaddressed, the unschedulable pods will remain pending. KubeAdmiral resolves this by automatically migrating the unschedulable pods to other clusters, enabling better resource utilization overall.

As an illustration, consider three clusters A, B, and C with an equal weight distribution for six replicas. After the initial scheduling by KubeAdmiral, each cluster receives two replicas. If the two replicas in cluster C fail to be scheduled by kube-scheduler after a while, KubeAdmiral automatically shifts them to clusters A and B, ensuring the desired availability of 6 replicas across all clusters.

Cluster	A	B	C
Weight	10	10	10
Replicas before scaling down	15	15	15
Change	-10	-11	0
Replicas after scaling down	5	4	0

Dynamic Replica Distribution Based on Real-Time Resource Availability

In a multi-cluster setup, the resource utilization of each cluster fluctuates as machines go online or offline. Relying solely on the static weight replica scheduling provided by KubeFed RSP can easily lead to skewed resource utilization. Clusters with a high deployment rate are prone to pod pending during upgrade, while clusters with a low deployment rate have idle resources that are wasted.

As a solution to this, KubeAdmiral introduces dynamic weight scheduling based on real-time cluster resource utilization. It calculates the amount of available resources by collecting the total and allocated resources of each cluster, and uses it as the weight for replica scheduling. This ultimately achieves dynamic load balancing across all member clusters. In practice, we are able to maintain a stable deployment rate of 95-98% or above in all member clusters with this approach.

Refined Replicas Rescheduling

KubeFed’s replica rescheduling algorithm usually results in less than ideal distributions for scaling operations. As an illustration, consider 30 replicas currently distributed to 3 member clusters A, B, and C with equal weights. If the workload is scaled down to 9 replicas, KubeFed has 2 possible behaviors depending whether the user enables rebalance:

If rebalance = false, KubeFed retains existing replicas, disregarding cluster weights.
If rebalance = true, KubeFed disregards current distribution and rebalances replicas based on weights.

As seen above, KubeFed is unable to devise a distribution that satisfies fault tolerance and load balancing requirements without compromising service availability. To address this, KubeAdmiral developed a refined replica rescheduling algorithm that guarantees service availability and produces distributions that are as close to the optimal distribution as possible. The gist of the algorithm is to distribute the increment or decrement in replicas, instead of the total replicas.

Using the same scenario of scaling down from 30 replicas to 9 replicas above, the refined algorithm roughly proceeds as follows:

Current distribution = [15, 15, 0]; total current replicas: 30
Desired distribution = [3, 3, 3]; total desired replicas: 9
Distance = desired – current = [-12, -12, 3]; total distance: -21
For scaling down, remove any positive distance terms; distance = [-12, -12, 0]
Distribute the total distance -21 using the distance vector [-12, -12, 0] as weights; adjustments = [-10, -11, 0]
Final distribution = current + adjustments = [15, 15, 0] + [-10, -11, 0] = [5, 4, 0]

Cluster	A	B	C
Weight	10	10	10
Replicas before scaling down	15	15	15
Change	-10	-11	0
Replicas after scaling down	5	4	0

Support for Native Kubernetes Resource API

Unlike KubeFed, which requires users to use an incompatible “federated” API, KubeAdmiral caters to the usage habits of single-cluster Kubernetes users by providing support for native Kubernetes APIs. After the user creates a native resource (such as Deployment), KubeAdmiral’s federate-controller automatically converts it into an internal object for use by downstream KubeAdmiral controllers. This enables users to quickly transition from a single-cluster to a multi-cluster architecture with low onboarding cost.

However, KubeAdmiral doesn’t stop there. In a single cluster, Kubernetes controllers update the status of resources to reflect their current state. For example, a Deployment‘s status reflects its rollout progress and number of replicas it currently has. Users or upper-layer systems often rely on such status. In a multi-cluster environment, the status is populated on individual Deployments propagated to member clusters. Users must check the status of resources in each cluster individually, leading to a fragmented perspective and reduced operational efficiency.

To solve this problem and seamlessly support native resources, KubeAdmiral introduces status aggregation. The KubeAdmiral status-aggregator collects and aggregates the status of individual resources from member clusters and writes it back to the native resources. This allows users to observe the global resource status at a glance.

Final Thoughts

KubeAdmiral has been incubating within ByteDance for a while, and has been an integral part of ByteDance’s internal PaaS platform TCE. Battle-tested by large-scale applications, it has accumulated many valuable practical experiences. To give back to the community, KubeAdmiral has officially been open-sourced on GitHub.

Looking forward, we plan to continue working on KubeAdmiral, especially in the following areas:

Continue to improve the orchestration and scheduling capabilities of stateful and job-like workloads, and develop advanced capabilities such as automatic migration and cost-based scheduling to embrace the new era of multi-cloud batch computing.
Improve user experience and further reduce users’ cognitive burden, striving for a pleasant out-of-the-box experience.
Improve observability, optimize logging and metrics, and enhance the scheduler’s explainability.
Explore features such as one-click migration from single cluster, further smoothening the onboarding experience.

Multi-cluster orchestration and scheduling is not a simple topic. We hope our experience and solution could be useful to the community. We look forward to more friends joining the KubeAdmiral community, and welcome everyone to try KubeAdmiral and give us suggestions!

GitHub repo: https://github.com/kubewharf/kubeadmiral

Gödel Scheduler open-sourced: a unified scheduler for online and offline workloads

XIAOXU CHANG — Wed, 03 Apr 2024 09:16:53 +0000

Background

Since its open-source release in 2014, Kubernetes has rapidly become the de facto standard for container orchestration. The infrastructure team at ByteDance adopted Kubernetes early on to build our private cloud platform. Over the years, ByteDance’s rapid growth across various business lines, including microservices, recommendation/advertising/search services, machine learning & big data, and storage, has led to a substantial increase in the demand for computing resources.

Initially, ByteDance managed its online and offline workloads with separate resource pools, each dedicated to distinct business segments. To accommodate the surge in online business demands during significant holidays and major events, the infrastructure team usually needed to plan ahead by reallocating resources from offline to online pools to bolster the capacity for handling increased online activities. While this temporary fix satisfied immediate requirements, the inter-pool resource borrowing process proved to be time-consuming, operationally heavy, and inefficient.

Furthermore, maintaining separate resource pools for online and offline workloads resulted in significant colocation costs, leaving little scope for enhancing resource utilization.

Therefore, the infrastructure team sought to implement a unified system for scheduling and managing both online and offline workloads. This initiative aimed to facilitate resource pooling, enhance resource utilization and elasticity, optimize costs and user experiences, and alleviate operational burdens.

Practice of Unified Scheduling

Enhancement beyond Kubernetes Default Scheduler:

Since extensive use of Kubernetes in 2018, ByteDance continuously optimized various components of Kubernetes for functionality and performance. However, with the containerization of recommendation/advertising/search services in 2019, the native Kubernetes scheduler, in terms of both functionality and performance, was farther away from meeting ByteDance’s business requirements.

In terms of functionality, more granular resource scheduling capabilities and flexible preemption strategies were required.
In terms of performance, the native Kubernetes default scheduler could only achieve a scheduling throughput of around 10 pods per second in a cluster of 5000 nodes, often causing business upgrades to be bottlenecked, far from meeting the requirements.

Therefore, the team introduced a number of key optimizations to the Kubernetes default scheduler, including:

Functionality:

Extended the scheduling capabilities to support non-native resources, such as memory bandwidth and network bandwidth.
Supported for micro-topology scheduling.
Refactored preemption implementation by providing a pluggable preemption framework to support extending preemption capabilities.

Performance:

Optimized the data synchronization mechanism between Scheduler cache and Snapshot by refactoring data structure and further strengthening the concept of incremental update between snapshots.
Cached scheduling results for homogenous scheduling units to reduce redundant calculations and improve efficiency.
Optimized preemption implementation by reorganizing preemption-related data structures, applying pruning timely, and reducing unnecessary computation.

By implementing the aforementioned optimizations, we successfully enhanced our containerization capabilities to meet ByteDance’s rapidly expanding needs. This led to a remarkable 30-fold increase in scheduling throughput. That is, in a cluster comprising 10,000 nodes, we consistently achieved a scheduling rate of 300 pods per second.

Gödel Scheduler

In 2020, ByteDance initiated a unified scheduling and resource management project for both online and offline business operations. The objective was to enhance overall resource utilization, improve operational efficiency, and reduce maintenance overheads. Initially, the plan involved managing both online and offline tasks through a singular scheduling system. However, this approach presented challenges, primarily due to the intricate nature of offline scheduling, which differed markedly from online processes, especially in its demand for high throughput.

The native Kubernetes scheduler, primarily designed for Pod-level scheduling, was somewhat limited in its support for more complex “Job” scheduling semantics and encountered performance limitations when dealing with these higher-level demands. To effectively address these unique requirements and to better align with the diverse operational needs of ByteDance, the decision was made to develop a bespoke, in-house distributed scheduler. This led to the creation of the Gödel Scheduler, specifically tailored to integrate with the Kubernetes system and to handle the demanding and varied scheduling needs of ByteDance’s expansive and evolving business landscape.

The Gödel Scheduler is a distributed system crafted to consolidate the scheduling of both online and offline workloads. This scheduler is an enhancement of the Kubernetes (K8s) Scheduler, designed to augment scalability and improve scheduling quality. It is adept at fulfilling the diverse functional and performance demands of ByteDance’s online and offline operations. Key features of the Gödel Scheduler include:

Optimistic Concurrency: It incorporates optimistic concurrency concepts, moving the most time-consuming unit-to-node matching (filtering and scoring) to the scheduler component. This allows for concurrent execution and improves scheduling throughput in large-scale clusters.
Two-Layer Scheduling Abstraction (Unit and Pod) and Framework: This provides more flexible batch scheduling capabilities, better supporting offline operations while also improving scheduling throughput and system scalability. The extended framework handles special scenarios more effectively.
Rich Functionality and High Performance: It meets the demands of various operations including online, offline (batch and stream), and training tasks, achieving true unified scheduling.
Compatibility with the Kubernetes Ecosystem: It can serve as an alternative to the K8s Scheduler, but due to performance and architectural optimizations, its framework interface is not entirely the same as the K8s Scheduler. However, its extensibility remains unaffected, and it can implement scheduling plugins like Kubernetes.

Below is the architecture diagram of Gödel Scheduler.

As outlined, the Gödel Scheduler consists of three primary components: the Dispatcher, the Scheduler, and the Binder. Key to its architecture is the Scheduler component, which is typically deployed in multiple shards to facilitate optimistic concurrency scheduling. This multi-shard deployment enhances its efficiency and scalability. On the other hand, the Dispatcher and the Binder are each deployed as single instances, a configuration that suits their specific roles and responsibilities within the Gödel Scheduler system.

Dispatcher

The Dispatcher plays a pivotal role in managing application queuing, distribution, and node partitioning. It is comprised of several key components:

Sort Policy Manager: This module handles the queuing of applications. Currently, it implements FIFO and DRF/FairShare queuing strategies, the latter still pending production use. Future enhancements will introduce more sophisticated queuing strategies, including those based on priority values.
Dispatching Policy Manager: Its primary function is to allocate applications across various Scheduler instances. At present, the LoadBalance strategy is employed as the default. Future updates aim to make this feature more versatile and plugin-based.
Node Shuffler: This component is tasked with organizing cluster nodes relative to the number of Scheduler instances. It assigns each node to a specific node partition, with each Scheduler instance overseeing one partition. During the scheduling process, a Scheduler first considers nodes within its partition before exploring nodes in other partitions. This arrangement is dynamically adjusted in response to changes in node availability or Scheduler count.
Partition Rules: Currently, the system strives for an even distribution of nodes among Scheduler instances. Plans are underway to expand these partition strategies, enhancing their configurability.
Scheduler Maintainer: This element is responsible for monitoring the status of Scheduler instances. It tracks aspects like health status, workload, and the node count within each partition.
Reconciler: Operating periodically, the Reconciler oversees the status of various elements like Pods, Nodes, Schedulers, and SchedulingUnits. It addresses any errors, discrepancies, or deficiencies, ensuring system integrity and performance.

Scheduler

The Scheduler plays a critical role in the decision-making process for scheduling and preempting applications, although it does not execute these decisions itself (that task is handled by the Binder). It operates on a two-tier framework: the Unit Scheduling Framework and the Pod Scheduling Framework. The entire scheduling procedure is segmented into three principal phases: Node Organizing, Unit Scheduling, and Unit Preempting.

Node Organizing: This phase involves filtering and sorting nodes to streamline the scheduling process and enhance its quality. It consists of two types of plugins:
- Locating Plugins: These filter nodes are based on specific application information.
- Grouping Plugins: These group nodes according to available resources or Job-level affinities.
Unit Scheduling: In this stage, nodes are matched and scored in alignment with application requests that have been filtered through the Node Organizing plugins. This process is analogous to the Kubernetes (K8s) Scheduler framework, encompassing:
- Filtering Plugins: These filter nodes are based on the requisites of the application requests.
- Scoring Plugins: These assign scores to nodes that have been filtered in the previous step.
Unit Preempting: If suitable nodes are not found during the Unit Scheduling phase, the Scheduler progresses to the preemption phase. Here, it tries to free up resources by preempting running application instances to make room for new ones. This phase includes:

Victims Searching: Identifying potential victim applications that can be preempted.
Candidates Sorting: Sorting both nodes and potential victims to identify the most appropriate choices for preemption.

Binder

The Binder plays a crucial role in the final stages of the scheduling process, focusing on conflict detection, preemptive operations, and executing the binding of applications to resources. It comprises three main components: ConflictResolver, PreemptionOperator, and UnitBinder.

ConflictResolver: This component is tasked with detecting concurrent conflicts in the scheduling process. It operates in two modes:
- Cross Node Conflict Resolver: Checks for conflicts that might occur across different nodes.
- Single Node Conflict Resolver: Identifies conflicts within a single node. If any conflict is detected, the application is immediately rejected and rescheduled.
PreemptionOperator: In scenarios where no conflict exists but preemption is necessary, this operator takes charge. It executes the preemption by deleting the victims (applications or processes that need to be terminated to free up resources) and then awaits the final scheduling.
UnitBinder: This part of the Binder is responsible for the preparatory work required before binding, such as dynamically creating storage volumes, and then carries out the actual binding operation, linking applications to the designated resources.

Noteworthy, the current version of the Binder integrates a PodGroup controller. This controller is responsible for managing the state and lifecycle of PodGroups. However, it’s important to note that in a future version we plan to remove this functionality from the Binder, transitioning it into an independent controller.

Experience

Over the past two years, the Gödel Scheduler has been a cornerstone within ByteDance, offering a wealth of scheduling features and semantics. It has efficiently and reliably supported the operations of ByteDance’s diverse and complex business workloads.

Building upon the foundation of architectural enhancements, ByteDance has implemented profound performance optimizations drawing from its experience with the Kubernetes native scheduler. Integrated with ByteDance’s internally refined Kubernetes system, the Gödel Scheduler now boasts an impressive throughput: 2000+ pods/s in a single shard and 5000+ pods/s across multiple shards. ByteDance’s ongoing efforts to expand single-cluster capacity have culminated in their largest prod cluster reaching over 20,000 nodes and more than 1,000,000 pods.

After years of thorough internal practices and enhancements within ByteDance, Gödel Scheduler has achieved a state of relative stability. In 2023, the top-notch cloud computing conference, SoCC, accepted our paper on Gödel Scheduler, highlighting ByteDance’s unified approach to large-scale resource management and scheduling. The RD team was also invited to present the work at the conference. For those interested, the Gödel Scheduler paper is available at https://dl.acm.org/doi/10.1145/3620678.3624663.

With a commitment to contributing to the open-source community, the Bytedance team decided to open-source the Gödel Scheduler, offering a new scheduling solution that enhances cloud-native experiences for both online and offline services through its outstanding performance and comprehensive scheduling capabilities.

Future Work

Looking ahead, ByteDance is committed to the continual development of the Gödel Scheduler, focusing on enriching its features and enhancing its scalability. A significant area of attention will be optimizing the scheduling throughput in specific challenging scenarios, such as those involving high rates of deployment and frequent preemptions. Through innovative rescheduling strategies, ByteDance aims to tackle the intricate balance between maintaining scheduling performance and enhancing its quality. The overarching goal is to not only preserve the current scheduling throughput but also to substantially elevate the quality of scheduling.

Moreover, ByteDance places a high priority on ecosystem development. Efforts will be made to ensure Gödel Scheduler’s compatibility with leading systems and frameworks used in various business applications. This initiative will include integration with prominent big data and machine learning frameworks, accompanied by practical usage examples and comprehensive documentation.

To keep the community engaged and informed, a detailed roadmap for the Gödel Scheduler will be methodically laid out and made available on the Gödel Scheduler Repository. This will provide an opportunity for interested parties to track progress, contribute, and become active participants in the project.

While the Gödel Scheduler has undergone numerous iterations within ByteDance, been rigorously tested in various scenarios, and demonstrated its effectiveness, ByteDance acknowledges that there is still considerable potential for advancement in terms of generality and standardization. ByteDance warmly invites and encourages members of the community to join in the development of the Gödel Scheduler, believing that collaborative efforts will lead to even greater improvements and innovations.

Gödel Scheduler Project Repository: https://github.com/kubewharf/godel-scheduler

Katalyst: A QoS-based resource management system for workload colocation on Kubernetes

XIAOXU CHANG — Mon, 01 Apr 2024 11:01:26 +0000

This Blog originally published on Katalyst’s blog by Pengcheng Tang

The resource usage of web applications tends to fluctuate with changes in the number of visitors, displaying noticeable tidal characteristics. To ensure stability, service providers often allocate resources for their applications according to resource usage during peak periods. These resources can easily be underutilized during off-peak hours.

If these idle resources can be reclaimed and temporarily allocated to lower-priority services and returned promptly to online services when needed, the overall resource utilization rate can be significantly improved.

ByteDance colocation practices

ByteDance operates at a massive scale with diverse business types, encompassing various categories such as microservices, advertising, machine learning, big data, and storage. Typically, different business types have distinct resource management requirements at the infrastructure level. The conventional approach involves segmenting resource pools based on business lines or service types to meet customized demands.

However, this method of resource pool segmentation can lead to resource silos, preventing flexible resource sharing and hindering the overall efficiency of resource utilization and cost optimization. It also adds to the operational burden of managing clusters.

Furthermore, considering that different types of businesses have complementary SLO requirements and resource usage patterns, the infrastructure team aims to leverage these characteristics fully. They do so through scheduling and control mechanisms to optimize resource efficiency, achieve the convergence and consolidation of resource pools, and assist business teams in attaining lower resource costs and greater elasticity.

To achieve unified resource management, ByteDance began building a unified infrastructure based on Kubernetes in 2016. At the current stage, ByteDance has essentially completed the containerization of all microservices, advertising, and a significant portion of machine learning and big data businesses. Throughout this process, the infrastructure team has continued to explore resource optimization methods under a unified resource pool and gradually developed a resource pool deployment approach that combines ’elastic scaling’ and ‘colocation.’ Elastic Scaling: This enables machine-level and Numa-level resource time-sharing, combining business and system metrics to guide horizontal and vertical scaling strategies for application instances. This ultimately allows offline services to purchase more idle resources at a lower cost, and online services to purchase more peak-time resources at a higher cost through resource market-oriented operations, leading to overall efficiency improvement.

Colocation: It offers the ability to oversell resources, making full use of ‘sold but underutilized resources’ in the cluster to deploy more low-priority tasks. Simultaneously, we enhance resource isolation mechanisms across multiple dimensions such as CPU, memory, disk, and network at the system level. Minute-level control mechanisms, combined with intelligent load prediction algorithms, are adopted to ensure service stability according to their SLOs.

This solution combines Kubernetes and Yarn systems for joint control. It runs control components of both Kubernetes and Yarn on the same machine, and coordinates the allocatable resources between the two systems through a central coordination component. On top of this joint control system, we achieve real-time resource estimation based on service resource profiles, ensuring more flexible and dynamic resource allocation while meeting various service SLA requirements.

During the implementation of this colocation solution, the infrastructure team verified the feasibility of resource pooling, constructed the foundational capabilities for colocation, and improved the overall utilization of core clusters from 23% to 60%.

Katalyst: From Internal Validation to Open Source

After undergoing extensive testing with businesses like Douyin and Toutiao, which have large-scale tidal traffic, ByteDance’s cloud-native colocation practices have matured. In order to help end users in cloud native community understand the principles behind large-scale colocation practices and improve overall resource efficiency of their own, we have refactored and enhanced the resource management system in a Kubernetes Native manner and built the resource management system “Katalyst”, which has now been officially open-sourced. The name “Katalyst” is derived from the word “catalyst,” and the ‘K’ symbolizes its ability to provide enhanced automation for resource management for all workloads running within the Kubernetes ecosystem.

What Is Katalyst

Katalyst originated from ByteDance’s colocation practices and has been extended and supplemented in terms of resource management capabilities:

Developed entirely within the context of hyperscale colocation practices, achieving true reuse of internal and external technology systems.
Built based on a plugin-based architecture, allowing users to customize various modules such as scheduling, control strategies on top of the Katalyst Framework.
Provides one-click deployment templates and comprehensive operation manuals, reducing the understanding and deployment costs for end users.

Resource Abstraction

The native Kubernetes Quality of Service (QoS) system does not meet the requirements of large-scale production environments, prompting Katalyst to build a QoS system of its own. Katalyst defines four QoS classes including dedicated_cores, shared_cores, reclaimed_cores and system_cores. Users can assign different QoS class to their applications according to different QoS requirements. Based on our practices in Bytedance, CPU, in most scenarios, is a dominant resource that can affect applications’ performance and users tend to associate QoS requirements with CPU as well. So despite the fact that QoS requirement encompasses various kinds of resources(i.e. CPU, memory, disk io, network bandwidth etc), we named it after CPU. Each level is accompanied by various enhancement mechanisms (e.g., whether NUMA node binding is required or if network affinity and bandwidth restrictions are necessary), enabling differentiated resource allocation and control strategies.

Through this abstract resource model, Katalyst provides users with a unified resource entry point. Users can accurately express their specific needs by mapping business services to the appropriate QoS class and sales model based on business requirements. This ultimately allows users to obtain resources from a unified resource pool without needing to delve into the underlying details.

Overall Architecture

In the early stages, the colocation architecture had several issues: although the joint control of Kubernetes and Yarn systems achieved colocation of online and offline businesses, the complexity of the system incurs more maintenance costs. Additionally, this architecture led to resource usage overhead. It comes from the resource consumption of agents running on each node. While resource consumption on an individual node is not significant, the accumulated overhead in hyperscale clusters can be substantial. Moreover, the use of two control systems increased system complexity. Any abnormalities at any stage could lead to resource calculation errors. In Katalyst, we optimized and refactored the overall colocation architecture:

At the control layer, we integrated the fusion system based on both Kubernetes and Yarn systems into a single Kubernetes-based system. Specifically, we retained the API entry points for Kubernetes and Yarn at the access layer, while unifying metadata management and resource control implementation within Katalyst, which is Kubernetes-native.

At the scheduling layer, Katalyst implemented a coordinated resource scheduling and control mechanism between “centralized scheduling” and “node resource management” based on unified metadata.

On the node side: Katalyst extends Kubernetes with a module named QoS Resource Manager (QRM). This module enables plugin-based node-level topology affinity allocation and reports the topology to the control plane through custom CRDs, facilitating scheduling processes. At runtime, Katalyst continuously makes estimation for resource allocation according to system metrics, service level indicators and QoS requirements of the pods. The allocation result is then dynamically sent to the Container Runtime Interface (CRI) through the QRM reconcile loops. The resource estimation algorithm and QRM implementation can be customized through plugins, making resource control strategies more adaptable to different business scenarios.

On the scheduling side: Katalyst extends scheduler with richer scheduling capabilities through scheduler framework. During scheduling, it takes into account how applications of different QoS classes should allocate and collaborate resources when running in the same cluster. The scheduler also combines real-time data and service profiles to do dynamic rebalancing across the entire cluster, reducing cluster vacancy rates and enhancing business stability.

Lastly, under a unified control system, we can fully leverage Kubernetes’ advantages in API design. By decoupling internal systems and generalizing control strategies through custom CRDs, we are able to iteratively improve the system through a plugin-based approach, achieving true convergence between internal and external systems.

RoadMap

Katalyst, as a resource management system, has colocation as one of its core business scenarios. In addition to abstracting the core concepts mentioned above, we have provided and planned various QoS capabilities for Katalyst:

Fine-grained resource estimation strategies: Katalyst supports multiple resource estimation strategies including heuristics, unsupervised learning, and QoS-aware algorithms, improving resource utilization by accurately calculating and predicting the amount of resources that can be reclaimed from the nodes.
Multi-dimensional resource isolation capabilities: Using technologies such as cgroup, RDT, iocost, tc, etc., Katalyst achieves effective isolation of various resources, including CPU, memory, disk, and network, in different colocation scenarios.
Multi-level load eviction strategies: Katalyst supports multi-level eviction strategies based on various metrics, ensuring online business QoS while maximizing offline business QoS.

Besides colocation, Katalyst also provides enhanced resource management capabilities:

Recommendation and autoscaling: Katalyst provides enhanced VPA/HPA capabilities and advanced recommendation algorithms. This helps end users make a more accurate estimation of pod resource request/limit or replica numbers, hence improve deployment rates and resource utilization.
Tidal(Exclusive-mode) colocation: While colocating online and offline applications on the same node gives more resource efficiency improvement, it requires all the infrastructure intricacies(e.g. Resource isolation, scheduling etc.) to work smoothly, which makes the overall system complicated. Katalyst provides an exclusive-mode colocation where the resource is reclaimed in node granularity so that either online or offline applications can run on the same node simultaneously. This allows users to improve resource efficiency at a lower operation cost.
Node overcommitment: With node overcommitment, Katalyst allows the scheduler to schedule more pods to a node without end users’ awareness. Meanwhile, Katalyst adopts methods like interference detection and mitigation, node resource prediction algorithm and so forth to guarantee the QoS requirement of higher priority tasks.
…

For detailed plans, please refer to the roadmap. Besides, we have just given a brief introduction about Katalyst at Kubecon China. You can also refer to our talk for more information.

While colocation has undergone several iterations within ByteDance, a universal, standardized platform foundation must be refined through various scenarios. We look forward to your participation in the Katalyst community and sharing your scenarios and requirements for colocation, resource efficiency improvement and so forth.

GitHub| https://github.com/kubewharf/katalyst-core

Monorepos with Rspack Just Makes Things Easier

Josh Alphonse — Tue, 06 Feb 2024 23:17:00 +0000

Introduction

When it comes to web dev, it's hard to keep up with all the latest trends. I also can't come up with many examples of ecosystems that move faster than JavaScript's. It feels as if a new version of the iPhone is released every month. However, just because things are moving fast, doesn't mean you have to be left in the dust.

Monorepos are one of the hottest additions to the scene especially for web developers. One of the biggest challenges when building a full stack app is managing complexity. The larger your application becomes, the more complex it gets and it doesn't matter how big your organization or team may be. You've probably experienced this kind of struggle if you're building a product that has many different front-end code bases that share backend's and interfaces.

The way orgs are solving this issue now is with the use of monorepos. Monorepos help mitigate the confusion that comes along with having to manage multiple repositories for multiple distinct projects/products. A monorepo can consolidate your projects all into one repository. Contrary to the name, monorepos are technically not monolithic. A common misconception is that if you use a monorepo, every project in the repository have to be released on the same day. This acutally isn't the case. Just because we are developing our code in the same place doesn't mean we have to deploy everything at the same time nor to the same place. Since your code bases can exist in one repository, it makes common tasks like code sharing and refactoring way easier. This leads to significantly lowering the cost that comes with creating libs, microservices and microfronents. If you want true development flexibility, then you have to give this approach a try. This is why some of the biggest companies are implementing monorepos.

Rspack

So, where does Rspack come into play here?

As you may know, Rspack is a high performance web bundler that offers interoperability with the webpack ecosystem but also build systems like Nx to build smart monorepos! Rspack can bring several key benefits when used in conjunction with NX, primarily by enhancing the performance aspects of the development and build processes. Here are a few ways Rspack can positively impact an NX-powered project :

Speeding up builds:Rspack is known for its quick bundling capabilities. In complex NX workspaces with multiple applications and libraries, faster bundling translates to quicker build times. This is especially beneficial when you're dealing with large-scale projects where build times can significantly impact developer productivity and continuous integration/deployment pipelines.
Reducing setup time:Rspack has a reputation for requiring less configuration out of the box compared to other bundlers, which can simplify the setup process. In an NX workspace, this means you can get applications up and running more quickly, and you can spend less time adjusting bundler configurations and more time developing features for the e-commerce store.
Efficient code splitting: Rspack can facilitate code splitting, which means that each page or feature can have its bundle. This ensures that users are only downloading the code necessary for the page they're visiting, improving load times and enhancing the user experience.
Providing Hot Module Replacement (HMR)While NX manages the overall structure and dependencies in your monorepo, Rspack can provide Hot Module Replacement for a smoother development experience. HMR allows changes to be applied to a running application without needing a full reload, preserving the application state and speeding up the development process.

Although monorepos can solve a bunch of problems for applications of any size, there is one feature in particular that I'm going to highlight in this blog post. That feature being the ability to manage aliases with Rspack and Nx.

Defining aliases with rspack in an NX monorepo provides multiple benefits and solves several issues commonly faced in large-scale projects. Let's take a look before we dive into the tutorial:

Simplified Imports: With aliases, you can avoid long and complicated relative paths that are hard to read and maintain. Aliases provide a clear, concise way to import modules.

// Without alias... How do you even know where your functinos are?! Half the time in guesssing.
import utility from '../../../../libs/utils/src/utility';

// With alias
import utility from '@utils/utility';

Easier Refactoring: When you decide to rearrange your project's folder structure, you won't need to update every import statement. Instead, you only update the alias paths in the rspack configuration.
Code Readability: Aliases can help signify the intent or the origin of the imported module more clearly than relative paths.
Consistency Across Apps and Libs: In a monorepo, you typically have multiple apps and libraries. Aliases ensure that every part of your project refers to shared libs in the same way.
Namespace Clarity: By using aliases that represent features or shared libraries (like @feature-a or @shared), you provide a clear namespace. This indicates that the import is from a shared source and not a local module.
Avoid File Path Errors: As projects grow, relative paths become more prone to errors when files are moved or when the developer is unsure of the current file's depth in the directory structure.
Enhanced Autocompletion: Many IDEs can provide improved autocompletion for imports when aliases are set up properly, making development faster and reducing typos.
Collaboration Enhancement: When working with a team, aliases ensure that everyone is using the same paths to common resources, reducing the cognitive load of understanding where a file is located within the project hierarchy.
Integration with Build Tools: Tools like rspack can understand these aliases and use them to resolve the actual bundle paths during the build process, ensuring the generated bundles reflect the same structure and readability of the source code.
Decoupling the Code: Aliases can help in abstracting the actual file system paths, which means developers can think more about architecture and less about the file system.

Tutorial

Alright, so now that we have some base knowledge of what monorepos are, let's jump into a tutorial. This time around, we're going to set up a monorepo for an e-commerce store using NX and Rspack. We'll configure aliases to simplify module resolution, keeping our import statements clean and readable.

Before we get started

Prerequisites

At the minimum, a basic understanding of Javascript/Typescript and react
Node.js, yarn, or pnpm
Install Nx globally depending on your package manager. I'm using npm add --global nx@latest
Create a project with npx create-nx-workspace myrspackapp --preset=@nx/rspack

Step 1: Creating an NX Workspace

First, we need to create a new NX workspace by running the following command:

npx create-nx-workspace monrepo-example --preset=@nx/rspack

Choose empty as the preset for full customization. Once the setup is complete, navigate into your new workspace directory:

cd monorepo-example

Step 2: Adding Applications and Libraries

With NX, you can have multiple apps and libraries co-existing in a single monorepo. Since we installed Nx globally, we can use the nx command to generate our libraries.

For this example, let's add an application and two libraries:

nx generate @nrwl/react:application store-front
nx generate @nrwl/react:library ui-shared --directory=shared
nx generate @nrwl/react:library product-data --directory=shared

This will set up a React application (store-front) and two shared libraries (ui-shared and product-data) under the libs/shared/ directory.

Step 3: Configuring Aliases in rspack.config.js

Create a rspack.config.js file at the root of your workspace if it doesn't exist, and define your aliases:

const path = require('path');

module.exports = (config) => {
  config.resolve.alias = {
    ...config.resolve.alias,

    // For that beautiful UI:
    '@ui-elements': path.resolve(__dirname, 'libs/shared/ui-elements/src'),

    // For the business logic behind products:
    '@product-services': path.resolve(__dirname, 'libs/shared/product-services/src'),

    // Continue as your project grows...
  };
  return config;
};

Step 4: Using Aliases in Your Code

Now that we have our aliases, we can use them within our applications and libraries. For example, in our store-front app, we can now import from our shared libraries like so:

import { CheckoutButton, StoreBanner } from '@ui-elements';
import { fetchAllProducts } from '@product-services';

Step 5: Building the Application

To build your application with the new alias configurations, simply use the NX build command:

nx build store-front

NX will invoke Rspack with your custom configuration including the aliases, and you should see a successful build output.

Conclusion

By using aliases with rspack in an NX monorepo, you're essentially laying down a scalable, maintainable foundation that streamlines development workflows and helps manage complexity as your project grows. Keep an eye out for more content and if you have any questions and want to join the community find us on Github and Join us on our ByteDance Open Source Discord Server!

BMF 📹 + Hugging Face🤗, The New Video Processing BFFs

Josh Alphonse — Thu, 01 Feb 2024 01:27:15 +0000

TL;DRif you want to test this tutorial before we start, try it out here

Hugging Face has created a major shift in the AI community. It fuels cutting-edge open source machine learning/AI models and datasets. The Hugging Face community is thriving with great ideas and innovations to the point where the possibilities seem endless.

Hugging Face is revolutionizing Natural Language Processing (NLP) with state-of-the-art solutions for tasks like translation, summarization, sentiment analysis, and contextual understanding. Its arsenal of pre-trained models makes it a robust platform for diverse NLP tasks, streamlining the integration of machine learning functionalities. Hugging Face simplifies the training, evaluation, and deployment of models with a user-friendly interface. The more I used Hugging Face in my own personal projects, the more I felt inspired to combine it with Babit Multimedia Framework (BMF).

If you're reading this and are not familiar with BMF, it's a cross-platform multimedia processing framework by ByteDance Open Source. Currently, BMF is used to process over 2 billion videos a day across multiple social media apps. Can this get complex? Yes, it sure can. However, in this article, I'll break it all down, so you know how to create unique experiences across any type of media platform.

Why BMF?

BMF stands out with its multilingual support, putting it ahead in the video processing game. BMF excels in various scenarios like video transcoding, editing, videography, and analysis. The integration of advanced technologies like Hugging Face with BMF is a game-changer for complex multimedia processing challenges.

Before we get started with the tutorial, let me share with you some ideas I envision coming to life with BMF + Hugging Face:

Multimedia Content Analysis: Leveraging Hugging Face's NLP models, BMF can delve deep into textual data associated with multimedia content, like subtitles or comments, for richer insights.
Accessibility: NLP models can automatically generate video captions, enhancing accessibility for the hard-of-hearing or deaf community.
Content Categorization and Recommendation: These models can sort multimedia content based on textual descriptions, paving the way for sophisticated recommendation systems.
Enhanced User Interaction: Sentiment analysis on user comments can offer valuable insights into user engagement and feedback for content improvement.

What now?

Open Source AI is creating the building blocks of the future. Generative AI impacts all industries, and this leads me to think about how generative AI can impact the future of broadcasting and video processing. I experimented with BMF and Hugging Face to create the building blocks for a broadcasting service that uses AI to create unique experiences for viewers. So, enough about the background, let's get it going!

What we'll build

Follow along, as we'll build a video processing pipeline with BMF that uses the runwayml/stable-diffusion-v1-5 model to generate an image to display as an overlayed image ontop of an encoded video. If that didn't make sense, don't worry, here's a picture for reference:

So why is this significant? The image of the panda is AI generated and combined with BMF , we can put it down a processing pipeline to put it on top of our video. Think about! There could be a scenario where you are creating a video broadcasting service and during live streams, you'd like to display images quickly and display them for your audience with a simple prompt. There can also be a scenario where you are using BMF to edit your videos and you'd like to add some AI-generated art. This tutorial is just one example. BMF combined with models created by the Hugging Face community opens up a whole new world of possibilities.

Let's Get Started

Prerequisites:

A GPU(I'm using google Colab A100 GPU. You can also use v100 or TP4 GPUs but they will just run a bit slower)
Install BMFGPU
Python 3.9-3.10 (strictly required to work with bmf)
FFMPEG

You can find all the BMF installation docs here. The docs will highlight more system requirements if you decide to run things locally.

Getting Started

Begin by ensuring that essential toolkits like Hugging Face Transformers and BMF are installed in your Python environment. Use pip for installation:

Initial Setup

First, we'll clone the following repository to get our video that we want to process(If you are coding along and want to use your own video, create your own repo and simply add a video file, preferably a short video and add to easily clone just like I did. You can also just save the video to the directory you're coding in.)

git clone https://github.com/Joshalphonse/Bmf-Huggingface.git

Install BabitMF-GPU to accelerate your video processing pipeline with BMF's GPU capablities

pip install BabitMF-GPU

Install the following dependencies

pip install requests diffusers transformers torch accelerate scipy safetensors moviepy Pillow tqdm numpy modelscope==1.4.2 open_clip_torch pytorch-lightning

Install ffmpeg.BMF framework utilizes the FFmpeg video decoders and encoders as the built-in modules for video decoding and encoding. It's necessary for users to install supported FFmpeg libraries before using BMF.

sudo apt install ffmpeg

dpkg -l | grep -i ffmpeg

ffmpeg -version

This package below is installed to show the BMF C++ logs in the colab console, otherwise only python logs are printed. This step is not necessary if you're not in a Colab or iPython notebook environment.

pip install wurlitzer
%load_ext wurlitzer

Create a new folder in the directory for the github repository we cloned. We'll need this path later on.

import sys
sys.path.insert(0, '/content/Bmf-Huggingface')
print(sys.path)

Creating the Module

Now it's time for the fun part. We'll create a module to process the video.Here's the module I created and I'll break it down for you below.

import bmf
from bmf import bmf_sync, Packet
from bmf import SubGraph
from diffusers import StableDiffusionPipeline
import torch

model_id = "runwayml/stable-diffusion-v1-5"
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")

prompt = "a photo of a panda eating waffles"
image = pipe(prompt).images[0]

image.save("panda_photo.png")

class video_overlay(SubGraph):

    def create_graph(self, option=None):
        # create source stream
        self.inputs.append('source')
        source_stream = self.graph.input_stream('source')
        # create overlay stream
        overlay_streams = []
        for (i, _) in enumerate(option['overlays']):
            self.inputs.append('overlay_' + str(i))
            overlay_streams.append(self.graph.input_stream('overlay_' + str(i)))

        # pre-processing for source layer
        info = option['source']
        output_stream = (
            source_stream.scale(info['width'], info['height'])
                .trim(start=info['start'], duration=info['duration'])
                .setpts('PTS-STARTPTS')
        )

        # overlay processing
        for (i, overlay_stream) in enumerate(overlay_streams):
            overlay_info = option['overlays'][i]

            # overlay layer pre-processing
            p_overlay_stream = (
                overlay_stream.scale(overlay_info['width'], overlay_info['height'])
                    .loop(loop=overlay_info['loop'], size=10000)
                    .setpts('PTS+%f/TB' % (overlay_info['start']))
            )

            # calculate overlay parameter
            x = 'if(between(t,%f,%f),%s,NAN)' % (overlay_info['start'],
                                                 overlay_info['start'] + overlay_info['duration'],
                                                 str(overlay_info['pox_x']))
            y = 'if(between(t,%f,%f),%s,NAN)' % (overlay_info['start'],
                                                 overlay_info['start'] + overlay_info['duration'],
                                                 str(overlay_info['pox_y']))
            if overlay_info['loop'] == -1:
                repeat_last = 0
                shortest = 1
            else:
                repeat_last = overlay_info['repeat_last']
                shortest = 1

            # do overlay
            output_stream = (
                output_stream.overlay(p_overlay_stream, x=x, y=y,
                                      repeatlast=repeat_last)
            )

        # finish creating graph
        self.output_streams = self.finish_create_graph([output_stream])

Code Breakdown:

Importing Required Modules:

import bmf
from bmf import bmf_sync, Packet
from bmf import SubGraph
from diffusers import StableDiffusionPipeline
import torch

bmf and its components are imported to harness the functionalities of the Babit Multimedia Framework for video processing tasks.
SubGraph is a class in BMF, used to create a customizable processing node.
StableDiffusionPipeline is imported from the diffusers library that allows the generation of images using text prompts.
torch is the PyTorch library used for machine learning applications, which Stable Diffusion relies on.

Configuring the Stable Diffusion Model:

model_id = "runwayml/stable-diffusion-v1-5"
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")

The Stable Diffusion model is loaded with the specified model_id.
The torch_dtype parameter ensures the model uses lower precision to reduce memory usage.
.to("cuda") moves the model to GPU for faster computation if CUDA is available.

Generating an Image Using Stable Diffusion:

prompt = "a photo of a panda eating waffles"
image = pipe(prompt).images[0]
image.save("panda_photo.png")

We then set a text prompt to generate an image of "a photo of a panda eating waffles".
The image is created and saved to "panda_photo.png".

Defining a Custom BMF SubGraph for Video Overlay:

class video_overlay(SubGraph):

video_overlay class is derived from SubGraph. This class will define a custom graph for video overlay operations.

Creating the Graph:

def create_graph(self, option=None):

create_graph method is where the actual graph (workflow) of the video and overlays are constructed.

Processing Source and Overlay Streams:

self.inputs.append('source')
source_stream = self.graph.input_stream('source')
overlay_streams = []

Registers input streams for the source and prepares a list of overlay input streams.

Scaling and Trimming Source Video:

info = option['source']
output_stream = (
    source_stream.scale(info['width'], info['height']).trim(start=info['start'], duration=info['duration']).setpts('PTS-STARTPTS'))

The source video is scaled and trimmed according to the specified option. Adjustments are made for the timeline placement.

Scaling and Looping Overlay Streams:

p_overlay_stream = (
    overlay_stream.scale(overlay_info['width'], overlay_info['height']).loop(loop=overlay_info['loop'], size=10000).setpts('PTS+%f/TB' % (overlay_info['start'])))

Each overlay is scaled and looped as needed, providing a dynamic and flexible overlay process.

Overlaying on the Source Stream:

output_stream = (
    output_stream.overlay(p_overlay_stream, x=x, y=y,
                          repeatlast=repeat_last))

Overlays are added to the source stream at the calculated position and with the proper configuration. This allows multiple overlays to exist within the same timeframe without conflicts.

Finalizing the Graph:

self.output_streams = self.finish_create_graph([output_stream])

Final output streams are set, which concludes the creation of the graph. Now, after this, it's time for us to encode the video and display it how we want to.

Applying Hugging Face Model

Let's add our image as an overlay to the video file. Let's break down each section of the code to explain how it

input_video_path = "/content/Bmf-Huggingface/black_and_white.mp4"
logo_path = "/content/panda_photo.png"
output_path = "./complex_edit.mp4"
dump_graph = 0

duration = 10

overlay_option = {
    "dump_graph": dump_graph,
    "source": {
        "start": 0,
        "duration": duration,
        "width": 1280,
        "height": 720
    },
    "overlays": [
        {
            "start": 0,
            "duration": duration,
            "width": 300,
            "height": 200,
            "pox_x": 0,
            "pox_y": 0,
            "loop": 0,
            "repeat_last": 1
        }
    ]
}

my_graph = bmf.graph({
    "dump_graph": dump_graph
})

logo_1 = my_graph.decode({'input_path': logo_path})['video']

video1 = my_graph.decode({'input_path': input_video_path})

overlay_streams = list()
overlay_streams.append(bmf.module([video1['video'], logo_1], 'video_overlay', overlay_option, entry='__main__.video_overlay')[0])

bmf.encode(
    overlay_streams[0],
    video1['audio'],
    {"output_path": output_path}
    ).run()

Let's break this down too

Defining Paths and Options:

input_video_path = "/content/Bmf-Huggingface/black_and_white.mp4"
logo_path = "/content/panda_photo.png"
output_path = "./complex_edit.mp4"
dump_graph = 0
duration = 10

input_video_path: Specifies the file path to the input video.
logo_path: File path to the image (logo) you want to overlay on the video.
output_path: The file path where the edited video will be saved.
dump_graph: A debugging tool in BMF that can be set to 1 to visualize the graph but is set to 0 here, meaning no graph will be dumped.
duration: The duration in seconds for the overlay to be visible in the video.

Overlay Configuration:

overlay_option = {
    "dump_graph": dump_graph,
    "source": {
        "start": 0,
        "duration": duration,
        "width": 1280,
        "height": 720
    },
    "overlays": [
        {
            "start": 0,
            "duration": duration,
            "width": 300,
            "height": 200,
            "pox_x": 0,
            "pox_y": 0,
            "loop": 0,
            "repeat_last": 1
        }
    ]
}

overlay_option: A dictionary that defines the settings for the source video and the overlay.
For the source, the width and height you want to scale the video to, and when the overlay should start and end are specified.
For the overlays, detailed options such as position, size, and behavior (like loop and repeat_last) are defined.

Creating a BMF Graph:

my_graph = bmf.graph({"dump_graph": dump_graph
})

my_graph is an instance of BMF graph which sets up the processing graph (pipeline), with dump_graph passed as an option.

Decoding the Logo and Video Streams:

logo_1 = my_graph.decode({'input_path': logo_path})['video']
video1 = my_graph.decode({'input_path': input_video_path})

The video and logo are loaded and decoded to be processed. This decoding extracts the video streams to be used in subsequent steps.

Creating Overlay Streams:

overlay_streams = list()
overlay_streams.append(bmf.module([video1['video'], logo_1], 'video_overlay', overlay_option, entry='__main__.video_overlay')[0])

An empty list overlay_streams is created to hold the video layers.
The bmf.module function is used to create an overlay module, where the source video and logo are processed using the video_overlay class defined previously with the corresponding options.

Encoding the Final Output:

bmf.encode(
    overlay_streams[0],
    video1['audio'],{"output_path": output_path}).run()

The final video stream, with the overlay applied, and the original audio from the input video are encoded together into a new output file specified by output_path.
The .run() method is called to execute the encoding process.

Our final output should look something like this:

Thats it! We've explored a practical example of utilizing Babit Multimedia Framework (BMF) a video editing task using AI to create an image we can overlay on a video. Now you know how to set up a BMF graph, decode the input streams, create overlay modules, and finally encode the edited video with the overlay in place. In the future, I will consider adding more AI models, like one to improve the resolution, or even a model that creates a video from text. Through the power of BMF and Hugging Face open source models, users can create complex video editing workflows with overlays that can dynamically change over time, offering vast creative possibilities.

Try it out on CoLab and tell us what you think:

https://colab.research.google.com/drive/1eQxiZc2vZeyOggMoFle_b0xnblupbiXd?usp=sharing

Join us on our ByteDance Open Source Discord Server!

Leveraging GPU Acceleration in BMF for High-Performance Video Processing

Josh Alphonse — Thu, 25 Jan 2024 01:26:05 +0000

Introduction

Multimedia processing and GPU acceleration have become a cornerstone for achieving high performance. Babit Multimedia Framework (BMF) harnesses this power, offering unparalleled speed and efficiency for video processing tasks. In this blog post, we'll explore how BMF utilizes GPU acceleration and provide practical examples to help you integrate this capability into your projects.

Understanding GPU Acceleration in BMF

BMF's architecture is designed to exploit the parallel processing capabilities of GPUs. This is crucial for tasks like video transcoding, real-time rendering, and applying complex filters or effects, where the computational intensity can be staggering.

GPU acceleration is like a turbocharged engine in a sports car, propelling you forward at unimaginable speeds. It's all about doing more in less time. Imagine you're editing a video for your youtube channel or streaming a live esports tournament; every millisecond counts. This is where BMF's GPU prowess shines, slicing through processing times like a hot knife through butter.

BMF has carried out Performance optimization in CPU and GPU heterogeneous scenarios that many FFmpeg existing solutions do not have, and enriched the Pipeline. Taking compression and super resolution scenarios as examples, after statistics, the total throughput of BMF has increased by 15%.

BMF's GPU codec is inherited from FFmpeg, using GPU NVENC, NVDEC and other proprietary hardware to accelerate video codec, and using FFmpeg's CUDA filters to accelerate image preprocessing, which is no barrier for users familiar with FFmpeg. At this stage, BMF supports GPU decoding, encoding and one-to-many transcoding.

Key Benefits of GPU Acceleration:

Speed: GPUs can process multiple operations simultaneously, drastically reducing processing time.
Efficiency: Offloading intensive tasks to the GPU frees up the CPU for other operations, improving overall system performance.
Scalability: As video resolutions and processing demands increase, GPUs can scale to meet these challenges.

Setting Up GPU Acceleration in BMF

Before diving into coding, ensure your environment is set up to leverage GPU capabilities. This typically involves installing the necessary GPU drivers and libraries, like CUDA for NVIDIA GPUs. BMF documentation provides detailed setup instructions. You can use tools like co-lab or your own hardware. BMF also has the capability to run on Windows, Mac OS and Linux.

Code Example: Basic GPU-Accelerated Video Processing

Let's start with a simple example of GPU-accelerated video processing in BMF. This example assumes you have BMF and all necessary GPU libraries installed. If you haven't installed it yet, click this link and you can install BMF based on your system set up. You can also use tools like Colab as well. If you're using GPU just make sure you meet the hardware requirements to do so.

Prerequsites:

Python 3.9
Cmake
ffmpeg4
Python, C++, or Go experience

Python code

===========

In this example , BMF implements a call to the GPU codec function for video transcoding. BMF basically follows the parameters of FFmpeg, and the lines of code you'll see that are written in red are where true magic happens.

First create a BMF Graph and Decode model, specify the incoming Hardware Accelerator parameter as cuda, and then you can decode the GPU.

import bmf

def test_gpu_transcode():
    print ("Testing gpu transcoding......")
    input_video_path = "input.flv"
    output_video_path = "output.mp4"

    graph = bmf.graph()

    video = graph.decode({
        "input_path": input_video_path,
        "video_params": {
            "hwaccel": "cuda",
        }
    })

Next, use the CUDA filter for the decoded video stream. In BMF, CUDA filters can be used serially at the same time. In this case, we used Scale cuda and Yadif cuda. Then we passed in the audio & video stream to build an Encode model, specifying Codec as h264_nvenc and Pix format as cuda. Once the entire pipeline is complete, call RUN to start execution.

(bmf.encode(
        video["video"].ff_filter("scale_cuda", w=1280, h=720).ff_filter("yadif_cuda"),
        video["audio"], {
            "output_path": output_video_path,
            "video_params": {
                "codec": "h264_nvenc",
                "pix_fmt": "cuda"
            },
        }
    ).run())

Full Code

import bmf

def test_gpu_transcode():
    print ("Testing gpu transcoding......")
    input_video_path = "input.flv"
    output_video_path = "output.mp4"

    graph = bmf.graph()

    video = graph.decode({
        "input_path": input_video_path,
        "video_params": {
            "hwaccel": "cuda",
        }
    })
    (bmf.encode(
        video["video"].ff_filter("scale_cuda", w=1280, h=720).ff_filter("yadif_cuda"),
        video["audio"], {
            "output_path": output_video_path,
            "video_params": {
                "codec": "h264_nvenc",
                "pix_fmt": "cuda"
            },
        }
    ).run())

Advanced GPU-Accelerated Video Processing

For more complex scenarios, BMF allows fine-tuning of GPU settings and integration with other GPU-accelerated libraries.

we introduce CV-CUDA accelerated image preprocessing. In order to fully mobilize the computing power of CUDA, we introduced CV-CUDA in BMF, which is the acceleration operator base specially developed by Nvidia for Computer Vision applications. At this stage, it provides about 45 common high-performance operators.It provides rich API interfaces such as C/C++/Python API, supports batch input of images of different sizes at the same time, and can realize zero-copy data conversion with other deep learning frameworks, and also provides a variety of scene application examples.

Cuda Operators you can use:

Blur
Crop
Flip
Gamma
Rotate
Scale

def test_gpu_transcode():  # Start of function named 'test_gpu_transcode'
    print("Testing GPU transcoding...")    # Print out a string "Testing GPU transcoding..." in the console

    # Variables containing the paths of the input video and the path to save the output video (transcoded one)
    input_video_path = "input.flv"    # Path to the video file we want to transcode
    output_video_path = "output.mp4"  # Path to save the output video

    # Create a BMF graph to represent a series of processing operations
    graph = bmf.graph()

    # Call the 'decode' function of the created BMF graph. Input is the video file pointed by 'input_video_path'.
    # Use hardware acceleration on the GPU to decode the video (hwaccel means hardware accelerator)
    video = graph.decode({
        "input_path": input_video_path,
        "video_params": {
            "hwaccel": "cuda",    # Use NVIDIA CUDA technology for hardware accelerated decoding
        }
    })

    # Call the 'encode' function to encode the video and audio streams.
    # The input video stream is first processed by a GPU scale module to resize it to 1280x720 pixels.
    # The encoded video will be saved to the path pointed by 'output_video_path'.
    # Use NVIDIA NVENC technology for GPU accelerated encoding,
    # and 'pix_fmt' is set to 'cuda' to let the GPU to read in the processed frames directly from its own memory.
    bmf.encode(
        video["video"].module("scale_gpu", {"size": "1280x720"}),    # Scaling the video to dimension of 1280x720 pixels using GPU
        video["audio"],    # Including the audio stream in the processed video
        {
            "output_path": output_video_path,    # Path to save the output video
            "video_params": {
                "codec": "h264_nvenc",    # Use H.264 codec for video encoding with NVENC technology
                "pix_fmt": "cuda"    # The input video frames are in GPU memory
            }
        }
    ).run()    # Execute the graph operations

# Now Call the above defined function
test_gpu_transcode()    # Call the 'test_gpu_transcode' function to start the whole process

Example: Integrating AI Models for Video Enhancement

BMF's flexibility enables the integration of AI models for tasks like super-resolution or frame interpolation. Here's an example of how you might integrate an AI model for super-resolution. Check out this example

Real-World Sorcery with BMF and GPU Acceleration

Let's list up some real-world scenarios where GPU-accelerated BMF works its magic:

The Live Sports Event: Picture a live sports broadcast. With BMF's GPU acceleration, you can stream high-definition, slow-motion replays almost instantaneously. It's like having the ability to freeze time and zoom in on that crucial game-winning goal.
Hollywood films: In film editing, BMF with GPU acceleration is your special effects wizard. Render stunning visual effects in a fraction of the time, bringing dragons to life or creating epic space battles that look breathtakingly real.
The Viral Video Sensation: For content creators, time is of the essence. GPU-accelerated BMF is like having a superpower to edit and render viral-worthy videos in record time, ensuring you hit the trends before they fade.
The Gaming Livestream: In the gaming world, live streaming with real-time effects is key. With BMF's GPU acceleration, you can stream your gameplay with high-quality graphics and overlays, keeping your audience glued to their screens.
The AI-Powered Masterpiece: Dive into the future with AI-enhanced video processing. From upscaling vintage film footage to crystal-clear quality to applying real-time face filters in a video chat, BMF's GPU acceleration makes it all possible, and at lightning speeds.

GPU acceleration in BMF opens up a world of possibilities for high-performance video processing. By leveraging the power of GPUs, developers can achieve remarkable speed and efficiency in multimedia applications. The examples provided are just a starting point -- the real potential lies in how you apply these capabilities to your unique projects.

Remember, the key to successful implementation is understanding your specific processing requirements and how best to utilize BMF's GPU acceleration features to meet those needs.