Paul DeCarlo

Posted on Mar 1

How GPU-Powered Coding Agents Can Assist in Development of GPU-Accelerated Software

#jetson #nvidia #docker #whisper

How GPU-Powered Coding Agents Can Assist in Development of GPU-Accelerated Software

The Dream: Transcribe Your Entire Media Library on a Device That Fits in Your Hand

Imagine owning a massive Plex media library with hundreds of foreign-language films and TV shows. You want subtitles for everything, but manually sourcing them is a nightmare — mismatched timings, missing translations, incomplete coverage. Tools like Bazarr exist specifically to automate subtitle management for Plex and Sonarr/Radarr libraries, and they ship with built-in integration for whisper-asr-webservice — a self-hosted REST API that wraps OpenAI's Whisper speech recognition model. Point Bazarr at a whisper-asr-webservice endpoint, and it will automatically transcribe and generate subtitles for every piece of media in your library, in any language Whisper supports.

There's just one problem: running Whisper fast enough to be practical requires a GPU, and the existing Docker images only support x86_64 with NVIDIA desktop or server GPUs. If you want a quiet, power-efficient, always-on transcription appliance — something you can tuck behind your NAS and forget about — the NVIDIA Jetson platform is the obvious choice. An Orin Nano draws under 15 watts, fits in the palm of your hand, and packs a 1024-core Ampere GPU with hardware support for the same CUDA operations that Whisper needs. A single portable Docker container running on a Jetson could silently chew through your entire library in the background, generating subtitles on demand whenever new media arrives.

The question was: could we actually build that container? The answer turned out to be a story about how GPU-powered AI coding agents can come full circle — using GPU-accelerated tools to build GPU-accelerated software for GPU-accelerated hardware.

The Historical Pain of Porting to aarch64

Anyone who has tried to compile PyTorch, CTranslate2, or onnxruntime for ARM hardware knows the pain. The Python AI/ML ecosystem was born on x86_64 Linux and macOS, and its package infrastructure carries deep assumptions about that lineage.

PyTorch is the foundation of nearly every modern speech recognition system. On x86, you pip install torch and get a CUDA-enabled wheel in seconds. On aarch64, that same command gives you a CPU-only build — or nothing at all. For years, getting a CUDA-enabled PyTorch on Jetson meant manually compiling from source against NVIDIA's JetPack SDK, a process that could take hours on the device itself and was fragile across JetPack versions. NVIDIA eventually began publishing pre-built wheels through the Jetson AI Lab pip index, but using them correctly requires understanding a subtle and underdocumented packaging conflict: pip's wheel compatibility sorting prefers manylinux_2_28 tags over linux_aarch64, which means if both PyPI and the Jetson index are available as pip sources, pip will happily install the CPU-only PyPI wheel instead of the CUDA-enabled Jetson wheel. You must use --index-url (making Jetson the primary source), not --extra-index-url (which makes it secondary).

CTranslate2, the inference backend that faster-whisper uses to run Whisper models efficiently, is another casualty. PyPI publishes aarch64 wheels, but they are CPU-only. There is no CUDA-enabled aarch64 wheel. Getting GPU acceleration on Jetson means compiling CTranslate2 from source with -DWITH_CUDA=ON, linking against the JetPack CUDA toolkit, and targeting the correct CUDA compute capability for your specific Jetson hardware.

Poetry, the dependency manager used by whisper-asr-webservice, adds another layer of complexity. Poetry's resolver has no concept of "this package must come from this alternate index for this platform." When you run poetry install, it merges all dependency specifications and resolves them against PyPI. On Jetson, this means Poetry will cheerfully overwrite your carefully pre-installed CUDA-enabled PyTorch with a CPU-only wheel from PyPI, because the version constraint matches and Poetry doesn't know the difference. The project's poetry-core PEP 517 metadata generation also merges [tool.poetry.dependencies] source mappings with [project.optional-dependencies], producing version constraints like torch==2.7.1+cu126 that don't match the actual Jetson wheel version at all.

These are not exotic edge cases. They are the default experience of trying to port GPU-accelerated Python software to aarch64. And they are exactly the kind of deeply contextual, multi-layered problems that AI coding agents excel at navigating.

CUDA Architecture on L4T: The Edge Cases That x86 Takes for Granted

NVIDIA's Linux for Tegra (L4T) is the OS layer that underpins JetPack on Jetson devices. While x86 CUDA development benefits from a relatively uniform environment — install the CUDA toolkit, install the driver, compile for sm_70 through sm_90 and let the JIT handle the rest — Jetson development requires precise awareness of the hardware-software matrix:

Jetson Generation	Compute Capability	L4T Branch	JetPack	CUDA
Nano / TX2	sm_53 / sm_62	R32.x	4.x	10.2
Xavier NX / AGX	sm_72	R35.x	5.x	11.4
Orin Nano / NX / AGX	sm_87	R36.x	6.x	12.6

On x86, compiling CTranslate2 with -DCUDA_ARCH_LIST="7.0;7.5;8.0;8.6;8.9;9.0" covers virtually every GPU from 2017 to 2024. On Jetson, you compile for exactly one architecture — 8.7 for Orin — and the base image must match your JetPack version precisely because the CUDA toolkit, cuDNN, and TensorRT are all provided by the L4T base image rather than installed separately.

There's also the cuBLAS conflict problem: the JetPack base image ships a system cuBLAS in /usr/local/cuda/lib64. When you install nvidia-cudss-cu12 from pip (required because the Jetson PyTorch wheel links against libcudss.so.0), it pulls in nvidia-cublas-cu12 as a transitive dependency. Loading two different versions of cuBLAS at runtime causes CUBLAS_STATUS_ALLOC_FAILED — a cryptic error that only manifests when the model actually tries to run a matrix multiplication on the GPU. The fix is to uninstall the pip cuBLAS immediately after installing cudss, and ensure LD_LIBRARY_PATH does not include the pip cublas lib directory.

These are the kinds of platform-specific gotchas that would take a human developer hours of Stack Overflow browsing and GitHub issue trawling to diagnose. An AI coding agent with knowledge of the Jetson ecosystem can identify and resolve them in the flow of a single conversation.

The Setup: VS Code, Claude Opus 4.6, and Source Code

The ingredients for this solution were deliberately minimal. We used VS Code as the development environment, outfitted with GitHub Copilot powered by Claude Opus 4.6 as the AI coding agent, with the whisper-asr-webservice source code cloned locally on the Jetson device itself. That's it — an editor, a model, and a codebase. No specialized Jetson development tools, no cross-compilation toolchains, no reference implementations to copy from.

What made this combination potent was the intersection of three capabilities: Claude Opus 4.6's deep knowledge of CUDA toolchains, Python packaging, and Docker multi-stage builds; VS Code's integrated terminal giving the agent direct access to build and test on the target hardware; and the source code providing the agent full visibility into the project's dependency structure, build system, and runtime architecture. The agent could read pyproject.toml to understand Poetry's dependency graph, inspect the existing x86 Dockerfiles for patterns, examine the application code to understand which libraries each ASR engine imports, and then synthesize all of that into a Jetson-specific build — all within a single conversational session.

The Prompt That Started It All

The session began with a single natural-language prompt:

"Bro, I need some help, is there a way you might be able to figure out how to build this project in a Docker container with support for GPU acceleration on the NVIDIA Jetson hardware we are currently running on. Specifically, wthe openai-whisper, whisperx, and faster-whisper dependencies are going to need to be built from source to include acceleration on this device. Poetry is going to annoy you because the trick will be solving the additional dependencies without breaking the full project. The resulting solution should be a single docker file that builds specifically on Jetson. This is going to be tough, can you try?"

That's it. No architecture document. No step-by-step instructions. No prior Dockerfile to copy from. The agent needed to understand the project structure, identify which components required platform-specific builds, design a multi-stage Dockerfile strategy, and navigate every one of the compatibility landmines described above.

The resulting Dockerfile.jetson is nearly 400 lines of carefully sequenced build steps with extensive documentation explaining why each decision was made — not just what it does. The three-stage build strategy emerged organically:

Stage 1: Compile CTranslate2 from source with CUDA support, targeting sm_87 for Orin
Stage 2: Extract Swagger UI static assets from the x86 swagger-ui image (only static JS/CSS, no binaries)
Stage 3: Assemble the runtime — pre-install CUDA packages from the Jetson AI Lab index, install remaining Python dependencies with constraints protecting CUDA packages, apply compatibility shims

Beyond Code: Testing Containers from Inside VS Code

One of the most powerful aspects of working with an AI coding agent in VS Code is that the agent is not limited to writing code. It has access to the terminal, which means it can build Docker images, run containers, generate test data, hit HTTP endpoints, inspect logs, and tear down environments — all within the same conversation.

This is a fundamentally different paradigm from traditional code generation. The agent doesn't hand you a Dockerfile and say "try building this." It builds the image itself, watches for errors, diagnoses failures, applies fixes, and rebuilds — iteratively, in real-time, on the actual target hardware.

During our session, the agent executed commands like:

# Build the image on the Jetson
docker build -f Dockerfile.jetson \
  -t whisper-asr-webservice-jetson:jp6.1-cu12.6-py3.10 .

# Start a container with a specific ASR engine
docker run -d --rm --name fw-test --runtime nvidia -p 9000:9000 \
  -e ASR_ENGINE=faster_whisper -e ASR_MODEL=tiny \
  whisper-asr-webservice-jetson:latest

# Test the endpoint with a real audio file
curl -X POST "http://localhost:9000/asr?task=transcribe&language=en&output=json" \
  -H "accept: application/json" \
  -F "audio_file=@/tmp/test_speech.wav"

# Inspect CUDA availability inside the container
docker run --rm --runtime nvidia whisper-asr-webservice-jetson:latest \
  python3 -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"

When a build failed — and across 15+ iterations, many did — the agent read the error output, identified the root cause, modified the Dockerfile, and rebuilt. When a runtime crash occurred, it inspected the Python traceback, traced the issue to a specific library version incompatibility, and created a monkey-patch shim. The entire feedback loop happened inside the VS Code terminal, with the agent operating as both developer and QA engineer simultaneously.

The Testing Strategy: Three Engines, Generated Speech, Real Validation

Whisper-asr-webservice supports three different ASR backends, each with different runtime dependencies, model loading paths, and GPU code paths:

faster_whisper — Uses CTranslate2 for optimized inference, requires a CUDA-compiled CTranslate2
openai_whisper — The original OpenAI implementation, uses PyTorch directly
whisperx — Extends Whisper with word-level timestamps and speaker diarization via pyannote.audio, requires torchaudio and the HuggingFace pipeline

All three needed to work. A container that only runs one engine is only one-third of a solution.

The testing strategy evolved through an important self-correction. The initial approach was to generate a simple test audio file — a sine wave tone — and POST it to the ASR endpoint. This produced a "successful" HTTP 200 response, but the transcription result was empty or garbage because there was no actual speech in the audio. The test was passing but not actually validating anything meaningful.

The agent recognized this limitation and pivoted: instead of a synthetic tone, it used espeak-ng (a text-to-speech engine available on the system) to generate a WAV file containing actual spoken English:

espeak-ng -w /tmp/test_speech.wav \
  "The quick brown fox jumps over the lazy dog"

This produced a test file with clear, recognizable speech. When the ASR engines transcribed it, the response contained actual words that could be verified — not just an HTTP status code, but semantic validation that the speech recognition pipeline was functioning end to end, from audio input through GPU-accelerated model inference to text output.

Each engine was tested individually by spinning up a fresh container with the appropriate ASR_ENGINE environment variable:

# Test faster_whisper
docker run -d --rm --name fw-test --runtime nvidia -p 9000:9000 \
  -e ASR_ENGINE=faster_whisper -e ASR_MODEL=tiny \
  whisper-asr-webservice-jetson:latest
# Wait for model download + startup, then curl, then teardown

# Test openai_whisper
docker run -d --rm --name ow-test --runtime nvidia -p 9000:9000 \
  -e ASR_ENGINE=openai_whisper -e ASR_MODEL=tiny \
  whisper-asr-webservice-jetson:latest

# Test whisperx
docker run -d --rm --name wx-test --runtime nvidia -p 9000:9000 \
  -e ASR_ENGINE=whisperx -e ASR_MODEL=tiny \
  whisper-asr-webservice-jetson:latest

All three returned successful transcriptions with recognizable text. The whisperx engine additionally returned word-level timestamps, confirming that the torchaudio compatibility shim was working correctly and pyannote's audio processing pipeline was intact.

This test cycle was repeated three times across the session — after the initial build, after the torch.load compatibility fix, and after the huggingface_hub API fix — ensuring that each patch didn't break previously working functionality. The agent managed all of this autonomously: spinning up containers, waiting for startup, sending requests, validating responses, tearing down containers, and reporting results.

The Compatibility Shims: When Libraries Disagree

Three runtime compatibility issues surfaced during testing, each requiring a different kind of fix. Rather than forking upstream libraries or pinning to ancient versions, the agent created a unified compatibility shim — a single Python file loaded at interpreter startup via a .pth file in site-packages. This approach is surgical: it patches only what's broken, at the earliest possible moment, without modifying any installed package.

1. torchaudio API removal: The Jetson AI Lab torchaudio builds strip out the legacy backend API — AudioMetaData, info(), and list_audio_backends() — because the Jetson builds use a different audio backend architecture. But pyannote.audio 3.x still calls these functions. The shim implements them using the soundfile library, which is available and functional on Jetson.

2. torch.load weights_only default: PyTorch 2.6+ changed torch.load() to default to weights_only=True for security. But pyannote's VAD (Voice Activity Detection) model checkpoints contain omegaconf.ListConfig objects that aren't in the allowlist. The tricky part: lightning_fabric passes weights_only=None explicitly, which PyTorch interprets as True. A simple setdefault doesn't work — you have to check if kwargs.get("weights_only") is None and override it. The agent discovered this subtlety by reading the actual traceback and tracing through the call chain.

3. huggingface_hub API deprecation: huggingface_hub 1.5.0 removed the deprecated use_auth_token parameter entirely, but pyannote.audio 3.4.0 and whisperx still pass use_auth_token= instead of token=. The fix required patching not just huggingface_hub.hf_hub_download in the top-level namespace, but also in submodules like huggingface_hub.file_download — because pyannote does from huggingface_hub import hf_hub_download at module level, which copies the function reference before any top-level patch can take effect. The shim pre-imports and patches the submodules so that when pyannote's from import runs, it picks up the already-patched version.

Each of these fixes emerged from the agent observing a runtime failure, diagnosing the root cause by inspecting library source code inside the running container, and implementing the minimal patch needed.

From Working Code to Pull Request — By Prompting

With the container built, tested, and verified across all three engines, the next step was contributing back upstream. The project had open issues requesting exactly this capability:

Issue #359: "Add Arm support for GPU container"
Issue #54: "Possible to run on Jetson Nano?"
Issue #133: "Is it possible to get this on Jetson using the GPU?"

Forking the repository, creating a feature branch, committing the changes, and opening a pull request was accomplished entirely through prompts:

"Fork the repo and make a pull request."

The agent forked ahmetoner/whisper-asr-webservice to toolboc/whisper-asr-webservice, created a feat/jetson-gpu-support branch, committed the Dockerfile and compose file, pushed to the fork, and opened PR #364 with a detailed description including:

A summary of the target platform and what's included
A table of key technical decisions with rationale for each
Verification results from actual hardware testing
Build and run instructions
A link to the pre-built Docker Hub image

When additional fixes were made (cuBLAS conflict, torch.load, huggingface_hub), each was committed with a descriptive message and pushed to update the PR:

937e2a8 feat: add NVIDIA Jetson GPU support (Dockerfile + compose)
290ffcd fix: remove conflicting pip cuBLAS to fix CUBLAS_STATUS_ALLOC_FAILED
d7096fa fix: patch torch.load for whisperx/pyannote VAD compatibility
19ef291 chore: add container_name to compose file
a6e731b fix: patch huggingface_hub use_auth_token -> token for HF_TOKEN support

Later, when asked to reference the relevant upstream issues, the agent searched the issue tracker, identified the three related issues, and updated the PR description with Closes #359 and Relates to #54, #133. The resulting PR is more thorough than most manually created pull requests — every technical decision is documented, every compatibility workaround is explained, and the testing methodology is clear.

The image was also pushed to Docker Hub as toolboc/whisper-asr-webservice-jetson:jp6.1-cu12.6-py3.10, making it immediately available to anyone with a Jetson device — no build required.

The Full Circle: GPUs Building Software for GPUs

There's a satisfying symmetry in this story. The AI coding agent that designed, built, debugged, and tested this container is itself powered by GPU-accelerated inference. The end product — a Docker container running Whisper on Jetson's GPU — is GPU-accelerated software. And the problems we solved — CUDA compute capabilities, cuBLAS library conflicts, GPU-specific wheel selection — are fundamentally GPU problems.

This is what "GPU-assisted development comes full circle" looks like in practice:

A GPU-powered agent (the LLM) understands the nuances of CUDA architecture, library ABI compatibility, and platform-specific packaging
It produces GPU-accelerated software (the Jetson Whisper container) that exploits the target hardware's full capabilities
It validates the result on actual GPU hardware by running containers, executing CUDA operations, and verifying inference output
It contributes the solution back to the open-source community through a well-documented pull request

The agent didn't just write a Dockerfile. It navigated a maze of platform-specific incompatibilities that have historically been the domain of specialized embedded engineers with deep knowledge of the NVIDIA toolchain. It did this while simultaneously managing Docker builds, generating test data, running HTTP integration tests, managing git workflows, and producing documentation — tasks that span the full spectrum from systems engineering to technical writing.

What This Unlocks

The Jetson platform isn't just for Whisper. The same challenges we solved here — CUDA compilation, pip index conflicts, Poetry resolver workarounds, torchaudio compatibility — apply to virtually every PyTorch-based project that someone wants to run on edge hardware. The pattern is repeatable:

Identify which dependencies need platform-specific builds
Source or compile CUDA-enabled versions for the target architecture
Constrain the package manager to prevent overwriting with CPU-only alternatives
Shim any API incompatibilities between library versions
Test on actual hardware with meaningful validation data

An AI coding agent that understands this pattern can port other projects to Jetson — or to any constrained platform — with dramatically less effort than manual development. The developer's role shifts from "figure out why pip install torch gives me a CPU-only wheel on aarch64" to "build this for Jetson" and validating the result.

For the Plex and home media server community specifically, this means a standalone appliance that generates subtitles automatically for any content in any language. Drop a Jetson Orin Nano next to your NAS, run docker compose -f docker-compose.jetson.yml up, point Bazarr at http://jetson:9000, and every new movie or episode that arrives gets transcribed and subtitled without human intervention. All on a device that draws less power than a light bulb.

That's the kind of practical, real-world automation that becomes possible when AI-assisted development makes it trivial to port sophisticated GPU-accelerated software to the hardware that can actually run it where it's needed.

This post documents work performed on an NVIDIA Jetson Orin running JetPack 6.2.2 (L4T R36.5.0, CUDA 12.6) using VS Code with GitHub Copilot powered by Claude Opus 4.6. The resulting pull request is #364 on the whisper-asr-webservice repository. A pre-built container image is available on Docker Hub at toolboc/whisper-asr-webservice-jetson:jp6.1-cu12.6-py3.10.

DEV Community

How GPU-Powered Coding Agents Can Assist in Development of GPU-Accelerated Software

How GPU-Powered Coding Agents Can Assist in Development of GPU-Accelerated Software

The Dream: Transcribe Your Entire Media Library on a Device That Fits in Your Hand

The Historical Pain of Porting to aarch64

CUDA Architecture on L4T: The Edge Cases That x86 Takes for Granted

The Setup: VS Code, Claude Opus 4.6, and Source Code

The Prompt That Started It All

Beyond Code: Testing Containers from Inside VS Code

The Testing Strategy: Three Engines, Generated Speech, Real Validation

The Compatibility Shims: When Libraries Disagree

From Working Code to Pull Request — By Prompting

The Full Circle: GPUs Building Software for GPUs

What This Unlocks

Top comments (0)