DEV Community

CharmPic
CharmPic

Posted on

AVX2 SIMD Optimization for 12-bit JPEG Decoding in libjpeg-turbo — Pair Programming with Copilot CLI

GitHub Copilot CLI Challenge Submission

This is a submission for the GitHub Copilot CLI Challenge

## What I Built

I added AVX2 SIMD optimizations to libjpeg-turbo's 12-bit JPEG decoding pipeline, achieving 4.6% speedup on
Full HD and 2.5% on 4K images.

libjpeg-turbo is the world's most widely used JPEG library, with highly optimized SIMD paths for 8-bit JPEG. However,
12-bit JPEG (used in medical imaging and high-precision workflows) had zero SIMD support — everything ran as
scalar C code.

Using perf profiling, I identified 3 hotspots and implemented AVX2 intrinsics for each:

| Target | Implementation | Impact |
| --- | --- | --- |
| IDCT (Inverse DCT) | 64-bit arithmetic + AVX2 parallelization | ~3% |
| YCC→RGB Color Conversion | SIMD compute + packed RGB interleave output | ~3% |
| H2V2 Fancy Upsample | 16-bit SIMD weighted interpolation | ~1.8% |

### Why "just 4.6%" matters

libjpeg-turbo is already one of the most optimized codebases in existence. Profiling reveals that 37.6% of CPU
time
is spent in Huffman decoding — which is structurally impossible to SIMD-ize due to the sequential
bit-dependency in the JPEG spec. The SIMD-able portion (IDCT + color conversion + upsampling ≈ 44%) was effectively
optimized across all three targets.

📊 Benchmarks (AMD Ryzen 9 9950X, GCC 13.3.0, -O3)

| Resolution | Before | After | Improvement |
| --- | --- | --- | --- |
| Full HD (1920×1080) | 27.87 ms | 26.58 ms | 4.6% |
| 4K (3840×2160) | 113.07 ms | 110.26 ms | 2.5% |

🧪 All 662 tests pass — JPEG compliance tests allow zero tolerance for bit-level differences.

## Demo

🔗 Repository: moe-charm/dev_libjpeg-turbo-12bit-simd

### Profiling Breakdown (4K 12-bit JPEG)

37.63%  decode_mcu                    ← Huffman decoding (cannot SIMD)
21.95%  jsimd_idct_islow_avx2_12bit   ← ✅ AVX2 optimized
11.38%  ycc_rgb_convert               ← ✅ AVX2 optimized
10.57%  h2v2_fancy_upsample           ← ✅ AVX2 optimized
 8.25%  put_rgb                       ← File I/O
 5.00%  jpeg_fill_bit_buffer          ← Bitstream parsing
Enter fullscreen mode Exit fullscreen mode

### Key Implementation Detail

// 12-bit samples are 16-bit → widen to 32-bit for arithmetic → pack back to 16-bit
m256i y = _mm256_cvtepu16_epi32(_mm_loadu_si128((m128i *)inptr0));
// ... AVX2 YCC→RGB conversion ...
__m256i r16 = _mm256_packus_epi32(r, zero); // 32-bit → 16-bit pack

## My Experience with GitHub Copilot CLI

This entire project was built exclusively through Copilot CLI in the terminal — no IDE involved.

### 🔍 The Profile → Implement → Verify Cycle

Copilot CLI handled perf record / perf report execution and analysis, AVX2 intrinsics code generation, and running
all 662 ctest tests — all within a single terminal session.

What stood out:

  • Profiling-driven prioritization — After running perf, Copilot analyzed the results and suggested which function to optimize next based on CPU time share. This data-driven approach kept the work focused on high-impact targets.
  • AVX2 intrinsics generation — Instructions like _mm256_packus_epi32, _mm256_permute4x64_epi64, and _mm256_cvtepu16_epi32 are notoriously hard to get right without reading Intel manuals. Copilot generated correct sequences and understood the cross-lane behavior of AVX2.
  • Debugging bit-level failures — The 12-bit IDCT initially had 1-bit rounding errors that failed compliance tests. Copilot helped diagnose the overflow issue and switch from 32-bit to 64-bit intermediate arithmetic to fix it.
  • A/B testing infrastructure — Copilot proposed and implemented the JPEG12_IDCT_FORCE_C environment variable for toggling between SIMD and scalar paths, enabling clean before/after benchmarking.

### 💡 Why CLI Was Perfect for This

SIMD optimization lives and dies by the "write → build → test → profile → analyze → rewrite" loop. Copilot CLI
keeps this cycle entirely within the terminal — no context switching to an editor. Run cmake --build, see the
result, fix the code, run 662 tests, benchmark — all in one continuous conversation.

### ⚠️ Challenges

  • Rare type system: libjpeg-turbo's 12-bit types (J12SAMPLE, J12SAMPROW, J12SAMPARRAY) barely exist in training data. Copilot initially generated dispatch logic using the compile-time BITS_IN_JSAMPLE macro, but the correct approach requires runtime data_precision checks — since libjpeg-turbo builds a single binary supporting multiple precisions.
  • Measuring small gains: When your baseline is already world-class, proving that a 2-3% improvement is real (not noise) requires careful benchmark design with multiple runs and statistical analysis.

Built with GitHub Copilot CLI + libjpeg-turbo 3.1.x on AMD Ryzen 9 9950X / Ubuntu / GCC 13.3.0

Top comments (0)