This is a submission for the GitHub Copilot CLI Challenge
## What I Built
I added AVX2 SIMD optimizations to libjpeg-turbo's 12-bit JPEG decoding pipeline, achieving 4.6% speedup on
Full HD and 2.5% on 4K images.
libjpeg-turbo is the world's most widely used JPEG library, with highly optimized SIMD paths for 8-bit JPEG. However,
12-bit JPEG (used in medical imaging and high-precision workflows) had zero SIMD support — everything ran as
scalar C code.
Using perf profiling, I identified 3 hotspots and implemented AVX2 intrinsics for each:
| Target | Implementation | Impact |
| --- | --- | --- |
| IDCT (Inverse DCT) | 64-bit arithmetic + AVX2 parallelization | ~3% |
| YCC→RGB Color Conversion | SIMD compute + packed RGB interleave output | ~3% |
| H2V2 Fancy Upsample | 16-bit SIMD weighted interpolation | ~1.8% |
### Why "just 4.6%" matters
libjpeg-turbo is already one of the most optimized codebases in existence. Profiling reveals that 37.6% of CPU
time is spent in Huffman decoding — which is structurally impossible to SIMD-ize due to the sequential
bit-dependency in the JPEG spec. The SIMD-able portion (IDCT + color conversion + upsampling ≈ 44%) was effectively
optimized across all three targets.
📊 Benchmarks (AMD Ryzen 9 9950X, GCC 13.3.0, -O3)
| Resolution | Before | After | Improvement |
| --- | --- | --- | --- |
| Full HD (1920×1080) | 27.87 ms | 26.58 ms | 4.6% |
| 4K (3840×2160) | 113.07 ms | 110.26 ms | 2.5% |
🧪 All 662 tests pass — JPEG compliance tests allow zero tolerance for bit-level differences.
## Demo
🔗 Repository: moe-charm/dev_libjpeg-turbo-12bit-simd
### Profiling Breakdown (4K 12-bit JPEG)
37.63% decode_mcu ← Huffman decoding (cannot SIMD)
21.95% jsimd_idct_islow_avx2_12bit ← ✅ AVX2 optimized
11.38% ycc_rgb_convert ← ✅ AVX2 optimized
10.57% h2v2_fancy_upsample ← ✅ AVX2 optimized
8.25% put_rgb ← File I/O
5.00% jpeg_fill_bit_buffer ← Bitstream parsing
### Key Implementation Detail
// 12-bit samples are 16-bit → widen to 32-bit for arithmetic → pack back to 16-bit
m256i y = _mm256_cvtepu16_epi32(_mm_loadu_si128((m128i *)inptr0));
// ... AVX2 YCC→RGB conversion ...
__m256i r16 = _mm256_packus_epi32(r, zero); // 32-bit → 16-bit pack
## My Experience with GitHub Copilot CLI
This entire project was built exclusively through Copilot CLI in the terminal — no IDE involved.
### 🔍 The Profile → Implement → Verify Cycle
Copilot CLI handled perf record / perf report execution and analysis, AVX2 intrinsics code generation, and running
all 662 ctest tests — all within a single terminal session.
What stood out:
-
Profiling-driven prioritization — After running
perf, Copilot analyzed the results and suggested which function to optimize next based on CPU time share. This data-driven approach kept the work focused on high-impact targets. -
AVX2 intrinsics generation — Instructions like
_mm256_packus_epi32,_mm256_permute4x64_epi64, and_mm256_cvtepu16_epi32are notoriously hard to get right without reading Intel manuals. Copilot generated correct sequences and understood the cross-lane behavior of AVX2. - Debugging bit-level failures — The 12-bit IDCT initially had 1-bit rounding errors that failed compliance tests. Copilot helped diagnose the overflow issue and switch from 32-bit to 64-bit intermediate arithmetic to fix it.
-
A/B testing infrastructure — Copilot proposed and implemented the
JPEG12_IDCT_FORCE_Cenvironment variable for toggling between SIMD and scalar paths, enabling clean before/after benchmarking.
### 💡 Why CLI Was Perfect for This
SIMD optimization lives and dies by the "write → build → test → profile → analyze → rewrite" loop. Copilot CLI
keeps this cycle entirely within the terminal — no context switching to an editor. Run cmake --build, see the
result, fix the code, run 662 tests, benchmark — all in one continuous conversation.
### ⚠️ Challenges
-
Rare type system: libjpeg-turbo's 12-bit types (
J12SAMPLE,J12SAMPROW,J12SAMPARRAY) barely exist in training data. Copilot initially generated dispatch logic using the compile-timeBITS_IN_JSAMPLEmacro, but the correct approach requires runtimedata_precisionchecks — since libjpeg-turbo builds a single binary supporting multiple precisions. - Measuring small gains: When your baseline is already world-class, proving that a 2-3% improvement is real (not noise) requires careful benchmark design with multiple runs and statistical analysis.
Built with GitHub Copilot CLI + libjpeg-turbo 3.1.x on AMD Ryzen 9 9950X / Ubuntu / GCC 13.3.0
Top comments (0)