DEV Community: Jaysmito Mukherjee

High-Performance Image Processing with Halide: Building a Custom Sharpening Filter

Jaysmito Mukherjee — Fri, 15 May 2026 16:07:40 +0000

High-Performance Image Processing with Halide: Building a Custom Sharpening Filter

Writing functional image processing code in C++ is relatively straightforward. You load an image, write some nested for loops to iterate over the width and height, apply your mathematical operations to the pixels, and save the result.

However, writing fast image processing code is an entirely different beast.

To squeeze every ounce of performance out of modern hardware, developers are usually forced to implement loop unrolling, manage cache locality, utilize platform-specific SIMD (Single Instruction, Multiple Data) intrinsics, and orchestrate complex multithreading. By the time you finish optimizing your pipeline, the original, elegant mathematical algorithm is entirely buried under a mountain of architecture-specific boilerplate. Worse, if you want to run that same code on a GPU instead of a CPU, you often have to rewrite the entire thing from scratch.

This is exactly the problem that Halide solves.

Halide is a domain-specific language embedded within C++ designed specifically for fast, portable computation on images and tensors. It allows developers to write code that is incredibly easy to read, mathematically pure, and capable of generating machine code that rivals or exceeds the performance of hand-tuned assembly.

Let’s dive deep into the philosophy behind this paradigm and build a complete, highly optimized image sharpening filter from scratch.

The Core Philosophy: Decoupling Algorithm from Schedule

The fundamental magic of Halide lies in its strict separation of two concepts: what you want to compute, and how you want to compute it.

In traditional C++, these two concepts are inextricably linked. The structure of your for loops dictates both the mathematical operation and the memory access pattern. In Halide, these are split:

The Algorithm: This defines the pure mathematical operations. It describes how the value of a pixel is calculated based on its coordinates. It contains absolutely no information about storage, execution order, threads, or vectorization.
The Schedule: This defines the execution strategy. Once the algorithm is defined, you write a separate set of instructions (the schedule) that tells the compiler how to iterate over the domain. This is where you dictate tile sizes, threading, vectorization, and memory locality.

Because these two concepts are decoupled, you can write your algorithm once and safely experiment with dozens of different performance schedules without ever risking breaking the underlying math. You can switch from single-threaded CPU execution to massively parallel GPU execution with just a few lines of scheduling code.

Understanding the Building Blocks

Before writing the algorithm, it is important to understand the three foundational types you will use when building a pipeline:

Var (Variable): Represents a dimensional coordinate in your computational domain. In a standard 2D image, you will typically use x and y for spatial coordinates, and c for the color channel (Red, Green, Blue).
Expr (Expression): Represents a mathematical operation or value. Adding two pixels together produces an Expr.
Func (Function): Represents a pipeline stage. You can think of a Func as a mathematical function that, given a set of coordinates (like x, y, c), evaluates and returns a computed pixel value. Unlike standard arrays, a Func represents an infinite domain until it is explicitly constrained and evaluated.

The Theory: Designing a Sharpening Kernel

To sharpen an image, we want to enhance the edges. We can achieve this by applying a discrete convolution kernel. A standard spatial sharpening filter works by amplifying the center pixel and subtracting the values of its immediate orthogonal neighbors (top, bottom, left, and right).

We will use the following 3x3 convolution matrix:

  0  -1   0
 -1   5  -1
  0  -1   0

Mathematically, to calculate the new value for a pixel at coordinates (x, y), the formula is:
Output(x, y) = (5 * Input(x, y)) - Input(x-1, y) - Input(x+1, y) - Input(x, y-1) - Input(x, y+1)

While the math is simple, implementing it robustly requires handling a few critical edge cases.

1. The Boundary Problem

What happens when we are evaluating the pixel at x = 0? The algorithm will ask for the value of Input(-1, y). In a standard C++ array, this results in an out-of-bounds memory read, leading to a segmentation fault. Halide provides elegant boundary condition handling that automatically clamps out-of-bounds coordinate requests to the nearest valid edge pixel.

2. The Arithmetic Overflow Problem

Standard images store color channels as 8-bit unsigned integers, meaning pixel values are restricted to a range between 0 and 255. If a pixel has a value of 200, multiplying it by 5 yields 1000. In 8-bit arithmetic, this causes integer overflow, creating severe visual artifacts. We must cast our pixels to a wider data type (like 16-bit integers) before performing the math, and then clamp the final result back down to the 0-255 range before casting back to 8-bit.

Implementation and Scheduling

Once the math is defined safely with proper types and boundaries, we apply the schedule.

By default, Halide will execute a Func using a basic, single-threaded nested loop. However, modern CPUs have multiple cores and support vector instructions (processing multiple pieces of data in a single clock cycle).

For our sharpening tool, we will apply a very effective, yet simple schedule:

Parallelization: We will divide the image by its rows (y) and distribute them across all available CPU cores.
Vectorization: Within each row, we will process the columns (x) in chunks of 16. This tells the compiler to pack 16 pixels into wide CPU registers and calculate them simultaneously.

This optimization takes only a single line of code in Halide.

The Complete Code

Here is the fully commented, ready-to-compile C++ source code for the image sharpener.

#include "Halide.h"
#include "halide_image_io.h" // Helper library for loading and saving image files

using namespace Halide;
using namespace Halide::Tools;

int main(int argc, char **argv) {
    // Ensure the user provided input and output file paths
    if (argc < 3) {
        printf("Usage: ./sharpen input.png output.png\n");
        return 1;
    }

    // 1. Load the input image from disk into a Halide Buffer
    Buffer<uint8_t> input = load_image(argv[1]);

    // Define our spatial and channel variables
    Var x("x"), y("y"), c("c");

    // 2. Handle boundary conditions
    // If the convolution kernel asks for a pixel outside the image (e.g., x = -1),
    // return the value of the nearest edge pixel (x = 0).
    Func clamped = BoundaryConditions::repeat_edge(input);

    // 3. Prevent arithmetic overflow
    // Cast the 8-bit image data to 16-bit integers so our multiplication and 
    // subtraction don't wrap around and corrupt the image.
    Func input_16("input_16");
    input_16(x, y, c) = cast<int16_t>(clamped(x, y, c));

    // 4. THE ALGORITHM: Apply the discrete convolution kernel
    Func sharpen("sharpen");
    sharpen(x, y, c) = 5 * input_16(x, y, c)
                     - input_16(x - 1, y, c)
                     - input_16(x + 1, y, c)
                     - input_16(x, y - 1, c)
                     - input_16(x, y + 1, c);

    // 5. Finalize the output
    // The result might be negative or greater than 255. We clamp the values
    // to the valid 0-255 range, then safely cast back to unsigned 8-bit.
    Func output("output");
    output(x, y, c) = cast<uint8_t>(clamp(sharpen(x, y, c), 0, 255));

    // 6. THE SCHEDULE
    // This is where the magic happens. We tell the compiler to evaluate the
    // 'y' coordinates in parallel (utilizing multithreading), and to process
    // the 'x' coordinates in vectorized batches of 16 (utilizing SIMD).
    output.parallel(y).vectorize(x, 16);

    // 7. Realize the pipeline
    // Until this point, no actual computation has happened. The 'realize' call
    // triggers the Just-In-Time (JIT) compiler to generate optimized machine code 
    // and execute the pipeline over the specified dimensions.
    Buffer<uint8_t> result = output.realize({input.width(), input.height(), input.channels()});

    // 8. Save the processed image to disk
    save_image(result, argv[2]);

    printf("Success! Image sharpened.\n");
    return 0;
}

Compiling and Running the Code

To compile this application, you must have the Halide release binaries available on your system, along with libpng and libjpeg to support the image I/O helper functions.

Because Halide utilizes modern C++ features, you must compile with at least C++17. A standard compilation command using GCC looks like this:

g++ main.cpp -g -I /path/to/halide/include -I /path/to/halide/tools \
    -L /path/to/halide/lib -lHalide -lpng -ljpeg -lpthread -ldl -std=c++17 -o sharpen

Note: Ensure you replace /path/to/halide/ with the actual path where your Halide headers and libraries are located.

Once the code is compiled successfully, you can run the executable from your terminal, passing in the image you want to process and the desired name for the output file:

./sharpen my_blurry_photo.png crisp_sharpened_photo.png

Final Thoughts

By abstracting the memory layout and execution loops away from the mathematical logic, Halide drastically reduces the cognitive load required to build complex computer vision pipelines. Our sharpening filter is concise, mathematically readable, and incredibly fast.

More importantly, it is highly maintainable. If a new hardware architecture is released tomorrow with a completely different optimal memory access pattern, the algorithm itself remains untouched. The developer only needs to adjust the one-line schedule to accommodate the new hardware, ensuring that high-performance image processing code remains future-proof.

High Performance GPGPU with Rust and wgpu

Jaysmito Mukherjee — Sun, 14 Dec 2025 14:46:57 +0000

High Performance GPGPU with Rust and wgpu

General Purpose Graphics Processing Unit programming, or GPGPU, has transformed high-performance computing. By offloading parallelizable tasks to the massive number of cores available on modern graphics cards, developers can achieve performance gains spanning orders of magnitude compared to CPU execution. While CUDA has long been the standard, the ecosystem is evolving. The wgpu crate in Rust offers a compelling, portable, and safe alternative that runs on Vulkan, Metal, DirectX 12, and even inside web browsers via WebGPU. This article explores how to leverage wgpu for compute workloads, moving beyond rendering triangles to processing raw data.

The Architecture of a Compute Application

A GPGPU application differs significantly from a traditional rendering loop. In a graphics context, the pipeline is complex, involving vertex shaders, fragment shaders, rasterization, and depth buffers. A compute pipeline is refreshingly simple by comparison. It consists primarily of data buffers and a compute shader. The workflow involves initializing the GPU device, loading the shader code, creating memory buffers accessible by the GPU, and dispatching "workgroups" to execute the logic.

The core abstraction in wgpu involves the Instance, Adapter, Device, and Queue. The Instance is the entry point to the API. The Adapter represents the physical hardware card. The Device is the logical connection that allows you to create resources, and the Queue is where you submit command buffers for execution. Unlike graphics rendering which requires a windowing surface, a compute context can run entirely "headless," making it ideal for background processing tools or server-side applications.

Writing the Kernel in WGSL

The logic executed on the GPU is written in the WebGPU Shading Language (WGSL). This language feels like a blend of Rust and GLSL. For a compute shader, we define an entry point decorated with the @compute attribute and specify a workgroup size. The GPU executes this function in parallel across a 3D grid.

Consider a simple kernel that performs vector multiplication. We define a storage buffer to hold our input and output data. The built-in variable global_invocation_id allows us to determine which specific element of the array the current thread should process.

// shader.wgsl
@group(0) @binding(0)
var<storage, read_write> data: array<f32>;

@compute @workgroup_size(64)
fn main(@builtin(global_invocation_id) global_id: vec3<u32>) {
    let index = global_id.x;
    // Guard against out-of-bounds access if the array size 
    // isn't a perfect multiple of the workgroup size
    if (index < arrayLength(&data)) {
        data[index] = data[index] * data[index];
    }
}

In the code above, the workgroup size is set to 64. When we dispatch work from the Rust side, we will calculate how many groups of 64 are needed to cover our data array. The logic inside the function is simple, but the hardware will execute thousands of these instances simultaneously.

Buffer Management and Bind Groups

Memory management is the most critical aspect of GPGPU programming. The CPU and GPU often have distinct memory spaces. To bridge this gap, wgpu uses buffers. For a compute operation, we typically need a Storage Buffer, which allows the shader to read and write arbitrary data. However, CPU read access to GPU memory is slow or impossible directly. Therefore, we often use a Staging Buffer strategy. We create a buffer on the GPU for processing and a separate buffer that can be mapped for reading by the CPU.

Once the buffers are created, we must tell the shader where to find them. This is done via Bind Groups. A Bind Group Layout describes the interface—stating that binding slot 0 is a storage buffer. The Bind Group itself connects the actual wgpu::Buffer object to that slot. This strict separation of layout and data allows wgpu to validate resource usage before the GPU ever sees a command, preventing many common crashes associated with low-level graphics APIs.

Dispatching the Work

With the pipeline created and data uploaded, we proceed to command encoding. We create a CommandEncoder and begin a compute pass. Inside this pass, we set the pipeline, set the bind group containing our data buffers, and call dispatch_workgroups.

The dispatch call requires understanding the grid dimensionality. If we have an array of 1024 elements and a shader workgroup size of 64, we must dispatch 16 workgroups on the X-axis (1024 divided by 64).

let mut encoder = device.create_command_encoder(&wgpu::CommandEncoderDescriptor { label: None });
{
    let mut cpass = encoder.begin_compute_pass(&wgpu::ComputePassDescriptor { 
        label: None, 
        timestamp_writes: None 
    });
    cpass.set_pipeline(&compute_pipeline);
    cpass.set_bind_group(0, &bind_group, &[]);
    cpass.dispatch_workgroups(data_size / 64, 1, 1);
}

After dispatching, if we intend to read the results back to the CPU, we must issue a copy command. This command copies the data from the GPU-resident storage buffer into a map-readable staging buffer. Finally, we finish the encoder and submit the command buffer to the queue.

Asynchronous Readback

One aspect of wgpu that often trips up developers coming from blocking APIs is its asynchronous nature. Submitting the work to the queue returns immediately, but the GPU has only just received the instructions. To read the data back, we must map the staging buffer. This is an async operation returning a Future.

To resolve this, the application must poll the device. In a native environment, we call device.poll(wgpu::Maintain::Wait). This blocks the main thread until the GPU operations are complete and the map callback has fired. Once the buffer is mapped, we can cast the raw bytes back into a Rust slice, copy the data to a local vector, and unmap the buffer. This creates a synchronization point, ensuring the GPU has finished its heavy lifting before the CPU attempts to interpret the results.

Conclusion

The wgpu ecosystem provides a robust foundation for GPGPU programming that prioritizes safety and portability without sacrificing the raw parallel power of the hardware. By standardizing on WGSL and the WebGPU resource model, developers can write compute kernels that run seamlessly on desktop, mobile, and web. While the boilerplate for setting up pipelines and managing memory buffers is more verbose than high-level CPU threading, the payoff is the ability to process massive datasets in parallel, unlocking performance capabilities that are simply unattainable on the CPU alone.

TerraGen3D 3D Procedural Terrain Generation Tool in OpenGL/C++

Jaysmito Mukherjee — Fri, 01 Oct 2021 09:31:29 +0000

I am making a 3D Procedural Generation Software Completely opensource and free!

Get it:
https://github.com/Jaysmito101/TerraGen3D
https://sourceforge.net/projects/terragen3d/

Tutorials : https://www.youtube.com/playlist?list=PLl3xhxX__M4A74aaTj8fvqApu7vo3cOiZ

Join the Discord Server : https://discord.gg/AcgRafSfyB

What can this do?

Generte 3D Terrain Procedrally
Export Terrain mesh as OBJ
You can write and test your own shaders
An Inbuilt IDE for shaders
Test under different lighting
A 3D viewer
A Node based as well as Layer based workflow
Save the project(custom .terr3d files)
Hieght map visualizer in node editor
Wireframe mode
Custom Lighiting
Customizable Geometry Shaders included in rendering pipeline
Skyboxes
Multithreded Mesh Generation
Lua scripting to add custom algotrithms
Export to heightmaps(both PNG and also custom format)
Custom Skyboxes
Completely usable 3D procedural modelling and texturing pipeline

Future Goals

Procedural grass and foliage
Fix more bugs!
Many more things..

Screenshots

Support

I am just a Highschool student so I may not have the best quality of code but still i am trying my best to write good code!

Any support would be highly appretiated!

For example you could add a feature and contribute via pull requests or you could even report any issues with the program!

And the best thing you could do to support this project is spread word about this so that more people who might be interested in this may use this!

Please considering tweeting about this!

Join the Discord Server : https://discord.gg/AcgRafSfyB