<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Jaysmito Mukherjee</title>
    <description>The latest articles on DEV Community by Jaysmito Mukherjee (@jaysmito101).</description>
    <link>https://dev.to/jaysmito101</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F716987%2F7e2853be-840a-41ba-996e-398f64f2a4a3.jpeg</url>
      <title>DEV Community: Jaysmito Mukherjee</title>
      <link>https://dev.to/jaysmito101</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/jaysmito101"/>
    <language>en</language>
    <item>
      <title>High-Performance Image Processing with Halide: Building a Custom Sharpening Filter</title>
      <dc:creator>Jaysmito Mukherjee</dc:creator>
      <pubDate>Fri, 15 May 2026 16:07:40 +0000</pubDate>
      <link>https://dev.to/jaysmito101/high-performance-image-processing-with-halide-building-a-custom-sharpening-filter-483n</link>
      <guid>https://dev.to/jaysmito101/high-performance-image-processing-with-halide-building-a-custom-sharpening-filter-483n</guid>
      <description>&lt;h2&gt;
  
  
  High-Performance Image Processing with Halide: Building a Custom Sharpening Filter
&lt;/h2&gt;

&lt;p&gt;Writing functional image processing code in C++ is relatively straightforward. You load an image, write some nested &lt;code&gt;for&lt;/code&gt; loops to iterate over the width and height, apply your mathematical operations to the pixels, and save the result. &lt;/p&gt;

&lt;p&gt;However, writing &lt;em&gt;fast&lt;/em&gt; image processing code is an entirely different beast. &lt;/p&gt;

&lt;p&gt;To squeeze every ounce of performance out of modern hardware, developers are usually forced to implement loop unrolling, manage cache locality, utilize platform-specific SIMD (Single Instruction, Multiple Data) intrinsics, and orchestrate complex multithreading. By the time you finish optimizing your pipeline, the original, elegant mathematical algorithm is entirely buried under a mountain of architecture-specific boilerplate. Worse, if you want to run that same code on a GPU instead of a CPU, you often have to rewrite the entire thing from scratch.&lt;/p&gt;

&lt;p&gt;This is exactly the problem that Halide solves. &lt;/p&gt;

&lt;p&gt;Halide is a domain-specific language embedded within C++ designed specifically for fast, portable computation on images and tensors. It allows developers to write code that is incredibly easy to read, mathematically pure, and capable of generating machine code that rivals or exceeds the performance of hand-tuned assembly. &lt;/p&gt;

&lt;p&gt;Let’s dive deep into the philosophy behind this paradigm and build a complete, highly optimized image sharpening filter from scratch.&lt;/p&gt;




&lt;h3&gt;
  
  
  The Core Philosophy: Decoupling Algorithm from Schedule
&lt;/h3&gt;

&lt;p&gt;The fundamental magic of Halide lies in its strict separation of two concepts: &lt;strong&gt;what&lt;/strong&gt; you want to compute, and &lt;strong&gt;how&lt;/strong&gt; you want to compute it.&lt;/p&gt;

&lt;p&gt;In traditional C++, these two concepts are inextricably linked. The structure of your &lt;code&gt;for&lt;/code&gt; loops dictates both the mathematical operation and the memory access pattern. In Halide, these are split:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;The Algorithm:&lt;/strong&gt; This defines the pure mathematical operations. It describes how the value of a pixel is calculated based on its coordinates. It contains absolutely no information about storage, execution order, threads, or vectorization.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Schedule:&lt;/strong&gt; This defines the execution strategy. Once the algorithm is defined, you write a separate set of instructions (the schedule) that tells the compiler how to iterate over the domain. This is where you dictate tile sizes, threading, vectorization, and memory locality.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Because these two concepts are decoupled, you can write your algorithm once and safely experiment with dozens of different performance schedules without ever risking breaking the underlying math. You can switch from single-threaded CPU execution to massively parallel GPU execution with just a few lines of scheduling code.&lt;/p&gt;




&lt;h3&gt;
  
  
  Understanding the Building Blocks
&lt;/h3&gt;

&lt;p&gt;Before writing the algorithm, it is important to understand the three foundational types you will use when building a pipeline:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;Var&lt;/code&gt; (Variable):&lt;/strong&gt; Represents a dimensional coordinate in your computational domain. In a standard 2D image, you will typically use &lt;code&gt;x&lt;/code&gt; and &lt;code&gt;y&lt;/code&gt; for spatial coordinates, and &lt;code&gt;c&lt;/code&gt; for the color channel (Red, Green, Blue).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;Expr&lt;/code&gt; (Expression):&lt;/strong&gt; Represents a mathematical operation or value. Adding two pixels together produces an &lt;code&gt;Expr&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;Func&lt;/code&gt; (Function):&lt;/strong&gt; Represents a pipeline stage. You can think of a &lt;code&gt;Func&lt;/code&gt; as a mathematical function that, given a set of coordinates (like &lt;code&gt;x, y, c&lt;/code&gt;), evaluates and returns a computed pixel value. Unlike standard arrays, a &lt;code&gt;Func&lt;/code&gt; represents an infinite domain until it is explicitly constrained and evaluated.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  The Theory: Designing a Sharpening Kernel
&lt;/h3&gt;

&lt;p&gt;To sharpen an image, we want to enhance the edges. We can achieve this by applying a discrete convolution kernel. A standard spatial sharpening filter works by amplifying the center pixel and subtracting the values of its immediate orthogonal neighbors (top, bottom, left, and right). &lt;/p&gt;

&lt;p&gt;We will use the following 3x3 convolution matrix:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  0  -1   0
 -1   5  -1
  0  -1   0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Mathematically, to calculate the new value for a pixel at coordinates &lt;code&gt;(x, y)&lt;/code&gt;, the formula is:&lt;br&gt;
&lt;code&gt;Output(x, y) = (5 * Input(x, y)) - Input(x-1, y) - Input(x+1, y) - Input(x, y-1) - Input(x, y+1)&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;While the math is simple, implementing it robustly requires handling a few critical edge cases.&lt;/p&gt;
&lt;h4&gt;
  
  
  1. The Boundary Problem
&lt;/h4&gt;

&lt;p&gt;What happens when we are evaluating the pixel at &lt;code&gt;x = 0&lt;/code&gt;? The algorithm will ask for the value of &lt;code&gt;Input(-1, y)&lt;/code&gt;. In a standard C++ array, this results in an out-of-bounds memory read, leading to a segmentation fault. Halide provides elegant boundary condition handling that automatically clamps out-of-bounds coordinate requests to the nearest valid edge pixel.&lt;/p&gt;
&lt;h4&gt;
  
  
  2. The Arithmetic Overflow Problem
&lt;/h4&gt;

&lt;p&gt;Standard images store color channels as 8-bit unsigned integers, meaning pixel values are restricted to a range between 0 and 255. If a pixel has a value of 200, multiplying it by 5 yields 1000. In 8-bit arithmetic, this causes integer overflow, creating severe visual artifacts. We must cast our pixels to a wider data type (like 16-bit integers) before performing the math, and then clamp the final result back down to the 0-255 range before casting back to 8-bit.&lt;/p&gt;


&lt;h3&gt;
  
  
  Implementation and Scheduling
&lt;/h3&gt;

&lt;p&gt;Once the math is defined safely with proper types and boundaries, we apply the schedule. &lt;/p&gt;

&lt;p&gt;By default, Halide will execute a &lt;code&gt;Func&lt;/code&gt; using a basic, single-threaded nested loop. However, modern CPUs have multiple cores and support vector instructions (processing multiple pieces of data in a single clock cycle). &lt;/p&gt;

&lt;p&gt;For our sharpening tool, we will apply a very effective, yet simple schedule:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Parallelization:&lt;/strong&gt; We will divide the image by its rows (&lt;code&gt;y&lt;/code&gt;) and distribute them across all available CPU cores.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vectorization:&lt;/strong&gt; Within each row, we will process the columns (&lt;code&gt;x&lt;/code&gt;) in chunks of 16. This tells the compiler to pack 16 pixels into wide CPU registers and calculate them simultaneously.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This optimization takes only a single line of code in Halide.&lt;/p&gt;


&lt;h3&gt;
  
  
  The Complete Code
&lt;/h3&gt;

&lt;p&gt;Here is the fully commented, ready-to-compile C++ source code for the image sharpener.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="cp"&gt;#include&lt;/span&gt; &lt;span class="cpf"&gt;"Halide.h"&lt;/span&gt;&lt;span class="cp"&gt;
#include&lt;/span&gt; &lt;span class="cpf"&gt;"halide_image_io.h"&lt;/span&gt;&lt;span class="c1"&gt; // Helper library for loading and saving image files&lt;/span&gt;&lt;span class="cp"&gt;
&lt;/span&gt;
&lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="k"&gt;namespace&lt;/span&gt; &lt;span class="n"&gt;Halide&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="k"&gt;namespace&lt;/span&gt; &lt;span class="n"&gt;Halide&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Tools&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;argc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;char&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Ensure the user provided input and output file paths&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;argc&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;printf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Usage: ./sharpen input.png output.png&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;// 1. Load the input image from disk into a Halide Buffer&lt;/span&gt;
    &lt;span class="n"&gt;Buffer&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;uint8_t&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;input&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;load_image&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;

    &lt;span class="c1"&gt;// Define our spatial and channel variables&lt;/span&gt;
    &lt;span class="n"&gt;Var&lt;/span&gt; &lt;span class="nf"&gt;x&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"x"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"y"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"c"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// 2. Handle boundary conditions&lt;/span&gt;
    &lt;span class="c1"&gt;// If the convolution kernel asks for a pixel outside the image (e.g., x = -1),&lt;/span&gt;
    &lt;span class="c1"&gt;// return the value of the nearest edge pixel (x = 0).&lt;/span&gt;
    &lt;span class="n"&gt;Func&lt;/span&gt; &lt;span class="n"&gt;clamped&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;BoundaryConditions&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;repeat_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// 3. Prevent arithmetic overflow&lt;/span&gt;
    &lt;span class="c1"&gt;// Cast the 8-bit image data to 16-bit integers so our multiplication and &lt;/span&gt;
    &lt;span class="c1"&gt;// subtraction don't wrap around and corrupt the image.&lt;/span&gt;
    &lt;span class="n"&gt;Func&lt;/span&gt; &lt;span class="nf"&gt;input_16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"input_16"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;input_16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cast&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;int16_t&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;clamped&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;

    &lt;span class="c1"&gt;// 4. THE ALGORITHM: Apply the discrete convolution kernel&lt;/span&gt;
    &lt;span class="n"&gt;Func&lt;/span&gt; &lt;span class="nf"&gt;sharpen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"sharpen"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;sharpen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nf"&gt;input_16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                     &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;input_16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                     &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;input_16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                     &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;input_16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                     &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;input_16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// 5. Finalize the output&lt;/span&gt;
    &lt;span class="c1"&gt;// The result might be negative or greater than 255. We clamp the values&lt;/span&gt;
    &lt;span class="c1"&gt;// to the valid 0-255 range, then safely cast back to unsigned 8-bit.&lt;/span&gt;
    &lt;span class="n"&gt;Func&lt;/span&gt; &lt;span class="nf"&gt;output&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"output"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cast&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;uint8_t&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;clamp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sharpen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;255&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;

    &lt;span class="c1"&gt;// 6. THE SCHEDULE&lt;/span&gt;
    &lt;span class="c1"&gt;// This is where the magic happens. We tell the compiler to evaluate the&lt;/span&gt;
    &lt;span class="c1"&gt;// 'y' coordinates in parallel (utilizing multithreading), and to process&lt;/span&gt;
    &lt;span class="c1"&gt;// the 'x' coordinates in vectorized batches of 16 (utilizing SIMD).&lt;/span&gt;
    &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parallel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;vectorize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// 7. Realize the pipeline&lt;/span&gt;
    &lt;span class="c1"&gt;// Until this point, no actual computation has happened. The 'realize' call&lt;/span&gt;
    &lt;span class="c1"&gt;// triggers the Just-In-Time (JIT) compiler to generate optimized machine code &lt;/span&gt;
    &lt;span class="c1"&gt;// and execute the pipeline over the specified dimensions.&lt;/span&gt;
    &lt;span class="n"&gt;Buffer&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;uint8_t&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;realize&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;height&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;channels&lt;/span&gt;&lt;span class="p"&gt;()});&lt;/span&gt;

    &lt;span class="c1"&gt;// 8. Save the processed image to disk&lt;/span&gt;
    &lt;span class="n"&gt;save_image&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;

    &lt;span class="n"&gt;printf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Success! Image sharpened.&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Compiling and Running the Code
&lt;/h3&gt;

&lt;p&gt;To compile this application, you must have the Halide release binaries available on your system, along with &lt;code&gt;libpng&lt;/code&gt; and &lt;code&gt;libjpeg&lt;/code&gt; to support the image I/O helper functions.&lt;/p&gt;

&lt;p&gt;Because Halide utilizes modern C++ features, you must compile with at least C++17. A standard compilation command using GCC looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;g++ main.cpp &lt;span class="nt"&gt;-g&lt;/span&gt; &lt;span class="nt"&gt;-I&lt;/span&gt; /path/to/halide/include &lt;span class="nt"&gt;-I&lt;/span&gt; /path/to/halide/tools &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-L&lt;/span&gt; /path/to/halide/lib &lt;span class="nt"&gt;-lHalide&lt;/span&gt; &lt;span class="nt"&gt;-lpng&lt;/span&gt; &lt;span class="nt"&gt;-ljpeg&lt;/span&gt; &lt;span class="nt"&gt;-lpthread&lt;/span&gt; &lt;span class="nt"&gt;-ldl&lt;/span&gt; &lt;span class="nt"&gt;-std&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;c++17 &lt;span class="nt"&gt;-o&lt;/span&gt; sharpen
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Note: Ensure you replace &lt;code&gt;/path/to/halide/&lt;/code&gt; with the actual path where your Halide headers and libraries are located.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Once the code is compiled successfully, you can run the executable from your terminal, passing in the image you want to process and the desired name for the output file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./sharpen my_blurry_photo.png crisp_sharpened_photo.png
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Final Thoughts
&lt;/h3&gt;

&lt;p&gt;By abstracting the memory layout and execution loops away from the mathematical logic, Halide drastically reduces the cognitive load required to build complex computer vision pipelines. Our sharpening filter is concise, mathematically readable, and incredibly fast. &lt;/p&gt;

&lt;p&gt;More importantly, it is highly maintainable. If a new hardware architecture is released tomorrow with a completely different optimal memory access pattern, the algorithm itself remains untouched. The developer only needs to adjust the one-line schedule to accommodate the new hardware, ensuring that high-performance image processing code remains future-proof.&lt;/p&gt;

</description>
      <category>algorithms</category>
      <category>cpp</category>
      <category>performance</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>High Performance GPGPU with Rust and wgpu</title>
      <dc:creator>Jaysmito Mukherjee</dc:creator>
      <pubDate>Sun, 14 Dec 2025 14:46:57 +0000</pubDate>
      <link>https://dev.to/jaysmito101/high-performance-gpgpu-with-rust-and-wgpu-4l9i</link>
      <guid>https://dev.to/jaysmito101/high-performance-gpgpu-with-rust-and-wgpu-4l9i</guid>
      <description>&lt;h1&gt;
  
  
  High Performance GPGPU with Rust and wgpu
&lt;/h1&gt;

&lt;p&gt;General Purpose Graphics Processing Unit programming, or GPGPU, has transformed high-performance computing. By offloading parallelizable tasks to the massive number of cores available on modern graphics cards, developers can achieve performance gains spanning orders of magnitude compared to CPU execution. While CUDA has long been the standard, the ecosystem is evolving. The &lt;code&gt;wgpu&lt;/code&gt; crate in Rust offers a compelling, portable, and safe alternative that runs on Vulkan, Metal, DirectX 12, and even inside web browsers via WebGPU. This article explores how to leverage &lt;code&gt;wgpu&lt;/code&gt; for compute workloads, moving beyond rendering triangles to processing raw data.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture of a Compute Application
&lt;/h2&gt;

&lt;p&gt;A GPGPU application differs significantly from a traditional rendering loop. In a graphics context, the pipeline is complex, involving vertex shaders, fragment shaders, rasterization, and depth buffers. A compute pipeline is refreshingly simple by comparison. It consists primarily of data buffers and a compute shader. The workflow involves initializing the GPU device, loading the shader code, creating memory buffers accessible by the GPU, and dispatching "workgroups" to execute the logic.&lt;/p&gt;

&lt;p&gt;The core abstraction in &lt;code&gt;wgpu&lt;/code&gt; involves the Instance, Adapter, Device, and Queue. The Instance is the entry point to the API. The Adapter represents the physical hardware card. The Device is the logical connection that allows you to create resources, and the Queue is where you submit command buffers for execution. Unlike graphics rendering which requires a windowing surface, a compute context can run entirely "headless," making it ideal for background processing tools or server-side applications.&lt;/p&gt;

&lt;h2&gt;
  
  
  Writing the Kernel in WGSL
&lt;/h2&gt;

&lt;p&gt;The logic executed on the GPU is written in the WebGPU Shading Language (WGSL). This language feels like a blend of Rust and GLSL. For a compute shader, we define an entry point decorated with the &lt;code&gt;@compute&lt;/code&gt; attribute and specify a workgroup size. The GPU executes this function in parallel across a 3D grid.&lt;/p&gt;

&lt;p&gt;Consider a simple kernel that performs vector multiplication. We define a storage buffer to hold our input and output data. The built-in variable &lt;code&gt;global_invocation_id&lt;/code&gt; allows us to determine which specific element of the array the current thread should process.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="c1"&gt;// shader.wgsl&lt;/span&gt;
&lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="nf"&gt;group&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="nf"&gt;binding&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;var&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;storage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;read_write&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;f32&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="n"&gt;compute&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="nf"&gt;workgroup_size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="nf"&gt;builtin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;global_invocation_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;global_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;vec3&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;u32&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;global_id&lt;/span&gt;&lt;span class="py"&gt;.x&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="c1"&gt;// Guard against out-of-bounds access if the array size &lt;/span&gt;
    &lt;span class="c1"&gt;// isn't a perfect multiple of the workgroup size&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nf"&gt;arrayLength&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the code above, the workgroup size is set to 64. When we dispatch work from the Rust side, we will calculate how many groups of 64 are needed to cover our data array. The logic inside the function is simple, but the hardware will execute thousands of these instances simultaneously.&lt;/p&gt;

&lt;h2&gt;
  
  
  Buffer Management and Bind Groups
&lt;/h2&gt;

&lt;p&gt;Memory management is the most critical aspect of GPGPU programming. The CPU and GPU often have distinct memory spaces. To bridge this gap, &lt;code&gt;wgpu&lt;/code&gt; uses buffers. For a compute operation, we typically need a Storage Buffer, which allows the shader to read and write arbitrary data. However, CPU read access to GPU memory is slow or impossible directly. Therefore, we often use a Staging Buffer strategy. We create a buffer on the GPU for processing and a separate buffer that can be mapped for reading by the CPU.&lt;/p&gt;

&lt;p&gt;Once the buffers are created, we must tell the shader where to find them. This is done via Bind Groups. A Bind Group Layout describes the interface—stating that binding slot 0 is a storage buffer. The Bind Group itself connects the actual &lt;code&gt;wgpu::Buffer&lt;/code&gt; object to that slot. This strict separation of layout and data allows &lt;code&gt;wgpu&lt;/code&gt; to validate resource usage before the GPU ever sees a command, preventing many common crashes associated with low-level graphics APIs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Dispatching the Work
&lt;/h2&gt;

&lt;p&gt;With the pipeline created and data uploaded, we proceed to command encoding. We create a &lt;code&gt;CommandEncoder&lt;/code&gt; and begin a compute pass. Inside this pass, we set the pipeline, set the bind group containing our data buffers, and call &lt;code&gt;dispatch_workgroups&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The dispatch call requires understanding the grid dimensionality. If we have an array of 1024 elements and a shader workgroup size of 64, we must dispatch 16 workgroups on the X-axis (1024 divided by 64).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;encoder&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="nf"&gt;.create_command_encoder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="nn"&gt;wgpu&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;CommandEncoderDescriptor&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;None&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;cpass&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;encoder&lt;/span&gt;&lt;span class="nf"&gt;.begin_compute_pass&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="nn"&gt;wgpu&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;ComputePassDescriptor&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; 
        &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
        &lt;span class="n"&gt;timestamp_writes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;None&lt;/span&gt; 
    &lt;span class="p"&gt;});&lt;/span&gt;
    &lt;span class="n"&gt;cpass&lt;/span&gt;&lt;span class="nf"&gt;.set_pipeline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;compute_pipeline&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;cpass&lt;/span&gt;&lt;span class="nf"&gt;.set_bind_group&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;bind_group&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="p"&gt;[]);&lt;/span&gt;
    &lt;span class="n"&gt;cpass&lt;/span&gt;&lt;span class="nf"&gt;.dispatch_workgroups&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data_size&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After dispatching, if we intend to read the results back to the CPU, we must issue a copy command. This command copies the data from the GPU-resident storage buffer into a map-readable staging buffer. Finally, we finish the encoder and submit the command buffer to the queue.&lt;/p&gt;

&lt;h2&gt;
  
  
  Asynchronous Readback
&lt;/h2&gt;

&lt;p&gt;One aspect of &lt;code&gt;wgpu&lt;/code&gt; that often trips up developers coming from blocking APIs is its asynchronous nature. Submitting the work to the queue returns immediately, but the GPU has only just received the instructions. To read the data back, we must map the staging buffer. This is an async operation returning a &lt;code&gt;Future&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;To resolve this, the application must poll the device. In a native environment, we call &lt;code&gt;device.poll(wgpu::Maintain::Wait)&lt;/code&gt;. This blocks the main thread until the GPU operations are complete and the map callback has fired. Once the buffer is mapped, we can cast the raw bytes back into a Rust slice, copy the data to a local vector, and unmap the buffer. This creates a synchronization point, ensuring the GPU has finished its heavy lifting before the CPU attempts to interpret the results.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;wgpu&lt;/code&gt; ecosystem provides a robust foundation for GPGPU programming that prioritizes safety and portability without sacrificing the raw parallel power of the hardware. By standardizing on WGSL and the WebGPU resource model, developers can write compute kernels that run seamlessly on desktop, mobile, and web. While the boilerplate for setting up pipelines and managing memory buffers is more verbose than high-level CPU threading, the payoff is the ability to process massive datasets in parallel, unlocking performance capabilities that are simply unattainable on the CPU alone.&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>performance</category>
      <category>rust</category>
    </item>
    <item>
      <title>TerraGen3D
3D Procedural Terrain Generation Tool in OpenGL/C++</title>
      <dc:creator>Jaysmito Mukherjee</dc:creator>
      <pubDate>Fri, 01 Oct 2021 09:31:29 +0000</pubDate>
      <link>https://dev.to/jaysmito101/terragen3d-3d-procedural-terrain-generation-tool-in-opengl-c-375f</link>
      <guid>https://dev.to/jaysmito101/terragen3d-3d-procedural-terrain-generation-tool-in-opengl-c-375f</guid>
      <description>&lt;p&gt;I am making a 3D Procedural Generation Software Completely opensource and free!&lt;/p&gt;

&lt;p&gt;Get it:&lt;br&gt;
&lt;a href="https://github.com/Jaysmito101/TerraGen3D"&gt;https://github.com/Jaysmito101/TerraGen3D&lt;/a&gt;&lt;br&gt;
&lt;a href="https://sourceforge.net/projects/terragen3d/"&gt;https://sourceforge.net/projects/terragen3d/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Tutorials : &lt;a href="https://www.youtube.com/playlist?list=PLl3xhxX__M4A74aaTj8fvqApu7vo3cOiZ"&gt;https://www.youtube.com/playlist?list=PLl3xhxX__M4A74aaTj8fvqApu7vo3cOiZ&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Join the Discord Server : &lt;a href="https://discord.gg/AcgRafSfyB"&gt;https://discord.gg/AcgRafSfyB&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  What can this do?
&lt;/h1&gt;

&lt;ul&gt;
&lt;li&gt;Generte 3D Terrain Procedrally&lt;/li&gt;
&lt;li&gt;Export Terrain mesh as OBJ&lt;/li&gt;
&lt;li&gt;You can write and test your own shaders&lt;/li&gt;
&lt;li&gt;An Inbuilt IDE for shaders&lt;/li&gt;
&lt;li&gt;Test under different lighting&lt;/li&gt;
&lt;li&gt;A 3D viewer&lt;/li&gt;
&lt;li&gt;A Node based as well as Layer based workflow&lt;/li&gt;
&lt;li&gt;Save the project(custom &lt;code&gt;.terr3d&lt;/code&gt; files)&lt;/li&gt;
&lt;li&gt;Hieght map visualizer in node editor&lt;/li&gt;
&lt;li&gt;Wireframe mode&lt;/li&gt;
&lt;li&gt;Custom Lighiting&lt;/li&gt;
&lt;li&gt;Customizable Geometry Shaders included in rendering pipeline&lt;/li&gt;
&lt;li&gt;Skyboxes&lt;/li&gt;
&lt;li&gt;Multithreded Mesh Generation&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.lua.org/"&gt;Lua&lt;/a&gt; scripting to add custom algotrithms&lt;/li&gt;
&lt;li&gt;Export to heightmaps(both PNG and also custom format)&lt;/li&gt;
&lt;li&gt;Custom Skyboxes&lt;/li&gt;
&lt;li&gt;Completely usable 3D procedural modelling and texturing pipeline&lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  Future Goals
&lt;/h1&gt;

&lt;ul&gt;
&lt;li&gt;Procedural grass and foliage&lt;/li&gt;
&lt;li&gt;Fix more bugs!&lt;/li&gt;
&lt;li&gt;Many more things..&lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  Screenshots
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--SpvamjnJ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://raw.githubusercontent.com/Jaysmito101/TerraGen3D/master/Screenshots/Version%25203/Screenshot%2520%281%29.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--SpvamjnJ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://raw.githubusercontent.com/Jaysmito101/TerraGen3D/master/Screenshots/Version%25203/Screenshot%2520%281%29.png" alt="Screenshot 1"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--HxI6XH14--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://raw.githubusercontent.com/Jaysmito101/TerraGen3D/master/Screenshots/Version%25203/Screenshot%2520%282%29.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--HxI6XH14--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://raw.githubusercontent.com/Jaysmito101/TerraGen3D/master/Screenshots/Version%25203/Screenshot%2520%282%29.png" alt="Screenshot 2"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--_eoefYRh--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://raw.githubusercontent.com/Jaysmito101/TerraGen3D/master/Screenshots/Version%25203/Screenshot%2520%283%29.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--_eoefYRh--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://raw.githubusercontent.com/Jaysmito101/TerraGen3D/master/Screenshots/Version%25203/Screenshot%2520%283%29.png" alt="Screenshot 3"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Support
&lt;/h1&gt;

&lt;p&gt;I am just a Highschool student so I may not have the best quality of code but still i am trying my best to write good code!&lt;/p&gt;

&lt;p&gt;Any support would be highly appretiated!&lt;/p&gt;

&lt;p&gt;For example you could add a feature and contribute via pull requests or you could even report any issues with the program!&lt;/p&gt;

&lt;p&gt;And the best thing you could do to support this project is spread word about this so that more people who might be interested in this may use this!&lt;/p&gt;

&lt;p&gt;Please considering tweeting about this! &lt;/p&gt;

&lt;p&gt;&lt;a href="https://ctt.ac/MX5_c"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--bsDDv_CG--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/http://clicktotweet.com/img/tweet-graphic-4.png" alt="Tweet: Check out TerraGen3D Free and Open Source Procedural Modelling and Texturing Software : https://github.com/Jaysmito101/TerraGen3D"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Join the Discord Server : &lt;a href="https://discord.gg/AcgRafSfyB"&gt;https://discord.gg/AcgRafSfyB&lt;/a&gt;&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
