Every color palette extractor you've used runs K-Means clustering. Coolors, Canva, Adobe. They all use the same algorithm to find dominant colors. Most of them run it on a server. Ours runs it on your GPU, entirely in the browser.
We built a free Palette Extractor that processes 65,536 pixels in 12ms using a WebGL2 fragment shader. This post walks through how K-Means works, why color extraction is the perfect demo, and how 15 lines of GLSL turn the GPU into a parallel K-Means engine.
K-Means in 30 Seconds
The algorithm repeats two steps:
- Assign every data point to the nearest centroid
- Update each centroid to the mean of its assigned points
Loop until nothing changes. For color extraction, data points are pixels (RGB values) and centroids become your palette colors.
It's the most widely used clustering algorithm in ML (IBM), and it scales at O(n).
The Five Steps (With Colors)
Choose K: How many palette colors do you want? K=6 means 6 clusters.
Initialize: Pick K random pixels as starting centroids. Wrong guesses are fine. The algorithm fixes them.
Assign: Every pixel calculates Euclidean distance to every centroid in RGB space. Snaps to the nearest one. For a 256x256 image, that's 65,536 pixels x K centroids = a lot of distance calculations. Per iteration.
Update: Each centroid moves to the average color of its assigned pixels.
Converge: Repeat assign+update until centroids stop moving. Typical images converge in 5-15 iterations.
Why the GPU?
The assignment step is embarrassingly parallel. Each pixel's computation is independent. No pixel needs to know about any other pixel. That's exactly what GPUs do: run thousands of identical programs simultaneously.
When we first tried running K-Means in JavaScript on the CPU, anything above 256x256 was noticeably slow. Moving the assignment step to a WebGL2 fragment shader made it instant.
The Fragment Shader
This is the actual production code. Not pseudocode.
#version 300 es
precision highp float;
uniform sampler2D u_data; // pixel colors as texture
uniform vec3 u_centroids[32]; // current centroid positions
uniform int u_k; // number of clusters
uniform int u_numPoints; // total pixel count
out vec4 outColor;
void main() {
ivec2 texCoord = ivec2(gl_FragCoord.xy);
int texWidth = textureSize(u_data, 0).x;
int idx = texCoord.y * texWidth + texCoord.x;
if (idx >= u_numPoints) { outColor = vec4(-1.0); return; }
vec3 pt = texelFetch(u_data, texCoord, 0).rgb;
float minDist = 999999.0;
float bestK = -1.0;
for (int i = 0; i < 32; i++) {
if (i >= u_k) break;
float d = distance(pt, u_centroids[i]);
if (d < minDist) { minDist = d; bestK = float(i); }
}
outColor = vec4(bestK, pt.r, pt.g, pt.b);
}
The trick: pixel RGB values are packed into an RGBA32F floating-point texture. The shader runs once per texel, computes distance to each centroid, and outputs the cluster assignment. One drawArrays call processes all 65,536 pixels in parallel.
GPU K-Means achieves 100x+ speedup over CPU at scale (Springer, 2022).
The Hybrid Loop
The full K-Means loop isn't entirely on the GPU. Assignment parallelizes perfectly, but centroid update requires aggregation:
- CPU uploads centroid positions as shader uniforms
- GPU runs the fragment shader (assignment)
-
CPU reads results via
readPixels(the bottleneck) - CPU accumulates sums per cluster, divides to get new centroids
- Repeat until converged
readPixels stalls the pipeline (GPU-to-CPU transfer), but the parallel assignment step more than compensates.
Real Numbers
- 256x256 image, K=6: 5-8 iterations, 8-15ms
- 256x256 image, K=12: 8-12 iterations, 15-25ms
- Max 50 iterations cap. Never seen an image need more than 20.
Try It
The Palette Extractor is free, runs in any browser with WebGL2 (92% support), and never uploads your images. Drop an image, pick K, get colors.
Full writeup with the interactive K-Means step-by-step visualizer: velocaption.com/blog/k-means-gpu-palette-extractor
We build browser-based tools at Velocaption. Our main product is a desktop video editor for silence removal and auto-captioning. If you're curious about browser performance, we also wrote about how we got 60fps video playback using a similar bypass-the-framework approach.
Top comments (0)