Alan Rock

Posted on Dec 18, 2024 • Edited on Dec 23, 2024

GpuScript: C# is no longer just for the CPU.

#csharp #gpu #gpgpu #productivity

Introduction

GpuScript allows a software developer to program and debug the GPU, turning a single laptop with any GPU into a supercomputer. GpuScript can increase software development productivity by 50 times and make programs run a million times faster, depending on the application. GpuScript is open source, free, and requires 30 minutes to learn for a typical C# programmer.

GpuScript Performance

A popular benchmark of various computer languages was made for entertainment purposes here. The following figure shows a comparison with GpuScript, which was 100,000 times faster than C, and over 10 million times faster than Python. Python/PyPy can incorporate GPU acceleration using vectorization and run 1000 times faster than Python without GPU acceleration, obtaining a computation time 10 times faster than C, but still 10,000 times slower than GpuScript.

One Billion Nested Loop Language Comparison in μs.

How can GpuScript achieve such higher performance than other languages that use GPU acceleration? Just because a language or a program reports using GPU acceleration does not mean that it is utilizing the GPU to its full extent. GpuScript usually only utilizes 20% of the GPU potential, and multiple processes using the GS Cloud running on the same computer are required to take full advantage of both CPU and GPU cores. However, using a single process is still several orders of magnitude faster than other languages using GPU acceleration.

GpuScript accomplishes high speeds in several ways:

Minimizing CPU/GPU memory transfers and transferring memory only when necessary
Allowing more code to be ported from the CPU to the GPU
Reducing GPU calls by utilizing Group Shared Memory and Intrinsic Functions when possible
Allowing programs to be designed specifically for the GPU rather than relying on GPU acceleration to speed up code designed to run on the CPU

GpuScript could be considered a type of GPU acceleration technique, because GpuScript runs entirely in C# on the CPU using the standard .NET CLR on the CPU with no other libraries or extensions other than Unity. GpuScript generates GPU code and simulates, controls, and communicates with the GPU, but GpuScript does not actually run on the GPU.

GpuScript is a Language

Haskell is written in LISP. C++ is written in C. GpuScript is written in C#. Although GpuScript is similar to C#, it is actually an amalgamation of several languages, including HLSL, ShaderLab, OpenCL, OpenGL, and CUDA. However, there is no need to learn all these languages. C# is all that is required. Java, Javascript, C++, Python, and other Object-Oriented Programming (OOP) languages are essentially the same, so a working knowledge of any of these languages is all that is required.
GpuScript supports programming and debugging the GPU in OOP, which makes it easier to write large and complex programs that almost run entirely on the GPU. GPU programs are usually difficult to write and debug, so GPU routines are typically small and simple. This requires significant memory transfer between the CPU and GPU, which can be reduced by transferring more workload to the GPU.

Enum

Enumerations are not supported in other GPU languages but are fully supported in GpuScript.

  public enum Rate { Low, Medium, High }
  Rate rate = Rate.Low;

Swizzling

Swizzling is a language construct from GPU programming languages that allows reordering or copying vector components. The following code declares three float2 variables: a = float2(0, 1), b = float2(1, 0), and c = float2(1, 1).

  float2 a = f01, b = a.yx, c = b.xx;

GpuScript also supports a type of swizzling for initializing vector components with -1, 0, and 1. The following code declares three float2 variables: a = float2(0, 1), b = float2(0, 0), and c = float2(-1, 1).

  float2 a = f01, b = f00, c = f_1;

Vector Comparisons

All() and Any() are GPU language constructs for comparing vector components. Operator overloading is used in GpuScript to return a vector when comparing vectors. This allows All() and Any() to be supported the same on both the CPU and GPU.

  float2 a = f01, b = f10;
  bool all_less = All(a < b); //false, a.y > b.y
  Bool any_less = Any(a < b); //true , a.x < b.x

New

GPU languages do not require a "new" keyword when declaring vectors or structs, but C# does. To resolve this difference, GpuScript allows both.

  float2 a = new float2(0, 1), b = float2(1, 0);

Inheritance

Methods from the base class and libraries may be overridden by parent classes. This allows library customization to directly access parent class data without costly reorganization of function input or output data structures.

Memory Transfer

Transferring memory between the CPU and GPU can be costly. GpuScript keeps track of all CPU and GPU memory access to minimize memory transfer and synchronize CPU and GPU memory.

Intrinsic Functions

GpuScript handles intrinsic interlocked functions similar to GPU languages. Intrinsic functions only work with integer and unsigned integer buffers, but can be very powerful. For example, InterlockedMin can locate the minimum few elements in an array without sorting. InterlockedAdd can determine the sum of an array, perform matrix multiplication, compute an FFT, sum forces in a Distinct Element model, or sum signals in a neural network node in Order(1). Depending on the application, intrinsic functions can achieve several orders of magnitude faster performance.

Group Shared Memory

GroupSharedMemory / AllMemoryBarrierWithGroupSync are GPU language constructs to reduce global GPU update calls. GpuScript supports full debugging of these constructs, significantly speeding up algorithms such as FFT, AppendBuff, prefix sums, sorting, random numbers, or reduction techniques.

GPU Compute Kernels

GPU languages require each kernel to have a thread block declaration, attachment of all buffers used by the kernel and all methods used by the kernel, and that all methods be declared before they are called. This becomes a daunting task as program complexity and size increases. GpuScript handles all these requirements, so the programmer simply writes and organizes code the same as when working with any OOP language.

Graphics Shaders

GpuScript redesigns how graphics shaders are designed and used, so they are more similar to calling a normal function. This makes it easier to design large and complex graphics systems, displaying 3D volumetric raymarching, axes, legends, signals, and millions of spheres, arrows, lines, and 3D text all using a single graphics shader.

Code Duplication

The CPU code, compute shaders, and graphics shaders often use the same data, variables, and methods, resulting in code duplication. This complicates implementation and debugging. GpuScript hides these details from the programmer, so the programmer only works with a single version of code and data.

GpuScript is a Code Generator

GpuScript presents an alternative approach to programming that is highly efficient and productive. GpuScript generates all the boilerplate code for UI, GPU, and CPU, leaving only program critical code to be filled in by the programmer. GpuScript analyzes the code, and creates a checkbox when a boolean is declared, a button when a method is defined, a textbox with a scrollbar when a float is declared, and a grid of UI elements when a class or struct array is declared. GpuScript does the same for GPU compute kernels, buffers, variables, and graphics shaders. The programmer simply writes code with some additional attribute properties, and GpuScript does the rest. The programmer still has complete control to override the generated UI and code, but the time savings in coding and debugging increases productivity by orders of magnitude. This approach also hides coding details from the programmer, making GPU programming easy to learn, efficient to develop, and run at supercomputer speeds.

Opposite of Visual Programming

Visual Programming Languages (VPLs) have been a popular style for decades: Visual Basic, Visual C++, Visual C#, Delphi Pascal, LabView, etc. VPLs are highly manual and difficult to automate, increasing UI development time on average. VPLs require considerable screen-space, are difficult to debug, and require several levels of text-based menus and trees for UI settings. VPLs also have a steep learning-curve, especially for experienced programmers. GpuScript is entirely text-based, the exact opposite of VPLs. GpuScript generates UI code by examining program-critical code, thus eliminating the requirement to manually develop UI code. Due to high levels of automation, the developer spends almost no time building the UI. The UI is also highly consistent throughout all applications. This programming approach is not only applied to the UI, but also to the GPU. The paradox is that GpuScript allows the developer to program the GPU without actually writing GPU code. GpuScript generates both UI and GPU code by examining program-critical code. This is the key for how GpuScript achieves high productivity with a small learning curve.

A note about UI. GpuScript generates a generic UI using UIBuilder and UIElements in Unity. The appearance of the UI can be customized using UXML files. GpuScript can be instructed to generate applications with no UI, allowing the programmer to implement a custom UI if desired. Since GpuScript is open source, the programmer can modify the code that generates the UI, so that GpuScript can automatically generate a completely different custom UI.

GpuScript is a GPU Simulator

GpuScript simulates all GPU operations, scalers, vectors, matrices, methods, buffers, group shared memory, sync operations, and threads on the CPU. This allows some of the program to run on the CPU while the rest runs on the GPU, for debugging at full scale. The GPU is much faster than the CPU for many tasks, so this capability is essential for debugging at full scale.

How is it possible to use the CPU with limited thread-pools to simulate the high number of concurrent threads on the GPU? GpuScript accomplishes this with coroutines, or iterator blocks. This allows a comprehensive GPU simulation to be performed entirely on the main thread, without the need to create multiple CPU threads.

GpuScript Implementation

GpuScript is integrated into Unity and consists of several components: an editor window for building projects and libraries, a series of classes and structs for simulating the GPU, an automated and persistent UI, and a set of hierarchical precompiled libraries, both internal and external. GpuScript is implemented in a mostly Functional Programming style using C#, with HLSL macros and functions to make HLSL and ShaderLab conform to C#. GpuScript automates many Unity tasks related to GPU programming, such as creating compute shaders, graphics shaders, Unity materials, and numerous settings and links.

CPU and GPU Programming Differences

It's no surprise that serial programming and massively parallel programming styles have differences. Simple loops in CPU code can be parallelized, but speed-ups are minimal. Redesigning CPU code is the only way to fully utilize the GPU. Parallel programming is a different style of programming. Speed-ups can be achieved by running multiple models at the same time, each with different parameters. A single FFT can be designed to run slightly faster, but significant speedups can be achieved by running millions of FFTs at the same time.

GpuScript is Open Source

GpuScript is open source and free to use. Unity is also free for individual programmers and small companies.

GpuScript Libraries

GpuScript can be expanded and upgraded with available libraries.

Since the GpuScript libraries are written in GpuScript, the libraries naturally have exceptional performance.

AppendBuff

Append buffers are included in most GPU programming languages and are often specially built directly into the GPU firmware. GpuScript includes a library that rivals the speed and memory requirements of GPU append buffers. AppendBuff supports prefix sums as well as append buffers, uses 32 times less memory than append buffers, and does not require prior knowledge of the append buffer size to avoid crashing the computer. Depending on the GPU, append buffers are notorious for giving incorrect results, which AppendBuff fixes.

BDraw

GpuScript includes a library for drawing pixels and billboard graphical objects, including spheres, lines, arrows, signals, and 3D text. Billboards are rectangles that rotate to face the camera. A sphere billboard always faces the camera, but lines, arrows, signals, and text rotate along the local x-axis so that the y- and z- axes face the camera. Billboards are rendered with a pixel shader for high-resolution rendering and interact well with other Unity graphical objects and models. Billboards give the appearance of 3D graphics and are very high speed, with the capability to draw tens of millions of billboards at high frame rates.

BDraw Spheres, Arrows, and Text

OCam_Lib

OCam is a library that includes an orbit camera, multiple camera views, and a legend.

View_Lib

View_Lib is a library that can save and load selected settings, such as camera viewing parameters, in a grid for quick access with a short-cut keystroke.

Report_Lib

Report_Lib is an important library that is a prerequisite for most projects. Report_Lib to an application is like a batch file to an operating system, except Report_Lib is more powerful than a batch file or running an application with command line arguments.

Report_Lib is a text file with instructions that can control and automate all aspects of the program. It can generate reports, documentation, or run data analysis with tables, equations, figures, and animations. Report_Lib was used to generate all the documentation for GpuScript and all the libraries documented on the GpuScript.com website.

Report_Lib can perform thorough testing for debugging purposes, for both the CPU and GPU. Report_Lib replaces unit testing and profiling tools, and works well for both computation and graphics.

Project_Lib

Project_Lib adds support for multiple projects in an application. Projects may be selected, created, copied, renamed, or archived.

Backup_Lib

Backup_Lib makes it quick and easy to backup code or data locally or to an external hard drive.

Puppeteer_Lib

Puppeteer_Lib can be used to automate the Google Chrome browser. Almost anything that can be done manually in a browser can be done with the Puppeteer_Lib, including searching and downloading data, language translation, maps, etc. Puppeteer_Lib reduces or eliminates the need for web APIs to perform similar tasks.

Cloud_Lib

Cloud_Lib adds multi-user support and multi-process distributed computing to an application. For example, a desktop computer or laptop may have 8 CPU cores and a GPU. Running a single application may only utilize 20% of the GPU, and the main thread may only run on a single CPU core. Cloud_Lib allows the application to run multiple times on a single computer, fully utilizing the GPU and in this case obtaining a 5 times speedup. A local area network (LAN) with 10 computers could achieve a 50 times speedup. Ten LANs connected across the internet could achieve a 500 times speedup, depending on the application.

Data may be cached for instant access. Methods and coroutines are easily tasked and optimally scheduled for distributed processing. Results are combined when processing is complete.

Cloud_Lib keeps track of all connections, supports multiple licensing for multiple applications with password protection, and supports triple encryption.

Rand

Random numbers are very useful for statistics, simulation, search, integration, scheduling, simulated annealing (traveling salesman problem), and Monte-Carlo methods. Random numbers are quick to initialize and generate on the GPU. The NVidia GPU Gems publication devotes 20 pages to a a random number generator in CUDA, but initializing random numbers on the GPU is stated as beyond the scope of the paper. It can take considerable time to initialize random numbers on the CPU and transfer the memory to the GPU. The Rand library also provides a variety of random number functions for generating different distributions and geometries.

Rand was benchmarked at an equivalent 0.8 nanoseconds per floating-point random number (24 operations) on a GPU rated at 20 TFLOPS.

VGrid_Lib

GpuScript contains a library for 3D volumetric rendering that is unrivaled in speed and resolution. Marching Cubes is commonly used to generate a set of triangles for each voxel. The number of triangles is initially unknown, often requiring extra computation to determine. Ill-formed triangles either result in inconsistencies or require additional computation and storage to remove or avoid. The resulting triangles may require considerable memory storage and be quite inconsistent from one contour to the next. VGrid only requires storing or computing one value per voxel and renders smooth ray-traced contours directly on the GPU without the need to generate triangles. The results are full resolution CT-scans at hundreds of frames per second.

GEM_Lib

Geometric Empirical Modeling (GEM) is a GpuScript library that revolutionizes AI neural networks. The initial motivation for writing GpuScript was to implement GEM on the GPU. GEM solves neural networks directly and instantaneously, including the neural network structure and all weights and offsets. Neural networks are the core of much of AI and machine learning, including the recently popular Large Language Models (LLMs). GEM eliminates the need for trial and error to determine the number of hidden layers and the number and type of nodes in each layer. GEM reduces or eliminates the need to simplify data representation, to reduce non-linearity, dimensionality, clustering, and input correlations. GEM eliminates high computation requirements for training, problems with over-fitting or under-fitting, problems with getting stuck in a local minimum, and problems selecting an optimal learning rate. In other words, GEM eliminates the need for AI experts, AI factories, and GPU super-clusters. GEM is a paradigm shift in AI. A separate paper dedicated to GEM will be published on Dev.to.

Matrix_Lib

GpuScript contains a high-speed library for matrix operations. Matrix multiplication was a standard benchmark for comparing supercomputer performance. Matrix multiplication is typically O(N^2), which means that a 4096 x 4096 matrix requires 32 million floating point multiplications and 32 million additions. Each thread of the GPU can change the order to O(N), requiring only 8192 FLOPS.

The GpuScript Matrix library scales the matrix and combines multiplications using intrinsic addition. This changes operations from O(N^2) to O(1), meaning that each thread only requires a single floating-point operation. The result is an incredible 23 PFLOPS on a GPU rated at only 20 TFLOPS. The library can perform a matrix multiplication in the equivalent of 1.44 nanoseconds.
However, GEM can be trained on matrix vector pairs and perform matrix inversion and singular value decomposition, typically O(N^3), in O(1). In these cases, the speedup is practically immeasurable.

FFT_Lib

The GpuScript Fast Fourier Transform (FFT) library can compute a 4096 sample FFT in the equivalent of 3 nanoseconds. However, using the same technique of rescaling with intrinsic addition could significantly improve performance and allow transforming signals of arbitrary sizes.

Sort_Lib

Sorting is typically O(N log N) on the CPU and O(log N) on the GPU. The GpuScript Sort library is O(1).
The GpuScript Sort library can sort a 2048 floating point array in an equivalent 0.2 nanoseconds. This library will be expanded to sort 4M element arrays in O(1) using an entirely new sorting algorithm.

Library Dependencies

AppendBuff

Backup

BDraw => AppendBuff

Cloud => Puppeteer

FFT

GEM => AppendBuff, Rand

Matrix

OCam => BDraw

Project

Puppeteer

Rand

Report => Puppeteer

VGrid => BDraw

Views

Conclusion

GpuScript presents a new approach to GPU program development. Debugging is made possible using a full GPU simulator running on a single thread. CPU/GPU memory transfer is significantly reduced using memory IO management and moving the majority of the workload from the CPU to the GPU. UI development is made easier by embedding the UI directly into the code. GPU development is also embedded directly into the code. Code translation allows both GPU computation and graphics to be entirely implemented in C#, without the need to learn CUDA, HLSL, ShaderLab, or other GPU languages.

GpuScript is a paradigm shift in programming. It is not the same as changing from Java to C#, which both have essentially the same productivity and performance. Programmers are used to adapting to small shifts in technology, but GpuScript is an extinction event. If computer languages were selected on the basis of productivity and performance, GpuScript would result in the near extinction of all other programming languages. People are naturally resistant to extreme changes that revolutionize entire industries. Paradigm shifts usually require considerable time to become mainstream.

The bottom line: The average programmer using GpuScript can more efficiently complete projects that run orders of magnitude faster with a smaller learning curve. No matter the dedication, motivation, perseverance, intelligence, or hard work, it's the tools that make all the difference. More grass can be cut with a lawnmower than with scissors, more snow can be moved with a snowplow than with a teaspoon, and more computation can be achieved on a laptop with GpuScript than any other language.

Link

GpuScript on Github, free and open source