Instant neural graphics primitives with a multiresolution hash encoding

#machinelearning

This was originally posted on Dangling Pointers. My goal is to help busy people stay current with recent academic developments. Head there to subscribe for regular summaries of computer science research.

Instant neural graphics primitives with a multiresolution hash encoding Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller SIGGRAPH'22

Neural Graphics Primitives

A neural graphics primitive is a trained neural network which approximates a function used during graphics rendering. For example, a trained neural network which could map texture coordinates to colors (u,v)→(r,g,b,a), and thus serve as a replacement for blasé a texture map. Another example would be representing a 3D object via signed distance function (x,y,z)→distance which is implemented with a neural network.

Spatial Embedding

The classic way to improve a neural network is to impose some structure on the network and reduce the number of parameters required to represent a network in the process. Replacing a fully connected layer with a convolutional layer is an example.

Another example is a token embedding, which maps input token IDs to vectors via a lookup table. One way to define an embedding is that it is just a matrix * vector multiply operator, where the input vector is one-hot encoded.

The idea proposed in this paper to add a spatial embedding to the network. Going with the texture map example, rather than require a generic neural network (e.g., multi-level perceptron (MLP)) to do the full job: (u,v)→(r,g,b,a),start with a (learned) lookup table which maps texture coordinates to a feature vector, and then pass that to the rest of the network. In my informal notation: (u,v)→feature_vec→(r,g,b,a). Because (u,v) texture coordinates are real-valued while the lookup table is defined at integer coordinates, bilinear interpolation is used.

Another way to say this is: completely replacing a texture map with a MLP is asking too much. It may be more efficient to instead replace a large texture map with the composition of a smaller texture map followed by a smaller MLP.

Logically you can think of the spatial embedding as simply a matrix * vector multiply, where most elements of the vector are known to be zeros (e.g., if bilinear filtering is used, then at most 4 vector elements will be non-zero).

Multi-Resolution Hashing

The paper proposes using multiple embeddings of varying sizes. I suppose you could think of it like a mipmap. There are 2 key differences:

All levels of the mipmap are read, and the results are concatenated together to form the final feature vector
Parameter counts and storage requirements for the large levels bounded by limiting physical embedding dimensions. For embeddings with larger logical sizes than physical sizes, a hash function is used to map (integer) locations in the logical embedding to locations in the physical embedding. No care is taken to handle collisions; it is assumed the subsequent MLP will learn to handle these collisions.

Results

A picture is worth a thousand words:

Source: https://doi.org/10.1145/3528223.3530127

Dangling Pointers

I think the clear fruit waiting to be picked is neural architecture search. Solid techniques exist for learning the values of parameters in a pre-defined architecture, but humans are required to define the architecture. Why can’t a machine figure out that imposing this structure is a good idea?

DEV Community