The glTF file format for 3D models has some issues

The glTF 2.0 file format has become a standard for sharing 3D assets between applications. Unfortunately, it has a few unnecessary problems, and these problems are quite typical for our industry and the sad state it is in: rushed development, additive thinking, and a disdain for formal correctness. I'm going to rant about them here in the hope that others might avoid them, and that perhaps one day we will have a sane standard.

The WebGL-first problem

glTF 2.0 comes from glTF 1.0, the WebGL Transfer Format. This name is apt, as it heavily focussed on WebGL. It allowed specifying glsl shaders and structured its data as though it is arguments to WebGL API calls. Supporting glTF 1.0 in other graphic APIs, such as APIs such as Vulkan, Direct3D, WebGPU, would have been difficult and inefficient. The industry later standardized around PBR shaders/materials. For glTF 2.0, the API-specific shaders where therefore replaced by PBR materials. In other words while glTF 1.0 was specific about API (WebGL) and generic in materials (any shader), glTF 2.0 is specific about materials (PBR) but generic about API, except glTF 2.0 still declares data in a very WebGL-centric fashion. The focus of the project shifted, but decision-makers were unwilling to "redo" work they had already crossed off once. The standard was rushed and is now in a permanently bad state and all us implementors must suffer because of it. Let's get into some examples where this shows.

Meshes are defined using variable attributes. Rather than being a list of values, as in e.g. obj files, the attributes reference (by id) an accessor, defining stride, offset and bufferView id. The buffer view defines an offset and index into a buffer, where finally we get the data. This is basically the input to ArrayView, gl.bufferData() and gl.vertexAttribPointer(), respectively. The specification essentially admits this tight coupling at https://registry.khronos.org/glTF/specs/2.0/glTF-2.0.html#_accessor_componenttype:

Related WebGL functions: type parameter of vertexAttribPointer(). The corresponding typed arrays are Int8Array, Uint8Array, Int16Array, Uint16Array, Uint32Array, and Float32Array.

In glTF 2.0, specifying data as input to WebGL functions like this is pointless. Not only is it supposedly targeting other APIs as well, switching to PBR also requires changing the data in some cases, so it will not be fed directly to WebGL functions even on applications running on WebGL. Normals and tangents, for instance, need to be generated if absent, which alter the vertex indices.

What's more, even in glTF 1.0, which was specifically targeting WebGL, a model should not be defined as arguments to API calls. Because neither the modeler, nor the glTF exporter used, can know what the most efficient way to do this is. For instance, interleaving vertex attributes is a performance consideration that can be highly dependent on the application and deployment target. I was once maintaining a WebGL rendering engine for room planning. The graphics were fairly simple, but it did have 4 shadow-casting lights, which is 3 more than normal. As an optimization, I decided to interleave the vertex data. Essentially that means you have a single array with all the information you need for every vertex close together, rather that having a different array for every "field". Performance benefits on my laptop - with integrated graphics - were within margin of error. As it turned out, the same change tanked performance on GPUs and ARM architectures. The 4-shadow casters each needed to draw every object in the entire scene, but only needed the vertices' position to do it. Were first they had a tightly-packed array of positions, simply not using the other arrays, now each position was separated by strides of useless data, causing cache misses. The more advanced CPU caching strategy of my laptop with integrated graphics had likely hidden that issue.

The advantages of defining buffersViews and accessors to match the structure of WebGL just aren't there for glTF 2.0, and were arguably never there for glTF 1.0 either, but this indirection makes extracting the actual data we need a lot more strenuous. A few of my colleagues had just given up, and while it didn't take me that long, the code I wrote definitely isn't the most readable. In comparison, an array of values would have been trivial.

There's another place where it shows that glTF 2.0 was intended to define inputs to WebGL functions: default values and constants. In some places, defaults are missing from the spec, and you will have to look up the matching WebGL function. E.g. in samplers' magFilter and minFilter. They are optional, but no default is given at the time of writing.

The accepted values for such variables are also nonsensical outside the context of WebGL. Does 9728 read like a magnification filter? No? Well it is and means NEAREST. In WebGL.

The moral here is: when performance allows, define data as what it is, rather than trying to be clever about how some application would like to have it internally, leave that to the application.

Too much stuff

Besides rushed development, glTF 2.0 is also plagued by what I've started to call "additive thinking", as I do not know of an existing name for it. People tend to think adding feature will make a thing better. Quite often, the opposite is true. Whether you want to call it the Unix Principle, orthogonality or encapsulation, a driving principle of good software engineering is easy composition: you make a thing that performs a function, and can be easily combined with other things to make bigger things. This is what makes category theory interesting for software developers; it is the study of composition. The opposite is the God object anti-pattern, where you give too much stuff to something, rather than making many small things that can interact.

glTF 2.0 suffers from this too. Not a lot, but it does. I'll name a few things that, in my opinion, should not be in the specification in the way they currently are.

The worst offender is using negative scale as a mirror function. When you scale by a negative value (in a single axis), you mirror the position of the vertices, but you also flip their rotational order. Where first their internal side would get culled because it was e.g. clockwise, now the external side is clockwise and gets culled. This prevents using negative scaling to mirror a mesh as-is. Nevertheless, some engines do allow for this. You can check if a mesh has a negative scale (by its matrix determinant), and cull the other side of the triangle if it does. Changing cull side requires changing the pipeline, which is detrimental to performance, so we probably don't want to do that while iterating meshes in the render function. Instead, we complicate the renderer by pre-sorting objects into positive and negative determinants.

There are other features that I don't think are good defaults to have in the format: vertex animations, multiple texture coordinates, sparse accessors, lines and points... I would much rather have seen them in extensions to keep simple things simple. These features aren't necessary for a lot of e-commerce applications for instance. As it stands, an engine supporting glTF 2.0 needs to be pretty bloody robust and feature-complete, wasting developer time and bloating their engine.

Then again, the cynic in my says that might have been the point. Make the standard format complicated enough to incentivize smaller companies to buy into the commercial engines whose owners are part of the Khronos group.

Supposing your goal is to make a good standard, the point I'm making is: try to keep it simple, and allow for extensions. On that last part at least, glTF 2.0 is pretty good as far as I can see, though I plan to verify this in the future by writing my own extension.

Please use sum types

Finally, and I am getting tired of this being everywhere, glTF does not leverage sum types to ensure type-safety. These are not toys for academics. They are essential in accurately defining your data. Sum types are the dual to product types every programmer knows as structs, classes, objects, or dicts: they consist of one type and another type. Sum types consist of some type or another type. Let's take a look at glTF's camera definition:

That's a lot of text to say a camera needs to be either orthographic or perspective. The used type provides no information about this, orthographic and perspective field look independent of each other. A better approach would have been to use a sum of an orthographic camera and a perspective camera, placing the tag (type) as a fixed string with the data. E.g. in JavaScript with JSDoc:

/** 
 * @type { { 
 *    type: "orthographic", 
 *    orthographic : camera.orthographic
 *  } | {
 *    type : "perspective", 
 *    perspective : camera.perspective
 *  } } 
 */
camera

This would immediately inform the programmer and the type checker of the right intention. If it has type field "orthographic", it will have an orthographic camera, if it has a type field "perspective", it will have a perspective camera, and there are no other option. Much shorter than all that text explaining the same thing but in a format my type checker can't read.

The good

I should also take some time to mention what glTF 2.0 does well. First off, it exists and can do basically everything I want it to do. When I was modding games, we had complex toolchains between proprietary formats that would always lose the material, animation, or more likely both. glTF 2.0 is, at least, a standard, and a well-documented one, with a lot of test-cases to boot. This is all the reason I need to implement it over obj, ply, or other formats. It's also not going to change every few months.
There's also PBR, glTF 2.0's binary format, which is very nice to have. It allows packing all data, such as textures and meshes, into a single file, keeping everything together that belongs together and making it more portable. I've seen enough broken links in mesh and material files to appreciate this. On the other hand, referencing external resources is still possible, so no efficiency is lost due to data duplication in applications that reuse assets.
Finally, glTF 2.0 is built to support extensions, which means it can stay relevant if market demand for some new feature emerges. I myself look forward to toying with an extension in the future.

Out of the file formats for 3D virtual display, glTF is the best we have, but that fact should be rather embarrassing for our industry.