Yesterday some friends and I watched The Game Awards, an industry award show for video games and surrounding media. The TGAs (Yes, the acronym isn't very good) have gained a bit of a reputation for being one of the biggest sources of new game announcements, trailers, and the like, along with other industry events like E3. Last night we saw 23 new trailers, many of which were for entirely new games, and many of which also included this bit of information, either through text on the screen, people introducing the trailer or discussing it afterwards, or in the description for the trailer on YouTube:
Let's talk about why this is frustrating.
For quite a while, there was a bit of an issue with video game trailers: They were good. Quite a bit too good when compared to the game they were supposed to represent. It became a bit of a joke among players, when the trailer shows something absolutely beautiful compared to the actual visuals of the game:
It was generally understood at a very basic level why this happened - You could make things in a pre-rendered cutscene that you couldn't make in the game's engine, so you could really use that as an opportunity to wow players with something that's not actually what they're going to get. I want to get a bit into the theory behind why it's easier to display something in a pre-rendered cutscene, from a graphics rendering perspective, and why something being rendered in-engine is something to brag about - or why it shouldn't necessarily be.
At the root of the issue isn't really that game engines are inherently worse than another method of rendering graphics. But before we get into that, let's talk about the theory of how graphics rendering works.
At the end of the day, the end goal of all of this is to get an array of pixels to light up on the screen. This end result is really simple - Every pixel has a value of red, a value of green, and a value of blue (the upper bounds of these numbers are indicative of how good the screen and related components are - this is the idea of a "number of colors" on a monitor), and the goal is to get a set of three numbers into memory for each of those pixels. The entire screen's representation in memory is called the "frame buffer":
Of course, the trouble is in calculating those numbers. No matter what overarching model you have for your renderer, whether it be manipulating objects into their on-screen coordinates (a.k.a. rasterization), or projecting out from the camera into the scene to see what object you hit (a.k.a. ray tracing), things typically follow the same very general set of steps:
- Determine an object that is visible at the pixel
- Determine what this object's "base" color is at the point that you're seeing through the pixel
- Modify this color based on other factors like shadows, reflections, etc.
Note that this doesn't include more advanced steps like anti-aliasing and such, because things like that are appended to this process afterwards, and aren't really embedded within it. And frankly, the given process is complicated enough to prove my point anyway.
Throughout this process, you're never working with just a single object at a time. Sure, there's a single object that fills a pixel, but in order to determine what object that is, you have to do mathematical operations on other objects to determine that they are NOT the object in question. And also, you have to do operations on other objects to see how they interact with the given object through shadows and reflection.
The first part of this rendering process is illustrated below in the context of a ray tracer, viewed from a perfect side-view to the rendering camera. The pixel being looked at here would be in the very center of the screen, and the rectangles are objects in the scene:
So, judging from just what we see in this scene, the end result should be that the blue object is what we "see" in the pixel. The ray also collides with the green object, but only AFTER it collides with the blue object. And it doesn't collide with the purple object, though another pixel's projection ray will. And the brown object won't even be visible on the screen at all, since it's behind the camera.
So, how do we know this? Or more specifically, how does the renderer know this? In a perfectly naive implementation, we'd just test all objects against the ray to see if they intersect:
Does the blue object intersect? Yes, with a time value of t1 (indicating how far along the ray the intersection is).
Does the green object intersect? Yes, with a time value of t2, where t2 > t1.
Does the purple object intersect? No.
Does the brown object intersect? No.
So we can then ignore all objects for which the answer above is "No", and take the lowest time value out of all the ones for which the answer is "Yes". That is the object that intersects with the ray.
Of course, that literally just tells you which object you're hitting, which is the first item from the general steps above. Next up would be getting a "base color" for the blue object, which is pretty easy - we can just store that info in the scene and say "this object is blue".
The third step is a lot trickier: Determining how shadows and reflections affect that blue. I'm not going to fully map out this process since it gets really complicated, really fast, but suffice it to say that this also includes several other rounds of checking all the objects in the scene (in our naive implementation). Except now we have to see if those objects are blocking the path from various lights in the scene to the blue object, or if light reflects off of another object from the light source, onto the object in question.
All of this takes time. The amount of time it takes to render a scene like this depends on (among other things):
The number of pixels on the screen (i.e. your resolution)
The number of lights in the scene
The number of objects in the scene
That last one is the most important, as it affects multiple steps in the process, and often non-linearly as well.
Theoretically, if time is not a factor, we can get some really pretty pictures from processes like this. However, things get a lot harder when you have a cap on the amount of time you get to display a frame, like 16 ms (the time to display a frame when running at 60 frames per second). You have to do this entire process, for every pixel, every 16 ms. So, you immediately have to make some concessions. Models have to have fewer polygons (my example above has objects represented as rectangles, but in reality those objects would be made of multiple polygons, each of which is an "object" in its own right), which is why a lot of games from the N64/PS1 era were notoriously low-poly and ugly.
There are some other concessions you could make as well. You could ignore things like reflections. You could calculate shadows in a different way, flattening the object and coloring it black. These all make the scene look a lot worse, but they speed up the process. The best thing you can do, however, is reduce the number of objects being considered.
So this is the base difference between "pre-rendered" and "in-engine". An in-engine render is actually calculating these frame buffers every time the frame is displayed on the screen, but pre-rendered videos can take as long as needed to generate a frame - you have theoretically unlimited time when you're just trying to get a video out, and you're not displaying it to a player while they're playing the game. That's why pre-rendered cutscenes were notorious for a long time for looking way better than the actual gameplay, and also why being able to get good-looking video out of an "in-engine" render is something that game publishers are bragging about.
But let's think about why we're even able to display the gameplay frame buffers at all: We cheat. A lot. The best example of that is how we reduced the number of polygons on the models - it doesn't look as good, but you're drastically reducing the render time by having fewer objects on screen.
But what if we were to take that idea to its extreme?
What if we were to make changes to the scene that made it extremely easier to render? That would be pretty difficult in the context of a game... if you have to be able to react to unpredictable actions by the player. This is where the concept of interactivity comes in, and it's a major distinguishing factor between generating a cutscene in-engine, and generating a video feed in-engine while the user is interacting with it.
So let's make the following assumptions:
We are trying to make a scene that will be very easy to render in-engine.
We also want it to look as good as we can while still having it rendered in-engine, so we can't do things like reducing poly-count on objects that would make them look worse.
The user is not playing the game; all camera and model movements are pre-scripted and known ahead of time. It is a cutscene, after all - we know exactly what's going to happen.
Because of this, we can also do a lot of things beforehand, like figuring out which objects are visible in the scene on each frame.
So let's think about what this means for making our "perfectly renderable scene". We can fully remove objects that we don't care about on this frame, from the scene. So, in our rectangle example above, we can completely remove the green and brown objects from the scene, because we know they won't be visible on this frame. If the green object moves and becomes visible next frame, then we can add it to the scene that time. So, in that super-simple example, we've already cut the time for the first step in half at least, because we don't have to check half of the objects in the scene to see if they're visible. And we also cut down the time of the third step, because you only have to check for shadows and reflections from objects that we didn't remove.
So, in a more complex example, we could remove almost all of the objects from the scene, only leaving polygons that we already know will be visible to a pixel. This means that, if you have a human-looking model that's perfectly facing the camera, their entire back half would be gone, because it's not needed. That's a lot of polygons.
And the key aspects of this little experiment are:
This is still fully rendered in-engine, displaying a frame in less than 16 ms or what have you.
The scene is not interactive.
This would not work if the scene were interactive. You wouldn't be able to spend all that time beforehand figuring out what polygons are necessary because you don't know if the player is going to suddenly look to the right and see a whole new set of polygons.
And note that I was zeroing in on one particular technique here, but there's a lot of different ways that you could cheat in something like this, while still being perfectly truthful when you say that it's being rendered in-engine, but only if it's not interactive.
And that's why I find the concept of bragging that your cutscene is rendered in-engine to be pretty dishonest. The only way you can get something this detailed to render in-engine is if it's also not interactive, and as such, you still get the issue of the gameplay not looking nearly as good.
I'll just wait until I see actual gameplay footage from a review.