TL;DR of the paper 'Revisiting The Vertex Cache: Understanding and Optimizing Vertex Processing on the modern GPU'

I will in this post write down a TL;DR of the paper 'Revisiting The Vertex Cache: Understanding and Optimizing Vertex Processing on the modern GPU'.

It is well-known that modern GPUs have a post-transform cache. If we for instance are drawing some geometry with indices $\{0, 1, 2, 0, 1, 2\}$, then the vertices with indices 0, 1 and 2 will have their vertex shader evaluated only once, for a total of three vertex shader invocations. But for the second occurrences of 0, 1, 2 in this list, the vertex shader will not at all be executed. Instead, the results from the first evaluation will have been stored in some cache, and the results are fetched from this cache. Thus, the vertex shader is only executed 3 times for this index list.

Now let's say we have a very long sequence of indices $\{0, 1, 2, 0, 1, 2, 0, 1, 2\dots\}$. Let's say this sequence has length $N$. If there really is a central vertex cache that is used for all vertex shader invocations, then the vertex shader should only be executed three times, if these indices are rendered.

But this is not what happens on GeForce GTX 1080Ti. If $N \le 96$, then the vertex shader is executed exactly three times, as expected. However, once we exceed 96 vertices, there is a jump. If $N=99$, then the vertex shader is executed 6 times! If there was really a central vertex cache used by all vertex shader invocations, this wouldn't make any sense. The authors claim that this is because the hardware uses a batch-based approach for the vertex cache. That is, we have batches of 96 vertices, and within every such batch a vertex cache is active. The key-point is that the cache doesn't appear to be shared between the batches. The authors denote this number 96 as $\text{MAX_SIZE}$. As another example, For AMD R7 200 we have that $\text{MAX_SIZE}=384$. For Intel hardware, it seems to be much more dynamic, and the authors are unable to find a conclusive number for $\text{MAX_SIZE}$. It appears the batch can be as big as 1791 vertices for Intel hardware, they report.

Another interesting tidbit is that there appears to be a limit on the number of unique vertices within each batch. For NVIDIA hardware, the authors report that cache misses will start occuring as soon as more than $\text{MAX_UNIQUES}=32$ unique vertices are used in a batch. For AMD hardware, they find that $\text{MAX_UNIQUES}=\text{MAX_SIZE}$.

Now, the authors design a mesh optimization algorithm based on these discoveries. Triangles that are near each other should be close to each other in memory, so that the vertex cache can be properly utilized without many cache-misses. The authors have basically discovered that, across batch boundaries, the vertex cache will not be useful. Previous mesh optimization algorithms did not take this fact into account, and so the authors design an algorithm that takes this into account. On NVIDIA hardware, they are able to beat all other mesh optimization algorithms. However, on Intel and AMD hardware, their algorithm is almost always beaten by some other algorithm. This is presumably because NVIDIA hardware has much smaller batch sizes compared to AMD and intel. This means that taking batch sizes into account is much more important for NVIDIA hardware, and that's why the NVIDIA cards gain so much from this optimization.