Errata: L1 Cache Behavior
Caching Mutable Data in L1
In our original instructions for Lab 3 and Lab 4, we wrote that, because the L1 cache is incoherent, typical load instructions will bypass the L1 cache, and only loads accessing read-only data are eligible to be cached:
From Lab 3:
…while the shared L2 behaves like a normal cache – implicitly caching all normal loads and stores from DRAM – the per-SM L1 cache is not coherent across SMs. As a result, most normal loads and stores bypass the L1.
From Lab 4:
Because the L1 caches on the GPU are incoherent, the compiler will typically emit load instructions which bypass the L1 cache by default, and will only emit instructions that use the L1 cache under two circumstances:
If the compiler can figure out that the memory location you’re reading won’t change from one access to the next. This is somewhat rare and unreliable, but it can happen!
If you manually tell the compiler that the memory location you’re reading won’t change from one access to the next. You can do this by using the special
__ldg(...)
function.
Although NVIDIA’s documentation and public communications are somewhat opaque on this matter, we now have strong evidence that the behavior we described in Lab 3 and Lab 4 is not the full story:
-
At least on recent NVIDIA GPUs, it appears that even data which the compiler believes may change in the future can still benefit from being cached in L1.
-
At the PTX level, this means that even
ld.global
instructions without the.nc
qualifier (or any other caching qualifiers) can still make use of the L1 cache. -
At the SASS level, this means that
LDG
instructions without the.CONSTANT
qualifier can make use of the L1 cache.
L1 Cache Hit Granularity
In our instructions for Lab 4, we wrote that the L1 cache can only supply data from at most one contiguous 128-byte (32-word) cache line on each cycle:
When you’re using the L1 as a read-only data cache, you can only make use of its full bandwidth if each warp accesses just a single contiguous 32-word cache line in each load instruction. (Reference)
We still recommend trying to touch only one cache line at a time per load instruction as a reasonable policy to adopt by default when designing high-performance kernels. However, we’ve now seen some evidence online suggesting that the L1 cache can in fact perform multiple tag lookups in parallel per cycle.
Further Reading
The course staff is still working to understand the performance characteristics of the L1 cache on modern NVIDIA GPUs in greater detail. If you’re interested in investigating this topic yourself, here are some references which may be relevant:
-
-
Useful background reading to understand the correctness contract which the compiler and hardware are trying to uphold. Does not directly describe the performance characteristics of the microarchitecture, but imposes constraints on its design which may help one make educated guesses at how it must be implemented.
-
See also: A Formal Analysis of the NVIDIA PTX Memory Consistency Model by Lustig et al., 2019
-
-
CUDA global memory caching documentation
-
Documentation consistent with our original description of the L1 cache’s behavior in the instructions for Lab 3 and Lab 4.
-
This documentation was originally published for NVIDIA’s Maxwell generation of GPUs (released 2014), but the CUDA documentation claims elsewhere that this description of the L1 cache is still relevant on Ampere and Hopper. It is unclear to us if the documentation is fully correct.
-
-
-
Suggests that on Ampere, all loads, even loads of non-read-only data, are cached by default in L1:
…caching at all levels (what ca hint means) is the default behavior, at least for cc 8.0.
-
-
-
Suggests that on Volta (released 2017), data is retrieved from L2 and DRAM at the granularity of 32-byte “sectors,” and that the L1 can service up to 4 tag lookups per cycle:
The Volta L1 data cache has 128 byte cache lines divided into 4 sectors. For local and global accesses the tag stage can compare all 32 threads at a time. The tag stage can look up 4 tags per cycle resolving a maximum of 16 sectors (4 tags x 4 sectors). On miss the cache will only fetch the unique 32 byte sectors that missed. The full cache line is not automatically fetched from L2.
-
-
Dissecting the Turing GPU Architecture through Microbenchmarking (Jia et al., 2019)
-
Fascinating presentation on experimentally observing microarchitectural details of NVIDIA’s Turing generation of GPUs (released 2018).
-
Provides clear, direct evidence of the existence of 32-byte sectors in the L1 cache (see slide 32).
-
See also: accompanying 66-page technical report on the Volta architecture, from the same authors.
-