Errata: L1 Cache Behavior

Caching Mutable Data in L1

In our original instructions for Lab 3 and Lab 4, we wrote that, because the L1 cache is incoherent, typical load instructions will bypass the L1 cache, and only loads accessing read-only data are eligible to be cached:

From Lab 3:

…while the shared L2 behaves like a normal cache – implicitly caching all normal loads and stores from DRAM – the per-SM L1 cache is not coherent across SMs. As a result, most normal loads and stores bypass the L1.

From Lab 4:

Because the L1 caches on the GPU are incoherent, the compiler will typically emit load instructions which bypass the L1 cache by default, and will only emit instructions that use the L1 cache under two circumstances:

If the compiler can figure out that the memory location you’re reading won’t change from one access to the next. This is somewhat rare and unreliable, but it can happen!

If you manually tell the compiler that the memory location you’re reading won’t change from one access to the next. You can do this by using the special __ldg(...) function.

Although NVIDIA’s documentation and public communications are somewhat opaque on this matter, we now have strong evidence that the behavior we described in Lab 3 and Lab 4 is not the full story:

At least on recent NVIDIA GPUs, it appears that even data which the compiler believes may change in the future can still benefit from being cached in L1.
At the PTX level, this means that even ld.global instructions without the .nc qualifier (or any other caching qualifiers) can still make use of the L1 cache.
At the SASS level, this means that LDG instructions without the .CONSTANT qualifier can make use of the L1 cache.

L1 Cache Hit Granularity

In our instructions for Lab 4, we wrote that the L1 cache can only supply data from at most one contiguous 128-byte (32-word) cache line on each cycle:

When you’re using the L1 as a read-only data cache, you can only make use of its full bandwidth if each warp accesses just a single contiguous 32-word cache line in each load instruction. (Reference)

We still recommend trying to touch only one cache line at a time per load instruction as a reasonable policy to adopt by default when designing high-performance kernels. However, we’ve now seen some evidence online suggesting that the L1 cache can in fact perform multiple tag lookups in parallel per cycle.

Errata: L1 Cache Behavior

Caching Mutable Data in L1

L1 Cache Hit Granularity

Further Reading