Well, the main point of the post was that cache efficiency is not about whether or not your access pattern is totally linear, but rather about whether or not you're making effective use of the bandwidth by utilizing full cache lines when fetch them to whatever level of cache.2DaT wrote: ↑Sat Apr 24, 2021 9:06 amEven if data does not fit in L1, L2 is often fast enough to not bottleneck anything. Only a fraction of workloads can utilize the L1 bandwidth such as matrix multiplication or direct convolution - only in these workloads it makes sense to do L1 tiling (with appropriate SIMD). In any other case L2->L1 prefetch would get data faster than you can process it.mystran wrote: ↑Sat Apr 24, 2021 7:53 am edit: Well, the only other thing is that caches are set-associative (eg. 4-way or something) so if you have a large power-of-two stride then you might not be able to effectively utilize your whole L1 if all the cachelines you fetch map to the same set, but I don't really see how even this would matter when it comes to an API pushing around a few small buffers for an even smaller number of channels.
One thing to take home about the set-associativity though is that if you have two aligned 64 byte (or whatever the cacheline size is) chunks of memory next to each other (eg. a single buffer of 128 bytes) then these two cachelines will never compete against each other, because being next to each other they map to different sets.
That said, tiling for a specific cache size can be dangerous anyway, because if the code then runs on a different CPU with a different specific cache layout and you rely on the layout too heavily you can get very poor performance as the result. Basically this only really makes sense if you're willing to tune for every different CPU separately (eg. FFTW style).
I'm personally more of a fan of "cache oblivious" design whenever possible, where you assume that the exact cache layout is unknown and try to build your code to be cache friendly for essentially any reasonable layout.. but even then it's not about having perfectly linear access patterns, but rather just about dense access patterns and good locality of reference.