A blossoming paradigm for block-recursive matrix algorithms is presented that, at once, attains excellent performance measured by
It provides a philosophy and tools that allow the programmer to deal with the memory hierarchy invisibly, from L1 and L2 to TLB, paging, and interprocessor communication. Used together, they provide a cache-oblivious style of programming.
Plots are presented to support these claims on an implementation of Cholesky factorization crafted directly from the paradigm in C with a few intrinsic calls. The results in this paper focus on low-level performance, including the new Morton-hybrid representation to take advantage of hardware and compiler optimizations. In particular, this code beats Intel’s Matrix Kernel Library and matches AMD’s Core Math Library, losing a bit on L1 misses while winning decisively on TLB-misses.
Cache misses, TLB, Paging, Cholesky factorization, quad-trees, Morton-hybrid, parallel processing
Seven at one stroke: Results from a cache-oblivious paradigm for scalable matrix algorithms. In MSPC ’06: Proceedings of the 2006 workshop on Memory system performance and correctness, pages 41–50. ACM, New York, NY, USA, 2006. ISBN 1-59593-578-9. doi: 10.1145/1178597.1178604.
and .@inproceedings{adams2006seven, author = {Adams, Michael D. and Wise, David S.}, title = {Seven at One Stroke: Results from a Cache-Oblivious Paradigm for Scalable Matrix Algorithms}, booktitle = {MSPC '06: Proceedings of the 2006 workshop on Memory system performance and correctness}, pages = {41--50}, year = {2006}, address = {New York, NY, USA}, publisher = {ACM}, isbn = {1-59593-578-9}, doi = {10.1145/1178597.1178604}, }