Abstract

As processors get faster and memory speeds fail to keep up, it is essential to design the "cubic" parts of the meat-axe to do the vast majority of the fetches from L2 cache at worst. This needs a fundamental re-organization of multiply and Gaussian elimination to keep the working set below the L2 cache size. This talk describes my design efforts so far to solve this problem. A new way of adding mod 3 is also presented.