Modeling Data Locality for Next Generation Systems
Although the introduction of multi-core systems has increased overall processor speed without significantly increasing CPU clock rates, a significant speed disparity remains between the CPU core and main memory. Multi-level caches have long been used to bridge this gap. Conventional cache design favors applications with good locality. The community's understanding of locality, however, is more qualitative than quantitative. A quantitative understanding of locality is essential to exploit memory hierarchy and achieve maximal performance. The new generation of multi-core systems adds the challenge of quantifying data locality for multi-threaded programs.
This research models data locality as a function of three parameters: data size, path history, and thread count, relying on close cooperation among the compiler, the profiler, and hardware just-in-time monitoring. The compiler provides a global view of the program. The profiler, using traces, has a view of the run-time behavior of a program, but this view is based on only a limited number of training inputs. Although the hardware's view is run specific, its prediction, often depending on hardware buffers, is not always effective due to buffer size limitations. The cooperative model being developed combines the advantages of static analysis and run-time sampling and profiling, providing an accurate view of program locality for both single-threaded and multi-threaded programs. Given this model the project explores memory system performance including managing data movement in conventional multi-level cache as well as non-uniform cache architecture (NUCA) caches, reducing the memory traffic of a state-of-the-art hardware-only region prefetcher, and improving spatial locality of Java programs.