On-chip shared caches, data-parallel programming and the relevance of data layout optimizations The technological trend towards chip multiprocessors (CMPs) and recent advances in utilizing GPUs for general-purpose computations have rekindled interest in data-parallel programming models. Most proposed CMP designs use shared on-chip caches to minimize the number of off-chip accesses. In order to handle increasing wire delays and decreasing feature sizes, the shared on-chip cache designs have employed a Non-Uniform Cache Architecture (NUCA), based on a banked organization with an on-chip network. For data-parallel programming models, there is a mismatch between such a cache organization and the canonical row-major or column-major layouts of arrays. It is important for a compiler to perform data layout optimizations that use non-canonical data layouts to improve data locality. In this talk, we discuss an approach that (i) determines the profitability of utilizing non-canonical layouts and (ii) selects layout parameters based on a polyhedral framework. We present preliminary experimental results that demonstrate the value of this approach.