Optimizing the performance of streaming numerical kernels on the IBM Blue Gene/P PowerPC 450 processor

Tareq Majed Yasin Malas, Aron Ahmadia, Jed Brown, John A. Gunnels, David E. Keyes

Research output: Contribution to journalArticlepeer-review

4 Scopus citations


Several emerging petascale architectures use energy-efficient processors with vectorized computational units and in-order thread processing. On these architectures the sustained performance of streaming numerical kernels, ubiquitous in the solution of partial differential equations, represents a challenge despite the regularity of memory access. Sophisticated optimization techniques are required to fully utilize the CPU. We propose a new method for constructing streaming numerical kernels using a high-level assembly synthesis and optimization framework. We describe an implementation of this method in Python targeting the IBM® Blue Gene®/P supercomputer's PowerPC® 450 core. This paper details the high-level design, construction, simulation, verification, and analysis of these kernels utilizing a subset of the CPU's instruction set. We demonstrate the effectiveness of our approach by implementing several three-dimensional stencil kernels over a variety of cached memory scenarios and analyzing the mechanically scheduled variants, including a 27-point stencil achieving a 1.7× speedup over the best previously published results. © The Author(s) 2012.
Original languageEnglish (US)
Pages (from-to)193-209
Number of pages17
JournalInternational Journal of High Performance Computing Applications
Issue number2
StatePublished - May 21 2012

Bibliographical note

KAUST Repository Item: Exported on 2020-10-01

ASJC Scopus subject areas

  • Hardware and Architecture
  • Theoretical Computer Science
  • Software


Dive into the research topics of 'Optimizing the performance of streaming numerical kernels on the IBM Blue Gene/P PowerPC 450 processor'. Together they form a unique fingerprint.

Cite this