The IBM research team in Zurich set up a project to develop a methodology for estimating the performance, power consumption and cost of exascale systems. The project is called Algorithms and Machines (A&M) and is part of DOME, a joint program with the Netherlands Institute for Radio Astronomy (ASTRON).
The main objective of this collaboration is to develop technologies to support the Square Kilometre Array (SKA), the world’s largest radio telescope currently being developed.
The A&M team set out to model an exascale computing system required by the SKA data processing pipeline. This system and the software running on it may allow an early and fast design-space exploration. To construct the software model, the A&M methodology used a platform-independent software analysis tool that measures software properties (such as available scalar and vector instruction mix, parallelism, memory access patterns and communication behaviour). As the software models are extracted at application run-time, they can only be collected on current systems which are orders of magnitude smaller than exascale. To predict the software models at exascale, the methodology used an extrapolation tool which employs advanced statistical techniques. Once extrapolated, the software model was then combined with a hardware model that captures the performance constraints and dependencies of a computer system. The mathematical formulas allow for fast exploration of a large design-space of hardware processor- and network-related parameters.
To validate the analytical performance estimates, the A&M team required access to systems with different network topologies, (e.g., fat-tree and dragonfly). The team contacted EGI for support to obtain service access to such systems. EGI identified the Poznan Supercomputing and Network Center (PSNC) in Poland as a provider to offer such an environment and kicked started the collaboration.
PSNC offered access to Orzel / Eagle, a supercomputer with a performance of 1.4 PFlops computing power and a fat-tree network interconnect fabric. The A&M team then ran MPI applications of different problem sizes and number of MPI processes on the system, using configurations of two and three-level fat-tree topologies. The first validation results for the MPI-simple implementation of Graph 500 (a MPI benchmark for analytics workloads) showed that the analytical methodology can estimate the time performance with an accuracy of 82%, which is a very encouraging result. In the future, more MPI applications will be analysed to validate the A&M methodology.
This work was done in the context of the joint ASTRON and IBM DOME project and was funded by the Dutch Ministry of Economic Affairs and the Province of Drenthe. The IBM A&M team is very pleased with the involvement of specialists from PSNC and thankful to EGI for its efforts of intermediating the partnership with PSNC.
Written by Giuseppe La Rocca, leader of the Community Support team at EGI.