Large-Scale System Monitoring Experiences and Recommendations Workshop paper: HPCMASPA 2018

Ville Ahlgren, Stefan Andersson, Jim Brandt, Nicholas P. Cardo, Sudheer Chunduri, Jeremy Enos, Parks Fields, Ann Gentile, Richard Gerber, Michael Gienger, Joe Greenseid, Annette Greiner, Bilel Hadri, Yun (Helen) He, Dennis Hoppe, Urpo Kaila, Kaki Kelly, Mark Klein, Alex Kristiansen, Steve LeakMike Mason, Kevin Pedretti, Jean-Guillaume Piccinali, Jason Repik, Jim Rogers, Susanna Salminen, Mike Showerman, Cary Whitney, Jim Williams

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    8 Scopus citations

    Abstract

    Monitoring of High Performance Computing (HPC) platforms is critical to successful operations, can provide insights into performance-impacting conditions, and can inform methodologies for improving science throughput. However, monitoring systems are not generally considered core capabilities in system requirements specifications nor in vendor development strategies. In this paper we present work performed at a number of large-scale HPC sites towards developing monitoring capabilities that fill current gaps in ease of problem identification and root cause discovery. We also present our collective views, based on the experiences presented, on needs and requirements for enabling development by vendors or users of effective sharable end-to-end monitoring capabilities.
    Original languageEnglish (US)
    Title of host publication2018 IEEE International Conference on Cluster Computing (CLUSTER)
    PublisherIEEE
    Pages532-542
    Number of pages11
    ISBN (Print)9781538683194
    DOIs
    StatePublished - 2018

    Bibliographical note

    KAUST Repository Item: Exported on 2021-08-20
    Acknowledgements: This research was supported by and used resources of the Argonne Leadership Computing Facility, which is a U.S. Department of Energy Office of Science User Facility operated under contract DE-AC02-06CH11357. This document is approved for release under LA-UR-18-26485.

    Fingerprint

    Dive into the research topics of 'Large-Scale System Monitoring Experiences and Recommendations Workshop paper: HPCMASPA 2018'. Together they form a unique fingerprint.

    Cite this