TY - GEN
T1 - Large-Scale System Monitoring Experiences and Recommendations Workshop paper: HPCMASPA 2018
AU - Ahlgren, Ville
AU - Andersson, Stefan
AU - Brandt, Jim
AU - Cardo, Nicholas P.
AU - Chunduri, Sudheer
AU - Enos, Jeremy
AU - Fields, Parks
AU - Gentile, Ann
AU - Gerber, Richard
AU - Gienger, Michael
AU - Greenseid, Joe
AU - Greiner, Annette
AU - Hadri, Bilel
AU - He, Yun (Helen)
AU - Hoppe, Dennis
AU - Kaila, Urpo
AU - Kelly, Kaki
AU - Klein, Mark
AU - Kristiansen, Alex
AU - Leak, Steve
AU - Mason, Mike
AU - Pedretti, Kevin
AU - Piccinali, Jean-Guillaume
AU - Repik, Jason
AU - Rogers, Jim
AU - Salminen, Susanna
AU - Showerman, Mike
AU - Whitney, Cary
AU - Williams, Jim
N1 - KAUST Repository Item: Exported on 2021-08-20
Acknowledgements: This research was supported by and used resources of the Argonne Leadership Computing Facility, which is a U.S. Department of Energy Office of Science User Facility operated under contract DE-AC02-06CH11357. This document is approved for release under LA-UR-18-26485.
PY - 2018
Y1 - 2018
N2 - Monitoring of High Performance Computing (HPC) platforms is critical to successful operations, can provide insights into performance-impacting conditions, and can inform methodologies for improving science throughput. However, monitoring systems are not generally considered core capabilities in system requirements specifications nor in vendor development strategies. In this paper we present work performed at a number of large-scale HPC sites towards developing monitoring capabilities that fill current gaps in ease of problem identification and root cause discovery. We also present our collective views, based on the experiences presented, on needs and requirements for enabling development by vendors or users of effective sharable end-to-end monitoring capabilities.
AB - Monitoring of High Performance Computing (HPC) platforms is critical to successful operations, can provide insights into performance-impacting conditions, and can inform methodologies for improving science throughput. However, monitoring systems are not generally considered core capabilities in system requirements specifications nor in vendor development strategies. In this paper we present work performed at a number of large-scale HPC sites towards developing monitoring capabilities that fill current gaps in ease of problem identification and root cause discovery. We also present our collective views, based on the experiences presented, on needs and requirements for enabling development by vendors or users of effective sharable end-to-end monitoring capabilities.
UR - http://hdl.handle.net/10754/670699
UR - https://ieeexplore.ieee.org/document/8514913/
UR - http://www.scopus.com/inward/record.url?scp=85057218940&partnerID=8YFLogxK
U2 - 10.1109/CLUSTER.2018.00069
DO - 10.1109/CLUSTER.2018.00069
M3 - Conference contribution
SN - 9781538683194
SP - 532
EP - 542
BT - 2018 IEEE International Conference on Cluster Computing (CLUSTER)
PB - IEEE
ER -