TY - GEN
T1 - Toward online testing of federated and heterogeneous distributed systems
AU - Canini, Marco
AU - Jovanovic, Vojin
AU - Venzano, Daniele
AU - Spasojevic, Boris
AU - Crameri, Olivier
AU - Kostic, Dejan
PY - 2019
Y1 - 2019
N2 - Making distributed systems reliable is notoriously difficult. It is even more difficult to achieve high reliability for federated and heterogeneous systems, i.e., those that are operated by multiple administrative entities and have numerous inter-operable implementations. A prime example of such a system is the Internet's inter-domain routing, today based on BGP. We argue that system reliability should be improved by proactively identifying potential faults using an online testing functionality. We propose DiCE, an approach that continuously and automatically explores the system behavior, to check whether the system deviates from its desired behavior. DiCE orchestrates the exploration of relevant system behaviors by subjecting system nodes to many possible inputs that exercise node actions. DiCE starts exploring from current, live system state, and operates in isolation from the deployed system. We describe our experience in integrating DiCE with an open-source BGP router. We evaluate the prototype's ability to quickly detect origin misconfiguration, a recurring operator mistake that causes Internet-wide outages. We also quantify DiCE's overhead and find it to have marginal impact on system performance.
AB - Making distributed systems reliable is notoriously difficult. It is even more difficult to achieve high reliability for federated and heterogeneous systems, i.e., those that are operated by multiple administrative entities and have numerous inter-operable implementations. A prime example of such a system is the Internet's inter-domain routing, today based on BGP. We argue that system reliability should be improved by proactively identifying potential faults using an online testing functionality. We propose DiCE, an approach that continuously and automatically explores the system behavior, to check whether the system deviates from its desired behavior. DiCE orchestrates the exploration of relevant system behaviors by subjecting system nodes to many possible inputs that exercise node actions. DiCE starts exploring from current, live system state, and operates in isolation from the deployed system. We describe our experience in integrating DiCE with an open-source BGP router. We evaluate the prototype's ability to quickly detect origin misconfiguration, a recurring operator mistake that causes Internet-wide outages. We also quantify DiCE's overhead and find it to have marginal impact on system performance.
UR - http://www.scopus.com/inward/record.url?scp=85077094476&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85077094476
T3 - Proceedings of the 2011 USENIX Annual Technical Conference, USENIX ATC 2011
SP - 241
EP - 246
BT - Proceedings of the 2011 USENIX Annual Technical Conference, USENIX ATC 2011
PB - USENIX Association
T2 - 2011 USENIX Annual Technical Conference, USENIX ATC 2011
Y2 - 15 June 2011 through 17 June 2011
ER -