Scalable Problem Localization for Distributed Systems: Principles and Practices

Rui Zhang; Bruno C. d. S. Oliveira; Alan Bivens; Steve McKeever

2nd International ICST Conference on Scalable Information Systems

Research Article

Scalable Problem Localization for Distributed Systems: Principles and Practices

Download593 downloads

Cite: BibTeX Plain Text

@INPROCEEDINGS{10.4108/infoscale.2007.896,
    author={Rui Zhang and Bruno C. d. S. Oliveira and Alan Bivens and Steve McKeever},
    title={Scalable Problem Localization for Distributed Systems: Principles and Practices},
    proceedings={2nd International ICST Conference on Scalable Information Systems},
    proceedings_a={INFOSCALE},
    year={2010},
    month={5},
    keywords={Scalability Problem Localization Complexity Decentralization Hierarchy Distributed systems},
    doi={10.4108/infoscale.2007.896}
}

Rui Zhang
Bruno C. d. S. Oliveira
Alan Bivens
Steve McKeever
Year: 2010
Scalable Problem Localization for Distributed Systems: Principles and Practices
INFOSCALE
ICST
DOI: 10.4108/infoscale.2007.896

Rui Zhang¹^,*, Bruno C. d. S. Oliveira¹^,*, Alan Bivens²^,*, Steve McKeever¹^,*

1: Oxford University, Computing Laboratory, Oxford, OX1 3QD, England.
2: IBM T.J. Watson Research Center Hawthorne, NY 10532, USA.

*Contact email: rui.zhang@comlab.ox.ac.uk, bruno@comlab.ox.ac.uk, jbivens@us.ibm.com, swm@comlab.ox.ac.uk

Abstract

Problem localization is a critical part of providing crucial system management capabilities to modern distributed environments. One key open challenge is for problem localization solutions to scale for systems containing hundreds or even thousands of nodes, whilst still remaining fast enough to respond to rapid environment changes and sufficiently cost-effective to avoid overloading any management or application component. This paper meets the challenge by introducing two scalable frameworks applicable to a wide range of existing problem localization solutions: one based on a summarydriven, narrow-down procedure, the other through decomposing and decentralizing the problem localization process. Both frameworks, at their best, are able to achieve O(logN) problem localization time and O(1) per node communication load. The contrasting natures of both frameworks provide them with complimentary strengths that make them suitable for different scenarios in practice. We demonstrate our approaches in simulation settings and two real-world environments and show promising scalability benefits that can make a difference in system management operations.

Keywords: Scalability Problem Localization Complexity Decentralization Hierarchy Distributed systems

Published: 2010-05-16
Modified: 2011-09-11

: http://dx.doi.org/10.4108/infoscale.2007.896