Research Article
Efficient simulation of agent-based models on multi-GPU and multi-core clusters
@INPROCEEDINGS{10.4108/ICST.SIMUTOOLS2010.8822, author={Brandon G. Aaby and Kalyan S. Perumalla and Sudip K. Seal}, title={Efficient simulation of agent-based models on multi-GPU and multi-core clusters}, proceedings={3rd International ICST Conference on Simulation Tools and Techniques}, publisher={ICST}, proceedings_a={SIMUTOOLS}, year={2010}, month={5}, keywords={Agent-based simulation GPU Cluster Threads MPI CUDA Latency hiding Computational hierarchy Multi-core}, doi={10.4108/ICST.SIMUTOOLS2010.8822} }
- Brandon G. Aaby
Kalyan S. Perumalla
Sudip K. Seal
Year: 2010
Efficient simulation of agent-based models on multi-GPU and multi-core clusters
SIMUTOOLS
ICST
DOI: 10.4108/ICST.SIMUTOOLS2010.8822
Abstract
An effective latency-hiding mechanism is presented in the parallelization of agent-based model simulations (ABMS) with millions of agents. The mechanism is designed to accommodate the hierarchical organization as well as heterogeneity of current state-of-the-art parallel computing platforms. We use it to explore the computation vs. communication trade-off continuum available with the deep computational and memory hierarchies of extant platforms and present a novel analytical model of the tradeoff. We describe our implementation and report preliminary performance results on two distinct parallel platforms suitable for ABMS: CUDA threads on multiple, networked graphical processing units (GPUs), and pthreads on multi-core processors. Message Passing Interface (MPI) is used for inter-GPU as well as inter-socket communication on a cluster of multiple GPUs and multi-core processors. Results indicate the benefits of our latency-hiding scheme, delivering as much as over 100-fold improvement in runtime for certain benchmark ABMS application scenarios with several million agents. This speed improvement is obtained on our system that is already two to three orders of magnitude faster on one GPU than an equivalent CPU-based execution in a popular simulator in Java. Thus, the overall execution of our current work is over four orders of magnitude faster when executed on multiple GPUs.