Exploration vs Exploitation with Partially Observable Gaussian Autoregressive Arms

Julia Kuhn; Michel Mandjes; Yoni Nazarathy

8th International Conference on Performance Evaluation Methodologies and Tools

Research Article

Exploration vs Exploitation with Partially Observable Gaussian Autoregressive Arms

Download1426 downloads

Cite: BibTeX Plain Text

@INPROCEEDINGS{10.4108/icst.valuetools.2014.258207,
    author={Julia Kuhn and Michel Mandjes and Yoni Nazarathy},
    title={Exploration vs Exploitation with Partially Observable Gaussian Autoregressive Arms},
    proceedings={8th International Conference on Performance Evaluation Methodologies and Tools},
    publisher={EAI},
    proceedings_a={VALUETOOLS},
    year={2015},
    month={2},
    keywords={restless bandits partially observable whittle index performance evaluation asymptotic dynamics},
    doi={10.4108/icst.valuetools.2014.258207}
}

Julia Kuhn
Michel Mandjes
Yoni Nazarathy
Year: 2015
Exploration vs Exploitation with Partially Observable Gaussian Autoregressive Arms
VALUETOOLS
ICST
DOI: 10.4108/icst.valuetools.2014.258207

Julia Kuhn¹^,*, Michel Mandjes², Yoni Nazarathy³

1: The University of Queensland, University of Amsterdam
2: University of Amsterdam
3: The University of Queensland

*Contact email: j.kuhn@uq.edu.au

Abstract

We consider a restless bandit problem with Gaussian autoregressive arms, where the state of an arm is only observed when it is played and the state-dependent reward is collected. Since arms are only partially observable, a good decision policy needs to account for the fact that information about the state of an arm becomes more and more obsolete while the arm is not being played. Thus, the decision maker faces a tradeoff between exploiting those arms that are believed to be currently the most rewarding (i.e. those with the largest conditional mean), and exploring arms with a high conditional variance. Moreover, one would like the decision policy to remain tractable despite the infinite state space and also in systems with many arms. A policy that gives some priority to exploration is the Whittle index policy, for which we establish structural properties. These motivate a parametric index policy that is computationally much simpler than the Whittle index but can still outperform the myopic policy. Furthermore, we examine the many-arm behavior of the system under the parametric policy, identifying equations describing its asymptotic dynamics. Based on these insights we provide a simple heuristic algorithm to evaluate the performance of index policies; the latter is used to optimize the parametric index.

Keywords: restless bandits, partially observable, whittle index, performance evaluation, asymptotic dynamics

Published: 2015-02-19
Publisher: EAI
Appears in: ACM Digital Library

: http://dx.doi.org/10.4108/icst.valuetools.2014.258207

Exploration vs Exploitation with Partially Observable Gaussian Autoregressive Arms

Abstract

About EAI

Community

Publish with EAI