sas 15(4): e5

Research Article

Exploration vs Exploitation with Partially Observable Gaussian Autoregressive Arms

Download270 downloads
  • @ARTICLE{10.4108/icst.valuetools.2014.258207,
        author={Julia Kuhn and Michel Mandjes and Yoni Nazarathy},
        title={Exploration vs Exploitation with Partially Observable Gaussian Autoregressive Arms},
        journal={EAI Endorsed Transactions on Self-Adaptive Systems},
        volume={1},
        number={4},
        publisher={EAI},
        journal_a={SAS},
        year={2015},
        month={2},
        keywords={restless bandits, partially observable, whittle index, performance evaluation, asymptotic dynamics},
        doi={10.4108/icst.valuetools.2014.258207}
    }
    
  • Julia Kuhn
    Michel Mandjes
    Yoni Nazarathy
    Year: 2015
    Exploration vs Exploitation with Partially Observable Gaussian Autoregressive Arms
    SAS
    EAI
    DOI: 10.4108/icst.valuetools.2014.258207
Julia Kuhn1,*, Michel Mandjes2, Yoni Nazarathy3
  • 1: The University of Queensland, University of Amsterdam
  • 2: University of Amsterdam
  • 3: The University of Queensland
*Contact email: j.kuhn@uq.edu.au

Abstract

We consider a restless bandit problem with Gaussian autoregressive arms, where the state of an arm is only observed when it is played and the state-dependent reward is collected. Since arms are only partially observable, a good decision policy needs to account for the fact that information about the state of an arm becomes more and more obsolete while the arm is not being played. Thus, the decision maker faces a tradeoff between exploiting those arms that are believed to be currently the most rewarding (i.e. those with the largest conditional mean), and exploring arms with a high conditional variance. Moreover, one would like the decision policy to remain tractable despite the infinite state space and also in systems with many arms. A policy that gives some priority to exploration is the Whittle index policy, for which we establish structural properties. These motivate a parametric index policy that is computationally much simpler than the Whittle index but can still outperform the myopic policy. Furthermore, we examine the many-arm behavior of the system under the parametric policy, identifying equations describing its asymptotic dynamics. Based on these insights we provide a simple heuristic algorithm to evaluate the performance of index policies; the latter is used to optimize the parametric index.