Computing an Index Policy for Bandits with Switching Penalties

José Niño-Mora

1st International ICST Workshop on Tools for solving Structured Markov Chains

Research Article

Computing an Index Policy for Bandits with Switching Penalties

Download1198 downloads

Cite: BibTeX Plain Text

@INPROCEEDINGS{10.4108/smctools.2007.1994,
    author={Jos\^{e} Ni\`{o}o-Mora},
    title={Computing an Index Policy for Bandits with Switching Penalties},
    proceedings={1st International ICST Workshop on Tools for solving Structured Markov Chains},
    proceedings_a={SMCTOOLS},
    year={2010},
    month={5},
    keywords={Markov decision processes bandits restless switching costs switching delays index policies},
    doi={10.4108/smctools.2007.1994}
}

José Niño-Mora
Year: 2010
Computing an Index Policy for Bandits with Switching Penalties
SMCTOOLS
ICST
DOI: 10.4108/smctools.2007.1994

José Niño-Mora¹^,*

1: Department of Statistics Universidad Carlos III de Madrid C/ Madrid 126 28903 Getafe (Madrid), Spain

*Contact email: jnimora@alum.mit.edu

Abstract

We address the multiarmed bandit problem with switching penalties including both costs and delays. Asawa and Teneketzis (1996) introduced an index for bandits with switching penalties that partially characterizes optimal policies, attaching to each project state a "continuation index" (its Gittins index) and a "switching index," yet only proposed an index algorithm for the case of switching costs. We present a fast decoupled computation method, which in a first stage computes the continuation index and then, in a second stage, computes the switching index an order of magnitude faster in at most (5/2)n2 + O(n) arithmetic operations for an n-state project. This extends earlier work where we introduced a two-stage index algorithm for the case of switching costs only. We exploit the fact that the Asawa and Teneketzis index is the marginal productivity index of a classic bandit with switching penalties in its semi-Markov restless reformulation, by deploying methods introduced by the author. A computational study demonstrates the dramatic runtime savings achieved by the new algorithm, the near-optimality of the index policy, and its substantial gains against the benchmark Gittins index policy across a wide range of two-and three-project instances.

Keywords: Markov decision processes, bandits, restless, switching costs, switching delays, index policies

Published: 2010-05-16
Modified: 2011-09-18

: http://dx.doi.org/10.4108/smctools.2007.1994

Computing an Index Policy for Bandits with Switching Penalties

Abstract

About EAI

Community

Publish with EAI