3rd International ICST Workshop on Tools for solving Structured Markov Chains

Research Article

Computing an Index Policy for Multiarmed Bandits with Deadlines

  • @INPROCEEDINGS{10.4108/ICST.VALUETOOLS2008.4406,
        author={Jos\^{e} Nino-Mora},
        title={Computing an Index Policy for Multiarmed Bandits with Deadlines},
        proceedings={3rd International ICST Workshop on Tools for solving Structured Markov Chains},
        publisher={ACM},
        proceedings_a={SMCTOOLS},
        year={2010},
        month={5},
        keywords={stochastic scheduling deadlines multiarmed bandits index policies restless bandits marginal productivity index adaptivegreedy algorithm},
        doi={10.4108/ICST.VALUETOOLS2008.4406}
    }
    
  • José Nino-Mora
    Year: 2010
    Computing an Index Policy for Multiarmed Bandits with Deadlines
    SMCTOOLS
    ICST
    DOI: 10.4108/ICST.VALUETOOLS2008.4406
José Nino-Mora1,*
  • 1: Department of Statistics, Universidad Carlos III de Madrid, Avda. Universidad 30, 28911 Leganés (Madrid), Spain
*Contact email: jnimora@alum.mit.edu

Abstract

This paper introduces the multiarmed bandit problem with deadlines, which concerns the dynamic selection of a live project to engage out of a portfolio of Markovian bandit projects expiring after given deadlines, to maximize the expected total discounted or undiscounted reward earned. Although the problem is computationally intractable, a natural heuristic policy is obtained by attaching to each project the finite-horizon counterpart of its Gittins index, and then engaging at each time a live project of highest index. Remarkably, while such a finite-horizon index was introduced in [R. N. Bradt, S. M. Johnson, and S. Karlin (1956). On sequential designs to maximize the sum of n observations. Ann. Math. Statist. 27 1060--1074], an exact polynomialtime algorithm using arithmetic operations does not seem to have been proposed until [J. Niño-Mora (2005). A marginal productivity index policy for the finite-horizon multiarmed bandit problem. In Proceedings of CDC-ECC '05, pp. 1718--1722, IEEE]. Yet, such an adaptive-greedy index algorithm, which draws on methods introduced by the author for restless bandit indexation, has a complexity of O(T3n3) operations for a T-horizon n-state project, rendering it impractical for all but small instances. This paper significantly improves on the complexity of such an algorithm, decoupling it into a recursive T-stage method that performs O(T2n3) arithmetic operations. Moreover, in an insightful special model the complexity is further reduced to O(T2) operations, and closed-form index formulae are given. Computational experiments are reported demonstrating the algorithm's runtime performance, and showing that the proposed index policy is near optimal and can substantially outperform the benchmark greedy and Gittins index policies.