UvA-DARE

We consider a restless bandit problem with Gaussian autoregressive arms, where the state of an arm is only observed when it is played and the state-dependent reward is collected. Since arms are only partially observable, a good decision policy needs to account for the fact that information about the state of an arm becomes more and more obsolete while the arm is not being played. Thus, the decision maker faces a tradeoﬀ between exploiting those arms that are believed to be currently the most rewarding (i.e. those with the largest conditional mean), and exploring arms with a high conditional variance. Moreover, one would like the decision policy to remain tractable despite the inﬁnite state space and also in systems with many arms. A policy that gives some priority to exploration is the Whittle index policy, for which we establish structural properties. These motivate a parametric index policy that is computationally much simpler than the Whittle index but can still outperform the myopic policy. Furthermore, we examine the many-arm behavior of the system under the parametric policy, identifying equations describing its asymptotic dynamics. Based on these insights we provide a simple heuristic algorithm to evaluate the performance of index policies; the latter is used to optimize the parametric index.


INTRODUCTION
Inherent to the problem of decision making under partial observability is the tradeoff between exploration and exploitation: Should we collect new information, or opt for the immediate payoff? We investigate this question for a reward observing restless multiarmed bandit problem, where at every point in discrete time the decision maker wants to play a fixed number k out of d arms, with the objective of maximizing the expected (discounted or average) reward achieved over an infinite time horizon. The state of an arm is restless as it evolves also when the arm is not played, and partially observable as it can only be observed whenever the arm is played and the state-dependent reward is collected. This type of bandit problem has drawn much attention in recent literature, e.g. [3,8,9]. We call it reward observing to distinguish from other types of partially observable decision problems, for example those where the decision maker does not have fully accurate information due to measurement errors [11].
In view of the reward observing character of the problem, one would like a model that allows in a non-artificial way to keep track of the decision maker's current need for exploring an arm rather than exploiting others. As a starting point we assume in this paper that state processes are AR(1), Gaussian autoregressions of order 1. Since states are normally distributed, the objectives of exploitation and exploration naturally correspond to the conditional mean and variance of an arm, which at the same time contain all relevant information concerning its state (and thus fully describe the belief state of the arm). The AR(1) model has been found useful for example for modeling channels in wireless networks [1]. It seems that in the context of decision making under reward observability it has previously only been considered in [3], where the myopic (greedy) policy was compared numerically to an ad hoc randomized policy.
While in principle an optimal policy for Markovian restless bandit problems can be obtained with the aid of dynamic programming, in practice this is typically computationally infeasible [12]. Particularly when the system is large (with many arms), one therefore often resorts to the class of index policies, which remain tractable due to a decoupling of arms. For every arm, the information available is mapped to some real-valued priority index which does not depend on the state or history of any other arm. At every time slot the policy then activates those k arms that correspond to the k largest indices.
It has been shown in [15] and more recently in [14] that -under technical conditions that are not satisfied by the model considered in this paper -an index policy known as the Whittle index [16] is asymptotically optimal for restless bandit problems. These results hold as the number of arms tends to infinity while the ratio of played arms, k/d, tends to a constant ρ, a regime also considered in this paper. Furthermore, under the restrictive assumption that arms can be modeled as identically distributed two state Markov chains (Gilbert-Elliott), the authors of [9] derive non-asymptotic optimality results for the Whittle index for the reward observing decision problem. The key feature of the Gilbert-Elliott model is that due to its simplicity the Whittle index can be computed in closed form. For the AR(1), however, no optimal policy is known and a closed-form expression for the Whittle index does not appear to be available.
Our contributions are both structural and asymptotic. Considering the discounted reward case, we find structural properties of the one armed subsidy problem associated with the Whittle index. We establish convexity and monotonicity properties which, based on non-restrictive assumptions, imply the existence of a switching curve and the monotonicity of the related Whittle index. These properties motivate a simple parametric index which quantifies the virtue of exploration compared to exploitation in terms of variance and mean. For this index we analyze the mean-field behavior of the system in the average reward case. In particular, we put forward a deterministic measure-valued recursion that approximately describes the distribution of belief states when the number of arms is non-small. We merge these ideas into a performance evaluation and optimization procedure, which we illustrate to be asymptotically exact.
The paper is organized as follows. In Section 2 we formulate the decision problem. Sections 3 and 4 present our contributions with respect to structural and asymptotic analysis of the problem. We conclude in Section 5.

MODEL AND FRAMEWORK
The state processes X(t) := X1(t), . . . , Xd(t) are assumed to be independent, and satisfy the AR(1) recursion, with εi(t) t denoting an i.i.d. sequence of N 0, σ 2 random variables. The parameters ϕ, σ are assumed to be known. We restrict our exposition to the case ϕ ∈ (0, 1), whence the processes are stable and observations are positively correlated over time.
At every point in time the decision maker may choose whether or not to play arm i, i.e., to observe its state and collect the reward. We denote the action of playing arm i by ai(t) = 1 (active), while ai(t) = 0 (passive) refers to the action of not playing. We require that exactly k arms have to be activated at each decision time, i.e., the action vector We are in the partially observable setting, that is, the state of an arm is only observed when that arm is activated, while at every time slot all arms evolve to the next state. Thus, the states of the d − k passive arms are unknown to the decision maker, and he has to rely on his belief concerning these states. The belief state of arm i at time t is given by the probability distribution over all of its possible states conditional on the information available at that time. Since this distribution is Normal, the belief state is fully characterized by the conditional mean and variance defined as Here, ηi(t) := min {h ≥ 1|ai(t − h) = 1} denotes the number of time steps ago arm i was last played and observed. We denote the joint (belief) state space of (μi, νi) by Ψ := Ψ1 × Ψ2 so that (μ, ν) ∈ Ψ d . It is worth noting that Ψ1 = R and Ψ2 is countable and bounded by [νmin, νmax) := σ 2 , σ 2 /(1 − ϕ 2 ) . The conditional variance increases in ηi, i.e. while the arm is not being played.
The following evolutions show how the belief states are updated in a Markovian manner; these also appear in [3]. Let Yμ,ν denote a generic random variable with distribution N (μ, ν). If action ai(t) is chosen at time t, then at time t + 1, Here, the realization of Y μ i (t),ν i (t) corresponds to the realization of the state process the decision maker observes when playing arm i at time t. On the other hand, if the arm is not played, then the previous belief state is updated in a deterministic fashion as no new observation of the state of arm i is made.
In summary, as time evolves from t to t + 1, given the current belief states (μ, ν) and a policy π : Ψ → {0, 1} d , the following chain of actions takes place: The aim is to find a policy π so as to maximize the accumulated rewards over an infinite time horizon as evaluated by the total expected discounted reward criterion, where β ∈ (0, 1), and the subscript indicates conditioning on X(0) ∼ N μ, diag(ν) , or the average expected reward criterion Note that Xi(t) in (4) and (5) can be replaced by μi(t).
Lemma 2.1. The function V π is well-defined in the sense that the limit in (4) exists and is finite. Furthermore, the optimal value function sup π V π is finite.
In view of computational tractability we restrict our exposition to policies from the class of index policies.

INDEX POLICIES
An index policy is a policy of the form where the index function γ : Ψ → R maps the belief state of each arm to some priority index. That is, at every point in time, πγ activates those k arms that correspond to the k largest indices; ties are broken arbitrarily. Without loss of generality the index function can be written as for some known function q : Ψ → R. The basic example is the myopic index with q ≡ 0. The resulting policy always activates those k arms with the largest expected immediate reward. As it does not account for information growing obsolete (giving full priority to exploitation), the performance of the myopic policy deteriorates as β ↑ 1. A more sophisticated index policy, the Whittle index, is surveyed in the next subsection.

Whittle Index
The Whittle index is a generalization of the Gittins index to the restless bandit case in which state processes (or belief states) evolve irrespective of whether an arm is being played or not -as opposed to the classical multiarmed bandit problem [6,7] in which the states of unplayed arms do not evolve. In order to devise a heuristic policy for a restless multiarmed bandit problem, Whittle [16] considered arms separately (i.e., he considered d decoupled one-armed bandits), and introduced a subsidy paid for leaving the arm under consideration passive. Intuitively this subsidy can be thought of as a substitute for the rewards the decision maker could have obtained from playing other arms in the multiarmed setting; from a theoretical point of view it is a Lagrange multiplier associated with the relaxed constraint that k arms have to be activated on average rather than requiring that exactly k arms be activated. Due to this relaxation, the Whittle index policy is not generally optimal for small systems but, at least under certain conditions that do not apply here, it is asymptotically optimal as k, d → ∞ with k/d → ρ [14,15]. This is also the asymptotic regime we consider in Section 4.
We show how to obtain the Whittle index when the underlying states are AR(1), and provide structural results.
Consider a special one-armed bandit problem where at each time slot, the decision maker can either activate the arm (a = 1) or leave it passive (a = 0). When it is activated, then the decision maker observes the state and collects the corresponding reward. When the arm remains passive, he obtains a (possibly negative) subsidy λ. The objectives are analogous to (4) and (5), but our focus in this section is on (4). We call this problem one-armed bandit problem with subsidy. The Whittle index is then defined as the smallest subsidy for which it is optimal to leave the arm passive.
Definition 3.1. Let Pλ denote the passive set associated with the one-armed problem with subsidy λ, Then the Whittle index associated with this arm and state This way, the Whittle index is obtained from the optimal policy for the one-armed problem with subsidy, which can be derived from the optimal value function as outlined below. The Whittle index policy is sensible only if any arm rested under the subsidy λ remains passive under every subsidy λ > λ. If this is the case, the one-armed bandit problem is called indexable [16]. Even if the state space is finite, proving indexability is likely to be highly involved [5]; we assume that it holds here as confirmed through extensive numerical experimentation (see for example Fig. 1).
The discount-optimal value function V λ := sup π V λ,π for the one-armed problem with subsidy can be obtained using value iteration (see Prop. 3.2). For the average reward case one can formulate a similar result stating that G λ := sup π G λ,π can be found from relative value iteration as defined for example in [13,Section 8.5.5]. First we introduce the operator T v := max a∈{0,1} Tav, where with φμ,ν denoting the normal density with mean μ and variance ν. Proofs can be found in the appendix.
converges to a unique function V λ : Ψ → R as n → ∞ that satisfies the Bellman equation, This V λ is the discount-optimal value function for the onearm bandit problem with subsidy λ. An optimal policy for this problem maps (μ, ν) to action a if V λ (μ, ν) = TaV λ (μ, ν).

Structural Properties. Let us first consider monotonicity
properties of the optimal value function V λ . ν) is convex, continuous, non-decreasing, and not constant; and V λ (μ, ·) is non-decreasing. switching curve (defined on the countable space Ψ2).
We conjecture that this assumption generally holds here; it states that if the expected reward obtained when playing is large enough, then it is optimal to indeed play the arm, while if it is very small, we should rather take the subsidy. To see that there is a switching curve, it remains to be proven that in between μ and μ there are no μ1 < μ2 such that (μ1, ν) ∈ P c λ and (μ2, ν) ∈ Pλ. Proposition 3.4. If Assumption 1 holds, then a policy that achieves the optimal value function V λ is a threshold policy: There exists a switching curve (sequence) Numerical evidence such as provided in Fig. 1 suggests that the switching curve is in fact strictly decreasing, i.e. it is optimal to give some priority to exploration.

Assumption 2. The switching curve is non-increasing.
It follows from the assumptions and Prop. 3.4 that the Whittle index is monotone.

Corollary 3.5. Provided Assumption 1 holds, the Whittle index ω(μ, ν) is non-decreasing in μ. If in addition Assumption 2 holds, then ω(μ, ν) is non-decreasing in ν.
Consequently, the Whittle index policy assigns comparatively larger indices to arms that have not been activated for a longer time. In accordance with this observation, Fig. 2 shows that the correction term q(μ, ν) is positive and increases in ν. It is larger for μ close to zero, which may be explained by noting that exploration is less important if |μ| is large, for in that case there is less uncertainty about the direction in which μ will evolve. Furthermore, we confirmed numerically that the slope of ζλ(ν) increases as β increases as in this case exploration becomes more beneficial.

Parametric Index
As no closed form for the Whittle index is available, the Whittle indices have to be computed and stored for every belief state in Ψ, while the evaluation of the optimal value function for the one-armed problem with subsidy is computationally expensive. Therefore, instead of finding the index for the one-armed problem with subsidy that is optimal (the Whittle index), we propose to find the index that is optimal when restricting to a family of parametric functions. A simple example is obtained by picking a function q(μ, ν) that is proportional to ν, the most obvious measure for the decision maker's uncertainty. This yields the parametric index where θ ≥ 0 as motivated by Corollary 3.5. The correction term θν allows to adjust the priority the decision maker wants to give to exploration. We denote the associated policy by πθ.
The parametric index can be related to the Whittle index as follows. Numerical experiments (such as Fig. 1 for discounted, and related experiments for average rewards) suggest that the optimal switching curve may be well approximated by a linear function, the slope of which is negative but does not depend on λ. The position of the curve on the other hand does depend on λ. Such an approximation is given by ζλ(ν) ≈ −θν + λ + c with θ ≥ 0, c ∈ R. As ζλ(ν) takes some value μ ∈ R, solving for λ (which may correspond to the Whittle index) suggests using an index of the form (9), where without loss of generality we take c = 0.
In the next section we show that we can explicitly describe the asymptotic dynamics of the system induced by πθ.

SYSTEM WITH MANY ARMS
We investigate the behavior of the system with many arms as d → ∞ and kd/d → ρ. Section 4.1 outlines the main idea: the limiting proportion of belief states remains stable in an equilibrium system with infinitely many arms. In Section 4.2 we relate the equilibrium system to a single arm process, and use this connection to propose an algorithm for performance evaluation. This algorithm is used to optimize πθ in Section 4.3.

Limiting Empirical Distribution
We first informally describe the intuition motivating this section. Consider the system with d arms as before, and to simplify the exposition suppose for now that the system is stationary. Let Γi(t) denote the process of indices associated with arm i, that is, Γi(t) := γ μi(t), νi(t) . Note that the index processes Γi(t) and Γj(t), i, j = 1, . . . , d, are generally dependent because the belief states of both arms depend on the action that was chosen, which in turn depends on the index of all arms in the system (as they are coupled by the requirement that those k arms with the largest indices are activated). Let us now focus on a single arm i in this system and suppose that its belief state evolves from ψ at time t to another belief state ψ at time t + 1. While d is small, this certainly changes the proportion of arms with current belief state ψ considerably. However, if d is very large, we should be able to find another arm j whose new belief state at time t is (in close proximity to) ψ. It thus seems reasonable to expect that, as we add more arms to the system, it approaches a mean-field limit in which the proportion of arms associated with a certain belief state remains fixed. Thus, in the limit, the action chosen for a certain arm is independent of the current belief state of any other arm, as there is always the same proportion of arms associated with a certain belief state in the system.
Let us now more formally investigate the proportion of arms that are associated with a certain belief state at time t. We focus on parametric index functions as defined in (9). The empirical measure quantifies the proportion of arms in the d-dimensional system whose belief state falls into C ∈ B(Ψ) at time t, where B(Ψ) denotes the Borel σ-algebra on Ψ. It is related to the measure on indices, We examine the dynamics of M d (C, t). To this end we enumerate the elements in Ψ2, that is, ν (h) = σ 2 (1−ϕ 2(h+1) )/(1− ϕ 2 ), h = 0, 1, 2 . . . , so that h + 1 is the number of time steps since an arm was played last. We refer to h as the age of an arm. Then (10) can be written as, with B ∈ B(R), Many-arms Asymptotics. As motivated at the beginning of this section, it is reasonable to believe that the limiting proportion of arms associated with a certain belief state evolves deterministically, and thus, that the dynamics of the limiting system can be described by non-random measures mh(·, t). For brevity we write mh(x, t) for mh (−∞, x], t), and denote by Φμ,ν the normal distribution function with mean μ and variance ν. We define mh(·, t) by the recursion Here, mh denotes the measure on indices, i.e.
cf. Eqn. (12). Note that h(t) is a threshold such that at time t the policy πθ activates all arms that are of age h and have conditional mean μ(t) ≥ h(t), h ≥ 0. Obviously, if the policy is myopic, then h(t) = (t) does not depend on the age of an arm and the above expressions can be simplified.
Recursion (14) is obtained based on the dynamics of the belief states as given in (3). The evolution of m0(·, t) is determined by the evolution of the belief state of all arms that have been played in the previous time slot. If h > 0 on the other hand, we use that arms of age h must have been of age h − 1 at the previous decision time; and since they have not been activated, their mean must have been below the threshold h(t − 1).
For the (pre-limit) empirical processes M d it obviously holds that M d (Ψ, t) = 1 as well as M d 0 (Ψ1, t) = kd/d. These properties carry over to the limiting measure. This is easily proven by induction using (14)- (16). We believe that (14) indeed describes the mean-field behavior of the dynamical system: Long-run Equilibrium. Note from (14) that for h ≤ t we can express mh(B, t) in terms of m0 (B, t), Then the fixed-point equation corresponding to (14) is given by and m * again denotes the measure on indices, cf. (16).
The above system of equations describes possible equilibrium points of the measure valued dynamical system. It is intricate due to the coupling of * , m * and the measures m h , h ≥ 0. Nevertheless, the system is elegant in that its solution can potentially be described through a single measure, namely m * 0 .
For the special case of θ = 0 (myopic) we verified numerically that with arbitrary initial choice {mh(·, 0)} satisfying (17) an equilibrium point satisfying (18) is indeed attracting. Furthermore, when d and t are large enough, the proportion of arms associated with a certain belief state in a simulated system with d arms is indeed fixed and well approximated by the solution to (18) when operated under the myopic policy.

The Equilibrium Index Process
We now relate the system with many arms operated under π θ to a special one-armed process with threshold. For this process the arm is activated whenever the index exceeds a specified threshold , i.e. a(t) = 1{μ(t) + θν(t) ≥ }. Because the evolution of the belief state and thus the evolution of the index depends on , we denote the associated stochastic process of indices by Γ (t) := μ(t) + θν(t).
Suppose that is picked in such a way that we activate with probability ρ; denote it by . Then a policy π θ that chooses action ai(t) = 1 μi(t) + θνi(t) ≥ for every arm i in an unconstrained system with d arms is a policy which activates ρ d arms on average (this is essentially the idea behind Whittle's relaxation [16]). Thus, as d → ∞, the policy π θ activates approximately a proportion ρ of arms at every decision time.
We believe that in steady state (as t → ∞ or under stationarity) the equilibrium of the measure-valued dynamical system is directly related to the one-armed process with this particular threshold , and further equals * of (18).
Conjecture 2. Assume that the index is parametric, and that Γ (t) is stationary. Then the equation has a unique solution * , which satisfies Eqn. (18), and A practical implication of Conj. 1 and 2 combined is that in the limit, as d → ∞ and t → ∞, a parametric index policy πθ is equivalent to the policy that activates arm i in an unconstrained system whenever Γi(t) ≥ * , where * is defined by (19). This motivates the following simple algorithm for performance evaluation.
Algorithm Performance evaluation.
1: For large T determine * such that T −1 T t=0 a i (t) = ρ is achieved for a policy π * θ .
2: Use the sample path of Step 1 to obtain an estimate G for the expected average reward of the one-armed system.
3: Output G d := d G as an approximation of the expected average reward of the multiarmed system with d arms operated under π θ .
The virtue of this algorithm is that the behavior of the manyarmed system is approximated by simulating a much simpler one-armed problem.

Optimized Parametric Index
The algorithm can be used to approximate the best parameter values for a parametrized index policy. We approximate θ * := arg max θ Gd(θ) by θ is the average reward obtained under π θ for the problem with d arms, and G(θ) is the estimator for G(θ) as obtained from Step 2 of the algorithm. Fig. 3 depicts the estimated expected average reward G(θ) as a function of θ. The figure suggests that for large ϕ, the myopic policy (which corresponds to θ = 0) can be improved significantly. We now examine the performance of πθ when the parameter is chosen to be θ * . In contrast to the approximation Gd(θ) that is obtained from the algorithm, we denote the estimated average reward obtained by Monte Carlo simulation of the d-armed system by Gd(θ). We define θ * d := arg max θ Gd(θ). Accordingly, Gd θ * and Gd θ * d are the average rewards obtained when simulating the system under πθ, where θ is chosen as θ * and θ * d respectively. In Fig. 4 we compare these quantities to the average rewards obtained when simulating the system under the Whittle index and the myopic policy. Unsurprisingly, the Whittle index policy outperforms the other index policies -in fact, we believe it to be asymptotically optimal. However, the parametrized index does considerably better than the myopic. Importantly, we note from Fig. 4 that θ * d is indeed well approximated by θ * . Thus, instead of optimizing the parameter by simulating the multidimensional d-armed system, we can approximate the best θ-value directly from the onearmed process with threshold for any value of d (such that kd = ρ d ).

CONCLUDING REMARKS
This paper provides a starting point for a rigorous investigation of the structural properties and performance of index policies in partially observable restless bandit problems with AR(1) arms. This incorporates (i) the analysis of the Whittle index as a likely candidate for an asymptotically optimal policy as d → ∞ while kd/d → ρ, and (ii) insights into the behavior of the system in this asymptotic regime. In addition to our conjectures above, we also believe that some form of asymptotic independence holds for the index processes as the number of arms grows large. In this context we mention that Γi, i = 1, . . . , d, are exchangeable [2]. This may yield a path for proving asymptotic independence. The recursions on measures defining the limiting dynamical system can perhaps be treated along the lines of [10].
Furthermore, many of the ideas in this paper can be generalized. For example, the results obtained in Section 3.1 for discounted rewards similarly hold in the average reward case, and the assumptions made in that section generally hold for the problem we consider. Beyond that, we can extend the treatment to AR processes of higher order, heterogeneous arms and bandit problems with correlated arms.

ACKNOWLEDGMENTS
This project is supported by the Australian Research Council grant DP130100156.