Transmission Policies for Energy Harvesting Sensors Based on Markov Chain Energy Supply

Due to the small energy harvesting rates and stochastic energy harvesting processes, energy management of energy harvesting senor is still crucial for body network. Transmission polices for energy harvesting sensors with Markov chain energy supply over time varying channels is formulated as an infinite discounted reward Markov Decision Problem under the assumption of geometric distribution of sensors’ lifetime. In this paper, we firstly propose a low-storage transmission policy based on probability of successful transmission for body network. Then we narrow the feasible region of parameters in our policies from the real domain to a discrete set with limited number, which makes the method of combing optimal equations and enumeration algorithm to obtain optimal parameters workable. Finally, numerical results show that our presented transmission policies can achieve a good approximated performance of optimal policies, which can be derived by policy iteration algorithm. Compared with the optimal policies, our presented policies has the advantage of low storage.


INTRODUCTION
With the development of energy harvesting techniques, harvesting energy from the environment has received increasing attention in wireless sensor networks.Wireless sensor networks with energy harvesting nodes can operate for long periods.However, due to the low replenishing rate and the stochastic energy renewal process, energy management is still crucial [1].
Many previous works have been conducted in the optimization of a point-to-point energy harvesting communication system in which the transmitter is equipped with an energy harvesting device.They can be categorized into two classes based on the assumption of whether or not the source knows the information about the energy and data arrival processes.The review about optimization in energy harvesting communication system has been conducted in [2].Our paper mainly focuses on online scheduling.
Online scheduling assumes that the source contains statistical information about the energy harvesting processes [3]- [7].In [3], a threshold algorithm is proposed for utilizing available energy to transmit packets with different rewards based on a memoryless Markov chain for battery state, and time correlation in the energy supply is introduced in [4].In [5], energy management polices for energy harvesting sensor nodes to maximize the date rate while the data queue stays stable are proposed.An online heuristic policy [6] that performs closely to the online optimal is proposed for the finite horizon optimal packet scheduling problem.In [7], a Markov Decision Problem(MDP) is formulated to maximize the number of successfully delivered packets per time slot for an energy harvesting node with infinite capacity.It is proved that the optimal policy is a threshold policy depending on the state of the channel and the length of the energy queue.
However, on one hand, it is not possible to know the noncausal information before transmitting; on the other hand, the parameters of the underlying stochastic processes change over time in many practical systems.So, neither offline scheduling or online scheduling will be applicable.In [8], a learning approach is proposed to learn the optimal transmission policy that maximizes the expected sum of the data transmitted during the transmitter's life over a fixed channel.
In our paper, we consider a point-to-point communication system in a time-slot fashion with time-correlated energy supply, which is similar to [4].However, in [4], the model did not consider the impact of limited source lifetime.In our paper, we model the lifetime of the source as a random variable, which means the source can stop its operation at any time slot with fixed probability.At the beginning of each time slot, the source node decides to transmit a packet or drop it depending on the channel state and the energy queue length.The objective for the source is to maximize the average amount of data successfully transmitted during its lifetime.
The rest of the paper is organized as follows.In section II, a system model is presented and the problem is formulated as an infinite-horizon discounted reward Markov Decision Processes(MDP).In section III, we define a low storage transmission policy and present an algorithm to find the optimal policy.Finally, numerical results are presented in section IV, followed by our concluding in section V.

SYSTEM MODEL
In this section, we present our system model for a point-topoint energy harvesting communication system and formulate the optimal transmission problem as an infinite-horizon discounted reward MDP.A time slotted system is considered, which the k -th time slot means [k, k + 1) , k ∈ Z. Fig. 1 gives the model of the source node with a finite battery.The data queue is assumed to be saturated, that means there is always data to be sent at the beginning of each time slot.We use E to indicate the energy queue length.And the capacity of the battery is N units.And we also assume that the life time of source is geometric random variable with mean of 1  1−β (0 < 1 − β ≤ 1).In other words, at each slot, the source node can terminate its operation with small probability 1-β, which is independent and identically distributed.

Figure 2: energy supply model
Let H k be the channel state during the time slot k.We assume that H k follows a Markov model and p lj is the transmission probability from state l to state j.h k (0 ≤ h k ≤ 1) is the successful transmission probability during slot k, which has M values.Here we assume the channel state process and energy harvesting process are independent.
At the beginning of each slot, the source node decides to either transmit a packet or discard it.One unit energy is used from the energy queue when a packet is transmitted.
We assume the source node can receive the feedback from the receiver about the CSI at the beginning of each slot, then the action is taken by the source node based on We use a to indicate the action, where a = T means the action is to transmit a packet and a = D means the action is to discard a packet.The objective of the source node is to find the optimal policy that maximizes the expected total number of packets that are received successfully during the lifetime of the source node.
Next, define S as state space, which contains {E k , B k−1 , H k }, A(i) as possible action set at state i, π as the policy used by the source node.We define V (i, π) as the expected total reward with the initial state i at a given policy π.It can be calculated by the following expression where k is the time slot index, s k is the state at time slot k, a k is the action taken at the time slot k, random variable Y is the lifetime of the source, EY denotes the statistical expectation and E π i denotes the expectation of total reward.The term r (s k , a k ) is reward when the state is s k and the action is a k , and has the following expression The reward represents the number of packets received correctly when the action is taken.If the source chooses action T, which means to transmit a packet, the reward will be h k depending the channel state at time slot k.If the source choose action D, which means to discard a packet, the reward will be 0 since there is no packet to be transmitted.
Due to the geometric distribution of the lifetime Y of the source with mean 1 1−β , equation ( 1) can be equal to the objective function of infinite-horizon discounted reward, which has the following expression [9]: The optimal problem can be expressed as follows:

ONLINE TRANSMISSION POLICIES
The optimal problem is an infinite-horizon discounted reward MDP.Because in our system, state space S is countable and discrete, and the number of actions for each state is finite, there exists an optimal deterministic stationary policy.Furthermore, policy iteration algorithm(PIA) [10] can be used to solve the problem and will terminate in a finite number of iterations with an optimal stationary policy.
However, when the number of states is large, the algorithm needs a lot of calculation to solve the linear system of equations, and also needs large storage to save the optimal policy.So in this section, we present a low storage policy and an algorithm to solve the optimal parameter of the policy.

Definition of Transmission Policies
The definition of transmission policies as follows, where E k is the energy level of k-th slot, h k is the successful transmission probability during slot k. θ b is a threshold of previous slot's energy harvesting state b, which b can be 1 or 0, and θ0 ≥ θ1.
When energy queue length E k = 0, the source takes action D; When energy queue length E k ≥ 1, the source decides action according to h k .When h k < θ b , the source tends to take action D to save energy; when h k ≥ θ b , the source tends to take action T to get good reward.θ1 is threshold of B k−1 = 1, and θ0 is threshold of B k−1 = 0.And when the energy harvesting rate is larger, more data should be transmitted, so we have θ0 ≥ θ1.
In the optimal policy solved by PIA, source needs to store a look-up table, which contains the states and corresponding action.In our present policy, source only needs to store two thresholds θ1 and θ0.Our present policy has the advantages of low storage capacity.

Algorithm of Optimal Threshold
From the policy, we can see different thresholds lead to different rewards.In this section, we present an algorithm to solve the optimal threshold.The optimal threshold can be solved by combining optimal equations [10] and enumeration algorithm.The algorithm flowchart is as follows: 1. Sort the channel state from small to large by the successful transmission probability, denoted by h1, h2, and θ0 ≥ θ1.
2. Enumerate the possible values of θ1, θ0, to get policy according to equation (5).Then solve the following equation to get the rewards.The threshold, which make the reward biggest is optimal.
In equation ( 6), V is the expected total reward, R(f ) is the instant reward of each state given policy f , which can be computed according to equation (2).P (f ) is transmission probability matrix at given policy f .Next we introduce computational formula of P (f ).
The k-th state is denoted by (E k , B k−1 , H k ).On one hand ,when source takes action T to transmission a packet, one unit energy is depleted.On the other hand, if B k−1 = 1, the source can harvest one unit energy by the probability λ1, so the energy level can either remain same or decrease by one unit.If B k−1 = 0, the source can't harvest energy, so the energy level E k will decrease one unit.Also, channel state process and energy harvesting process are independent.So we can derive the following formula. 1 When source takes action D, energy level E k will either remain same or increase one unit.However, due to the limit of battery capacity, when E k = N , energy level E k can only be N .So we can derive the following formula.

NUMERICAL RESULTS
In this section, we compare the performance of three transmission policies.They are optimal policies by PIA, our presented adaptive transmission policies, and greedy polices which means that source can take action T only when B k−1 = 1 and E k > 0.The performance is the expected total number of packets that are received successfully during the lifetime of the source node with initial state E k = 0.
We consider H k are Gilbert-Elliot channel model [11], which has p11 = 0.8, p01 = 0.4, h1 = 0.5, h0 = 0.2.Besides, N = 10, β = 0.95.3-5 show the performance against energy harvesting probability λ1 over different parameters of the energy harvesting process.In Fig. 3, p1 = 0.2, p0 = 0.8 means source stays on energy harvesting state 0 for a longer time; in Fig. 4, p1 = 0.5, p0 = 0.5 means source stays on energy harvesting state 0 and 1 for the same time; in Fig. 5, p1 = 0.8, p0 = 0.2 means source stays on energy harvesting state 1 for a longer time.First of all, from the three figures, we can see that the performance of transmission policies increases with the increasing of energy harvesting probability λ1.This is due to the fact that the bigger energy harvesting probability is , more energy can be used to transmit data.Secondly, the performance of our presented adaptive policy is larger than that of Greedy policies, but smaller than that of optimal policies.However, when energy harvesting probability is lower than a threshold, the performance of our presented adaptive policy is a good approximation of optimal polices's performance.Finally, the difference of performance between optimal polices and adaptive polices increases with the increase of energy harvesting probability λ1.This is because in our presented adaptive policies, we only consider different thresholds for different energy harvesting states.However, the thresholds have nothing to do with energy level.

CONCLUSIONS
In this paper, we have considered a point-to-point energy harvesting system with time-correlated energy supply and limited lifetime of source over time varying channel in a time slot system.Our goal is to find optimal transmission policies to maximize the expected sum of successfully transmitted data during the lifetime of the source.First, assumed that the life time of source is geometric random variable, we formulated above problem as an infinite-horizon discounted reward MDP.Then we presented a low storage transmission policies and introduced an algorithm to solve the optimal parameter in our present policies.Also, numerical results show that when the energy harvesting is lower than a threshold, the performance of our presented policies can be a good approximation of that of optimal polices, which can be derived through policy iteration algorithm.

Firstly, we narrow
the feasible region of the threshold from the real domain to discrete set with limited number.The range of θ b is [0, 1].Sort the channel state from small to large by the successful transmission probability, denoted by h1, h2,• • • , hM with h0 = 0. Assumed there is hi < θ b ≤ hi+1, i = 0, 1, • • • , M − 1,then when the k-th successful transmission probability is any of {h1, h2, • • • , hi}, the source takes action D; when the k-th successful transmission probability is any of {hi+1, hi+2, • • • , hM }, the source takes action T .The effect is equal to θ b = hi+1.And if θ b = hM + ε, ε > 0, the source can only take action D. So the feasible region of threshold can be narrowed to the set {h1, h2, • • • , hM , hM + ε} , ε > 0.