Dynamic State Space Partitioning for Adaptive Simulation Algorithms

Adaptive simulation algorithms can automatically change their con-ﬁguration during runtime to adapt to changing computational demands of a simulation, e.g., triggered by a changing number of model entities or the execution of a rare event. These algorithms can improve the performance of simulations. They can also reduce the conﬁguration effort of the user. By using such algorithms with machine learning techniques, the advantages come with a cost, i.e., the algorithm needs time to learn good adaptation policies and it must be equipped with the ability to observe its environment. An important challenge is to partition the observations to suitable macro states to improve the effectiveness and efﬁciency of the learning algorithm. Typically, aggregation algorithms, e.g., the adaptive vector quantization algorithm (AVQ), that dynamically partition the state space during runtime are preferred here. In this paper, we integrate the AVQ into an adaptive simulation algorithm.


INTRODUCTION
Adaptive simulation algorithms change their configuration during runtime automatically to improve the overall performance of a simulation [3]. Adaptations can be necessary due to changing computational demands during a simulation run, e.g., caused by a changing number of entities or the execution of a rare event. In general, it is challenging for a user to select a suitable algorithm and configuration for a specific experiment, let alone changing the algorithms during runtime. Even developer of complex algorithms themselves are typically not able to evaluate the performance of their algorithms for a specific problem properly [5]. Adaptive algorithms are an approach to deal with these challenges.
Machine learning techniques can be used to learn when to use which configuration automatically. Adaptive algorithms that use such techniques can be reused for different models, modeling languages and simulation experiments [3]. A suitable machine learning technique is reinforcement learning [11]. In reinforcement learning, the following procedure is repeated until a termination criterion is fulfilled, e.g., a simulation run terminates. An agent observes the environment and performs an action. Then, it receives a reward from the environment and it updates its knowledge base. Typically, the agent starts with little knowledge about the environment and it must find a suitable trade-off between the exploration and the exploitation of knowledge. Based on reinforcement learning, we developed a generic adaptive simulation algorithm [3].
However, the advantages of the learning algorithms come with a cost. For example, they must automatically determine how the features of the model and the environment influence the performance. In reinforcement learning, aggregation algorithms are used to deal with this problem, e.g., [9,10,6]. These algorithms partition the state space into disjunct macro states dynamically. A macro state represents a region of the state space with a homogeneous performance behavior. All states within the same macro state are treated equally, i.e., knowledge learned for one state is reused for all other states of the same macro state. The performance of these aggregation algorithms strongly depends on the given problem and the used configuration -a proper usage of these algorithms is a challenge in itself. In this paper, we discuss several aggregation algorithms referring to their applicability for adaptive simulation algorithms and we adjust the adaptive vector quantization algorithm [6] so that it can be used for the adaptive simulation algorithm developed in [3].

ADAPTIVE SIMULATOR
The adaptive simulation algorithm called "adaptive simulator" developed in [3] uses reinforcement learning to learn when to use which simulation algorithm and configuration. Referring to reinforcement learning, the adaptive simulator represents the agent, all available simulation algorithms and configurations represent the available actions, and all observable information like the model state represent the environment. The adaptive simulator is organized as a wrapper for other simulation algorithms, i.e., it uses other algorithms to compute the actual simulation and it exchanges the current simulation algorithm during an adaptation. The available simulation algorithms and configurations are listed in a finite set A = {a1, a2, . . . , an}. During a simulation, simulation events are executed and a data vector ("base state") σ ∈ Σ of data is observed and appended to a base state trajectory τ ∈ Σ * after each event execution (Alg. 1, l. 9-11). When the adaptation condition is fulfilled (l.12), an adaptation is triggered. The adaptation condition function uses a Bayesian changepoint detection algorithm to determine adaptation points [4]. When an adaptation shall be executed, the adaptive simulator firstly computes the current reward r ∈ R (l. 13), e.g., the event throughput. The knowledge base is updated (l. 23) by using Q-Learning that learns so called q-values q ∈ R for each state-action pair (s, a), representing the utility to select a after observing s [14]. Afterward, the current state s ∈ S is computed by using the base state trajectory τ ∈ Σ * . The state is then mapped to a macro state that represents a region in the state space. So far, we only use static grids to partition the state space. This approach is simple but it has obvious disadvantages, i.e., finding a suitable scale for the grid is difficult and problem-dependent. Furthermore, different granularities are usually needed in different areas of the state space. Finally, the new simulation algorithm is selected by using a selection policy (l. 29) and the current simulation algorithm is exchanged with the new one (l. 31). For the selection policy, we use -decreasing [12] that is a simple but robust and efficient policy, see [2,3].

AGGREGATED STATE SPACES
Due to high-dimensional and real-valued state spaces, it is usually not feasible to learn a suitable selection policy for each state individually. Aggregation algorithms, e.g., [9,10,6,13,1], dynamically partition the state space of a reinforcement learning problem into disjunct macro states do deal with this problem. Typically, these algorithms start with a coarse-grained partitioning of the state space and refine the state space based on various conditions. For example, the parti-game algorithm [9] assumes that a) the goal is to reach a specific region of the state space, b) that there exists a continuous path from the start position to the goal region, c) that state transitions are deterministic, and d) that the agent can move deliberately through the state space. Due to these requirements, this algorithm cannot be applied to the adaptive simulator. First, the goal of the adaptive simulator is not to reach a specific region of the state space, but to maximize the received rewards. Second, the adaptive simulator cannot move deliberately through the state space -it cannot influence deliberately model variables that are possibly used as dimensions for the state space.
An extended and less restricted version of the parti-game algorithm is the decision boundary partitioning algorithm [10]. This algorithm splits two adjacent macro states msi and msj at the mid-point of their longest dimension if the following three conditions based on the current knowledge hold. First, the best action in both macro states differ. Second, the difference of the utilities of the best action of msi and the best action of msj is in msi or msj higher than Δmin ∈ R. Third, all actions of both macro states have been visited at least vmin times. The conditions guarantee that only reasonable splits of macro states are executed. Figure 1 (left) illustrates an exemplary state space partitioning that could have been built with the decision boundary partitioning algorithm. This algorithm can be used for the adaptive simulator, but there are various challenges. For example, if many observed states occur inside the same macro state but not inside its neighbors, this algorithm will not split this macro state although it could be useful. In the worst case, no splits are executed at all because the initial macro states have been set poorly.
Another group of aggregation algorithms uses the idea of the nearest neighbor vector quantization to create macro states, e.g., [6,13]. These algorithms maintain a codebook CB ⊆ S containing specific states that are called codewords. A nearest vector quantizer is used to map a state s ∈ S onto the nearest codeword c ∈ CB available in the current codebook. Basically, this mapping creates a partitioning of the state space into disjoint regions, i.e., the macro states (see Figure 1). The idea of these algorithms is to frequently change the codebook, so that all states mapped to a macro state represent a similar q-value behavior. A promising algorithm of this group to be applied to the adaptive simulator is the adaptive vector quantization algorithm (AVQ), because a) it does not depend on the definition of a goal state, b) it allows complex shapes of macro states, and c) no initial partitioning for each dimension must be defined. Further, compared to other aggregation algorithms that use the nearest neighbor vector quantization, the configuration effort of this algorithm is low. To decide whether new codewords shall be added to the codebook, this algorithm uses a concept based on the accumulated reward (accReward), that "with respect to a particular action is the sum of the total rewards received by continuously taking the same action within a particular cell" [6]. Consequently, this algorithm directly influences the action selection. This approach has to be replaced when applying this algorithm to the adaptive simulator, because it is possible that a poor performing action is reused frequently.

AVQ AND ADAPTIVE SIMULATOR
Algorithm 1 illustrates the integration of the AVQ algorithm in the adaptive simulator. As motivated above, we replaced the concept of the accumulated reward. Instead, we use a condition inspired by the decision boundary partitioning algorithm. A state s, that is mapped to a codeword c, is added to the codebook if the absolute difference of the current reward and the last reward achieved by the same action for any other state s mapped to c is higher than a threshold α ∈ R (see l. 16). Generally speaking, a state is added to the codebook if the rewards of an action differ significantly within its macro state. Further, we add a limit m ∈ N for the number of macro states. The reward of the adaptive simulator is calculated by computing the logarithm of base 2 of the event throughput. We set α = log2(1.5) ≈ 0.585 by default, i.e., a throughput difference of at least 50% must occur so that a codeword is added to the codebook. We reused the merging routine of the AVQ (l. 35-41). Thus, at the end of each simulation run, i.e., after finishing the adaptation loop, the merging routine is executed. We set ρ = α 2 ≈ 0.342 by default, so that this parameter is linked to the condition to add codewords to the codebook.
We demonstrate the effectiveness and efficiency of our approach with a benchmark using a two-dimensional state space (x, y) ∈ Algorithm 1 Pseudo-code for the adaptive simulator [3] extended by the changed AVQ. Q: q-value matrix indexed by state s ∈ S and action a ∈ A. N : matrix of counters for visited (s, a) tuples. s, s ∈ S: previous and current state. a ∈ A: action. r ∈ R: reward. R : Σ * → R reward function. σ ∈ Σ: current base state. τ ∈ Σ * : current base state trajectory (seq. of base states). other actions. If the state observed by the agent is between the sine and cosine curve, it receives reward 1 for choosing action a1, and reward 0 for the other actions. In the last case, i.e., if the state observed by the agent is below the sine and cosine curve, it receives reward 1 for choosing action a2, and reward 0 for the other actions. The actions chosen by the agent do not influence the next state of the environment -the next state is chosen randomly. Finally, 100 trials are executed with 1000 steps per trial, i.e., the merging process is executed every 1000 steps. Although simple, this setting reflects important aspects of the adaptive simulator scenario, e.g., the actions of the agent do not influence the observed environment states. Figure 2 illustrates the final partitioning of the state space using m ∈ {10, 100}. The success rates (the rate of correct decisions) are shown in Figure 3 (left). As expected, the more macro states are allowed, i.e., the higher m is chosen, the more accurate gets the state space partitioning and the higher is the success rate. However, the success rate converges to a value ≈ 0.87, i.e., at some degree it is not worth to increase the number of macro states, because most of the occurring states are already covered. Figure 3 (right) also shows the success rate of a static rectangular grid with different macro state sizes. Here, the smaller the length and width of a macro state, the worse the success rate, because the more redundant knowledge is learned.
To demonstrate that the results do not depend on the granularity of state space dimensions, we repeat the same benchmark experiment with a scaled down state space with scaled down sine and cosine curves so that x ∈ [0.52, 0.54] and y ∈ [0.39, 0.41]. The results of this scaled down benchmark are also shown in Figure 3 (crosses). Whereas the results using the changed AVQ are almost equal, the results using a static grid state space differ significantly, i.e., a granularity that is effective for one problem is usually unsuitable for another.

ML-RULES SIMULATION
Besides the benchmark, we executed a Wnt/β-catenin pathway model [8] implemented in the modeling language ML-Rules [7] to evaluate the adaptive simulator using the changed AVQ. ML-Rules is used to model biochemical reaction networks. For ML-Rules, various simulation algorithms and components exist, e.g., to manage the species and reactions sets. Here, we selected 24 configurations to execute ML-Rules simulations available for the adaptive simulator, i.e., |A| = 24. Besides different state space partitioning configurations, we used the default configuration of the adaptive simulator. For the changed AVQ, we used m ∈ {10, 50, 100, 500}. Each simulation run of the Wnt/β-catenin pathway model was executed for 100,000 simulation events. We sequentially executed 100 replications of this simulation. The knowledge base and the state space partitioning of the adaptive simulator were reused over the replications. Consequently, referring to the merging process of the state space partitioning, 100 merging processes were executed. After finishing 100 replications, the average runtime to execute one simulation run was calculated. We repeated this experiment 50 times to get a reliable distribution of the average execution times. Figure 4 illustrates these execution time distributions of the adaptive simulator with m ∈ {10, 50, 100, 500}. Additionally, the average execution time distribution of the 24 non-adaptive simulator configurations are shown. The adaptive simulator achieves almost the performance of the best non-adaptive simulator configuration with all values of m. For the Wnt/β-catenin pathway model, a small number of macro states is apparently sufficient to achieve good results. However, the results show that the performance of the adaptive simulator only worsen slightly with a higher number of macro states owing to a higher exploration effort. This emphasizes the robustness of dynamic state space partitioning algorithms. Altogether, compared to a random choice of a configuration, the adaptive simulator is significantly more efficient.

CONCLUSION
Adaptive simulation algorithms like the adaptive simulator [3] change their configuration automatically during runtime, 1) to improve the overall performance of the simulation, and 2) to relieve the user to configure the simulation algorithm. Using learning techniques for these algorithms make them reusable, however, these techniques come with own challenges that must be solved. In this paper, we dealt with the problem of dynamic state space partitionings [10,6]. We integrate the adaptive vector quantization algorithm [6] into the adaptive simulator. Our short evaluation has illustrated that applying a dynamic state space partitioning makes the adaptive simulator more efficient and reliable and further reduces its configuration effort.

ACKNOWLEDGMENTS
This research was supported by the German research foundation (DFG), via the research grant ESCeMMo (UH-66/14).