Master-Slave TLBO Algorithm for Constrained Global Optimization Problems

INTRODUCTION: The teaching-learning based optimization (TLBO) algorithm is a recently developed algorithm. The proposed work presents a design of a master-slave TLBO algorithm. OBJECTIVES: This research aims to design a master-slave TLBO algorithm to improve its performance and system utilization for CEC2006 single-objective benchmark functions. METHODS: The proposed approach implemented using OpenMP and CUDA C, a hybrid programming approach to enhance the utilization of the system’s computational resources. The device utilization and performance of the proposed approach evaluated using CEC2006 benchmark functions. RESULTS: The proposed approach obtains best results in significantly reduced time for CEC2006 benchmark functions. The maximum speed-up achieved is 30.14X. The average GPGPU utilization is 90% and the average utilization of logical processors is more than 90%. CONCLUSION: The master-slave TLBO algorithm improves the utilization of computational resources significantly and obtains the best results for CEC2006 benchmark functions.


Introduction
Optimization is defined as "Finding an alternative with the most cost effective or highest achievable performance, by maximizing desired factors and minimizing undesired ones" [1]. Optimization functions are discrete/continuous and constrained/unconstrained types. The optimization problems found in engineering and other domains are constrained and unconstrained in nature. The constrained optimization problems are optimized concerning certain restrictions. The restrictions exist on different things like resources availability, time, etc. The unconstrained optimization problems are free from such restrictions. These problems are optimized with respect to design variables and their range as well as dimensionality. The constrained and unconstrained optimization problems are of single and multi-objective optimizations [2][3][4][5].
In literature, different classical methods used to solve the constrained and unconstrained optimization problems. These methods have their own merits and demerits. Researchers have developed nature-inspired approaches to solving complex engineering design and optimization problems [2]. When an algorithm is proposed newly, validation and efficiency of the algorithm need to be evaluated. Therefore, the algorithm is tested on some standard benchmark functions. The test on benchmark functions ensures the suitability of the algorithm to problems with specific properties. The nature of benchmark functions is of different types. The functions are unimodal, multimodal, single and multidimensional. Real word applications belong to these categories and by implementing the proposed algorithm to these benchmark functions, one can determine which kinds of real-world problems the algorithm suits [3][4][5].
The nature-inspired algorithms are problem-solving approaches inspired by different phenomena that exist in nature. The nature-inspired algorithms are developed by understanding the behaviour of swarms, biological systems, physical and chemical systems, etc. The popular natureinspired algorithms are Genetic algorithms, Particle swarm optimization, Artificial bee colony algorithm, Ant colony optimization, Intelligent water drop algorithm, Cuckoo search algorithm, Teaching-Learning based optimization algorithm, etc. The literature reveals various applications of nature-inspired algorithms in different domains. Recently, authors have used nature-inspired approaches for virtual machine placement in the cloud. Adhikari and Amgoth used intelligent water drop algorithm for workflow scheduling in cloud data-centre [6]. Abdessamia et al. developed an energy-efficient virtual machine placement using binary gravitational search algorithm [7]. Jangiti et al. used a heuristic approach to perform virtual machine placement in the heterogeneous cloud data centre [8]. Jangiti et al. presented bulk-bin-packing based migration management approach to address the reserved virtual machine request problem in green cloud computing [9]. Tian et al. have used a heuristic approach for scheduling of virtual machine reservations in cloud data centres [10]. Basiri and Kabiri developed a machine learning-based approach for opinion mining, a subfield of data mining [11]. Sharma et al. presented a genetic algorithm (GA) and ontology-based NLP frameworks for online opinion mining [12]. Sharma et al. presented a well-organised study on the use of natureinspired techniques in agriculture, finance, healthcare, education and engineering domains. Also, authors have presented the publication trends from 2010 to 2019 in selected domains [13]. Aljarah et al. developed a multi-verse optimization algorithm to solve data clustering problems [14]. Sharma and Kaur presented an analysis of natureinspired meta-heuristic techniques employed in stock prediction, intrusion detection, disease diagnosis, image processing, bioinformatics, agriculture, text mining, robotics, finance, and educational data mining for feature selection [15]. Mafarja  . The self-adaptive mutation factor cross-over probability-based differential evolution algorithm developed by Annepu and Rajesh to solve node localization problem in wireless sensor network [22]. The development in multi and many-objective natureinspired optimization techniques with its applications is presented in [23].
The nature-inspired algorithms have algorithm-specific parameters. The success of such algorithms largely depends on the efficient tuning of algorithm-specific parameters. Inefficient tuning of algorithm-specific parameters affects the solution of optimization problems. To reduce the impact of algorithm-specific parameters the teaching-learning based optimization (TLBO) algorithm have developed [24]. It requires to tune only common control parameters namely population size and termination criteria. The TLBO algorithm is popular and used by various researchers to solve complex engineering optimization problems. With the era of increasing processing speeds, computer architects are exploring new ways to increase throughput. One of the most promising technique is to exploit parallelism. If your application use parallelism, resources are used more efficiently and performance is increased. The main advantage is that the CPU overhead is minimized. Some of the applications require more time to solve the problem, as it contains a large number of tasks so distributing those tasks in a balanced way across available resources improves the performance. This is the basic need for parallelism. Due to advancement in computer systems processors, the utilization of such systems is the need of the hour. As evolutionary algorithms are inherent in parallel nature, the parallel development of optimization algorithms will take benefit of it [28-30]. Gong et al. presented a survey on the parallel implementation of evolutionary algorithms [31]. Authors have categorised the parallelization strategy adopted to develop parallel versions of evolutionary algorithms. The future research direction highlights the importance of the parallel development of evolutionary algorithms [31]. The GPGPU based development of the population and swarm-based implementation on GPU is discussed with real-world problems from different domains such as data mining, bioinformatics, drug discovery, crystallography, artificial chemistries, and Sudoku [32].
The proposed work presents the design and implementation of a master-slave TLBO algorithm to solve CEC2006 single-objective constrained optimization problems. The motivation behind this work is to improve the execution time, enhance device utilization and the performance of an algorithm. The novelty of this work is that the Teacher phase and Learner phase of TLBO algorithm is executed on GPGPU. Generally, in sequential program execution, the single-core (logical processor) of the multi-core CPU is utilized. In the proposed approach, to enhance the CPU utilization OpenMP programming model is used. The Odd-Even sorting is implemented to determine the best value from obtained solutions. The proposed algorithm tested using thirteen single-objective constrained benchmark functions. The master-slave strategy is adopted to develop GPGPU based TLBO algorithm.
The main contributions of this paper are as follows. The Master-Slave TLBO algorithm is presented to improve the CPU and GPGPU utilization. The proposed approach implemented using OpenMP-CUDA C, a hybrid parallel programming approach. The proposed algorithm's performance is evaluated using well-known CEC2006 single-objective constrained benchmark functions. The device utilization, speed-up computed and performance of the algorithm is measured using a statistical test.
The remaining paper is organized as follows: Section 2 discusses the related work. Section 3 presents the proposed methodology. The single-objective benchmark functions used in the proposed work are discussed in section 4. The obtained results and discussion are presented in section 5. Section 6 outlines the conclusion and future research directions.

Related work
This section presents the literature review of different evolutionary algorithms developed on GPGPU to solve benchmark problems, the variations of TLBO algorithms and parallel TLBO implementation.
Johannes Hofmann et al. in [33] studied genetic algorithm on the graphics card. In this paper, the author tried to find out which steps of the genetic algorithm (GA) can be done on the graphics processing unit using profiler so that algorithm can effectively work in parallel. The algorithm is tested on the Weierstrass function. The first phase of generating population randomly is done on the GPU where the time required to generate a random number on each card is considered. In the second phase, crossover operation is performed in which each thread can operate on two offspring. The third phase of mutation is done parallel in which, each thread will operate on single offspring. In [34], the author implemented a genetic algorithm for three benchmark functions on GPGPU using CUDA. In [35], the GA on multi-core and many-core systems implemented using different approaches like master-slave, coarse-grained, fine-grained approaches. The multi-core and many-core architecture make use of thread-level parallelism to improve the performance. Luca Mussi et al. in [36] discussed possible approaches of parallelizing PSO on graphics hardware. The main attention was given to minimize the data transfer, minimize the use of global memory using one CUDA kernel per swarm. Jitendra Kumar et al. also implemented PSO on GPGPU. Initial population generated on GPU minimizes the time required for copying of data [37]. The Bees algorithm, artificial bee colony and multihive artificial bee colony algorithm are implemented on GPU using CUDA to address the benchmark optimization problems [38][39][40]. The ant colony optimization algorithm is parallelized using CUDA on GPU [41][42].
In [43], the improved version of TLBO based on the orthogonal design, with a new selection strategy to decrease the number of generations is proposed. The classes of learners are divided into some vectors where each of them acts as a factor in the orthogonal design. In [44], the TLBO algorithm modified by adding the concept of tutorial class. To make a stochastic variation in the available solution, the random scale factor added to the learner solution. It helps to maintain diversity and a better value obtained in the multimodal surface. Rao and Patel proposed improved TLBO adding the number of teachers, self-adapting factor, and tutorial-based learning [45]. The TLBO tested for continuous non-linear large-scale benchmark functions [46]. Zou et al. presented a survey of TLBO. Authors discussed the working of basic TLBO algorithm and presented a survey of its variations and applications developed using TLBO. The analysis of TLBO also presented [47].
The researchers developed the parallel TLBO algorithm and implemented on GPU. Rico-Garcia et al. implemented TLBO on GPU and compared with Jaya on GPU. Authors used unconstrained benchmark functions to test the proposed approach. Authors also analysed the utilization of GPUs by each approach [48]. García-Monzó et al. developed a shared memory-based and message-passing based parallel TLBO algorithm. Authors have used thirty unconstrained benchmark functions to evaluate the performance of proposed approach [49]. Other parallel implementation of TLBO found in [2, 50] to solve unconstrained benchmark functions.
The findings of the literature study are, the TLBO algorithm is an algorithm-specific parameter-less approach developed to solve various optimization problems. There exist different variations of TLBO. The evolutionary algorithms successfully implemented to solve standard benchmark problems on GPGPU. The few researchers have developed a multi-core or many-core system based TLBO algorithm. The parallel TLBO algorithms are tested mostly for unconstrained optimization problems. These findings motivate the author of this paper to develop a master-slave TLBO algorithm to solve constrained benchmark optimization problems.

Methodology
This section presents the proposed master-slave TLBO algorithm with a flowchart.

Teaching-Learning based optimization algorithm
The Teaching-Learning Based Optimization (TLBO) algorithm proposed by Rao et al. for solving various types of optimization problems. The TLBO algorithm is inspired by the teaching-learning process. The special feature of TLBO algorithm is that authors have removed the algorithmspecific parameter tuning. The detailed working with the flowchart of the TLBO algorithm and demonstration with manually solved constrained and unconstrained optimization problems as well as its applications can be found in [26].

Proposed approach
The proposed master-slave TLBO algorithm is described in this section. The proposed approach described with the flowchart. The master-slave based TLBO algorithm's execution steps are also discussed.

Master-slave model
In the literature, there are different parallelization models exists to implement evolutionary algorithms in a parallel fashion. The widely used are master-slave, island, cellular, hierarchical, pool, coevolution and multi-agent models [51]. There are different levels of parallelism found in each model. The operation level parallelism is used in a masterslave approach, where compute-intensive operations are executed on slaves. The limitation of the master-slave model is communication cost. It is due to the slave communicates frequently with the master to exchange results and data [51]. The development of GPGPU eases the implementation of master-slave evolutionary algorithms. The CPU acts as master whereas GPGPU works as a slave.

Master-slave TLBO algorithm
This section presents the master-slave TLBO algorithm.
The GPGPU based parallel TLBO algorithm is developed based on the master-slave model to solve constrained benchmark problems. The proposed approach designates CPU as a master while GPGPU works as slaves. The compute-intensive or the steps which require more time are executed on GPGPU. These include generation of the initial population, teacher phase, learner phase and fitness function evaluation. The step which compares the final solution at the end of the algorithm and identifying best value is only performed on CPU (master). This approach overcomes the limitation of the master-slave model, i.e. time required to perform frequent communication between master and slave. Figure 1 presents the flowchart of the proposed algorithm.
The algorithms execution begins with memory allocation on CPU and GPGPU. The required input for the algorithm is transferred on GPGPU. The initial population is generated on GPGPU. As the TLBO algorithm consists of two phases, the teacher phase executes first and then learner phase. Both these phases are executed on the GPGPU. The GPGPU computes the mean of each design variable in teacher phase and transfers to CPU to identify the best individual (teacher). The odd-even sorting scheme implemented on CPU to find the best teacher. This step is developed using OpenMP -a multi-core approach to improve the utilization of all CPU cores in the multi-core CPU environment. The identified best value is transferred to the GPGPU to update the solution. The learner phase updates the solution based on identified best teacher on GPGPU. The final solution is better than earlier and/or if the termination criteria meet then algorithm exits. The initial population generation, teacher phase and learner phase implemented using CUDA framework, developed by Nvidia for implementation on Nvidia's GPU.
In minimization type of constrained single objective benchmark optimization functions, the teacher with null or minimum constraint violation is preferred and in case of a tie, the solution with minimum objective value is selected. In the case of maximization type of problem, the solution with maximum objective value is selected. The constraint violation checking rule is the same in both types of problems The pseudo-code of the master-slave TLBO algorithm is presented below. 20. X''j,P,i= X'j,P,i+ ri(X'j,P,i-X'j,Q,i), If X'total-P,i<X'total-Q,i 21. X''j,P,i= X'j,P,i+ ri(X'j,Q,i -X'j,P,i), If X'total-Q,i < X'total-P,i

Perform odd-even sort to find best value
The proposed master-slave TLBO algorithm generates the initial solution on GPGPU to reduce the communication time required for data transfer from CPU to GPGPU. It is implemented in parallel on GPGPU using kernel-1. Objective function computation and evaluation is a compute-intensive task in any optimization problem. If the number of design variables and the population size is large then it increases the computational time for objective function evaluation. The kernel-2 and kernel-3 execute the teacher phase. The mean value of each design variable, the difference between the result of class and individual learners mean is computed and objective value is updated. The teacher phase ends and the obtained results are used for the next phase. The learner phase begins at the end of teacher phase. The learner phase updates the value of each learner using the updated function value from teacher phase. The kernel-4 is learner phase. It selects any two learners and improves the result of each learner according to best value among them. It results in improvement of the class mean.
The obtained values of the objective function are copied to the host (CPU). The odd-even sorting is implemented on CPU using OpenMP to obtain the best value among all the individuals. The algorithm exits when it reaches the predefined termination criteria.

Merits and demerits of master-slave TLBO algorithm
The proposed approach has advantages and limitations. These are briefly explained here. The proposed approach enhanced the system's computational resources utilization. The proposed approach preserves the characteristics of basic TLBO algorithm. The common controlling parameters are only required to tune. The execution time required is reduced very largely. The limitation of the proposed approach is, it requires more function evaluations. The data transfer time required to and from CPU to GPGPU affects the execution time. It can be overcome by using advanced CUDA libraries. The merits and limitation of the proposed master-slave TLBO algorithm verified after performing experimentation.

Constrained benchmark functions
The various properties of optimization applications vary from one problem to another, so the testing of strength and weakness of the new or modified optimization algorithm becomes difficult. The new or modified optimization algorithms performance vary from one problem to another. The benchmark functions enable to test the hypothetical performance of optimization algorithm in practically. The benchmark functions are used to test any newly developed or modified optimization algorithm in an unbiased manner. The benchmark functions help researchers to understand the behaviour of the optimization algorithm. These benchmark problems used by researchers to test evolutionary algorithms. The IEEE Congress on Evolutionary Computation (CEC) competitions have announced the constrained benchmark optimization problems in 2006. One of the objectives of the proposed work is to test the performance of the proposed master-slave TLBO algorithm for the single-objective constrained optimization problem. Hence, the proposed work uses the CEC2006 singleobjective constrained benchmark functions. These functions are widely used in literature to test the newly developed single objective optimization algorithms. The selected benchmark functions have linear, nonlinear, cubic, and polynomial objective functions. The number of design variables varies from 2 to 20, depends on the type of benchmark function. The selected benchmark functions provide the common platform to evaluate the performance of optimization algorithms. The CEC2006 benchmark problems have characteristics similar to single-objective real-time optimization problems. The "Appendix A" presents the definition of selected CEC2006 constrained real-parameter optimization benchmark functions [52].
The constrained benchmark functions with the number of design variables, function type, number of constraints with the nature of constraints, and known best solution are presented in Table 1. The known best solution indicates the best results obtained after solving the benchmark functions recently by any optimization algorithm. It will help new researchers to measure the quality of the solution by his/her proposed approach.

Results and discussion
The master-slave TLBO algorithm developed to test the improvement in the optimal solution, speed-up and device utilization. The obtained solution is evaluated using statistical tests such as best, mean, and standard deviation (SD). The experimental setup, parameter settings, results obtained, and its analysis is presented in this section.
To perform experimentation the GeForce GTX 680 Nvidia's GPGPU used, which has 8 streaming multiprocessors with 1536 CUDA cores. The global memory is 2GB and its speed is 6.0 Gbps. The device has 3.0 compute capability. The Intel's i7 processor with 2.40 GHz processing speed and 8 GB RAM is used. It has 4 logical processors. The proposed approach is implemented on Ubuntu 16.04 using CUDA Toolkit 7.0 with the Thrust library and OpenMP 5.0. The G1 to G13 constrained benchmark functions from CEC2006 dataset used to evaluate the performance of proposed Master-slave TLBO algorithm.
The basic TLBO algorithm is a parameter-less algorithm, in the proposed parallel version, the same behaviour is preserved. The common controlling parameter only initialized at the beginning. It is presented in Table 2. The execution parameters need to select cautiously to obtain good performance from the proposed algorithm. The parameter settings for the proposed master-slave TLBO algorithm are determined after performing extensive experimentation.  Table 3 presents the results obtained by the master-slave TLBO algorithm for CEC2006 single-objective constrained benchmark functions. The results are presented in the form of obtained best value, obtained worst value, standard deviation (SD), and mean. The execution time is presented in a millisecond. The proposed master-slave TLBO algorithm is executed for 30 times for each function and best, mean and standard deviation values obtained. The mean and standard deviation tests performed to measure the quality of the obtained results. The standard deviation is used to interpret the spread of solution from the mean value. The low value indicates that the best value obtained in each run is close to mean value while high value indicates results obtained are away from the mean. The mean value obtained is used to interpret where the obtained best values clustered. The master-slave TLBO obtains know best results for G1, G3, G4, G5, G6, G7, G8, G9, G10, and G11 benchmark functions. The results obtained for G2, G12 and G13 functions are close to the known best value. The population size used is 256. The parallel execution time recorded is from 1.59 milliseconds to 49.034 milliseconds. From the obtained results concluded that master-slave TLBO performs same as sequential TLBO. The master-slave TLBO algorithm is a promising approach to solve real-time single-objective constrained optimization problems from different domains.   Table 4 shows the execution time taken by sequential TLBO algorithm and Master-slave TLBO algorithm. Using Amdahl's law, it is found that, the proposed approach is 11.8X to 30.14X faster than the sequential TLBO algorithm. The less speed-up is obtained for G12 function while G1 has highest speed-up. The G1 and G12 functions are quadratic but the G1 has 9 linear inequality constraint while G12 has 1 non-linear inequality constraint. As compare to G12, the G1 function in sequential execution taken more execution time.
The G5 function has taken less execution time in both the versions.  The speed-up and device utilization are the other two important metrics used to measure the performance of a parallel implementation of the proposed approach. Fig. 3 presents the execution time and device (GPGPU) utilization of master-slave TLBO algorithm. The GPGPU utilization is measured using Nvidia's profiler. The CPU utilization is computed using Ubuntu's "system monitor" application. The speed-up is computed using Amdahl's law. The execution time and device utilization vary with characteristics of selected benchmark functions. The average GPGPU utilization is 90% and it varies from 86% to 94%.
For G1 function the device utilization is maximum and it is 94.83%. The minimum device utilization observed for G11 test function, which is 86.57%. The device utilization affected by complexity and nature of the selected problem, the number of ideal threads in execution and divergence among thread execution. The execution time taken by the master-slave TLBO algorithm is from 1.590 milliseconds to 49.034 milliseconds. The G5 function requires minimum time while G2 requires maximum time to execute. The proposed master-slave TLBO algorithm implemented using a hybrid programming model, viz. OpenMP-CUDA. The CUDA toolkit used to implement kernels (the functions to be executed on GPGPU) and OpenMP used to implement the code-block, to be executed on CPU. One of the objectives of the proposed work is to improve system utilization. The Teacher phase and Learner phase is executed on GPGPU, due to which the GPGPU utilized from 86% to 94% (presented in Fig.  3). The OpenMP facilitates the parallel implementation on a multi-core system, which results in the utilization of all logical processors instead of utilizing a single logical processor during execution. Fig. 4 presents the average percentage (%) utilization of logical processors available on the system, which was used for experimentation in 60 seconds. The significance of Fig. 4 is that it validates the utilization of all the cores of CPU by master-slave TLBO algorithm. The master-slave TLBO algorithm is implemented truly in a parallel fashion. The master phase of proposed approach utilizes all the logical processors (CPU1 to CPU4). Each logical processor's average utilization is more than 90% in 60 seconds.

Conclusion and Future Work
The design and implementation of a master-slave TLBO algorithm to solve CEC2006 single-objective constrained benchmark optimization is presented in this paper. The master-slave TLBO algorithm is developed to improve the computational resources of available systems. The best, worst, mean, and standard deviation, statistical tools used to measure the efficiency of the proposed algorithm. The proposed master-slave TLBO algorithm gives best results for G1, G3 to G9, G11 test functions. The results obtained for G2, G10, G12 and G13 are close to best-known results from the literature. The standard deviation confirms that the results obtained for each test function are close to mean results of that function. The Amdahl's law is used to compute speed-up obtained by master-slave TLBO algorithm. The proposed algorithm is 11.66X to 30.14X faster than sequential TLBO algorithm. The parallel execution time to exit the algorithm varies from 1.80 milliseconds to 49.03 milliseconds based on the characteristics of selected benchmark functions. One of the motives behind the parallel implementation is to improve the utilization of the system's computational resources. The average GPGPU device utilization is 90%. The maximum device utilized is 94.83% for G1 function and minimum device utilized is 86.57% for G11 function. The average CPU utilization also recorded; it is more than 90% for each logical processor. The function evaluation computed for proposed approach; it is from 5,12,000 to 10,24,000. The function evaluation is more because the TLBO algorithm has two phases (Teacher phase and Learner phase). From obtained results, it is concluded that the master-slave TLBO algorithm is one of the promising approaches to solve CEC2006 single-objective constrained optimization problems by improving system utilization.
In the future, the master-slave TLBO algorithm can be implemented to solve large-scale and complex constrained optimization problems. The advanced feature of recent CUDA toolkit and various CUDA libraries can be used to implement the proposed master-slave TLBO algorithm. The real-world single-objective constrained optimization problems can be solved using the proposed master-slave TLBO algorithm to test its efficiency and system utilization. As future work, other parallelization strategy found in the literature can be used to develop a GPGPU based parallel TLBO algorithm. The system utilization with other parallel strategies can be tested in future, for the other benchmark problems. [17] Kumar SV, Rao PV, Singh MK. Optimal floor planning in VLSI using improved adaptive particle swarm optimization. Evolutionary Intelligence. 2019 Jul 9:1-14.
[20] John J, Rodrigues P. A survey of energy-aware cluster head selection techniques in wireless sensor network. Evolutionary Intelligence. 2019 Nov 27:1-13.