An Energy Efficient Reinforcement Learning Based Clustering Approach for Wireless Sensor Network

Clustering is known to conserve energy and enhance the network lifetime of Wireless Sensor Network (WSN). Although, the topic of energy efficiency has been well researched in conventional WSN, but it has not been extensively studied. In the research, Reinforcement Learning (RL) based energy-aware clustering algorithm is proposed by which the neighboring nodes in the cluster selects an appropriate Cluster Head (CH) by observing the environmental conditions like as energy consumption and coverage that is computed as distance from the CH to the Base Station (BS). An optimal cluster is selected by each neighboring node, which minimized the energy consumption and network lifetime. The problem of selecting an optimal CH is resolved using the RL approach. Using the RL approach, the CH having the highest reward point is selected for data communication. The results show that energy saving of 7.41%, 3.27%, 4.03%, and 2.79 % is achieved for 100, 200, 300, and 400 deployed nodes, respectively.


Introduction
Advances in wireless communications and the use of electronic devices have led to the development of low cost and minimum energy, multi-functional Wireless Sensor Networks (WSNs) that can operate in a complex environment and offer a wide variety of applications [1].
WSN is a group of autonomous, self-configured nodes used to collect various environmental activities such as temperature, humidity, and so on [2]. Using Analog to digital converter (ADC), the sensed data is converted into the digital signal, and after processing, the data is transmitted to the main server known as Base station (BS). BS analyses the data, and the decision has been taken accordingly. In WSN, nodes act as source as well as sink node [3]. The pictorial representation of WSN is shown in Figure 1. The structure of the WSN is shown in Figure 1. At present, WSN finds different application fields such as In WSN, clustering is one of the best ways to minimize energy consumption and hence prolong network life. The general existing clustering approach mainly has two types of communication phase (i) CH selection, (ii) Multiple data communication phases, which again consists of intra and intercommunication phase. The general scenario of data transmission using the clustering approach is shown in Figure 2. In common LEACH protocol, a group of nodes that is cluster size is formed based on the distance from the CH. The size of the cluster increases with the increase in the distance of CH from the BS and vice versa.
LEACH is one of the basic routing approaches designed for routing using the clustering technique in WSN. The distribution of nodes in the clusters should be approximately homogenous [8]. To equalize the distribution, a number of researchers have extended the LEACH protocol, and some of the popular enhancements are named as E-LEACH, N-LEACH, D-LEACH, EE-LEACH.E-LEACH uses the residue power of each deployed nodefor the selection of CH. A round-robin approach is considered, and the round time remains fixed [9].In N-LEACH and EE-LEACH, the selection of CH has been performed by considering the benefits of the remnant sensor node's energy. Spanning tree has been constructing based on the distance between the CH and BS, which further results in avoidance of intermediate communication, which further saves communication energy [10,12]. In D-LEACH, the energy of sensor nodes has been balanced by adjusting the threshold function, which is set according to the radius of nodes [11]. Usage of the Machine Learning (ML) approach has also been observed in the earlier research articles mentioned in several recent research articles [13,30,31]. Though the usage of ML is still there, this paper improves the ML-based Reinforcement Learning (RL) by using a feedback-based mechanism.

LEACH
LEACH is a self-adaptive clustering algorithm that distributes energy load on the deployed nodes equally.
LEACH worked into two phases, namely CH count calculation and CH selection Every node sends its data to the region CH, and further CH communicates with its neighboring CHs to reach to the destination. A CH can remain at its position until it has the highest residual energy in normal approaches, whereas in a rotation based approach, each node gets a chance to become CH for a certain time interval [13]. Every node transfers the n-bits of data to its corresponding CH by using a transmission setup, as shown in Figure 3.  [14] A different set of authors uses different energy consumption models, but there are some standard sets of parameter values, as shown in Table 1 [11,23,22,25]

Energy Consumption in WSNs Model
The energy consumption model is highly dependent on distance to BS. The increase in the distance between the transmitter and receiver results in the attenuation of transmitted power [15]. The threshold of free space can be represented by equation (1).
d th Threshold of the space model Pl f s  Power Loss for free space Pl m p  Power Loss for multipath models The energy released during the transmission of n bits over d distance can be calculated using equation (2).
TxTransmitting subscript eltxand amp electronic and amplify the signal The energy dissipated during the reception of the signal is presented by equation Let WSN is implemented in 2 D (two dimensional) area of (A 2 ), with N number of deployed nodes. The nodes are grouped into 'P' number of clusters. Let the distance from the cluster members to the CH isd,' and from CH to BS, it is denoted by 'D.' Therefore, the energy consumed by a cluster to forward a single data frame is calculated using equation (4).
N P − 1Average of number of Cluster members in a given cluster E CH  Dissipated energy for CH E Cm  Dissipated Energy for the cluster member Therefore, the dissipated energy by the CH and the Cm is given by equation (5 Based on the above discussion, the total energy consumed by the WSN nodes during the communication process, while utilizing the clustering mechanism is given by equation (7 The power consumed by the microcontroller cannot be optimized. Hence, the only way to optimize power is to optimize the clustering process, which is based on distance [16,17].
In each round of LEACH, the sensor node decides whether to becomes a CH or not during the cluster formation. The decision to become a CH depends upon the percentage of total CHs present in the network and the number of times to which the sensor node becomes a CH. During this process, the sensor node selects a random number between [0,1]. The node becomes CH if the number is less than the defined threshold for the existing round.
Here, n is a random number lies between 0 and 1. P d Required %age of CHs n present rounds G Set of sensor nodes, which are not selected as CHs in the previous round.

The motivation for Wireless Sensor Networks
Recent advancements in the engineering field, digital communications, and networking have led to new wireless systems designs with new tech sensors. Such high-tech sensors can be utilized as a bridge between the physical world and the digital world. Sensor nodes are deployed in various fields, mainly in remote areas, which is beyond the approach of the human being. These sensors sense the external environment and help people from infrastructure failure, accidents, save natural resources, protect wildlife, increase productivity, safety, and more. Using emerging semiconductor technology, one can design more powerful microprocessors that are compact in size than preceding generations. This miniaturization of processing, computing, and sensation technologies has led to the creation of small, low power and inexpensive sensors, controllers, and actuators. The rest of the paper is organized as follows: related work is provided in section 2. In section 3, the detailed description of the proposed work, along with the designed algorithm, is discussed. In section 4, experimental analysis is discussed in terms of performance parameters. In section 4 conclusion is presented, followed by the references.

Related work
Clustering is the best way to reduce the energy consumed by sensor nodes. Presently, a number of researchers have proposed different clustering techniques and contributed to prolonging the network life [18]. The work performed by various researchers is discussed in this section, which is divided into two sections that are State-of-art related to (i) LEACH, and (ii) RL based Clustering.

Clustering using LEACH based Approach
The trendiest clustering protocol used in WSN is the low-Energy Adaptive Clustering Hierarchy (LEACH) protocol, which extends the life of a network by partitioning the entire network area [19]. However, the LEACH protocol has many disadvantages. Random selection of CH and the use of single-hop communication in the LEACH protocol result in the early death of nodes that are too far from the BS due to unbalanced network load. In addition, the protocol does not take into account residual energy (each node has equal rights to behave like a CH), which results in an unequal distribution of CH [20,21]. The scenario of LEACH based clustering in WSN is shown in Figure 4. have presented a cluster chain routing protocol (CCRP) that provided a balanced and improved CH selection for data transmission. A number of nearby nodes, along with their residual energy, have been taken into consideration for the CH selection. A chain has been formed among all the CHs. Increased network lifetime as the energy level is distributed equally. Here, the remand energy is multiplied by the defined threshold, and the formula is written in equation (8) is modified as presented in equation (9).
In short, the CH selection is performed by counting the number of nearby nodes within the radio range. Every CH have 1/P d − 1 member nodes. Results show that higher is the neighbor node count; less is the energy consumption rate [22]. Xiaoping et al. (2010) have used the multi-hop concept along with LEACH as a routing protocol to prolong the network life. The use of present node energy along with multiple hoping, has been adopted. The node's energy is compared with the defined threshold energy level. If the energy level is high only then, the node is selected as CH. For the next round, the same CH again becomes a CH if n<(P d )/(1 − P d (nmod1/P d )), n is the randomly generated number [23]. Peng and LI (2010) have presented a Variable Round-LEACH (VR-LEACH) protocol based on the selection of CH. The selection of CH has been performed on the basis of sensor nodes conditions. All those nodes take part in the data communication, having a higher energy level compared to the defined average energy [24]. Tong and Tang (2010) have presented Balanced LEACH (B-LEACH) protocol that resolved the problem of fluctuation in CH and for CH selection considered remand energy of sensor nodes. Based on the member count in the cluster, every round is divided into the frame and then again subdivided into time slots [25]. Yu et al. (2012) have utilized the concept of cluster-based routing protocols using the "Energy Aware Clustering approach" to solve the problem of non-uniform energy consumption of distributed nodes. Using this approach, the loss of energy among CH's have been minimized by using the concept of inter and intra clustering approach. However, sensor nodes with small energy limited networks life, and the nodes with higher energy are responsible for network exhaustion [26]. Nehra and Sharma (2013) have used an enhanced "Power-efficient gathering in sensor information systems" (PEGASIS-E) protocol to prolong network life. The chain for transmitting data from the source node to the BS has been formed based on the average distance measured between the sensor nodes. Even though the network life is enhanced with the reduced energy consumption, but the chain formation required the information about the location of nodes, which results in poor network scalability with large delay in data transmission [27]. In 2015, Amgoth et al. have presented an "Energy-Aware Routing Algorithm" for WSN. Using this technique, the nodes in the network are divided into multiple groups. Each sensor node began to select CH with a delay based on the remaining energy. The technique can extend network life but has not exploited fault tolerance and the WSN's active scenario [28].
Al-Humidi et al. (2019) have resolved the problem of a random selection of CH through a newly developed clusterbased approach known by EACCC. It works on the basis of centralized control cluster (3C) algorithm. Initially, each sensor nodes sent information related to remand energy along with the position of the node to the BS. On the basis of received information, BS decide that in which portion of the network the nodes will be. EACCC runs in the round, and in every round the selection CH has been performed, after the selection of Ch, data transmission takes place [29].

Reinforcement Learning in WSN
For the last couple of years, a machine learning technique has gained a lot of attention. Reinforcement Learning (RL) is a subfield of machine learning and seeks to use computer programs to create rules from large data sets. Using the RL approach [30], the agent is selected based on the action/rule performed, and in return, a reward point is received from the outside environment. Based on the reward point, the rules are modified and can achieve optimal results. Due to RL's convent feature, it can easily resolve distributed problems [31] (see Figure 5). Therefore, a few researchers have used RL algorithm to resolve the routing-related problems faces by WSN. This section describes the state-of-art performed by various researchers by utilizing the RL approach in WSN [32].

Figure 5. Block Diagram of RL approach
Boyan and Littman (1994) have presented a Q learningbased optimized routing approach in WSN. RL technique provides better PDR with reduced data transmission time. A Q value is assigned to each neighboring nodes. Here, Q value is calculated by the total time consumed by the source node to delivered data to the BS through nearby nodes. Test results proved that; the proposed approach is able to delivered data in large networks without the prior knowledge of traffic patterns [33]. Wang and Wang (2006) have utilized Least squares policy iteration approach to know the routing policy of WSN. Using the proposed approach, the Q value along with the π score has been calculated, and based on these parameters, the remaining energy, hop count, the aggregate ratio has been decided [34].  have presented an energy-aware routing protocol based on the RL approach for underwater WSN. RL technique has been used to learn about the situation in an effective way so that data can be transmitted in the dynamic topology of underwater WSN. Based on the environmental conditions, the reward point is generated based on the remaining energy and its distribution among the set of nodes, which are used for balancing energy consumption [35]. Razzaque et al. (2014) have used RL approach to modify the routing information. Here, RL approach learns about the environment using information related to the delay and reliability of the nodes and then generated a reward point [36]. Kiani et al. (2015) have used RL based routing mechanism to enhance the life spam of WSN. Here, the RL approach has been utilized for the selection of CH. Q point generated by RL approach is obtained after multiplying node's Q value and its actions Q value [37]. Soni  Navpreet Kaur, Inderdeep Kaur Aulakh 6 problems in WSN using RL approach. Initially, the RLBCA algorithm is designed to remand energy consumption. On the other hand, ODMST has been designed to aggregate data as per requirement from the CH. The record of incoming requests has been stored into a table, and according to the recorded data next step has been performed [38].

Materials and Methods
The entire network is divided into two sections that are clustering and Route formation using RL based approach.

Figure 6. Flow of Proposed Work
Clustering is the process of dividing the entire network into different zones known as clusters.
Each cluster is monitored by a CH node, which collects and forwards data to the BS. In a large network area, the selection of CH becomes an important issue. Since, with the increase in the network size distance between the CH and BS increases and hence decrease nodes lifetime and the failure of CH, it increases the network overhead. Also, the process of route formation has been done through the RL algorithm using the neural network concept. The overall flow is shown in Figure 6 An RL based energy efficient LEACH routing protocol (RL-LEACH) is designed to select appropriate CH based on the revert Q point.

Clustering
Initially, N numbers of nodes are deployed in the network with a dynamic size of the network. Here five different node variation with network size has been considered, which are presented in Table 2. The nodes are deployed with defined properties such as energy, delay, co-ordinates, and collision rate. The algorithm followed for the node deployment is written below:

• Sub Division of Network
The entire network is sub-divided into a number of clusters, and the node having the highest energy behaves as a CH.For example, if we partitioned the network of dimension 1600×1600, and the size of the cluster is 400×400, then the network is partitioned into 16 clusters, each consists of unique ID as presented in Table 3.

Table 3. Cluster Identification for 1600×1600 Network
Area The CH's that have been formed in the network area are determined using equation (8) T I A  Attraction index and is calculated using Equation (9) Otherwise Using T CH and I A , the selection of Q points for RL learning has been decided.

RL Learning
RL collects information by continuously interacting with the surrounding environment and improving the performance of the system by achieving optimal results by performing all the necessary actions to draw conclusions. Q-learning is a category of RL scheme that creates a sequence of observations in terms of reward point. The visualization of the RL approach is illustrated in Figure 7.
The RL model presented in Figure 7 can be implanted inside each sensor node or on the outer surface of the sensor node. For example, each sensor node holds a reward point associated with an adjacent sensor node and each network point in the surrounding environment. In WSN, its main use is to learn about the route and hence provide the best route for data transmission. State sensing is used to indicate the action performed by the destination node. The action indicates the next communicating node to which data is to be forwarded. The reward point indicates the distance between the CH and the destination node. The increase in the reward point decreases the distance of CH towards the destination node and hence increase the network performance [39].

Figure 7. RL Learning Approach
RL-based routing protocol has been used to optimize the life of WSN networks in all specific aspects. Using this technique, the next CH is selected based on historical learning and available evaluated information, and properties of nodes like residual energy, transmission distance from CH to BS, and delay are taken into consideration to learn the best ways. In this way, the correct linkage between the sensor nodes and BS is performed with balanced energy consumption.
Q learning approach is used in the RL approach in which agent monitored the WSN environment of state (U). Accordingly, there is an action (C), and reward points are (P) are generated. Based upon the experience, the strategy is modified. The action with the highest reward point is selected for a certain state. In case if the transition probability is unknown, the next node selection using simple RL approach is not possible. Therefore, Q learning approach is provided, which calculates the accurate reward point and hence decides the best path [40]. Let the state of each node is represented by Q(u,c), which is being updated after every iteration by using equation (10) Q(u t , c t ) = (1 − α)Q(u t , c t ) + α�p t+1 + β ∀c t+1 max Q(u t+1 , c t+1 )� (10) Q(u t , c t )Accumulated reward generated by the node at state u, action c, which is taken at time t. p t+1 immediate reward point generated for action c, when the state transition takes place from stateU t to U. ∀c t+1 action performed in the set of action U t+1 (C) αlearning factor Reward=[] // To store the current CH reward points 7 For I in range of S(CAREA) 8 Reward(I, 1)= ECONSUPTION (S(CAREA, I)) 9 Reward (

Proposed Reinforcement learning-based LEACH protocol (RL-LEACH)
Using the RL-LEACH approach, the selection of CH in WSN has been performed by considering three factors, such as Energy consumption, coverage area, and distance of CH from BS. Initially, the energy level of each node is compared with the defined threshold; if energy is higher, then the selected nodes' communication range is checked and then the distance is checked. Based on the QoS parameters, reward points are generated. The nodes having a higher reward point is selected as CH.   Figure 8 describes the working of the RL-LEACH routing protocol that is used to select the CH from nodes based on their basic properties like Energy Consumption, Coverage Area, and Distance from Sink node. In the figure, 10 nodes are selected as CH (CH1, CH2……CH5). Source node wants to transmit the data packets, then CH1 is selected as a next intermediate node after that RL decides which CH nodes are selected as the next intermediate nodes.
The selection of CH is performed based on the written algorithm of RL-LEACH: Where, is number of nodes, is residual energy of nodes and is initial energy for nodes being generated during the initialization of WSN.

Results and Discussions
This section describes the efficiency of the proposed RL-LEACH algorithm in terms of throughput, Packet Delivery Ratio (PDR), and energy consumption.
The proposed algorithms select optimal CH using the RL approach and enhance the lifetime by reducing the consumption of energy. In the simulation process, we have considered nodes (N=50, 100, 150, 200) deployed in a different area of (A=1000×1000, 1200×1200, 1400×1400, and 1600×1600) square meters. Initially, the parameters of network nodes such as energy consumed by each node, transmission delay, and collision rate and co-ordinated of each node are taken into consideration. The performance has been examined using simple LEACH, RL based Q learning approach by considering the same parametric values of the nodes. The detailed simulation parameters are presented in Table 4. The protocols are tested under different topologies, which are generated randomly. The performance parameters are discussed later. The designed network with N number of nodes is shown in Figure 9. The Q based learning has been implemented in MATLAB and the designed network for an area of 1600×1600, which is divided into 16 different clusters {ID1, ID2, ID3…………….ID16}, the identification of each clusters is represented in table 3. The pictorial representation is shown in Figure 9.
As shown in Figure 10, the source "S" is defined in the cluster (ID1) that wants to transmit data to destination "D" located in cluster ID 15. Now the route is created based on the directional orientation of destination node (D); that is, the data traveled from the source node (S) considered the path in the direction of its destination node as shown in figure 10. Source node 'S' have four nearby clusters in its coverage range named as ID2, ID 3, ID4 and ID5, each having different CH's such as CH1, CH2, CH3, CH4 and CH5 respectively. Now, to which CH, the S node forward data, depends upon the reward point generated by RL approach. For CH1, CH2, CH3, CH4 and CH5, the generated reward point is depicted in Table 6. According to that, the reward point of CH5 is higher compared to the other four CHs. Hence, S node forwards data to CH5. Now, CH 5 also, checks its nearby clusters that are CH7, CH9, CH10, and CH12. RL approach is applied to generate a 10 reward point based on energy consumption, coverage area, and distance. Since, the reward point for CH12 is highest among all, hence CH5 forward data to CH12. Again, the same process is repeated by CH 12 node, which finds that CH15 have the highest reward point and passes data to CH15 node, which is the destination node. In this way, data travels from the S node to the D node. During Q based RL learning, each cluster member nodes pass data to their CH, which is in its communication range as mention in Table 4. The data is passed to the single CH instead of checking each and every cluster. In this way, an optimal cluster is selected, which consumed less energy and also near to the BS.
The simulation has been performed in which both the member nodes and the CH nodes are participated and examined the energy consumed by both nodes with different ID1   ID2  ID3   ID4   ID5  ID6  ID7   ID8   ID9   ID10   ID11   ID12  ID13   ID14   ID15   ID16   D   S   ID14  ID15   ID10   ID16   ID1   ID3  ID7   ID2  ID8   ID5  ID12   ID9  ID4   ID13  ID6  ID11 EAI Endorsed Transactions Scalable Information Systems 04 2021 -06 2021 | Volume 8 | Issue 31 | e6 An Energy Efficient Reinforcement Learning Based Clustering Approach for Wireless Sensor Network 11 cluster sizes in the network. The examined network energy consumption value for 16 different clusters formed in Figure  10 is presented in Table 5. As shown in Figure 11, the network energy consumption for fixed number of nodes deployed in each cluster (ID1, ID2…...ID16), for small and large cluster size network, the consumed energy is high. For cluster size one in network are 1600× 1600, the consumed energy is 512 J, whereas for large cluster size of 16, the consumed energy is 442 J. This is because, for small cluster size network, the cluster member nodes have to communicate with other nodes that are placed at long distance and hence consume high energy due to the intracluster distance communication.

Figure 11. Network Energy Consumption Vs. Cluster
Size On the other side, with large cluster size (see cluster size=14), the energy consumption is low. But, in this case, with the increase in the cluster size, the communication between inter CH increases, which consumes more energy than cluster members. Therefore, it is necessary to select the appropriate size of clusters in the network so that energy consumption can be minimized. Here, this is performed by selecting appropriate CH and hence balance the energy consumption for both inter and intracluster communication in WSN. The graph depicted in Figure 11 shows that the optimal cluster size, where the energy consumption is minimum, is 8.
To obtain an optimal solution, we applied Q-learning to examine the performance of the proposed algorithm in terms of learning and adaptation to a dynamic environment. The CH selection using Q-learning is performed due to its quick convergence for shorter spam. The performance of the RL approach has been determined by computing the reward point for different member nodes in the CH based on the energy consumed and distance of the CH to the BS. Route = [S CH1 CH5 CH12 CH15 D] Here, S is source node and D is the destination node To create a route from S to D, the RL approach is used for better path selection by avoiding the affected node within the route as an intermediate node. During the transmission, S transfer the data packets to own cluster head (CH1), and then now CH1 searches next nodes as CH5 based on the reward points using the RL technique. The selection process of CH5 as a next intermediate node is shown in the figure10. The corresponding reward point based on energy consumption, coverage area, and distance are listed in Table 6.  In 2nd Scenario, route from S to D, using RL approach by considered CH5 as cluster head is performed. Let in the coverage area of CH5, clusters CH7, CH 9, CH10 and CH 12 occurs. Now, to which CH, the data is transmitted by the selected CH5 depends upon the reward point generated by RL approach as listed in Table 7 and graphically shown in Figure 13.  The reward point of CH12 for all three considered parameters that are energy consumption, coverage area, and distance is high compared to the reward point for CH7, CH9, CH 10. Therefore , CH5, forward data to CH 12. Now, CH12, also check the reward points to its nearby clusters and pass data to only those CH having the highest reward point. Let, CH8, CH14, CH16, CH15 comes in the coverage range of CH12 cluster. Now, the reward point generated based on the three parameters as energy consumption, coverage area, and distance are checked. From Figure 13, it is clearly seen that the reward point for CH15 is highest, which is the destination in our example. Thus, CH12 passes data to CH15 as the destination node. By using the above strategy, the data from the source CH 1 is reached to destination CH15.

Simulation Results
After the completion of route formation, data transmission takes place, and QoS parameters are calculated for the designed model using both approaches like RL and without RL. The computed values for network lifetime with respect to several nodes for three different scenarios that are (i) without RL, (ii) with RL, and (iii) Kiani et al. is presented in Table 9.  The results examined while deployed node count is N=200 nodes with different simulation area is depicted in

Comparative Analysis
This section describes the comparative analysis for different nodes varies as N= 100, 200, 300, and 400 respectively for three designed scenarios such as Without RL, with RL and Kiani et al. the comparison has been performed on the basis of three parameters as Network Lifetime, PDR and energy Consumption.  The computed values of network lifetime for different number of nodes deployed in with various network size as depicted in Figure 15. From the figure it is clearly seen that with the increase in the number of nodes the network life increases. The comparative graph consists of three types of bar represented by the blue, the orange and the grey colour depicting the values of network life for without RL, with RL and Kiani et al. Among all, the proposed approach provides better network life. With the increase in the number of nodes (N), the rate of receiving data packet increases with the reduction of data packet transmission rate. This is due to the affecting nodes properties like as hop count, nodes failure, nodes density.
The examined values for second parameter that is Packet Delivery ratio (PDR) with respect to number of nodes for   Figure 16. Packet Delivery Rate Figure 16 compares the PDR for three different scenarios that are CH selection without RL, with RL and using Kiani et al. approach. The graph illustrated the PDR of the designed WSN when three different techniques works with 100, 200, 300, and 400 number of nodes. Proposed approach performs well compared to existing work and the maximum PDR is obtained when (N=400) are deployed. The increase in the rate of packet delivered increases the network reliability as well as the fault tolerance. In case, if the communicating node is found dead, then the proposed algorithm selects previous node and pass its data by considering it as CH node. The simulation shows that PDR increases with the increase in the number of nodes. The percentage increase in the PDR obtained using RL compared to existing Kiani et al approach for 100, 200, 300, and 400 nodes is 11.25%, 13.92%, 15%, and 14.81 % respectively.  The energy consumed by three different approaches with different number of nodes is illustrated in Figure 17 along with the values presented in Table 15. The graph shows the improvement of our RL based cluster selection approach in contrast to the without RL approach and Kiani et al. The energy consumed by the cluster members is determined based on of Mean Square Error (MSE). Efficient network is designed with minimum MSE value examined for particular cluster. From the graph depicted in Figure 17, it is clearly seen that the proposed approach using RL algorithm have less MSE compared to the without RL approach and hence consume less energy. The percentage reduction in the energy consumption rate in contrast to the without RL approach observed for 100, 200, 300, and 400 nodes is 7.41%, 3.27%, 4.03%, and 2.79 % respectively.
In short, the main purpose of the designed protocol is to reduce energy consumption and to increase network lifetime with packet delivery rate. All parameters are obtained as per requirement and hence fulfil the need to design a balanced WSN in dynamic environment.

Conclusion
Similar to other existing networks, WSN is developed for specific applications like as military, rescue operation, agriculture and many more; each requires different features as per their need. Depending upon the network scenario, each requires new communication protocols. In addition, network design factors must be taken into account to achieve the expected performance. Among all factors, energy is the most important parameter in WSN and must be controlled by appropriate algorithm. In this research, we focused on to save energy by selecting the appropriate CH using RL approach and the designed algorithm provides better network life with improved energy. Initially, optimal number of clusters in a network is evaluated by analysing the network energy consumption of both inter and intra clustering communication. Also, the process of selecting appropriate CH using RL approach based on reward point is discussed. From the simulation results, proposed algorithm performed better in terms of network lifetime, packet delivery ratio, and energy consumption. The maximum