Adoption of the activation function fusion approach to identify human activity recognition in a semi-supervised neural network

INTRODUCTION: Neural networks are a popular type of algorithm for human activity monitoring which can build intelligent systems from labelled data in an automated fashion. Obtaining accurately labelled data is costly; it requires time and e ﬀ ort, which can be cumbersome because it interrupts the user activity stream. In conjunction with the ubiquitous presence of embedded technology, neural networks present new research opportunities for human activity monitoring in smart home environments. OBJECTIVE S : We propose a human activity classification method that requires a limited amount of labelled data, which consists of a concatenation method for classifying human activities built upon the fusion of neural network activation functions. METHODS: Our methodology builds a neural network model that receives the sensor data through the input layer to then distribute it among the di ﬀ erent vertical hidden layers, which implement di ﬀ erent activation functions simultaneously. Next a hidden layer combines activation functions by utilising a concatenation method. Finally, the neural network provides classes to the unlabelled sensing data. We conducted an evaluation utilising an open-access dataset. We compared the activity recognition accuracy of our approach utilising 25%, 50%, and 75% of labelled data against a conventional shallow neural network trained with the 100% of labelled data available. RESULTS: Results show an improvement in the accuracy of the activity classiﬁcation regardless of the portion of labelled data available. It was observed that the highest achieved accuracy when using 25% of activation function fusion data outperformed results compared to when using 100% of labelled data in a conventional shallow network (i.e., increase in accuracy of 2.7%, 3.7%, 4.8%, and 0.9% across the activity recognition of four subjects). CONCLUSION: The approach proposed showed an improvement in the accuracy of classifying human activity when a limited amount of labelled data is available.


Introduction
In recent years we have experienced the widespread use of mobile devices such as smartphones and smartwatches.Similarly, the interest in research on embedded sensors in daily life objects (e.g., kitchen appliances) N. Hernandez et al. is growing, which enables opportunities in the field of computer engineering that makes user's interactions richer through applications which can build on this variety of sensing data.This ubiquitous presence of sensing technology opens the opportunity to assist users in areas such as smart homes [27], rehabilitation [25], health monitoring [33], and activity recognition [23].
Activity recognition is the problem of identifying human actions given a collection of sensed data.It takes advantage of off-the-shelf technologies available within the user's environment (e.g., smartphones, smart-watches) to enable the collection of data from daily living scenarios.In this context, machine learning, can help in the automatic recognition of daily living tasks due to its capacity to handle differences between sensor readings and features in the domain, for what neural networks have been shown to be an effective machine learning approach in such scenarios [23].

Neural networks
A neural network (NN) is an interconnected assembly of simple processing elements, whose functionality is inspired by the human neuron.The processing ability of the network rely in the inter-unit connection strengths obtained by process of adaptation by learning from a set of training patterns, modelled after the human brain to recognise patterns in a given set of data [24,26].
In order to recognise patterns, a NN learns to approximate a relationship between inputs and outputs (assuming they relate by correlation of causation) [26].In the process of learning and classification, a NN can rely on two alternatives: i) experience, in which case, a successful outcome depends upon labelled dataset.Alternatively, ii) characterising data and finding similarities among them, in which case there is no labelled data available; therefore, the NN is meant to learn from clustering segregation or association rules.These approaches are also known as supervised learning [29] and unsupervised learning [8] respectively.
Supervised learning is an approach that requires a large amount of labelled data available for training purposes [3,4], the generation of which is expensive and time-consuming.Unlike supervised learning, however, semi-supervised learning methods, maximize the potential of the portion of labelled data available.They aim at leveraging the amount of labelled data to provide a higher generalisability within the classification classes.
The problem with activity recognition methods requiring a large amount of labelled data is the limited annotation available in day-to-day scenarios where human activities are unexpected and the impact that the lack of unlabelled data can have on those activity recognition models.Unlike supervised learning, the unsupervised approach does not depend on labelled data to cope with its task.However, considering that there is no knowledge known apriori, unsupervised approaches can lead to lower accuracy.
Overall, achieving high-performance in activity recognition with limited labelled data is a challenging topic that has attracted researchers' attention in recent years [16].The problem of human activity recognition motivates different approaches, such as feature learning adjustments and weight extrapolation for which NN has proven to be effective.For example, Ding et al. proposed a method to efficient NN using weight extrapolation under the same feature space, demonstrating the effectiveness of feature treatment to achieve correct class classification [6].Other approaches, include handcrafted annotation under the concept of active learning, where a trained model can iteratively be retrained upon user's feedback annotation as new data emerge [2].Stikic et al. proposed utilising a graph-based approach, which connects labelled and unlabelled data and builds multiple graphs to propagate the labels based on the similarity between features [31].Although feasible, these approaches present the issue of interrupting the user's day-to-day activities or rely on a single sensory modality.In the context of human activity recognition, disrupting the user's activities can affect the correlation between labelled and unlabelled data, whereas a single sensor modality limits the acquisition of rich data available nowadays from the ubiquitous technology available in activities of daily living.

Data fusion
In this context, data fusion exploits the natural synergy brought about by use of multiple data sources.It consists of the integration of multi-sensor data that exploits the natural synergy brought by multiple sources to achieve inferences that could not be feasible from a single sensor [20].Some recent research has successfully adopted data fusion as a technique to recognise human activities, for example G. Abebe and A. Cavallaro described a study in which they concatenated feature information from different technology domains (i.e., inertial sensing and first person video images) by implementing a classification model over a NN [1].Similarly, under the concept of transfer learning, N. Hernandez et al. showed the benefit of data fusion by selecting and fusing features from across two different classifier models, with activity recognition higher compared to the accuracy achieved by the models individually [11].Z. Wang et al. approach used kernel fusion upon extreme learning machine techniques in which they utilised Gaussian kernels [32].Other research in sensor fusion has shown positive results when analysing mobile and ambient sensing technology for activity recognition.These studies achieved high classification accuracy by fusing different data features, sensor types, or algorithm kernels which enable them to maximise the information contained across these multi-modal sensor streams.
In this paper, therefore, we focus on the benefits of fusion data as an elegant approach to design a form of feature derived from activation function fusion when small portions of labelled data are available.We elaborate on fusion data being a technique that has been explored to improve the recognition of activities in machine learning.Our interest in NN relies on the computational benefit of Deep Learning in NN from implementing multiple layers to progressively extract higher-level features.In this paper, therefore, we elaborate and present the results of our methodology that proves to be effective when utilising two hidden layers; one of which consists of three vertical layers, which shows the benefit of fusing activation functions and opens the opportunity for developing more complex NN.
Although to the best of our knowledge, activation function fusion in NN is a topic that has not been explored before, in Section 2, we discuss the closely related work.Section 3 presents our approach and methodology, in which we give a detailed description of the dataset and experimental setting.We conclude the paper with results, future work, and a brief discussion along Section 4 and 5; respectively.

State of the art
In general, a NN consist of three layers, Input-, Hidden-, and Output -layer.In contrast to a traditional NN, which builds upon a single hidden layer, modern NN structures consist of recurrent feedforward networks that are organised having two or more hidden layers.This approach benefits from the computational power derived from implementing multiple layers to extract higher-level features progressively [9].Layers are constructed out of nodes, which is where the computation happens.Nodes connect in such a way that each layer's output represents the subsequent layer's input.A node combines input from the data with a set of coefficients calculated based on a given activation fusion, which fires or is activated when it encounters sufficient stimuli.

Activation function
Activation functions are a biological inspiration from activity in the human brain, where different stimuli activate different neurons.In this context, an activation function is an algorithm that shapes the behaviour of a NN by assigning significance to inputs concerning what the algorithm is trying to learn; hence, it will either amplify or dampen input data aiming to uncover patterns within the given data [10].
To illustrate the activation function functionality, in Figure 1 we show a neuron, which receives data as a set of numerical inputs x 1 , x 2 , ..., x n which is then combined with a set of weights w 1 , w 2 , ..., w n and a bias element in order to produce a single numerical value y, computed by the activation function α as y = α((w 1 x 1 , w 2 x 2 , ...w n x n ) + b).In general, the activation function determines the type of function that the NN represents.Some of the most common activation functions are the linear, hyperbolic tangent, logistic, and rectified linear [21].Given that each activation function processes their input based on particular mathematical properties, some of them will activate a neuron when their value overtakes a certain threshold, while other activation function would remain inactive.Hence, performing an arithmetical calculation (e.g., adding or subtracting) on the outcome from different activation functions can lead to the cancellation of their properties.In this regards, data fusion offers the opportunity to associate data without altering their properties.

Multimodal fusion
Data fusion can be defined as a multiple level or multifaceted process to detect, associate, correlate, estimate, and combine data and information from several sources [19].The integration of multi-sensor data exploits the natural synergy brought by multiple sources allowing for useful information gain and achieving inferences that could not be feasible from a single sensor.In this context, therefore, a large volume of heterogeneous sources (e.g., technology capacity or device position) might leverage data analysis and enhance decision-making [20].
There are different architectures to data fusion, which fall into three general categories: direct fusion (i.e., competitive type), representation via feature vectors (i.e., complementary type), and processing of each sensor to achieve high-level inference (i.e., cooperative type).Depending on the data conditions, a particular mechanism should drive the data towards one or another approach.For instance, redundant data can be treated at the raw level involving classic estimation methods, while complimentary data can operate at the feature level [13].Due to the intrinsic nature of the methodology proposed in this paper, we focus on data fusion using two architecture categories.We utilise cooperative fusion in which the sensing data provided by the input sources represents different parts of the scene (i.e., wearable and kitchen appliances) to achieve a piece of complete global information.Also, we utilise complementary fusion by adopting different NN's activation functions, in which the information provided represents different perspectives from the same scene and is combined to achieve an outcome with properties wich are more global than the scene achieved by a single view.Building the appropriate models for multimodal fusion (i.e., models where information consists of various sensing technology), is not trivial.The complexity relies on the technical differences between sensor data.
In recent years, data fusion techniques have been shown to be effective in problems aligned to activity recognition.For example, in previous work, the authors explored the feasibility of recognising walking and standing activities from acceleration and thermal image sensing by fusing the features from both sensing technologies.Results showed an improvement in the classification of about 10% compared to conventional ensembles [11].Song et al., investigated and conclusively demonstrated an alternative method to create dense embedding for data using kernel similarities and adopting NN architectures [30].Oswaldo Ludwig et al. proposed a classifier-fusion schema using learning algorithms, in which they utilised feature extractors and classifier combinations to achieve a higher activity recognition accuracy [12].In related work, Friday et al. compared the performance of single and multi-sensor fusion for human activity recognition using accelerometer and gyroscope sensors.They considered seven classification algorithms.The evaluation results show the significant impact of multi-sensor fusion for recognising human activities demonstrating the feasibility of data fusion [22].
Another related data fusion approach proposes to combine different kernels to enhance the discrimination power of performance in convolutional NN classifiers [14].In this context, the idea behind kernel fusion relies on utilising multiple kernels instead of feature selection in order to create a discriminative matrix of the kernel.Kernel fusion has attracted significant attention in the research community, many approaches have been studied [17] and been verified effective in areas such as Extreme Machine Learning [5], and Multiple Kernel Learning [15].For example, Liu et al. studied Optimal Neighbourhood Kernel Learning which treats a pre-specified kernel as a "noisy" observation of an optimal kernel and learns the optimal kernel within the neighbourhood by building a constraint within a parametrised model [18].Wang et al. investigated the benefit of kernel fusion for NN; they presented a practical method consisting of building a particular form of feature-level fusion derived from combining two or more kernels [32].
In this regards, different fusion techniques have been explored in different abstraction levels such as data, features, and kernels in convolutional NN; however, activation function fusion remains unexplored.Given the relevance of activation function in NN, in this paper, we focus on investigating the benefit of fusing activation functions as an elegant approach to exploit inferences that could not be feasible from a single activation fusion.

Overview of methodology
As illustrated in Figure 2, the proposed method focuses on scenarios in which only a portion of labelled data is available.Collected sensor data is first prepared by synthesising motion-sensing data and adopting an imputation method to address missed data (Figure 2-a).The NN receives the sensor data through the input layer to then distribute it among the different hidden layers (Figure 2-b), which are meant to implement different activation functions (Figure 2-c).A next hidden layer fusion previous activation functions by utilising a concatenation method (Figure 2-d).Finally, the NN provides respective classes to the unlabelled sensing data (Figure 2-e).
Unlike conventional NN, our proposed methodology suggests implementing a collection of hidden layers growing vertically (Figure 2-c).The motivation for this approach is that we can extract the pattern signature from different activation functions.As shown in Figure 2-c and detailed in line 6 of Algorithm 1, the flexibility of this approach is the ability to use an unlimited number of functions.As presented in the pseudocode of Algorithm 1, we first define the collection of activation functions to fusion (i.e., α) and feed the inputLayer with the semi labelled dataset (i.e., X L ).We then create a sequence of hiddenLayers (i.e., line 4-6) utilising the previously defined activation function (i.e., α[]); note that the characteristic of the hiddenLayer is flexible, hence each collection of activation functions  Given that each activation function consists of a particular mathematical property that then models the received data, our methodology is designed under 1 www.tensorflow.orgthe architectural schema of cooperative and complementary data fusion.The cooperative approach combines the sensing data from two different technology domains, such as wearable devices and kitchen appliances with the benefit of producing a more comprehensive view.On the other hand, the complementary approach enables the representation of different parts of a scene by adopting different activation functions as part of the NN structure and thus obtain global information to achieve high-level inference.Overall, our methodology's contribution relies upon the benefit from the NN's activation function level, where it receives a set of inputs which then are processed in parallel by a defined set of activation functions.
In Figure 3, we observe how a single entry provided thought the Input layer is replicated and simultaneously distributed in n number of Hidden vertical layers; each one implementing a different activation function.The activation function fusion process happens at the second hidden layer, combining the outputs of the n individual activation functions.The Output layer will then analyse the new set of values.
To properly define our methodology, with reference to Figure 3, let α1 and α2 be two valid activation functions and X L the semi-labelled dataset, then α3 expressed as following is also a valid activation function: The activation function fusion consists of the concatenation of all outputs of the form y α1 y α2 where y α1 is an outcome from α1 and y α2 is an outcome from α2.For example: Where y α are outcomes from the classification for the activity labels provided in X L .Hence: Assuming that there is no restriction between activation functions, this method can assemble a large number of paired activation functions (e.g., tanh & softmax, elu & softplus, relu & linear).In this paper, we have empirically explored the impact of tuple activation function by permuting nine different functions (i.e., tanh, softmax, elu, softplus, softsign, relu, sigmoid, hard_sigmoid, and linear).
Next, we present the evaluation of our approach on one publicly available human behaviour dataset, where we explore the research questions: To what extent does the activation function fusion benefit transfer learning activity recognition models?

Dataset description
OPPORTUNITY is an open-access collection of sensing data gathered in a realistic environment (i.e., kitchen).It's built upon ubiquitous sensing technologies (i.e., motion data) available in wearable and mobile devices to showcase the capabilities of sensing motion in activity recognition tasks.In this context, we addressed this study to utilise motion technology, given its widely available in daily use technology such as watches and mobile phones; which is of interest to the research community.
We utilised the OPPORTUNITY activity recognition dataset [28], in which data was collected from four subjects performing scripted ADL (Activities of Daily Living) in an adapted home environment that simulates a kitchen.Our particular interest with this dataset is because of the rich overlap of sensing data from heterogeneous sensing devices (i.e., wearable, kitchen appliances, and environmental devices) in which acceleration, inertial measurement unit (IMU), and binary sensors were considered.In this context, each participant completed five repetitions of 17 different activities, such as grooming, relaxing, making/drinking coffee from a cup, preparing/eating a sandwich, and executing a simple-movement drill.The average length of each activity lasted from 2 to 5 seconds (i.e., 107±44 data points) with a sampling frequency of 30Hz.Overall, the dataset consisted of approximately 6 hours of data recorded.We included data from ambient sensors (e.g., dishwasher doors), kitchen appliances (e.g., cutlery), and on-body sensors which captured the subject's performance from different views simultaneously.

Data preparation
Given the interest of this study, we have focused on motion sensing from tri-axial sensors as we study the benefits of ADL's recognition in a wearable system [28].
The acceleration signal is synthesised by extracting the output voltage, which is mapped from the ax, ay, and az axes as they are orthogonal to the decomposition of the actions performed.Due to the magnitude of the acceleration with no directional information, the acceleration is orientation independent [5].
Imputation methodology based on the previous sample was implemented for handling missing data in the dataset.Feature data was segmented in window of 500ms with 50% overlap.The selection of features was based on previous studies that have proven them to be effective modelling a NN to classify ADL.Features such as location aren't included since the activities performed were conducted in a single environment (i.e., kitchen).

Experiment setup
We first evaluated the performance of our NN using the wearable sensing data points from the previously described dataset.We explored the benefit of activation function fusion by permuting the eight activation functions available in the Keras API.
Since NN's activation function computes in a particular manner (i.e., implementing respective neuron algorithms) by finding patterns in a given dataset (i.e., y = α((w 1 x 1 , w 2 x 2 , ...w n x n ) + b)).In this paper, we leverage the NN by treating the different activation function outcomes (i.e., yα n ) as new inputs, therefore, we can compute a set of outcomes from each of the activation functions (i.e., y⊕ = yα 1 ||yα 2 ||...||yα n ).In order to showcase our methodology, we conducted our experiments by permuting the different activation function in tuples consisting of two activation functions.
The dataset was pre-processed using Matlab 2018b (as detailed in Section 3.2).The NN is implemented in TensorFlow using Keras to build our models.We have empirically settled our NN with a Rectified linear unit (ReLU) as the activation function of the input layer, and Sigmoid activation function of the output layer.Our NN is trained for 500 epochs under a fully connected stack which implements two hidden layers; one of which consists of three vertical levels and implements our activation function fusion approach.
In this experiment, we used the leave-one-out-crossvalidation method for each user, in which ADL1, ADL2, ADL3, and Drill sessions defined our source dataset, and ADL4, and ADL5 session define our target dataset as advised by the OPPORTUNITY's authors [28].The process is conducted for each of the four subjects.

Results
To establish the feasibility of activation function fusion, we considered the scenario in which the kitchen appliances and wearable sensing devices are available simultaneously and data has been labelled.
To answer our research question (i.e., To what extent does activation function fusion benefit transfer learning activity recognition models?), we randomly trained our activity recognition model with 100, 75, 50, and 25% of labelled data from each view in order to simulate four scenarios in which different levels of fusion data are available (subsequently referred to as fusion data).If the activation function fusion concept proposed as part of our methodology is feasible, we would expect an improvement in the activity recognition model as more fusion data is provided.
To validate the improvement of our methodology, we built a conventional NN implementing a single hidden layer over the ReLU as the activation function, so we can compare the performance of our approach.Results are presented in Table 1 Given the motivation of this paper which relies on maximizing the potential of the labelled data, next, we highlight results based on the 25% of fusion data.
Tables 2 to 5, show the 10 highest activity recognition model's achieved when using 25% of the fusion data.For example, in Table 2 results show that fusion of "hard sigmoid" and "linear" activation functions have an improvement of 15.17% compared to the subject benchmark accuracy (i.e., 59.45%).Moreover, an increase in the activity recognition of 2.74% is observed when coping with the activity recognition model with 25% (i.e., 71.88% accurate) compared to 100% (i.e., 74.62% accurate) of fusion data.
Similarly, in Table 3, we can observe that the highest achievement concurs with the activation function  tuple of Subject 1.For Subject 2 the benchmark is 60.14%, results for "hard sigmoid" and "linear" activation functions is 74.94%, representing a 14.83% of improvement.Moreover, a positive increase in accuracy of 3.72% is also observed when using 25% of data compared to 100% of fusion data.
As presented in Table 4, for Subject 3 the improvement achieved by our approach rises the activity recognition model by 24.13% when adopting "sigmoid" and "hard_sigmoid" as a activation function tuple.This result represents an improvement of 24.14% when comparing agains the subject benchmark (i.e., 31.27%) and using only the 25% of fusion data.
As presented in Table 5, for Subject 4 the improvement achieved is 0.9% when comparing the benchmark (i.e., 60%) against the model's outcome of fusing the "softsign" and "sigmoid" activation function tuple.
As it can be observed across all tables, a similar accuracy level is achieved regardless of the dataset used, which represents a positive benefit for recognising activities in scenarios with constrained access to data.Full results are presented in Appendix A.

Discussion and Conclusion
In view of the ubiquitous technology available when undertaking activities of daily living, the underlying activity recognition techniques need to leverage the variability of these technology resources to achieve the highest accuracy of activity recognition, especially when data is partially available.In this paper, we have proposed a novel approach to maximise the accuracy of activity recognition given access to partially labelled, heterogeneous sensor data.
The objective of our methodology was to maximise the accuracy of human activity classification where limited labelled data are available.Unlike other similar studies that explore the feasibility of fusing sensor data, we proposed an elegant approach in which a NN processes data implementing a collection of different activation functions as part of its hidden layers.The hidden layer is built dynamically, which makes this approach flexible to adopt as many activation functions as wanted.To showcase the feasibility of this study, however, we limited the number to two activation functions.
Since a NN's activation function computes in a particular manner by finding patterns in a given dataset, in this paper, we leverage the NN by treating the different activation function outcomes (i.e., yα n ) as new inputs.In this way, we can compute a concatenated set of outcomes from each of the activation functions (i.e., y⊕ = yα 1 ||yα 2 ||...||yα n ).
Our results have demonstrated that the fusion of activation functions can perform well and is feasible for human activity recognition utilising inertial sensor devices worn by a user or embedded in kitchen appliances.
We evaluated our methodology using OPPORTU-NITY, which is an open-access dataset in which subjects perform ADL while wearing sensor devices and interacting with smart kitchen appliances.The empirical study shows an improvement in activity recognition accuracy, compared to traditional approach across four different subjects.Note that there is notable variability in the results of Subjects 3, which as reported by other authors [7] is due to missing data.
As part of our analysis, in Tables 2 to 5, it was observed that the highest achieved accuracy when using only 25% of activation function fusion data outperformed results compared to when using 100% of labelled data in a conventional shallow network (i.e., increase in accuracy of 2.7%, 3.7%, 4.8%, and 0.9% across the activity recognition of four subjects).We hypothesise that, given the benefit of extracting the pattern from a different perspective (i.e., activation functions) tuples of data becomes more specific as more data is adopted.Thus, 25% of data benefits the models by building a more general model.
Given the positive results and the rich opportunities that this approach opens, in the future, we will investigate the extent to which activation function fusion can be extended to three, four or more functions in the tuple, as well as the characteristics of the data which can affect performance of the model.
Similarly, we will extend our investigation to consider data collected in naturalistic conditions, in which unpredictable human behaviour presents a paramount challenge in the field.

Figure 1 .
Figure 1.Graphical representation depicting a basic unit of a neuron.Where x = numerical inputs, w = weight, b = bias, α = activation function, and y = outcome.

Figure 2 .Algorithm 1
Figure 2. Flow diagram that illustrate the activation function fusion flow process.It begins by retrieving data from the different sensing technology (left side), fusing data collection of steps as it is the core structure of our methodology (centre), and output achieved (left side).

Figure 3 .
Figure 3. Graphic representation of the activation function fusion proposed in this method.Where X L represents the semi-labelled data retrieved, α stands for the activation functions, y depicts the activation functions' outcome and y⊕ its concatenation.

Table 1 .
. Benchmark results for the activity recognition model utilising a single hidden layer NN and 100% of labelled data.

Table 2 .
Tuples of the ten highest activity recognition model results for Subject 1, which benchmark accuracy is 0.59455.

Table 4 .
Tuple of the ten highest activity recognition model results for Subject 3, which benchmark is 0.31173.

Table 5 .
Tuple of the ten highest activity recognition model results for Subject 4, which benchmark is 0.60025.