Hyperparameter optimisation for Capsule Networks

Convolutional Neural Networks and its contemporary variants have proven to be ruling benchmarks for most image processing tasks but resort to pooling techniques and routing mechanisms that a ﬀ ect classiﬁcation accuracy and lose spatial relationship information between involved data points. Hence, Hinton et al, proposed a layered architecture called Capsule Networks (Capsnets) which outperform traditional systems by replacing pooling techniques with dynamic routing abilities. Capsnets are, thus, en-route to proving themselves as prospective future benchmarks in visual imagery tasks by surpassing existing state-of-the-art results on the MNIST dataset. The two novel aspects inspected in this paper are: the enhancement of this performance on CIFAR-10 through regularization and hyperparameter optimization which, henceforth, augment applicability to stochastic numeric healthcare data helping uncover newer challenges of predictive neural networks.


Introduction
Machine Intelligence, involving visual imagery tasks like segmentation, detection and reconstruction, has primarily been propelled by Convolutional Neural Networks (Convnets) but the major limitations of such systems include: loss of information due to max and average pooling techniques along with the inability to encode orientation and positional variations into predictions [1].Although, the brain's mechanisms of information processing are significantly different from how traditional systems are wired, latest advancements like Capsule networks [2] apply the Hebbian Learning principles which are closer to emulating human capabilities by applying active vectorization techniques and by filtering features on layered dimensionality planes to reduce disparity over disagreement.
Hinton et.al.first proposed shape representation models in parallel systems [3] to synthesize assignment of coordinate intrinsic object frames of reference which deals with individualised hypotheses(on interpretation of local fragments of visual input and unit interactions encoding knowledge on local interaction and spatial disposition detection constraints) representing units further enabling interaction organization between parallelized network entities so that a single pattern activity representation is obtained as the frames undergo simultaneous convergence but continue to remain remain dimensionally viewer centric affected by attributes such as bilaterally symmetric plane, gross elongation and gravitational or contextual verticals.Since, this was implemented using coordinate hardware units, the architecture implicitly couples surrounding elements as coordinate frames affecting the relative environment where it's based.Activities, in terms of corresponding channels, can profoundly influence each other stimulating a rough object segmentation while the computational overheads can be reduced by optimizing mappings of distributed encoding by clustering associated regions.
The idea further evolved into 'Transforming autoencoders' [4] modeling scalar features into vector activity representations and instantiation parameters that effortlessly blended into the domains of the given visual entity.The resulting probability is multiplied element-wise to the capsule output as implicit routing learns to detect visual features over time.The obtained probability is expected to be stable even when the entity varies over the space of appearance transformations and the value determines weight in the overall autoencoder prediction and thus, is a reassurance of Capsnets abilities to be potential benchmarks in the future .
The nested set of neural layers resort to dynamic routing mechanisms of selected features by implementing denoising algorithms at the lower levels of capsule predictions before hierarchically routing activities of the local pool to higher level capsules which unearth convoluted data patterns resulting in highly concise informative outputs.
For all capsule i in layer l and for all capsule j in layer (l+1): b ij ← 0 For r iterations do: For all capsule i in layer l: c ij ← Sof tmax(b i ) For all capsule j in layer (l+1): s j ← Σc ij u ji For all capsule j in layer (l+1): v j ← Squash(s j ) For all capsule i in layer l and all capsule j in layer (l+1): Hinton et al proposed an expectation maximization routing algorithm logistic [5], which deals with encoding of the relationship between entity and pose.greatly improving the efficiency of capsule routing by recursive updates to the weighted assignment coefficient matrix and clustering probabilities(that are relatively closer).This ConvCaps structure extends dynamic routing into convolutional filters as the requisite feature maps are tiled into kernel wise batched data.The underlying transformation matrix is trained discriminatively by back-propagating through unrolled iterations between adjacent pairs of capsule layers, enabling effective representation of part-whole relationships and demonstrates Capsule's increased robustness towards white box adversarial attacks significantly lesser vulnerable than baseline ConvNets while also enhancing the generalisability of learned pose based matrices and corresponding learning capacities.
Capsules have also been applied to health-care applications primarily for lung tumor and brain fMRI data.(a)Lung disease detection dataset [6] where the CapsNets architecture was trained on a 400 image dataset to address two levels of the problem statement: a.Is the patient ill? and b. if yes, what type of lung related disease is it?The model was first scanned for fundamental attributes such as gender, age, noise filtering.The secondary processing involved testing the convolutional neural units ability to accelerate the convergence and optimize them using spatial transformation techniques.The experiments proved that Capsnets have an inherent ability to thrive in scenarios involving limited data.
(b)Brain fMRI images proposes "Capsnet architecture based visual reconstruction" [7] to reconstruct image stimuli by decoding position, orientation, and categories the object could potentially belong to, from activities in visual cortex.The approach is said to have 10% more accuracy than previosu state-of-the-art approaches.This implementation encompasses experimentation across four hypothesis: a. Implement and ensure successful application to design an improved architecture maximizing performance accuracy; b.Investigate the over-fitting on a real set of fMRI data; c.Explore if the solution can be extrapolated to fit the entire brain or the limitations are confined to the segmented tumor; d.Development of visualization paradigm to better convey learned features.The network was trained on nonlinear mappings between the image stimuli and high-level capsule features in an end-to-end manner.After estimating the serviceability of voxels by encoding performance to accomplish optimal selection, the system is re-trained on these nonlinear mappings.
Other systems such as CapsuleGAN [8] have been explored where the discriminator of a generative adversarial network(GAN) is replaced by capsules to model different objective functions evaluated qualitatively and quantitatively on the Generative Adversarial Metric (GAM) whose objective can be mathematically summarized as: These models are typically used to represent highly complex distributions which are known to be unstable, suffer from vanishing gradients, mode collapse and inadequate mode coverage issues which CapsuleGAN deals with, by introducing better objective functions, sophisticated training strategies, empirical tricks and structured hyperparameters.
The aforementioned papers have laid the basis for the work carried out and the rest of this paper is structured as follows: Section 2 articulates the architectural framework of Capsule networks, Section

Architectural Framework
The broad architectural framework of Capsule networks is composed of an encoder and a decoder, former of which comprises of a two dimensional convolutional ReLU layer for detecting the basic features, a Prima-ryCaps layer for producing combinations of the above feature outputs and a DigitCaps layer for the generation of the loss function and transformational weight matrix; the Decoder of Capsnets constitutes three fully connected layers, FC1 (With ReLU activation unit), FC2 (With ReLU activation unit), and FC3 (With sigmoid activation unit); Both the components effectively work together towards reconstruction of the input image while dealing with the accuracy and loss performance parameters.The loss parameter, in turn, entails margin loss as computed for each capsule and reconstruction loss which is scaled down by 0.0005 to prevent domination.
The detailed technical functionality of each of the Capsnet layers is as follows: (i) The ReLU convolutional layer: The layer has 256 kernels each with a bias term, stride of 1, size of 9x9x1 followed by the ReLU activation.
The layer handles 20992 parameters and outputs 20x20x256 tensor.
(ii) The supporting PrimaryCaps layers: The 32 capsule layer applies 9x9x256 convolutional kernels to the 20x20x256 input volume while handling 5308672 parameters and outputs 6x6x8x32 tensor.
(iii) The DigitCaps layers: This 10 node digit capsule layer ingests the 6x6x8x32 tensor and as per inner workings of each capsule, a weight matrix is computed and 8 dimensional input space is mapped to the 16 dimensional capsule output space.The layer outputs a 16x10 matrix associated with 1497600 parameters.
The loss function is a weighted sum calculated for correct DigitCaps and incorrect DigitCaps, primarily defined as 1 for a matching training label and 0 otherwise.A zero loss event is initiated either when a correct prediction occurs with probability greater than 0.9 in case of matching training labels or when an incorrect prediction occurs with probability less than 0.1 in case of mismatched training labels. 2  (2) The transformation matrix W ij is maps the 8-D capsule to a 16-D capsule output space for each class j in relation to u i , the capsule output of the previous layer.The probability magnitude could be mathematically summarised by v j .
The final output v j for class j is computed using the novel squashing function as: where with c ij coupling coefficients measuring the likelihood of primary capsule i probabilistically triggering capsule j with s j representing the weighted sum further shrinked by the squashing function.
The decoder is a regularizer network which recreates the original 28x28 image while forcing capsules to learn the features of the data.The first and the penultimate layer of the capsnet decoder have the ReLU activation function while the last layer retains the sigmoid activation unit.
The first fully connected layer calculates the number of parameters based on bias which outputs a 512 vector, processing 82432 trainable parameters.
The second fully connected layer calculates the number of parameters based on bias which outputs a 1024 vector, processing 525312 trainable parameters.
The final fully connected layer calculates the number of parameters based on bias which outputs a 784 vector, processing 803600 trainable parameters.This is also the final 28 X 28 output.Thus, the total number of parameters in the capsule network are: 8238608.The Capsnet architectural framework is as visualised [9] in Figure 1.

Experimental Design
This section entails the proposed enhancements and optimizations to Capsule networks which primarily fall under the following categories: (i)Activation function optimisations involving variants of ReLU [10] and Swish [11] variants [12].(ii)Data augmentation with Neural Style Transfer [13] (iii) Optimisation with additional Softmax [14] layers (iv) code shuffling [15] within the dataset to model stochasticity (v) hyperparameter tuning using grid search and random search [16] (vi)parallel implementation of hyperparameter regularisation [17] with early stopping [18]

Object Recognition Tasks
A tensorflow backed Keras implementation in a Jupyter notebook environment run on a tesla k40c GPU configuration was the key framework while executing Capsnets on the CIFAR-10 dataset [19] for object recognition and reconstruction tasks.However, each of the following optimisations and regularizations were executed within the same framework environment as opposed to original capsule network architecture with default settings as the baseline benchmark.Each of the implementations have been executed for 20 epochs as results are known to fairly stabilize thereafter.
Activation Functions.Activation functions or Transfer functions are defined as non-linear transformations or complex functional mappings between response variables and incoming data.The generic form of the equation is: where 'Y' can bound from negative to positive infinity.The activation function applied over an input signal which decides whether a neuron fires or activates, depending upon the weight over input which when paired with back-propagation iterates over the bias aggregate and update gradients resulting in a loss metric.
The mathematical definitions of the various activation function units are as follows: The non-linear ReLU activation is defined as: ReLU is a simple, efficient, and most widely used monotonic function which ensures convergence six times faster than the tanh function.It rectifies the vanishing gradient problem but the major limitation of ReLU is that the neurons are unlikely to recover if they fall into the negative slope thus, outputting zero independent of the input scenarios.The leaky ReLU function (LReLU) [20] defined as is an improvised version of ReLU implementation which deals with the above mentioned limitation by implementing small negative slopes.Latest activation functions like the Swish is defined in terms of the sigmoid as: proposed by Ramachandran et al.The e-Swish activation with a learnable beta component is mathematically defined by: Data Augmentation Using Neural Style Transfer.Data Augmentation refers to the process of augmenting the dataset with relevant synthetically modified data to enhance performance.We used neural image style transfer mechanisms on representative examples of the dataset which are further used to train the network resulting in improved performance.The method however introduces two types of loss namely style loss and content loss.The weighted sum of the same is as represented below.
L total (S, C, G) = αL content (C, G) + βL style (S, G) (11) where the content loss of layer I is defined as: and the loss of the style associated gram matrix is defined as: Optimisation with additional softmax layers.During multiclass classification, softmax layers are used with the same number of nodes prior to the output layer to obtain the probability distributions for all involved classes and the same is mathematically defined as: These layers are augmented into the decoder of the Capsnet architecture before the output layer to enhance object recognition performance.Hyperparameter tuning using grid search and random search.Grid search and random search explore the same parameter space by simultaneously searching for parameters that potentially influence learning.Grid search refers to the process of building models for combinations of hyperparameters, evaluating the model for each of the combinations and finally, choosing the set of parameters with the highest classification accuracy.Random search refers to the process of randomly choosing hyperparameters which in general converge in lesser time than the grid search tuning techniques.
parallel implementation of hyperparameter regularisation with early stopping.Hyperparameter regularisation of learning rate is implemented on Capsnets using the Sherpa library which works well on problems with computationally expensive iterative function evaluations.Results surpassed the ReLU benchmark much before the 20th epoch and early stopping on the 9th epoch was found to avoid overfitting issues.

Healthcare
As the world appraises the billion dollar healthcare market, which is poised to grow, technological advancements are substantial for outreach to meet the large scale demand for quality.In order to ensure affordability, exploring this transformational space in terms of the latest machine learning systems like predictive neural net frameworks could be an influential progress in this direction.This subsection conceptualizes the modelling of Capsnets for the analysis of hyperglycemia data spanning clinical databases involving 74 million unique cases which correspond to 17 million unique patients with 70,000 inpatient diabetes encounters.The original linear regression statistic model suggests that relationship between the HbA1c levels and readmission probability depends primarily on the diagnosis [21].The results of the study are significant and critical due to their influence on morbidity and mortality rates which in-turn depends on the treatment modality.HbA1c levels of greater than or equal to 7% were associated with increased morbidity where as both high and low levels of the component was associated with increased mortality [22].
The stochastic numeric healthcare diabetes dataset was mapped to a time series forecast along with appropriate channel labels and fed into the Capsule architecture to predict readmission rate of patients based on their HbA1c levels and other associated values.The conv2D layer detects basic features forming a feature map in the form of a 20 x 20 x 256 tensor.The PrimaryCaps layers produces combinations of the feature map and outputs a 6 X 6 X 256 tensor.The DigitCaps layers generates the transformation weight matrix W ij and the entailing loss function.The three fully connected layers of the decoder calculate the number of parameters based on bias.These layers resort to ReLU while fundamental analysis proves that variants of Swish and ReLU better enhance accuracy.The same is as elaborated in the subsection below.With performance attributes being key indicators in healthcare sectors, the application and implementation here is chosen to draw attention to the fundamental aspects within the organic framework and to demonstrate the intuitional insights of the architecture.

Results and Discussion
The survey with healthcare data involving hbA1c levels has proved to be phenomenal with 19.5% increased relative correlation as compared to previous benchmarks of linear regression when experimented with the ReLU activation function on Capsnets.
The results of activation function optimisations of Capsnets on object segmentation tasks of the CIFAR-10 dataset in terms of accuracy are as follows.
While the leaky ReLU variant performs best, e-Swish activation functions consistently outperform the ReLU benchmark.The results of other optimisations and regularisations of Capsnets on object segmentation tasks of the CIFAR-10 dataset in terms of accuracy are as shown (Table 2)   The results of activation function optimisations of Capsnets on object segmentation tasks of the CIFAR-10 dataset in terms of the loss Parameter are as shown (Table 3).LReLU outperforms both e-Swish and ReLU in terms of the loss metric.A graphical representation of the same is as follows where the y axis represents the loss and the horizontal x axis represents the epoch value.The results of other optimisations and regularisations of Capsnets on object segmentation tasks of the CIFAR-10 dataset in terms of loss are as shown (Table 4)  These results are currently being extrapolated to cancer research models which are expected to surpass the ConvNet accuracy benchmarks leading to implications and inferences on demographic constitutions.These activation functions could be experimented on different datasets in terms of application, volume and complexity.Newer functions, optimizers and regularisation techniques or tweaks to the existing implementations would be worthwhile research avenues to explore.

Conclusion
From the experiments, it is evident that, in terms of activation functions, the e-Swish, and PReLU not only better optimize and outperform the currently used ReLU but also, ensure faster convergence and lesser training time.In terms of other optimisations and regularizations, additional softmax layers, data augmentation, sherpa optimisation and code shuffling outperform the ReLU benchmark but the grid, random search are on the borderlines of ReLU.Future work in this promising direction could entail newer and novel functions applied to more complex models.
The non-normalized, distributed data with changing behavioral attributes and complex curves often pose new challenges.This ambitious research venture could redefine modern processing with respect to time series analysis and forecasting where age old contemporary systems seem to have failed miserably with techniques that are possibly profoundly flawed.
With the consideration of the aforementioned ideology, these upcoming architectures are expected to rule and drive systems of the future where technologies are rapidly advancing and landscapes are fast changing.While current technological revelations delineates the aforementioned scenarios, the future could spur out formidable and impressive inroads to advancements to not only effectively tackle current challenges but to create and solve newer problems in this space that we don't even know exist yet.

Figure 1 .
Figure 1.Block Diagram of Capsule Network Architecture

Figure 2 .
Figure 2. Visualisation of activation function optimisation of Capsnets on CIFAR-10 dataset in terms of Accuracy

Figure 4 .
Figure 4. Graphical Representation of LReLU and e-swish with reference to loss

Figure 5 .
Figure 5. Graphical Representation of Capsnet optimisation and regularisation with reference to loss Code shuffling.We introduced stochasticity into the data to enable the model to minimize the training loss to adapt to the dynamic characteristics of healthcare data.If we assume the process of training the network with a minimum loss value function to be defined by L w over the training set with w representing the weight matrix.So, the minimisation of loss value functions occurs with say, c elements of the training set, L is then a surface on the c+1 dimensional space.To geometrically generalize, loss function can be evaluated over any training set and associated weight matrix, but it's plausible that the resulting value remains unchanged over training iterations which would make the model susceptible to the local minima problem.Code shuffling with mini batch diversification ensures changes over training iterations.Assuming the local minimum of the loss value function is L w i at training iteration i, the loss surface geometrically changes over the next iteration on the stochastic training set assumed to be defined as: L w i+1 .L w i is different from L w i+1 which likely won't be the local minimum and we hence can compute the gradient update and continue training.

Table 1 .
Activation function optimisation of Capsnets on CIFAR-10 dataset in terms of Accuracy

Table 3 .
Comparison between ReLU, LReLU and e-Swish with reference to loss parameter

Table 4 .
Optimisations and Regularisations of Capsnets on CIFAR-10 dataset with respect to loss parameter A visual representation of the above data is as follows: