Estimating animal pose using deep learning: a trained deep learning model outperforms morphological analysis

INTRODUCTION: Analyzing animal behavior helps researchers understand their decision-making process and helper tools are rapidly becoming an indispensable part of many interdisciplinary studies. However, researchers are often challenged to estimate animal pose because of the limitation of the tools and its vulnerability to a specific environment. Over the years, deep learning has been introduced as an alternative solution to overcome these challenges. OBJECTIVES: This study investigates how deep learning models can be applied for the accurate prediction of animal behavior, comparing with traditional morphological analysis based on image pixels. METHODS: Transparent Omnidirectional Locomotion Compensator (TOLC), a tracking device, is used to record videos with a wide range of animal behavior. Recorded videos contain two insects: a walking red imported fire ant ( Solenopsis invicta ) and a walking fruit fly ( Drosophila melanogaster ). Body parts such as the head, legs, and thorax, are estimated by using an open-source deep-learning toolbox. A deep learning model, ResNet-50, is trained to predict the body parts of the fire ant and the fruit fly respectively. 500 image frames for each insect were annotated by humans and then compared with the predictions of the deep learning model as well as the points generated from the morphological analysis. RESULTS: The experimental results show that the average distance between the deep learning-predicted centroids and the human-annotated centroids is 2.54, while the average distance between the morphological analysis-generated centroids and the human-annotated centroids is 6.41 over the 500 frames of the fire ant. For the fruit fly, the average distance of the centroids between the deep learning-predicted and the human-annotated is 2.43, while the average distance of the centroids between the morphological analysis-generated and the human-annotated is 5.06 over the 477 image frames. CONCLUSION: In this paper, we demonstrate that the deep learning model outperforms traditional morphological analysis in terms of estimating animal pose in a series of video frames.


Introduction
Understanding the relationship between nervous system function and behavior has been a major goal in the neuroscience research [2]. Real et al. [3] presented a circuit inference framework that represents neurons and their S. Lee 2 Animal behavior is an important part of neuroscience study and has been treated in many fields, such as genetics, psychology, and ethology [6]. To understand animal behavior, high-quality video recordings have been used to capture the motion signal of behaving animals. However, researchers have been suffered from the lack of an efficient method for the animal behavior analysis in the captured video recordings; the analysis takes too much time to complete all tasks for each frame in the captured video recordings. Moreover, it may be highly subjective judging whether the animal behavior changes in certain situations.
Advances in automatic technologies have enabled animal scientists or researchers to better study the behavior of animals, providing the automated analysis of images [7,8]. In bioimage analysis, automated image analysis using a convolutional neural network (CNN) has been a popular method for studying digital images. As a class of neural network model, the CNN-based models, such as CNN [9], Recurrent CNN-based models [10], encoder-decoderbased models, generative adversarial networks [11], and CNN-based active learning models [12] have been widely studied and successfully applied to the real-world problems [13][14][15]. In particular, Apte et al. presented that a deep learning-based image segmentation can be involved in an open-source library [16] and Stern et al. presented a highthroughput analysis of animal behavior by using CNN [17]. Further, some studies show CNN-based models can improve the fundamental image analysis operations [18,19]. However, these methods are still time-consuming and require a heavy workforce for training and testing on the captured video recordings.
Animal pose estimation has been received much attention by reducing the time-consuming effort to label. Newell et al. [20] introduced a CNN architecture called stacked hourglass for human pose estimation and Insafutdinov et al. [21] firstly presented an articulated multi-person pose estimation by using the most popular deep learning model, ResNet. Mathis et al.
[1] advanced Insafutdinov's work. However, these methods have been focused on deep learning models and didn't demonstrate their models by comparing them with morphological analysis.
In this paper, we investigate how CNN-based models can achieve accurate animal pose predictions by using a newly developed open-source software toolbox called DeepLabCut [1,22], comparing with traditional morphological image analysis. Raw images of a fruit fly (Drosophila melanogaster) and fire ant (Solenopsis invicta) are recorded by a motion-tracking device called transparent omnidirectional locomotion compensator (TOLC) [23] separately. Each video recording created by TOLC is trained by deep learning models and the trained deep learning models are used to predict animal poses. The animal poses predicted by the trained deep learning models are compared with the poses estimated by morphological images analysis.
The rest of the paper is organized as follows. In Section 2, we describe our TOLC device. In Section 3, we introduce the DeepLabCut open-software tool. To provide a better understanding of the raw image data and the methods used in this paper, Section 4 shows detailed information about the measurement, raw images, and model configuration. In Section 5, we present our experimental results. Finally, Section 6 concludes the presented work.

Tracking device
Tracking devices have been being developed extensively by animal scientists or researchers because tracking behavioral data help to capture footage from an animal's movement and provides invaluable contributions to many disciplines such as genetics, psychology, and ethology [24,25]. This tracking device records the behavior of walking insects, such as a fruit fly, a cricket, and ants, by compensating their motion using a transparent sphere. However, the behavior observation system has been having difficulty obtaining all behavioral data because of its limited field of view. To address this issue, much attention has been paid to the development of a finite behavior chamber to observe the behavior in a freely walking insect by restricting the behavior of the animal.
Meanwhile, an insect was tethered on the fixture and walked on top of the air-floated ball in the finite behavior chamber. However, the tethered method makes an insect move its legs only to rotate a ball while its body is immobilized in the fixture. Although this method allows infinite space navigation with a virtual reality system, it causes a serious problem that damages the animal during the tethering process and makes it difficult for a long-term behavior study of an insect. We extended our paper [26] to include a walking red imported fire ant (Solenopsis invicta) without tethering but using the feedback control of the TOLC. Estimating animal pose using deep learning: a trained deep learning model outperforms morphological analysis 3 The TOLC [23] is a tracking device that captures an insect's movement by compensating its motion with the help of a transparent sphere. This TOLC device, shown in Figure 1, detects the motion of the untethered walking insect and brings it back to the desired location by rotating a sphere to counteract the walking motion. The imaging system underneath the transparent sphere captures an image of the freely walking insect at 200 Hz with the illumination of 850 nm near-infrared (NIR) light-emitting diodes (LEDs) to minimize the visual distraction of the insect. Once the position is detected, three omnidirectional wheels attached to the servomotor rotate the sphere in the desired direction to cancel out error between the current and target positions. We implemented the proportional integral derivative(PID) control methods to compute the control input [23]. The rotation is measured by two optical encoders on the side of the sphere, which is used to estimate the travel trajectory of the walking insect.

Pose estimation
Pose estimation is a machine learning task that estimates the object position from a series of image frames taken from a video recording at a particular frame rate [27]. The object position has been detected, associated, and tracked by a model reference frame obtained at a previous time, but the accuracy of pose estimation has been limited by high computational resources. Today, advances in image analysis methods based on convolutional neural networks (CNNs) promote fast and accurate object pose inference by pre-trained models for pose detection and pose tracking.
DeepLabCut is an image analysis tool that provides capabilities for tracking behaviors of various objects over state-of-the-art deep learning models. This deep learningequipped image analysis tool helps users to easily label the subject pose and train the labels for the object pose inference. DeepLabCut provides users steps to successfully predict the object pose. The steps follow: creating a project, extracting frames, labeling object pose, training labels, and evaluating deep learning models. DeepLabCut also provides the ability to retrain evaluated deep learning models by extracting outlier frames or updating labeled frames.
DeepLabCut utilizes deep residual networks such as ResNet-50 and ResNet-101 which are ImageNet pretrained to estimate human body parts with high performance. ResNet introduced in [28] is a special type of neural network that has received much attention in many image vision communities because a large number of layers on neural networks avoid training errors. As the state-of-the-art neural network architectures, such as VGG network [29], GoogleNet [30], and SqueezeNet [31], have been suffered in the training performance of large layered networks, ResNet models provided a solution to the problem by fitting the stacked layers to a residual mapping.
Moreover, DeepLabCut offers several benefits over similar available tools such as DataJoint, Kinematic, and Openephys, for feature tracking. DeepLabCut guides the experimenter using an interface that provides a step-bystep procedure from labeling to training, minimizes the cost of manual behavior analysis and can achieve human-level accuracy with only a small amount of labeled images, can be easily adapted to analyze behaviors across species, and is open-source and free to use.
The procedure of DeepLabCut is established by several steps: extracting the region of interest (ROI), manually localizing body parts, training a deep neural network (DNN) architecture, and predicting the locations of the body parts from new videos. Particularly, the DNN architecture uses ResNet-50 to predict the location of the body part by updating a distinct readout layer and network weights. The architecture of DeepLabCut with ResNet-50 is shown in Figure 2.
In this paper, we use ResNet-50 built in the DeepLabCut and compare it with traditional morphological analysis. Especially, we investigate how the centroid of an object is accurately predicted by the deep learning model. The results of the deep learning model will be compared with the centroids measured by using morphological image processing in a sequence of the image frames.

Datasets
We created two datasets for Drosophila melanogaster and fire ant. For the experiment of estimating Drosophila . Fire ants were housed in a plastic chamber with a nest tube, and they were provided with sugar water and held at standard room temperature (~22°C). For an initial 5 minutes, both the fruit fly and the fire ant were left to habituate before the first experiment. The experiment is conducted in the dark chamber to eliminate unpredicted external stimuli.
To record two animals, the near-infrared (NIR) camera (FLIR, CM3-U3-31S4M-CS) with 850 nm light-emitting diodes (Osram, SFH 4655-Z) equipped in TOLC was used to capture the animals' motion. Videos were recorded and their metadata were formatted as a binary file with 640x376 sized images and 720x540 sized images for a fruit fly and a fire ant respectively.
Experiments were performed on one Ubuntu server with 40 cores, 196 GB memory, and two Quadro P4000 GPUs. ResNet-50 was adopted to train centroids of a fruit fly and a fire ant. We obtained 1,000 fruit fly images and 1,000 fire ant images from the videos splitting them into two 500 images for training and testing. The number of training iterations was 15,000. A cross-entropy loss function with a learning rate of 0.02 was used for training. The crossentropy loss function represents the L1 distance between a predicted region and an actual region. The loss function is defined as below: , where ̂ is a predicted region in pixel and is an actual region in pixel.
The location of the fruit fly's centroid was measured by a morphological filter from the TOLC device. The morphological filter computes the centroids in 2dimensional space with x-axis and y-axis. The geometric center ( , ) of the fruit fly's area was compared with the location ( , ) of the centroid predicted by the ResNet-50 in the DeepLabCut. Likewise, we obtained the centroid location of the fire ant by the morphological filter and the deep learning prediction, comparing them with each other.
Two experiments were conducted for the fruit fly and the fire ant respectively. To verify the effectiveness of the deep learning model, we compared the ( , ) with the location ( ℎ , ℎ ) of the centroid annotated by two students (one graduate student and one undergraduate student), human-annotated. In the labeling process, the centroids in 50 image frames were labeled for the training of both animals. For the testing, 500 image frames were labeled using DeepLabCut's graphical user interface (GUI). The Euclidean distance was chosen to compare the distance between two centroids to find the dissimilarity of the centroid location. The Euclidean distances ( , ℎ), ( , ), and ( , ℎ) used in the experiment are defined as below:

Experiment Results
The distance between deep learning-predicted centroids and human-annotated centroids is defined as ( , ℎ) , while the distance between deep learning-predicted centroids and morphological analysis-generated centroids is defined as. The distance between morphological analysis-generated centroids and human-annotated centroids is defined as ( , ℎ). The experiment results of Euclidean distances: ( , ℎ), ( , ), and ( , ℎ) on the testing datasets for the fire ant are shown in Figure 3. In the figure, the xaxis represents frame numbers recorded by the TOLC device, while the y-axis represents the distances measured by Euclidean distance. The solid red line representing ( , ℎ) indicates that the centroid distances are likely to be maintained as the frame numbers increase, while the dotted blue line representing ( , ℎ) shows irregular distances as the frame numbers increase, comparing with the solid red line.
The overall experiment results of ( , ℎ) ( , ℎ), ( , ), and ( , ℎ) on the testing datasets for the fire ant show that the deep learning-predicted centroids are much closer to the human-annotated centroids than the morphological analysis-generated centroids. These results demonstrate that the error distance of the centroids of the morphological analysis generated is much larger than the error distance of the centroids of the deep learningpredicted.
The Euclidean distances for ( , ℎ), ( , ), and ( , ℎ) on the testing datasets for the fruit fly are shown in Figure 4. Similarly, the x-axis represents frame numbers recorded by the TOLC device, while the y-axis represents the distances measured by Euclidean distance. Unlike that the morphological analysis-generated centroids are fully obtained (500 frames) on the testing datasets for the fire ant, 23 morphological analysis-generated centroids were missing on the testing datasets for the fruit fly. Therefore, we performed the experiment on 477 frames of the fruit fly. These new unseen datasets were used to determine if the deep learning model was trained effectively or not. As a result, we found that the sold red line representing ( , ℎ) keeps maintaining the centroid distances between deep-learning predicted and human-annotated, showing that the deep learning-predicted centroids are very similar to the human-annotated centroid. The dotted blue line representing ( , ℎ) provides a different result showing that the centroid distances between morphological analysis-generated and human-annotated are irregular as the frame numbers increase.
The cross-entropies (̂, ) for training at 15,000 iterations were 0.0062 and 0.0062 with a 0.02 learning rate for the fruit fly and the fire ant respectively. The average distances between ( , ℎ), ( , ), and ( , ℎ) are shown in Table 1. An additional experiment to further validate our experiment was performed on other locations such as head, front head, front right leg, front left leg, middle right leg, middle left leg, rear right leg, and rear left leg. The number of training iterations was 500,000. A cross-entropy loss function representing the L1 distance between a predicted region and an actual region was used for training and the learning rate is 0.02. Training loss for each experiment is shown in Figure 5. While the experiment on centroids for both the fire ant and the fruity fly are conducted over the 15,000 iterations, we used 500,000 iterations for the seven locations. Three experiments indicate that the loss drops very quickly until around 2,300-th iteration and becomes constant as the iteration increases, demonstrating the effectiveness of the deep learning model on the experiments.
The visual representation of the locations on the fire ant is shown in Figure 6. Figure 6 represents fire ant NIR images with different times. The seven locations were colored as purple, blue, yellow, light blue, orange, light green, and red respectively.
The trajectories for the seven locations of the fire ant are shown in Figure 7. Figure

Discussion
Our study demonstrates that DeepLabCut can be applied to different species of animals and is beneficial to autotracking with a minimum of effort to identify the pattern of behavior exhibited by different species of animals. Our experimental results provide guidelines on how the predicted location information (i.e., centroids, head, and the six legs) can be combined. We plan to perform extensive experiments with more annotations that will avoid an intra-observer variation or learning bias. Moreover, we plan to study the conditioned behavior (i.e., positive phototactic movement) in Drosophila by tracking the leg movements. The phototactic response is the cellular behavior of moving directionally in response to the light source. While the negative phototactic response is related to moving away from the light, the positive phototactic response is related to moving toward the light. P. Pun et al.

Conclusion
In this paper, we performed an estimation of the animal pose in a sequence of image frames recorded by the TOLC device. Poses of two insects, a fruit fly, and a fire ant, were estimated to explain patterns at the location of the body parts. Morphological image analysis was performed to estimate the centroid of a body area, comparing it with the centroid predicted by a deep learning model. While the morphological process highly depends on a pixel-wise mask, the deep learning process creates a model that enables us to perform a markerless pose estimation to locate the centroid of a body area. In addition to the experiment, we studied the patterns of the insets' movement by tracing the location of the head and six legs. A total of 2,000 image frames was used for the three experiments and 1,000 image frames were annotated for the centroids for both the fire ant and the fruit fly respectively. The experiment results showed that the overall distance between the deep learning-predicted centroids and the human-annotated centroids is less than the overall distance between the morphological analysisgenerated centroids and the human-annotated centroids.