Characterisation of gestural units in light of human-avatar interaction

We present a method for characterizing coverbal gestural units intended for human-avatar interaction. We recorded 12 gesture types, using a motion-capture system. We used the markers positions thus obtained to determine the gestural units after stroke segmentation. We complement our linguistic analysis of gestures with an elaboration of our biomechanical hypotheses, our method of segmentation, our characterization hypotheses and the results obtained.


Introduction
Characterization of the meaning of gestures is traditionally based on body-oriented descriptions [36] capturing the gestural elements according to a global description of bodily reference points.We aim to show that the meaning of different Gestural Units (GUs) can be defined o n t he b asis o f f orms a long t he u pper limb, using multiple reference points that are not limited to body-orientation, but oriented via each of the segments (hand, forearm, arm), thus facilitating an automatic characterization of gestures to exploit in a humanavatar context.This work takes place in the CIGALE project, whose final g oal i s t o c reate n ovel h uman-avatar interactions in the context of theatrical performances starting from the off-line analysis, characterization and classification of 4 different datasets of gestures we recorded using a motion-capture system; one of these datasets is presented (Sec.3) and exploited in this work.Once the off-line study is completed, the project will move to the on-line analyses in order to evaluate the actual possible interaction between humans and the avatar, * Corresponding author.Email: ilaria.renna@gmail.comwhose behavior has been built exploiting the off-line gesture analysis.
After presenting the state of the art from a linguistic and engineering point of view (Sec.2), we describe the database and the adopted biomechanical model (Sec.3).Section 4 provides an overview of the linguistic framework within which our semantic characterization of co-verbal gestures is presented.The gesture significant part (stroke [31], Sec. 5) segmentation represents a necessary preliminary to such characterization, since it is impossible to characterize the meaning of a gesture without knowing when it occurs.For this purpose, an automatic segmentation is presented and tested against the segmentation of two coders, which serves as ground truth, in line with the highest standard methodes of both domains: robotics and linguistics (Sec.5).Once this operation is validated, automatic characterization relies on centering with respect to the motion variation of the degree of freedom (DOF) -the prono-supination (Sec.6).We conclude summing up our results and presenting future work (Sec.7).

State of the art
Gesture segmentation in concatenated units inspired by Kendon's work [25,26,30] is widely used: a Gesture Unit consists of a series of Gesture Phrases which are themselves composed by Gesture Phases.Each latter unit includes the core-meaning of the gesture, named stroke (see Sec. 5).Other studies try to set up syntagmatic rules system for movements phases available for both gesture and sign of sign languages [31].Minor differences exist between these approaches.In the linguistics part of the present study, we adopt Kendon's terminology.
According to their meaning or function, gestures are classified in several categories.McNeill [37], following overall Kendon's classification [27], differentiates gestures in beats, which punctuate the discourse, deictics which includes pointing gestures, iconics which are "images of concrete entities and/or actions" and metaphorics which show "images of the abstract".
Our study concerns gestures which belong to both iconic and metaphoric categories.We adopt Kendon's type gesture of quotable gestures defined as "those standardized gestures which have fairly stable meanings within a given community and which, on the whole, appear to serve in the place of a complete speech act of same sort" [28].
Gestures/actions segmentation is necessary to cut streams of motions into single instances that are consistent to the set of initial model hypotheses and that can be used as training sequences for recognition.In computer vision, different techniques are used to prepare data for gesture recognition and the segmentation concerns image processing methods to extract features [12,13,33,48] to represent the spatial structure of gestures.Such features are then exploited to learn the temporal structure of gestures with different methods, e.g.Hidden Markov Models (HMMs) [7,20,50,54], Baum-Welch [3], parametric models [21,56] and others.
When temporal segmentation is needed, different kinds of methods can be used to investigate motion profiles and trajectories to recognize human gestures.A general approach for segmenting actions is based on concatenating action grammars to model transitions in a gesture or between consecutive gestures [53].Concatenative grammars can be built, for instance, by joining all models in a common start and end node and by adding a loop-back transition between these two nodes; segmentation and labeling of a complex action sequence is then computed as a minimum-cost path trough the network using dynamic programming techniques.Some works [34,43] use such networks for action recognition based on HMMs, others [38,48] on Conditional Random Fields (CRFs) or on semi-Markov models [46].
Another strategy for recognizing gestures consists in dividing video sequences into multiple, overlapping segments, using a sliding window; classification is then performed sequentially on all the candidate segments, and peaks in the resulting classification scores are interpreted as gesture locations.Sliding window are used in many template-based representations [17,59], in combination with dynamic time warping (DTW) [13,39] and even grammars [4].For example, Abdelkader et al. [1] propose a template-based approach using DTW to align the different trajectories using elastic geodesic distances on the shape space; the gesture templates are then calculated by averaging the aligned trajectories.
A common strategy is to use a generic segmentation method based on detecting motion boundaries, then separately classifying the resulting segments.Such motion boundaries are typically defined as discontinuities and extrema in acceleration, velocity, or curvature of the observed motions.For example, Ogale et al. [40] segment action sequences by detecting minima and maxima of optical flow inside body silhouettes; Zhao et al. [58] calculate velocity and treat local minima in the velocity as gesture boundaries; Wang et al. [51] treat local minima in acceleration as a gesture boundary, allowing them to construct a motion alphabet whose "characters" of this motion are then combined using a HMM; Kahol et.al.[24] tested a user centric gesture segmentation algorithm and developed observer profiles based on how individual users segment motion sequences, encoding gesture boundaries as a binary vector of hierarchically connected body segment activities.Boundary detection methods are attractive because they provide a generic segmentation of the video, which is not dependent on the gestures classes; some precautions are needed because they are not stable across viewpoints and they are easily confused by the presence of multiple, simultaneous movements.
Movements primitives can also be extracted as joint trajectories using Principal Component Analysis (PCA) [18,23].In Lim et al. [32] each movement primitive is represented and stored as a set of joint trajectory basis functions that are then extracted via a PCA of human motion capture data.In [2], gestures computed from inertial sensors are defined by hand paths as a discrete time sequence in the Cartesian space; these are converted to training functional data by basis function expansions using B-splines (curve fitting), and then Functional Principal Component Analysis (FPCA) is performed on all the training data to determine a finite set of functional principal components (FPCs) that explain the modes of variation in the data.
Other sensors than those exploited in computer vision or in motion capture based approaches can be used: in [57], for example, accelerometers and multichannel electromyography (EMG) signals are used for segmentation.
As we want to validate our gesture characterization in an off-line situation, we decided to exploit a simple but robust boundary method for gesture segmentation based on arm movement considerations (see Sec. 5).

The dataset
The dataset examined is composed of 91 isolated coverbal symbolic gestures.These coverbal gestures are semantically autonomous and cover all DOF of the upper limb (see 4.1).Some gestures are performed using the entire upper limb, while others employ a subpart only (for example only the fingers) for a total of 150 different gestures reproduced by one person.

The marker-set and the biomechanical model
Gestures are collected using a 3D motion-capture system of digital infra-red cameras, which ensures the reliability of our understanding of the gesture and the effectiveness of the characterization of avatar motion.The system uses hemispherical reflectors glued to the skin and records their trajectory.
A list of cutaneous markers is established, in order to model the body segments in three dimensions (marker-set of 90 points, Fig. 1).This list references the anatomical positions that should be used in modelling each segment as a rigid body.Generally, three non-aligned anatomical reference points are sufficient to define a segment.In our model (Fig. 2), the torso, arm, forearm, and hand segments have been defined based on coordinates of spatial perspective using a standardized method.This method enables the creation of three orthogonal axes for each system of segment coordinates [15,55].It involves calculation of the centers of the wrist, elbow, and shoulder joints, as well as those of the cervical and lumbar regions [15].For the hand, the origin of the coordinate system is the centre of the wrist joint, Y is the unitary vector connecting the centre of the 2nd and 5th metacarpal heads to the origin, X is the normal unitary vector containing the origin and the 2nd and 5th metacarpal heads, Z is the vector result of axes X and Y.For the forearm, the origin of the coordinate system is the centre of the elbow joint, Y is the unitary vector connecting the centre of the wrist joint to the origin, X is the unitary vector normal to the plane containing the origin and the styloid processes of the ulna and radius, Z is the vector result of axes X and Y.For the arm, the origin of the coordinate system is the centre of the shoulder joint, Y is the unitary vector connecting the centre of the elbow joint to the origin, X is the unitary vector normal to the plane containing the origin, the epicondyle and the epitrochlea, Z is the vector result of axes X and Y.For the torso, the origin of the coordinate system is the centre of the cervical joint, Y is the unitary vector connecting the lumbar joint to the origin, Z is the unitary vector normal to the plane connecting the origin, the lumbar joint and the suprasternal space, X is the vector result of axes Y and Z.
The coordinate system of each joint is defined through sets of adjacent segment coordinates, allowing the description of the three-dimensional articulation of the shoulder, the elbow, and the wrist at every moment of the gesture.To establish the kinematics of the joints, we used a sequence of successive rotations around the mobile axes, using Euler angles [55].The dynamic sequence of rotations enables the definition of joint coordinates through the axes of two adjacent segments: one axis for the proximal segment and another for the distal segment; and a floating axis, perpendicular to the other two.
The various joint movements of the wrist, elbow and shoulder are calculated thanks to this biomechanical model, as are the palmar/dorsal flexion and the adduction/abduction of the wrist.These correspond to the flexion/extension and adduction/abduction of the hand as described in the action schemas (Sec.4).The extension/flexion and supination/pronation of the elbow correspond to the extension/flexion of the forearm and supination/pronation of the hand respectively for the action schemas (see 4.2).Finally, shoulder motion is measured in retropulsion/forward flexion, abduction/adduction and internal/external rotation.These correspond, respectively, to the extension/flexion, abduction/adduction and external/internal rotation of the arm for action schemas.

Linguistic redefinition in light of human-avatar interaction
The recorded coverbal gestures match emblematic quote gestures [27,42], i.e. semantically autonomous gestures, whose significance is independent of the surrounding discourse.The 91 gestures can be divided into a dozen GUs, with the following senses: reject, refuse, despise, discredit, pass, accept, consider something, consider someone, offer, not care, commit, revere.These semantic labels have been tested and validated with a French-speaking population in a previous study [6].
Each GU corresponds to a particular action schema implementing some (or all) of the segments of the upper limb.Action schemas are based on the motion of various DOF of the segments of the upper limb in a specific order.This order emerges from the difference in the range of motion of each DOF involved in the schema according to its range of motion.Motion is transferred through moments of inertia attached to each DOF and as a function of (involuntary) conjoint movement of the longitudinal axis (exterior/interior rotation or pronation/supination) associated with any joint with two DOF [9,10,35].Thus, for the GU "refuse", for example (Fig. 3), the action diagram shows the hand motion towards the forearm.

Flow of motion propagation
In the action diagram, the position of the pole of adduction (motion towards the joint on the plane of the palm) determines the direction of motion propagation.If a movement of adduction is in first or second position, the flow of motion is distal-proximal, going from the hand towards the forearm.If adduction is in third position, then it is the result of the first two, and so does not present significant motion.Therefore, the gesture is initiated on the forearm and spreads towards the hand in a distal-proximal flux.We thus define two types of GUs.The first 8 GUs in the list above are built on the hand while the last 4 (offer, not care, protect, revere) are built on the arm.

Action schema of hand motions
The sequence of hand motions is based on a structure such that the motion or position of the first two DOF cause involuntary motion of the third DOF.This third motion is either the result of a biomechanical constraint related to motion around the longitudinal axis (pronation/supination), or to a sequence based on the moment of inertia.In both cases, the poles of motion in third position are completely determinable and follow the first two movements such that their sequence affects the pole of the third motion.So, the sequence ADD.EXTEN leads to involuntary PRONATION, while the reverse order, EXTEN.ADD leads to SUPINATION [5,6].

Grouping GUs by direction
Tracking the order of the poles in motion is affected by the range of motion, the temporal sequence of the emergence of motion, the initial position and the acceleration, but these criteria, which vary even among themselves, are difficult to hierarchise.On the other hand, it is possible to classify GUs on a formal basis by semantic field (Fig. 4).
Initially, it is necessary to determine the spread of motion; either the gesture starts from the hand and motion goes up the forearm, or it starts in the arm and spreads towards the hand (hand and arm in the diagram).For the hand (Fig. 4, left), the initial prono-supination of the gesture may be marked or unmarked.At the next level, we examine pronosupination with respect to the initial position.This gives us 8 manual action schemas.For the arm (Fig. 4, right), we examine the ADD/ABD position or motion of the arm.Subsequently, the 4 GUs of the arm can be distinguished through prono-supination.

Segmentation of gestural signals
Each gestural signal in our database is composed of a sequence: T-pose, gesture, T-pose.Extraction of the gesture requires automatic segmentation.The generally accepted sequence includes 4 phases [11,36]: 1. resting position; 2. preparation (pre-stroke); 3. core (stroke); 4. retraction (post-stroke).This sequence describes the structure of a gesture.However, it is impossible to find an automatic, objective criterion to extract the stroke, the semantically significant part.This is a complex operation even for a human, and remains uncertain [47].
In our case, segmentation is carried out on the basis of morpho-kinetic properties (as defined by Kendon [29]).Indeed, the preparation of motion consists of a ballistic motion that brings the arm(s) to the core of the motion [8].This ballistic motion involves acceleration followed by deceleration as the final position is approached, then symmetrical acceleration and deceleration to the first set, and a return to the resting position.T-poses are also characterized by acceleration and deceleration of movement.
In order to extract the stroke of each gesture, we consider the absolute value of the derivative of the Y index positions (seen in all cases as the body part that moves the most): the minimum of this signal represents the transition between acceleration and deceleration.So for automatic segmentation the stroke considered is the part between the minimal phase that precedes the second maximum (property of the beginning of a stroke) and the minimal phase that follows the penultimate maximum (end of a stroke) (Fig. 5).A threshold is set up to avoid minimal and maximal phases due to noise (small adjustment or preparatory motions) from being considered.

Segmentation evaluation
To evaluate automatic segmentation methods, it is necessary to compare an automatic segmenter's performance against the segmentations produced by human judges (coders).In general, methods for performing this comparison designate as comparison reference only the segmentation of a single coder [44].However, this approach assumes that the only coder is unbiased and able to provide a perfect segmentation.Indeed, previous works, e.g.[22], showed that interannotator agreement between human coders can be rather poor.Thus, an automatic segmenter should be compared directly against different coders [19] to ensure that it does not over-fit to the preference and bias of one particular coder.
Given our dual aim, to evaluate inter-annotator agreement on the one hand and automatic segmentation on the other, we decided to adopt two methods: Accurate Temporal Segmentation Rate (ATSR) [45] and F-score [49].ATSR is a time-based metric that measures performance in terms of accurately detecting the beginning and end of the stroke for each gesture signal.F-score provides more information than accuracy and enables individuated errors typologies.Three different cases are evaluated with both methods: 1. automatic segmentation is compared with the annotator considered as ground truth (case 1); 2. automatic segmentation is compared to a second annotator (the ground truth) (case 2); 3. the two annotators are compared against one another (case 3).
For each considered gesture, the ATSR was computed as follows: the Absolute Temporal Segmentation Error (ATSE) is evaluated by summing the absolute temporal error between the ground truth and the result of the algorithm for the start and stop event and dividing this sum by the total length of the gesture occurrence measured from the ground truth as formalized in Equation 1. Once the ATSE are calculated, ATSR metrics are computed by subtracting the average ATSE to 1 in order to obtain the accuracy rate as shown in Equation 2. A perfectly accurate segmentation produces an ATSR of 1.
Equation 1 counts differences that occur frame by frame, so an error is taken into account even when annotations differ for just a few frames.To limit this effect and so to avoid small ground truth timing errors producing irrelevant penalties during the computation of the ATSE [45], it is possible to fix a toleration value α so that if AT SE(i) < α then AT SE(i) = 0 . ( As a stroke is in general of about 100 frames, we took α = 0.2.This corresponds to a global difference of α * 100 = 20 frames (around 0.17s, given the acquisition rate is 120f /s) compared to the duration of the ground truth, which is an adequate choice considering that, on average, it is easy to have 10-frame-errors for each start or stop.In case 1 we obtain AT SR = 0.6038, in case 2 AT SR = 0.5857 and in case 3 AT SR = 0.8707.This kind of method lacks completeness as it does not categorize the errors [52].In fact, the errors can be categorized into 5 types, as shown in Figure 6.
It is, of course, quite important to know whether the automatic segmentation is wrong but the stroke is preserved (Error 2 in Figure 6) or cut out (all other cases in the figure).In order to assess the quality of our segmentation and of inter-annotator agreement, let us consider precision (p) and recall (r) [16,41]: precision is the fraction of detections that are true positives rather than false positives (Equation 4), while recall is the fraction of true positives that are detected rather than missed (Equation 5).In probabilistic terms, precision is the probability that detection is valid, and recall is the probability that ground truth data was detected: Precision and Recall can be combined in the F-score as follows: When the parameter β = 1, F-score is said to be balanced and written as F 1 : The F 1 score can be seen as a weighted average of precision and recall; F 1 score reaches its best value at 1 and worst one at 0. The obtained results are summed up in Table 1.In general, high F 1 values are obtained; r is higher than p in the comparison with automatic segmentation meaning that the algorithm returned most of the relevant results, while p is higher for the inter-annotator agreement: they obtained more relevant than irrelevant agreement.Results concerning the kinds of errors are presented in Table 2.
It is worth highlighting that in Cases 1 and 2, Error 2 is the most frequent.This means that the segmentation method preserves the strokes despite the error.Moreover, the lower error is the cut stroke (Error 1).We can therefore assume that the presented segmentation method is robust to analyze the presented gestures model.For the inter-annotator agreement, we note that most errors stem from one annotator cutting the stroke or because one anticipated the other (Error 3).

Action schema component properties
To characterize the action schemas for each of the coverbal gestures recorded, the segmented signals are transformed into kinematic data in accordance with the biomechanical model (Sec.3.1).The motion of different degrees of freedom for each joint in the right upper limb (shoulder, elbow, and wrist) is taken into account and normalized temporally on 101 points [14].Our starting point for this characterization is the assumption that in human-human communication, the motion of DOF is most visible in prono-supination regardless of the type of gesture (performed on the entire upper limb or only one of its segments).We therefore decided to focus the analysis, initially, on the signals that were temporally aligned with the prono-supination zone, which contains the widest variation.Biomechanical parameters, such as initial and final positions of each DOF and their maximal range, were taken into account (Fig. 7).We analyzed gestures involving the motion of all segments of the upper limb (33 of the 91 gestures captured).The stages considered in the decision tree (diagram in Fig. 4) move from i) the 1st node (manual or brachial flow) to ii) the 4th node (separation of gestures by prono-supination).
The first stage involves determination of the flow of motion from the arm (proximal-distal) or from the hand (distal-proximal).We therefore calculated: 1) for the initial position, the moment in which the min.and max. of each DOF appear within the automatically cut stroke; 2) the temporal difference between the min.and the max. of the DOF from one segment to another (arm [offer, not care, commit and revere], forearm and hand [for all the other gestures]).Within the latter calculation, the choice of the min.or max.value for one DOF or another correspond to the initial position, therefore a priori opposed to the pole seen in motion during the stroke.If hand motion is EXTEN (positive value), then the initial position corresponds to a minimum (flexion, negative value).Thus, for example, for the top line of diagram in Fig. 3, which illustrates the poles in motion in the "refuse" gesture, ADD.EXTEN >PRO, the initial position chosen was the max.value of the ADD/ABD, the min.value of the FLEX/EXTEN and the min.value of the SUPI/PRO.We set a minimal threshold of 10 frames, corresponding to 2 running video frames at 25f/s, for a temporal difference that enables the determination of the flow.

Discussion
Determination of flow using this method (for the 33 gestures tested, covering the 12 GUs presented in Sec.4; we underline that each GU occurred between 2 and 3 times), is conformed to expectations in 87.88% of cases.Three of the four cases not validated were below the 10-frame threshold and therefore meet no determinable flow; a further case (a realization of "revere") shows a flow reverse to expectations.The fourth stage of characterization -wherein poles in motion are used to determine the action schemawas conducted with two types of data (a.and b. below).Initial calculations concern the average of two or three realizations by GUs (33 gestures in total), thus covering the 12 averaged GUs for which moving poles and semantic labels are already known.We then determine the maximum range for each DOF, either a. within the boundary of prono-supination as shown in Fig. 7, or b. a wider range, starting from the stroke and calculating the difference between the final and initial position of each DOF.We thus obtained the motion poles for all DOF that characterize the GUs, that is 60 DOF for 12 GUs.Results for the first type of data (a. in the demarcation of prono-supination) show a recognition rate of 76.67%.The other option (b. in the stroke with the difference in initial and final position) shows much better characterization ratios 90%.Out of 60 DOF, the opposite pole appears only for 6 expected poles.In both options, an average of 6 DOF measured per GU, the pole that was most prone to error was the hand in ABD/ADD (36% error in a., 67% error in b.).This pole also shows the smallest range (25 • and 35 • ).
Intermediary stages of the characterization (2 and 3 diagram Fig. 4) involve -marking of the initial positions of prono-supination and ABD vs. ADD motion of the arm (stage 2); -determination of the initial position and the identical vs. opposed motion of prono-supination and the motion pole between PRO and SUPI (stage 3).
The characterization of ABD vs ADD of the arm and PRO vs SUPI is unproblematic.In contrast, marking the initial positions of prono-supination (stage 2) does not give the expected results.In this case, only the range difference of the interior/exterior rotation between the beginning and end of the stroke is significant.For a confidence interval of 95%, there is no overlap between "reject/refuse" on the one hand and "despise" on the other.For the trio "pass/accept/discredit" non-overlap was also checked.Thus, the DOF marking criterion (PRO/SUPI) of stage 2 should be modified into a differential of the range of rotation in relatively marked EXT/INT.For step 3, the identity or opposition between the initial position and prono-supination is a good criterion, since, with a confidence interval of 95%, there is no overlap between "reject/refuse/despise" on the one hand, and "consider something" on the other.The same is true between "pass/accept/discredit" on the one hand, and "consider someone" on the other.
All in all, the only phases which are not fully satisfactory are phases 1 (with a single case of inversion for "revere") and 4 (with 90% of expected poles).Intermediary steps are 100% reliable.

Conclusion
In this study, we have presented a method for the characterization of 12 gestural units involving the upper limb.A motion-capture system was used to build a reliable gesture database to prove that the meaning of different Gestural Units can be defined on the basis of forms along the upper limb, using multiple reference points, that are not limited to body-orientation, but oriented via each of the segments (hand, forearm, arm).For this purpose an automatic segmentation was exploited.Tests of the segmentation protocol demonstrate its robustness in the individuation of the stroke necessary for the characterization of gestures.
Simple characterization methods fulfill the requirement to associate each stage with formal semantic tagging.This is so since GUs that share the same poles differ only in the sequence in which these poles appear in the action schema.However, we have yet to characterize 4 of these.In this study, we cannot separate 'refuse' from 'reject' and 'accept' from 'pass'.Still both groups can be labeled: on the one hand, negative positioning with respect to things, and on the other, the same type of positioning, only positive.Consequently, all gestures are semantically associated with a variable granularity.
As this characterization has been conducted in light of a human-avatar interaction with an off-line system, the next step is to test the presented method in an online set-up using a simpler capture system, namely a kinect, on different subjects reproducing the presented kind of gesture also in other form as, for example, involving only a single moving segment (e.g., the hand).

4 EAIFigure 3 .
Figure 3. Action schema for the GU "refuse" .This gesture begins with a movement of the hand (Adduction, 1).The order of the motion is numbered.The photograms illustrate the execution of the gesture at different moments of what is gathered in the action schema.The first photogram capture the preparation phase before the stroke.

Figure 4 .
Figure 4. Diagram presenting the formal presentation of gestures according to semantic characterization.

6 EAIFigure 5 .
Figure 5. On the left, the signals of the gesture "revere".On the right, the signals for "accept".From top to bottom: positions of the index for Y, speed (derivative) and absolute value of the speed with automatic individuation of the stroke between two green points.

Figure 7 .
Figure 7.The signals for different segments of the upper limb.For the hand, supination/pronation in red, abduction/adduction in sky blue and dotted purple for flexion/extension.For the forearm, extension/flexion in dotted dark green.For the arm, internal/external rotation in dotted dark blue, abduction/adduction in dotted green and extension/flexion in yellow.Vertical lines indicate the highest variation in pronosupination.

Table 1 .
Results obtained for the three cases of study.

Table 2 .
Errors occurred in the cases of study.