The Speckled Cellist Classification of Cello Bowing Techniques using the Orient Specks

Cello bowing techniques are classified by applying supervised machine learning methods to sensor data from two inertial sensors called the Orient specks – one worn on the playing wrist and the other attached to the frog of the bow. Twelve different bowing techniques were considered, including variants on a single string and across multiple strings. Results are presented for the classification of these twelve techniques when played singly, and in combination during improvisational play. The results demonstrated that even when limited to two sensors, classification accuracy in excess of 95% was obtained for the individual bowing styles, with the added advantages of a minimalist approach.


INTRODUCTION
Stringed instruments by definition produce sounds from vibrating strings, and can be characterised by the way the strings are made to vibrate: by plucking them (harp, mandolin, sitar), by bowing (violin, cello, sarangi), or by striking them (piano, santoor). This paper is concerned with the celloa stringed instrument played with a bow (which is a stick with many hairs stretched between its ends). Advanced string competency involves mastering different bowing techniques. The automatic classification of the bowing techniques has attracted interest due to its technical complexity, for its possible future applications in musical pedagogy and its use in the hypercello, in triggering and modifying synthesised sounds that accompany the acoustic cello.
Previous research in this topic has been limited to a small number of bowing techniques, using special bows embedded with sensors, and restricted to classification of techniques singly, or in a predetermined order, without tackling combinations during improvisational play. The proposed approach uses the Orient specks [13], which combines a 3-axis accelerometer, gyroscope and magnetometer, and wireless communication in a single unit measuring 36x28x11 mm and weighing 23 grams (including the battery). Orients have the advantage of being unobtrusive when worn on the playing wrist, and can be attached to the frog in the bow without affecting its balance during playing. A novel set of dattributes was constructed from the sensor data for this task with the aim of making the data separable in d-dimensional space. Results are presented using a classifier based on a Support Vector Machine (SVM) model for three data captures sessions following the minimalist approach, which included twelve bowing techniques and addressed the aspect of string-crossings and improvisational playing by experienced cellists each with more than fifteen years of experience.
In the rest of this paper, Section 2 gives a short introduction to different bowing techniques and previous research in this area with an analysis of their strengths and weaknesses; Section 3 sets out the methodology adopted in this paper; Sections 4 and 5 present results and conclusions, respectively.

BACKGROUND AND RELATED WORK 2.1 Background
The drawing of the hair of the bow across the strings produces vibrations due to the spontaneous jerking motion called the stickslip phenomenon. Different bowing techniques [7][8] [9] produce characteristic sounds and the following were considered in this work:


Legato: smoothly connected, without interruption between the notes, whether in one or several bow strokes.
 Staccato: a non -legato martelé type of short bow stroke played with a stop; a detached well articulated stroke.
 Martelé: percussive bow-stroke characterised by its sharp initial accent and post-stroke articulation.
 Spiccato: In the eighteenth century terminology, a style of bowing which produces a dry, detached sound, not necessarily executed off the string. In the nineteenth century, it came to mean a relatively slow, bouncing stroke.


Col legno: to strike the strings with the stick of the bow as opposed to sliding the bow across the string.
 Ricochet: involves at least two notes being played in the same 'bow-stroke' (either up or down).
Each of the bowing techniques (except Col legno and Ricochetonly the single string variant was considered) was treated to two variants: 1. Single: the techniques were applied on a single string of the cello.

2.
Crossing: the bow moved over multiple strings of the cello, i.e., the bow "crossed" strings.

Related Work
The CyberViolin project [1] presented real-time classification of violin bow strokes such as détaché, martelé, staccato, spiccato, and legato, using an electromagnetic motion tracking system to capture raw gesture data. The data was analysed to extract stroke features and fed to a decision tree for training and classification. An accuracy of 73% was achieved using a set of sparse features such as the length of the bow stroke and its average speed. The accuracy was improved by adding new features such as frequency of bow change, acceleration or deceleration within a stroke, continuity of motion between strokes, bow position (middle, upper, lower), number of changes in a single coordinate (stroke similarity), and the lack of movement within a stroke (stoppage).
Schedel and Fiebrink [2] constructed a real-time cello bow articulation classification system for using a bow instrumented with sensors. The commercial K-Bow [5] measures bow acceleration and tilt, bow hair tension, grip pressure and surface area, horizontal distance between frog and tip, vertical distance between fingerboard & bridge, and the tilt of the bow relative to instrument. A total of eighty features were extracted from the following sensor data for classification of legato, marcato (martelé), tremolo, battuto (col legno), ricochet, hook and spiccato bowing styles. The Wekinator software tool was used for real-time, interactive machine learning [6] with the following inputs: each K-Bow sensor's mean, minimum, and maximum value over a sliding window; the mean, minimum, and maximum of the first-and second-order differences within the same window; and the raw sensor value sampled once per window.
Bevilacqua et al at IRCAM in Paris developed the augmented violin which was an acoustic violin with added sensing capabilities to measure the bow acceleration. A k-NN (k-Nearest Neighbour) clustering technique was used to recognise détaché, martelé and spiccato bowing styles. Young [4] used a carbon fibre violin bow augmented with force, inertial, and position sensors to record bowing gestures on an electric violin. Principal components were computed for the data set comprising of the downward and lateral forces; x, y, z acceleration; and angular velocity about the x, y, and z axes. A k-NN classifier was employed, using the principal components as inputs to classify six different common bowing techniques: accented détaché, détaché, lancé, louré, martelé, staccato and spiccato.
The research presented in this paper pushes the state-of-the-art in a number of ways. Firstly, a wider variety of bowing techniques are classified including the legato, staccato, spiccato, tremolo, martelé, col legno and ricochet. Furthermore, the "crossing" of strings is also considered, i.e., the single string and crossing string variants are considered separately for the different bowing techniques where applicable. The classification of such a comprehensive list of bowing articulations has not been attempted before.
Secondly, the approach adopted in this paper is the least intrusive to date compared to other methods; two wearable Orient wireless sensors are usedone on the wrist of the playing hand and the second attached under the frog of the bow, i.e., this could be any bow and does not require a specially adapted one. Such a minimalist approach is in sharp contrast to the K-Bow which is festooned with a number of sensors: a three-axis accelerometer located inside the frog senses tilt and acceleration of the bow; changes in the grip pressure and surface area of the cellist's bow hand are sensed; an angle sensitive pressure sensor located in the junction between the bow hair and the frog measures changes in the tension of the bow hair as the cellist plays; a small circuit board beneath the fingerboard of the instrument creates an RF field and an infrared modulated wide field light cone, whose interactions with the loop antennas inside the bow stick and with the infrared detector inside the frog allow the measurement of the bow position and angle relative to the instrument.
Thirdly, despite the minimalist approach, the classification method adopted in this paper yields an impressive accuracy of up to 98.33% for the individual bowing techniques, and furthermore the method has been applied for the first time towards classifying the bowing techniques in improvised pieces: for constrained improvisation and free improvisation.

Data Capture Sessions
The cellist participating in the data capture session wore an Orient speck on the wrist of the playing hand and another speck was attached to the frog of the bow. The motion of the bow as the playing hand manoeuvres it, is central in characterising the bowing technique. This determined the placement of specks on the bow and on the playing hand (as close as possible to the point of contact with the bow). The exact position of the specks, as shown in Figure 1, was fine-tuned by the cellists to minimise any interference when playing the instrument. Each data capture session consisted of three parts: each of the twelve bowing techniques ('Single' and 'Crossing' techniques for each of Legato, Staccato, Martelé, Tremolo, and Spiccato, and 'Single' only for Col legno and Ricochet.) was played continuously for a period ranging between 20 to 25 seconds; next, a constrained improvisation piece was played which contained a random concatenation of the twelve bowing techniques, which is not a recognisable musical piece; finally, the cellist was asked to play freely an improvised piece of music which was chosen to contain the bowing techniques. The latter two sessions lasted between 120 and 150 seconds. The constrained improvisation piece is a halfway house between the individual bowing styles where the playing is well pronounced and the freely improvised musical piece in which the demarcations between the techniques are fluid, wherein one technique smears into another. The three data capture sessions considered in this paper were also videotaped for future reference. The cellists participating in the study were categorised as an expert (Subjects 1 and 2), and at intermediate level (Subject 3), to validate the robustness of the method for different playing abilities and styles.

Data Pre-processing
The Orient data stream contains sensor values from three-axis gyroscope, accelerometer and magnetometer produced at a sample rate of 100 Hz. As a first step in the pre-processing stage, the magnetometer data was removed because the direction of the magnetic field or changes to it does not add any relevant information for describing the movement of the bow. Next, outlier values were removed manually by visual inspection. The data from all the Orients were packed into a vector for each segment so as to represent it as an n x d data matrix 1 . This is a standard representation of datasets for data mining tasks and is suitable for use with the Weka tool [10]. When sensor values were missing, they were ignored but a segment was only considered as long as no more than 25% of the data was missing from either of the two Orients. It was established empirically that segments with 75% of the sensor values from each of the Orients encoded sufficient information for classifying the segment correctly using the methodology described in this paper.

Segmentation
For each capture, the Orient sensor data and its corresponding video were aligned with a precision of 0.05 seconds using their respective timestamps. The Orient data was next divided into segments based on a time window, which is termed as the 'segment duration' (sd). For example, sd = 1.5 corresponds to dividing the actions into 1.5 second segments. The impact of sd 1 n is the number of segments and d is the number of attributes.
on the classification accuracy (%) was studied and the results summarised in Figure 2. The value of sd, ranging between 1 and 2 seconds does not have a significant effect on the accuracy. The final choice of 1 second was a compromise between the ability to pick out events of short duration of interest during the improvised pieces and in being able to encode as much of the macroscopic information in the pieces as possible. For example, sd = 1 will result in 25 training examples for a 25 second capture for LS (Legato on a single string).

Choice of Classifier
The choice of SVM with a normalised polynomial kernel was based on a systematic comparison of different classifier models using the Weka tool [10]. Table 2 summarises the evaluation results using 10-fold cross validation for an artificially shuffled dataset produced by randomly concatenating 1-second segments of individual bowing techniques. It was observed across all the artificially shuffled datasets using tuned versions of the classifiers, that SVM with a normalised polykernel was consistently the best performer.

Choice of Features
Six features were selected and their values calculated along the three axes using the sensor data from the two Orient specks, resulting in thirty-six (6 x 3 x 2) attributes. Given the variety of techniques including on-the-string and off-the-string bowing and those involving bow tilts, it was important to consider sensor data in all the three dimensions. The absolute values of the accelerometer and gyroscope readings were used as the bowing technique may be applied either in an Up-Bow ('Pushing' the bow so that its point of contact with the string moves from the tip towards the frog), or in a Down-Bow fashion (drawing the bow so that its point of contact with the string moves from the frog end towards the tip). The features were selected based on published sources on bowing techniques ( [8][9] [12]) and actual observations, and are listed below: 1. Average acceleration.
2. Variance in the acceleration.
3. Average smoothness of acceleration (mean of the first derivative of the accelerometer readings in a segment) 4. Average angular velocity (mean of the gyroscope readings in a segment).
5. Variance in angular velocity (variance in the gyroscope readings in a segment).
6. Average angular acceleration (mean of the first derivative of the gyroscope readings in a segment).
The utility of each feature was tested by plotting specific parts of the data, e.g., variance in the acceleration was hypothesised to distinguish between Legato and Martelé as these techniques have different acceleration profiles due to their articulation. The scatter plot in Figure 3 shows  Furthermore, a Voronoi tessellation 3 of the attribute space was induced by applying a 1-NN classifier model. The minimum classification accuracy on 10-fold cross validation was 90% which indicates that the data was indeed well separated in the attribute space. SVMAttributeEvaluation was used in Weka for understanding better the relative importance of the six features. Based on the results, one could infer that different features were useful for distinguishing between different types of bowing techniques and it was inconclusive in determining which one or two of these features had greater importance overall. Also, it was possible to choose a subset of attributes by not calculating some features along a specific axis without compromising on classification performance. However, such attribute selection techniques did not generalise well as an optimal subset of attributes in the case of one cellist was not necessarily the optimal one for the others.

Artificially Shuffled Datasets
We considered two "artificially shuffled" datasets for each of the three subjects: 'only single string variants' containing the 7 styles that were executed on a single string and 'full dataset' containing all of the 12 styles ( Table 1). Each of these datasets had 36+1 attributes including the class label. The mode of evaluation was set to 10-fold cross validation in all cases as this is one of the most robust techniques to evaluate a classification algorithm. The two parameters of the Support Vector Machine with a Normalised PolyKernel, viz. the complexity parameter (c) and the order of the PolyKernel (e) were set for each dataset after performing a grid search and the pair (c,e) which resulted in the best classification performance was selected. Ties were resolved by taking the lower value in order to guard against overfitting the training data. A weighted average is reported of the Receiver Operating Characteristic (ROC) curve area across the classes in the dataset weighted by the number of instances in each class and gives the performance of a classifier without regard to class distribution or error costs [11]. The weighted average of the F-measure (the harmonic mean of recall and precision) is also reported across the classes weighed by the number of instances in each classa high value of F-measure indicates high values of both recall and precision which are important measures of evaluating classification performance. There was only one misclassification for Subject 1-a staccato was classified as a martelé. This can be explained as these two styles are quite similar: staccato is essentially a version of martelé executed with separation but with a less aggressive articulation [12]. In case of Subject 2, there were three errors -all of which were in distinguishing staccato and legato (2 staccatos were classified as legato and 1 legato was classified as a staccato). Legato and staccato are both on-the-string bowing techniques. However, the key difference is that the former is non-articulated whereas the latter is a well-articulated technique. Thus, in absence of the part of the articulation/non-articulation in the segment, it may be quite difficult to distinguish between these two styles. Finally, there were no errors observed for Subject 3.

Only Single String Variants
Overall, the classification performance as indicated by the three metrics (Accuracy, ROC Area and F-measure) was impressive and the small number of errors was explainable.

Table 4: Classification performance on Full Dataset
The distribution of the instances across the 12 classes (single and crossing string bowing techniques) for each subject were as follows:  Subject 1: 20 each from the 12 classes. The results presented in Table 4 illustrates the fact that detecting string-crossings is challenging; however, there have been notable results. The errors for each subject have been summarised below:  Subject 1: We observed 4 misclassifications in this case a staccato crossing was classified as a martelé crossing; a spiccato crossing was classified as a spiccato single and the error is due to wrongly classifying a crossing; and the other two errors are inexplicable and perhaps caused by noise in the sensor data.
 Subject 2: There were fourteen errors, most of which (ten) are related to detecting string-crossings. Staccato 4 Refer to Table 1. and Spiccato share quite a few similarities especially in terms of articulation and this is likely to have been the cause for confusion. Explanation for the misclassification of the Staccatos as Legato was provided earlier in this section. Finally, the error in classifying a Spiccato Crossing as a Legato Crossing is inexplicable and may be attributed to the noise in the data.


Subject 3: There were thirteen errors in this case mainly due to string-crossings. The other errors were in distinguishing between non-articulated (Legato) and articulated strokes (Staccato and Martelé). There was also 1 error in distinguishing between Spiccato and Martelé These two styles have similar forms of articulation giving rise to the confusion.
In summary, the string-crossings results in greater misclassifications and could possibly be due to the segmentation procedure. In the absence of the part of the act where the bow "crossed" strings in the segment, it would be virtually impossible to detect the crossing variants. Also, dealing with both these variants introduces more noise in the data which may also lead to a dip in the performance.

Improvised Playing
The SVM with a normalised PolyKernel was trained on the corresponding 'Artificially Shuffled' dataset for Subject 2 and Subject 3. The parameter setting used was the one that worked best in case of the corresponding 'Artificially Shuffled' dataset using a 10-fold CV approach. The "Improvised" dataset was loaded as the test set in Weka and had no class labels. The classifier output was obtained, and the prediction for each segment and the probability distribution among classes for that segment, were stored. Using the classifier output, "soft" predictions were made, i.e., mark the segment with the class as predicted by the classifier; and, the second most likely class based on the probability distribution for that segment. A threshold t is set such that for each segment, a class label which has a probability p > t, is produced as an output. The threshold t is controlled in such a way that in most (or all) cases, only two predictions are allowed, i.e., the actual prediction produced by the classifier and the second most likely class label based on the probability distribution for the segment which is obtained from the classifier output in Weka. The reason for making "soft" predictions is due to the inherent subjective nature of the domain, i.e., the boundaries and definitions can often be fluid and the annotations rely on interpretation by musicians, which is subjective. The ground truth for assessing the accuracy on improvised pieces was obtained from the cellists themselves. A segmented version of the video of the improvised piece was provided to the cellist who was then asked to annotate each segment. The cellists were advised to mention both the techniques in case a combination was observed in a segment or if in any doubt about the exact bowing technique applied in the segment.    6 show that majority of the errors were made in detecting string-crossings and the classification performance increases significantly on ignoring this by verifying that the main technique was classified correctly, e.g., LS and LC are both Legatos and so LS and LC are treated as being the same. There were certain errors that were not observed in the artificially shuffled datasets such as detecting Ricochet and Col legno. These may be mitigated by using a larger training set to capture the fluidity and different varieties of the same technique. The other errors have either already been explained in the earlier subsection or were inexplicable, perhaps due to noise in the data. The results for Subjects 2 and 3 are summarised in Figure 4.

DISCUSSION
The approach adopted in this paper is a minimalist one, identifying the lowest number of easy-to-use, wearable sensors which can be used to classify successfully twelve bowing styles. The classification based on SVM has for the first time been extended to classifying bowing styles in improvisational playing of the cello. The similarity between the results obtained in case of the expert (Subject 2) and those for the intermediate level cellist (Subject 3) validates to a limited extent the robustness of the method and indicates the generalisability of the approach. We also evaluated the performance on datasets obtained from the same cellist but in different scenarios such as playing with a different bow and in these cases as well, the results remained consistent.
The task of identification of bowing techniques is a challenging one because of the fluid nature of music in general. The differences between the individual bowing techniques may not be concrete and a certain degree of overlap is expected to be present among the various techniques. Therefore, methodologies designed to identify bowing techniques are expected to be plagued by limitations to varying degrees. The evaluation of such methods also poses a challenge because purely data-driven approaches are not terribly well suited as they fail to capture the subjective nature of musical evaluation. In light of this observation, we adjusted our evaluation procedure on the improvised pieces. One must realise that in such a domain, absolute precision, even if achievable, may not be desirable. Alternative measures of performance such as inter-annotator agreement may serve to assess the quality of such methods more accurately.
The main limitation of this work is the segmentation procedure. However, an automatic segmentation procedure would require accurate velocity readings along the axes. Bow strokes start and end at velocity zero crossings and not at acceleration zero crossings. Hence, it is not possible to segment accurately the bow strokes based on accelerometer readings alone. The approach of using the zero-crossings in accelerometer readings to segment the individual strokes has not been entirely successful. The large variety of bowing techniques considered in this work (both onthe-string and off-the-string bowing techniques) also poses a challenge for automatic segmentation, e.g., spiccato (off-thestring) will not have velocity zero crossings along the same axis as legato (on the string). Hence, this approach would also have to be modified to account for such variability.
In conclusion, twelve bowing styles have been classified with above 95% accuracy for each of the three subjects despite adopting a minimalist approach. The results and its analysis in case of the improvised pieces also demonstrate the potential of this methodology especially for use in real world applications such as data-driven pedagogical approaches to support learning of stringed instruments.
Future work may involve developing an automated segmentation procedure, which would significantly improve the accuracy of this system based on our observations and conducting a deeper investigation on the generalisability of the approach to other stringed instruments using bowing. The advantages of this method may then fully be realised in applications such as interactive musical performances and on-the-fly musical compositions.

ACKNOWLEDGEMENT
We would like to thank cellists Nicola Baroni, Clea Friend and Oliver Perkins for their generous help during data collection and their musical knowledge in interpreting the results. This research was supported by a grant from the Centre for Speckled Computing (www.specknet.org), University of Edinburgh.