Kendall Tau Sequence Distance: Extending Kendall Tau from Ranks to Sequences

An edit distance is a measure of the minimum cost sequence of edit operations to transform one structure into another. Edit distance is most commonly encountered within the context of strings, where Wagner and Fischer's string edit distance is perhaps the most well-known. However, edit distance is not limited to strings. For example, there are several edit distance measures for permutations, including Wagner and Fischer's string edit distance since a permutation is a special case of a string. However, another edit distance for permutations is Kendall tau distance, which is the number of pairwise element inversions. On permutations, Kendall tau distance is equivalent to an edit distance with adjacent swap as the edit operation. A permutation is often used to represent a total ranking over a set of elements. There exist multiple extensions of Kendall tau distance from total rankings (permutations) to partial rankings (i.e., where multiple elements may have the same rank), but none of these are suitable for computing distance between sequences. We set out to explore extending Kendall tau distance in a different direction, namely from the special case of permutations to the more general case of strings or sequences of elements from some finite alphabet. We name our distance metric Kendall tau sequence distance, and define it as the minimum number of adjacent swaps necessary to transform one sequence into the other. We provide two $O(n \lg n)$ algorithms for computing it, and experimentally compare their relative performance. We also provide reference implementations of both algorithms in an open source Java library.


Introduction
There exists a wide variety of metrics for computing the distance between permutations Cicirello, 2016Cicirello, , 2018Cicirello, , 2019Cicirello and Cernera, 2013;Fagin et al., 2003;Martí et al., 2005;Meilă and Bao, 2010;Ronald, 1995Ronald, , 1997Ronald, , 1998Sevaux and Sörensen, 2005;Sörensen, 2007). The different permutation metrics that are available consider different characteristics of the permutation depending upon what it represents (e.g., a mapping between two sets, a ranking over the elements of a set, or a path through a graph). There is at least one instance where a metric on strings is suggested for permutations. Specifically, Sörensen (2007) suggested using string edit sequences (i.e., strings, arrays, or any other sequential data). The specific metric that we adapt to sequences is Kendall tau distance. Kendall tau distance is a metric defined for permutations that is itself an adaptation of Kendall tau rank correlation (Kendall, 1938). As a metric on permutations, Kendall tau distance assumes that a permutation represents a ranking over some set (e.g., an individual's preferences over a set of songs or books, etc), and is the count of the number of pairwise element inversions. We review Kendall tau distance, for permutations, in Section 2, along with existing extensions for handling partial rankings (i.e., instead of a permutation or total ordering, partial orderings with tied ranks are compared).
In the case of permutations, where each element of the set is represented exactly one time in each permutation, Kendall tau distance is the minimum number of adjacent swaps necessary to transform one permutation into the other. Thus, in the case of permutations, Kendall tau distance is an edit distance where the edit operations are adjacent swaps. Due to this relationship, it is sometimes referred to as bubble sort distance, since bubble sort functions via adjacent element swaps. However, as soon as you leave the realm of permutations, existing forms of Kendall tau no longer correspond to an adjacent swap edit distance. We provide an example of this in Section 2.5.
In the case of comparing partial rankings, the existing extensions of Kendall tau distance to partial rankings are fine. However, if we are comparing sequences (e.g., strings, arrays of data points, etc) that do not represent a ranking, then the partial ranking versions of Kendall tau distance do not apply. We propose a new extension of Kendall tau distance for sequences in Section 3. We call it Kendall tau sequence distance, and show that it meets the requirements of a metric. It is applicable for computing the distance between pairs of sequences, where both sequences are of the same length, and consist in the same set of elements (i.e., duplicates are allowed, but both sequences must have the same duplicated elements). It is otherwise applicable to strings over any alphabet or any other form of sequence (such as an array of integers or an array of floating-point values, etc). We argue that it is more relevant as a measure of array sortedness than the existing partial ranking adaptations of Kendall tau. In Section 3.3, we provide two O(n lg n) algorithms for computing Kendall tau sequence distance.
We implemented both algorithms in Java, and we have added those reference implementations to JavaPermutationTools (JPT), an open source Java library of data structures and algorithms for computation on permutations and sequences (Cicirello, 2018), which can be found at https://jpt.cicirello.org/. In Section 4, we experimentally compare the relative performance of the two algorithms. The code to replicate these experiments is also available in the code repository of the JPT.

Notation and Assumptions
Without loss of generality, we will assume a permutation of length n is a permutation of the integers in the set S = 1, 2, . . . , n. Let σ(i), where i ∈ S, be the position of element i in the permutation σ. If the permutation is a ranking over a set of n objects, then σ(i) represents the rank of object i in that ranking. Let p(r), where r ∈ S, be the element in position r of the permutation (or with rank r). Our notation assumes that the index into the permutation begins at 1.
The σ and p are two alternative representations of the permutation. They are related as follows: σ(i) = r ⇐⇒ p(r) = i. Throughout the paper, we will use whichever is more convenient in the given context.
We will initially assume that permutations (whether defined with σ or with p) are true permutations. That is, we assume σ(i) = σ(j) ⇐⇒ i = j and also that p(r 1 ) = p(r 2 ) ⇐⇒ r 1 = r 2 . Therefore, if the application is one of rankings, we assume that there are no ties. In other words, two objects have the same rank only if they are the same object; and each object has only one rank. We relax this assumption later in Section 2.4.
Another way of expressing it is as follows: where C is the set of concordant pairs, defined as: and D is the set of discordant pairs:

Kendall tau distance
For a function d : S × S → R to be a measure of distance, we must have non-negativity (d Further for d : S × S → R to be a metric, it must also satisfy the triangle inequality (d for all i, j, k ∈ S). The Kendall tau rank correlation coefficient is not a measure of distance (e.g., it clearly doesn't satisfy the first two requirements of nonnegativity and identity of indiscernibles. Kendall tau distance (for permutations) is found in the literature in two forms, as follows: and where D is the set of discordant pairs as previously defined in Equation 4. The only difference between these is that in the latter case, the distance is normalized to lie in the interval [0, 1], and in the former case the distance lies in the interval [0, n(n − 1)/2]. We have K(σ 1 , σ 2 ) = 0 only when σ 1 = σ 2 . And the maximum occurs when σ 1 is the reverse of σ 2 . Kendall tau distance for permutations satisfies all of the metric properties. The version seen in Equation 5 is also equal to the minimum number of adjacent swaps necessary to transform one permutation p 1 into the other permutation p 2 . That is, it is an edit distance where the edit operation is adjacent swap. Consider as an example, the permutations σ 1 = [2, 4, 1, 3] and σ 2 = [4,1,3,2]. Their equivalents in the other notation are p 1 = [3, 1, 4, 2] and p 2 = [2, 4, 3, 1]. The discordant pairs are D = {(1, 2), (1, 4), (2, 3), (2, 4), (3, 4)}. Thus, K(σ 1 , σ 2 ) = 5 in this example. You can transform p 1 = [3, 1, 4, 2] into p 2 via the following sequence of five adjacent swaps: [3, 4, 1, 2], [3, 4, 2, 1], [4, 3, 2, 1], [4, 2, 3, 1], [2, 4, 3, 1] = p 2 . You cannot do it with fewer than five adjacent swaps in this example.
Note that as an adjacent swap edit distance, it specifically concerns the p representation of the permutation and not the σ notation. For example, adjacent swaps on σ 1 leads to a shorter sequence (3 swaps): [4, 2, 1, 3], [4, 1, 2, 3], [4, 1, 3, 2] = σ 2 . However, there is an equivalent operation for the σ notation, swapping consecutive ranks (i.e., rank 1 with 2, 2 with 3, etc). That is, since p lists the elements in their "ranked" order, an adjacent swap in p is equivalent to exchanging the ranks of two elements whose ranks differ by 1.

Partial ranking Kendall tau distance
We now amend the notation previously introduced in Section 2.1. Specifically, we will now assume that rankings may be partial (i.e., there may be ties). That is, although i = j =⇒ σ(i) = σ(j) is still the case, we now allow σ(i) = σ(j) in cases where i j (i.e., two different elements may have same rank).
The simplest way to extend Kendall tau rank correlation or Kendall tau distance to partial rankings is to compute it without modification. That is, compute the number of discordant pairs, etc and use the definitions of Sections 2.2 and 2.3. The algorithm of Knight (1966) described in the previous section is actually specified to handle partial rankings in this way. In the first sort, where the list of tuples T is sorted by the first component of the tuples, Knight (1966) indicates to break ties using the second component.
Among the potential problems with directly applying Kendall tau distance without modification to partial rankings is that it no longer meets the metric properties. Fagin et al. (2006) developed the K (p) , known as the Kendall distance with penalty parameter p to deal with this, and determined the range of values for the penalty parameter that enables fulfilling the metric properties. Define K (p) as follows: where D is still the set of discordant pairs, as previously defined in Equation 4. Note the strict < and > in the definition of D, and that a tie within either permutation is not a discordant pair. E is the set of pairs that are ties in one permutation, but not the other (i.e., one ranking considers the objects equivalent, but the other does not). Therefore, E is defined as: Fagin et al. (2006) showed that K (p) is a metric when 0.5 ≤ p ≤ 1, and that it is what they termed a "near metric" when 0 < p < 0.5, and that it is not a distance when p = 0. We do not use their "near metric" concept here so we leave it to the interested reader to consult Fagin et al. (2006).
You can compute |D| and |E| without actually computing the sets D and E via the approach of Knight (1966) based on sorting. Let T = [(1, 3), (2, 2), (3, 1), (1, 2), (1, 1), (2, 2), (2, 1)]. Sort T by first component of tuples, breaking ties via the second components, and obtain: You can finally sort T ′ via mergesort (or another O(n lg n) sort), with the sort modified to count inversions. In this case, there are 8 inversions in T ′ , which is equal to |D|. It is also straightforward enough to compute |E|.
There is no interpretation where K (p) or any other partial ranking variation of Kendall tau distance that is based on the number of discordant pairs is equivalent to an adjacent swap edit distance. The example of this section illustrates this in that there are eight discordant pairs (thus K (p) ≥ 8 unless p is negative) while less than eight adjacent swaps is sufficient for sorting the permutation (either 6 or 4 depending upon the interpretation of "adjacent swap" and the representation to which it is applied).

Positions of elements in a sequence are not ranks
If the sequences we are comparing do not define rankings, then the partial ranking variants of Kendall tau distance are not applicable as it would be arbitrary to impose a ranking interpretation upon them, and also likely to lead to a nonsensical interpretation. For example, consider the string s: "abacab". It would be arbitrary to impose a lexicographical order of the characters as if they are ranks (e.g., "a" as 1, "b" as 2, etc), such as transforming s to σ = [1, 2, 1, 3, 1, 2]. Or, if you consider position in the sequence to be an element's rank, then you'd have something meaningless like "a" is simultaneously ranked first, third, and fifth.

Notation and Assumptions
Let s be a sequence of length n, where s(i) ∈ Σ for some alphabet Σ and i ∈ {0, 1, . . . , n − 1}. The alphabet Σ can be a character set for some language, but can also be the set of integers, the set of real numbers, the set of complex numbers, or any other set of elements. The alphabet Σ is not necessarily a finite alphabet, although we do assume finite length sequences (i.e., n is finite).
Without loss of generality, we also assume that the elements of the alphabet Σ can be ordered. The specific ordering does not affect the measure of distance between the sequences.

Kendall tau sequence distance = adjacent swap edit distance
We previously saw in Section 2.3 that the original form of Kendall tau permutation distance is equivalent to an adjacent swap edit distance when applied to permutations (i.e., no duplicates) and specifically when applied to the p representation (and not the σ representation). But that the existing extensions of Kendall tau beyond permutations (e.g., partial ranking variants) are not equivalent to an adjacent swap edit distance.
We now define the Kendall tau sequence distance, τ S , as follows: τ S (s 1 , s 2 ) = min # adjacent swaps to transform s 1 to s 2 .
(10) where s 1 and s 2 are sequences as defined in Section 3.1. We require the lengths of the sequences to be equal, i.e., |s 1 | = |s 2 |. And for each character c ∈ Σ, we require count(s 1 , c) = count(s 2 , c), where count(s, c) is the number of times that c appears in s. The τ S distance is undefined if these conditions do not hold for a specific pair of sequences.
The τ S distance satisfies all of the metric properties. It clearly satisfies non-negativity, identity of indiscernibles, and symmetry. We must have τ S (s 1 , s 2 ) ≥ 0, since it is not possible to apply a negative number of swaps. If s 1 = s 2 , then τ S (s 1 , s 2 ) = 0 since 0 swaps are required to transform a sequence to itself. And if τ S (s 1 , s 2 ) = 0, then s 1 = s 2 since the only case when a sequence can be transformed to another with 0 adjacent swaps is obviously when the two sequences are identical. It is also obvious that τ S (s 1 , s 2 ) = τ S (s 2 , s 1 ).
The τ S also satisfies the remaining metric property, the triangle inequality: The proof is as follows (via contradiction). Suppose there exists sequences s 1 , s 2 , and s 3 , such that: τ S (s 1 , s 2 ) > τ S (s 1 , s 3 ) + τ S (s 3 , s 2 ). The minimum cost edit sequence from s 1 to s 3 is τ S (s 1 , s 3 ) (by definition via Equation 10). Likewise, the minimum cost edit sequence from s 3 to s 2 is τ S (s 3 , s 2 ). One sequence of edit operations that will transform s 1 to s 2 is to first transform s 1 to s 3 , and then to transform s 3 to s 2 . The cost of that edit sequence is clearly the sum of the costs of the two portions: τ S (s 1 , s 3 ) + τ S (s 3 , s 2 ). The minimum cost edit sequence to transform s 1 to s 2 must therefore be no greater than τ S (s 1 , s 3 ) + τ S (s 3 , s 2 ), a contradiction.

Two O(n lg n) algorithms to compute τ S
In this section, we present two O(n lg n) algorithms for computing τ S . Both rely on an observation related to the optimal sequence of adjacent swaps for editing one sequence s 1 to the other s 2 , and specifically concerning duplicate elements. If a mapping between the elements of s 1 and s 2 is defined, such that an element is mapped to its corresponding position if the optimal sequence of adjacent swaps is performed, then an element that appears only once in s 1 will be mapped to the only occurrence in s 2 . Furthermore, in such a mapping, if an element appears multiple times, then the k-th occurrence in s 1 will be mapped to the k-th occurrence in s 2 .  b, c, a, a, d, e]. Swapping the two copies of element a results in the same sequence. In general, a swap of adjacent identical copies of the same element does not change the sequence, but accrues a cost of 1. The two algorithms both generate a mapping of the indices of one sequence that correspond to the elements of the other, as described above. The two algorithms differ in how they generate the mapping. The mapping, once generated, is a permutation of the integers in {0, 1, . . . , n − 1}. And the τ S is the number of permutation inversions in that mapping.
Algorithm 1. The first of two algorithms for computing τ S is found in Figure 1 if |s 1 | |s 2 | 2.
Let S be a sorted copy of s 1 5.
Let M be a new array of length n 6.
Let B 1 and B 2 be arrays of length M[n − 1] + 1 of initially empty queues 13.
for i = 0 to n − 1 do 14.
Let Let I be the number of inversions in P.

29.
return I Counting permutation inversions (line 28) is done in O(n lg n) time with a modified mergesort.
The runtime of this first algorithm is therefore O(f c (m) n lg n) due to the sort in line 4, and the block of lines 12-19. This is worst case as well as average case. If the sequences contain values of a primitive type, such as ASCII or Unicode characters, primitive integers, primitive floating-point numbers, etc, then f c (m) = O(1), and thus the runtime of the algorithm simplifies to O(n lg n).
Algorithm 2. Our second algorithm for computing τ S is found in Figure 2. It is similar in function to the first algorithm, but generates the mapping from unique sequence elements to integers differently. Specifically, it uses a hash if |s 1 | |s 2 | 2.
Let H be an initially empty hash table mapping sequence elements to integers. 5.
for i = 0 to n − 1 do 7.
Put the mapping (s 1 [i], q) in H.
Let B 1 and B 2 be arrays of length q of initially empty queues 11.
for i = 0 to n − 1 do 12.
j Lines 18-25 iterates over the buckets, as in Algorithm 1, to generate the permutation mapping elements between the two sequences. It is unchanged from Algorithm 1, and thus has a runtime of O(n).
Line 26 counts permutation inversions, just like in Algorithm 1, and thus has a runtime of O(n lg n).
The runtime of Algorithm 2 is thus O(f h (m) n + n lg n). For sequences of primitive elements, this again simplifies to O(n lg n), but where the only O(n lg n) step is the inversion count of line 26. Therefore, for sequences of primitive elements, such as ASCII or Unicode characters, or primitive integers or floatingpoint numbers, Algorithm 2 will likely run faster than Algorithm 1.
In this analysis, we assumed that the hash table operations are O(1), which in practice should be achievable with sufficiently large table size and a well-designed hash function for the type of elements contained in the sequences.
Notes on the Runtimes. In addition to likely running faster for sequences of primitive elements, in many cases we should expect Algorithm 2 to run faster than Algorithm 1 for sequences of elements of an object type. Under any normal circumstances, the cost, f h (m), to compute a hash of an object of size m should be no more than linear in the size of the object. Thus, the runtime of Algorithm 2 should be no worse than O(mn + n lg n).
Similarly, the cost f c (m) to compare objects of size m should be no worse than linear in the size of the objects. Thus, the runtime for Algorithm 1 is no worse than O(mn lg n), which is higher order than the runtime of Algorithm 2. However, it is possible that a comparison of objects of size m may run faster than a hash of an object of size m since a comparison may short circuit on an object attribute difference found early in the comparison. Therefore, Algorithm 1 may be the preferred algorithm for sequences of large objects. We explore this experimentally in the next section.

Experiments
In this section, we experimentally explore the relative performance of the two algorithms for computing Kendall tau sequence distance. In Section 4.1 we describe our reference implementations of the two algorithms, and explain our experimental setup in Section 4.2. Then, in Section 4.3, we experimentally compare the two algorithms on sequences of primitive values, such as strings of Unicode characters, arrays of integers, and arrays of floating-point values. Section 4.4 compares the performance of the algorithms on arrays of objects of varying sizes..

Reference Implementations in Java
We provide reference implementations of both algorithms from the previous section in an open source Java library available at: https://jpt.cicirello.org. Specifically, the class KendallTauSequenceDistance, in the package org.cicirello.sequences.distance, implements both algorithms. The implementations support computing the Kendall tau sequence distance between Java String objects, arrays of any of Java's primitive types (i.e., char, byte, short, int, long, float, double, boolean), as well as computing the distance between arrays of any object type.
For arrays of objects, the implementation of Algorithm 1 requires the objects to be of a class that implements Java's Comparable interface, since the sort step requires comparing pairs of elements for relative order; while Algorithm 2 requires the objects to be of a class that overrides the hashCode and equals methods of Java's Object class since it relies on a hash table.
To compute the distance between arrays of objects, our implementation of Algorithm 2 uses Java's HashMap class for the hash table, and the default maximum load factor of 0.75. To eliminate the need to rehash to maintain that load factor, we initialize the HashMap's size to ⌈ n 0.75 ⌉, where n is the sequence length. In this way, even if every element is unique, no rehashing will be needed.
For computing the distance between arrays of primitive values, as well as for computing the distance between String objects, our implementation of Algorithm 2 uses a set of custom hash table classes (one for each primitive type). All of these hash tables (except the one for bytes) use chaining with single-linked lists for the buckets. The size of the hash table is set, as above, based on the length of the array to ensure that the load factor is no higher than 0.75. Additionally, we use a table size that is a power of two to enable using a bitwise-and operation rather than a mod to compute indexes. However, we limit the table size to no greater than 2 16 for the two 16-bit primitive types (char and short), and to no greater than 2 30 for all other types. The integer primitive types are hashed in the obvious way for each of the three such types that use 16 to 32 bits (char, short, int). Specifically, char and short values are cast to 32-bit int values. We hash long values with an xor of the right and left 32-bit halves. We hash a float using its 32 bits as an int. We hash a double with an xor of its left and right 32-bit halves, using the result as a 32-bit int. Java's Float and Double classes provide methods for converting the bits of float and double values to int and long values, respectively. We otherwise do not use Java's wrapper classes for the primitive types.
In the case of arrays of bytes, our implementation of Algorithm 2 uses a simple array of length 256 as the hash table, one cell for each of the possible byte values, regardless of byte sequence length. In this way, there are never any hash collisions when computing the distance between arrays of byte values.
For arrays of booleans, we handle the mapping to integers differently regardless of algorithm choice, since it is straightforward to map all false values to 0 and all true values to 1 in linear time.
The KendallTauSequenceDistance class can be configured to use either of the two algorithms. The default is Algorithm 2, since as we will see in Sections 4.3 and 4.4, it is always faster for sequences of primitives and nearly always faster for arrays of objects.

Experimental Setup
Our experiments are implemented in Java 1.8, and we use the Java HotSpot 64-Bit Server VM, on a Windows 10 PC. Our test system has 8GB RAM, with a quadcore AMD A10-5700 APU processor with 3.4 GHz clock speed.

Results on Sequences of Primitives
Strings. Our first set of results is on computing Kendall tau sequence distance between Java String objects. Strings in Java are sequences of 16-bit char values, that encode characters in Unicode.
In our experiments, we consider String lengths L ∈ {2 8 , 2 9 , . . . , 2 17 }, and alphabet size |Σ| ∈ {4 0 , 4 1 , . . . , 4 8 }. Note that |Σ| = 4 8 = 2 16 is the entire Unicode character set, and that |Σ| = 2 8 is the ASCII subset of Unicode. The alphabet Σ is just the first |Σ| characters of the Unicode set. For each combination of L and |Σ|, we generate 100 pairs of Strings as follows. The first String in each pair is generated randomly, such that each character in the String is selected uniformly at random from the alphabet Σ. The second String is then a randomly shuffled copy of the first String. We compute the average CPU time to calculate Kendall tau sequence distance averaged over the 100 random pairs of Strings. Figure 3 shows the results for two of the alphabet sizes: 256 and 65536. String length is on the horizontal axis, and average CPU time is on the vertical axis. Algorithm 2 is consistently faster than Algorithm 1, independent of alphabet size. This is also true of the other alphabet sizes, thus we have excluded graphs in the interest of brevity. The interested reader can use the code provided in the JPT repository to replicate our experimental data.  The explanation for why alphabet size affects the runtime of the algorithms is straightforward. First, note that larger alphabet size lead to longer runtime (Figure 3(b) vs Figure 3(a)). A smaller alphabet size means more duplicate characters in the strings. For Algorithm 1 that means that the sort has fewer elements to move. In the case of Algorithm 2, the hash table contains one entry for each unique character in the strings, so the smaller alphabet size leads to fewer hash table entries, which translates to lower load factor and thus faster hash table lookups.
Arrays of Integers. This next set of results is on computing Kendall tau sequence distance between arrays of int values, where an int in Java is a 32-bit integer. The array lengths L are the same as the String lengths used in Section 4.3, as are the alphabet sizes |Σ|, where the alphabet Σ is just the first |Σ| non-negative integers. We again average CPU times over 100 pairs of randomly generated arrays, where the first array contains integers generated uniformly at random from the alphabet, and the second array in each pair is a randomly shuffled copy of the first. Figure 4 shows the results for two of the alphabet sizes: 256 and 65536. Just as with Strings of characters, Algorithm 2 is consistently faster than Algorithm 1 for computing Kendall tau sequence distance between arrays of 32-bit integers, independent of alphabet size and array length.
Just as in the case of Strings, both algorithms run faster with the smaller alphabet size than with a larger alphabet size. The explanation is the same: smaller alphabet means more duplicate copies of elements, which means sorting is faster (Algorithm 1) and hash table lookups are faster due to reduced load factor (Algorithm 2).

Arrays of Floating-Point Numbers.
In this last case of sequences of primitives, we consider arrays of 64-bit double-precision floating point numbers, Java's double type. We consider the same array lengths and alphabet sizes as the previous cases, but now the alphabet is a set of floating-point values. Specifically, the alphabet Σ contains 1.0x where x is the first |Σ| non-negative integers. Figure 5 shows the results for two of the alphabet sizes: 256 and 65536. Just as in the previous two cases, Algorithm 2 is consistently faster than Algorithm 1 for computing Kendall tau sequence distance between arrays of 64-bit double-precision floatingpoint numbers, independent of alphabet size and array length. And again, runtime is longer for both algorithms with larger alphabet size for the same reasons as before.

Results on Sequences of Objects
In this section, we explore the performance of the algorithms on computing distance between sequences of objects. Specifically, we use arrays of Java String objects. For example, consider sequences s 1 and s 2 as follows: s 1 = ["hello ′′ , "world ′′ , "hello ′′ , "blue ′′ , "sky ′′ ], (12) s 2 = ["hello ′′ , "blue ′′ , "sky ′′ , "hello ′′ , "world ′′ ]. (13) These sequences are a Kendall tau sequence distance of 5 from each other. One sequence of adjacent swaps of length five that transforms s 1 into s 2 , starts by swapping "blue" to the left twice, then swaps "sky" twice to the left, and finally swaps "world" with the right most of the two copies of "hello." We use String objects for this set of experiments because it is easy to vary the size of a String object; and it is also relatively easy to create a case where both a hash and a comparison have cost O(m) where m is object size (in this case length) as well as a case where a comparison costs significantly less than a hash.
We consider array lengths L ∈ {2 8 , 2 9 , . . . , 2 14 }, and alphabet size |Σ| = 256, where the alphabet is a set of String objects. We consider the following object sizes m ∈ {2 0 , 2 1 , . . . , 2 11 }. Computing a hash of a String of length m has cost O(m) regardless of String content. We consider two cases of String formation. In the first case, each of the 256 Strings in Σ begin with m − 1 copies of Unicode character 0, and only differ in the last character. In this case, all comparisons also cost O(m) since linear iteration over the entire String object is required to determine how they differ. We will refer to this case as high cost comparisons (HCC). In the second case, each of the 256 Strings in Σ is m copies of the same character, but each of the 256 Strings use a different character. Comparisons in this case either immediately short circuit on the first character (if they are different) or require linear iteration if they are identical. We will refer to this case as low cost comparisons (LCC). For each combination of L, m, and HCC vs LCC, we generate 10 pairs of sequences. Each pair contains the same set of objects, but in different random orders. We compute average CPU time across the 10 pairs of sequences.
In Figures 6 and 7, we show average CPU time as a function of sequence length for arrays of String objects 32 characters and 2048 characters in length, respectively. Part (a) of each figure is the HCC case, and part (b) is the LCC case. For the small objects ( Figure 6), Algorithm 2 is consistently faster for all sequence lengths in both the HCC and LCC cases, although the performance gap is much narrower in the LCC case.
For the large object case (Figure 7), Algorithm 2 is faster for all sequence lengths in the HCC case (Figure 7(a)). For the LCC case (Figure 7(b)), when the sequence length is long, performance of the two algorithms appears to converge; but for shorter length sequences, Algorithm 1 is faster. To see this clearer, we zoom in on the left side of the graph in Figure 8, where you can clearly see that Algorithm 1 is faster.

Conclusion
In this paper, we presented a new extension of Kendall tau distance that we call Kendall tau sequence distance. The original Kendall tau distance is a distance metric on permutations. We have adapted it to be applicable for computing distance between general sequences. Both sequences must be of the same length and contain the same set of elements, otherwise the Kendall tau sequence distance is undefined.
We introduced two algorithms for computing Kendall tau sequence distance. If the sequences contain primitive values, such as a string of characters, or an array of primitive integers, etc, then the runtime of both algorithms is O(n lg n). However, the only O(n lg n) step of Algorithm 2 is a permutation inversion count that is shared with Algorithm 1; and thus, Algorithm 2 should be preferred for sequences of primitives. If one is computing the distance between sequences of objects of some more complex type, then the size of the objects in the sequences also impacts the runtime of the algorithms. However, unless the cost of a hash of an object is significantly greater than the cost of an 11 Kendall tau sequence distance: Extending Kendall tau from ranks to sequences EAI Endorsed Transactions on Industrial Networks and Intelligent Systems 01 2020 -05 2020 | Volume 7 | Issue 23 | e1 object comparison, Algorithm 2 is still the preferred algorithm.
We provide reference implementations of both algorithms in the Java language. These implementations have been made available in an open source library. Our experiments confirm that Algorithm 2 is the faster algorithm under most circumstances. The code to replicate our experimental data is also available as open source.