A Highly Concurrent Replicated Data Structure EAI Endorsed Transactions (cid:72)

Well defined concurrent replicated data structure is very important to design collaborative editing system, particularly, certain properties like out-of-order execution of concurrent operations and data convergence. In this paper, we introduce novel linear data structure based on unique identifier scheme required for indexed communication. These identifiers are real numbers holding specific pattern of precision. Based on the uniqueness and the total order of these identifiers, here, we present two concurrency control techniques to achieve high degree of concurrency according to strong and lazy happened-before relations. Our data structure preserves data convergence, yields better performance and avoids overheads as compared to existing approaches


Introduction
In collaborative editing, a group of individuals may edit shared document simultaneously.Collaborative editing tools are designed to provide environment in which authorized users edit a shared document.A user or participant may know others working on the shared document, and watch the changes/modifications (in real time) performed by other users.Using collaborative editing tools, multiple users are able to make changes at the same time.A group of users could be in the same location or dispersed geographically [1,2,8,9,11].Such tools prefer to manipulate shared objects that own linear structure in which each element (e.g.character, line, object or paragraph) is indexed by an identifier (say position number).Each user is supposed to have copy (local) of the shared document so that the availability of data could be ensured.In general, the collaboration is supposed to be performed as follows: each user's operations (e.g.inserting a new element or deleting existing element at certain position) are locally executed in nonblocking manner turned out to be ineffective because they may ensure consistency at the expense of responsiveness and loss of operations [2,6,10].Another technique, called Operational Transformation (OT) is proposed in [2].Generally, it consists of application-dependent transformation algorithm which modifies the parameters (e.g.position numbers) of operations to execute them regardless of reception order.Identifying elements by position numbers is not suÿcient to ensure data convergence using OT approach.It is also claimed that all previously proposed transformations fail to achieve such convergence [4,5].

Related work
A comparison of several approaches to the problem of collaboratively editing a shared text is presented by Ignat et al. [3].Operational transformation (OT) [2,9] considers collaborative editing based on non-commutative operations.To this end, OT transforms the arguments of remote operations to take into account the e ects of concurrent executions.To execute concurrent operations in either order, OT requires two correctness conditions which remain diÿcult to satisfy.Imine et al. [4,5] prove that all previously proposed transformations fail to satisfy these conditions.More recently, Weiss et al. [12,13] and Preguiça et al. [7] proposed a new data type, called CRDT (Commutative Replicated Data Types), for collaborative editing.Weiss et al. proposed the Logoot CRDT which uses a sparse n-ary tree rather than Treedoc's dense binary tree [7].In Logoot, a position identifier is a list of (long) unique identifiers, and Logoot does not flatten.Also, Logoot has a high overhead compared to Treedoc.The approaches presented in [7,12,13] have some inconveniences: Let after exchange of points (updates), two identifiers (by Weiss et al. approach [12,13]) are p = 0, s A , c and q = 0, s B , c (with s A and s B are identifiers of sites for users A and B respectively, such that s A < s B ). Now, suppose that, user B inserts only one new character between characters having identifiers p and q, then new identifier is 1, s B , c .This new identifier takes the place as given below which explores an p = 0, s, c < 1, s, c < 0, s, c = q ambiguous order alteration problem in the automatic storage buffer of identifiers.Similarly, constructing an example for Preguiça et al. approach [7], let user A inserts character 'c' and user B inserts character 'd' and user C inserts character 'e' in a shared document.
One of the possible choices of constructing tree is: node e is the right child of node c and node d is the left child of e.Now let users start concurrent operations between character 'c' and character 'd' then new nodes will be mini-nodes as siblings.
Let dismbiguator of user B be (counter, siteID) = (c1, 2) 2 1.The data structure may grow indefinitely because the deletion operation has no physical effect on the state.Indeed, they mark deleted elements as tombstones in order to converge all replicas.
2. The position identifiers are very long; sometimes the size of identifier can exceed the size of document.
Unlike [7,12,13], we use very dense identifiers (real numbers) to uniquely identify all elements inside the shared document and, instead of tree data structure, we describe the shared document by a simple sequence data structure.Moreover, we remove elements without using tombstones.Another recent work [14] uses rational numbers to uniquely identify the elements of the shared document but the size of the document increases during collaboration sessions as the removed elements are only hidden (they use a form of tombstones).
Furthermore, approaches in [7,12,13] are to be revised because we observe that proposed algorithms for generating unique position identifiers do not support some critical situations, that leads to cause problems like replica divergence and order preservation over identifiers.As an example, we analyse these two approaches by proposing identifiers exchange scenario A Highly Concurrent ReplicatedData Structure at the time of insertion of character 'd'.Let he/she intends to insert another character 'f' between c and d, there must be increment in counter, let it be (c2, 2).To make insertion possible, it must holds It means insertion is not possible or problem of order alteration occurred.Further: Thus, after exchange of identifiers, we observe that, new identifier computed for one of the users does not lie within the previous two identifiers or this new identifier is not comparable by definition (propose by authors) of comparison for two identifiers, consequently it may cause divergence.Treedoc approach is a modified form of approach proposed by Weiss et al. [12,13] and complicated to implement, for example, to find order, one has to make walk of the tree.Technique to reduce overheads consists in removing information about identifiers that may cause serious problems in retrieving these identifiers from storage buffer.Our approach is simple and avoid these drawbacks.

Contributions
This paper presents a novel concurrent replicated data structure for collaborative editors, in which, each element is identified by a unique real number.Based on a specific pattern of precision, our unique identifier scheme guarantees order preservation (compatible with the order of the elements) and achieves easily data convergence.Moreover, to each user (or peer), we assign a unique real value as an identifier, also generated under specific precision.A shared document is supposed to be mapped on an interval I = [a, b] with 0 ≤ a < b for a, b ∈ R. For the operations performed by users in the network, corresponding unique identifiers are computed over the interval I such that these identifiers are assigned to elements (e.g.characters or lines) of the shared document or object.Based on the uniqueness and the total order of our identifiers, we present two concurrency control techniques to achieve high degree of concurrency according to strong and lazy happened-before relations.The first technique relies on time-stamp vectors and it allows the concurrent operations to be executed in either order.This technique is well-suited for collaborative editors where the number of users is fixed.The second technique relies on lazy happened-before relation and it enables us to extend the concurrency even for operations generated by the same user.Using this technique, a collaborative editor can be deployed easily on P2P networks as it can supports dynamic groups where users can leave and join at any time.We validate our data structure with a performance evaluation which shows that our unique identifier scheme is appropriate for linear data structure.
The current manuscript is organized as follows.Section 2 describes the ingredients of our concurrent replicated data structure.Section 3 presents our technique to generate unique identifiers and a view of editing and modifying a document.In Section 4 we suggest two concurrency control techniques in order to use our concurrent data structure.Section 5 gives a performance evaluation of our data structure.Section 1.1 compares with previous work and conclusion is described in Section 6.

Introducing New Concurrent Replicated Data Structure
This section consists in introducing new data structure for concurrent editing.It is known that collaborative editors manipulate shared objects that own a linear structure [2,9,11].This structure can be modeled as a sequence of elements from any data type.For instance, an element may be regarded as a character, a paragraph, a page, an XML node, etc.In [11], it has been shown that this linear structure can be easily extended to a range of multimedia documents, such as MicroSoft Word and PowerPoint documents.We consider a shared, replicated document as a sequence of elements and mapped the document to an interval of

UniquePositionWeights
We consider a shared document as an ordered set of elements indexed by unique position identifiers that are real numbers.We call these identifiers position weights to distinguish them from the traditional position numbers.The position weights have the following properties: • Each element in the document (can be thought of as separate storage buffer) has a weight in the corresponding interval used to generate new weights.
• Two elements in the document have two different weights: we can always order two different elements.
• The weight of an element is volatile: any position weight can be removed and inserted again at any time without allowing weight redundancy inside the document.
• Order of position weights is compatible with the order of elements: the set of position weights is totally ordered and consistent with the position numbers of the shared document.
For instance, the position wieghts are denoted with ω, ω , ω 1 , ω 2 , . . .etc.Moreover, position weights or identfiers hold a strict order relation "<".To insert a new element between two existing holding position weights ω 1 and ω 2 , such that ω 1 < ω 2 , requires only to compute a new position weight ω new in such a way that ω 1 < ω new < ω 2 .Since the position weights are real numbers that would require theoretically infinite precision.Therefore, their machine representation are carefully used in order to preserve the property (i.e., avoiding weight redundancy).In Section 3, we will present method to compute position weights based on the assumptions described above.

Shared Data
Shared data structure could be considered as a sequence of pairs (element, weight) where the elements are ordered by their corresponding weights.Users are able to modify replicas of the data structure by performing any of the following editing operations: (i) Insert(element, ω new ), inserts new element in the document by associating new weight.
(ii) Remove(ω exist ) removes an element with an existing weight (ω exist ) such that this weight is to be recreated next time.
Multiple users are able to edit shared document concurrently and the operations may replayed on each site as soon as received.Unique position weights guarantee the convergence even if the operations performed at different sites in different orders.

Practical Implementation
This section describes method to create unique identifiers.These identifiers are real numbers, follow a specific pattern and have low storage overhead as compared to recent available approaches (best to our knowledge) [7,12,13].Definition 1.We define precision as the number of digits following the decimal point of a value (rounded to decimal places/to significant digits), e.g., the precision of the values 12.34600 and 12.345 is 5 and 3 respectively.

Developingthe Basic Rules
In this section we explain basic assumptions made to develop the method.
Mapping a Document to an Interval.To identify each element by a unique position weight, we associate the shared document to an interval I in such a way that 0 and 1 correspond to the begin and the end of the document, respectively.For simplicity we take the initial real interval as Note that the real interval I can be taken up to desired positive length.
Rounding a Value.To create unique identifiers, as a first step, we introduce Function 1 that rounds a given value according to definition 1. Function 1 takes two parameters, for example, Round a value(x, p r ) rounds the decimal part of an expression x to the p r th decimal place and round(x)  Let x ∈ I be a real number.We say that x is correctly rounded to d-decimal number, which is denoted by Precision Pattern.To create unique identifiers, we introduce precision control technique keeping the following assumptions.
• A 1 : We denote default precision by p d (that commonly taken by programming language) over which we perform computations.
For example, in Maple, the precision can be fixed by the global variable "Digits" and floating point arithmetic is done in decimal with rounding, so one can set p d = Digits.
• A 2 : We denote rounding precision taken for small positive real numbers by p , and value for is taken as user/site's identifier.
• A 3 : We denote final rounding precision taken to compute unique identifier by p r and is kept less than both of p and p d .Moreover if we round user/site identifier over p r , the resulting value is negligible and has no significant effect on position weights.
By summarizing above assumptions, we get the following inequality p r < p < p d (1) and we keep inequality 1 as the basic principle to create unique identifiers and to perform computations accordingly.

CreatingPositionWeights for Insertion
Now, suppose that the default precision and the rounding precision are fixed according to the assumptions.
To insert an element between two elements p and q, it requires only information about weights of p and q.It is known that the classical midpoint formula computes midpoint of two real values 'say' a and b as (a + b)/2.To compute each time different and finite many midpoints for the same interval, it requires certain modifications.As many users compute all possible midpoints over interval I, we are interested in, that: • Weights computed for a user's modifications must be different from weights computed for all operations performed by other users; • Set of weights computed computed for operations related to each user must be an ordered set.
For a certain rounding precision (say with a fixed d decimal places) and a user identifier δ created by keeping the required conditions, we modify the classical midpoint formula such that ∀ x, y ∈ I with x < y, as given below where x = x (d) , y = y (d) (x and y are rounded to ddecimal places).Notice that the weight computed in this way will not be equidistant from the weights x and y due to the subtraction of δ which is a small real value.Function 2 gives how to compute new position weight for a given user (i.e., with user identifier δ).

Editing a Document and Behaviorof Identifier
The main idea of our approach is to provide a non-conflicting execution between concurrent editing operations.We ensure eventual consistency (i.e., the final state of replicas is identical at all sites), provided that every site executes every operation in an order consistent with some happened-before order (See Section 4 which presents two forms of happenedbefore order).When the data type is a sequence of elements (such as a text document), the out-oforder execution between insertions in the sequence can be obtained with a unique and totally ordered identifiers for each element.Our approach is based on associating a position weight to each element.These position weights are unique and totally ordered.Figure 2 presents an overview of, how a single user make insertion, corresponding to unique identifiers, starting with an empty document.First position weight is  presented by p 1 , then all other weights are presented by p s.Note that, new position weights could be on the left or right side of the existing weight.In Figure 2, 'updating' denotes updates of the document after each modification (insertion).
Example 2. Suppose that multiple users are participating in collaborative edition, we explain in this example, how the corresponding identifiers are generated.Such situation is described in figure 3.In the model that we proposed, a user can modify the document by inserting or removing elements (e.g., lines or paragraphs).To perform this task, corresponding weights are created and removed.Millions of weights can be created with chosen value of user identifier (i.e., δ) and by selecting appropriate rounding precision.These weights can be created locally as well as based on the remote identifiers of the elements during the exchange of elements between the users in the network.To compute new weights, same criteria has to follow for all users in the network.Suppose that a shared document marked with 'Beg' and 'End' mapped to an interval [0, 1] and 'updating' action updates modification to all participants.Suppose d = 1 for the rounding precision.Suppose that there are three users U 1 (assigned a user identifier 0.007), U 2 (assigned a user identifier 0.004) and U 3 (assigned a user identifier 0.001) start editing the shared document at three different sites (site 1, site 2 and site 3) as shown in the Figure 3. Let user U 3 inserts first character 'A' in the empty document then the first

Two Concurrency Control Techniques
A stable state in a collaborative editor is achieved when all generated editing operations have been performed at all sites.Let o 1 and o 2 be two editing operations.A collaborative editor is consistent iff it satisfies the following properties: • Causality preservation: if o 1 happens before o 2 then o 1 is executed before o 2 at all sites.
• Convergence: when all sites have performed the same set of operations, the copies of the shared document are identical.
To satisfy the above consistency criteria, we present in this section two concurrency control techniques in order to manipulate our concurrent replicated data structure.Each technique implements strong/lazy happened-before relation and ensures the convergence property.

ConcurrencyControlwith Strong Causality Relation
Let o 1 and o 2 be operations generated at sites i and j, respectively.We say that o 2 causally depends on o 1 , denoted o 1 → o 2 , iff: • i = j and o 1 was generated before o 2 ; or, • i j and the execution of o 1 at site j has happened before the generation of o 2 .
Operations o 1 and o 2 are said to be concurrent, As a long established convention in collaborative editors [2,9], the time stamp vectors are used to determine the causality and concurrency relations between operations.A time-stamp vector is associated with each site and each generated operation.Every time-stamp is a vector of integers with a number of entries equal to the number of sites.For a site j, each entry V j [i] returns the number of operations generated at site i that have been already executed on site j.When an operation o is generated at site i, a copy V o of V i is associated with o before its broadcast to other sites.
V i [i] is then incremented by 1. Once o is received at site j, if the local vector V j "dominates" A Highly Concurrent ReplicatedData Structure ready to be executed on site j.In this case, V j [i] will be incremented by 1 after the execution of o.Otherwise, the o's execution is delayed.Let V o 1 and V o 2 be time-stamp vectors of o 1 and o 2 , respectively.Using these time-stamp vectors, the causality and concurrency relations are defined as follows: Given our concurrent data structure to describe a shared document, all concurrent editing operations (Insert and Remove operations) are executed in any order provided that the above causality relation is respected.Unfortunately, the time-stamp vectors do not enable dynamic groups (i.e., users may join or leave the group at any time) since each time-stamp is a vector of integers with the number of entries is equal to the number of users.

ConcurrencyControlwith Lazy Causality Relation
The time-stamp vectors implement a false causality relation.Indeed, some editing operations performed by a user could be permuted without effect on the state of the document.For instance, two successive insertions can be executed in any order because their position weights are unique totally ordered.As no two different users produce the same position weight, an insert must happen-before a removal with the same position weight.They can never be concurrent.Consequently, the only causality relation to be preserved is Insert(element, ω e ) → Remove(ω e ).All other operations, not constrained by this relation, can be performed in either order.Using only this simple causality relation enables us to deploy a collaborative editor on P2P networks for supporting dynamic groups.
Next, we present two scenarios which show that exchanging naively editing operations may cause consistency issues.For each scenario, we illustrate the consistency problem and we sketch a solution for this problem.

First Scenario. Consider the scenario given in
' ' (b) Out-of-order execution for insertions.same character e added by operation o 1 .When the remove operation o 2 (resp.o 3 ) arrives at site 2 (resp.site 1), it cannot performed because the position weight ω e does not exist.Should we consider o 2 and o 3 as not causally ready?If yes, o 2 and o 3 will be never causally ready as we have no information whether or not ω e has been added.Consequently, the waiting queue would increase drastically.
To overcome this problem, we propose that each site maintain a log where the local remove operations will be stored.In this way, o 2 and o 3 are logged respectively at sites 1 and 2. When o 3 arrives, it is easy to verify inside the local log of site 1 that ω e has been already removed.Thus, o 3 can be ignored.Similarly, the same processing is carried out for o 2 at site 2. o 1 ), removes it (operation o 2 ) and next adds at the same position another character f (operation o 3 ).What happens if o 3 arrives before o 1 at site 2? Note that the position weights ω e and ω f are computed by the same user (or site) on the same adjacent position weights, ω 1 and ω 2 .In this case, ω e and ω f are equal.This redundancy breaks the structure of our shared document.Indeed, in this case, the same position weight indexes two different characters "e" and "f".Moreover, at site 2, o 3 may remove either "e" or "f".This can lead to inconsistency situation.

Second Scenario. As shown in
To still create different position weights, we propose to add a monotonically increasing counter that is incremented by each local insertion operation.Hence, we redefine the set of position weights as W = {(ω, c) | ω ∈ I and c ∈ N} the set of which are totally ordered by the relation ≺ such that (ω 1 , c 1 ) Note that the first component ω will be still computed according to Function 2. In this case, characters "e" and "f", generated at site 1, will have two different weight positions (the same ω but with different counters).

Performance Evaluation
To verify the effectiveness of our approach, an experimentation study has been conducted using a text document as a shared data with various sizes.In collaborative editors, shared data is represented to user as a linear structure and insertion and deletion of elements are based on their position numbers in this structure (i.e data view).In our case, we store for each element in the shared Based on the uniqueness and the total order of our identifiers, we present two concurrency control techniques to achieve a high degree of concurrency according to strong and lazy happenedbefore relations.The first technique relies on timestamp vectors and it allows the concurrent operations to be executed in either order.This technique is wellsuited for collaborative editors where the number of users is fixed.The second technique relies on lazy happened-before relation and it enables us to extend the concurrency even for operations generated by the same user.Using this technique, a collaborative editor can be deployed easily on P2P networks as it can supports dynamic groups where users can leave and join at any time.We validate our data structure with a performance evaluation which shows that our unique identifier scheme is appropriate for linear data structure.data its position weight.To choose the adequate structure, we investigate the implementation of two versions of our data structure, either based on linear structure or on binary tree structure.We denote by n the size of the current state, and by list and tree the structures used in both versions.

Local Insertion / Local Deletion
Whatever the used structure, the following operations are performed to insert a new element e (delete an existing one) at position i in the view: • Search the adjacent position weights ω i and ω i+1 in the case of insertion, and the position weight ω e corresponding to the position i of the element to be removed.
• update the view.
Since insertion/deletion is performed over a given position in the view, a linear structure has the advantage that adjacent position weights ω i and ω i+1 (or ω e in the case of deletion) are returned by list[i] and list[i + 1] (or by list[i]), respectively, in a constant time.However when using a tree structure, to return either the adjacent weights ω i and ω i+1 of the element to insert (or the ω e of the element to be removed), the ascending list of all position weights stored in the tree must be computed in O(n) time to extract the required weights.
The second step consists in inserting/deleting position weight in list/tree.The third step corresponds to inserting/deleting element e in the shared data and to refresh user view.

RemoteInsertion / RemoteDeletion
At the reception of a remote insertion/deletion operation of element e (with weight ω e ), we proceed as follow: • Search the position of ω e in list/tree.
• Update the view.
Since list is an ordered set of weights, thus for a given element weight ω e its position in list is computed in O(log(n)) time.It is used first to insert/remove element in list and also to update the view.In case of tree structure, the correspondence between a given weight ω e and its position is found by first computing the ascending list of all weights stored in the tree (O(n) time), and next the position is returned by a binary search executed in O(log(n)) time over the computed list.In Table1 we summarize the overall of all complexity time over list/tree in the cases of local/remote operations and either in the worst-case (inserting element at the beginning of the view/deleting the first element) or in the best-case (inserting element at the end of the view/deleting the last element).In each structure, all local operations (resp.remote operations) have the same complexity time.In Figure 5 we present the evaluation of the insertion over a text document with size varying from 10000 to 100000 elements (we take an element as a small paragraph).It is clear that a linear structure based-implementation is more efficient.Tree structure offers good performance in term of search, insertion and deletion of data defined with its position weight (case of remote operation).There operations are often performed in O|log(n)| time.However, the poor performance encountered in a tree structure is due to the correspondence, for a given element, between its associated position weight and its position in the view.As approaches in [7,12,13] are based on tree structure, it is clear that their implementations will present poor performance.

Conclusion
This paper presented a new data structure that is wellsuited for linear structure-based shared documents (such as text documents) in collaborative editors.To ensure a high degree of concurrency, we proposed a new technique to uniquely identify elements inside the shared document.These identifiers are simply real numbers which are manipulated under a precision control in order to avoid the problem of infinite precision.This technique is quite simple and guarantees the uniqueness of these identifiers.According to two (strong and lazy) forms of happenedbefore relation, we proposed two concurrency control procedures: the first procedure allows the concurrent operations to be executed in either order.The second one enables us to extend the concurrency even for operations generated by the same user as our identifiers are unique and totally ordered.We performed a performance evaluation which shows that our unique identifier scheme is well-suited for linear data structure.In future work, we intend to investigate the impact of our work when undoing operations.Furthermore, we plan to extend our unique identifier scheme to other data structures such as trees and graphs.

Function 1 : 2
(Round a value) Function: Round a value Input: Value to be rounded, Desired precision Output: Rounded Value 1 begin Let dp := Desired precision;

3 x 6 Rounded value ←− y; 7 return
:= Value to be rounded; 4 prec ←− 10 dp ; 5 y := round(x * prec) prec ; Rounded value; 8 end (see line 5, Function 1) rounds an expression x to the nearest integer.For instance, round(−2.4)returns −2 and Round a value(15.0766647,4) returns 15.0767 by performing computation over Maple 12.We denote Round a value(x, p r ) by x p r , i.e., value x rounded over precision p r by the Function 1.

Example 1 .
Consider an empty document with corresponding interval I = [ 0, 1 ] and rounding precision 5 EAI Endorsed Transactions on Collaborative Computing 12 2015 | Volume 1 | Issue 6 | e4 EAI European Alliance for Innovation A Highly Concurrent ReplicatedData Structure

Function 2 :
middle( How to compute new point between two weights)

Figure 2 .
Figure 2. Situation for single user.
An element is simply identified by a real number in the interval.
The complexity time of the third step is O(n) in each structure.But operation Insert(e,ω e )/Remove(ω e ) in the second step is executed in O(n) over list and in O(log(n)) over tree.