Replicated abstract data types: Building blocks for collaborative applications

JinSoo Kim

Replicating data under Eventual Consistency (EC) allows any replica to accept updates without remote synchronisation. This ensures performance and scalability in large-scale distributed systems (eg, clouds). However, published EC approaches are ad-hoc and error-prone. Under a formal Strong Eventual Consistency (SEC) model, we study sufficient conditions for convergence. A data type that satisfies these conditions is called a Conflict-free Replicated Data Type (CRDT).

Abstract: Eventual consistency aims to ensure that replicas of some mutable shared object converge without foreground synchronisation. Previous approaches to eventual consistency are ad-hoc and error-prone. We study a principled approach: to base the design of shared data types on some simple formal conditions that are sufficient to guarantee eventual consistency. We call these types Convergent or Commutative Replicated Data Types (CRDTs).

Abstract A commutative replicated data type (CRDT) is one where all concurrent operations commute. The replicas of a CRDT converge automatically, without complex concurrency control. This paper describes Treedoc, a novel CRDT design for cooperative text editing. An essential property is that the identifiers of Treedoc atoms are selected from a dense space. We discuss practical alternatives for implementing the identifier space based on an extended binary tree.

J. Parallel Distrib. Comput. 71 (2011) 354–368 Contents lists available at ScienceDirect J. Parallel Distrib. Comput. journal homepage: www.elsevier.com/locate/jpdc Replicated abstract data types: Building blocks for collaborative applications Hyun-Gul Roh a,∗ , Myeongjae Jeon b , Jin-Soo Kim c , Joonwon Lee c a Department of Computer Science, KAIST, Daejeon, Republic of Korea b Department of Computer Science, Rice University, Houston, TX, United States c School of Information and Communication Engineering, Sungkyunkwan University (SKKU), Suwon, Republic of Korea article info Article history: Received 23 July 2009 Received in revised form 11 October 2010 Accepted 4 December 2010 Available online 15 December 2010 Keywords: Distributed data structures Optimistic replication Replicated abstract data types Optimistic algorithm Collaboration abstract For distributed applications requiring collaboration, responsive and transparent interactivity is highly desired. Though such interactivity can be achieved with optimistic replication, maintaining replica consistency is difficult. To support efficient implementations of collaborative applications, this paper extends a few representative abstract data types (ADTs), such as arrays, hash tables, and growable arrays (or linked lists), into replicated abstract data types (RADTs). In RADTs, a shared ADT is replicated and modified with optimistic operations. Operation commutativity and precedence transitivity are two principles enabling RADTs to maintain consistency despite different execution orders. Especially, replicated growable arrays (RGAs) support insertion/deletion/update operations. Over previous approaches to the optimistic insertion and deletion, RGAs show significant improvement in performance, scalability, and reliability. © 2010 Elsevier Inc. All rights reserved. 1. Introduction Optimistic replication is an essential technique for interactive collaborative applications [8,30]. To illustrate replication issues in collaboration, consider the following scenario in an editorial office publishing a daily newspaper. A number of pressmen are editing a newspaper using computerized collaboration tools. Each of them is browsing and editing pages consisting of news items, such as text, pictures, and tables. When a writer collaborates on editing the same article with others, his local interaction is never blocked, but interactions of the others are shown to him as soon as possible. After all interactions cease, all the copies of the newspaper become consistent. Human users, the subjects of these applications, prefer high responsiveness and transparent interactivity to strict consistency [8,30,13]. Responsiveness means how quickly the effect of an operation is delivered to users, and interactivity is how freely operations can be performed. Optimistic operations that are executed first at each local site enable to achieve these properties, but consistency should be maintained as sites execute operations in different orders. Optimistic replication contrasts with pessimistic concurrency control protocols [30], such as serialization [5,14] or locking [3,12]. ∗ Corresponding author. E-mail addresses: hgroh@calab.kaist.ac.kr, knowhunger@gmail.com (H.-G. Roh). 0743-7315/$ – see front matter © 2010 Elsevier Inc. All rights reserved. doi:10.1016/j.jpdc.2010.12.006 Even if a global locking protocol allows optimistic operations [13], it not only requires a state rollback mechanism, but also damages interactivity due to the nature of the locking protocol. There has been research on genuine optimistic replication oriented to specific services, such as a replicated databases [36,37], Usenet [21,9,6], and a collaborative textual or graphical editor [8,27,34,33]. However, these service-oriented techniques are inflexible for various complex functions of modern interactive applications; e.g., electronic blackboards, games, CAD tools, and office tools such as Microsoft Office and Google Docs, all of which can be extended for collaboration. Interactive applications, e.g., CAD tools designing for skyscrapers or spaceships, have demanded managing of indeterminate data; one data may consist of limited elements, another data may need quick access to unlimited elements, and the other data may contain ordered elements frequently inserted and deleted. Sensible developers would make use of various abstract data types (ADTs) to reflect such a demand. When those applications are extended for collaboration, however, developers may abandon to use ADTs for shared data owing to inconsistency. Hence, we suggest replicated abstract data types (RADTs), a novel class of ADTs that can be used as building blocks for collaborative applications. RADTs are multiple copies of a shared ADT replicated over distributed sites. RADTs provide a set of primitive operations corresponding to that of normal ADTs, concealing the details of consistency maintenance. RADTs ensure eventual consistency [30], a weak consistency model for achieving responsiveness and interactivity. By imposing no constraint on operation delivery H.-G. Roh et al. / J. Parallel Distrib. Comput. 71 (2011) 354–368 except causal dependency, we accommodate RADT deployment in general environments. This allows a site to execute operations in any causal order. We model such executions and explore principles to achieve eventual consistency. This paper suggests two principles that lead to successful designs of non-trivial RADTs. First, operation commutativity (OC) requires that every pair of concurrent operations commutes. Though the concept of commutativity was discussed in many distributed systems [39,1,27], it was not fully assimilated. We formally prove that OC guarantees eventual consistency for all possible execution orders; so, we mandate RADT operations to satisfy OC. Second, precedence transitivity (PT) requires that all precedence rules are transitive. RADTs require precedence rules to reconcile conflicting intentions. PT is a guideline on how to design remote operations so that RADT operations satisfy OC and preserve their intentions. In short, OC is a sufficient condition to ensure eventual consistency, while PT is a principle for exploiting OC. We present efficient implementations of three RADTs: replicated fixed-size arrays (RFAs), replicated hash tables (RHTs), and replicated growable arrays (RGAs). Although some key ideas for RFAs and RHTs were already present in the literature [36,37,21,9,6], we introduce them again because they exemplify the concepts of RADTs, and above all because their problems and ideas are inherited by RGAs. RGAs are another main contribution of this paper, which solves the problem of optimistic insertions and deletions into a replicated ordered set. As these operations have been highly desired in collaborative applications [8,13], the operational transformation (OT) framework is the classic approach for these operations. Various OT methods have been introduced [7,27,35,34,19,22], and one of them is adopted by a web collaboration tool Google Wave [11]. However, the OT framework has difficulty in verifying correctness, and an evaluation study on recent OT methods reports that their performance and scalability are poor and nonnegligible [18]. Thanks to OC and PT, RGAs provide full correctness verification not only for insertions and deletions, but also for updates [29]. In addition, RGAs are superior to most of the previous works in complexity, scalability, and reliability. Whereas remote operations of OT methods generally have quadratic time-complexity, remote RGA operations can perform in O(1) time by proposing the s4vector index (SVI) scheme. Our evaluation shows that operations needing about hundreds of ms in OT methods take only dozens of µs in RGAs. Due to the optimal remote operations and the fixed-size s4vectors, RGAs scale. Additionally, RGAs have a chance to enhance reliability by autonomous causality validations of the SVI scheme. RGAs, therefore, can be a better alternative of OT methods. Section 2 describes three RADTs and their inconsistency problems. Sections 3 and 4 formalize OC and PT, respectively. Concrete algorithms of RADTs are proposed in Section 5. We survey the related work in Section 6, and contrast RGAs with previous work in Section 7. Section 8 presents the performance evaluation, and we conclude this paper in Section 9. 2. Problem definition 2.1. Preliminary: causality preservation among operations The replication system discussed in this paper is characterized by a set of distributed sites and operations as shown in the time–space diagram of Fig. 1 which describes the propagations and executions of operations. Lamport presented two definitions for causality [15]: happened-before relation (‘→’) and concurrent relation (‘‖’). Given a time–space diagram consisting of n n(n−1) operations, all 2 relations are obtained; every pair of distinct operations is in either of the two relations. 355 Fig. 1. A time–space diagram in which three sites participate. A vector on the right of each operation is its vector clock. While no uniquely correct order is defined for concurrent operations, partial orders defined by happened-before relations need to be preserved at every site [27,35] owing to the causality that might exist; e.g., imagine O4 is to delete the object inserted by O2 in Fig. 1. Vector clocks can ensure such causality (or causal execution orders) by preserving happened-before relations [5,35]. Following Birman et al.’s CBCAST scheme [5], in our replication system consisting of N sites, site i updates its own N-tuple vector − → clock v i according to lines 2, 8, 10, and 16 in Algorithm 1. To preserve causality, causally unready operations are delayed using a queue (lines 13–15). When an operation O issued at site j (i ̸= j) − → − → arrives with its vector clock v O , O is causally ready if v O [j] = − → → → v i [j] + 1 and − v O [k] ≤ − v i [k] for 0 ≤ k ≤ N − 1 and k ̸= j. To illustrate, consider site 2 in Fig. 1. After O1 ’s execution, site − → − → 2 has v 2 = [1, 0, 1]. When O4 arrives at site 2 with v O4 = [2, 1, 1], it is causally unready; thus, it is delayed until executing O2 . According to Birman and Cooper [4], CBCAST is 3–5 times faster than ABCAST that supports a total ordering. Nevertheless, this causality preservation scheme is so strict that it might incur a chain of inessential delays, when a site fails to broadcast operations; in Section 7, we discuss relaxing of this scheme. 2.2. System model of RADTs A replicated abstract data type (RADT) is extended from a normal ADT. The system model of RADTs can be summarized below, and the main control loop is presented in Algorithm 1. • An RADT is a particular data structure with a definite set of operation types (OPTYPE). • RADTs are multiple copies of an RADT, each of which is replicated at one of the distributed sites. • At a site, a local operation is one issued locally, whereas a remote operation is one received from a remote site. • At a site, every local operation is immediately executed on the RADT of the site according to its local algorithm. • Every local operation modifying the local RADT is broadcast to the other sites in the form of the remote operation. • At a site, every remote operation is immediately executed according to its remote algorithm when it is causally ready. For the operations modifying RADTs, two kinds of algorithms are given: local and remote. In RADTs, local algorithms are almost the same as those of the normal ADTs, but remote algorithms might operate differently in order to maintain consistency. Since an operation is executed first at its local site and later at remote sites, different sites execute operations in different orders. Section 3 will go into detail on operation execution. On the other hand, though local Read operations are allowed without restriction, they are not propagated to remote sites. A Read issued at a site, therefore, never globally performs, and thus consistency is not defined for Reads. Instead, RADTs guarantee an eventual consistency model which is defined only for the operations modifying replica states as follows. 356 H.-G. Roh et al. / J. Parallel Distrib. Comput. 71 (2011) 354–368 Algorithm 1 The main control loop of RADTs at site i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 MainLoop(): → ∀k:− v i [k] := 0; i := this site ID; initialize queue Q ; initialize RADT; while(not aborted) if(O is a local operation but not Read) − → → v i [i] := − v i [i ] + 1 ; → if(RADT.localAlgorithm(O) = true) broadcast (O, − v i ); → → else − v i [i] := − v i [i ] − 1 ; if(O is a Read) RADT.localAlgorithm(O); → if(an operation O arrives with − v O from site j) − → enqueue set (O, v O ) into Q ; while(there is a causally ready set in Q ) → (O, − v O ) := dequeue the set from Q ; → → → ∀k:− v i [k] := max(− v i [k], − v O [k]); RADT.remoteAlgorithm(O); Definition 1 (Eventual Consistency of RADTs). Eventual consistency is the condition that all the states of RADTs are identical after sites have executed the same set of modifying operations from the same initial states regardless of any causal execution order of the operations at each site. Fig. 2. A usage example of RADTs in the newspaper editing scenario. A newspaper page might be divided into a fixed number of blocks, which can be managed by an RFA. An RHT makes it possible to rapidly access news items with unique keys. An RGA enables pages to be inserted and deleted, respecting the order of existing pages. Algorithm 2 The local algorithms for RFA operations Write(int i, Object o): if(RFA[i] exists) 3 RFA[i].obj := o; 1 2 4 In this model, even if every site executes the same Read at the exact same time, sites might read different values. As in the scenario of editing newspapers, however, if consistency of shared objects, displayed to human users, accords with this model, momentary inconsistency is acceptable to human users. In particular, the eventual consistency is appropriate for achieving high responsiveness and transparent interactivity; thus, it has been widely accepted in collaborative applications [7,8,13,30]. 5 6 7 8 //RFA[i]: the ith element //replaces the ith object with o return true; else return false; Read (int i): if(RFA[i] exists) return RFA[i].obj; else return nil; 2.3. Definitions of RADTs and inconsistency problems This paper focuses on three kinds of representative ADTs and extends them into RADTs: fixed-size array into RFA, hash table into RHT, and growable array or linked list into RGA. A real example of the growable array is the Vector class of JAVA or STL. As shown in Fig. 2, their functionality is prevalently demanded in such applications as in the newspaper editing scenario. Note that multiple RADTs can handle the same object in memory. For example, news items can be managed not only with RHTs as in Fig. 2 but also with RGAs to display a number of news items on a page. To manage the overlapping order of news items, i.e., Z order, consistently over all sites, RGAs can be used even if some items are inserted or deleted. Therefore, just like linked lists or growable arrays are widely used, RGAs will be highly demanded in collaborative applications. However, if remote algorithms are not properly designed, RADTs suffer pathological inconsistency problems. Below, we precisely define each RADT and show its potential inconsistency problems when it executes operations naïvely. A replicated fixed-size array (RFA) is a fixed number of elements with OPTYPE = {Write, Read}. An element is an object container of RFAs, which contains one object. In Algorithm 2, the local algorithm of Write(int i, Object o) replaces the object at the ith element with a new one o. In RFAs, different execution orders lead to inconsistency. For example, if three operations of Fig. 3 are given as O1 : Write(1, o1 ), O2 : Write(1, o2 ), and O3 : Write(1, o3 ) in RFAs, the element of index 1 lastly contains o1 at sites 1 and 2, but o2 at site 0. Hash tables are extended into replicated hash tables (RHTs), which access shared objects in slots by hashing unique keys with OPTYPE = {Put , Remove, Read}, as in Algorithm 3. This paper assumes that an RHT resolves key collisions by separate chaining scheme. If a Put performs on an existing slot, it updates the Fig. 3. A simple example of a time–space diagram. slot with its new object. RHTs have an additional source of inconsistency because Puts and Removes dynamically create and destroy slots. This necessitates the idea of tombstones, which are invisible object containers kept up after Removes [21,9,6]. Despite the tombstone, if the remote algorithms are the same as the local ones, RHTs might diverge. Consider Fig. 3 again, assuming O1 : Remove(k1 ), O2 : Put(k1 , o2 ), and O3 : Put(k1 , o3 ). Having executed the two Puts, sites 1 and 2 have different objects for k1 . Finally, sites 1 and 2 have the tombstone for k1 while site 0 has o2 for k1 . Algorithm 3 The local algorithms for RHT operations 1 Put (Key k, 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Object o): := RHT[hash(k)]; // RHT[hash(k)]: the slot where k is mapped; if(s != nil) s.obj := o; // if slot exists; else new_s := make a new slot; new _s.obj := o; RHT[hash(k)] := new_s; // link new_s to RHT; return true; Remove(Key k): s := RHT[hash(k)]; if(s = nil) return false; // if no slot exists, s.obj := nil; // make s tombstone; return true; Read (Key k): s := RHT[hash(k)]; if(s = nil or s.obj = nil) return nil; // no slot or tombstone return s.obj; s A replicate growable array (RGA) of primary interest to this paper supports OPTYPE = {Insert , Delete, Update, Read}, each of which accesses an object with an integer index. The local algorithms of RGAs are presented in Algorithm 4. Since nodes, the object containers of RGAs, are ordered and inserted/deleted, an RGA 357 H.-G. Roh et al. / J. Parallel Distrib. Comput. 71 (2011) 354–368 adopts a linked list internally for efficiency. Update is also required because Inserts cannot update nodes and modifications on nodes should be explicitly propagated. RGAs, therefore, inherit all the problems of RFAs and RHTs. In order to enhance user interactions, such as carets or cursors, it is also possible to supplement OPTYPE with the local pointer operations, which are parameterized with node pointers instead of integers by using findlink in Algorithm 4. This paper, however, mainly deals with the operations of integer indices since their semantics have been frequently studied in collaborative applications [7,27,35,34,19,22]. Note that local RGA operations on tombstones fail by findlist or findlink and are not propagated to remote sites in Algorithm 1. Since the order among nodes matters, RGAs have additional inconsistency problems. Suppose that the operations in Fig. 3 are given as O1 : Update(2, ox ), O2 : Insert(1, oy ), and O3 : Insert(1, oz ) and executed on an initial RGA [o1 o2 ] by the local algorithms.1 After executing both O2 and O3 , sites 1 and 2 have different results: [o1 oz oy o2 ] at site 1, and [o1 oy oz o2 ] at site 2. Here, only one must be chosen for consistency. Algorithm 4 The local algorithms for RGA operations 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 findlist (int i): n := head of the linked list; int k := 0; while(n != nil) if(n.obj != nil) // skip tombstones; if(i = ++k) return n; n := n.link; // next node in the linked list; return nil; findlink (node n): if(n.obj = nil) return nil; // if n is tombstone; else return n; Insert (int i, Object o): if((refer _n := findlist (i)) = nil) return false; new _n := make a new node; new _n.obj := o; link new_n next to refer _n in the RGA structure; return true; Delete(int i): 19 if((target _n 20 target _n.obj 18 21 22 23 24 25 26 27 28 := findlist (i)) = nil) return false; := nil; // make target _n tombstone; return true; Update(int i, Object o): if((target _n := findlist (i)) = nil) return false; target _n.obj := o; return true; Read (int i): if((target _n := findlist (i)) = nil) return nil; return target _n.obj; When executing O1 , its remote sites might violate the intention of O1 , which is what O1 intends to do at its local site. We formally define the intention of an operation as follows. Definition 2 (Intention of an Operation). Given an operation with parameter(s) on an RADT, its intention is the effect of its local algorithm on the RADT. In RGAs, intentions can be violated at remote sites because Inserts and Deletes change integer indices of some nodes located behind their intended nodes. This intention violation problem was first addressed by Sun et al. [35]. In the example, although O1 intends to replace oz on [o1 oz o2 ] with ox at its local site, O1 of site 2 may update oy on [o1 oy oz o2 ], which is not the intention of O1 . RGAs may incur many other puzzling problems regarding intentions [19], but solve them in Sections 4 and 5. Fig. 4. CEG of the time–space diagram of Fig. 1. The local RADT algorithms ensure the same responsiveness and interactivity as the normal ADTs. Note that local Algorithms 2–4 are incomplete since we present no exact details of the data structures yet. After introducing two principles, the remote algorithms, which mandate consistency maintenance, will be presented with the details of the data structures in Section 5. 3. Operation commutativity RADTs allow sites to execute operations in different orders. To denote an execution order of two operations, we use ‘→’; e.g., Oa → Ob if Oa is executed before Ob . In addition, we use ‘⇒’ to express changes of replica states caused by the execution of an Oa Ob operation or a sequence of operations; e.g., RS0 ⇒ RS1 ⇒ RS2 means that Oa and Ob change a replica state RS0 into RS1 and Oa →Ob then into RS2 in order. We abbreviate this as RS0 ⇒ RS2 . Though time–space diagrams, such as Figs. 1 and 3, are intuitive and illustrative, we present a better definition for formal analysis as follows. Definition 3 (Causally Executable Graph (CEG)). Given a time–space diagram TS, a graph G = (V , E ) is a causally executable graph, iff : V is a set of vertices corresponding to all the operations in TS, and E ⊂ V × V is a set of edges corresponding to all the relations between every pair of distinct operations in V , where a happenedbefore relation Oa → Ob corresponds to a directed edge in E from Oa to Ob , and a concurrent relation to an undirected edge in E, respectively. Fig. 4 shows the CEG obtained from the time–space diagram in Fig. 1. Every CEG essentially has the following properties. Lemma 1. A CEG G has no cycle with its directed edges and is a complete graph. Proof. According to the definitions of happened-before and concurrent relations [15], they are not defined reflexively and happened-before relations are all transitive; thus, G has no cycle. Unless every pair of two distinct operations is in happened-before relation, it is concurrent; hence, G is complete. For a given CEG, if all the vertices can be traveled without going against directed edges, casuality can be preserved in the execution sequence. In Fig. 4, at site 0, the execution sequence of O1 → O2 → O3 → O4 → O5 does not go against the direction of any directed edges, but at site 2, O3 → O1 → O4 → O2 → O5 violates causality because O4 is executed before O2 , whose order is the reverse of edge O2 → O4 . A causality-preserved sequence encompassing all the operations of a CEG satisfies the conditions in the following definition. Definition 4 (Causally Executable Sequence (CES)). Given a CEG G = (V , E ), where |V | = n, an execution sequence s : O1 → · · · → On 1 The first object is referred to by index 1. An Insert adds a new node next to (in the right of) its reference. To insert ox at the head, we use Insert(0, ox ). is a causally executable sequence (CES), iff : all the operations in V participate only once in s, and no Oj → Oi for 1 ≤ i < j ≤ n. 358 H.-G. Roh et al. / J. Parallel Distrib. Comput. 71 (2011) 354–368 Unless all the edges in E are directed ones, a CEG has more than one CES. According to the system model, RADTs permit the executions of all possible CESes. Eventual consistency, therefore, is achieved if all the CESes lead to the same replica state. To observe the relationship among CESes of a CEG, consider a CES s1 : O1 → O2 → O3 → O4 → O5 in the CEG of Fig. 4. If a pair of adjacent operations on s1 is concurrent, another CES can be derived by swapping the order in the pair; e.g., if O1 ‖ O2 is swapped, s1 is transformed into another CES s2 : O2 → O1 → O3 → O4 → O5 . Only if both O1 → O2 and O2 → O1 yield the same result from an identical replica state, will s1 and s2 produce a consistent result. In this regard, given a CES s from a CEG G, if we show that all the possible CESes derived from G can be transformed from s and find the condition that they yield the same result, eventual consistency will be guaranteed. This is the basic concept of operation commutativity (OC), which is developed from the commutative relation. Definition 5. [Commutative Relation ‘↔’]. Given concurrent Oa and Ob , they are in commutative relation, denoted as Oa ↔ Ob , iff : for a replica state RS0 , when RS0 is equal to RS2 . Oa →Ob O →Oa ⇒ RS1 and RS0 b⇒ RS2 , RS1 To illustrate the effect of a commutative relation in CESes, consider two CESes in Fig. 1: s1 : O1 → O2 → O3 → O4 → O5 at site 0 and s3 : O2 → O5 → O1 → O3 → O4 at site 1. Even if O1 ‖ O5 is O1 ↔ O5 , we are not sure if this commutative relation helps s1 and s3 to be consistent because the initial states where O1 ‖ O5 are executed are different and because other operations may or may not intervene between them. Indeed, to make all the possible CESes consistent, the following condition is necessary. Definition 6 (Operation Commutativity (OC)). Given a CEG G = (V , E ), operation commutativity is established in G, iff : Oa ↔ Ob for ∀(Oa ‖ Ob ) ∈ E. OC is the condition in which every pair of concurrent operations commutes. For example, consider s1 and s3 again. If OC holds in the CEG of Fig. 4, O1 ↔ O2 , O4 ↔ O5 , O3 ↔ O5 , and O1 ↔ O5 are ensured. By applying the properties of those commutative relations in sequence, s1 can be transformed into s3 . For completeness, we present the following theorem. Theorem 1. If OC holds in a given CEG G = (V , E ), all the possible CESes of G executed from the same initial replica state eventually produce the same state. This theorem, proved in [29], implies that OC is a sufficient condition for eventual consistency. We, therefore, mandate every pair of operation types to be commutative when they are concurrent. Besides, OC will be used as a proof methodology. To prove if a kind of RADTs is consistent or not, we show that each pair of concurrent operations actually commutes on all the replica states defined exhaustively (see [29] for detail). However, OC suggests no guideline to exploit OC itself. In the next section, precedence transitivity gives a practical way to achieve OC for the RADT operations. 4. Precedence transitivity 4.1. Precedence transitivity In RADTs, operations relate to object containers, i.e., elements of RFAs, slots of RHTs, and nodes of RGAs. This relationship would be clarified by causal object (cobject) and effective operation (eoperation). • cobject: For a local operation O, its cobject is the object container indicated by the index of O. If O is an Insert, it has two cobjects: one is called as left cobject which is indicated by the index of O (say i), and the other is right cobject which is the one of i + 1 when O is generated. • eoperation: For an object container, its eoperation is the operation whose local or remote algorithm succeeds in creating/destroying/updating the container. Except Inserts, a local operation on an existing container regards the container as its cobject (cf. a Put on no slot has no cobject) while it becomes an eoperation on its cobject. An Insert has two cobjects, but becomes an eoperation only on its new node. The intention of a remote operation is preserved, (1) if a remote Insert places a new node between its two cobjects, (2) if a remote Put on no slot becomes the eoperation on a new slot for its key, or (3) if a remote Write, Put, Remove, Delete, or Update becomes the eoperation on its cobject. The intentions of different operations might be in conflict, if they are supposed to be eoperations on a common cobject or their cobjects overlap. To decide which operation has higher priority than the other conflicting one in preserving their intentions, precedence rules are needed. Enacting precedence rules is, however, complicated since the rules should not conflict with each other. We, therefore, suggest precedence transitivity that makes precedence rules consistent with each other. Initially, the precedence relation is defined as an order between two operations as follows. Definition 7 (Precedence Relation ‘99K’). Given two operations Oa and Ob , Ob takes precedence over Oa , denoted as Oa 99K Ob , iff : (1) Oa → Ob or (2) for Oa ‖ Ob , Ob has higher priority than Oa in preserving their intentions. For Oa → Ob , it is evident that the intention of Ob should be preserved even if that of Oa is impeded or canceled; thus, Oa 99K Ob . In a similar sense, the precedence relation between concurrent operations is defined. For instance, suppose Oa 99K Ob for Oa ‖ Ob . If they are two Writes on the same element, Ob overwrites the element where Oa has performed, but Oa does nothing on the element where Ob has performed so that Ob preserves its intention rather than Oa . If Oa and Ob are two Inserts of the same cobjects, Ob should insert its new node closer to the left cobject than Oa because that makes the effect similar to the effect of Oa 99K Ob derived from Oa → Ob . Obviously, intentions of no conflict are preserved at once. If precedence relations on current operations are arbitrarily enacted, they might conflict with each other. To illustrate, suppose that the operations in Fig. 3 are given as O1 : Write(1, o1 ), O2 : Write(1, o2 ), and O3 : Write(1, o3 ). For each pair of operations, assume the following arbitrary precedence relations: O1 99K O2 , O3 99K O1 (from O3 → O1 ), and O2 99K O3 . These precedence relations on an element are expressed with a graph called a precedence relation graph (PRG). A PRG can be derived from a CEG by keeping the directed edges intact and by choosing a direction for each undirected edge. Such a directed complete graph is called a tournament in graph theory [2]. The PRG of the above precedence relations is shown in Fig. 5(a). Assuming that the three operations are executed according to this PRG, the element of index 1 at each site will be as follows. O3 O1 O2 O2 O3 O1 O3 O2 O1 Site 0: o? ⇒ o3 ⇒ o1 ⇒ ox , Site 1: o? ⇒ o2 ⇒ o3 ⇒ oy , Site 2: o? ⇒ o3 ⇒ o3 ⇒ oz . The first operations of sites 1 and 2 are local ones, which are effectively executed by the local algorithms; i.e., O2 and O3 become the eoperations on RFA[1], respectively. At site 0, we assume that the remote operation O3 is effectively executed. At each site, the second operation is effectively executed if it takes precedence over the first one, otherwise it does nothing. Thus, the elements of index 1 become o1 at site 0 by O3 99K O1 and o3 at sites 1 and 2 by O2 99K O3 , respectively. When the third operation arrives at each site, its execution must obey the precedence relations with the previous two operations. 359 H.-G. Roh et al. / J. Parallel Distrib. Comput. 71 (2011) 354–368 a b Fig. 5. Two PRGs of the time–space diagram of Fig. 3. For example, at site 1, the execution of O1 should obey both O3 99K O1 and O1 99K O2 . However, O1 cannot satisfy both; if O1 does nothing according to O1 99K O2 , it violates O3 99K O1 , but otherwise O1 99K O2 is disobeyed. We can find the reason from the PRG of Fig. 5(a). Note that PRG (a) has a cycle, that is, the precedence relations are intransitive such that O1 99K O2 and O2 99K O3 , but not O1 99K O3 . Hence, obeying two precedence relations among the three inevitably leads to violating the rest in this PRG. On the other hand, another PRG shown in Fig. 5(b) is an acyclic tournament. Since all the edges in an acyclic tournament are transitive (see Theorem 3.11 in [2]), the third operation at each site can be applied while obeying all the precedence relations; thus, ox , oy , and oz become o1 . In the final analysis, we suggest the following condition as a key principle to realize OC. Definition 8 (Precedence Transitivity (PT)). Given a CEG G = (V , E ), precedence transitivity holds in G, iff : if Oa 99K Ob and Ob 99K Oc for ∀(Oa ̸= Ob ̸= Oc ) ∈ V , Oa 99K Oc . PT is a condition in which all precedence relations are transitive. Since an acyclic tournament has a unique Hamiltonian path (see Theorem 3.12 in [2]), which visits all the vertices of the graph once, precedence relations are totally ordered; e.g., the PRG of Fig. 5(b) are ordered as O2 99K O3 99K O1 . Note that PT is not a principle that regulates operation executions and operations need never be executed in this order. Instead, each object container has only to store a few hints for its last eoperation(s), which are used to reconcile the intention of a new operation. In this way, PT enables RADT operations to commute without serialization and state rollback mechanisms. While OC is a principle only on concurrent operations, PT explains how concurrent relations are designed against happenedbefore ones; indeed, precedence relations between concurrent operations (concurrent precedence relations) must accord with precedence relations inherent in happened-before relations. If static priorities are used to decide concurrent precedence relations, a derived PRG might have cycles. For example, suppose that an operation issued at a higher site ID takes precedence over an operation issued at a lower site ID. The graph of Fig. 6(a) is the PRG derived from the CEG of Fig. 4. As those static priorities never take happened-before relations into account, PRG (a) has a cycle with O3 , O4 , and O5 . To accord concurrent precedence relations with happenedbefore ones, any logical clocks that arrange distributed operations in a particular total order, such as Lamport clocks or vector clocks, can be used. For instance, since RADTs are in need of vector clocks for causality preservation, we can use the condition deriving a total order of vector clocks [35]; then, the PRG of Fig. 6(b) can be obtained by making the precedence relations comply with their vector clock orders. Note that RADTs never serialize operations and never undo/do/redo operations, but all sites obtain the same effect as serialization by reconciling operation intentions. Instead of using original vector clocks, in Section 5.1, we introduce a fixedsize (quadruple) vector named an s4vector that is derived from a vector clock. Based on the s4vectors, we define a transitive s4vector order. Fig. 6. Two PRGs of Fig. 4. (a) is the PRG based on static priorities, and (b) is the PRG based on vector clock orders. In RADTs, precedence relations are mostly determined on the basis of s4vector orders and will be realized in remote algorithms by considering data structures and operation semantics. However, all precedence relations do not depend on only the s4vector orders. In RGAs, the precedence relation between concurrent Update and Delete is Update 99K Delete, i.e., Delete always succeeds in removing its target container regardless of the s4vector orders. Nevertheless, since no operations happening after the Delete can arrive to the container, PT holds for the operations arriving to the object container. Since precedence relations, on which PT is based, are differently implemented with specific operation types, it is difficult to prove that, without loss of generality, PT guarantees eventual consistency. In this paper, we apply PT to the implementations of operation types, and thus make pairs of operation types commute. Hence, in our report [29], we prove OC for every pair of operation types to which PT is applied. As the proofs show, PT is a successful guideline to achieve OC. Although this paper uses PT as a means of achieving OC, PT itself could accomplish eventual consistency for the implementations of RFAs or RHTs. Furthermore, unlike OC, PT might be able to ensure consistency for the execution sequences that are not CESes. We will discuss this issue further in Section 7. 4.2. Discussion In summary, the relationship among the various concepts introduced so far can be represented as follows: Responsiveness and interactivity can be enabled by Eventual Consistency can be guaranteed by Operation Commutativity (OC) can be exploited by Precedence Transitivity (PT) In fact, PT is not the only solution to exploiting OC. Especially, for insertion and deletion, several techniques have been introduced to achieve OC: some approaches derive a total order of objects from partial orders of objects [16,17,19,23], or introduces dense index schemes [26,40]. In Section 6, we compare those approaches with PT in more detail. On the other hand, for some types of operations, defining precedence is not possible. For example, consider the four binary arithmetic operations, i.e., addition, subtraction, multiplication, and integer division, which are allowed on replicated integer variables. Since some pairs of these operations are not commutative, this data type does not spontaneously ensure OC. Unlike the RADT operations, their intentions are realized depending on the previous value as an operand. Therefore, the precedence relation defined for RADT operations is difficult to apply to those arithmetic operations. Nevertheless, OC is still available in this example, if multiplications 360 H.-G. Roh et al. / J. Parallel Distrib. Comput. 71 (2011) 354–368 or integer divisions are transformed into appropriate additions or subtractions as the corresponding remote operations, OC can be achieved due to the commutative laws for additions and subtractions. To illustrate, suppose that, in Fig. 3, O1 is to multiply 3, O2 to add 2, and O3 to subtract 5 on the initial shared variable 10. If O1 is transformed into the addition of 10, i.e., an increased value by the multiplication, the replicated integers will converge into 17. Algorithm 5 The remote algorithm for Write 1 2 3 4 5 6 Write(int i, Object* o) → − → − → if(RFA[i].− s p ≺ s O ) // s O : s4vector of this Write; RFA[i].obj := o; → − → RFA[i].− s p := s O ; return true; else return false; → − → Element by replacing RFA[i].− s p with s O . Since the s4vector of a 5. RADT implementations local operation is always up-to-date at its issued time, it succeeds any s4vectors in the local RFA; hence, PT holds in both the local and remote algorithms of Write. 5.1. The S4Vector For optimization purpose, we define a quadruple vector type. typedef S4Vector⟨int ssn, int sid, int sum, int seq⟩; 5.3. Replicated hash tables − → be the vector clock of an operation issued at site i. Then, an O → − → − → S4Vector − s O can be derived from v O as follows: (1) s O [ssn] is − → a global session number that increases monotonically,∑ (2) s O [sid] − → − → is the site ID unique to the site, (3) s O [sum] is ( v O ) := ∑ − → − → − → v [ i ] , and (4) s [ seq ] is v [ i ] reserved for purging tombO O O ∀i − → stones (see Section 5.6). To illustrate, suppose that v O = [1, 2, 3] Let v is the vector clock of an operation that is issued at site 0 at session − → − → 4. Then, the s4vector of v O is s O = ⟨4, 0, 6, 1⟩. As a unit of collaboration, a session begins with initial vector clocks and identical RADT structures at all sites. When a membership changes or a collaboration newly begins with the same RADT structure stored on − → disk, s O [ssn] increases. The s4vector of an operation is globally ∑− unique because (→ v O ) is unique to every operation issued at a site. We define an order between two s4vectors as follows. − → Definition 9 (S4vector Order ‘≺’). Given two s4vectors s a and − → − → − → − → − → − → − → s b , s a precedes s b , or s b succeeds s a , denoted as s a ≺ s b , − → − → − → − → iff : (1) s a [ssn] < s b [ssn], or (2) ( s a [ssn] = s b [ssn]) ∧ → − → − → − → (− s a [sum] < s b [sum]), or (3) ( s a [ssn] = s b [ssn]) ∧ − → − → − → − → ( s a [sum] = s b [sum]) ∧ ( s a [sid] < s b [sid]). An RHT is defined as an array of pointers to Slots. struct Slot { Object* obj; → S4Vector − s p; Key k; Slot* next; }; Slot* RHT[HASH_SIZE]; A Slot has a key (k) and a pointer to another Slot (next) for the separate chaining. Algorithms 6 and 7 show the remote algorithms for Put and Remove, respectively. Algorithm 6 The remote algorithm for Put 1 Put (Key k, 2 3 4 5 6 Lemma 2. The s4vector orders are transitive. Proof. The s4vectors of different sessions are ordered by mono− → tonous s [ssn]s. In the same session, the s4vectors of a site are − → − → totally ordered because s [sum] grows monotonously. If s [sum]s are equal across different sites, they are ordered by unique − → s [sid]s. Since all s4vectors are totally ordered by the three conditions, s4vector orders are transitive. − → In this section below, s O denotes the s4vector of the current − → operation derived from v in Algorithm 1. O 5.2. Replicated fixed-size arrays To inform a Write of the last eoperation, an element encapsulates a single s4vector with an object. Using C/C++ language conventions, an element of RFAs are as follows. struct Element { Object* obj; → S4Vector − s p; }; Element RFA[ARRAY_SIZE]; An RFA is a fixed-size array of Elements. Based on this data structure, Algorithm 5 describes the remote algorithm of Write, − → − → where s O is the s4vector of the current remote operation, and s p is the s4vector of the last eoperation on the Element. − → − → Only when s O succeeds s p in line 2, does a remote Write(int i, − → Object* o) replace obj and s p of the ith Element with a new object − → o and s O . This Write becomes the new last eoperation on the 7 8 9 10 11 12 13 14 15 16 Object* o) Slot *pre_s := nil; Slot *s := RHT[hash(k)]; while(s != nil and s.k != k) // find slot in the chain; pre_s := s; s := s.next ; → − → if(s != nil and − s O ≺ s. s p ) return false; else if(s != nil and s is a tombstone) Cemetery.withdraw(s); else if(s = nil) s := new Slot; if(pre_s != nil) pre_s.next := s; s.k := k; s.next := nil; s.obj := o; → − → s.− s p := s O ; return true; A Put first examines if the Slot of its key k, mapped by a hash function hash, already exists (lines 3–6). If a Put precedes the last − → − → eoperation on the Slot, i.e., s O ≺ s. s p , it is ignored (line 7). In the case of no Slot, a new Slot is created and connected to the chain (lines 9–13). Finally, it allocates a new object and records the − → − → s4vector in the Slot (lines 14–15) only when s. s p ≺ s O or no Slot exists. A Remove first finds its cobject addressed by its key k (lines 2–3). Although a local Remove can be invoked on a non-existent Slot, it is not propagated to remote sites by Algorithms 1 and 3. Consequently, a remote Remove on no Slot throws an exception and does nothing (line 4). In line 5, a Remove is ignored if its s4vector precedes the last eoperation’s, or otherwise it demotes its − → target Slot into a tombstone by assigning nil and s O to obj and − → s p (lines 7–8). Thanks to tombstones, no concurrent operation misses its cobject; so, the precedence relation with the last Remove will not be lost. Obviously, local Reads regard tombstones as no Slots. If we recall the example of RHTs in Section 2.3, O1 becomes the last eoperation of the tombstone for k1 while O2 is ignored at site 0. 361 H.-G. Roh et al. / J. Parallel Distrib. Comput. 71 (2011) 354–368 Fig. 7. An example data structure of an RGA. The Node of τ4 is a tombstone. Algorithm 7 The remote algorithm for Remove 1 2 3 4 5 6 7 8 9 Remove(Key k) Slot *s := RHT[hash(k)]; while(s != nil and s.k != k) s := s.next ; if(s = nil) throw NoSlotException; → − → if(− s O ≺ s. s p ) return false; if(s is not tombstone) Cemetery.enrol(s); s.obj := nil; → − → s.− s p := s O ; return true; Removes enrol tombstones in Cemetery, a list of tombstones for purging. In Section 5.6, we discuss their purging condition. If − → − → a tombstone receives an operation whose s O succeeds, its s p − → is replaced with s O . When a succeeding Put is executed on a tombstone, it is withdrawn from Cemetery as line 8 in Algorithm 6 since it must not be purged. algorithms also operate on such structures. To illustrate, assume that, at session 1, site 2 invokes Insert(3, ox ) with a vector clock − → v O := [3, 1, 2] on the RGA structure of Fig. 7(a), which can be denoted as [o1 o2 o3 τ4 o5 ]. As shown in Algorithm 4, the local Insert algorithm first finds its reference Node, i.e., the left cobject o3 , from the linked list. Then, it creates a new Node that contains the new − → − → object ox in obj and the s4vector s O = ⟨1, 2, 6, 2⟩ in both s k − → − → and s p . This Node is placed in the hash table by hashing s O as a key and is connected to the linked list as shown in Fig. 7(b); thus, [o1 o2 o3 ox τ4 o5 ]. We assume line 16 of Algorithm 4 does this. − → Once s k of a Node is set, it is immutable, thereby being adopted as an s4vector index in the remote operation into which a local operation is transformed. For example, a local Insert(3, ox ) generated on the RGA of Fig. 7(a) will be transformed into Insert(⟨1, 0, 5, 3⟩, ox ) before it is broadcast. In this way, three RGA operations are broadcast to remote sites in the following − → − → forms: Insert(S4Vector i , Object* o), Delete(S4Vector i ), and Up5.4. The S4Vector index (SVI) scheme for RGAs In RGAs, Inserts and Deletes induce the intention violation problem due to integer indices, as stated in Section 2.3; that is, the nodes indicated by integer indices might be different at remote sites. To make the remote RGA operations correctly find their intended nodes, this paper introduces an s4vector index (SVI) scheme. A local operation with an integer index is transformed into a remote one with an s4vector before it is broadcast. The SVI scheme is implemented with a hash table which associates an s4vector with a pointer to a node. Note that the s4vector of every operation is globally unique; thus, it can be used as a unique index to find a node in the hash table. As mentioned in Section 2.3, RGAs adopt a linked list to represent the order of objects. After an Insert adds a new node into the linked list, the pointer to the node is reserved in the hash table by using the s4vector of the Insert as a hash key. The following shows the overall data structure of an RGA. struct Node { Object* obj; → S4Vector − s k; // for a hash key and precedence of Inserts → S4Vector − s p; // for precedence of Deletes and Updates Node* next; // for the hash table Node* link; // for the linked list }; Node* RGA[HASH_SIZE]; Node* head; // the starting point of the linked list − → A Node of an RGA has five variables. s k is the s4vector index as a hash key, and is used for precedence of Inserts. For precedence of − → Deletes and Updates, s p is prepared. Two pointers to Nodes, i.e., next and link, are for the separate chaining in the hash table and for the linked list, respectively. An RGA is defined as an array of pointers to Nodes like an RHT, and head is a starting point of the linked list. Fig. 7 shows an example of an RGA data structure which is constructed with a linked list combined with a hash table. The local − → − → date(S4Vector i , Object* o), where o is a new object, and where i − → − → is the s4vector index. Here, i is from s k in the cobject of a local Delete/Update or the left cobject of a local Insert. If an Insert adds its − → object at the head, i should be nil. 5.5. Three remote operations for RGAs Algorithm 8 shows the remote algorithm for Insert. As shown in Fig. 8, a remote Insert is executed through four steps. (i) First, a remote Insert looks for its left cobject in the hash table with the − → s4vector index i (lines 5–6). The SVI scheme ensures that this left cobject is always the same of the corresponding local Insert. − → For non-nil i , the left cobject always exists in the remote RGAs because tombstones also remain after Deletes. To this end, an Insert throws an exception, unless finding its cobject (line 7). (ii) Next, an − → Insert creates a new Node with s O as a hash key and connects it to the beginning of the chain in the hash table (lines 8–13). (iii) A remote Insert might not add its new Node on the exact right of the left cobject in order to preserve the intentions of some other concurrent Inserts that have already inserted their new Nodes next to the same cobject. If an Insert has a succeeding s4vector, it has higher priority in preserving its intention; thus, it places its new Node nearer its left cobject. Accordingly, in line 20, a remote Insert scans the Nodes next to its left cobject until a − → − → preceding Node whose s k precedes ins. s k is first encountered. As lines 14–18 are needed for inserting a new object at the head, the conditions of line 15 are the converse of line 20; if not inserted at the head, the comparison continues again from line 20. (iv) Finally, the new Node is linked in front of the first encountered preceding Node by lines 21–22. The following example, known as the dOPT puzzle [34] (see Section 6), illustrates how Inserts work. 362 H.-G. Roh et al. / J. Parallel Distrib. Comput. 71 (2011) 354–368 that I1 preserves its intention more preferentially than I2 and I3 ; so, the RGA states eventually converge as follows. Execution 1. At each site of Fig. 3, I3 I1 I2 I2 I3 I1 I3 I2 I1 Site 0: [oia oib ] ⇒ [oia oi3 oib ] ⇒ [oia oi1 oi3 oib ] ⇒ [oia oi1 oi3 oi2 oib ], Site 1: [oia oib ] ⇒ [oia oi2 oib ] ⇒ [oia oi3 oi2 oib ] ⇒ [oia oi1 oi3 oi2 oib ], Site 2: [oia oib ] ⇒ [oia oi3 oib ] ⇒ [oia oi3 oi2 oib ] ⇒ [oia oi1 oi3 oi2 oib ]. It is worth noting that the consistency is achieved without comparing the s4vector of I1 with the effect of concurrent I2 at sites 1 and 2. This is due to PT that harmonizes concurrent precedence relations with happened-before precedence relations. Fig. 8. The overview of the execution of I2 in Example 1. Algorithm 9 The remote algorithm for Delete Algorithm 8 The remote algorithm for Insert Insert (S4Vector 1 − → i , Object* o) Node* ins; Node* ref; − → if( i != nil) // (i) Find the left cobject in hash table; − → ref := RGA[hash( i )]; − → → while(ref != nil and ref.− s k != i ) ref := ref.next ; if(ref = nil) throw NoRefObjException; ins := new Node; // (ii) Make a new Node → − → ins.− s k := s O ; − → → ins. s p := − s O; ins.obj := o; → ins.next := RGA[hash(− s O )]; // place the new node → RGA[hash(− s O )] := ins; // into the hash table; − → if( i = nil) // (iii) Scan possible places → − → if(head = nil or head.− s k ≺ ins. s k ) if(head != nil) ins.link := head; head := ins; return true; else ref := head; → − → while(ref.link != nil and ins.− s k ≺ ref.link. s k ) ref := ref. link; ins.link := ref.link; // (iv) Link the new node to the list. ref.link := ins; return true; 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 1 2 3 4 5 6 7 8 9 − → i − → = ⟨1, 0, 1, 1⟩ and i b = ⟨1, 1, 2, 1⟩, − → − → i 1 = ⟨2, 0, 2, 1⟩, I1 : Insert(1 = i a , oi1 ) with [1, 0, 1] − → − → i 2 = ⟨2, 1, 1, 1⟩, I2 : Insert(1 = i a , oi2 ) with [0, 1, 0] − → − → i 3 = ⟨2, 2, 1, 1⟩. I3 : Insert(1 = i a , oi3 ) with [0, 0, 1] a We assume that I1 , I2 , and I3 correspond to O1 , O2 , and O3 of Fig. 3, respectively. As all their remote forms have the same s4vector − → index i a = ⟨1, 0, 1, 1⟩, their intentions are to insert new nodes − → − → − → next to oia . In this example, i 1 , i 2 , and i 3 are the s4vectors derived from the left vector clocks on the assumption of session 2. − → − → − → As i 2 ≺ i 3 ≺ i 1 , I2 99K I3 99K I1 ; I1 has the highest priority, then I3 , and I2 in order. At each site of Fig. 3, PT of Inserts is realized as follows. At site 0, remote I3 places oi3 in front of the preceding Node oib of the previous session. Then, I1 is executed as [oia oi1 oi3 oib ] by the local Insert algorithm. Finally, remote I2 is executed as Fig. 8. In line 20, oi1 and oi3 are skipped in turn because they are the succeeding → − → Nodes whose − s k succeeds ins. s k . Thus, I2 inserts oi2 past oi1 and oi3 as [oia oi1 oi3 oi2 oib ]. At sites 1 and 2, concurrent I2 and I3 commute despite different execution order because the scanning of line 20 sorts oi2 and oi3 between the same cobjects oia and oib as [oia oi3 oi2 oib ]. Then, the most succeeding I1 puts oi1 nearest the common left cobject oia so − → i ) − → Node* n := RGA[hash( i )]; − → → while(n != nil and n.− s k != i ) n := n.next ; if(n = nil) throw NoTargetObjException; if(n is not a tombstone) n.obj := nil; → − → n.− s p := s O ; Cemetery.enrol(n); return true; The local and remote Delete algorithms leave a tombstone − → behind. In Algorithm 9, a remote Delete finds its cobject with i via the hash table (lines 2–3); otherwise, it throws an exception (line 4). Regardless of s4vector order, a Delete assigns nil and − → − → − → s O into obj and s p (but not s k ) as a mark of a tombstone, and enrols it into Cemetery (lines 6–8). Note that tombstones never revive in RGAs. As findlist and findlink in Algorithm 4 exclude tombstones from counting, local operations never employ tombstones as cobjects. For example, in Fig. 7(a), local Insert(4, oy ) refers to o5 instead of the tombstone τ4 , and thus is transformed into remote Insert(⟨1, 1, 1, 1⟩, oy ). Algorithm 10 The remote algorithm for Update 1 Example 1 (Fig. 3). dOPT puzzle, on initial RGAs = [oia oib ] with Delete(S4Vector 2 3 4 5 6 7 8 9 Update(S4Vector − → i , Object* o) − → Node* n := RGA[hash( i )]; − → → while(n != nil and n.− s k != i ) n := n.next ; if(n = nil) throw NoTargetObjException; if(n is a tombstone) return false; → − → if(− s O ≺ n. s p ) return false; n.obj := o; → − → n.− s p := s O ; return true; In Algorithm 10, a remote Update operates in the same way as a remote Delete until finding its cobject. An Update also replaces obj − → − → − → and s p (but not s k ) of its cobject with its owns if s O succeeds − → s p (lines 7–8). Unlike Put of RHTs, an Update does nothing on a tombstone as in line 5; thus, always Update 99K Delete. This prevents an Update on a tombstone from being translated into the semantic of an Insert and makes the purging condition simple (see Section 5.6). Example 2 illustrates how RGA operations interact with each other when they are propagated as shown in Fig. 1. − → a = ⟨1, 0, 1, 1⟩. − → − → U1 (O1 ): Update(1 = i a , ȯia ) with [1, 0, 0] i 1 = ⟨2, 0, 1, 1⟩, − → − → U2 (O2 ): Update(1 = i a , öia ) with [0, 1, 0] i 2 = ⟨2, 1, 1, 1⟩, − → − → D3 (O3 ): Delete(1 = i a ) with [0, 0, 1] i 3 = ⟨2, 2, 1, 1⟩, − → i 4 = ⟨2, 0, 4, 2⟩, I4 (O4 ): Insert(0 = nil, oi4 ) with [2, 1, 1] − → − → I5 (O5 ): Insert(1 = i a , oi5 ) with [0, 2, 0] i 5 = ⟨2, 1, 2, 2⟩. Example 2 (Fig. 1). Initially, RGAs = [oia ] with i 363 H.-G. Roh et al. / J. Parallel Distrib. Comput. 71 (2011) 354–368 cobject might be overlooked, though it is required to terminate the comparisons in lines 15 or 20 of Algorithm 8. However, if the right cobject is purged before the remote execution of an Insert, Inserts of different intentions might be in conflict. For example, see Execution 3 where τib is assumed to be purged at the time T1 of P Fig. 9 (⇒ means purging). Execution 3. If purging τib at T1 , I3 D2 I1 Fig. 9. A time–space diagram of Example 3. P Site 0: [oia ] ⇒ [oi1 oia ] ⇒ [oi1 τia ] ⇒ [oi1 τia oi3 ] ⇒ [oi1 oi3 ], D2 I3 I1 P For U1 ‖ U2 being in conflict, U2 has higher priority than U1 owing Site 1: [oia ] ⇒ [τia ] ⇒ [τia oi3 ] ⇒ [oi3 ] ⇒ [oi3 oi1 ], to i 1 ≺ i 2 ; thus, as shown in Execution 2, U1 is ignored at site 1 by line 6 of Algorithm 10. When D3 conflicts with U1 and U2 , D3 always succeeds in leaving the Node of oia as the tombstone τia − → regardless of the s4vector order of s p , but U1 and U2 do nothing on τia by line 5 of Algorithm 10. The tombstone τia also enables I5 to find the left cobject after having executed concurrent D3 at sites 1 and 2. In addition, since τia is regarded as a normal preceding Node in Algorithm 8, I4 places oi4 in front of τia at sites 1 and 2. Eventually, RGAs converge at all the sites as follows. Site 2: [oia ] ⇒ [oia oi3 ] ⇒ [oi1 oia oi3 ] ⇒ [oi1 τia oi3 ] ⇒ [oi1 oi3 ]. − → − → Execution 2. At each site of Fig. 1, U1 D3 U2 I5 I4 Site 0: [oia ] ⇒ [ȯia ] ⇒ [öia ] ⇒ [τia ] ⇒ [oi4 τia ] ⇒ [oi4 τia oi5 ], I5 U2 D3 U1 I4 Site 1: [oia ] ⇒ [öia ] ⇒ [öia oi5 ] ⇒ [öia oi5 ] ⇒ [τia oi5 ] ⇒ [oi4 τia oi5 ], D3 U1 U2 I5 I4 Site 2: [oia ] ⇒ [τia ] ⇒ [τia ] ⇒ [τia ] ⇒ [oi4 τia ] ⇒ [oi4 τia oi5 ]. To sum up, the SVI scheme enables remote RGA operations to find their intended Nodes correctly using the hash table. For this − → purpose, s k is prepared in a Node as an s4vector index, which is − → immutable once being set by an Insert. Also, s k is used to realize PT among Inserts. Since tombstones are kept up, no remote operations − → miss their cobjects. Another s4vector of a Node s p is renewed by − → Updates and Deletes. The effectiveness of Updates is decided by s p , but Deletes are always successful; i.e., always Update 99K Delete. Nevertheless, OC holds because no operation happening after a − → − → Delete targets or refers to the tombstone. Separation of s k and s p means that Inserts never conflict with any Updates or Deletes. 5.6. Cobject preservation Cobjects need to be preserved for consistent intentions of operations. If the cobjects causing the effect of a local operation are not preserved at remote sites, its remote operations may cause different effects. Tombstones enable remote operations to manifest their intentions by retaining cobjects, but need purging. However, the tombstone purging algorithm should be cautiously designed for consistency. In fact, the operational transformation (OT) framework has failed to achieve consistency because cobjects are not preserved at remote sites (see Section 6). To illustrate, consider Example 3 where three operations are executed as in Fig. 9. − → a = ⟨1, 0, 1, 1⟩, − → I1 : Insert(0 = nil, oi1 ) with [1, 0, 0] i 1 = ⟨2, 0, 1, 1⟩, − → − → D2 : Delete(1 = i a ) with [0, 1, 0] i 2 = ⟨2, 1, 1, 1⟩, − → − → i 3 = ⟨2, 2, 1, 1⟩. I3 : Insert(1 = i a , oi3 ) with [0, 0, 1] I3 I1 D2 P Observe the effect of I1 concerning the existence of its right cobject, i.e., oia or τia . At site 0, I1 places oi1 at the head. Being indispensable to I3 as the left cobject, τia is retained. At site 2, I1 has to insert oi1 in front of the preceding oia , and then D2 performs. Hence, sites 0 and 2 have the correct final result [oi1 oi3 ]. At site − → − → of o k i3 − → (= i 3 of I3 ) despite I3 having different cobjects; thus, [oi3 oi1 ]. 1, however, if τia is purged, i 1 of I1 is compared with s Consequently, the loss of the right cobject can lead to the different effects of I1 . Instead, if the right cobject is purged at T2 of Fig. 9, the effects of I1 are consistent as follows. Execution 4. At Site 1, if purging τib at T2 , D2 I3 I1 P Site 1: [oia ] ⇒ [τia ] ⇒ [τia oi3 ] ⇒ [oi1 τia oi3 ] ⇒ [oi1 oi3 ]. In this respect, tombstones must be preserved as far as they could be cobjects for consistent operation intentions. However, in RGAs, tombstones, impeding search for Nodes in the linked list, need purging as soon as possible. We, therefore, introduce a safe tombstone purging condition using s4vectors. Let Di be a Delete issued at site i and τi be the tombstone − → caused by Di . Recall that Di assigns its s4vector into τi . s p and that RGAs guarantee two properties for a tombstone: (1) a tombstone never becomes the cobject of any subsequent local operations, and (2) a tombstone never revives. Hence, only for the operations concurrent with Di , can τi be a cobject. By retaining τi as far as any concurrent operations with Di can arrive, we can prevent those concurrent operations from missing cobjects. Golding already introduced a safe condition for this [10]. The existing condition enables RHTs and RGAs to preserve the cobjects of their operations, except the right cobjects of Inserts. To preserve the right cobjects of Inserts, an additional condition is needed. Note that, at site 1 in Example 3, the loss of τia causes problems since the next Node of τia succeeds the s4vector of I1 . In other words, if it is ensured that a new arrival Insert succeeds − → s k of every Node, the tombstone can be substituted by its next Node as a right cobject. To this end, an RGA needs to maintain a set of vector clocks including as many vectors as the number of sites − → − → − → N, i.e., VClast = { v last0 , . . . , v lastN −1 }; here, v lastj ∈ VClast is the vector clock of the last operation issued at site j and successfully executed at the site of VClast . Using VClast , a tombstone τi of Di can be safely purged if satisfying both of the following conditions. − → − → ∑− − → → → (2) τi .link. s k [sum] < min∀− ( v ) or τi .link = tail. v ∈VClast − → Example 3 (Fig. 9). Initially, RGAs = [oia ] with i → (1) τi . s p [seq] ≤ min∀− v ∈VClast v [i] for i = τi . s p [sid], Two cobjects of I1 are the head and oia while those of I3 are oia and the tail. The left cobject looks indispensable to the execution of an Insert because it prescribes the position where an Insert has to be executed at all sites. However, the necessity of the right Condition (1) is similar to Golding’s [10], which means that every site had executed Di ; so hereafter, only the operations happening after Di will arrive. Condition (2) means the s4vector of any new arrival operation succeeds that of the Node next to the tombstone that will be purged in the linked list. Consequently, we prepare Cemetery as a set of FIFO queues, − → each of which reserves tombstones of different s p [sid]. A Delete 364 H.-G. Roh et al. / J. Parallel Distrib. Comput. 71 (2011) 354–368 enrols a tombstone at the end of the queue; thus, enrolling a tombstone takes a constant time. A purge operation first inspects a foremost tombstone in each queue of Cemetery to know whether there are any tombstones that can be purged. If they exist, the tombstones are purged in practice with the time complexity of O(N ) because the previous Node of a tombstone should be found from the singular linked list of an RGA. Note that, if a session changes, all tombstones in RGAs can be purged. However, if a site stops issuing operations, the other sites cannot purge tombstones. In this case, a site can request the paused sites to send their vector clocks back and renews VClast with the received ones, thereby continuing to purge. 6. Related work The concept of commutativity was first introduced in distributed database systems [1,39]. It was, however, applied to concurrency control over centralized resources, but not to consistency maintenance among replicas. In other words, to grant more concurrency to some transactions in a locking protocol, transaction schedulers allow only innately commutative operations, e.g., Writes on different objects, to be executed concurrently while noncommutative operations still have to be locked. For maintaining consistency among replicas, the works of [14,21] considered commutativity, but allow only innately commutative operations to be executed in different order. The behavior of RFAs is similar to that of Thomas’s writing rule, introduced in multiple copy databases [36,37]. The rule prescribes that a Write-like operation can take effect only when it is newer than the previous one [15]. To represent the newness, Lamport clocks are adopted. In fact, Lamport clocks can be used in RADTs on behalf of sum of s4vectors. We, however, use the s4vector derived from a vector clock to ensure not only the causality preservation but also the cobject preservation. The idea of tombstones was also introduced in replicated directory services [21,9,6], which are similar to RHTs. With respect to RGAs, the operational transformation (OT) framework is one of a few relevant approaches that allows optimistic insertions and deletions on ordered characters. In this framework, an integration algorithm calls a series of transformation functions (TFs) to transform the integer index of every remote operation against each of concurrent operations in the history buffer. A function O′a = tf (Oa , Ob ) obtains a transformed operation of Oa against Ob , that is, O′a , which is mandated to satisfy the following properties, called TP1 and TP2 . Property 1 (Transformation Property 1 (TP1 )). For O1 ‖ O2 issued on the same replica state, tf satisfies TP1 iff: O1 → tf (O2 , O1 ) ≡ O2 → tf (O1 , O2 ). Property 2 (Transformation Property 2 (TP2 )). For O1 ‖ O2 and O2 ‖ O3 and O1 ‖ O3 issued on the same replica state, tf satisfies TP2 iff: tf (tf (O3 , O1 ), tf (O2 , O1 )) = tf (tf (O3 , O2 ), tf (O1 , O2 )). TP1 , introduced in the dOPT algorithm by Ellis and Gibbs [7], is another expression of the commutative relation of Definition 5 in terms of the OT framework. As it is not sufficient, a counterexample, called the dOPT puzzle (see Example 1), was found. Ressel et al. proposed TP2 , which means that consecutive TFs along different paths must result in a unique transformed operation [27]. It was proven that TP1 and TP2 are the sufficient conditions that ensure eventual consistency of some OT integration algorithms such as adOPTed [20,32], but it is worth comparing them with OC and PT. Though we are not sure that TP2 is equivalent to OC, TP2 is clearly a property only on sequential concurrent operations. If happened-before operations intervene among concurrent operations, OC and PT explain how operations are designed, but TP2 nothing. Various OT methods consisting of different TFs or integration algorithms have been introduced: e.g., adOPTed [27], GOT [35], GOTO [34], SOCT2 [32], SOCT4 [38], SDT [16] and TTF [22]. However, having presented no guidelines on preserving TP2 , counterexamples have been found in most of OT algorithms, such as adOPTed, GOTO, SOCT2, and SDT. Though GOT or SOCT4 avoided TP2 by fixing up the transformation path through an undo-do-redo scheme or a global sequencer, responsiveness is significantly degraded. In addition, the intention preservation, addressed in [35], also has gone through the lack of effective guidelines. We believe PT could be an effective clue to preserving both TP2 and the intention preservation. For example, when OT methods break ties among insertions, or write the TF for an Update-like operation, PT will explain how to compromise their conflicting intentions. Another reason why OT methods have failed not only in consistency but also in intention preservation is the loss of cobjects (see Section 5.6) if illustrated in terms of our RADT framework. Similar to Execution 3, once a character is removed, TFs are difficult to consider it. Accordingly, Li et al. have suggested a series of algorithms, such as SDT [16], ABT [17], and LBT [19]. The authors said these algorithms are free from TP2 by relying on the effects relation, which is an order between a pair of every two characters. However, deriving effects relations incurs additional overhead to transpose operations in the history buffer, or to reserve the effects relations in an additional table. Oster et al. introduced the TTF approach that invites tombstones to the OT framework [22]. TTF introduces the TFs satisfying TP2 based on a document growing indefinitely. However, purging tombstones in TTF is more restrictive than in RGAs because it makes integer indices incomparable at different sites. Hence, some optimizations, such as caret operations or D-TTF, are provided. Since TTF must be combined with an existing OT integration algorithm, such as adOPTed or SOCT2, it inherits the characteristics of the OT framework. Recently, several prototypes adopting unique indices were introduced for the optimistic insertion and deletion. Oster et al. proposed the WOOT framework that is free from vector clocks for scalability [24]. Instead, the causality of an operation is checked by the existence of cobjects at remote sites; also in RGAs, this can be in use through the SVI scheme (see Section 7). In WOOT, a character has a unique index of the pair ⟨site ID, logical clock⟩ and includes the indices of two cobjects of the insertion; also, an insertion is parameterized with the two indices. For consistency, an insertion derives a total order among existing characters by considering both sides of characters, but this makes the insertion of WOOT an order of magnitude slower than that of RGAs. WOOT also keeps up tombstones, but its purging algorithm is not yet presented. Meanwhile, independently of our early work [28], Shapiro et al. proposed a commutative replicated data type (CRDT) called treedoc that adopts an ingenious index scheme [31,26]. Treedoc is a binary tree whose paths to nodes are unique indices and ordered totally in infix order. Using paths as unique indices, treedoc can avoid storing indices separately and provide a new index to a new node continuously. Conflicting insertions require special indices like the WOOT indices to place their new characters into the same node. Besides, if a deletion performs on a node that is not a leaf, its tombstone should be preserved not to change indices of its child nodes. Thus, as a treedoc ages, it becomes unbalanced and may contain many tombstones. To clean up treedoc, the authors suggest two structural operations, i.e., flatten and explode, which obtain a character string from a treedoc and vice versa. However, flatten, requiring a distributed commitment protocol, is costly and not scalable. 365 H.-G. Roh et al. / J. Parallel Distrib. Comput. 71 (2011) 354–368 For scalability purpose, Weiss et al. suggested logoot, a sparse n-ary tree, which provides new indices continuously like treedoc [40]. However, unlike in treedoc, a node of logoot encapsulates a unique index that is an unbounded sequence of the pair ⟨pos, site ID⟩, where pos is the position in a logoot tree. Explicit indices allow logoot trees to be sparse; that is, no tombstone is needed for a deletion. Owing to the absence of tombstones, logoot could incur less overhead than treedoc for numerous operations abounding in a large-scale replication. To enhance scalability, the authors also suggest the causal barrier [25] on behalf of vector clocks for causality preservation. Although causal barriers can reduce the transmission overhead, sites manage their local clock in the same manner as with vector clocks for membership changes. In fact, causality preservation relates to reliability, which is discussed in Section 7. Above all, none of the above three approaches could derive an underlying principle, such as PT. Therefore, including the effects relations of Li et al. [19], the approaches are bounded only to the consistency of the insertion and deletion. In other words, they present no solution for the Update-like operation. Though an update can be emulated with consecutive deletion and insertion, if multiple users concurrently update the same object, multiple objects will be obtained. To our knowledge, the update has not been discussed in the OT framework. Instead, independently of the OT framework, Sun et al. proposed a multi-version approach for the update in collaborative graphical editors, where the order of graphical objects does not matter [33]. In this approach, when the operations updating some attributes, such as position or color, are in conflict, multiple versions of the object are shown to users in order not to lose any intentions. Although the behavior of RADTs must be deterministic as building blocks, RADTs can mimic the multi-version approach; if remote Puts or Updates return false, their effects can be shown as auxiliary information by using local ADTs. 7. Complexity, scalability, and reliability As the building blocks, the time complexity of RADTs is decisive for the performance and quality of collaborative applications. The time complexity of the local RADT operations is the same as that of the operations of the corresponding normal ADTs. In RFAs and RHTs, the remote operations perform in the same complexity as the corresponding local ADTs based on the same data structures; thus, Write, Put, and Remove work optimally in O(1). Only when the hash functions malfunction, is the theoretical worst-case of Put and Remove O(N ) for the number of objects N due to the separate chaining. The local RGA operations with integer indices work in O(N ) time since findlist in Algorithm 4 searches intended Nodes via the linked list from the head. As mentioned in Section 2.3, RGAs also support the local pointer operations taking constant time by using the findlink function. Meanwhile, the remote RGA operations can perform in O(1) time as a Node is searched via the hash table. The worst-case complexity of a remote Insert can be O(N ) in case that all the existing Nodes have been inserted by the concurrent Inserts on the same reference Node. We compare RGAs with WOOT and the recent OT methods, such as ABT, SDT, and TTF, in time-complexity. About the complexity of ABT and SDT, we consult [18]. For the complexity of TTF, we assume that D-TTF is combined with the adOPTed integration algorithms [22,27]. According to [18], the performance of the remote operations of both ABT and SDT fluctuates depending on the size and characteristics of the history buffer. WOOT presents different time complexity for insertion and deletion. Also, the complexity differs according to the policy to find a character; we assume that a local operation finds a character in Table 1 Time complexity of local and remote operations in a few algorithms. Algorithms Local operations Remote operations RGAs ABT SDT D-TTF w/adOPTed WOOT O(N ) ora O(1) O(|H |) O(1) O(N ) orb O(1) d O(N 2 ) ande O(1) O(1) O(|H |2 ) O(|H |2 ) orc O(|H |3 ) O(|H |2 + N ) d O(N 3 ) ande O(N ) N: the number of objects or characters, |H |: the number of operations in history buffer. a Local pointer operations. b The caret operations. c Worst-case complexity. d WOOT insertion operation. e WOOT deletion operation. O(1) time but a remote operation in O(N ) [23]. Table 1 shows that RGAs overwhelm the others, especially in the performance of the remote operations. More significantly, the remote RGA operations can perform without fluctuation, and thus guarantee stable responsiveness in the collaboration. We examine scalability in two aspects: membership size and the number of objects. As membership size or the number of objects scale up, performance may degrade. In a group communication like the RADT system model, the more sites participate in a group, the more remote operations a site must have. For example, suppose that each of total s = 16 sites generates evenly N = 6250 operations. Then, though all sites execute equally s × N = 100, 000 operations, each site will execute 6250 local and 93,750 remote operations. Consequently, the performance of remote operations are critical to scalability. RGAs have the optimal remote operations that are irrelevant to the number of objects. RGAs, therefore, are scalable, which will be proven in the next section. Meanwhile, OT methods are unscalable because their remote operations are inefficient and because history buffers are likely to grow for larger membership. In the meantime, scalability can be affected by vector clocks, adopted by most of optimistic replications requiring causality detection and preservation, because the clock size must be proportional to membership size. In the OT framework, a vector clock per operation should be stored in the history buffer for the causality detection. Hence, the space complexity to maintain the history buffer is O(s × |H |), where s is the number of sites, and wherein |H | has a tendency to grow in relation to s. In addition, OT methods also demand at least more than O(N ) space in order to store a document or effects relations. However, RGAs reserve only two s4vectors per object because PT enables consistency to be achieved without causality detection; thus, the space complexity is O(N ). Therefore, the s4vector enhances the scalability of RGAs with respect to the space overhead. In fact, the overhead incurred by tombstones cannot be ignored in collaborative applications; thus, tombstone purging algorithms have an impact on overhead. The space complexity of treedoc and WOOT is O(N ) including tombstones, while unbounded indices of logoot might incur theoretically higher space complexity despite the absence of tombstones [40]. Compared with treedoc or WOOT, RGAs may suffer more overhead owing to entries of tombstones. However, unlike treedoc or WOOT, of which indices are structurally related to tombstones, an RGA can purge tombstones regardless of indices as it continuously receives operations from the other sites. Section 8 presents a simple experiment regarding tombstones. As most of optimistic replication systems, RADTs also constrain themselves to preserve the causality defined by Lamport [15], but this relates to the reliability issue. When a site broadcasts an operation, some of the other sites may lose it. Though such a fault as an operation loss can be detected by the causality preservation 366 H.-G. Roh et al. / J. Parallel Distrib. Comput. 71 (2011) 354–368 scheme, it leads to a chain of delays in executing the operations happening after the lost operation. In the sense that a fault might result in a failure, preserving causality could significantly exacerbate reliability in scalable collaborative applications. In Example 2 and Fig. 1, if site 2 misses U2 , then I4 and I5 happening after U2 should be delayed. However, I4 and I5 can be executed without delays while U2 is being retransmitted because U2 has no essential causality with I4 and I5 . In other words, reliability can be enhanced by relaxing causality. Relaxing causality permits sites to execute some additional operation sequences that are not CESes, but that preserve only essential causality (say eCES). Then, OC does not guarantee consistency for eCESes. In fact, WOOT, the only approach allowing eCESes, verifies consistency by the model-checker TLC on a specification modeled on the TLA+ specification languages [41]. This checker exhaustively verifies all the states produced by all possible executions of operations, leading to the explosion of states; thus, the verification is made only up to four sites and five characters in WOOT [23]. In any case, for eCESes consisting of insertions and deletions, there is no generalized proof methodology of consistency yet. In RFAs and RHTs, it is simple to show that eventual consistency is guaranteed even for eCESes, though lines 4 in Algorithm 7 needs to be modified not to throw an exception. PT ensures that the last eoperation on a container is always identical, if the same set of operations is executed. Necessarily, to achieve consistency for eCESes in RGAs, the algorithms need to be modified. We leave the causality relaxation as future work, but believe OC and PT can be clues to make eCESes converge and to prove consistency. Compared with the OT framework, RGAs have much room for improving reliability. In the OT framework, all happened-before operations of a remote operation are indispensable to satisfying the precondition, addressed by Sun et al. [35], because integer indices are dependent on all the previously happened operations by nature. Meanwhile, by means of the SVI scheme, the remote RGA operations can validate their cobjects autonomously, i.e., independently of most of the other operations, thereby checking the essential causality without vector clocks. Hence, like WOOT, RGAs have a chance to be free from vector clocks, which could improve the reliability. 8. Performance evaluation We perform some experiments on RGAs to verify if the RGA operations actually work as the analysis of Section 7 and to compare with some previous approaches. To our knowledge, however, no previous approaches have presented any performance evaluation yet, except SDT and ABT [18]. Comparably to the experiments in [18], RGAs are implemented in C++ language and compiled by GNU g++ v4.2.4 on Linux kernel 2.6.24. We automatically generate intensive workloads modeling real-time collaborations with respect to the following four parameters: • s: the number of sites. • N: the number of operations that a site generates. In the experiments, every site generates evenly N operations. Hence, every site executes N local operations and N × (s − 1) remote operations on its RGA. • avd: average delay. As shown in Fig. 10, at every turn, a site either generates a local operation or receives a remote operation. Operations generated at a site are broadcast to arbitrary forward turns of the other sites by keeping the order at their local site. Thus, delays are measured in ‘turns’, and avd is the average number of turns that total s × N operations take to be delivered. We indirectly control avd by stipulating the maximum delay of an operation. Fig. 10. An example of a workload generation where s=3. • mo: minimum number of objects. Since all experiments are devised to begin with empty RGAs, mo controls the number of objects in RGAs during the evaluation of a workload. If an RGA at a site has objects fewer than mo (excluding tombstones), it generates only Inserts. Otherwise, the site randomly generates one of Insert, Delete, and Update with equal proportions. For three groups of operations, i.e., (LI) local operations with integer indices, (LP) local operations with pointer indices, and (R) remote operations with s4vector indices, their generated indices are uniformly distributed over their current RGAs. Unrelated to our proposed algorithms, the times for communication or buffering are excluded in the measurement. Currently, a purge operation is invoked whenever every remote operation is executed (see r Section 5.6). We run the workloads on Intel⃝ Pentium-4 2.8 GHz CPU with 1 GB RAM. Fig. 11 shows the average execution time of each operation with respect to the number of objects. By restricting mo, we control the average number of objects as in the line chart of Fig. 11. As predicted in the time-complexity of Table 1, only the execution times of operations (LI) are proportional to the number of objects, whereas those of operations (LP) and (R) are irrelevant. The execution time of the purge operation is also affected by the object number, but less susceptible than those of operations (LI) because tombstones are not always purged. Compared with the OT operations of SDT and ABT [18], impler Pentium-4 3.4 GHz CPU, and evalmented in C++, run on Intel⃝ uated for the workloads generated from only two sites, the RGA operations overwhelm the OT operations in performance. As shown in Table 1, the results of [18] prove that the history size decides the performance of the OT operations. For example, if a site has executed more than 3000 local and 1000 remote operations, it takes more than 600 ms (10−3 s) and 100 ms to execute a remote SDT operation and two ABT operations (one local and one remote operations), respectively [18]. Operations (LP) and (R), however, can be executed in 0.4–1.3 µs (10−6 s) in our environment, and operations (LI) are also faster enough unless an RGA contains excessively abundant objects. Fig. 12 shows the effect of delays. In RGAs, delays decide the lifetime of tombstones. As stated in Section 5.6, tombstones can be purged if a site continues to receive operations from all the other sites. In the line chart of Fig. 12, though each of roughly 33,000 Deletes makes one tombstone, smaller avd decreases tombstones; irrespective of avd, the average number of objects excluding tombstones is around 800. As a result, longer delays degrade the performance of operations (LI), but not of operations (LP) and (R). Also, avd hardly affects purge operations since the numbers of purged tombstones are similar. Actually, in our experiments, one tombstone is purged for every three purge operations on average. In treedoc [26] and logoot [40], overhead was evaluated; tombstones and fixed-size indices incur overhead in treedoc, and unbounded indices do in logoot. Though the workloads are obtained from Wikipedia or latex revisions, they cannot model the real-time collaborations where multiple sites concurrently participate. Though overhead depends on workloads in all approaches, purging tombstones in RGAs is less costly than in treedoc, and unlike logoot indices the size of a Node is fixed; as the sizes of an S4Vector and a Node are 12 bytes and 36 bytes, H.-G. Roh et al. / J. Parallel Distrib. Comput. 71 (2011) 354–368 367 Fig. 11. (Object effect) [s = 16 sites, N = 6250 ops, avd = 25.7 turns] With respect to mo, the average execution time of each operation (the column chart with the left y-axis) and the average number of objects including tombstones (the line chart with the right y-axis). 0 Fig. 12. (Delay effect) [s = 16 sites, N = 6250 ops, mo = 800 objs] With respect to avd, the average execution time of each operation (the column chart with the left y-axis) and the average number of objects including tombstones (the line chart with the right y-axis). Fig. 13. (Site effect) [s × N = 100, 000, mo = 800 objs, avd = 16.5–27.3 turns] With respect to s, accumulated execution time of total 100,000 operations. respectively, the overhead of 33,000 tombstones is about 1.1 MB without purging tombstones. In addition, an update, emulated with consecutive deletion and insertion in the two approaches, also produces a tombstone or makes indices long; meanwhile, Updates incur no additional overhead in RGAs. To verify the scalability issue, addressed in Section 7, we have a site execute total 100,000 operations which are equally generated by all s sites; hence, a site executes 100,000/s local operations and 100,000 ×(1 − 1/s) remote and purge operations. With respect to s, the accumulated execution times are presented in Fig. 13. Except the times for purge operations, the accumulated execution times tend to decrease as s is getting larger. However, the times for purge operations, invoked more frequently, offset the gains by replacing a slow local operation with a fast remote one. Notwithstanding, the result proves that RGAs are scalable with respect to the number of sites owing to the excellent performance of the remote operations. laborative applications. Operation commutativity and precedence transitivity make it possible to design the complicated optimistic RGA operations without serialization/locking protocols/state rollback scheme/undo-do-redo scheme/OT methods. Especially, in performance, RGAs provide the remote operations of O(1) with the SVI scheme using s4vectors. This is a significant achievement over previous works and makes RGAs scalable. We have demonstrated this outstanding performance of RGAs with intensive workloads. Furthermore, since the SVI scheme autonomously validates the causality and intention of an RGA operation, reliability would be enhanced. The work presented here, therefore, has profound implication for future studies of other RADTs such as various tree data types. 9. Conclusions This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MEST) (No. 2010-0000829). We would like to give warm thanks to all the anonymous reviewers and special thanks to Dr. Marc Shapiro at INRIA. When developing applications, programmers are used to using various ADTs. Providing the same semantics of ADTs to programmers, RADTs can support efficient implementations of col- Acknowledgments 368 H.-G. Roh et al. / J. Parallel Distrib. Comput. 71 (2011) 354–368 References [1] B.R. Badrinath, K. Ramamritham, Semantics-based concurrency control: beyond commutativity, ACM Transactions on Database Systems 17 (1) (1992) 163–199. [2] V. Balakrishnan, Graph Theory, McGraw-Hill, New York, 1997. [3] P.A. Bernstein, N. Goodman, An algorithm for concurrency control and recovery in replicated distributed databases, ACM Transactions on Database Systems 9 (4) (1984) 596–615. [4] K. Birman, R. Cooper, The ISIS project: real experience with a fault tolerant programming system, SIGOPS Operating Systems Review 25 (2) (1991) 103–107. [5] K.P. Birman, A. Schiper, P. Stephonson, Lightweight causal and atomic group multicast, ACM Transactions on Computer Systems 9 (3) (1991) 272–314. [6] A. Demers, D. Greene, C. Hauser, W. Irish, J. Larson, S. Shenker, H. Sturgis, D. Swinehart, D. Terry, Epidemic algorithms for replicated database maintenance, in: Proceedings of ACM Symposium on Principles of Distributed Computing, PODC, 1987, pp. 1–12. [7] C.A. Ellis, S.J. Gibbs, Concurrency control in groupware systems, in: Proceedings of ACM International Conference on Management of Data, SIGMOD, 1989, pp. 399–407. [8] C.A. Ellis, S.J. Gibbs, G. Rein, Groupware: some issues and experiences, Communications of the ACM 34 (1) (1991) 39–58. [9] M.J. Fischer, A. Michael, Sacrificing serializability to attain availability of data in an unreliable network, in: Proceedings of ACM Symposium on Principle of Database Systems, PODS, 1982. [10] R.A. Golding, Weak-consistency group communication and membership, Ph.D. Thesis, University of California, Santa Cruz, 1992. [11] Google Inc., Google wave protocols, 2009. http://www.waveprotocol.org/. [12] J. Gray, P. Helland, P. O’Neil, D. Shasha, The dangers of replication and a solution, in: Proceedings of ACM International Conference on Management of Data, SIGMOD, 1996, pp. 173–182. [13] S. Greenberg, D. Marwood, Real time groupware as a distributed system: concurrency control and its effect on the interface, in: Proceedings of ACM Conference on Computer Supported Cooperative Work, CSCW, 1994, pp. 207–217. [14] P.A. Jensen, N.R. Soparkar, A.G. Mathur, Characterizing multicast orderings using concurrency control theory, in: Proceedings of IEEE International Conference on Distributed Computing Systems, ICDCS, 1997, pp. 586–593. [15] L. Lamport, Time, clocks, and the ordering of events in a distributed system, Communications of the ACM 21 (7) (1978) 558–565. [16] D. Li, R. Li, Preserving operation effects relation in group editors, in: Proceedings of ACM Conference on Computer Supported Cooperative Work, CSCW, 2004, pp. 457–466. [17] R. Li, D. Li, Commutativity-based concurrency control in groupware, in: International Conference on Collaborative Computing: Networking, Applications and Worksharing, CollaborateCom, 2005, p. 10. [18] D. Li, R. Li, A performance study of group editing algorithms, in: Proceedings of International Conference on Parallel and Distributed Systems, ICPADS, IEEE Computer Society, 2006, pp. 300–307. [19] R. Li, D. Li, A new operational transformation framework for real-time group editors, IEEE Transactions on Parallel and Distributed Systems 18 (3) (2007) 307–319. [20] B. Lushman, G.V. Cormack, Proof of correctness of Ressel’s adOPTed algorithm, Information Processing Letters 86 (6) (2003) 303–310. [21] S. Mishra, L.L. Peterson, R.D. Schlichting, Implementing fault-tolerant replicated objects using Psync, in: Proceedings of Symposium on Reliable Distributed Systems, 1989, pp. 42–52. [22] G. Oster, P. Molli, P. Urso, A. Imine, Tombstone transformation functions for ensuring consistency in collaborative editing systems, in: International Conference on Collaborative Computing: Networking, Applications and Worksharing, CollaborateCom, 2006, pp. 1–10. [23] G. Oster, P. Urso, P. Molli, A. Imine, Real time group editors without operational transformation, Rapport de recherche RR-5580, INRIA, May 2005. [24] G. Oster, P. Urso, P. Molli, A. Imine, Data consistency for P2P collaborative editing, in: Proceedings of ACM Conference on Computer Supported Cooperative Work, CSCW, 2006, pp. 259–268. [25] R. Prakash, M. Raynal, M. Singhal, An adaptive causal ordering algorithm suited to mobile computing environments, Journal of Parallel and Distributed Computing 41 (2) (1997) 190–204. [26] N. Preguiça, J.M. Marqués, M. Shapiro, M. Letia, A commutative replicated data type for cooperative editing, in: Proceedings of IEEE International Conference on Distributed Computing Systems, ICDCS, 2009. [27] M. Ressel, D. Nitsche-Ruhland, R. Gunzenhäuser, An integrating, transformation-oriented approach to concurrency control and undo in group editors, in: Proceedings of ACM Conference on Computer Supported Cooperative Work, CSCW, 1996, pp. 288–297. [28] H.-G. Roh, J. Kim, J. Lee, How to design optimistic operations for peer-to-peer replication, in: Joint Conference on Information Sciences, JCIS, 2006. [29] H.-G. Roh, J.-S. Kim, J. Lee, S. Maeng, Optimistic operations for replicated abstract data types, Technical Report CS-TR-2009-318, KAIST, 2009. [30] Y. Saito, M. Shapiro, Optimistic replication, ACM Computing Surveys 37 (1) (2005) 42–81. [31] M. Shapiro, N. Preguiça, Designing a commutative replicated data type, Rapport de recherche RR-6320, INRIA, October 2007. [32] M. Suleiman, M. Cart, J. Ferrié, Concurrent operations in a distributed and mobile collaborative environment, in: Proceedings of International Conference on Data Engineering, ICDE, IEEE Computer Society, 1998, pp. 36–45. [33] C. Sun, D. Chen, Consistency maintenance in real-time collaborative graphics editing systems, ACM Transactions on Computer–Human Interaction 9 (1) (2002) 1–41. [34] C. Sun, C.S. Ellis, Operational transformation in real-time group editors: issues, algorithms, and achievements, in: Proceedings of ACM Conference on Computer Supported Cooperative Work, CSCW, 1998, pp. 59–68. [35] C. Sun, X. Jia, Y. Zhang, Y. Yang, D. Chen, Achieving convergence, causality preservation, and intention preservation in real-time cooperative editing systems, ACM Transactions on Computer-Human Interaction 5 (1) (1998) 63–108. [36] D.B. Terry, M.M. Theimer, K. Petersen, A.J. Demers, M.J. Spreitzer, C.H. Hauser, Managing update conflicts in Bayou, a weakly connected replicated storage system, in: Proceedings of ACM Symposium on Operating Systems Principles, SOSP, 1995, pp. 172–182. [37] R.H. Thomas, A majority consensus approach to concurrency control for multiple copy databases, ACM Transactions on Database Systems 4 (2) (1979) 180–209. [38] N. Vidot, M. Cart, J. Ferrié, M. Suleiman, Copies convergence in a distributed real-time collaborative environment, in: Proceedings of ACM Conference on Computer Supported Cooperative Work, CSCW, 2000, pp. 171–180. [39] W.E. Weihl, Commutativity-based concurrency control for abstract data types, IEEE Transactions on Computers 37 (12) (1988) 1488–1505. [40] S. Weiss, P. Urso, P. Molli, Logoot: a scalable optimistic replication algorithm for collaborative editing on P2P networks, in: Proceedings of IEEE International Conference on Distributed Computing Systems, ICDCS, IEEE Computer Society, 2009. [41] Y. Yu, P. Manolios, L. Lamport, Model checking TLA+ specifications, in: CHARME’99: Proceedings of the 10th IFIP WG 10.5 Advanced Research Working Conference on Correct Hardware Design and Verification Methods, Springer-Verlag, 1999, pp. 54–66. Hyun-Gul Roh received his B.S. degree in computer science from Yonsei University, Korea, in 2002, and is due to receive his Ph.D. degree in computer science from KAIST (Korea Advanced Institute of Science and Technology), in 2011. Currently, he is working as a research intern at INRIA from September, 2010. His research interests include distributed and replication systems, especially, collaboration and version vectors. Myeongjae Jeon is currently a Ph.D. student in computer science at Rice University. He received his M.S. degree in computer science from Korea Advanced Institute of Science and Technology (KAIST) in 2009 and his B.E. degree in computer engineering from Kwangwoon University in 2005. His research interests include machine virtualization, distributed systems, and storage systems. Jin-Soo Kim received his B.S., M.S., and Ph.D. degrees in Computer Engineering from Seoul National University, Korea, in 1991, 1993, and 1999, respectively. He was with the IBM T.J. Watson Research Center as an academic visitor from 1998 to 1999. He was the faculty in computer science department at KAIST from 2002 to 2008. Currently, he is the faculty of SungKyunKwan University. His research interests include operating systems, distributed file systems, and grid computing. Joonwon Lee received his B.S. degree from Seoul National University in 1983 and his Ph.D. degree from the Georgia Institute of Technology in 1991. From 1991 to 1992, he was with IBM T.J. Watson Research Center. After working for IBM, he was the faculty of KAIST from 1992 to 2008. Currently, he is the faculty of SungKyunKwan University. His research interests include operating systems, virtual machines, and parallel processing.

RELATED PAPERS

RELATED TOPICS

Log In

Replicated abstract data types: Building blocks for collaborative applications

Replicated abstract data types: Building blocks for collaborative applications

Related Papers

RELATED PAPERS

RELATED TOPICS