Academia.eduAcademia.edu
Author manuscript, published in "29th IEEE International Conference on Distributed Computing Systems (ICDCS 2009) (2009) 404-412" Logoot : a Scalable Optimistic Replication Algorithm for Collaborative Editing on P2P Networks Stéphane Weiss, Pascal Urso and Pascal Molli Nancy-Université LORIA Campus Scientifique Vandoeuvre-lès-Nancy {weiss,urso,molli}@loria.fr inria-00432368, version 1 - 16 Nov 2009 Abstract Massive collaborative editing becomes a reality through leading projects such as Wikipedia. This massive collaboration is currently supported with a costly central service. In order to avoid such costs, we aim to provide a peerto-peer collaborative editing system. Existing approaches to build distributed collaborative editing systems either do not scale in terms of number of users or in terms of number of edits. We present the Logoot approach that scales in these both dimensions while ensuring causality, consistency and intention preservation criteria. We evaluate the Logoot approach and compare it to others using a corpus of all the edits applied on a set of the most edited and the biggest pages of Wikipedia. 1. Introduction Collaborative editing (CE) systems allow distant users to modify the same data concurrently. The major benefits are: reducing task completion time, getting different viewpoints, etc... Wikis, online office suites and version control systems are the most popular collaborative editing tools. Several collaborative editing systems are becoming massive: they support a huge number of users to obtain quickly a huge amount of data. For instance, Wikipedia is edited by 7.5 million of users and got 10 million of articles in only 6 years. However, most of CE systems are centralized with costly scalability and poor fault tolerance. For instance, the Wikimedia Foundation spent 2.7 million dollars between 2007 and 2008 for maintaining wiki servers1 . To overcome these limitations, we aim to provide a peer-to-peer (P2P) CE system. P2P systems rely on replication to ensure scalability. A single object is replicated a limited number of times in structured networks (such as Distributed Hash Tables) or a unbounded number of times in unstructured P2P networks. In all cases, replication requires to define and maintain consistency of copies. With a limited number of replicas, 1. http://wikimediafoundation.org/wiki/Donate/Transparency/en it is possible to maintain strong consistency models such as sequential consistency. For instance, some P2P replication systems are based on consensus algorithm [1]. However, if the number of replicas grows, the communication cost becomes too expensive. Collaborative editing can rely on weaker consistency criteria that generate less traffic and that are more efficient. For instance, Git [2] distributed version control system relies on causal consistency, Usenet [3] on eventual consistency, and CoWord [4], a real-time editing systems relies on CCI consistency . CCI consistency have been proven suitable for replicated collaborative system [5]. CCI consistency means Causality, Convergence, and Intention preservation. Thus, CCI consistency implies causal consistency and eventual consistency. Intention preservation means that an operation effect observed on a copy, must be observed in all copies whatever any sequence of concurrent operations applied before. Many algorithms have been proposed for maintaining CCI consistency. Some approaches [6], [7] do no support P2P constraints such as churn. The others [8], [9], [10] rely on data “tombstones”. In these approaches, a deleted object is replaced by a tombstone instead of removing it from the document model. Tombstones cannot be directly removed without compromising the document consistency. Therefore, the overhead required to manage the document grows continuously. In this paper, we present a new optimistic replication algorithm called Logoot that ensures CCI consistency for linear structures, that tolerates a large number of copies, and that does not require the use of tombstones. This approach is based on non-mutable and totally ordered object position identifiers. The time complexity of Logoot is only logarithmic according to the document size. We validate the Logoot algorithm with real data extracted from Wikipedia. In this paper, we show and analyze the results of this experiment. 2. P2P Collaborative Editing System We make the following assumptions about P2P Collaborative Editing Systems (P2P CE) and their correction criteria. inria-00432368, version 1 - 16 Nov 2009 A P2P CE network is composed by a unbounded set of peer P . Objects edited by the system are replicated on set of replicas R (with 0 < |R| ≤ |P |). Each replica has the same role and is hosted on one peer. A peer can host many replicas, each one of a different object. Peers can enter and leave the network arbitrary fast. We assume that each peer possesses a unique comparable site identifier. The modifications applied on a replica are eventually delivered to all other replicas. We make no assumption about the kind of dissemination routine through the P2P network or the propagation time of modifications. When a modification is delivered to a replica, the modification is applied. Thus, the replica diverge in the short term. This kind of replication is known as optimistic replication [11] (or lazy replication). According to [12], a collaborative editing system is considered correct if it respects the CCI criteria: Causality: This criterion ensures that all operations ordered by a precedence relation, in the sense of the Lamport’s happened-before relation [13], will be executed in the same order on every copy (causal consistency). Convergence: The system converges – i.e. all replicas are identical – when the system is idle (criteria also known as eventual consistency). Intention: The expected effect of an operation must be observed on all replicas. One definition of operations intention for textual documents is : delete A line is eventually removed from the document if and only if it has been deleted on, at least, one replica. insert A line inserted on a replica eventually appears on every replica. Moreover, the order relation between the document lines and the newly inserted line must be preserved on every replica (as long as these lines exist). Along these criteria, we add a numerical scalability criteria from [14]. Scalability: The system must handle the addition of users or objects without suffering a noticeable loss of performance. some cases. The convergence is ensured by the algorithm by using tombstones. TreeDoc [10] is a collaborative editing system which uses a binary tree to represent the document. Deleted lines are also kept as tombstones. The authors propose a kind of “2 phase commit” procedure to remove tombstones. Unfortunately, this procedure cannot be used in an opennetwork such as P2P environments. However, this approach proposes also an interesting general framework called Commutative Replicated Data Type (CRDT) to build distributed collaborative editors ensuring CCI criteria. [7] proposes a distributed optimistic replication mechanism in the CRDT framework, that ensures the CCI criteria but using tombstones and vector clocks. Vector clocks (aka state vector) have a size proportional to the number of replicas in the network, and thus are not scalable when the number of users grow. The Operational Transformation approach [16], [17], [5], [18] is a framework for building distributed collaborative editor. Except MOT2 [8], all algorithms in this framework require the use of vector clocks and, thus, do not scale. MOT2 is a P2P peer-wise based reconciliation algorithm. This algorithm assumes the existence of transformation functions satisfying some properties. To our best knowledge, the only transformation functions for text document adapted [19] for MOT2 are the Tombstone Transformation Functions [20] which are based on tombstones. Thus, all of the above approaches that are usable on P2P networks are based on tombstones. According to the scalability definition, the tombstone cost is not acceptable on massive editing systems. For instance, for the most edited pages of Wikipedia2 , the tombstone storage overhead can represent several hundred times the document size. Tombstones are also responsible of performance degradation. Indeed, in all published approaches, the execution time of modification integration depends on the whole document size – including tombstones. Therefore, letting the number of tombstones growing degrades the performance. 3. Related Work Our idea is based on the CRDT [10] framework for collaborative editing. In the CRDT framework, modifications produced locally are re-executed on remote replicas. There is no total order on operations, thus, concurrent operations can be re-executed in different orders. The main idea of this framework is to use a data type where all concurrent operations commute. Combined with the respect of the causality relationship between operations, this commutation ensures the convergence criteria. To achieve commutativity on a linear structure, the authors propose a solution based on a total order between elements In this section, we present the optimistic replication approaches that are known to scale according to the number of users in the network. WOOKI [9] is a P2P wiki system based on Wooto, an optimization of Woot [15]. The main idea of Woot is to treat a collaborative document as a Hasse diagram that represents the order induced by the insert operations. Therefore, the Wooto algorithm computes a linear extension of this diagram. WOOKI barely respects the CCI correction criteria. Indeed, the causality is replaced by preconditions. As a result, the happened-before relation can be violated in 4. Proposition 2. http://en.Wikipedia.org/wiki/Wikipedia:Most frequently edited articles in the document. More precisely, there is two kinds of modification : • insert(pid, text) that inserts the line content text at the position identifier pid. • delete(pid) that removes the line at the position identifier pid. In the original paper, a tree structure is introduced to maintain the total order between positions identifier. However, safely removing elements from this tree requires tombstones. In [7], authors refer also to the CRDT framework but use vector clocks to obtain this order. Our idea is to use a position identifier based on a list of integers for each line. With such an identifier, a line can be removed from the document model without affecting the order of the remaining lines. • We only compare positions – and not logical clocks – since there can not be, in the same model, two lines with the same position (see lemma 3). Finally, the logical view of a Logoot document looks like: 1 2 3 4 5 inria-00432368, version 1 - 16 Nov 2009 On collaborative editing systems such as wiki or VCS, an edit on a document is not a single operation but a set of operations (a patch). Lines inserted by a patch are often contiguous. Thus, to apportion the line positions, we define the generateLinePosition(p, q, N, s) function which generate N positions between a position p and a position q using numbers in base M AXIN T (the maximum of the unsigned integer plus 1). To obtain short positions, it firstly select the smallest equal length prefixes of p and q spaced out at least N . Then it apportions randomly the constructed positions. A Logoot document is composed by lines defined by: hpid, contenti where content is a text line and pid a unique position identifier. There is two virtual lines called lB and lE to represent the beginning and the ending of the document. The main idea to insert a line is to generate a new position A such as P ≺ A ≺ N where P is the position of the previous line and N the position of the next line. 1 3 4 5 hpid0 , lB i hpid1 , ”This is an example of a Logoot document” i hpid2 , ”Here, pid1 ≺ pid2 ” i hpid3 , ”And pid2 ≺ pid3 ” i hpid∞ , lE i To allow operations to commute, position identifiers must be unique. Also, since a user can always insert a line, we must be able to generate a position A such as P ≺ A ≺ N for any P and N . In the following definition we assume that each site maintains a persistent logical clock clocks incremented each time a line is created. Definition 1. • An identifier is a couple hpos, sitei where pos is an integer and site a site identifier. • A position is a list of identifiers. • A position identifier generated by a replica s is a couple pos, hs where pos = i1 .i2 . . . . .in .hp, si is a position and hs is the value of clocks . Thus, every position identifier is unique since the last identifier of the list i1 .i2 . . . . .in .hp, si which contains the unique site identifier and the value of the logical clock of this site. To obtain a total order between positions, we use the following definition. Definition 2. • Let p = p1 .p2 . . . pn and q = q1 .q2 . . . qm be two positions, we get p ≺ q if and only if ∃j ≤ m. (∀i < j. pi = qi ) ∧ (j = n + 1 ∨ pj <id qj ) hh0, 0i, N A, lB i hh1, 1i, 0, ”This is an example of a Logoot document” i hh1, 1i.h1, 5i, 23, ”How to find a place between 1 and 1” i hh1, 3i, 2, ”This line was the third made on replica 3” i hhM AXIN T, 0i, N A, lE i 4.2. Modifying a Logoot document 4.1. Logoot model 2 Let id1 = hpos1 , site1 i and id2 = hpos2 , site2 i be two identifiers, we get p1 <id p2 if and only if pos1 < pos2 or if pos1 = pos2 and site1 < site2 . 1 function generateLinePositions(p, q, N , s) 2 3 4 5 list := {}; index := 0; interval := 0; 6 7 8 9 10 11 12 13 14 15 16 17 while (interval < N ) index++; interval := prefix(q, index) − prefix(p, index); endwhile step := interval / N ; r := prefix (p, index); for j:=1 to N do list.add(constructPosition(r + Random (1, step),p,q,s)); r := r + step; done return list; The function prefix(p, i) returns a number in base M AXIN T which each digits is pi .pos the integers of the first ith identifiers of p (filled with 0 if |p| < i). The function constructPosition(r, p, q, s) constructs a position hhr1 , s1 i.hr2 , s2 i.....hrn , sn ii where ri is the ith digit of r. We use the following rules to define each si : 1) if i = n then si = s, 2) else if ri = pi .pos then si = pi .site, 3) else if ri = qi .pos then si = qi .site 4) else si = s For instance, on a site s, insertion positions between p = hh2, 4ii and q = hh10, 5ih20, 3ii are apportioned in the set • {hhx, sii|x ∈]2, 10[} if N < 8 {hh2, 4ihx, sii|x ∈ [0, M AXIN T [}∨ {hhy, sihx, sii|x ∈ [0, M AXIN T [, y ∈]2, 10[}∨ {hh10, 5ihx, sii|x ∈ [0, 20[} if N ≥ 8 The following theorem states that Logoot ensures consistency criteria. Its correctness is based on the CRDT correctness proof of [10]. To choose the value of the value of the integer in the position, any arbitrary choice can be made. However, to restrain two different replicas to generate concurrently the same choice, and thus to reduce the grow rate of the position list, we apply a random function. To delete a line, we simply generate a delete operation which contains the position identifier of this line. Then, we can completely remove this line from the document. Theorem 2. If causality is preserved, Logoot ensures consistency. • inria-00432368, version 1 - 16 Nov 2009 4.3. Integrating remote modifications Both line insertion and removal can be integrated in a logarithmic time according to the number of lines in the document and constant according to the number of the user in the P2P network. Indeed, we simply use the binary search algorithm to find the position in the document corresponding to the position identifier. Also, the integration of a delete operation can safely remove the line from the document model, since the total order between remaining lines is not affected. Moreover, this removal will free a position identifier that can be reused. This mechanism reduces the growing rate of position identifier as shown in section 5. 4.4. Correctness of the approach To ensure convergence in the CRDT framework, concurrent operations must commute. If line positions are unique, non mutable, and totally ordered, the different replicas can apply any series of insert operations in any order and obtain the same result. The following lemma states that there cannot be two different lines with the same position on one model. Lemma 1. If causality is preserved, the position of a line is unique on each model. Proof: The last element of the line position contains the unique identifier of the site which generates the line. As a result, two different replicas cannot generate the same position. A replica can only generates a position different from every other position in its model. A replica can generate an insert operation oi of line l2 with the same position than a line l1 it has previously generated. However, this is possible only if l1 was previously deleted on that replica by a (remote or local) delete operation od . Then, we have the following happens before relationship od → oi . Thus, if causality is preserved l2 can only be inserted on a replica where l1 was deleted. Proof: Since the couple composed of a site identifier and a clock value is unique, each position identifier is unique. According to Lemma 3, each position is unique, and thus position identifier are totally ordered. Finally, since Logoot position identifiers are unique, non mutable and totally ordered, every couple (insert/insert, insert/delete and delete/delete) of concurrent operations commutes. Thus, Logoot data type is a CRDT. The following theorem states that Logoot respects the intentions of the insert and delete operations as defined Section 2. Theorem 3. If causality is preserved, Logoot ensures intentions. Proof: Since position identifiers are unique and non mutable, delete operation intention is respected. According to Lemma 3, each position is unique. Thus, the Logoot is always able to compute a new position between two lines. Since positions are non-mutable and totally ordered, the line order observed on the generation replica will be preserved on each other replica. 4.5. P2P constraints In order to be deploy on a P2P network, our approach needs to satisfy some constraints. It must scales in terms of peers and support the churn of the network (i.e. peers which enter and leave the network arbitrarily fast). Logoot position identifiers support these constraints since their size and space complexity are constant according to peers number (no vector clock). We only assume that each site has a unique identifier. Additionally to position identifiers, Logoot only require a causal dissemination mechanism. To obtain it, a scalable broadcast such as the lightweight probabilistic broadcast [21] in association with causal barriers [22] can be used. Also, the Logoot framework supports network churn since it does not require any group membership mechanism or consensus algorithm. It also does not require to know the number of peers. The following section discuss about Logoot scalability in terms of number of edits. 5. Evaluation The size of Logoot position identifiers is unbounded. Theoretically, position identifiers can grow each time a line inria-00432368, version 1 - 16 Nov 2009 is inserted. Thus, if no line is ever P removed, the size of n i = O(n2 ) where the document model overhead can be n is the total number of inserted lines. However, due to the randomized nature of the algorithm, such a worst case can only arrive with a probability equal to 1/M AXIN T n which is negligible. In practice, lines are often removed and position identifier are apportioned, thus the overhead size remains low. In order to effectively measure the Logoot overhead, we have replayed the modifications made on some Wikipedia pages in a Logoot document. For instance, in the proportionally worst result of our test bed (case 2, Figure 4), their is a total of 43352 identifiers, to be compared with the number of inserted lines n = 623863. On another hand, in tombstones-based approaches, the size of the document model is also unbounded. The Wooto and Treedoc approaches have a constant overhead for each inserted line, thus strictly proportional to n. Comparing to the number of tombstones, the size of each position identifier remains low. Our approach is sightly less efficient only in a specific case (see Section 5.2.3). 5.1. Methodology In our implementation, we use 8-bytes integers for site identifiers and positions, and 4-bytes integers for the logical clock, hence, a position identifier contains at least 20 bytes. For the Wooto approach, we use an 8-bytes integer for the site identifiers and 4-bytes integers for the logical clock and the degree. Then, the overhead for each line is 16 bytes. About the TreeDoc approach, we do not consider the tree overhead, we only count one 8-bytes integer for the site identifier and one 4-bytes integer for the counter. Finally, the overhead for each line is 12 bytes. To replay the histories of some Wikipedia pages, we use the MediaWiki API3 to obtain an XML file containing several revisions of a specific Wikipedia page. Then, using a diff algorithm [23], we compute the modifications performed between two revisions. Modifications are simply re-executed in our model. Since our approach generates each position randomly, we re-executed ten times each page history to obtain average values. We also measure the overhead for Wooto and TreeDoc. The result obtained for TreeDoc does not take account of the “stabledel” and “gc” procedures which aim to remove tombstones. We motivate this choice by the fact that these procedures require to know the exact number of replicas which is unknown, unbounded and unstable in P2P networks. The overhead of Wooto and TreeDoc are directly computed from the number and the type of operations performed 3. http://www.mediawiki.org/wiki/API:Query - Properties#revisions . 2F rv on the document. Indeed, their overhead are directly proportional to the number of inserted lines in the document since deleted lines remain as tombstones. We have applied this schema on the top pages of three categories4 : • The most edited encyclopedic pages, • The most edited pages, • The biggest pages. For each of the treated pages, we present the average – over the last 100 edits – overhead of the Logoot, Wooto and Treedoc approaches. We present the average size of the page and the number of patches (i.e. edits on the page). 5.2. Results Figure 1 shows the relative (size of the overhead divided by the size of the visible document on a logarithmic scale) overhead of the three different approaches on the most edited encyclopedic page of the English Wikipedia. Figure 2 shows the absolute overhead. The Logoot overhead remains constant all along the editing session, while the overhead of tombstones-based approaches continuously grows. Finally, the Logoot overhead is inferior to the document size while tombstones-based approaches require more than 100 times the document size and continuously grows. While the “George W. Bush” page contains only about 553 lines, the number of deletions is about 1.6 million. As a consequence, tombstones-based systems are not well-suited for such documents since we obtain 1.6 million tombstones for only 553 lines. Most of the modifications done on Wikipedia pages consists in updating the content of some existing lines. To ensure user’s intentions, distributed editing systems handle such an update as deleting the old content and inserting the new content. Thus, the number of tombstones grows quickly. Also, the figure 1 shows several peaks which are mainly due to vandalism acts. Indeed in some of the most edited encyclopedic pages of Wikipedia, there is a lot of vandalism acts done by users, including erasing the whole content of the page. Every vandalism is reverted by re-introducing the previously erased content or removing malicious content introduced. This process adds each time a lot of tombstones (up to the page size). Introducing a specific undo mechanism that reuses tombstones such as [19] should reduce the overhead due to tombstones. 5.2.1. Most edited encyclopedic Pages. The Figures 3, 4 and 5 show the average relative overhead (i.e. the size of the overhead divided by the size of the visible document) 4. According to http://en.Wikipedia.org/wiki/Wikipedia:Most frequently edited pages and http://en.Wikipedia.org/wiki/Special: LongPages on end of November 2008. However, due to some technical issues (i.e. invalid characters, missing patch, ...), we skipped some of the top pages, but the first page of each category is presented. inria-00432368, version 1 - 16 Nov 2009 Figure 1. Relative overhead for “George W. Bush” page. computed on the 100 last revisions of each page for the approaches Logoot, Wooto and TreeDoc. The column “Size” indicates the average size of each pages for the 100 last revisions. Finally, in the column “Number of Patches”, we show the number of edits done on each pages. The Figure 3 presents the results obtained on the most edited encyclopedic pages. The histories of such pages show a lot of edits as well as vandalism acts. A huge number of deletions has been performed on such pages, hence, tombstones-based approaches show an important overhead. On the contrary, the Logoot overhead remains low. 5.2.2. Most edited Pages. These pages (Figure 4) are discussion pages or special pages mostly edited by bots. In such pages, there is no or very few vandalism acts but a lot of edits. For all these pages, the Logoot approach is more efficient than tombstones-based approaches. However, we can notice that the difference is far more important for pages were data are very volatile for instance like case 1 (a communication channel to detect and block vandals) or like case 4 (a sandbox). The other cases represent discussion pages. Users ask questions, and other users reply by modifying the page. Each topic is removed after one week. They are edited in the same way : mostly adding content at the end of the page and removing week old topics content at the beginning. Thus, there is, in these pages, a lot of tombstones but the Logoot position identifiers grow as well. 5.2.3. Biggest Pages. These pages (Figure 5) are often lists of elements. If these lists are always edited in the same way (for instance adding elements at the end of the page), they represent the worst cases for our approach. Indeed, the Logoot position identifier will grow the quickest, especially if insertions are done in many different occasions. Effectively, in cases 4, 9, and 10, our approach is less efficient than tombstones-based approaches. However, these results show that our approach is in average, even in these disadvantageous real cases, less costly than tombstones-based approaches. 5.3. Limits of the experimentation Since Wikipedia uses a centralized wiki, we can expect a slightly different behavior in a P2P system. For instance, Wikipedia reduces the impact of concurrent modifications. In a P2P environment, preventing users to make concurrent modifications is not a realistic hypothesis. Therefore, concurrent modifications are automatically merged. This will certainly produce an “inconsistent” document which requires inria-00432368, version 1 - 16 Nov 2009 Figure 2. Absolute overhead for “George W. Bush” page. the intervention of some user to correct it. Therefore, the number of edits will certainly be more important in a P2P wiki than in a centralized wiki. In Wikipedia, some pages are protected to reduce the number of vandalism acts. However, such protection mechanism is not compatible with P2P constraints. Therefore, in a P2P wiki, the number of vandalism acts is certainly more important. Therefore, we expect to obtain more edits and vandalism acts on a P2P wiki system. Contrary to the Woot approach, CRDT approaches, including ours, requires a causal broadcast to achieve convergence. However, a causal delivery implies an overhead on each message sent by each replica. The two main mechanism to achieve a causal delivery are vector clocks [24] and causal barriers [22]. Vector clocks are not usable in P2P networks since their sizes are proportional to the number of replica. Causal barriers have a smaller size, that depend only on the degree of concurrency of the operations in the network. On collaborative editing system, this degree remains low : less than 3 edits per second on the whole English Wikipedia in average5 . However, a realistic measure of the communication overhead can only be achieved with a corpus of concurrent collaborative editions. 6. Conclusion In this paper, we have presented the Logoot algorithm. Logoot is an optimistic replication algorithm that ensures CCI consistency on linear structures. Logoot can be used on structured or unstructured P2P networks. It does not require tombstones. Therefore, the space overhead remains linear during the life of the document and no garbage collector is required. We validated the logoot algorithm on a corpus extracted from Wikipedia. The experimentation demonstrates that the Logoot unbounded list of identifiers associated to each line 5. 2.69 in October 2008 according to http://en.wikipedia.org/wiki/ Wikipedia:Statistics Overhead (in percent) Logoot Wooto TreeDoc 8.33 16128.75 14590.79 39.24 8413.41 6310.05 8.30 5875.07 4406.31 9.83 4179.09 3134.32 13.62 927.12 695.34 15.92 2996.30 2247.22 5.92 1129.51 847.13 18.51 1747.24 1310.43 17.88 4431.19 3323.39 9.81 389.89 292.42 14.74 4621.76 3715.74 Pages 1 2 3 4 5 6 7 8 9 10 George W. Bush List of World Wrestling Entertainment employees United States Jesus 2006 Lebanon War Islam Roman Catholic Church Deaths in 2006 Canada Akatsuki (Naruto) Average Number of Patches 41563 27152 24781 20271 17780 15315 14378 14029 13992 13929 20319 Size (in bytes) 133146 16673 158242 125669 139458 101278 170380 21880 112589 60638 106639 Figure 3. Most edited encyclopedic pages Pages inria-00432368, version 1 - 16 Nov 2009 1 2 3 4 5 Wikipedia: dalism Wikipedia: Wikipedia: Wikipedia: Wikipedia: Average Administrator intervention against vanReference desk/Miscellaneous Reference desk/Science Introduction Help desk Logoot 27.78 520.21 186.14 43.74 58.11 167.20 Overhead (in percent) Wooto TreeDoc 287530.03 215647.52 7492.31 3431.45 4195621.30 9266.41 900668.3 5619.23 2573.59 3146715.98 6949.81 675501.23 Number of Patches 438330 Size (in bytes) 2369 148283 142722 132693 126509 197707 133204 190858 317 96256 1011.98 Figure 4. Most edited pages Pages 1 2 3 4 5 6 7 8 9 10 Line of succession to the British throne United States at the 2008 Summer Olympics List of sportspeople by nickname List of Brazilian football transfers 2008 List of college athletic programs by U.S. State List of Chinese inventions List of suicide bombings in Iraq since 2003 China at the 2008 Summer Olympics List of urban areas in Sweden Table of United States Core Based Statistical Areas Average Overhead (in percent) Logoot Wooto TreeDoc 23.65 488.30 366.23 52.65 314.71 236.03 19.14 82.34 61.75 27.08 11.33 8.5 34.60 48.56 36.42 5.11 37.71 28.29 13.51 24.55 18.42 61.55 134.15 100.61 40.04 39.61 29.71 63.55 61.54 46.15 34.09 124.28 93.21 Number of Patches 3317 2314 2332 752 868 2344 1260 1552 19 31 1478.9 Size (in bytes) 376760 314748 309576 287128 305294 293228 215763 268720 108353 252236 320899 Figure 5. Biggest pages stays acceptable in practice. The experimentation also shows that Logoot has better average performances than the WOOT and Treedoc algorithms. In the future, we plan to evaluate Logoot overhead in time and to extend it to manage more structured data such as XML documents. We are also working on a group undo feature. References [1] T. Schütt, F. Schintke, and A. Reinefeld, “Scalaris: reliable transactional p2p key/value store,” in ERLANG ’08: Proceedings of the 7th ACM SIGPLAN workshop on ERLANG. New York, NY, USA: ACM, 2008, pp. 41–48. [2] L. Torvalds, “git,” (April 2005), http://git.or.cz/. [3] R. Salz, “InterNetNews: Usenet transport for Internet sites,” in USENIX conference proceedings. San Antonio, Texas, tats-Unis: USENIX, t 1992, pp. 93–98. [Online]. Available: citeseer.ist.psu.edu/salz92internetnews.html [4] S. Xia, D. Sun, C. Sun, D. Chen, and H. Shen, “Leveraging single-user applications for multi-user collaboration: the coword approach.” in CSCW, J. D. Herbsleb and G. M. Olson, Eds. ACM, 2004, pp. 162–171. [5] C. Sun and C. A. Ellis, “Operational transformation in realtime group editors: Issues, algorithms, and achievements.” in Proceedings of the ACM Conference on Computer Supported Cooperative Work - CSCW’98. New York, New York, tatsUnis: ACM Press, Novembre 1998, pp. 59–68. [6] M. Suleiman, M. Cart, and J. Ferrié, “Concurrent operations in a distributed and mobile collaborative environment,” in Proceedings of the fourteenth International Conference on Data Engineering - ICDE’98. Orlando, Floride, tats-Unis: IEEE Computer Society, Fvrier 1998, pp. 36–45. [7] H.-G. Roh, J. Kim, and J. Lee, “How to design optimistic operations for peer-to-peer replication,” in JCIS, 2006. [8] M. Cart and J. Ferri, “Asynchronous reconciliation based on operational transformation for p2p collaborative environments,” in CollaborateCom, 2007. [9] S. Weiss, P. Urso, and P. Molli, “Wooki: a p2p wiki-based collaborative writing tool,” in Web Information Systems Engineering. Nancy, France: Springer, December 2007. [10] M. Shapiro and N. Preguia, “Designing a commutative replicated data type,” INRIA, Rapport de recherche INRIA RR-6320, October 2007. [Online]. Available: http://hal.inria. fr/inria-00177693/fr/ inria-00432368, version 1 - 16 Nov 2009 [11] Y. Saito and M. Shapiro, “Optimistic replication,” ACM Computing Surveys, vol. 37, no. 1, pp. 42–81, 2005. [12] C. Sun, X. Jia, Y. Zhang, Y. Yang, and D. Chen, “Achieving convergence, causality preservation, and intention preservation in real-time cooperative editing systems,” ACM Transactions on Computer-Human Interaction (TOCHI), vol. 5, no. 1, pp. 63–108, Mars 1998. [13] L. Lamport, “Time, clocks, and the ordering of events in a distributed system.” Commun. ACM, vol. 21, no. 7, pp. 558– 565, 1978. [14] B. C. Neuman, “Scale in distributed systems,” in Readings in Distributed Computing Systems. IEEE Computer Society Press, 1994, pp. 463–489. [15] G. Oster, P. Urso, P. Molli, and A. Imine, “Data Consistency for P2P Collaborative Editing,” in Proceedings of the ACM Conference on Computer-Supported Cooperative Work - CSCW 2006. Banff, Alberta, Canada: ACM Press, nov 2006, pp. 259–267. [16] C. A. Ellis and S. J. Gibbs, “Concurrency control in groupware systems.” in SIGMOD Conference, J. Clifford, B. G. Lindsay, and D. Maier, Eds. ACM Press, 1989, pp. 399– 407. [17] M. Suleiman, M. Cart, and J. Ferrié, “Serialization of concurrent operations in a distributed collaborative environment.” in GROUP, 1997, pp. 435–445. [18] D. Li and R. Li, “An approach to ensuring consistency in peer-to-peer real-time group editors,” Computer Supported Cooperative Work, vol. 17, no. 5-6, pp. 553–611, 2008. [19] S. Weiss, P. Urso, and P. Molli, “An undo framework for p2p collaborative editing,” in CollaborateCom, Orlando, USA, November 2008. [20] G. Oster, P. Urso, P. Molli, and A. Imine, “Tombstone transformation functions for ensuring consistency in collaborative editing systems,” in The Second International Conference on Collaborative Computing: Networking, Applications and Worksharing (CollaborateCom 2006). Atlanta, Georgia, USA: IEEE Press, November 2006. [21] P. T. Eugster, R. Guerraoui, S. B. Handurukande, P. Kouznetsov, and A.-M. Kermarrec, “Lightweight probabilistic broadcast,” ACM Trans. Comput. Syst., vol. 21, no. 4, pp. 341–374, 2003. [22] R. Prakash, M. Raynal, and M. Singhal, “An adaptive causal ordering algorithm suited to mobile computing environments,” J. Parallel Distrib. Comput., vol. 41, no. 2, pp. 190– 204, 1997. [23] E. W. Myers, “An o(nd) difference algorithm and its variations,” Algorithmica, vol. 1, no. 2, pp. 251–266, 1986. [24] F. Mattern, “Virtual time and global states of distributed systems,” in Proceedings of the International Workshop on Parallel and Distributed Algorithms, M. C. et al., Ed. Chteau de Bonas, France: Elsevier Science Publishers, Octobre 1989, pp. 215–226.