I have previously written about Distributed Coordination giving an introductory idea on “what is distributed coordination” and “why do we need it?”. When it comes to the implementations of distributed coordination schemes, there are many outstanding systems like apache zookeeper, etcd, consul and hazlecast. Even though some of them are not directly distributed coordination systems, all of them can be used as distributed coordination schemes. However, there are 2 clear outstanding giants in distributed coordination, apache zookeeper and etcd where zookeeper has been originated in the Hadoop echo system and became popular while etcd has been the distributed coordination scheme backing Google’s kubernetes. This article will be comparing these two implementations along with there pros and cons which will be very useful for many developers when selecting the right distributed coordination scheme for their future distributed computing implementations.
Zookeeper originated as a sub project of Hadoop and evolved to be a top-level project of Apache Software Foundation. Right now it is being used by most of the Apache projects including hadoop, kafka, solr and many more. Due to its proven track record and stability, zookeeper has become one of the best distributed coordination systems in the world.
Zookeeper is using ZAB protocol(Zookeeper Atomic Broadcast) as the consensus protocol. For those who are interested, you can read more on ZAB from here. As mentioned earlier, purposes of zookeeper include providing a distributed data store. In order to achieve high availability, zookeeper operates in an ensemble. That is, a set of nodes running zookeeper works together to provide its distributed characteristics. This set of nodes are called the zookeeper ensemble. A zookeeper ensemble should always include an odd number of nodes. When running, first these nodes communicate with each other and elect a leader within the leader election phase. The node with the majority vote gets elected as the leader. That is why zookeeper needs to operate in an odd numbered ensemble. Then onwards, all the other nodes are called followers while the elected node becomes leader. Any client connecting to zookeeper now can connect to any of these nodes. Clients’ read requests can be served by any node while write requests can only be served by the leader only. Therefore, adding more nodes to zookeeper ensemble will increase read speed. But not the write speed.
Out of the 3 properties of CAP (Consistency, Availability and Partition Tolerance) theorem, zookeeper provides Consistency and Partition Tolerance. You can read more on zookeeeper guarantees here
When storing data, zookeeper uses a tree structure where each node is called a ZNode and names those ZNodes based on the path from the root node. Each ZNode has a name. When accessing a given ZNode, absolute path from the root node is used.
Shown above is an example to Zookeeper’s ZNode structure. Each ZNode can store up to 1MB of data. Users can,
- Create ZNodes
- Delete ZNodes
- Store data in a particular ZNodes
- Read data in a particular ZNode
Apart from that zookeeper provides a very important other feature, watcher API.
Users can put a watch on a given ZNode. When any change (create, delete, data change, addition/remove of child ZNode) occurs to that ZNode, this watch API notifies the listening party about this change. This is a very important functionality provided by zookeeper which can be used to detect changes, for distributed command passing and for many such critical requirements.
But the only drawback in Zookeeper watches is that a given watch is only triggered once. Once a watch is notified, in order to receive future events on the same ZNode, users have to place a new watch on that ZNode. This can be considered as a limitation in Zookeeper. However, extensions like Apache Curator internally handles these complexities and provide the user with a more convenient API.
Curator is an extended client library for Zookeeper. It internally handles almost all the edge cases and complexities of typical zookeeper and provide users with a convenient API. You can read more on curator from Apache Curator in 5 Minutes. Also curator has implemented many distributed zookeeper recepies which includes,
- Shared Reentrant Lock
- Shared Lock
- Shared Semaphore
- Double Barrier
- Path Cache
- Path Cache — Keep watching on children ZNodes of a specific ZNode without having to worry about placing triggered watches.
- Node Cache — Notify changes on a given ZNode
- Tree Cache — Notify on changes on a complete sub tree in ZNode structure. Also keeps data cached locally.
and many more.
Most of the pros had been discussed earlier. Along with them, following pros are there with zookeeper.
- Non-blocking full snapshots (to make eventually consistent)
- Efficient memory management.
- Reliable, has been there for a long time.
- A simplified API
- Automatic ZooKeeper connection management with retries
- Complete, well-tested implementations of ZooKeeper recipes
- A framework that makes writing new ZooKeeper recipes much easier.
- Event support through Zookeeper watches
- In a network partition, both minority and majority partitions will start a leader election. Therefore the minority partition will stop operations. Read more.
- Since Zookeeper is written in Java, it inherits few drawbacks of java like garbage collection pauses.
- Also when creating snapshots (where the data is written to disks), zookeeper read write operations are halted temporarily. This only happens if we have enabled snapshots. If not, zookeeper operates as an in memory distributed storage.
- Zookeeper opens a new socket connection per each new watch request we make. This has made zookeepers like more complex since it has to manage a lot of open socket connections in real time.
We will be talking about the latest release of etcd (etcd v3) which has major changes compared to its predecessor etcd v2. In contrast to Apache Zookeeper, etcd has been written in go with the prime objective of resolving drawbacks in zookeeper. As a result, etcd came out few years ago. Even though it is not old and renowned as zookeeper, it is a promising project with a great future. Well known google’s container orchestration platform, kubernetes is using etcd v2 as the distributed storage to get distributed locks for its kebe master components. This is an example of how powerful etcd is.
Guarantees provided by etcd is almost similar to guarantees of zookeeper apart from several small changes.
- Sequential Consistency
- Serializable Isolation
- Linearizability (Except for watches)
etcd 3 provides following operations to be performed on distributed storage. These operations are mostly similar to the operations provided by zookeeper instead of the differences imposed by the underlying data structure.
- Put — puts a new key-value pair to storage
- Get — Retrieve value corresponding to a key
- Range — Retrieve values corresponding to a key range. Ex: key1 to key10 will retrieve all the present values for keys key1, key2, …, key10.
- Transaction — Read, compare, modify, write combinations
- Watch — On a key or key range. Changes will be notified similarly to zookeeper.
More on these operations can be found here. Also etcd has most of the distributed recipes implemented similarly to Apache Curator.
- Incremental snapshots avoid pauses when creating snapshots which is a problem in zookeeper.
- No garbage collection pauses due to off-heap storage.
- Watchers are redesigned, replacing the older event model with one that streams and multiplexes events over key intervals. Therefore, no socket connection per watch like zookeeper (Multiplexes events on the same connection). This is a major advantage.
- Unlike ZooKeeper that return one event per watch request, etcd can continuously watch from the current revision. No need to place another watch once a watch is triggered.
- Zookeeper looses old events, while etcd3 holds a sliding window to keep old events so that a client’s disconnection will not lose all events occurred till it connect back.
- Note that the client may be uncertain about the status of an operation if it times out, or there is a network disruption between the client and the etcd member. etcd may also abort operations when there is a leader election. etcd does not send abort responses to clients’ outstanding requests in this event.
- Serialized read requests will continue to be served in a network split where the minority partition has the leader at the time of split.
We discussed about the major features, pros and cons of Apache Zookeeper and etcd3. Out of those, zookeeper is written in java and widely adopted by the Apache Software Foundation projects while etcd3 is backed by Google (Kubernetes). Even though Apache Zookeeper is stable and renowned for being a great distributed coordination system, etcd3 is new and promising.
Since etcd is written in Go around Go ecosystem, good client libraries are not available for java. In contrast, since zookeeper had been there for some time, zookeeper has some good client libraries written in other languages as well. However, whether you should go with zookeeper or etcd depends on your requirement and the language you prefer to develop your program.
Further reading and references
- Apache Zookeeper Wiki
- Zookeeper Internals
- Serializability and Distributed Software Transactional Memory with etcd3
- Introduction to etccd 3 by CoreOS CTO Brandon Philips
- etcd3 API
- Apache Curator in 5 Minutes