CAP Theorem
A well-design cloud-based application often stores its data across multiple servers. For faster response, data is often stored closer to clients in that geography. Due to the distributed nature of this system, it's impossible to design a perfect system. The network may be unreliable or slow at times. Therefore, there are trade-offs to be made. CAP Theorem gives system designers a method to think through and evaluate the trade-offs at the design stage.
The three parts of the CAP Theorem are Consistency, Availability, and Partition Tolerance. The theorem states that it's impossible to guarantee all three in a distributed data store. We can meet any two of them but not all three.
Over the years, designers have misinterpreted the CAP Theorem. To reflect read-world scenarios, modifications to the theorem have been proposed.
Discussion
-
What's the definition of CAP Theorem? A formal definition of CAP Theorem is, "It is impossible in the asynchronous network model to implement a read/write data object that guarantees the following properties: availability, atomic consistency, in all fair executions (including those in which messages are lost)".
A simplified definition states that,
In a network subject to communication failures, it is impossible for any web service to implement an atomic read/write shared memory that guarantees a response to every request.
The word "atomic" used above means that although it's a distributed system, requests are modelled as if they are executing on a single node. This gives us an easy model for consistency.
-
Could you explain the CAP Theorem? The parts of the CAP Theorem can be understood as follows:
- Consistency: When a request is made, the server returns the right response. What is "right" depends on the service. For example, reading a value from a database might mean that the most recent write to that value should be returned.
- Availability: A request always receives a response from the server. No constraint is placed on how quickly the response must be received.
- Partition Tolerance: The underlying network is not reliable and servers may get partitioned into non-communicating groups. Despite this, the service should continue to work as desired.
As an example, consider two nodes G1 and G2 that have been partitioned. A client changes a value from v0 to v1 on G1. However, the same value is not updated on G2 due to the partition. Hence, when G2 is queried it returns the old value v0. Thus, the service is available but not consistent.
Sometimes the terms safety (consistency) and liveness (availability) are used in the generalized sense. Safety means "nothing bad ever happens". Liveness means "eventually something good happens".
-
What's the implication of the CAP Theorem when designing distributed systems? When CAP Theorem was proposed, the understanding was that system designers had three options:
- CA Systems: Sacrifice partition tolerance. Single-site or cluster databases using two-phase commit are examples.
- CP Systems: Sacrifice availability. If there's a partition, for consistency we make the service unavailable: return a timeout error or lock operations.
- AP Systems: Sacrifice consistency. If there's a partition, we continue accepting requests but reconcile them later (writes) or return stale values (reads).
In practice, we deal with network partitions at least some of the time. The choice is really between consistency and availability. For databases, consistency can be achieved by enabling reads after completing writes on several nodes. Availability can be achieved by replicating data across nodes. In fact, permanent partitions are rare. So the choice is temporary.
Designers don't have to give up one of the three to build a distributed system. In fact, it's possible to have all three under normal network conditions. There's trade-off only when the network is partitioned. It's also helpful to think probabilistically. We can design a CA system if probability of a partition is far less than that of other systemic failures.
-
Could you share real-world applications of the CAP Theorem? Databases that follow ACID (Atomicity, Consistency, Isolation, Durability) give priority to consistency. However, NoSQL distributed databases prefer availability over consistency since availability is often part of commercial service guarantees. So caching and logging were used for eventual consistency. This leads to what we call BASE (Basically Available, Soft-state, Eventually consistent). As examples, Zookeeper prefers consistency while Amazon's Dynamo prefers availability.
Maintaining consistency over a wide area network increases latency. Therefore, Yahoo's PNUTS system is inconsistent because it maintains remote copies asynchronously. A particular user's data is partitioned locally and accessed with low latency. Facebook's prefers to update a non-partitioned master copy. User has a more local but potentially stale copy, until it gets updated.
A web browser can go offline if it loses connection to the server. The web app can fall back to on-client persistent storage. Hence, availability is preferred over consistency to sustain long partitions. Likewise, Akamai's web caching offers best effort consistency with high level of availability.
In Google, primary partition usually resides within one datacenter, where both availability and consistency can be maintained. Outside this partition, service becomes unavailable.
-
What are some criticisms of the CAP Theorem? Although the Theorem doesn't specify an upper bound on response time for availability, in practice, there's exists a timeout. CAP Theorem ignores latency, which is an important consideration in practice. Timeouts are often implemented in services. During a partition, if we cancel a request, we maintain consistency but forfeit availability. In fact, latency can be seen as another word for availability.
In NoSQL distributed databases, CAP Theorem has led to the belief that eventual consistency provides better availability than strong consistency. Some believe this is an outdated notion. It's better to factor in sensitivity to network delays.
CAP Theorem suggests a binary decision. In reality, it's a continuum. There are different degrees of consistency implemented via "read your writes, monotonic reads and causal consistency".
Milestones
MIT researchers Seth Gilbert and Nancy Lynch offer a formal proof of the CAP Theorem in their paper titled Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services. In asynchronous systems, the impossibility result is strong. In partially synchronous systems, we can achieve a practical compromise between consistency and availability.
Daniel Abadi proposes the PACELC Theorem as an alternative to the CAP Theorem. When data is replicated, there's a trade-off between latency and consistency. PACELC makes this explicit: during partitions (P), trade-off is AC; else, trade-off is LC. Default versions of Dynamo, Cassandra, and Riak are PA/EL systems. VoltDB/H-Store and Megastore are PC/EC. MongoDB is PA/EC.
References
- Abadi, Daniel. 2010. "Problems with CAP, and Yahoo’s little known NoSQL system." DBMS Musings, April 23. Accessed 2019-06-09.
- Abadi, Daniel J. 2012. "Consistency Tradeoffs in Modern Distributed Database System Design." IEEE Computer Society, pp. 37-42, February. Accessed 2019-05-28.
- Birman, Ken and Roy Friedman. 1996. "Trading Consistency for Availability in Distributed Systems." Technical Report TR96-1579, Cornell University, April. Accessed 2019-06-07.
- Brewer, Eric. 2000. "Towards Robust Distributed Systems." PODC Keynote, July 19. Accessed 2019-06-09.
- Brewer, Eric. 2012. "CAP Twelve Years Later: How the 'Rules' Have Changed." InfoQ, May 30. Accessed 2019-05-28.
- Brooker, Marc. 2014. "CAP and PACELC: Thinking More Clearly About Consistency." Blog, July 16. Accessed 2019-05-28.
- Dinh, Anh. 2016. "History of the Impossibles - CAP and FLP." July 15. Accessed 2019-05-28.
- Fischer, Michael J., Nancy A. Lynch, and Michael S. Paterson. 1985. "Impossibility of Distributed Consensus with One Faulty Process." Journal of the Association for Computing Machinery, vol. 32, no. 2, pp. 374-382, April. Accessed 2019-05-28.
- Gilbert, Seth and Nancy Lynch. 2002. "Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services." ACM SIGACT News, vol. 33, no. 2, pp. 51-59, June. Accessed 2019-06-09.
- Gilbert, Seth and Nancy A. Lynch. 2012. "Perspectives on the CAP Theorem." Computer, vol. 45, no. 2, pp. 30-36, February. Accessed 2019-05-28.
- Greiner, Robert. 2014. "CAP Theorem: Revisited." August 14. Accessed 2019-05-28.
- Kleppmann, Martin. 2015. "A Critique of the CAP Theorem." arXiv, September 18. Accessed 2019-05-28.
- Messinger, Lior. 2013. "Better explaining the CAP Theorem." DZone, February 17. Accessed 2019-06-09.
- Robinson, Henry. 2019. "The CAP FAQ." The Paper Trail. Accessed 2019-05-28.
- Whittaker, Michael. 2014. "An Illustrated Proof of the CAP Theorem." Blog on GitHub.io, August 16. Accessed 2019-05-28.
Further Reading
- Brewer, Eric. 2012. "CAP Twelve Years Later: How the 'Rules' Have Changed." InfoQ, May 30. Accessed 2019-05-28.
- Gilbert, Seth and Nancy A. Lynch. 2012. "Perspectives on the CAP Theorem." Computer, vol. 45, no. 2, pp. 30-36, February. Accessed 2019-05-28.
- Abadi, Daniel J. 2012. "Consistency Tradeoffs in Modern Distributed Database System Design." IEEE Computer Society, pp. 37-42, February. Accessed 2019-05-28.
- Robinson, Henry. 2019. "The CAP FAQ." The Paper Trail. Accessed 2019-05-28.
Article Stats
Cite As
See Also
- Distributed Computing
- ACID Transactions
- CRUD
- Two-Phase Commit Protocol
- Command Query Responsibility Segregation
- Byzantine Generals Problem
Article Warnings
- Readability score of this article is below 50 (48.4). Use shorter sentences. Use simpler words.