CAP Confusion Problems with partition tolerance

The ‘CAP’ theorem is a hot topic in the design of distributed data storage systems. However, it’s often widely misused. In this post I hope to highlight why the common ‘consistency, availability and partition tolerance: pick two’ formulation is inadequate for distributed systems. In fact, the lesson of the theorem is that the choice is almost always between sequential consistency and high availability.
CAP理论在分布式存储系统设计当中是一个很热门的话题。但是,经常被误解。我希望在这偏文章中着重说明在分布式系统设计中为什么作为常识的公式"一致性,可用性,分区容错性:只能选择两个"是不足的。实际上,CAP理论多数的选择都是连续一致性和高可用性。

It’s very common to invoke the ‘CAP theorem’ when designing, or talking about designing, distributed data storage systems. The theorem, as commonly stated, gives system designers a choice between three competing guarantees:

Consistency – roughly meaning that all clients of a data store get responses to requests that ‘make sense’. For example, if Client A writes 1 then 2 to location X, Client B cannot read 2 followed by 1.

Availability – all operations on a data store eventually return successfully. We say that a data store is ‘available’ for, e.g. write operations.

Partition tolerance – if the network stops delivering messages between two sets of servers, will the system continue to work correctly?
在设计或者讨论分布式存储系统中经常会涉及CAP理论。CAP理论给了系统设计者一个选择在以下三个具有竞争关系的理论下:
一致性: 大致的意思是说要的存储数据的客户端所有请求回来的回应都是有意义的。比如,如客户端A在原先是2的值的位置X上写了一个1,那么客户B在A写入1以后,不可能会读到2
可用性: 所有在存储系统做的操作最终都会返回成功。我们就说存储系统是可用的,比如写操作
分区容错性: 如果在两个集群服务中分发消息的网络中断了,这个系统还能正常工作吗?

This is often summarised as a single sentence: “consistency, availability, partition tolerance. Pick two.”. Short, snappy and useful.
经常被总结为一条简单的原则:“一致性,可用性,分区容错性,只能选择两个”。简单,敏捷和实用。
At least, that’s the conventional wisdom. Many modern distributed data stores, including those often caught under the ‘NoSQL’ net, pride themselves on offering availability and partition tolerance over strong consistency; the reasoning being that short periods of application misbehavior are less problematic than short periods of unavailability. Indeed, Dr. Michael Stonebraker posted an article on the ACM’s blog bemoaning the preponderance of systems that are choosing the ‘AP’ data point, and that consistency and availability are the two to choose. However for the vast majority of systems, I contend that the choice is almost always between consistency and availability, and unavoidably so.
至少,这是一个传统的观念。许多现代的分布式存储系统,包括那些经常在NOSQL网络下的数据存储,以其提供可用性和分区容错性而不是强一致性自豪,理由就是相比较段时间的不可用性,短时间的系统错误的行为不是什么太大的问题。然而,Michael Stonebraker博士在ACM博客上发表了一篇文章哀叹选择可用性和分区容错性的系统的优势,并指出一致性和可用性才是应当选择的。然而对于大多数系统来说,我觉得不可避免的选择应当是在一致性和可用性之间。

Dr. Stonebraker’s central thesis is that, since partitions are rare, we might simply sacrifice ‘partition-tolerance’ in favour of sequential consistency and availability – a model that is well suited to traditional transactional data processing and the maintainance of the good old ACID invariants of most relational databases. I want to illustrate why this is a misinterpretation of the CAP theorem.
Stonebraker博士的中心点就是,因为分区上很少见的,我们为了支持强一致性应该简单的牺牲掉分区容错性-这是一种非常适合传统事务性数据的处理和在大多数关系性数据库维护ACID不变形的模型。我想要说明的是为什么这是对于CAP理论的一种曲解。

We first need to get exactly what is meant by ‘partition tolerance’ straight. Dr. Stonebraker asserts that a system is partition tolerant if processing can continue in both partitions in the case of a network failure.
我们首先要清楚精确的认识到分区容错性意味值什么。Stonebraker博士断言如果所有的分区中网络中断的情况下依旧可以进行处理就是分区容错性。

“If there is a network failure that splits the processing nodes into two groups that cannot talk to each other, then the goal would be to allow processing to continue in both subgroups.”
”如果存在网络中断把处理节点分成两个互不通信的两个组,设计的目标就是依旧可以在子组内进行处理“

This is actually a very strong partition tolerance requirement. Digging into the history of the CAP theorem reveals some divergence from this definition.
这确实是非常强的分区容错性。深入研究CAP理论的历史可以看到与这个定义的一些分歧。

Seth Gilbert and Professor Nancy Lynch provided both a formalisation and a proof of the CAP theorem in their 2002 SIGACT paper. We should defer to their definition of partition tolerance – if we are going to invoke CAP as a mathematical truth, we should formalize our foundations, otherwise we are building on very shaky ground. Gilbert and Lynch define partition tolerance as follows:

“The network will be allowed to lose arbitrarily many messages sent from one node to another”
Seth Gilbert 和 Nancy Lynch 教授形式化的证明CAP理论发表在他们的2002 SIGACT 论文上面。我们应当遵从他们对于分区容错性的定义-如果我们把CAP理论作为数学上的真相,我们应该形式化我们的基础,否则我们将建立在摇摇欲坠的基础上。Gilbert 和 Lynch定义的分区容错性如下:
网络可以被允许丢失很多任意的从一个节点发送到另外一个节点的消息
Note that Gilbert and Lynch’s definition isn’t a property of a distributed application, but a property of the network in which it executes. This is often misunderstood: partition tolerance is not something we have a choice about designing into our systems. If you have a partition in your network, you lose either consistency (because you allow updates to both sides of the partition) or you lose availability (because you detect the error and shutdown the system until the error condition is resolved). Partition tolerance means simply developing a coping strategy by choosing which of the other system properties to drop. This is the real lesson of the CAP theorem – if you have a network that may drop messages, then you cannot have both availability and consistency, you must choose one. We should really be writing Possibility of Network Partitions => not(availability and consistency), but that’s not nearly so snappy.
请注意,Gilbert 和 Lynch的定义不是从分布式应用的方面,而是其执行所在网络的属性。这个经常被误解:分区容错性不是我们设计系统的一个选择。如果你的网络存在分区,你将会丢失一致性(因为你将允许更新自两个分区各自进行)和可用性(因为你将发现错误和关闭系统直到错误被解决)。分区容错性意味着简单的开发一种拷贝的策略通过选择丢掉其他的系统属性。这是 CAP理论的中心点-如果你有网络就可能丢失信息,然后你不可能同时拥有可用性和一致性,你必须选择一个。我们应该写的是分区的可能性,而不是可用性和一致性,但这并不是那么快。

Dr. Stonebraker’s definition of partition tolerance is actually a measure of availability – if a write may go to either partition, will it eventually be responded to? This is a very meaningful question for systems distributed across many geographic locations, but for the LAN case it is less common to have two partitions available for writes. However, it is encompassed by the requirement for availability that we already gave – if your system is available for writes at all times, then it is certainly available for writes during a network partition.
Stonebraker博士对于分区容错性的定义确实是可用性的度量-如果一次写可能在不同的分区,最后会响应么?这是一个对于分布在不同区域的系统非常有意义的问题,但是在局域网的情况下很少会有两个分区同时写入的情况。但是它已经包含在我们已经给出的可用性要求中-如果你的系统始终可以进行写的操作,那么在网络分区期间肯定可以进行写操作。

So what causes partitions? Two things, really. The first is obvious – a network failure, for example due to a faulty switch, can cause the network to partition. The other is less obvious, but fits with the definition from Gilbert and Lynch: machine failures, either hard or soft. In an asynchronous network, i.e. one where processing a message could take unbounded time, it is impossible to distinguish between machine failures and lost messages. Therefore a single machine failure partitions it from the rest of the network. A correlated failure of several machines partitions them all from the network. Not being able to receive a message is the same as the network not delivering it. In the face of sufficiently many machine failures, it is still impossible to maintain availability and consistency, not because two writes may go to separate partitions, but because the failure of an entire ‘quorum’ of servers may render some recent writes unreadable.
那么是什么引起的分区?主要是两个事情。第一件事情很显然–比如由于切换故障引起的网络错误会引起网络分区。另外一个没有那么显然,但是很符合Gilbert和 Lynch的定义:包括了软件和硬件的机器错误。在一个异步网络中,比如一个处理消息可能使用无限的时间,这种情况下不可能区分是机器失败还是丢失消息。因此单台的机器故障会将其与网络的其他部分分区。一些机器的故障将会是得他们全部从网络中分区。无法发送和接受消息。面对足够多的机器,仍旧不可能维护可用性和一致性,不是因为两次写在不同的分区,是因为整个服务器的故障可能会使最近的一次写变得不可读。

This is why defining P as ‘allowing partitioned groups to remain available’ is misleading – machine failures are partitions, almost tautologously, and by definition cannot be available while they are failed. Yet, Dr. Stonebraker says that he would suggest choosing CA rather than P. This feels rather like we are invited to both have our cake and eat it. Not ‘choosing’ P is analogous to building a network that will never experience multiple correlated failures. This is unreasonable for a distributed system – precisely for all the valid reasons that are laid out in the CACM post about correlated failures, OS bugs and cluster disasters – so what a designer has to do is to decide between maintaining consistency and availability. Dr. Stonebraker tells us to choose consistency, in fact, because availability will unavoidably be impacted by large failure incidents. This is a legitimate design choice, and one that the traditional RDBMS lineage of systems has explored to its fullest, but it implicitly protects us neither from availability problems stemming from smaller failure incidents, nor from the high cost of maintaining sequential consistency.
这就是问什么将分区容错性定义为“允许分区组保持可用性”会产生误导的原因–机器故障时分区的,几乎是自动的,根据定义当他们失败的时候是不可用的。然而,Stonebraker博士建议选择CA系统而不是P系统。感觉就像是邀请我们自己吃自己的蛋糕一样。不选择P就像是类似构建一个不会经历多个故障的网络。这个对于分布式系统是不可理喻的–正是出于CACM论文中有关相关故障,操作系统故障和集群灾难所有合理的理由-所以一个设计者不得不从可用性和一致性两者中作一个选择。Stonebraker博士告诉我们选择一致性,实际上,因为可用性在大的故障事件是不可避免会被影响的。这是一个合理的设计选择,传统的RDBMS系统谱系已经对其进行了充分的探索,但它隐式地保护了我们既不会遭受较小的故障事件引起的可用性问题,也不会保护我们保持顺序一致性的高昂成本。

When the scale of a system increases to many hundreds or thousands of machines, writing in such a way to allow consistency in the face of potential failures can become very expensive (you have to write to one more machine than failures you are prepared to tolerate at once). This kind of nuance is not captured by the CAP theorem: consistency is often much more expensive in terms of throughput or latency to maintain than availability. Systems such as ZooKeeper are explicitly sequentially consistent because there are few enough nodes in a cluster that the cost of writing to quorum is relatively small. The Hadoop Distributed File System (HDFS) also chooses consistency – three failed datanodes can render a file’s blocks unavailable if you are unlucky. Both systems are designed to work in real networks, however, where partitions and failures will occur*, and when they do both systems will become unavailable, having made their choice between consistency and availability. That choice remains the unavoidable reality for distributed data stores.

当系统规模增加到成百上千台机器时,以面对潜在故障的一致性的方式来进行写入会变得非常昂贵
(您不得不同时写入比一台更多的机器相比准备接受的故障)CAP定理没有捕捉到这种细微差别:就吞吐量或维护延迟而言,一致性通常要比可用性贵得多。诸如ZooKeeper之类具有强一致性的系统,因为群集中的节点数量很少,因此写入仲裁的成本相对较小。Hadoop分布式文件系统(HDFS)也选择一致性-如果您不走运的话,三个故障数据节点会导致文件块不可用。两种系统都设计为可在实际网络中工作,但是,在这些网络中会发生分区和故障,并且当它们发生故障时,在一致性和可用性之间做出选择后,两个系统都将变得不可用。对于分布式数据存储来说,这种选择仍然是不可避免的现实。
Further Reading
*For more on the inevitably of failure modes in large distributed systems, the interested reader is referred to James Hamilton’s LISA ‘07 paper On Designing and Deploying Internet-Scale Services.

Daniel Abadi has written an excellent critique of the CAP theorem.

James Hamilton also responds to Dr. Stonebraker’s blog entry, agreeing (as I do) with the problems of eventual consistency but taking issue with the notion of infrequent network partitions.

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章