Dynamo and Cassandra mass storage foundation

It is very similar to the core ideas, but there are some details of things, and it is also a single tree in the distributed system. It is very representative of a series, these inconsistent Wherever places, the most obvious place is in consistency. It can be seen that even if it is a simple engineering implementation, it is also very different, but they also have some common and some unique concepts, let’s take a look at them.

First tell the commonality:

W + r> n

I believe that this everyone is familiar with it? Oh, I will explain this from other angles.

N indicates the total number of this replication cluster [Many places to explain not very accurately caused a lot of misunderstandings].

W Represents the number of writes

R represents the number of copies required for a consistent reading (hereby pay attention, is the number of machines selected from the N).)

This formula is expressed as:

If W + R> N (W, R, N is not a negative integer and R, W

The most common usage of this formula is to ask R, that is, the formula should be written into N-W

1. Simple master and strong copy.

Because it is strong, the data must be two copies, W2.

What about the number of nodes in this cluster? Only two, so N2.

So r> (N-W 2 – 2 0).

R> 0, so R can take 1, 2. Can’t take three, because the number of machines is only 2. . Do not come to 3 :).

2. Assume that there are three machines, using Quorum’s way to write data.

Because there are three machines, n3.

Because it is written in Quorum, as long as the two sets, even if it is written, W2

At this time r> (N-W3-21)

So the value of R can only be 2,3

What is the value of R means? It means that you have to select two of the data from the N3 machines, and you can read the latest values ??of the data.

3. It is assumed that there are three machines and write a success.

Because there are three machines, n3.

Because it is written in quorum, as long as it is written, W1 is successful.

At this time, R> (N-W3-12)

The value of R is only 3.

what does that mean? It means that if you only write a machine, you will be successful, then you need to read 3 machines when you read it, you can get the latest value of the data.

Specifically, I will not be listed, I am interested in seeing yourself :). Enumerates easy to draw conclusions.

GOSSIP protocol

The Gossip protocol is one of these two sets of storage, saying that complicacy is also complex, saying that it is not complex. . In fact, Gossip is the P2P protocol. What he mainly wants is to go to the center.

How did you do it? I only hope to leave a few impressions in this article: What is Gossip? How did you do it? What is the disadvantage? I.e. Interested in the details of the agreement, you can go deep into research.

Gossip’s core goal is to go to the center. What to do:

According to the seed file, connect to some machines according to some rules, establish contact with them, do not pursue global consistency, just synchronize the data in the other machine. Here is how to quickly know how your data and other people’s data are inconsistent? At this time, you will use Merkle Tree, which can quickly perceive the changes in the data he hold.


Go to the center, look at the great Tor, as long as you can connect to a SEED, there is a mouth that can go out of the Great Wall, then you can finally jump out of the Great Wall. .


The consistency is more difficult to maintain, (here we introduce very simple, because I have no actual written … If anyone has this experience welcome to supplement)

Different choices:

Dynamo: Vector Clock vs. CassandratimeStamp.

The goals of these two protocols are consistent. It is also the focus we have to say.

Let’s first talk about Vector Clock:

Refer to Vector Clock, you can’t mention another paper of Lamport Time Clocks and the Ordering of Events in a distributed system (Chinese translation) http://t.cn/zlewzin

This article is at the heart of this article is the mutual exclusion and queuing problem between multiple processes. But this is not our main thing to absorb, in which this article can make you realize a question: I have ran to a relativistic world. That is, when there is no message between the process, they are all kinds of politics, follow their own clock. Only when there is a messaging between them, it is possible to create a global time to come out. The feeling of Vector Clock gives me, it is a way to think about this road. If I want to have a real life, I want to say, Vector Clock gives me more like git.

Let’s start with your comparison with Paxos.

Inside PaxOS, we use Quorum and class three-stage submission to ensure that data proposals are sequential, and only one proposal is accepted at a time.

This is not the highest in one scene: If we assume that most of the data updated data does not repeat, it will not be high.

For example, if we continue to do the following in a KV:

{View a is not 100? If enough, reduce 100 pieces}

{View a is not 100? If enough, reduce 100 pieces}

{View a is not 100? If enough, reduce 100 pieces}

. . .

If you are constantly being put into this data, then it is actually the value of the last data. This operation is that it is necessary to queue to be efficient, otherwise the super-reduced situation will occur. But if our operation is just we continue to do the following:

A logged in

B landed

C landed

D landed

E landed

F landed

Then it can be considered that all data is “there is no relationship”. At this time, let these write all the queues once, the cost is obviously high.

I understand Dynamo and Cassandra, their scenes are mainly suitable for the following ways, that is, the conflict between data is small. In this case, the efficiency of maintaining a global ordered queue is too low, not as good as this decentralized manner. However, the probability of conflict does not mean there is no conflict, so there is still a mechanism that can help you feel conflict, allow you to handle conflict. On this issue, Dynamo and Cassandra have selected different roads.

Dynamo chose vector clock.

His main way is: In the information passed, this data is from where the data is from the versions.

Let’s take a look, use this way, how can I know when a conflict occurs:

We assume three machines A, B, C.

Initially, the data of the three machines of A, B, and C is 100.

At this time, I randomly picked a machine, B wrote a record [KeyWhisper, Val0]

At this time, B-generated data is a motion [101 from b -> [keywhisper, val0]].

Then, another person chooses C to write another record, such as [KeyWhisper, Val10000]

At this time, C – generated data motion is [101 from C -> [KeyWhisper, Val10000]]]

At this time, the data of B is passed to C. Because C also has a 101 motion, [he will maintain two motion] (Please note that this is where PaxOS is inconsistent)

So the case accepts the case is:

[No. 101 proposal

{fromb [keywhisper, val0]}

{fromc [KeyWhisper, Val10000]}


Then what should I do? Then .. Then you get in Get (“Whisper”), Vector Clock feeds these two records to you, tell you, conflict, you choose one 🙂

So, we have several options, and all resolutions should be added together for {count ++ class} operations. For other data, you should take the timestamp, the maximum data is taken, which is the latest. This is the workflow of Vector Clock, which makes me easily think of conflict in SVN or GIT. .

how about it? Do you think the idea is more open? Welcome everyone based on PaxOS and Vector Clock to make other thinking, consistent research is far from being completed. As far as I person. . I prefer to guarantee the order of consistency model, don’t like this model for politics.

Ok, I finished the Vector Clock, let’s talk about Cassandra’s TimeStamp.

In fact, the TimeStamp model is the deterioration and simplified version of the Vector Clock. In vector clock, the conflict is handled by the user, the system just helps you check conflicts, but Cassandra is more rude and simple. He does not detect conflicts, all data only retain the maximum timestamp. This model can respond to 80% scene, the model is greatly simplified, but I should estimate whether COUNT ++ is not doing? I didn’t actually use it. Ok, look back, we just talked three new concepts: w + r> n. Used to determine the consistency level. The Gossip protocol and Merkle Tree are used to perform data synchronization between detrimentary nodes. Vector Clock or TimeStamp.

It is now assembled time.

Dynamo: The data is synchronized using Gossip + Merkletree, using the Vector Clock to mark conflict data, conflict data will be handed over to the user. Allows a set of Equalty Group (N) for several groups of small clusters to be specified by configuration. And you want to write and guarantee the number of copies (w) that you want to success (R).

Cassanra: Similar to Dynamo (because homologous), in selecting, give up VectorClock, use TimeStamp to conflict. The rest is similar.