首页 > 代码库 > Reading Papers about Distributed Replication
Reading Papers about Distributed Replication
A. Practical Byzantine Fault Tolerance
1.What’s its checkpoint?
We will refer to the states produced by the execution of these requests as checkpoints and we will say that a checkpoint with a proof is a stable checkpoint. When the replication code invokes the make_checkpoint upcall, snfsd gets all the copy-on-write bits and creates a (volatile) checkpoint record, containing the current sequence number, a list of blocks and the digest of the current state. snfsd computes a digest of a checkpoint state as part of a make_checkpoint upcall. Although checkpoints are only taken occasionally, it is important to compute the state digest incrementally because the state may be large. We can avoid sending the entire checkpoint by partitioning the state and stamping each partition with the sequence number of the last request that modified it. To bring a replica up to date, it is only necessary to send it the partitions where it is out of date, rather than the whole checkpoint. A request that has executed tentatively may abort if there is a view change and it is replaced by a null request. In this case the replica reverts its state to the last stable checkpoint in the new-view message or to its last checkpointed state (depending on which one has the higher sequence number).
2.About its advantages and limitations
Advantages:
(a) To tolerate Byzantine faults, the paper proposes an algorithm which works in a asynchronous system like the Internet. Previous systems, such as Rampart and SecureRing, rely on the synchrony assumption for correctness, which is dangerous in the presence of malicious attacks.
(b) Also, it uses an efficient authentication scheme based on message authentication codes during normal operation; public-key cryptography, which was cited as the major latency and throughput bottleneck in Rampart, is used only when there are faults.
(c) The paper used the approach to implement a real service: a Byzantine-fault-tolerant distributed file system that support the NFS protocol. And the system is only 3% slower than the standard NFS daemon in the Digital Unix kernel during normal-case operation.
(d) It provides experimental results that quantify the cost of the replication technique.
(e) Applying the read-only optimization to lookup improves the performance of BFS significantly and reduces the overhead relative to BFS-nr to 20%.
(f) If the primary has actually failed, the group will be unable to process client requests until the delay has expired. Our algorithm is not vulnerable to this problem because it never needs to exclude replicas from the group.
Limitation:
(a) The algorithm does not address the problem of fault-tolerant privacy: a faulty replica may leak information to an attacker. And they plan to investigate secret sharing schemes to solve the problem in the future.
(b) In this paper they assume that the client waits for one request to complete before sending the next one. But they can allow a client to make asynchronous requests, yet preserve ordering constraints on them.
(c) The overhead introduced by the replication library is due to extra computation and communication, such as executing cryptographic operations and an extra message round-trip.
(d) There is still much work to do on improving our system. One problem of special interest is reducing the amount of resources required to implement our algorithm. The number of replicas can be reduced by using f replicas as witness that are involved in the protocol only when some full replica fails. We also believe that it is possible to reduce the number of copies of the state to f+1 but the details remain to be worked out.
(e) Our approach cannot mask a software error that occurs at all replicas. However, it can mask errors that occur independently at different replicas, including nondeterministic software errors.
3.Others about the paper
It introduces the three-phase protocol in normal-case operation. The protocol aims to totally order requests. Here is my question: Does the three-phase protocol tolerate Byzantine faults? Or just to totally order requests?
Another question is about view changes: How to select the new primary? It says that when the primary p of view v+1 receives 2f valid view-change messages for view v+1 from other replicas, it multicasts a
message to all other replicas. Is it related to the above question?
My third question is that I still don’t understand what differences between it works in synchronous environment and asynchronous environment.
Maybe I couldn’t rebuild the system in the paper just according to the paper, but there are some approaches worthy for me to learn. For example, about checkpoint, we could compute the state digest incrementally, and avoid sending the entire checkpoint by partitioning the sate.
B. Paxos Made Practical
1. What’s its checkpoint?
Since view numbers are monotonically increasing, the combination of view-id and timestamp, which we call a viewstamp, determines the execution order of all requests over time. When the primary receives a request from client, it will send a message to backups. And there is a field committed included. The field specifies a viewstamp below which the server has executed all requests and sent their results back to clients. These committed operations never need to be rolled back and can therefore be executed at backups. When the primary’s reply to the client gets lost and the primary subsequently fails, the cohort can re-execute the request after a reboot. In short, the checkpoint is used to record what requests have been executed, so that once fails, we don’t have to re-execute from the beginning. My question is that: Whether the whole request or one operation of the request does the checkpoint record?
2. About its advantages and limitations
Advantages:
(a) I think it build a clear view-change protocol. It considers the crashed cohort that fails to respond to message and the new cohort that may wish to join the system. And it is a multi-step process like Paxos, that involves first proposing a new view-id, then proposing the new view.
(b) The primary can temporarily respond to read-only requests without involving backups, if a majority of backups promise not to form a new view for 60 seconds. So that it could make the replication protocol more efficient.
(c) It employ a third machine, called a witness, that ordinarily sits idle without executing requests,but can participate in the consensus protocol to allow one of the other two replicas to form a view after a failure or network partition.
Limitation:
(a) In Practical Byzantine Fault Tolerance, its system uses view changes only to select a new primary but never to select a different set of replicas to form the new view. While in Paxos Made Practical, it uses view changes to form the new view. So I think the latter could learn from the former to reduce overhead.
(b) I believe it realizes consensus and view changes, but it doesn’t show the performance evaluation.
3. Others about the paper
In fact, the paper aims to solve the two limitations of Viewstamped Replication: (i) It doesn’t show how to replicate a simple system; (ii) It assumes that the set of possible cohorts is fixed over time. So the paper does it. It makes Paxos practical and easy to understand.
C. Rex vs. Eve
Rex and Eve both aim at the multi-core server. In the paper of Rex, it often refers to Eve. So here I want to put them together and let them have a trail of strength with each other.
1. What’s its checkpoint?
Eve is an execute-verify architecture, and execute every batch of requests concurrently. So the checkpoint should be different from that of traditional SMR protocols. To achieve efficient state comparison and fine-grained checkpointing and rollback, Eve stores the state using a copy-on-write Merkle tree, whose root is a concise representation of the entire state.
Rex resorts to a general checkpointing framework to alleviate this burden. It doesn’t have the primary checkpoint periodically during its execution. Checkpointing cannot be done on a state where a request has not been processed completely because Rex does not have sufficient information for a replica to continue processing an incomplete request when re-starting from that checkpoint. When Rex decides to create a checkpoint, the primary sets the checkpoint flag, so that all threads will pause before taking on any new request. When a checkpoint is available on a replica, any committed trace before the cut points pf that checkpoint is no longer needed and can be garbage collected.
2. About its advantages and limitations
Eve Advantages:
(a) Eve will achieve a speedup of 6.5x compared to sequential execution with 16 execution threads.
(b) From the following figure (Figure 8), I think Eve has a good ability to recover from failure.
(d) To keep its latency low while maintain a high peak throughput, Eve uses a dynamic batching scheme: the batch size decreases when the demand is low (providing good latency), and increases when the system starts becoming saturated, in order to leverage as much parallelism as possible.
Eve Limitations:
(a) Not implementing extra protection mode optimization for our asynchronous configurations.
(b) Our current implementation does not handle applications that include objects for which Java’s finalize method modifies state that need to be consistent across replicas.
(c) Our current prototype only supports in-memory application state.
(d) As the workload gets lighter (the execution time per request reduces), the overhead of Eve becomes more pronounced.
(e) What is the ability to handle concurrency faults? If bug exists in a replica, Eve can detect it, and then fix it by rolling back and re-execute sequentially; however, if bug occurs in two replica simultaneously, Eve can’t detect it.
(f) From the paper about Rex, handling on-disk state is tricky in the execute-verify model of Eve.
Rex Advantages:
(a) Rex has at most one active consensus instance at any time. The decision greatly simplifies the design of Rex.
(b) Eve’s correctness depends on marking the states of machines correctly, while Rex’s correctness depends on capture all the sources of non-determinism. Who is better? Compared to Eve, it is easy for Rex to find locks and nondeterministic functions among requests, because Rex uses its own synchronization primitives, so that Rex only needs to replace the interfaces of the program with its primitives to find all the locks and non-deterministic functions. Is it similar to LD_RRELOAD?
Rex Limitations:
(a) Rex introduces overhead in both execution of a primary for recording causal order and execution of secondary replicas for respecting that order.
(b) In Eve, replicas could execute independently, while in Rex, replicas could not, because they need to make the same nondeterministic decisions to ensure consistency.
(c) It is hard to find data race. But Rex thinks more and more people will realize the danger of data race.
3. Others about the paper
Does multithreading execution have relation to Byzantine faults?
Eve realizes deterministic parallelism by mixer, as mixer will partitions the set of requests into non-conflicting batches. So that Eve can execute these batches concurrently without conflicts. While Rex is an execute-agree-follow model. At the beginning, it lets primary freely execute requests concurrently, meanwhile during its execution Rex will record the nondeterministic decisions into a trace. Then other machines should agree on the trace to ensure consensus. Secondary replicas would execute requests concurrently according the trace. I think Eve and Rex both find the nondeterministic decisions and conflicts among requests, and then let replicas agree on it to ensure consensus. My question is: do their approaches to capture the nondeterministic decisions have the same overhead? Who will be better?
About Rex, since we could remove unnecessary causal edges to reduce overhead, so if the client proposes three requests, and through execute state, we find that they are independent, so there is no causal edge among them, they could be executed concurrently. My thinking is that whether can we know they are independent without executing, just like preprocessing, so that we could omit the execute stage?
Reading Papers about Distributed Replication