Discuss: Understand the Complexity of Region Merge on TiKV

Continuing the discussion from Discuss: Merge Mto1Solution:

Jay states that the merge function is simple, a target region taking over the kv range of a source region, but the implement is complex, handling abnormal cases. There is a gap between the simple function and the complex implement of more than thousands of source code lines.
I like the three points Jay presented, they are helpful to cross the great gap and understand how merge works.

1 Like

I present a merge model for current solution on TiKV.

Merge model.
Pre-conditions: Aligned region A and B on nodes. A synchronous point between executions of raft group A and B.
(1) Stop peerA.
(2) Enlarge region B, remove region A.
(3) Remove peerA.
Post-conditions: region B.

Merge implement.
(1) Align peerA and peerB on nodes.
(2) Distributed transaction raftA.prepare_merge() for model step 1.
(3) Distributed transaction raftB.commit_merge() for model step 2 and 3.
(4) Distributed transaction raftA.rollback_merge() when region B does not exist or is modified by others.

Difficulties.
(1) Distributed data objects are not reliable in concurrent systems. Aligned peers may be modified during process steps. See the examples of point 2 presented by Jay.
(2) Distributed transactions do not synchronize with each other in a limited time period during process steps.

Raft group supportance.
(1) Calling raftB.commit_merge() in raftA.prepare_merge() synchronizes the logs of Raft group A and B.
(2) However, the executions do not synchronize between Raft group A and B. This results in the complex point 1 stated by Jay.

I present a split model for comparison.

Split model.
Pre-conditions: Region A on nodes. Synchronize Raft group A by a record in the Raft log.
(1) Stop peerA.
(2) Shrink region A, create region B.
(3) Create peerB.
Post-conditions: region A and B.

Split implement.
(1) Distributed transaction raftA.split() for model step 1 to 3.

Implement difficulties.
(1) Distributed data objects are not reliable in concurrent systems.
(2) Distributed transactions do not synchronize with each other in a limited time period during process steps.

Raft group supportance.
(1) The model works on region A only, avoiding the problem 1 listed in the difficulties.
(2) The model works on Raft group A only, avoiding the problem 2 in the difficulties.
(3) In a Raft group, Raft protects application from abnormal cases caused by the difficulties.

We do not remove region A before removing peerA. In actually, we just update the region meta in PD when regionB has committed the raft-entry CommitMerge.

For region-split, we do not need to create peerB before we split regionA. And it is not a distributed transaction. We just split a region into multiple regions simply.

Thank you very much for your suggestions, but this part of the code is not only logically complex but also has a high risk of modification. If it is not necessary, we will not refactor this part of the code easily.

Yes. I will check the codes in tikv/src/raftstore and reflects your corrections into the models. As topic name indicates, we aim to understand the complex implement of region merge.
Hope we can cross the great gap between the simple function and the complicated implement. Then a newcomer could understand and confirm every code lines in one week with the efforts in this topic as well as the documents and posts on pingCAP websites.

Revise merge implement to reflect the corrections presented by Wallace. The implement of merge is complicated now comparing to region split. All the additional work comes from synchronization required by the model.

Merge model.
Pre-conditions: Aligned region A and B on nodes. A synchronous point between executions of raft group A and B.
(1) Stop peerA.
(2) Enlarge region B.
(3) Remove peerA, remove region A.
Post-conditions: region B.

Merge implement on a node.
(1) Align peerA and peerB on the node. This is a point for synchronizing region A and B.
(2) Call raftB.commit_merge() in raftA.prepare_merge(). This is a point for synchronizing the logs of Raft group A and B.
(3) Raft transaction raftA.prepare_merge() for model step 1.
(4) Raft transaction raftB.commit_merge() for model step 2.
(5) Raft transaction raftA.rollback_merge() when region A is not aliged to B. This is an abnormal case of synchronization between region A and B.
(6) peerA.catchup_logs(). This is an abnormal case of synchronization between raftA.prepare_merge and raftB.commit_merge.
(7) Extend-Raft transaction raftstore.remove(peerA) for model step 3.
Note: Raft manages implement for different nodes.

Difficulties.
(1) Distributed data objects are not reliable in concurrent systems. Aligned peers may be modified during process steps. See the examples of point 2 presented by Jay.
(2) Distributed transactions do not synchronize with each other in a limited time period during process steps.

Revise split implement to reflect the corrections presented by Wallace.

Split model.
Pre-conditions: Region A on nodes. Synchronize Raft group A by a record in the Raft log.
(1) Stop peerA.
(2) Shrink region A, create region B.
(3) Create peerB.
Post-conditions: region A and B.

Split implement on a node.
(1) Raft-transaction raftA.split() for model step 1 to 2.
(2) Extended-Raft-transaction raftstore.create(peerB) for model step 3.

Note: NO synchronizations required for raftstore source coding, while Raft manages the implement for nodes.

I am happy! I have worked out a merge model in mathematics. cc @BusyJay @Little-Wallace @tison. I found a conflicting point on state (A0, B1) in the implement steps, see question marks there. I am confused, can anybody help?

Merge model with synchronization.
Pre-conditions: Φ→(A0, B0) Aligned region A and B on nodes. A synchronous point between executions of raft group A and B.
(1)(A0, B0)→(A1, B0) Stop KV on peerA.
(2)(A1, B0)→(A1, B1) Enlarge region B.
(3)(A1, B1)→(A2, B1) Remove peerA, remove region A.
Post-conditions: region B.

Implement steps.
(0) Align peerA and peerB on the nodes. This is a point for synchronizing region A and B.
(1) peerA.prepare_merging:
if A0 and Φ, quit.
if A0 and B0, set peerA State::Merging.
if A0 and B0 and not leader, quit.
call extended_RaftA.
(2) extended_RaftA:
if A0 and B0 and leader, call raftB.commit_merge.
if A0 and B1 and leader, call raftA.rollback_merge. ?
(3) peerB.commit_merge:
if B0 and Φ, quit.
if B0 and A0, peerA.catchup_logs.
if B0 and A1, set region B.
if B0 and A1, call extended_RaftB.
if B0 and A2, quit.
(3) extended_RaftB:
if B0 and A1, remove peerA. ?

A revision to handle commit_merge state (B0, Φ).

Merge model with synchronization.
Pre-conditions: Φ→(A0, B0) Aligned region A and B on nodes. A synchronous point between executions of raft group A and B.
(1)(A0, B0)→(A1, B0) Stop KV on peerA.
(2)(A1, B0)→(A1, B1) Enlarge region B.
(3)(A1, B1)→(A2, B2) Remove peerA or peerB in case.
Post-conditions: region B.

Implement steps.
Φ.

(0)
Align peerA and peerB on the nodes. This is a point for synchronizing region A and B.
(A0, B0).

(1) peerA.prepare_merging:
if A0 and Φ, nop.
if A0 and B0, set peerA State::Merging.
if A0 and B1, nop.
if A0 and B2, nop.
(A1, B0).
call extended_RaftA.

(2) extended_RaftA:
if A1 and Φ and leader, call raftA.rollback_merge.
if A1 and B0 and leader, call raftB.commit_merge.

(3) peerB.commit_merge:
if B0 and Φ, nop.
if B0 and A0, peerA.catchup_logs.
if B0 and A1, set region B.
if B0 and A2, nop.
(A1, B1).
call extended_RaftB.

(4) extended_RaftB:
if B1 and Φ, remove peerB.
if B1 and A1, remove peerA.
(A2, B2).