Discuss: Region Merge Mto1Solution

Region Merge Mto1Solution
I find a solution to simplify merge implement for raftstore, and extend the ability by merging M regions to 1 region. I present the solution, waiting for feedbacks, to see if the solution is valid.

Objective
Merge region A…regionM to region X, including peerA…peerM to peerX, peerA1…peerM1 to peerX1.

Model Concepts
(1) peerA.kv_function = false.
(2) peerB.kv_function = false.

(m) peerM.kv_function = false.
(m+1) if (!peerA.kv_function and … !peerM.kv_function) peerX.kv_function = true.
Pre-conditions:peerA open, … peerM open, peerX closed for kv functions.
Post-conditions: peerA closed, … peerM closed, peerX open for kv functions.

Implement steps
(1) client: Create region X for the union of region A…M. Align region A…M and X on nodes. Region X creates peerX, peerX1, kv_function closed, raft_function open, configued to merge region A…M.
(2) client: On A, propose raft function prepare_merging(X).
(3) On peerA, prepare_merging: self.kv_function = false. Call peerX.merge(A).
(4) On peerX, merge: prepared (peerA)=true. if ( prepared (peerA) and … prepared (peerM) ) self.kv_function = true.

(5) client: On M, propose raft function prepare_merging(X).
(6) On peerM, prepare_merging: self.kv_function = false. Call peerX.merge(M).
(7) On peerX, merge: prepared (peerM)=true. if ( prepared (peerA) and … prepared (peerM) ) self.kv_function = true.
(8) client: Wait for peerX open, then remove peerA…peerM.

Notes
At step (1), repeat this step to provide pre-conditions for merging when something wrong happens.
At step (2) and (5), client propose raft commands to region A…M. Repeat these steps when something wrong happens.
At step (3) and (6), peerA…peerM apply raft commands, closing self.kv_function, calling peerX.merge directly promising the model concept that peerX.kv_function must be open after peerA…peerM.kv_function are closed.

Thanks for starting this thread @ralph !

cc @BusyJay @yiwu-arbug you may be interested in this topic.

In fact, the algorithm TiKV currently use actually supports merge multiple regions into one. But we do not implement for multiple region (more than two), because we hope the logical and code is as simple as possible. If you create a region X which will overlap with A and M, it will increase the complexity of code because TiKV does not allow two region exist in the same instance, of which data is overlap with each other.

I think the easiest method is dividing region and raft-group, and allows multiple region in one raft-group. Then we can move and merge region with a normal distributed transaction rather than implementing another distributed transaction algorithm just for region-merge.

2 Likes

I guess what you really need is to speed up region-merge of the TiKV cluster. But I must state the fact that the current bottleneck is not the size of a single merge task.

When I start a cluster for my own test, I found that if I increase the PD config of max-snapshot-count, merge-schedule-limit,replica-schedule-limit and so on, the speed of region-merge will be so fast that TiKV can merge hundreds of thousands of empty region in one hour.

But we do not configure as the above in production environment because these config may cause TiKV move region two frequently and it will slow down user request. I think a better method to speed up region-merge is let PD do not limit the empty region schedule, not just for the configure merge-schedule-limit, but also for max-snapshot-count and store-limit. Because even if thousands of merge tasks for empty regions are executed at the same time, the impact on the cluster is small while the migration of tens of large region can cause a big compaction of RocksDB and a large number of IO request for disk.

Actually the algorithm of merge TiKV designs is quite simple and straightforward, it just stop one region from serving and then ask the other region to take over the range. The complexity comes from handling abnormal cases. For example:

  1. to get around different replication speed, we let commit merge to deliver logs on behalf of the source leader;
  2. to recover from adhoc membership change, we add additional checks on epoch;
  3. the most complicated part, to cleanup stale unmerged ranges after majority finish merge.

Merge one by one is simplest and also reduces a lot of corner cases. The more raft group involved, more steps in an algorithm, the easier to make mistake and lost in the maze.

1 Like

Thank Jay and Wallace. I want feedbacks judging the correctness of the model concepts, or criticizing the implement steps for corner cases forgotten.

I think it is needed to create another topic of “Discuss: Understand the Complexity of Region Merge on TiKV”, if we think merge implement is a complex part among TiKV functions.

The first I want to see the model concepts for the merge implement in TiKV. I want a model concept expressed in logics or mathematics, not in English. It may be not easy to present such a mathematical model, but it is worthy because merge is such a basic and important function on TiKV. I will try to work it out.

you may take a look at our TLA+ spec.

1 Like