Idea: Not replicate most transaction commit requests through Raft

TiDB uses a two-phase commit protocol to implement distributed transactions. The first phase is called “prewrite” and the second phase is called “commit”.

For an async-commit transaction, the transaction is considered committed once all its keys are prewritten successfully, even without any key “committed”. And for other distributed transactions not using async-commit, merely the primary lock being committed indicates the transaction is committed.

This means, it is not that crucial to commit the secondary keys or all keys in async-commit transactions. It’s completely possible for us to find out the status of these uncommitted locks in other ways. This hints that it may be not really necessary to replicate the committing of these keys.

Commit

The commit operation in TiKV transactions is mostly in order to speed up reading. Because a reading request does not know whether it should bypass the lock when it encounters a lock.

Reading typically happens on the leader, so we’d better still let TiDB send commit requests to the leader. Receiving the request, the leader can remove the lock and add a write record in its RocksDB. But the leader does not need to replicate it to the followers immediately becuase it’s useless on the followers in most cases.

But the leadership can change, so we still need to resolve the locks on the followers. The related handlings are not clear. We can discuss about the solution.

My proposal is to only send the start TS and commit TS to the followers. The sending can be done in batch. For example, collect information of committed transactions and send to other peers at 1s intervals. The followers run background tasks to scan the whole lock CF and look for committed transactions according to the information from the leader.

In this way, there are several advantages:

  • Committing on the leader becomes faster because it don’t need to wait for Raft replication. Then, reading requests are blocked for shorter time.
  • Flow through Raft replication is largely reduced. It will make the raftstore run faster.
  • Resolving locks in batch on the follower may be more effective.

Prewrite/AcquirePessimisticLock

The KV RocksDB will be inconsistent between the leader and the follower. When the leader has cleared its lock on a key, the lock may still exist on the follower. This brings a problem when a following prewrite or pessimistic locking request happens on the leader, the replicated mutation on the lock may overwrite the uncommitted lock on the follower.

So, we need to modify the apply procedure of lock CF on the follower. If the lock exists, we need to additionally add a write record when a new lock is written. The commit TS of the previous lock can be passed from the leader.

If such a overwrite happens, we also save some resource by changing “delete” + “write” on the lock to just one overwrite.

Flow through Raft replication is largely reduced.

I doubt if it’s really true as commits only contain keys.

I do think caching committed ts on the leader side should bring enhancement for reads.

If the table is not wide, the keys themselves can produce unignorable flow. For example, in a TPC-C workload, not replicating pessimistic locks reduces 20% raft flow.

And because of the small-value optimization, commits can also contain values. This will bring more write flow besides only keys.

If we break the consistency between leader and follower, then the leader can not share the compaction result to followers.

But if we maintains a in-memory map of TS to keys, we can reduce the size of raft log.

So the commit raft log can only contains a TS, we can quick find out the keys with in-memory map.

We can still keep the consistency attribute.

Agree. Sounds better.