Discuss: How we develop a feature

Recently from an offline discussion among our developers, we notice that it hurts quite a bit that we mix feature developing on master branch, which make it hard to bisect bugs and regression, and further deliver an unstable software.

The causes may be various from code coupling and feature break down strategy. Here is an umbrella topic that do a survey and open to discuss on how we develop a feature and collect best practice.

In this topic, you are suggested to

  • share a good case when you develop or observer others develop a feature, so that we can learn from it
  • throw a bad case you suffer or observer others suffer that hurts software stability or feature developing, so that we can do a diagnosis and later spawn a task to deal with it

You may take a look at the below post for an example of base case and how we deal with it.

At the discussion, @disksing indicated that we should push self-contained change sets into master as soon as possibly, should not accumulate a batch of mixed change sets and later pull request it to master which is hard to review and easy to conflict with other ongoing effort.

To support this point, he shared a case that when they develop standalone Go client refactor, they always push self-contained change sets to master and the feature crew of stale read who also touch client abstraction can catch up the change and keep align.

This is a good case. If we regard our feature development as a downstream and master branch is upstream, so that

  • we develop on the feature branch, and never push unstable change set to upstream since it would be rejected
  • we keep sync with upstream and resolve conflict reflected upstream updates
  • we push back self-contained change sets to upstream frequently and notify all crews work on the same code block, so that the conflict doesn’t occur later the development cycle and be too huge to resolve.

I want to share my experience during the development of CTE.

I chose to develop CTE on a separate branch, and test it on it. After testing it, we found about 30 bugs and recorded them in https://docs.google.com/spreadsheets/d/1gjRzvt4M19zQSzNfDx_PNMeyygY_vtlf5dx3PGQaOxU/edit#gid=0 .

I think there are three advantages compare to developing on the master branch.

  1. Quickly iteration.
    Everyone worked on part of CTE. After we finished our part, we quickly made it work(Run a CTE SQL and get the result) and tested it continually. When we found a bug during the testing, we recorded the bug in the document to track it and fixed it quickly. Finally, we got a well-tested branch and merged it into the master branch.
    If we developed CTE in the master branch. We needed to wait for PR reviewing、merging. These are very time-consuming.

  2. Fewer bug-fixed PRs/issues.
    Since it’s been tested carefully, the bugs are less than developed in the master branch.

  3. Early delivery.
    One of our customers wants to test it as soon as possible. We can deliver the demo to them after some basic tests. And, it’s very quick.

Thanks for starting this discussion. Here are two coins from a maintainer of TiDB projects:

  1. There is no silver bullet on the world, and no development pattern fitting all cases. There is only the most suitable pattern based on our current codebase and RoadMap. Patterns are alive, and keep evolving. It is a two-way doors that when we find a pattern is not well-fit, we can adopt another.

  2. I’d prefer the feature branch pattern to develop a new feature. That is, fork the repository, create a develop branch, push back code refactor and bugfix early to the upstream, and later a whole new feature change set, which may be divided into stages but each is self-contained.

This is how nearly all open source communities work so why do we violate it in consideration of that TiDB projects are open source projects? If you develop a new feature in the open-source community, you finish the whole work in the feature branch, push back required code refactor and bugfix that benefits the upstream early, create a series of pull requests implementing the new feature with test coverage, and finally, merge them atomically.

In this pattern, you possibly encounter challenges.

  • What if my new feature really affects a wide range of code?

    If the feature is well designed, I’d suggest to divide it into several stages. Each stage is self-contained and is able to deliver in the release. We create a feature branch for the first stage, follow the process above and get it merge to the upstream, and create another feature branch for the second stage, and so on.

  • I must merge code early to the upstream despite whether or not it is finished. Otherwise I suffer from unacceptable code conflict.

We are supposed to find the root cause why code conflict is so possible. It may be

  • systematic problems that blame to our architecture
  • code is coupled a lot garbage like unused code backlog

These root causes should be resolved separately.

  1. TiDB projects are cloud native. On cloud devops is the best practice, and continuous delivering is significant part of devops. Continuous delivering means the master branch is solid enough to be deployed in production environment. We are still far from this goal, but keep this goal in mind would help us make more progresses on:
  • testing every valid branch of our repository, including PRs.
  • empower online operation to upgrade / downgrade at any time, by any senior developer with our instruments.
  • the whole DBMS architecture is real resilient and robust, any feature is under control with options.
  • set a high level on code reviewing which guard our code quality; test cases and coverage are improved

To sum up, we are clearly working in an open-source community, and work as DevOps engineers.


Here I have another case of writing codes on a feature branch and merging it back (in Chaos Mesh).

Actually there are a lot of problems inside, not only the branch management and feature development, but also version management, product design… My workmates have all suffered a tiring month and I’m really regretful for the bad working experience and the difficulties I have brought.

Here is the final PR. chaos-mesh#1908. It aims to redesign the cron chaos experiments, split into multiple controllers and fix some historical bugs (which could be fixed naturally after this refractor). I don’t know whether it’s off-topic as it’s not only a feature, but also a refractoring.

Besides the advantages @wjhuang2016 has talked, I have faced a lot of problems:


Fast Iteration with cost

I don’t know how @wjhuang2016 managed to achieve both quickly iteration and well-tested, because even in the feature branch, we have to wait for PR to get reviewed and merged. In Chaos Mesh, I have to say these PRs in the feature branch are not well reviewed. A lot of PRs are reviewed by only one developer, to ensure we can move on as fast as possible. And some of these PRs are also not well tested (though we added some tests later).

Merging master is really hard

To make it not too difficult to merge feature branch back to the master, we have to merge master into the feature branch periodically. However, it could be difficult. While you are developing new features, there are also some more features merged into master and they are conflict with the feature branch. You don’t know how to test these features and don’t know how to fix them.

Feature branch is really not friendly for big refractoring, which may change the code structure, copy some codes to another place, etc… Is there any way to refractor progressively? However, refractoring is really needed, with the thought and understanding of developers get deeper. When Chaos Mesh booted, I and @cwen are all green hand on cloud and kubernetes, and we don’t know how to design a good structure or manage an extensible software on kubernetes. But every software should have a second chance (or third chance, forth chance) to remedy the old design mistakes, right?

Version Management and Compatibility

In this feature, compatibility will be broke. Someone told me “not to break compatibility” in middle version, but all versions are checked out directly from master. Therefore, this feature cannot be done before “1.2.0” release (which is the last version of Chaos Mesh 1), and should be merged in Chaos Mesh 2.0.

However, “1.2.0” is released at 23, April, and “2.0” should be released before June, which leaves us really short time to develop and test. I started to develop this feature at around 10, April.

Without rich time, every developers (and the tests and documents) are getting really tired and chaotic :cry:. After my developing, my teammates should do their following job like exporting HTTP API, modifying frontends… The time left for them is even much tighter and the work experience is not so well (regarding the lack of documents, tests and product designs). The schedule becomes out of control and we have to delay, and delay again.