How we do trunk-based development: answering frequent questions
We have reached the 4th installment of the series telling how we implement trunk-based development at Codice today. Previously, we covered:
- How we changed our working cycle to move to trunk-based development.
- The difference between releasing and deploying.
- What exactly trunk-based development is and how it blends with task branches.
Today, we will answer some common questions that were not covered before. So far, we just described the ideal cycle, but never told you what happens if tests fail, the various reasons why a task/branch can be rejected, what to do when a branch can’t be merged or how to handle broken builds. These are the topics we cover today.
Let's take another look at the figure describing the working cycle introduced in the first blogpost, just to use it as a starting point for the next topics.
Feedback prior the task gets merged
It is important to note that if any phase fails, the task is reopened. The reviewer, the validator, a merge that is not automatic, code that doesn't build, unit tests that fail or quick smokes that don't pass... any small issue can reopen and reject the task. That's why the "feedback loop" painted in the figure above covers all the phases until the checkin of the task is done and the merge confirmed.
What if a task can't be merged
If a task can't be merged because the merge is not automatic (despite SemanticMerge doing its best), the system rejects the task and asks the developer to rebase.
As the picture above shows: the merge was rejected, so you are asked to "merge down" and solve the conflict manually.
The task will go through review and validation again (they can be skipped if the developer considers the changes done during the merge to be trivial - it can be a risky policy to use in some teams, though).
A branch could go through this cycle more than once, but since branches are meant to be really short, rebases shouldn't be frequent.
When are releases (official versions) triggered?
When you are in this situation, with a couple of branches that were already reviewed and validated and ready to be merged and tested:
The CI system takes Task1910, does the merge, passes the tests, and if everything goes fine, it does a checkin. The procedure repeats for Task1923 and you end up with the following:
Two new "builds" created on main are stable enough to be used as candidates to be deployed.
Your main branch will look as follows (now I remove the task branches for clarity):
Our system finds that 5 tasks were already merged, so it decides to take the last good one (BL745) and use it to run the release tests.
Actually, the condition for triggering the build is as follows:
- If 5 tasks were already merged since last labelled build => launch release.
- If there are no more tasks to test => launch release (unless there are no new builds after the last labelled one).
- Otherwise, go to sleep.
Release tests are run for changeset 119 (BL745) and if everything goes according to plan, in a little less than 2 hours, the main branch will look as follows:
Then, the new version BL745 is uploaded to our website (which includes all the Windows installers, OS X installers and uploading the Linux packages for all the supported distros) and it waits ready to be published once we finish our last manual validation.
As I write this, our CI bot switches from testing branches to creating new releases and it doesn't do the two in parallel. But, there is no reason not to continue processing new branches while a new release is created. In fact, our goal is to run both processes in parallel as soon as we can. When that happens, once BL745 is labelled, probably the situation in main will be as follows:
And, the CI would have to immediately start testing candidate changeset 132 (BL747) to create a new release.
What if the release tests don't pass? Broken builds
If an issue is detected while creating an official version (yes, I still tend to call it release) then the main branch is marked as broken. We enter broken build mode. The team priority is to get it fixed.
We don't reject tasks at this stage or "subtract them", we simply create a new task to fix the issue, which will go through the regular process.
Nowadays, when this branch hits main, the buildmaster marks the build as fine again, so a release can be retried. The goal is to have an attribute on the branch so that the system knows that if the branch is merged correctly, it can restart the release cycle again.
Initially, I was not very sure whether this "move forward" way of thinking was going to work. Previously, using "controlled integration", we tried to avoid broken builds at all costs. We would "subtractive merge" the branch causing problems in order to let the new version go forward. Now, we simply rely on super frequent releases.
Every task doesn't pass the entire release test suite, only some do
What do I mean when I say "every new checkin is a potential version"?
Well, every branch passes part of our test suite, but not the entire one.
This is because the entire test suite takes too long (+2h) and then the feedback to the developer would be too slow.
My belief is that in a perfect world every branch passes all tests, period. In fact, this is how we used to work long ago, at the very beginning, when the codebase was smaller and we had a lot less tests to pass.
But even if you have a great super-fast test suite, chances are you will want to run stress tests, performance tests... or do some sort of manual check or validation of the new version before shipping. The first two (stress and performance) are still possible to do fast if you have enough CPU/cloud power. But, the human phase will always take time. If you don’t need it at all that's fine, but as soon as you have someone using the version for a while to ensure everything is ok (perception, design, color changes, icons, making sure it feels right is not very automatable), every branch won’t go through the entire test suite.
As I said, I always thought the perfect thing would be to have every branch/task go through the entire test suite. But, the DevOps Handbook opened my mind in that sense, since top performers must deal with the same situation. For many, a layer of "human touch" is good to have, even if it takes time. I'm not talking about people doing what automated tests do, but using their human-ish powers.
What does it mean? That every checkin in main is a "candidate" but it still has to pass the rest of the test suite, including human tests. If all these additional tests pass, then the checkin is labelled and deployed to production.
Side effect - "release" tests are faster
Our HAL bot was running tests on each branch before, but since they were not merged to main and just stayed as "ready to merge", we had to re-run the entire test suite after merge. Now every branch that is merged already passed tests together with the previous ones, so we don’t have to re-run the entire suite.
Here is a drawing to illustrate it:
We had two branches ready to be merged: they both were reviewed, validated, and they passed a test-suite (about 1 hour each). But they weren't tested together yet.
Then, we merge them to main to create the upcoming BL741 (please note that I marked changeset 89 as labelled although in a different color because the actual labelling would never happen before all tests were passed).
Changeset 89 will have to pass the entire release test-suite because branches 1910 and 1923 weren't tested together yet. This is not an issue when tests are fast, but it can be it they aren't.
Now, we have almost the same situation in terms of how the branch diagram looks but it all happened very differently:
The CI system (HAL) tries to merge task 1910. If the merge succeeds, prior to checkin, HAL runs the short test-suite (as before). But, then if tests pass, HAL checkins to main:
New changeset 88 passed tests. We even assign it a build number (BL741) something we only did to labelled changesets before.
Now, task1923 will be merged. It is only tested once merged to main. Hence, tasks are not tested independently anymore. So, when we take changeset 89 after merging task1923 and passing tests, all we need to do is to pass the subset of tests not passed in the short check, not the entire test suite again.
Nice post! I'd like to hear about how you manage the sheer number of branches you get when using short task-branches. Does your CI system assign branch attributes that you can use to filter views?
ReplyDeleteAlso, do you use labels for assigning (non-release) build numbers, or are you also using attributes here? (In the last figure it looks like a combination?)