Merge recursive strategy

Tuesday, September 27, 2011 10 Comments

You must have heard about “git merge recursive strategy” which is the default algorithm that git uses when merging two branches.

How does it work and why is it good?

UPDATE: If you're interested in merge recursive you will probably enjoy this one too.

UPDATE 2013/10/02: Screencast available at http://youtu.be/Lg1igaCtAck

UPDATE 2014/03/35: If you arrived here from the mercurial mailing list, then it is worth you check my replies to Matt. He was quite rude with the comments and I replied and wrote a follow up blog post where we not only explain why we do know quite a bit about merging, but also why our merge system is better than Mercurial's.

Basics – elements on a merge

You've got two branches you want to merge. The basic "players" to consider are:

  • The "source": the changeset you're merging from. Changeset 16 in the example below.
  • The "destination": the changeset you're merging to. Changeset 15 in the example below.
  • The "ancestor": the changeset (or commit) which is the nearest parent of the "source" and the "destination". This is Changeset 10 in the example below.

So, we will merge 16 and 15 using 10 as ancestor. We DO need an ancestor to be used in "three-way" merges (more about it here).

When is "merge recursive" needed?

But, sometimes the scenario is not that simple. What if we find "two common ancestors"? The picture below shows how in this example, we have two possible "common ancestors".

Please note: the example is a little bit forced since there's not a good reason (initially) for the developer merging from changeset 11 into 16 instead of merging from changeset 15 (the latest from the branch main at the point of the merge). But let's assume he has to do it for a reason (For instance, changeset 11 was stable and 13 and 15 weren't at the time). The point is: between 15 and 16 there's not a single unique ancestor, but rather, two ancestors at the same "distance": 12 and 11.

While this won't happen daily, it is really likely to happen with long lived branches or complex branch topologies. (The case depicted above is the shortest one driving to the "multiple ancestor" problem, but it can happen too with several changesets and branches in between the "crossed" merges).

One solution is to "select" one of the ancestors as the valid one for the merge (which is the option Mercurial takes) but as we will see below, it has many drawbacks.

How "merge recursive" works?

When more than one valid ancestor is found, the "recursive merge" strategy will create a new unique "virtual ancestor" merging the ones initially found. The following image depicts the algorithm:

A new ancestor 2 will be used as "ancestor" to merge the "src" and "dst".

The "merge recursive strategy" is able to find a better solution than just "selecting one of the two" as I'll describe below.

Why merge recursive is better – a step by step example

Let me use the following "notation" in the next example: a file foo.c with three lines like these:

b
c
d

It will be described as: /foo.c = bcd.

Of course, for the sake of simplicity I'll be using "stupid lines" like "abc" but assume the example is valid for real code too.

Let's take a look at the following case in the diagram below:

I’ll try to describe it changeset by changeset:

  • 0. We had a file foo.c with content foo=bcd (three lines, first is b, second is c and third is d)
  • 1. We edit foo.c on a branch and add a new line so it ends up being foo=bcde
  • 2. We modify the second line of the file on main. Now the file is foo=bCd
  • 3. We create a new branch from changeset 2 and add a new line at the beginning so it is now foo.c=abCd
  • 4. Going back to task001, we modify the line we just added: foo.c=bcdE
  • 5. We "undo" the change we just did on main: foo.c=bcd
  • 6. We merge 4 and 3 and create 6 as foo.c=abCdE. (We combine the changes we made on task002 (adding a new line at the beginning) with the ones in task001 (adding E at the end) and also the change coming from 2 on /main.)
  • 7. We now merge 4 and 5 introducing the change from 4 (last line added) into main: foo.c=bcdE

If we now merge task002 on main (changeset 7 and changeset 6) what should we get?

We should get the "addition" on task002 (a new line a at the beginning) on top of 7.

The expected result is: foo.c=abcdE.

We shouldn't get the line with the uppercase C as it is on task002 because we fixed it afterwards on main.

As you can see in the diagram, I highlighted the changesets 4 and 2 because they're the two possible common ancestors from 6 and 7.

Which one should we choose?

Mercurial will choose 4 because its algorithm chooses the "deepest" ancestor in the case were more than one is found.

What happens if we choose 4?

We will merge 4=bcdE, 6=abCdE and 7=abcdE and the automatic result (by any 3-way merge tool will be):

  • First line -> a (it's there on 6 and 7)
  • Second line -> b (it's there on the three contributors)
  • Third line -> C (changed on 6 but unchanged on 4 and 7)
  • Fourth line -> d (unchanged)
  • Fifth line -> E (unchanged)

So, we automatically get foo=abCdE which is WRONG!!

We took C instead of c due to the wrong ancestor selection.

How recursive merge fixes the mess?

As I described above, the first thing "recursive merge" is going to do is to calculate a new "virtual ancestor" merging 4 and 2 as the following picture shows:

The result changeset X is foo=bCdE.

Later, changeset X is used as the "ancestor" of 6 and 7 and then we get:

  • Ancestor: foo=bCdE
  • Source: foo=abCdE
  • Destination: foo=bcdE
  • Result: foo=abcdE, which is what we were looking for!

The calculated result takes into account the "fix" done in changeset 5 and therefore the result is correct!

Why it is so good?

If you have to deal with branching and merging and you don't have a good merge algorithm, you can end up with broken files without warning!

In short: Git will do it correctly, Hg will break the result, and SVN and others will simply mess up the whole thing.

In Plastic SCM 4.0 we've also implemented a "merge recursive" algorithm, so we are able to produce the same result. (In fact, our algorithm is even more powerful, correctly handling cases that even Git is unable to deal with successfully).

More cases

I focused on "file conflicts" today but it is easy to come up with examples affecting directory structures too. I'll cover it in a follow up blog post.

Wrapping up

Branching and merging are the two weapons you must have in your developer's toolset... but ensure you have the best possible ones, the ones that really do the job.

We develop Plastic SCM, a version control that excels in branching and merging, can deal with huge projects and big binary assets natively, and it comes with GUIs and tools to make everything simpler.

If you want to give it a try, download it from here.

We are also the developers of SemanticMerge, and the gmaster Git client.

10 comments:

  1. what happens if the merge of the 2 common ancestors results in a conflict?

    ReplyDelete
  2. @Sheldon: Git uses the conflicted merge result as a merge base, including conflict markers. For example, if we have bXd in one common ancestor and bYd in the other, the conflicted merge result would be bd, where each character represents a line in a line-by-line merge, and <|> represent conflict markers. This merge base will force a conflict in the final merge as well. And in case of a conflict, it does not really matter what the merge base is. The result only depends on the tips of the branches. For example, if one tip contains bxd and the other byd, the result is bd.

    Note that either branch had to resolve the same conflict before, since they merged the common ancestors at some point. If they happen to resolve it in the same way, the tips are the same and there is no conflict.

    ReplyDelete
  3. Hi, Pablo!

    I’m the web editor at iMasters, one of the largest developer communities in Brazil.

    I´d like to talk to you about republishing your article at our site.

    Can you contact me at rina.noronha@imasters.com.br?

    Bests,
    Rina Noronha
    Journalist – web editor
    www.imasters.com.br
    redacao@imasters.com.br
    rina.noronha@imasters.com.br
    +55 27 3327-0320 / +55 27 9973-0700

    ReplyDelete
  4. Hi, Pablo!

    I’m the web editor at iMasters, one of the largest developer communities in Brazil. I´d like to talk to you about republishing your article at our site.

    Can you contact me at rina.noronha@imasters.com.br?

    ReplyDelete
  5. @Sheldon. I was about to ask the exact same question.

    How does this algorithm account for conflicts when creating this virtual ancestor?

    ReplyDelete
  6. Ok, if there's a conflict on the intermediate merge:

    1- If the conflict is a directory -> the solution is to get the status it has on the base (I mean, a directory conflict: moves, renames, deletes, whatever). It means no intermediate resolution will happen.

    2- If the conflict is a file -> all the intermediate conflicts will be prompted to the user. Git tries to do the same, but Plastic does it correctly :P

    ReplyDelete
  7. See this post by Matt Mackall for some critique:

    http://article.gmane.org/gmane.comp.version-control.mercurial.general/29523

    ReplyDelete
  8. It looks like some of your assumptions about Mercurial are mistaken. There was a discussion exactly about your post that explains better the Mercurial's behavior in this topic. Here is the link: http://selenic.com/pipermail/mercurial/2012-January/041456.html

    ReplyDelete
  9. I'll be replying to the mercurial thread shortly.

    The assumptions are not mistaken: they're considering "distance" differently, but the result is the same: Hg merge fails.

    But, I'll be replying as soon as I get access to the mercurial list and a good way to answer! :P

    ReplyDelete
  10. I also find the arrows that go back and create cycles in your DAG to be confusing. Could you consider editing the figure that shows perhaps dotted lines without directional arrowheads going back to their parents, so that we can see merge parents without thinking that we have endless cycles in what was supposed to be a directed acyclic graph?

    Warren

    ReplyDelete