Diff math

Tuesday, May 26, 2015 , 0 Comments

This blog post introduces a few hints about what you can expect when you diff changesets (commits) in your version control. Something similar to what we wrote to explain the difference between 2-way and 3-way merge months ago.

The diff function

Diff (9) actually means “diff with previous” or Diff(8, 9). We’ll assume the Diff function to be Diff(src, dst).

Diff is not commutative

Diff is not commutative -> Diff(9, 8) != Diff(8, 9)

Diff (8, 9) = changed foo.c, added bar.c

Diff (9, 8) = changed foo.c, deleted bar.c (it was in “9” but it is not in “8” so... deleted)

How diff really works

Let’s take a deeper look at what happens behind the scenes when diffing the changesets 8 and 9 in the previous example.

Diffing the changesets means diffing their associated trees. In most modern DVCS every changeset has an associated full source tree, starting on the root and going down to the leaf nodes (files, directories and so on). In the diagram below I put “revision numbers” to each node, which show, somehow, how changes on a file affect their containing directories up to the root directory (instead of revision numbers they could be GUIDs, in fact, in Plastic they’re both since we consider both GUIDs and revnos).

In the first figure above I said that in changeset 9 one file was modified (foo.c) and one was added (bar.c) and the figure below shows the two trees (before and after) containing these two changes:

The diff algorithm goes like this:

  • You start diffing in pairs from root (61, 65).
  • Then you find “doc” (8) didn’t change but there are changes inside “src” (60, 64).
  • Inside “src” the algorithm finds that:
    • foo.c has changed (59, 62)
    • boo.c didn’t change (32)
    • bar.c was added (wasn’t there before) -> that’s why if you Diff (9,8) you find it as deleted.

Diffing branches

What happens when you diff an entire branch? Basically it is equivalent to diffing changesets. In the example below diffing “feature-001” is the same as diffing changesets 9 and 12. Please note that in order to diff the changes in the branch you need to pick the “parent” cset of the branch, otherwise you’ll be missing the first change made on the branch.

Some hints:

  • Diff(feature-001) = Diff(9, 12)
  • Diff(feature-001) != Diff(10, 12) because you’ll be missing the change done in “10” which is not correct.
  • Hint: diffs can be grouped: Diff(11) + Diff(12) = Diff(10,12)

Diffing merges

To me this is where things get more interesting because sometimes it is not so clear what you’re getting when you diff a changeset where a merge happened.

In the example above, diffing the cset 15 (result of the merge between 13 and 14 having 9 as common ancestor) shows you all the changes done in “feature-001” + anything you might have changed *during* the merge (any manual changes you did while solving conflicts).

Now, a more interesting case: what if you Diff(14, 15)?

As the picture above says, you get changes done from the ancestor (9) to 13 + any potential changes done manually during the conflict resolution in the merge.

Hint: what if you Diff (13, 14)? You’re not actually seeing the changes done in the two branches but actually only how the two csets differ.

Since you’re not considering “9” in the diff, the result can be misleading: suppose you added “/doc/readme.txt” on “11”, Diff(13,14) will show it as deleted (it is not in “14”) and it is quite misleading.

Hint how branch merge actually works? – in the former scenario, to merge 13 and 14 to create 15, the version control will calculate the common ancestor between the two which is “9” in this case and then:

  • Diff(9, 13)
  • Diff(9,14)

And calculate the possible conflicts between the two diff collections. If there are no conflicts, simply apply Diff(9,14) to “main” to create “15”.

Merging down

Another interesting case happens when we “merge down” from the parent branch to the task branch (as an example) like in the case below:

Diffing the branch “feature-001” (or Diff (9,15)) will show all the changes coming from the merge, but if you Diff(13, 15) you’ll get the actual changes made on the branch (10+12+14) not “polluted” by the changes coming from the merge.

As I said above diff(feature-001) doesn’t show you just the changes you did on the branch, since it also shows the changes coming from the merge 13->14. That’s why some version controls choose to “rebase” as follows:

And the reason why we improved our merge with what we call “item-merge-tracking”.

Finally, what if we continue working on the branch feature-001 after the merge as follows?

In this case Diff(16,13) = changes on the branch + changes during the merge (but not 13,11) – it will work while the merges come from the same branch!! (otherwise is hard to get if they come from the branch or not) – remember to diff with the last source of merge from the branch.

Conclusion

Once you keep in mind how diff actually works (diffing the changeset trees) it helps understanding what you can get from non-trivial diff operations.

Diffing changesets is normally straightforward except when merging is involved, because then it becomes harder to understand what goes from where. In Plastic there are two aids for this: first the branch explorer itself, clearly rendering the evolution of the repo. Second the new “item-merge-tracking” feature which is able to show, in a file, which changes come from the merge, which ones were changed on the branch, and so on.

0 comentarios: