Why checkin asks to update first?

Friday, February 02, 2018 0 Comments

Mike is working on the main branch and has some changed files that are ready to checkin. Bob and Joe are also working on main. When Mike is about to checkin, Plastic asks him to update his workspace first. Mike goes nuts. He didn't have to do this with Perforce. Why is Plastic forcing him to work this way?

Short answer: Mike, if you want to work this way (SVN/Perforce style), please use Plastic Gluon :-P. It comes with Plastic, has no extra cost, but supports a different workflow. It is perfect for working with non-mergeable files (art, docs, blobposts...).

Long answer: it is all about merge-tracking. If you want to look under the hood of version control then keep reading. I will explain why Git, Mercurial and Plastic SCM implement per-changeset merge tracking instead of per-file merge tracking like older systems used to do (P4, Clearcase, SVN).

Intro to merge tracking

Take a look at the following Branch Explorer diagram:

The green curved arrow is a "merge link" and it means a merge happened between branch feature101 and main. The merge created the changeset 31.

But the arrow is not only here to explain us that a merge happened. The arrow is metadata that the version control uses to know how to calculate future merges.

Let's see how this works by making more changes in the repo. I create a new branch task127 as follows:

And then I want to merge task127 back into main. It means changesets 35 and 32 will be merged to create the result in main. Something like the following:

To calculate the possible file (and directory!) conflicts between 32 and 35, Plastic (or any worth-using version control) calculates the nearest common ancestor between the 2 changesets. To do that, the version control walks back the graph to find a common commit.

In this case, the result is quite simple to find - changeset 30.

This is because while walking back changeset 32, Plastic will go to 31, then it will fork and go back to the parent of 31 but also to the source of the merge link - 30. And 30 is the nearest common ancestor between 35 and 32.

What does it mean in non-graph-geek-terms?

Well, that you will only have to solve the conflicts that happened "after" 30 was merged. So, any other possible conflicts you already solved while merging feature 101 will be solved now.

If the merge link didn't exist, Plastic would go back to the changeset labeled as V1412... and you would need to resolve the same conflicts again and again. Painful!

If you want to go deeper on merge tracking and 2-way versus 3-way merge, maybe you will find this blog post useful.

Per-file vs per-changeset merge tracking

All the diagrams above show per-changeset merge tracking. It means that the merge links exist between changesets. This is how Plastic SCM, Git and Mercurial work. In fact, all distributed version control systems (DVCS) work this way.

Per-file merge tracking means that the merge links are set between revisions of files, instead of changesets. In this model, each file has its own version tree, each with its own different diagram. We forget about the global status and merge links between changesets. Every single file has its own tree and merge arrows are set between revisions of the same file. This is how Perforce works, but also how Clearcase worked, and most other older generation version control systems.

In the figure below, I will explain how the 2 different types of merge tracking work:

I illustrated the Branch Explorer with actions to understand what happened to a couple of files: foo.c and bar.c over time.

Then, I drew the version tree of each file individually as they would exist in per-file merge tracking (Note: Plastic doesn't keep this second structure, just the first one with links between changesets).

To make mapping easier, I rendered the version trees with "revision numbers" that are the same as their changesets. But, it is important to note that the numbers in the revision trees are not changesets but revision numbers, they are normally not global, just unique in its file history.

As you can see, while there is only one "branch explorer" for the repository, there are 2 very different version trees for foo.c and bar.c.

What will happen now if I try to merge task127 into main?

Well, in per-changeset merge tracking there is only one tree to walk, the changeset tree of the repository.

But in per-file merge tracking, the version control has to walk a tree for each individual file.

In the example, I only modified foo.c in feature101 so bar.c doesn't even have changes there. That's why the trees are so different.

Key differences between per-file and per-changeset

Speed: if you merge 1000 files, per-file will need to walk 1000 version trees, one for each file. It doesn't scale. That's why old version control systems normally prevented developers from using too many branches.

Merge all or nothing: with per-changeset when I merge a branch I merge all the files changed on the file... or none, but I can't simply merge a few. This is because the link is set per-changeset, so there's no way to say "hey, only a few were merged, keep the rest for a future merge". With per-file you can do that. When you merge task127 to main you can decide to merge only foo.c, and bar.c will be proposed again to merge if you try to merge the branch again. Flexibility comes at a cost - performance and much higher complexity.

Going back to a single branch

Hope you are still here with me :-).

Remember the original story - Mike is working on main, and he just wants to checkin a few files.

Mike modified boo.c, and he just wants to checkin, but when he tries to do it, Plastic asks him to update to changeset 104 first, and then to download the new bar.c and foo.c.

Why? Why can't he just checkin without updating?

If Mike updates prior to checkin, his workspace will be as follows:

He will be "in sync" with changeset 105.

If he simply checkins boo.c without updating, his workspace won't be in 105, it will be a mix of things - bar.c and foo.c will be taken from 102, but boo.c will be from 105.

Now suppose Mike wants to merge a branch. Remember my entire merge tracking explanation? How can per-changeset merge tracking work if some files don't belong to the changeset? How will the merge link be set?

Obviously, Mike can't set branch task1113 as merged taking the wrong contents of bar.c and foo.c (Mike's workspace is not loading the changes made in 103 and 104). Mike might now think "argh, why can't I just merge boo.c and forget about the ones I can't load?" And yes, this is something doable with per-file merge tracking. But how can a merge link between changesets reflect that "only a few files were merged"? It can't. That's why the merge all or nothing happens in per-changeset merge tracking.

Changeset "coherence"

Plastic enforces "changeset coherence" working mode. The workspace must be in sync with a changeset, but not with parts of one and parts of another.

In comparison, Plastic Gluon is perfectly fine with breaking "coherence", because the key restriction in Gluon workspaces is "you can't merge".

Coherence doesn't leave files behind

Same as you can't merge with a workspace not "synced to a changeset", you can't use it to switch to a different branch because many files could be out-of-sync with the supposedly loaded revisions.

What if Plastic allows Mike to create a new branch and switch to it at this point, without updating to latest? Should the branch be created from 102 or 105?

In case it is created from 105, what if he modifies again bar.c? He won't be allowed to checkin because he would be losing the changes made in 103.

If the branch starts from 102, then boo.c would be overriding the content of the file that was really loaded in 102.

There is an option in Plastic to actually let you switch branches with files out-of-sync (locally changed), but then it won't allow you to checkin if the revision of the file doesn't match the one that should be actually loaded. For example, if you branch from 102 but boo.c is modified and is not starting from the revision loaded in 102, then it will be rejected on checkin. This is all for protection purposes, to avoid breaking merge-tracking and overall history.

By default, Plastic simply disallows switching branches when there are pending changes.

Repository snapshots and consistency

I will focus once again on the status of Mike's workspace after checking in just boo.c without updating his workspace (remember, something doable with Gluon but not with "classic Plastic").

Mike just checked in boo.c, but he certainly built his code with his "older" bar.c, foo.c in combination with the newer boo. And it worked.

What if someone simply updates to 105 now? Will the code in 105 build if the boo.c there was never actually built (nor tested) with the newer bar.c and foo.c? Maybe it does, but most likely it won't. See my point?

This is the fundamental flaw with per-file way of working. It doesn't enforce having a consistent global status where every changeset is a snapshot of the project at a given point in time, known to build and pass tests.

On the contrary, per-changeset (with per-changeset merge tracking) enforces that every new checkin, every new created changeset, records a real status of the project, something stable, consistent, something you can go back and know it "existed".

How could you reproduce Mike's status later on? It is not as easy as "go back to changeset 102". You need a per-file configuration to exactly load bar.c and foo.c from one changeset and boo.c from another.

I could go on and on about this and explain that while labels in per-changeset are just pointers to changesets, in per-file, a label is a collection of pointers to individual revisions of files. Thus, labelling a 100k file repo in per-changeset is always constant time, while in per-file a new "record" must be created for every single loaded file.

Remark: Gluon allows you to break consistency, which is good for non-mergeable code and for files that evolve quite independently of others like docs, binary assets and the like.

Plastic was per-file long ago

Up to version 3.0, Plastic SCM used a per-file strategy. It was very good with branching and merging (not as good as it is now, of course), it was distributed, and it also had great GUIs.

Check this old 3.0 branch explorer. Despite of the obvious style changes, take a closer look at the labels, branches and merge links:

Do you see how the labels are not attached to a given changeset but placed somehow "after" a changeset? This is because the Branch Explorer struggled to render a global view of the repo, because in reality, there was not a real global history, but a bunch of histories coming from all the files and directories in the repo.

Do you realize that branches don't have a line to their "starting points"? Weird, isn't it? This is because there wasn't one. Branches loaded configurations and started from a given point, but that point was not easily rendered since it depended on each modified file in the branch.

The same thing happens for merge links. Do you see how in /main/Feature10_Paul the links just start from in between changesets? Again, this is because the links try to represent somehow a global history that didn't exist, because links were set on a per-file basis.

The following image shows the, now defunct, 3D version tree. It used to be the key visualization, long before the Branch Explorer:

We got rid of per-file versioning long ago, and favored:

  • Faster branching.
  • Much faster merging.
  • Comprehensive branching (with clear starting points and links).
  • Much stronger merge resolution (check an in-depth explanation of all possible supported merge cases here).
  • Coherence: changesets capturing real snapshots.
  • Much more solid replicas, key for DVCS.
  • Full interoperability with Git.

And then created Plastic Gluon to break the consistency rules when working on a single branch, without merging.

Conflicts on main – why does new files come to my workspace?

Let's go back to Mike and his adventures with checkins to main. Consider this:

Mike was working on foo.c but meanwhile other team members made a couple of new checkins.

Previously, Mike was able to simply update his workspace prior to checkin. But now foo.c will collide with the change made in 113.

When Mike tries to checkin, Plastic will ask him to merge.

When he merges 113 to his workspace in 111, Mike will solve the conflicts in foo.c, but bar.c will also show up as modified on his workspace as a result of the merge.

This is very confusing for many users. Why on earth bar.c comes to my workspace if I just want to checkin foo.c?

Again, it is all because of per-changeset merge tracking. This is what will happen once Mike checkins the resulting merge:

The merge link between 113 and 114 can't be set if not all files are merged.

Of course, there could be alternatives, like not downloading to bar.c and still merging and setting the link but then the workspace would be, again, out of sync.

In Gluon, the previous scenario is not possible because, right now, it can't merge. It is meant to be used with file-locking instead, and avoiding merging at all costs, which was, in fact, one of the foundations of its design - no merges, no branches, not having to download the entire content of the repo to the workspace.

Future: merging in partial workspaces to get the best of both worlds?

Gluon handles workspaces differently, they are not coherent, and very often they don't even hold a full copy. We call these workspaces partial.

We are considering options to improve single-branch workflows. Right now, Gluon can't merge, and that's fine for most users because they just focus on unmergeable content, Unity assets, 3D content, textures, animations, documents, etc. But there are times when non-developers concurrently touch text-based files, like small scripts, and then merging would be a better option.

What if we solve a situation like the one described above for Mike but without merge tracking?

  1. Mike is on 111, fully in-sync (not partial).
  2. Then he wants to checkin foo.c.
  3. Plastic detects a merge is needed but instead of just doing a full merge, he asks Mike if he wants to switch to a partial workspace and just merge foo.c.
  4. Mike says ok, and then foo.c is merged, then checked in to latest (head) on main.
  5. Mike is now on a partial workspace, loading the latest of foo.c (from 114) while the rest of his workspace remains on 111.

Why haven't we implemented this yet?

Check step 4: I said merged and checked in is happening in a single step, but that's not the case. When I merge, the merged file is kept in my workspace so I can review it before checkin. So, the real situation will be something intermediate prior to create 114:

The "checkout changeset" marked as CO is a temporary one. It doesn't exist on the repo, it just exists on the workspace, but it is clear it starts from 111. Currently, after a merge, a "merge-link" would be coming from 113 as this CO would be the new 114. But, in this "partial workspace merge" there is no merge link.

When I checkin foo.c, Plastic will realize it is in partial mode, that there are no new changes for foo.c after 113, and, somehow, it will also know that foo.c is already merged, and a new merge is not needed (sort of temporary per-file merge tracking stored only in the workspace). No version tree walking would happen since ancestors will always be the parents of the loaded revisions, something workspaces account for.

Does Git have these issues working on single branch?

The "issues" I described (although I wouldn't call them issues but simply ways of working) only happen when several people work on the same branch.

This can't happen in Git, by design, because 2 developers can't work on the same branch, because Git is not centralized.

Plastic can also work fully distributed, with clones for each developer, so you can avoid working in the same branch too.

Confused? Isn't GitHub sort of a centralized Git? Not really.

Centralized means you directly checkin to the repo, no intermediate repo exists. When Mike had this situation, it was because he is directly working on the same repo as his colleagues:

In distributed mode (Git or Plastic), you have your own repo and nobody will touch it or access it. You will never be asked to merge during checkin, because you are in your own branch.

For code, this is perfect, although you don't need to go distributed for that, you can simply create task branches. (More on task branches here).

For binaries it is not so good. You need locking and locking in a distributed environment is not something Git or GitHub can do at all.

If you can't merge certain files, then, right now, you are better served by centralized single-branch version control. Plastic Gluon helps. Git doesn't.

Conclusion

If you are used to using Perforce or SVN, maybe you'll find yourself struggling trying to wrestle with Plastic trying to make it work in P4 ways.

"Why I can't just checkin a single file without updating?" - you ask.

Per-changeset merge tracking is the answer. It is a different way of thinking. It imposes some limitations, but it opens up a wide range of possibilities for more efficient version control.

The whys might not be easy to grasp at first, and that's why this blogpost is way much longer than I initially expected.

We develop Plastic SCM, a version control that excels in branching and merging, can deal with huge projects and big binary assets natively, and it comes with GUIs and tools to make everything simpler.

If you want to give it a try, download it from here.

We are also the developers of SemanticMerge, and the gmaster Git client.

0 comentarios: