Who we are

We are the developers of Plastic SCM, a full version control stack (not a Git variant). We work on the strongest branching and merging you can find, and a core that doesn't cringe with huge binaries and repos. We also develop the GUIs, mergetools and everything needed to give you the full version control stack.

If you want to give it a try, download it from here.

We also code SemanticMerge, and the gmaster Git client.

nodata: the secret sauce of lighter clones

Friday, November 10, 2017 Pablo Santos 1 Comments

UPDATE 2019/09/17: no-data replication is very similar to the now new Git partial clone UPDATE 2019/03/20: We updated the examples to use the new cm push and cm pull commands


Having local repos (working distributed) is great: super-fast checkins, speed-of-light branch creation... nothing breaks the flow. In contrast, slow checkins to a remote server simply get on my nerves and drive me directly to productivity-killing time filling like checking conversations and forums on Slack.

Super big local repos with tons of history I never actually use are not good, though. Yeah, disk is cheap and all that, but 20GB local repos drive me crazy.

And, I'm connected to the internet while I code 99.9% of the time. Cant I just have lighter clones and grab the data from the central server on demand?

We have just introduced nodata replication. You only get the metadata and data is downloaded on demand from the central server.

(Applause).

How do you create nodata replicas?

Easy. Just use these 2 examples commands:

cm mkrep code@locahost:6060
cm pull main@code@plasticscm@cloud code@localhost:6060 --nodata

Creates a local repo code and then pulls the main branch of the repo plasticscm (located in our Cloud server) without data.

By the way, it works with on-premise repos too (provided you upgraded to version 1744 or higher). The command:

cm pull main@code@myserver:8087 code@localhost:6060 --nodata

works too.

Can you pull without data from the GUI?

Yes. See the example in the following image:

Wait, what goes on with nodata under the hood?

Now, here is when I draw one of my (in)famous branch diagrams.

This is my remote repo:

Now, I replicate with --nodata:

Notice that I marked the changesets differently with a diamond pattern instead of a solid blue fill. The new diamond pattern means that they are "de-hydrated" or that they do not have data.

Now, what happens if I create a workspace pointing to code@locahost:6060 and switch to it?

Actual data will be taken from the remote repo: code@london.plasticscm.com:9095.

Now, what happens if I switch to changeset 80? Will all the data download again from the remote repo? Of course not. Only the files that changed between 80 and 89 changesets will download (my workspace is never redownloaded if the files on it are fine).

Can I diff changeset 89 even if it doesn't have any data? Yes, I can. Actual data will be downloaded from code@london.plasticscm.com:9095, like it happens during download.

Can I work on this nodata repository? Yes, I can. Look, I can create a new branch:

And, I can checkin changes:

Look carefully new changesets 90 and 91 are not de-hydrated. They do have the content of the files I just checked in.

Now, I can push main/task2001 back to the central server if I want to.

What happens if I'm on changeset 91 and I switch to 89? What you expect: files foo.c and bar.c will redownload from code@london.plasticscm.com:9095.

And, if now I go from 89 to 91 then foo.c and bar.c will download from my local repo, because it does have the data.

In short, every single time Plastic needs the data of a file, it will try to take it from the repo, and if it is not there, it will go to the remote repo to grab it. The remote repo needs to be reachable, of course (we still haven't invented disconnected data transfer :P).

How does Plastic know where the remote is?

Easy too:

cm find replicationsource on repository 'code@localhost:6060'
13       code@london.plasticscm.com:9095 43a54f4a-c933-4c02-a84e-84eff4751f87
Total: 1

This is not new. Plastic has been tracking replication sources ever since version 2.

You can also check the replication log:

cm find replicationlog on repository 'code@localhost:6060'
327226   11/6/2017 11:33:06 code@london.plasticscm.com:9095 F   pablo
Total: 2

So, right now, the Plastic client (which is the one downloading the data) will use the replication source to find the remote server. The plan is to let users specify it also later on.

Restriction: I can't push if I don't have the data

As I explained above, I can pull with the --nodata argument set. But I can't push with --nodata. We chose to design it this way to avoid pitfalls. Like when you push without data then the remote repo (which could be a central server) is left in an unstable situation.

Also, even if I do a normal push, if some of my data is missing, Plastic won't let me run the operation.

What does this mean? Suppose we use the previous scenario:

Now, instead of pushing main/task2001 to code@london.plasticscm.com:9095 I try to push it to an empty repo at code@mordor.plasticscm.com:8087. Plastic will negotiate the replica and will find that code@localhost:6060 doesn't have the entire tree for changesets 90 and 91 at main/task2001. I mean, these changesets just contain the new revisions of foo.c and bar.c, but not the entire tree (which is great, and that's why I can have super light repos). Since the data of the entire tree can't be assembled, it will tell you that you need to "hydrate" first.

Hydrate: nodata's evil friend

What is hydrate?

Suppose the tree in the code repo is something like:

Ok, I have the data of foo.c and bar.c in my local repo, but the rest of the tree is not there, and it is downloaded on demand by the client when needed (update, diff, merge, anything).

Now, what if I want to actually introduce data in my repo so Plastic doesn't have to grab it anymore from the remote code@london.plasticscm.com:9095?

I can hydrate using this command:

cm pull hydrate cs:91@code code@london.plasticscm.com:9095

which will resolve the tree of changeset 91 and download all its data from the remote server and insert it into the repo. (Note: Yes, I can hydrate from a different server if I want to).

This doesn't mean that I am downloading the history of the entire repo. Plastic is just getting the data for the files loaded in changeset 91. All previous revisions stay "de-hydrated".

What is the primary use of hydrate?

As I explained at the beginning of the blogpost, with nodata I get super light clones, but I need to be connected to the central server for certain operations.

By hydrating a given changeset (or branch), I can completely go offline, but at the cost of downloading more data to my local repo.

For example: I can hydrate the last changeset on main, then I'm totally independent to continue branching in parallel from that point, without ever needing to get data from the central server.

Replicating with nodata and hydrating a given changeset will be still lighter than replicating a single full branch (except if the branch you are replicating has only a single changeset).

This is not just the beginning

nodata replica is just another step in our quest for better distributed workflows. Plastic introduced partial replicas (no need to get all paths to a changeset to have a working local repo) long ago. The downside is that sometimes you do need to check the entire history, like when doing an annotate. That's why eons ago we added the "annotate on remote repo" option:

You still have small fully-working local repos, but you benefit from the full data living on the central server(s).

Wrapping up

We want Plastic to excel in distributed workflows. nodata is just another step in that direction. We have been on this path for more than 10 years already; now we are going further.

Go do it - Git! ;-)

Pablo Santos
I'm the CTO and Founder at Códice.
I've been leading Plastic SCM since 2005. My passion is helping teams work better through version control.
I had the opportunity to see teams from many different industries at work while I helped them improving their version control practices.
I really enjoy teaching (I've been a University professor for 6+ years) and sharing my experience in talks and articles.
And I love simple code. You can reach me at @psluaces.

1 comment:

  1. Yeah, almost in Git is local and all the stuff, but the first clone really sucks. It's a waste of time till you can get started and I just use a small piece of the history downloaded so... getting it on demand or just download it by pieces when I want sounds much better.

    Congrats, Plastic, you did it again :)

    ReplyDelete