ExternalProject_Add cache for git repositories

codeling · October 6, 2021, 2:53pm

Inspired by this other topic, I added DOWNLOAD_DIR to our superbuild’s ExternalProject_Add to avoid re-downloading external libraries’ archive files, which works nicely.

The DOWNLOAD_DIR however only works to provide caching for archives retrieved via URL sources, according to ExternalProject_Add documentation.
Since we clone a few libraries from git in our superbuild, I was wondering if there was something like a “cache” for git repositories as well? I would imagine something like a setting specifying a local git repository that acts as a cache; which will be copied to the source directory; then the “original” git repo would be added as remote (if not already existing); then the update step could be performed as normal if necessary; this would mean that only commits not in the local repository yet need to be fetched.

The only, kind of hacky, workaround I can think of at the moment is to have some step before the call to ExternalProject_Add, which, if no source was cloned yet, takes as input the local repository path and copies it in the place where the ExternalProject_Add’s download step would load it into…

My use case: I have a superbuild which I want to test automatically; and while I want to do the superbuild from scratch (to fully test it), I also want to avoid large bandwidth usage (i.e. avoid re-downloading/cloning the whole git repository of, let’s say VTK).

ben.boeckel · October 6, 2021, 5:53pm

I see two potential solutions here.

`git clone --reference`

This would cause cloning to be something along the lines of git clone --reference $commondir. This $commondir can then be set up manually to have fetched from any set of repositories one might want to cache. The issue with ExternalProject managing this behind the scenes is that operations would need to:

preferentially operate on $commondir (otherwise new objects just get “orphaned” in the srcdir again);
know about remote names per project;
wrangle branches and tag conflicts in some meaningful way.

Per-repository cache

There could be a per-project cache repository as well placed under some scheme in DOWNLOAD_DIR to avoid conflicts (a hash of the repo URL seems the sanest here). Using git worktree to check out the actual src directory would be the way to go rather than actually cloning in order to keep the cache directory as the “canonical” location for objects as an actual cache (requiring a higher level Git, but it’s probably OK these days). Note that git worktree prune would need to be considered in case the srcdir is ever manually deleted outside of EP’s control.

Cc: @craig.scott

craig.scott · October 6, 2021, 9:05pm

I’ve been thinking for a while that it would be good to somehow support using git worktrees with ExternalProject and FetchContent. It would be more efficient from a space and network bandwidth point of view. It might be something to consider as part of this proposal which talks about URL substitutions controlled by the user, but worktrees would need additional design work beyond what that proposal currently covers.

ben.boeckel · June 19, 2023, 2:39pm

Note that the linked proposal was closed since url.insteadOf supports it for Git repositories. This is still relevant independently.