Efficiency issues with ExternalProject sub-builds under Ninja

Some projects need to use ExternalProject because the sub-builds use a different architecture, different SDK, different build type or some other reason that means things can’t all be kept in a single CMake build. Ordinarily, Ninja is a good choice for using build resources efficiently, but it doesn’t handle this situation well. Ninja doesn’t support co-ordinated sharing of CPU resources across nested builds, unlike make which does. Currently, you end up having to make choices between potentially severe CPU over-commitment, or artificially limiting the parallelism to avoid that CPU over-commitment. The default behavior if you invoke Ninja on the top level build will be CPU over-commitment that gets worse as you add more external projects to the build because each external project thinks it has exclusive access to all the CPUs.

The problem is that a superbuild doesn’t let the external sub-build and the main parent build work together. I think Ninja doesn’t quite yet offer what is needed to do this well, but proposals like this one may offer a potential way forward. Should CMake people get involved in that discussion and see if we can find a way to allow a main build and an arbitrary number of sub-builds co-operate and share the CPU resources more efficiently? I don’t know what’s involved, but I wanted to open the discussion to see what people might already have explored and whether there’s an opportunity to improve this situation. @brad.king @ben.boeckel I think you have both been quite involved with Ninja, but I’m sure others are also interested and have experiences that are relevant here.

As extra motivation, this comes up in a few different scenarios, all of which are very real and I’m seeing more often:

  • Projects that build for multiple embedded architectures and need to pull the results together into a single package or set of packages. Or perhaps be able to build, deploy and run a coherent set of multi-architecture artifacts.
  • Android builds that want to produce binaries for different architectures and combine them in a single package.
  • Apple platforms that target multiple architectures across multiple SDKs (macOS, iOS, each with their own multiple architectures). I suspect this will be very relevant for trying to create XCFrameworks, since we currently have no way of doing that given the combinations of SDK and architecture that may need to be included.

I’m sure there are more use cases, but I wanted to see if there’s anyone able and willing to investigate and maybe work on a solution to this.

My projects automatically download and build external libraries like Lapack, HDF5, etc. for embedded edge sensing/computing platforms. This is to avoid ABI issues and boost performance with per-system optimizations. They’re CPU/memory limited, but have multiple CPU cores so I have to limit build parallelism. Currently I do this with Ninja job pools in CMake + Ninja. I size the job pools empirically to avoid excessive RAM usage on low RAM systems. I have an if-elseif tree that sizes the job pools based on amount of physical RAM. Then I associate job pools with the resource heavy targets. This is per-project, without accounting for ExternalProject.

I had been using FetchContent, but recently switched to ExternalProject to avoid scope issues. I hadn’t tried building on the low resource systems since that switch, so perhaps this issue will now affect me.

My experience with Make was that even the -l load limiter was not adequate to avoid crashing the whole embedded OS due to extreme RAM usage. So with Make, I had to limit the entire build to say 2 cores, while with Ninja I could use all cores except for those targets where I used job pool.

It sounds like if I have problems I might just be able to switch to Make and do a less parallelized build. But it would be nice to have Ninja’s speed benefits.
As to why I don’t cross-compile, the build time isn’t that bad with modern ARM CPUs. It’s designed to be deployed by users who don’t want to install build tools on their laptop; they just SSH into the devices to build there.

What first comes to mind–these options would only modify CMake, and would be generator-independent:

  • a new option say EP_BUILD_SERIAL to make EPs build sequentially from the top-level project. We assume that each project uses Ninja job pools or otherwise is able to build OK by itself.
  • can CMake track the “done” status of each EP? If so, CMake could give each EP a fixed subset of the CPU count, waiting to start new EPs until enough CPU cores are available. This reminds me of the CTest target property PROCESSORS vis CTEST_PARALLEL_LEVEL. That is, the main project and each EP have a PROCESSORS-like property that’s managed like CTest does.

There is certainly merit to enhancing Ninja to handle subbuilds appropriately as the most general, performant and automatic solution–I would like to see this too. The CMake EP property additions above might be able to be released much more quickly, and work across generators.

Ninja runs the other way: there are pools of a given size, jobs are added to the pool. There would need to be a cost or something for how many slots in a pool a job takes. There would also be the additional logic that overflow is allowed if it is the only thing allowed (-j1 should still be able to build a cost=2 rule).

Personally, the -l flag is what is wanted here in general (for CPU saturation). Memory pressure is orthogonal (as memory usage differs way more between different TUs). The best I’ve found is systemd-run to limit the whole build to a lower cap and then let the OOM killer have a more specific target than “the tmux session with the rogue processes” which is way more inconvenient.

The linked ninja issue looks promising at a glance, but I don’t know how that would be modeled in CMake (unless ExternalProject is going to gain some special powers other CMake code can’t do in order to trigger this special syntax).