Efficiency issues with ExternalProject sub-builds under Ninja

Some projects need to use ExternalProject because the sub-builds use a different architecture, different SDK, different build type or some other reason that means things can’t all be kept in a single CMake build. Ordinarily, Ninja is a good choice for using build resources efficiently, but it doesn’t handle this situation well. Ninja doesn’t support co-ordinated sharing of CPU resources across nested builds, unlike make which does. Currently, you end up having to make choices between potentially severe CPU over-commitment, or artificially limiting the parallelism to avoid that CPU over-commitment. The default behavior if you invoke Ninja on the top level build will be CPU over-commitment that gets worse as you add more external projects to the build because each external project thinks it has exclusive access to all the CPUs.

The problem is that a superbuild doesn’t let the external sub-build and the main parent build work together. I think Ninja doesn’t quite yet offer what is needed to do this well, but proposals like this one may offer a potential way forward. Should CMake people get involved in that discussion and see if we can find a way to allow a main build and an arbitrary number of sub-builds co-operate and share the CPU resources more efficiently? I don’t know what’s involved, but I wanted to open the discussion to see what people might already have explored and whether there’s an opportunity to improve this situation. @brad.king @ben.boeckel I think you have both been quite involved with Ninja, but I’m sure others are also interested and have experiences that are relevant here.

As extra motivation, this comes up in a few different scenarios, all of which are very real and I’m seeing more often:

  • Projects that build for multiple embedded architectures and need to pull the results together into a single package or set of packages. Or perhaps be able to build, deploy and run a coherent set of multi-architecture artifacts.
  • Android builds that want to produce binaries for different architectures and combine them in a single package.
  • Apple platforms that target multiple architectures across multiple SDKs (macOS, iOS, each with their own multiple architectures). I suspect this will be very relevant for trying to create XCFrameworks, since we currently have no way of doing that given the combinations of SDK and architecture that may need to be included.

I’m sure there are more use cases, but I wanted to see if there’s anyone able and willing to investigate and maybe work on a solution to this.

2 Likes

My projects automatically download and build external libraries like Lapack, HDF5, etc. for embedded edge sensing/computing platforms. This is to avoid ABI issues and boost performance with per-system optimizations. They’re CPU/memory limited, but have multiple CPU cores so I have to limit build parallelism. Currently I do this with Ninja job pools in CMake + Ninja. I size the job pools empirically to avoid excessive RAM usage on low RAM systems. I have an if-elseif tree that sizes the job pools based on amount of physical RAM. Then I associate job pools with the resource heavy targets. This is per-project, without accounting for ExternalProject.

I had been using FetchContent, but recently switched to ExternalProject to avoid scope issues. I hadn’t tried building on the low resource systems since that switch, so perhaps this issue will now affect me.

My experience with Make was that even the -l load limiter was not adequate to avoid crashing the whole embedded OS due to extreme RAM usage. So with Make, I had to limit the entire build to say 2 cores, while with Ninja I could use all cores except for those targets where I used job pool.

It sounds like if I have problems I might just be able to switch to Make and do a less parallelized build. But it would be nice to have Ninja’s speed benefits.
As to why I don’t cross-compile, the build time isn’t that bad with modern ARM CPUs. It’s designed to be deployed by users who don’t want to install build tools on their laptop; they just SSH into the devices to build there.

What first comes to mind–these options would only modify CMake, and would be generator-independent:

  • a new option say EP_BUILD_SERIAL to make EPs build sequentially from the top-level project. We assume that each project uses Ninja job pools or otherwise is able to build OK by itself.
  • can CMake track the “done” status of each EP? If so, CMake could give each EP a fixed subset of the CPU count, waiting to start new EPs until enough CPU cores are available. This reminds me of the CTest target property PROCESSORS vis CTEST_PARALLEL_LEVEL. That is, the main project and each EP have a PROCESSORS-like property that’s managed like CTest does.

There is certainly merit to enhancing Ninja to handle subbuilds appropriately as the most general, performant and automatic solution–I would like to see this too. The CMake EP property additions above might be able to be released much more quickly, and work across generators.

1 Like

Ninja runs the other way: there are pools of a given size, jobs are added to the pool. There would need to be a cost or something for how many slots in a pool a job takes. There would also be the additional logic that overflow is allowed if it is the only thing allowed (-j1 should still be able to build a cost=2 rule).

Personally, the -l flag is what is wanted here in general (for CPU saturation). Memory pressure is orthogonal (as memory usage differs way more between different TUs). The best I’ve found is systemd-run to limit the whole build to a lower cap and then let the OOM killer have a more specific target than “the tmux session with the rogue processes” which is way more inconvenient.

The linked ninja issue looks promising at a glance, but I don’t know how that would be modeled in CMake (unless ExternalProject is going to gain some special powers other CMake code can’t do in order to trigger this special syntax).

Another superbuild parallelism issue is that project-level dependencies tend to be very coarse. This is even a problem with the Makefile generators.

For example, when building code generator tools, it is common that the generated code is only one small part of the build. Thus while a HostCodeGen project might be a dependency of a Target project, with an ExternalProject setup the Target build cannot begin until the HostCodeGen build has finished, even though there is likely quite a bit of work in Target that could safely begin.

One is therefore required to either live without the parallelism or break Target (and possibly HostCodeGen) down into fine-grained projects—one for each artifact in the worst case—that can run independently. The gains are unfortunately offset by many redundant CMake configurations, involving a lot of filesystem traffic to determine the same toolchain settings between builds.

I suppose that CMAKE_NINJA_OUTPUT_PATH_PREFIX could be used to “stitch” Ninja-generator builds together. However there are complications: target paths won’t be known at a sufficiently early time to do anything more than “wait for all of the other project” anyways. Some utility targets might conflict (or have to be amended to incorporate the prefixing).

I don’t think there’s much to be done here because improvement basically boils down to “support multiple toolchains for a single language” which is just too far in CMake’s assumption base to be rooted out anymore.

I thought of that, and then tried it, but I couldn’t get it set up… see here: How is `CMAKE_NINJA_OUTPUT_PATH_PREFIX` supposed to work?

Yeah, the problems there sound like the:

I mentioned.

solutions?

im trying to limit the cpu load for qtwebengine builds

example:

# qtwebengine-everywhere-src-6.3.1/src/CMakeLists.txt

externalproject_add(gn
    SOURCE_DIR  ${CMAKE_CURRENT_LIST_DIR}/gn

my process tree looks like this. the load is 45 but should be 32

cpr = number of child processes
spr = sum of all child processes, including self (g++ = g++ cc1plus as)

load   rss spr cpr proc
 45.9 1001 261   2 bash -e default-builder.sh # {'cwd': '/build/qtwebengine-everywhere-src-6.3.1/build'}
 45.9  996 257  15  ninja -j32 -l32
  1.0   24   3   1   g++
  1.0   24   3   1   g++
  1.0   24   3   1   g++
                     [ ... in total 14 g++ procs ... ]
 32.0  650 199   1   sh -c 'cd /build/qtwebengine-everywhere-src-6.3.1/build/src/gn && cmake --build . && cmake -E touch /build/qtwebengine-everywhere-src-6.3.1/build/src/gn/src/gn-stamp/gn-build' # {'cwd': '/build/qtwebengine-everywhere-src-6.3.1/build/src/gn'}
 32.0  650 197   1    cmake --build
 32.0  642 195   1     ninja
 32.0  640 193  32      ninja -j32 -l32 -C /build/qtwebengine-everywhere-src-6.3.1/build/src/gn/Release gn # {'cwd': '/build/qtwebengine-everywhere-src-6.3.1/build/src/gn/Release'}
  0.9   20   5   2       g++
  1.0   20   5   2       g++
  1.0   20   5   2       g++
                         [ ... in total 32 g++ procs ... ]

i would be happy with a simple solution like

so in my case, only the second g++ group (32 g++ procs) would run (depth first)
so i get a total load of 32

Perhaps the Kitware fork of ninja might be worth looking into. A relevant issue in the CMake issue tracker was recently updated. The Kitware fork of ninja supports a job server, so sub-builds share the CPU pool of the top level. It’s a promising idea.

been there, but that only works when the main build process is make and when ninja is only a jobclient (tokenpool client)

in my case, the main build process is ninja, so now im using the “tokenpool master” branch from https://github.com/stefanb2/ninja/tree/topic-tokenpool-master (via)

works like a charm : )

Now I just need to teach node and python to talk to the jobserver.

Indeed. I suspect ninja will only ever support being a job server client, but who knows…