Is GLOB still considered harmful with CONFIGURE_DEPENDS?

Supposing you’re able to use the newest, shiniest version of CMake, what are the current drawbacks for globbing sources with CONFIGURE_DEPENDS?

The documentation reads:

Note: We do not recommend using GLOB to collect a list of source files from your source tree. If no CMakeLists.txt file changes when a source is added or removed then the generated build system cannot know when to ask CMake to regenerate. The CONFIGURE_DEPENDS flag may not work reliably on all generators, or if a new generator is added in the future that cannot support it, projects using it will be stuck. Even if CONFIGURE_DEPENDS works reliably, there is still a cost to perform the check on every rebuild.

Emphasis mine. With which existing generators can CONFIGURE_DEPENDS be expected to work reliably? To work at all? Exactly how costly are those globbing checks that have to be performed on every rebuild?

The check costs depend on the platform (and probably generator too). I don’t know of the performance costs, but that’s because I personally find that even if it were performant, there’s at least one issue that I run into often enough to make it not worth it.

I still highly discourage globbing for the reason that files may appear in your source tree that you do not intend to build. The main case I’ve run into is that during conflict resolution in git, the other versions of the file(s) in conflict are named ${base}_${origin}_${pid}.${ext}, so if you try to build in the middle of a conflict, you’re going to glob up these files.

Another reason is that now the addition/removal of a file is not present in your build system diff, so tracking down “what did you change?” in debugging reported problems can be harder since there’s no evidence of accidentally added/removed files in a normal ${vcs} diff output.

3 Likes

Here is an example of using GLOB with configure depends.

The directory looks like this.

image

target_include_directories(foobar PRIVATE
    ${CMAKE_CURRENT_SOURCE_DIR}
)

file(GLOB SOURCES CONFIGURE_DEPENDS *.cpp *.h *.hpp)

target_sources(foobar PRIVATE ${SOURCES})

My main issue with CONFIGURE_DEPENDS is performance. It impacts configure time performance on large projects with hundreds of files. To me configure time performance is extremely important.

Perhaps this is less of an issue on Linux platforms (In general CMake is a lot faster on linux). But I’m mainly a windows dev.

Have you measured this? I tried recently and found performance was awful under WSL but not too bad on Ninja on a fast NVMe drive.

This is not something one can expect everyone to have just laying around though.

True, but SSDs generally (incl. SATA) are pretty ubiquitous these days. I’m very curious if there are any actual performance numbers on this.

I personally, don’t really find performance numbers interesting because the above reasons are enough to not use it either way.

I’m curious what your numbers are on this test case (directory with 1000 sources). I get 0.019s for a no-op build on WSL on the virtual ext4 disk, in a Samsung 970 EVO 1TB.

$ cat CMakeLists.txt
cmake_minimum_required(VERSION 3.20)
project(glob-test)

file(GLOB sources CONFIGURE_DEPENDS "src/*.cpp")
add_executable(main ${sources})
$ mkdir src
$ echo "int main () { return 0; }" > src/main.cpp
$ touch src/src_{1..999}.cpp
$ ls src | wc -l
1000
$ cmake -G Ninja -S . -B build
...
$ cmake --build build
...
$ time cmake --build build/ -- -v
[0/2] /usr/bin/cmake -P /home/alex/test/build/CMakeFiles/VerifyGlobs.cmake
ninja: no work to do.

real    0m0.019s
user    0m0.013s
sys     0m0.006s

Running this same experiment from WSL on the NTFS part of the same drive crashes performance to 2.373s, but this is probably just WSL’s poor filesystem performance generally.

Running this same experiment from Windows on the exact same NTFS drive gives 0.119s, so NTFS is a good bit slower than ext4.

Running this same experiment from Windows on a SATA SSD (SanDisk SDSSDHII480G) formatted with NTFS gives 0.116s, so SATA versus NVMe performance seems dominated by NTFS overhead.

I actually wanted to ask you about this:

What merge tool are you using, or what configuration option do you have set to get this behavior? When I do merges in git, the conflicting files are over-written in-place with merge markers, I don’t get those files you’re describing.

I use git mergetool with the merge.conflictstyle=diff3. Even without that, files getting added to the build and such just by their presence isn’t desired; I want to see the addition of a file in the CMake code so that if someone forgets to push, the error is obvious (instead of, potentially, being a link error).

That doesn’t sound so bad if your codebase has this established as a convention. Every file in src/ gets added. And you use preprocessor definitions for OS specific code.

That doesn’t help all that much; I’d also like to not install those headers (or classes that are disabled due to other options) on platforms that don’t actually have implementations behind them.

I didn’t think about the installation aspect. That’s a good point.

Don’t forget that people also build in VMs with virtualised disks, the performance of which could vary considerably (sometimes due to technical limitations, sometimes due to expertise limitations of the person who sets them up).

Another relatively common pattern that fails with globbing is where some source files only get added for some compilers or platforms. This pattern is useful because it avoids peppering sources with #ifdef’s and leaves you with very readable source files.

The WSL ext4 disk I tested on was virtualized and the performance was fine. Not as good as native, I’m assuming, but not unacceptable like the FUSE/9p access to the host NTFS file system through WSL. If you’re building over 9p, I think you have bigger fish to fry.

I tried measuring glob performance in this same scenario on GitHub Actions (all virtualized) using both the Visual Studio (windows) and Ninja generators and the no-op build overhead incurred by globing was under a half second in all three scenarios.

It does seem like there are other good reasons not to glob, but I haven’t been able to reproduce the supposed performance overhead except under very silly scenarios.


Here’s an idea I had while thinking about this:

When the file list does change, CMake currently has to re-configure from scratch, whether you’re globbing or not. There is a way, I think, to get better performance than either existing approach by adding a generator expression for globs such that the result of the glob cannot be inspected during the build:

add_library(objlib OBJECT $<GLOB:src/*.cpp>)

add_executable(app1 app1/main.cpp $<TARGET_OBJECTS:objlib>)

add_executable(app2 app2/main.cpp)
target_link_libraries(app2 PRIVATE objlib)

At generation time, it is known which targets have a glob expression in their (INTERFACE_)SOURCES property and to which other targets they’re linked (or referenced by $<TARGET_OBJECTS>). So there’s enough information to incrementally update the rules that depend on the result of that glob. I guess there’s an open question here about how source file properties should apply to files matched by $<GLOB:...>, but maybe it’s okay to prohibit that because source file properties aren’t very common in user code (and you can always split up your files or refine your globs and use the target properties).

Because the result of $<GLOB:...> cannot be inspected at configure time, there’s no need to rerun the whole CMakeLists.txt when the result changes, only the generation step. That could actually save a significant amount of time. If the glob itself changes, then of course you have to re-configure.

On the other hand, this would encourage people to use globs, which I understand you don’t want :slight_smile:

In order to get to the generator step, the configure step must have been run. There’s not enough on-disk state to “start up” post-configure and just do a generate.

Yet :slight_smile: I don’t see why that wouldn’t be possible in theory.

@alex I ran some experiments and you were correct. The performance overhead was pretty minimal.

I’ve been using WSL recently to build recently. And as you mentioned WSL has performance issues. It led me to find out that if you use the /mnt drives in WSL your file system operations become much slower. Which was the true cause of my problem.

1 Like