How to vend dependency information/configuration

dabrahams · March 7, 2024, 7:35pm

I am working with a moderately large dependency graph of projects, some pieces of which I control. For example, I control the projects rooted at The Hylo Group · GitHub, and in that org hylo depends on Swifty-LLVM and Durian, and both those projects depend on SwiftCMakeXCTesting.

The parts I don’t control:

SwiftCMakeXCTesting depends on SwiftSyntax and Swift’s ArgumentParser
Swifty-LLVM depends on LLVM.
hylo itself has scads of dependencies including ArgumentParser.

At each level, there are various workarounds needed to get dependencies to “play nice” with CMake. For example here you can see there’s a workaround for SwiftSyntax that I don’t want to repeat at each level of every project depending on this project. At a higher level you can see that I need a workaround for my dependency on LLVM. Since I know that ultimately, top-level projects need to be in control of all transitive dependencies, as you can see here, I tried to package up the dependency fetching part of SwiftCMakeXCTesting’s top-level project behavior so that other projects could take advantage of it. But that turns out not to work for a variety of reasons. Also it violates the convention that there should be only one call to FetchContent_make_available. So how is this sort of thing normally handled?

Note: As I transition away from Swift Package Manager, I’d like to maintain the experience, as much as possible, that each of these projects can be built on its own by default without needing any user intervention to acquire dependencies, which is why I’ve been using FetchContent.

Thanks in advance,
Dave

craig.scott · March 7, 2024, 8:56pm

In what way does that not work?

I’m not aware of that convention, nor would I want to try to enforce it. One of the behaviours of FetchContent_MakeAvailable() is that you can call it multiple times for the same dependency, and only the first call will try to do the population. Subsequent calls will not try to repeat the population, and will return to the caller with the same variables set as the first call that did the population. So from the caller’s point of view, it shouldn’t matter whether that call or an earlier one did the actual population. This is a directly supported use case, there’s no convention or need to try to ensure there is only one call to FetchContent_MakeAvailable() for a given dependency.

dabrahams · March 10, 2024, 8:31pm

Thanks for your reply, (which I’m intentionally answering in reverse order); I wanted to make sure I really knew what I was talking about before posting again.

A friend of mine told me this is the way, and I could swear I found definitive confirmation of that in the official documentation, but I’m not seeing it now. The closest thing I see is this passage in the FetchContent documentation:

Content population details should be defined separately from the command that performs the actual population. This separation ensures that all the dependency details are defined before anything might try to use them to populate content. This is particularly important in more complex project hierarchies where dependencies may be shared between multiple projects.

And I do think this discipline is very hard to maintain with an arrangement like the one I’m describing.

I think it will help to look at an example. Please ignore the somewhat questionable way this file does everything at the top level, and the fact that I used NEVER instead of OPT_IN for FETCHCONTENT_TRY_FIND_PACKAGE_MODE; those are temporary. Here are the problems I see:

If I want to get the transitive dependency fetching stuff from my dependency before I actually add the subdirectories of any dependencies, there’s this little boilerplate dance I need to do.
I might be able to encapsulate that in a function, but I have to break up that dance here, because I can’t just call FetchContent_Populate unconditionally. Unlike FetchContent_MakeAvailable it isn’t resilient to being called multiple times.
To know that I need to populate SwiftCMakeXCTesting in this else() clause I have to know that my conditional dependency, GenerateSwiftXCTestMain will populate it when I fetch its dependencies. That information should be encapsulated.
If I’ve called FetchContent_Populate on a dependency, the subsequent call to FetchContent_MakeAvailable doesn’t actually make it available and I have to explicitly add_subdirectory on it. And when FetchContent_MakeAvailable bypasses add_subdirectory it also bypasses a bunch of other logic I haven’t analyzed, so who knows what important things I may be missing here?
If any of these dependencies themselves do what I’m doing here in their XXX_FetchDependencies.cmake file, they may fetch a shared dependency before all the other dependencies I use have a chance to declare version requirements in their FIND_PACKAGE_ARGS. And AFAICT there’s no established way to string together a set of dependency declarations with their version requirements so that the top level can sort them out and decide what to actually fetch.

dabrahams · March 11, 2024, 6:38pm

It occurs to me that the only good way to share these workarounds I’m talking about might be to vend a collection of FindXXX.cmake files for dependencies from a separate repository than the dependencies themselves.

dabrahams · March 22, 2024, 7:00pm

This is the strategy I’m currently pursuing.

Now I’d like to ask about the true meaning of the passage I quoted from the FetchContent documentation that begins:

Content population details should be defined separately from the command that performs the actual population.

I assumed that this idea explains why FetchContent usage is always two-phase: first a _Declare call and then (normally) a _MakeAvailable call. But if I pursue the strategy above, I’m going to end up creating and calling a single function that does both parts. And I don’t really understand why that would be a problem.

The doc goes on to give this rationale:

This separation ensures that all the dependency details are defined before anything might try to use them to populate content. This is particularly important in more complex project hierarchies where dependencies may be shared between multiple projects.

But it seems to me that if top-level projects always determine all the dependencies, as @ben.boeckel confirmed is the way to do things, that separation becomes irrelevant.

Lastly, it seems as though, even if I don’t follow the top-level-owns-dependency-resolution pattern, there’s no obvious pattern of usage that would let me take advantage of the separation that the FetchContent docs are advocating.

This is so confusing. Am I misunderstanding what the doc means by “separation?”

craig.scott · March 23, 2024, 3:22am

You’ve mentioned in the past that you have a copy of the Professional CMake book, so I’ll direct you to take a look at Section 39.3: Resolving Dependencies in the FetchContent chapter of the 17th Edition. It directly discusses the reasoning for why declaring details is a separate step from populating content, and it goes through an example that demonstrates the need for it.

For those without the book, a short version of that reasoning is that when you pull in a dependency, that may in turn pull in any number of other dependencies. In fact, any dependency may pull in its own set of dependencies, and in large projects, it is common for some dependencies to be required by more than one of the other dependencies. In other words, lower level dependencies may be required by two or more higher level dependencies. Package A might specify they want X at version 1.3, while package B might specify they want X at version 1.5. But FetchContent works by a “first to declare, wins” philosophy. Therefore, the first dependency to be encountered which declares details for X will determine the version of X that is used. The top level project may not necessarily know which of A or B will be pulled in first (it might happen many levels deep in the dependency hierarchy), so in order for the top level project to be in control of the version used, it needs to declare the details for X before either A or B is pulled in.

Now consider what would happen if FetchContent combined declaring and populating in one operation. It would not be possible to declare more than one set of dependency details before kicking off pulling in a whole subtree of dependencies. If the top level project needed to control the versions of multiple lower level dependencies, it couldn’t do it except by carefully pulling in each dependency starting from the lowest level and manually working its way up the dependency tree to ensure it never pulled in a dependency it had not directly requested first. That’s not a strategy that scales, and not one that anyone wants to have to do. It might not even be possible, since the dependency graph can change over time as projects update the things they depend on.

By separating declaring dependency details from dependency population, a project can ensure that it gets exactly the dependencies it wants. Projects can also always override dependency settings for things coming from lower down in the dependency hierarchy.

dabrahams · March 25, 2024, 6:41pm

in order for the top level project to be in control of the version used, it needs to declare the details for X before either A or B is pulled in.

Sure, but then it can also just fetch X.

If the top level project needed to control the versions of multiple lower level dependencies, it couldn’t do it except by carefully pulling in each dependency starting from the lowest level and manually working its way up the dependency tree to ensure it never pulled in a dependency it had not directly requested first.

Presumably what’s going to happen is that the top level will start out declare its dependencies and let them pull in transitive dependencies, and then when a conflict is discovered, the top level will have to resolve it by declaring the conflicted dependency “up front.” At that point, why not just pull the dependency in? I guess because the top level would have to topologically sort these “up front” declarations, and if there were a lot of them, that could be painful to maintain. So okay, I understand why you want the ability to just declare a dependency, without fetching: it’s for this “up front” conflict resolution. But I don’t see any reason why, when you’re not actively trying to do conflict resolution, you want to separate the declaration from fetching, and it seems to me that my “find modules” are just fine doing both things together.

It also seems to me that if separation was always going to be important, CMake ought to be structured to accumulate a list of declarations so one top-level _MakeAvailable would pull them all in. But a system like that would really require separating dependency declaration from CMakeLists.txt, so that dependencies can all be resolved before target declarations are seen. At that point we’d be approaching package manager territory.

dabrahams · April 2, 2024, 1:55am

@craig.scott I really appreciate all your interaction so far; I submit this gentle bump in the hopes you’ll tell me whether there’s a flaw in my understanding. I’d also love to know if you think there’s a better alternative for my suite of projects than what I’m doing with find modules. Thanks.

craig.scott · April 3, 2024, 6:42am

There will always be scenarios where you can make one more change to avoid having to separate declaring from populating, or where cases seem simple enough that you don’t see the need for the separation. But I think you’re missing the point that one shouldn’t have to think about this. When you always have the two steps separated, you avoid the more subtle gotchas, you remain open to future performance improvements (a topic I haven’t mentioned, but I don’t want go into it here and complicate this discussion even further), and you use the same pattern everywhere regardless of whether your own assessment of how the project is used or might be used considers all the potential use cases. It is predictable, safer, and more future-proof. It is also based on a number of years of experience in real-world projects where if things had not been separated, it would have made projects more complicated.

This isn’t about how CMake is structured, it’s about how projects are structured. And your description is very well aligned with how I recommend projects handle dependencies these days. I recommend they do so at one point after project() but before any logic that might create any CMake targets. This may potentially involve a mix of find_package() and FetchContent logic, depending on what the project needs to do. Putting that logic in one place gives third party package managers the best opportunity to handle providing dependencies in the most efficient way too, and a lot of the work I’ve put into FetchContent and find_package() in recent years has been all about making the integration with third party package managers more seamless.

We don’t have to pull the dependency declarations out of CMakeLists.txt, although it may be possible to specify such details in a standardised format in a separate file in the future. Even then, if projects are following the structure I mentioned above, they would be well-positioned to adopt such an advancement. I’d see that as “everyone wins” territory.

craig.scott · April 3, 2024, 6:45am

I have limited time available, and that does mean I generally can’t invest time exploring other people’s projects unless they relate to specific bugs or things I’m working on. Hopefully you can navigate your way forward with the information provided so far. You may also find others willing to provide guidance and offer their help. I recommend you keep questions focused, concise, and self-contained, which will maximise the chances others will chime in to help.

dabrahams · April 3, 2024, 8:01pm

Thanks for your reply; I think I’m getting closer to understanding. I realize you’ve already spent more time than you’d like on this thread, but a few things are still unclear and I’d be grateful if you could clear them up.

Craig Scott:

There will always be scenarios where you can make one more change to avoid having to separate declaring from populating, or where cases seem simple enough that you don’t see the need for the separation. But I think you’re missing the point that one shouldn’t have to think about this. When you always have the two steps separated, you avoid the more subtle gotchas, you remain open to future performance improvements (a topic I haven’t mentioned, but I don’t want go into it here and complicate this discussion even further), and you use the same pattern everywhere regardless of whether your own assessment of how the project is used or might be used considers all the potential use cases. It is predictable, safer, and more future-proof. It is also based on a number of years of experience in real-world projects where if things had not been separated, it would have made projects more complicated.

Sure, but when I follow the advice as I understand it, I end up with all these places where I have _Declare calls immediately followed by _MakeAvailable calls (whether I am creating Find modules to do it or not), which leads me to wonder what I’m doing wrong. If they’re supposed to be separated, it must be because there’s supposed to be some other code between them. Or am I missing something?

This isn’t about how CMake is structured, it’s about how projects are structured.

Sure. CMake itself is very small and dictates little about how it is used, leaving projects to follow an ill-defined set of conventions and best practices and leading users like me, who really want to do the right thing rather than picking up the first pattern they can find that ”seems to work,” asking lots of questions like these. I’m also going to try to encapsulate the answers into higher level components so that other people I work with don’t have to make the same journey. CMake simply has too many knobs for usage to scale up otherwise.

That said, CMake increasingly provides high-level modules like FetchContent that help establish some structural patterns for projects. Those patterns are what I meant by “how CMake is structured.” I’m saying CMake doesn’t provide a system (in a module) for accumulating dependencies so that they can all be fetched only when the entire graph is known. If it was really important to keep declaration and fetching separate in all circumstances, such a facility would be needed (or top-level projects would have to know the entire recursive dependency graph and resolve it themselves—which is what I thought @ben.boeckel confirmed is “the way” but I now understand is not what you’re advocating). Otherwise a near-leaf with one dependency to fetch is always going to _Declare and _MakeAvailable immediately afterwards, which is effectively no separation.

And your description is very well aligned with how I recommend projects handle dependencies these days.

That sounds promising, if by “my description” you mean the strategy I’m following. But sadly my literal brain wonders what description you’re referring to.

I recommend they do so at one point after project() but before any logic that might create any CMake targets. This may potentially involve a mix of find_package() and FetchContent logic, depending on what the project needs to do.

I’m putting “Find” modules that fetch in CMAKE_MODULE_PATH, and then using find_package() everywhere. It seems like that should allow other projects to override how dependencies are found and selected by either putting their own Find modules earlier in the module path or simply getting the dependencies some other way ahead of time. That’s my best guess at how to make the system work.

Putting that logic in one place gives third party package managers the best opportunity to handle providing dependencies in the most efficient way too, and a lot of the work I’ve put into FetchContent and find_package() in recent years has been all about making the integration with third party package managers more seamless.

We don’t have to pull the dependency declarations out of CMakeLists.txt,

How else could a third-party package manager take advantage of the fact that the logic is “in one place” today? Or do you just mean that someone can manually derive the information from code more easily if it’s all together?