[API Design][C++ Modules]: Source listings and interface properties

ben.boeckel · April 6, 2022, 5:36pm

This issue is to track the proposed way to specify sources for C++ modules. Header units will be mentioned, but probably have some other corner cases to work out that will need better implementations to see where any differences are that may affect CMake APIs. Hopefully they end up being orthogonal to named module APIs :fingers_crossed: . Search for Q: to find questions that still need answered (my gut feelings are at the end as parentheticals).

Background

C++ module TUs

There are 4 types of module TUs in the C++ standard (MSVC adds another kind as well):

Module Interface Unit: contains export module X;
Module Partition Unit (exported): contains export module X:part;; must be exported from the main module interface (IFNDR otherwise)
Module Partition Unit: contains module X:part;; must not be exported from the main module interface (IFNDR otherwise)
Module Implementation Unit: contains module X; and implements the interface of the module

MSVC’s additional TU type is:

Module Partition Implementation Unit: contains module X:part; but implements the interface of the partition. This is not supported in GCC.

Because the only difference is based on file contents, CMake would need to first scan a source file to classify it and generate a flag for use by the scanning and compilation in MSVC’s model. However, with there being no syntactic difference between the non-exported Module Partition Unit TU types.

MSVC requirements

See this post. First, MSVC requires the -interface flag for any TU with an export module (partition or otherwise). Additionally, a Module Partition Unit requires the -internalPartition flag if it is not an implementation unit. Implementation units do not have any associated flags with them. Due to this, CMake must also know the classification of the modules.

Even though this is an MSVC-ism, it is not inconceivable that CMake may error on at least interface units not being classified properly. As the other compilers do not support the fifth kind (yet?), a warning may be emitted if such things are detected (they go against “only one TU may have a given name” rule if they’re not known to be implementation units).

Source listing

This builds on top of CMake 3.23’s FILE_SET. That is, in order to specify C++20 modules, one must use FILE_SETS to list the sources. This is to facilitate the extra information that MSVC needs to know about TUs before scanning.

The proposal is:

target_sources(target_with_cxx_modules
  PUBLIC
    FILE_SET public_modules TYPE CXX_MODULES FILES
      m.cpp m_part_exported.cpp
    FILE_SET public_module_partitions TYPE CXX_MODULE_IMPLEMENTATION_PARTITIONS FILES
      m_part_not_exported.cpp
    FILE_SET public_header_units TYPE CXX_MODULE_HEADERS FILES
      importable_header.h
  PRIVATE
    # no fileset required for implementations
    m_impl.cpp m_part_exported_impl.cpp
    m_part_not_exported_impl.cpp # MSVC only
    FILE_SET private_modules TYPE CXX_MODULES FILES
      p.cpp p_part_exported.cpp
    FILE_SET private_header_units TYPE CXX_MODULE_HEADERS FILES
      p_importable_header.h)

Note that if any private sources end up being visible from a public module (this is essentially “private modules cannot be transitively imported from a public module”), that is an error because such files have been indicated to not be installed. This is intended to help projects avoid shipping module code that need not be visible. Modules provided by private sources will also not be available to other targets in the project (that is, the collator will not communicate their existence to dependent libraries).

It is unclear to me what INTERFACE module units (named or headers) actually mean. We cannot scan them as part of this target because they are not this target’s modules. Because we do not scan them, we do not know how to specify them in IMPORTED_CXX_MODULE* properties (see below) later.

Q: Do we just punt and say that INTERFACE is not a valid visibility for CXX_MODULE* fileset types (my gut says to make this an error)?

Building

See this paper for the overall strategy. Implemented on my fork for MSVC 2022 and a patched GCC.

Not much needs to change here (besides the extra information the collator will need to handle and organize).

Target properties

CMake can put all of this together for non-IMPORTED targets. However, there will need to be a way for IMPORTED targets to provide this information. The proposed properties are:

set_property(TARGET Imported::Target APPEND PROPERTY
  IMPORTED_CXX_MODULES
    "name_of_module=${_IMPORT_PREFIX}/path/to/module/interface.cpp:${_IMPORT_PREFIX}/path/to/precompiled/module.bmi"
    "name_of_module:partition=${_IMPORT_PREFIX}/path/to/module/partition/interface.cpp:${_IMPORT_PREFIX}/path/to/precompiled/module-partition.bmi")
set_property(TARGET Imported::Target APPEND PROPERTY
  IMPORTED_CXX_MODULE_HEADERS
    "${_IMPORT_PREFIX}/path/to/importable/header/unit.h:${_IMPORT_PREFIX}/path/to/precompiled/header/unit/module.bmi")

This will need to be written out at build time because this information is not known during the configure stage (namely the actual name of modules). The paths they will be at will be known, so we can statically generate include(OPTIONAL) calls (I feel like OPTIONAL is required because the of things like EXCLUDE_FROM_ALL). It will be the job of the collator for each target to write out this information for each export (build and install). For installation exports, the DESTINATION for the file sets will need to be passed to the collator so that it will know where they will exist.

Installation

And speaking of DESTINATION bits, module interface cmake_install.cmake code can be known ahead of time because the only sets of files eligible for installation are those that are PUBLIC (or INTERFACE if that is allowed). BMI installation will require the collator to write out locations. Since MSVC requires them to be some kind of BMI-generating TU, they will generally be needed in the importer (transitively). Therefore the collator can error if a module unit does not provide something.

install(TARGETS target_with_cxx_modules
  # Q: Should this also install `CXX_MODULE_INTERNAL_PARTITIONS` or should it have its own scope (the type still matters on installation, but does the destination/component ever need to be distinct)?
  CXX_MODULES DESTINATION somewhere COMPONENT cxx_module_interfaces
  # Q: Should we re-use `HEADERS` or make a new `CXX_MODULE_HEADERS` scope (they're still headers after all)?
  HEADERS DESTINATION over/the COMPONENT headers
  # Q: Is there a better name for this?
  CXX_MODULE_BMIS DESTINATION rainbow COMPONENT cxx_module_bmis)

Cc: @kyle.edwards @craig.scott @marc.chevrier @brad.king

robert.maynard · April 7, 2022, 3:12pm

ben.boeckel:

target_sources(target_with_cxx_modules
  PUBLIC
    FILE_SET public_modules TYPE CXX_MODULES FILES
      m.cpp m_part_exported.cpp
    FILE_SET public_module_partitions TYPE CXX_MODULE_IMPLEMENTATION_PARTITIONS FILES
      m_part_not_exported.cpp
    FILE_SET public_header_units TYPE CXX_MODULE_HEADERS FILES
      importable_header.h
  PRIVATE
    # no fileset required for implementations
    m_impl.cpp m_part_exported_impl.cpp
    m_part_not_exported_impl.cpp # MSVC only
    FILE_SET private_modules TYPE CXX_MODULES FILES
      p.cpp p_part_exported.cpp
    FILE_SET private_header_units TYPE CXX_MODULE_HEADERS FILES
      p_importable_header.h)

Does this mean that m.cpp and m_part_exported.cpp don’t behvave like other public sources, which would be compiled by both target_with_cxx_modules and consumers?

If so, I don’t like the FILE_SET type effecting the well defined behavior of PUBLIC. Instead the internal / external mode should be captured by the TYPE and not the transitive usage.

ben.boeckel · April 7, 2022, 3:13pm

There was a question of whether a new PROTECTED visibility should be added meaning “for me, but eligible for installation”. I don’t think it has any purpose beyond source files however.

robert.maynard · April 7, 2022, 3:27pm

Maybe PUBLIC_MODULE woiuld be a better keyword since PROTECTED doesn’t convey what use case it is for, or the behavior of it.

I would think that an INTERFACE module would be valid when you have a private module that you want embedded into multiple end-points. That way they can write tests for the non-public functions of a module.

ben.boeckel · April 7, 2022, 4:06pm

The issue I find is that there’s no transitive flag that is orthogonal. It’s basically the same issue as the $<TARGET_OBJECTS> usage requirements. Maybe if it is directly used and not transitively used the sources get added to the source list?

Is the fileset name part of the usage requirements or is it just a list of sources with an associated TYPE? Do inherited sources/headers/modules get installed as well with install(HEADERS DESTINATION)? Modules may need to be installed this way if they’re publicly visible.

brad.king · April 8, 2022, 1:51pm

For reference, the role of PUBLIC/PRIVATE/INTERFACE visibility for a FILE_SET was designed in the context of type HEADERS, where the visibility passed in target_sources controls:

Whether a header set is listed in HEADER_SETS and/or INTERFACE_HEADER_SETS. Only those sets listed in INTERFACE_HEADER_SETS are installed and exported along with the target itself. INTERFACE_HEADER_SETS does not cause the files in the header set to be propagated to dependent targets, attached to them, or installed with them.
Whether the base directories are added to INCLUDE_DIRECTORIES and/or INTERFACE_INCLUDE_DIRECTORIES. The latter include directories are propagated to dependent targets.

So, for a FILE_SET of type HEADERS with PUBLIC or INTERFACE visibility, the only transitive usage requirements are include directories.

ruoso · April 19, 2022, 7:49pm

I think modules in imported targets need to be able to cope with missing bmi files. That is, they need to include the instructions on how to parse the module interface units.

Particularly important, even if the bmi is actually there, if we don’t have the whole module graph built, we risk disabling static analysis runs later. So we should make sure that we assemble the complete information on how to parse those modules even if the bmi is already there.

I agree with that. The idea of INTERFACE visibility for modules kind of breaks down because modules require more than just “a file that can be found by the preprocessor” kind of thing.

I would like to see us making the entire module graph (including information on how to parse the interface units) available in a configured workspace. Something analogous to compile_commands.json but that describes the entire module graph. Maybe
a new file entirely (c++_module_config.json for a random name idea) that would have all the modules the build system knows about and the instructions on how to parse them.

We still need the metadata to be generated and installed alongside the target (this paper suggests a location). And we still need a format for that metadata file.

ben.boeckel · April 19, 2022, 8:48pm

Yes. The BMI will only be used in the place it is expected to be in the build tree. The paths provided in the properties will only be used if they are compatible (according to the toolchain in use). AFAIK, no current compiler has such behavior right now, so we’ll just fall back to always generating BMIs until such time that “here are some candidate BMIs” flags exist.

This will necessarily not be available until build time if the module mapper information is needed (-reference or -fmodule-path flag-containing response files, or -fmodule-mapper input file). There’s also no way to provide an overall conglomerate file that is guaranteed to be up-to-date (because there’s no “run this target if any of its dependencies are updated” terminal node that works without all). Per-target files can be provided that are able to be kept up-to-date though.

Whatever other formats come up, we can generate. However, given that any such format will necessarily be lossy (e.g., how does a usage requirement of $<$<TARGET_TYPE:EXECUTABLE>:SomeMainFuncProvider> get translated in such a format?), they’ll be additional files. CMake usage requirements will likely still be the primary way for CMake to converse with CMake targets.

ruoso · April 19, 2022, 9:05pm

Sorry, I don’t think I understand what you mean here. Would you mind clarifying?

Interesting. Maybe having a per-target “modules config” is enough.

ben.boeckel · April 19, 2022, 9:16pm

As of configure/generate time, there are flags like @…/modules.mapper for MSVC and Clang while GCC has -fmodule-mapper=…/modules.mapper command line flags to the compiler. These module.mapper files do not exist until build time when the collator has run for the target in question. Scanning doesn’t need them, but it also has zero idea what is on the other side of import foo; which means that the mapper files will be needed for any meaningful static analysis (unless the tool understands how CMake will put together that mapper file and does it on its own).

ruoso · April 19, 2022, 9:24pm

Does that mean CMake also doesn’t know the full graph at that time? Or is it more a matter that the mapper file itself needs to be part of the build rules rather than generated at configure time?

Does that mean that CMake doesn’t have a full module graph at configure time at all?

Most likely not the mapper files tho, but rather enough information to put together their own mapper files to produce bmi files that are useable by the specific static analysis tools.

It was my understanding that we would have the full module graph at the end of the configuration, and that any source change adding or removing a new module would have to trigger a reconfigure in order to add the new build rules.

If we can’t assemble the entire module graph at the end of the configure phase, it’s going to limit a lot of what we can do in the static analysis world.

ben.boeckel · April 19, 2022, 9:29pm

It has all of the nodes of the build graph. What it doesn’t know is where edges need to be added to ensure that module import orders are satisfied (unless you want CMake to reconfigure on any change to a module-using source file; I don’t).

Right. What is the module mapper for generated-at-build.cpp?

This means that CMake reruns on any change to a C++20 source file. It also means generating C++20 code is not supported (at least up to using or providing any modules).

This is what we got when we chose a Fortran-isomorphic module system.

ruoso · April 19, 2022, 9:37pm

That may actually be enough. If we know all the files from within the build system, and the modules from imported targets are also known, this should be enough. I presume we also know how those nodes need to be parsed. Presuming the information for how to parsing the file is also available.

However, if that also means we don’t know the module name that is exported may require a full scan of all files in order to decide even where the modules are. If that’s the case, it’d be very unfortunate.

I almost feel like I want CMake to require the files exporting interfaces to have a predictable name…

ben.boeckel · April 19, 2022, 9:55pm

Yes, the scanning commands are statically known. There is the open question of Clang’s header units which apparently need to also care about the -D flags in the consumer when creating the BMI. However, I have zero idea of how to support that without also supporting dynamic nodes as well.

Yes. Again, a consequence of the Fortran-isomorphic module design. Luckily(?) with CMake’s target graph, you are limited in where you have to look (i.e., your own sources and PUBLIC sources in targets you depend on as well as their PUBLIC dependencies). I have test cases in the sandbox repo with the same module name in various places making sure that they get resolved in a way that makes sense for CMake at least (ignoring IFNDR situations it may be creating; I’m more interested in the build system behavior here).

There are a number of folks that have expressed similar feelings. That discussion can probably be re-started once what happens after a filename’s last . is resolved .

ruoso · April 19, 2022, 9:58pm

I wonder if it’s possible to have a more optimized experience if you commit to using predictable file names. Could we have alternative implementations where the one that can predict module names by the file names is just better? and then offer the “I guess I have to read every file first” as a less desirable option…

ben.boeckel · April 19, 2022, 10:08pm

I don’t know that it’s all that helpful for build performance. You need to scan anyways to discover import edges. “Knowing” the filename based on the symbol given to import it is more filesystem searching when you could just be handed it from the scanning step on a platter at the same time (or know that it’s a futile search in the first place) and not care what it is actually called. Of course, some of your non-compiler tools may prefer other naming schemes which will work just as well since there are no pressures from your build tool to conform.

Could CMake enforce some restriction? Yes. Would that involve reigning in some horses that have already left the stable? Yes.

However, I’m more interested at the moment to getting anything to work at all. If we have a set of rules we want to enforce before it hits a stable CMake release, we can do that, but I don’t want to guess what those might be and then have assumptions lingering around whenever they do get decided. But it needs to be decided before CMake has a stable release (without some…intricate policy logic).