suspect bug with configure_file()

We have a build of Zephyr running in CI (Azure DevOps.) We are running a build that sometimes fails, but then on re-run, it sometimes passes. The failure, when it happens, points to either one of these lines in a CMakeLists.txt

configure_file(${CMAKE_CURRENT_LIST_DIR}/SEGGER_RTT_Conf.h ${CMAKE_CURRENT_LIST_DIR}/../../../../modules/debug/segger/Config/SEGGER_RTT_Conf.h)
configure_file("${CMAKE_CURRENT_LIST_DIR}/vendor-prefixes.txt" "${CMAKE_CURRENT_LIST_DIR}/../../../../zephyr/dts/bindings/vendor-prefixes.txt")

There are 4 builds that all use this CMakeLists.txt file and all want the files copied. Typically what is observed is that one of these build jobs complains like so:

CMake Error at /__w/1/imx/imx-application/soc/arm/imx8mp_m7/CMakeLists.txt:6 (configure_file):
  No such file or directory

But the rest of the build jobs pass. But have also seen a case where one build job show:

CMake Error at /__w/1/imx/imx-application/soc/arm/imx8mp_m7/CMakeLists.txt:6 (configure_file):
  No such file or directory

and the next shows:

CMake Error at /__w/1/imx/imx-application/soc/arm/imx8mp_m7/CMakeLists.txt:7 (configure_file):
  No such file or directory

then the rest of the jobs pass. This makes me suspect that there is some kind of race condition going on here between the file being copied and some other action. We have other cases where we use configure_file() and have not seen any issues from that but in all these cases we run it with COPYONLY.

I think using COPYONLY will likely remove this failure for us as we are not using the transformation tags in any of these files, but I would not expect it to give this failure either way.

Always when I have logged into the server after seeing this failure all files have been copied and contain what is expected.

I am having a hard time following what is going on with your description. Can you break it down to what configures the files and what tries to read them? Are these running in separate invocations of CMake? It is not clear to me without more context as to what is going on.

We have 4 unit tests suets that each produce a binary. They get built by a build script (Twister from Zephyr) that call cmake. They all rely on the same CMakeLists.txt file to copy a set of files using configure_file. Two of the files did not use he COPYONLY flag. When we run this in CI these files will have to be copied once every time we call the build script. Usually this goes fine, but some times we see that one or more of the unit tests fail to build with the above errors. But the rest of the binaries get built.

I have updated the two configure_file entries we have to also use COPYONLY, and so far after that I have not seen the error.

What this makes me suspect is that Cmake is trying to access the files that it copies in the destination before the files are fully done with the copy. Without the COPYONLY flag Cmake is supposed to parse and update the files if it finds the correct tags in the files (we do not have any need for that)

Still not following the work flow here. Are there separate invocations of CMake going on here?

  1. (cd Zephyr-build; cmake …/Zephyr-src)
  2. cmake …/someunit-test

But that sounds like it would always work. Or is this all coming from one invocation of CMake?

I believe Twister tries to do do parallel builds here. Not at my work computer right now so can’t confirm that. Will come back tomorrow

You create files outside of the build directory, this is bad design.
If you start multiple cmake calls in parallel, each one will try to create the identical file.

I can’t say for certain that Twister runs Cmake in sub-processes. I took a look inside the Python scripts and there are some things there that would indicate that Cmake gets invoked in different processes, I have seen the same Twister build command have changes in the order in which the different projects are finished building. That indicates to me that the OS scheduler is involved in determining this. Although I suppose there is always the chance that Twister could have some randomizing stuff going on there too, I do not find that too likely.

On your first point. I would go even further and say that having the build system change the source code is a bad Idea. I would have much preferred if either the first time I tried to build everything failed and I had to read some documentation for the project telling me to run a script that copies files or have the manifest file fetch forks of the relevant repos with these changes already in there. Instead having the build system change files behind my back is not great and could lead to all kinds of problems.

On the second point. Twister is a test runner tool created by developers of Zephyr. Should I create a bug with them telling them not to invoke Cmake as parallel processes as I suspect they are doing now?

So, it does not sound like a CMake issue. It is likely an issue with parallel and timing. If you run the copyonly version it runs faster and “beats” the other process that is reading the files. If not it is just slow enough so that the files are not there.

So the bug is that Twister runs multiple instances of Cmake in parallel? Is it documented somewhere that that should not be done so I can point to that when making a bug with Zephyr?

You can run many instances of cmake in parallel as long as the inputs/outputs are separate. In this case the project cmake code does not allow it. The documentation would belong in Zephyr and not cmake.

COPYONLY skips any in-file replacement, but configure_file will also not update the destination if the content is the same as the (potentially resolved) input content. This requires reading the destination file regardless of the COPYONLY flag.

CMake projects which write to the source tree (such as this one) do not support using multiple build trees with a single source tree as there’s no global lock CMake uses to guard against modifications to the source tree.