Linking against CUDA::cuda_driver not working right with libcuda stub, wants libcuda.so.1 not libcuda.so

eugenewalker · February 13, 2023, 4:45pm

We build our project cusz using CMake so that it links against CUDA::cuda_driver. For CI we build inside a Docker container where we have to rely on linking against the libcuda.so stub in the absence of the actual CUDA driver. This works fine for building cusz. However, when we ldd the resulting cusz library we see it expects libcuda.so.1 and not libcuda.so. This becomes a problem when we subsequently try to link a dependent package against cusz, also in a Docker container where we only have the stub, as the cusz library is not happy with the stub libcuda.so. It wants libcuda.so.1, which it cannot find, because the stubs only include .so named libraries.

I’m a bit confused because on the one hand, I can inspect the CMakeCache.txt from the successful build of cusz and see it finds the stub libcuda.so:

$> cat CMakeCache.txt
...
//Path to a library.
CUDA_cuda_driver_LIBRARY:FILEPATH=/spack/opt/spack/linux-ubuntu20.04-x86_64/gcc-11.1.0/cuda-11.8.0-bmwmhhbfgn7c3zokorcdz2pvv5qam3x2/lib64/stubs/libcuda.so
...

But then when we try to link against the resulting cusz library, it wants libcuda.so.1 which it can’t have in the Docker container environment where only the stub is available.

Does anyone know what is going on here?

ben.boeckel · February 13, 2023, 7:32pm

I don’t know about the CUDA-specific bits here, but this is how ELF linking works. The libfoo.so is what the linker looks for and inside of this there is (usually) a DT_SONAME entry. This name is what is put into the created library to look for at runtime. For the stub to work properly, it must use the same DT_SONAME as the real one so that it works later. At runtime, a stub runtime library must be provided as well if the real one is not around.

Cc: @robert.maynard

eugenewalker · February 13, 2023, 7:47pm

Thank you for the quick response.

The stubs were installed with CUDA Toolkit, and the stubs are all named “*.so” with no further extension:

$> ls -l $CUDA_ROOT/targets/x86_64-linux/lib/stubs
total 2088
-rwxr-xr-x 1 root root  38872 Feb 12 10:57 libcublasLt.so
-rwxr-xr-x 1 root root  55256 Feb 12 10:57 libcublas.so
-rwxr-xr-x 1 root root  62176 Feb 12 10:57 libcuda.so
-rwxr-xr-x 1 root root   9400 Feb 12 10:57 libcufft.so
-rwxr-xr-x 1 root root  13496 Feb 12 10:57 libcufftw.so
-rwxr-xr-x 1 root root   9400 Feb 12 10:57 libcurand.so
-rwxr-xr-x 1 root root  29880 Feb 12 10:57 libcusolverMg.so
-rwxr-xr-x 1 root root 111800 Feb 12 10:57 libcusolver.so
-rwxr-xr-x 1 root root  58552 Feb 12 10:57 libcusparse.so
-rwxr-xr-x 1 root root   5304 Feb 12 10:58 libnppc.so
-rwxr-xr-x 1 root root 246968 Feb 12 10:58 libnppial.so
-rwxr-xr-x 1 root root 128184 Feb 12 10:58 libnppicc.so
-rwxr-xr-x 1 root root 173240 Feb 12 10:58 libnppidei.so
-rwxr-xr-x 1 root root 251064 Feb 12 10:58 libnppif.so
-rwxr-xr-x 1 root root  87224 Feb 12 10:58 libnppig.so
-rwxr-xr-x 1 root root  38072 Feb 12 10:58 libnppim.so
-rwxr-xr-x 1 root root 410808 Feb 12 10:58 libnppist.so
-rwxr-xr-x 1 root root   9400 Feb 12 10:58 libnppisu.so
-rwxr-xr-x 1 root root  54456 Feb 12 10:58 libnppitc.so
-rwxr-xr-x 1 root root 214200 Feb 12 10:58 libnpps.so
-rwxr-xr-x 1 root root  46872 Feb 12 10:57 libnvidia-ml.so
-rwxr-xr-x 1 root root  13496 Feb 12 10:58 libnvjpeg.so
-rwxr-xr-x 1 root root   5304 Feb 12 10:57 libnvrtc.so

Is this a mistake?

What you say makes me wonder, how then is the cusz library able to be created in the first place, when it links with libcuda but there is no libcuda.so.1 in the first place, only libcuda.so? Why does that work but then subsequently, when trying to link a dependent library against the cusz library, that’s where we see:

/usr/bin/ld: /spack/opt/spack/linux-ubuntu20.04-x86_64/gcc-11.1.0/cusz-0.3-ypll5jqnzonur4yhwkhq4jihu2gh3in3/lib/libcuszhuff.so: undefined reference to `cuMemGetAddressRange_v2'
collect2: error: ld returned 1 exit status
make[2]: *** [tools/hdf5_filter/CMakeFiles/pressio_hdf5_filter_tool.dir/build.make:137: tools/hdf5_filter/pressio_hdf5_filter_tool] Error 1

ben.boeckel · February 13, 2023, 9:04pm

It works because the stubs mock the development environment of the libraries, but do not provide the runtime. Alas, the linker will usually want to search dependent libraries when linking to try its best to satisfy symbol usages. There may be a flag to tell the linker to ignore this situation.

eugenewalker · February 13, 2023, 9:17pm

Ahh ok, this makes sense. I think this SO post has some bearing here:

robert.maynard · February 14, 2023, 2:00pm

This isn’t a DT_SONAME issue. This is the linker trying to resolve all symbols which happens when using linkers like ld.bfd. In those cases when building against stubs you need to explicitly tell the linker to allow unresolved symbols --allow-shlib-undefined

eugenewalker · February 14, 2023, 3:56pm

Thank you Rob and Ben! The build worked perfect with -Wl,--allow-shlib-undefined. I added this using -DCMAKE_EXE_LINKER_FLAGS=-Wl,--allow-shlib-undefined – is this the way you would recommend adding this option? The other question is, is it OK to use this option even when linking against the real driver, or is there a good reason to conditionally apply it only when linking against stub?

Thank you!

ben.boeckel · February 14, 2023, 4:02pm

This behavior is there to catch missing libraries, so it can hide problems (though they easily surface elsewhere). Note that it is only really needed on consumers of targets using the stub, but it might make sense to add it to the CUDA target interface when backed by the stubs (though this falls over when a shared target uses CUDA purely PRIVATE as the link usage requirement get lost).

robertu · February 14, 2023, 4:32pm

So I guess the question is should this be something fixed within CMake’s FindCudaToolkit, or do all packages potentially linking to a library built with cuda stubs need to fix this on their side? It seems to me like the consuming library shouldn’t need to know if the underlying package was built with stubs or not.

Of course we can carry around a work around in the short term.

ben.boeckel · February 14, 2023, 9:23pm

I agree, however CMake has no mechanism to express “add this to my direct consumer’s usage requirements” that doesn’t hide behind PRIVATE usage. The closest is putting it on the target when it is representing stubs, but that’s not exactly correct either (as privately using the CUDA toolkit will stop propagation).

craig.scott · February 14, 2023, 9:25pm

@marc.chevrier Raising awareness of the above since you are looking into .tbd stubs at the moment.

ben.boeckel · February 14, 2023, 9:26pm

.tbd stubs are fine as they are (AFAIK) just an optimization for the linker to not have to crack open a binary and parse information every time and can instead just consume the information relevant to linking directly.

Artem-B · March 7, 2023, 11:40pm

Does anyone know what is going on here?

I can provide a bit of context. libcuda.so is a bit of a special snowflake.

Unlike most of the other libraries that ship with CUDA SDK, libcuda.so is provided by the NVIDIA driver, which is only installed on the machines where NVIDIA GPUs are present. This is often not the case for the machines where one builds CUDA apps.

So, in order to be able to build a functional CUDA app which uses the driver API, the executable has to be linked with stubs/libcuda.so. The stub is essentially an interface library, which only provides the symbols and allows the linker to finish linking the executable w/o complaining about the missing symbols. DT_SONAME=libcuda.so.1 of the stub is intentionally does not match the file name libcuda.so, because we do not want dynamic linker to ever load stub/libcuda.so if we were to run the executable linked with it.

Instead, when the executable is run, dynamic linker will go searching for libcuda.so.1 among the shared libraries in the standard search path. On machines where NVIDIA driver is installed, it will find the real libcuda.so.1.X.Y provided by the driver vX.Y. On machines w/o the GPU the execution will be expected to fail due to the missing libcuda.so.

If one needs the application to run on machines where libcuda.so is unavailable, then the standard approach is to not link with libcuda.so (stub or real) but instead dlopen(libcuda.so.1) and use dlsym to find the pointers to the appropriate driver API functions. This is how libcudart.so operates under the hood.