The CUDA manual says that link-time device code optimization can be done on CUDA code now if we use a special architecture, e.g. 70_lto.
Previously I had been using compute_70 for all CUDA targets in my project by the Cmake command:
set(CMAKE_CUDA_ARCHITECTURES 70)
It seems Cmake allows us to also put “70-virtual” above, but I’ve not found any way to use the lto_70 architecture as described in the CUDA manual link above.
How can I use lto_70? I have tried simply using the -dlto
flag to enable link time optimizations, but it seems that the flags as set by CMake cause this error from nvcc:
nvcc fatal : '-dlto' conflicts with '-gencode' to control what is generated; use 'code=lto_<arch>' with '-gencode' instead of '-dlto' to request lto intermediate