Support for embedding data in a manner equivalent to xxd

rleigh · November 4, 2020, 7:51pm

The xxd command can convert any input file to a form which can be included into a C source file or compiled directly:

% echo "Sample data" > test 
% xxd -i test
unsigned char test[] = {
  0x53, 0x61, 0x6d, 0x70, 0x6c, 0x65, 0x20, 0x64, 0x61, 0x74, 0x61, 0x0a
};
unsigned int test_len = 12;

I have encountered several situations where being able to perform this transformation directly in CMake would be very useful. I would be happy to work on an implementation for inclusion as a CMake module.

Some thoughts:

The character type could be configurable. You might want a different character type, or std::byte
The length type could be configurable, e.g. size_t, std::size_t, uint32_t
The symbol name could be configurable

This could be implemented similarly to configure_file, or a file subcommand e.g. file(EMBED. It could work with files, or with a variable defining the content to write.

This could be useful for converting content such as static data (e.g. scripts: lua, python, sql or graphics or audio). But it could also be useful for converting generated content e.g. Vulkan compiled shaders and embedding it. The latter would need a wrapper script to use with a custom command. Potentially the module could also be the wrapper script if invoked in script mode.

I would be interested if anyone has any thoughts or suggestions.

Kind regards,
Roger

starseeker · November 4, 2020, 8:06pm

I’ve used hexdump (https://github.com/wahern/hexdump) in the past for similar purposes. One example I’ve got handy is embedding the init script of tinyscheme in the code: https://github.com/starseeker/tinyscheme/blob/master/CMakeLists.txt#L83

Depending on how you want to implement it, the hexdump code might come in handy.

rleigh · November 4, 2020, 8:13pm

Yes, I’ll definitely take a look, thanks.

I should have mentioned it in the original post, but one of the drivers for this for me is having a portable way of doing this on all platforms without extra tools.

starseeker · November 4, 2020, 8:18pm

Heh - I’m actually building hexdump as part of the tinyscheme build process to achieve precisely that effect - no extra tools installed, works across all platforms. What you’re suggesting with file(EMBED) would let me replace the custom target completely and achieve the same result, so it sounds good - I was thinking you might literally be able to create the file sub-command by grabbing hexdump and using it on the backend.

(FWIW I’d probably vote for file(EMBEDDABLE ) to avoid the impression that the file command is actually going to embed something in another file.)

brad.king · November 4, 2020, 8:29pm

In general I’m wary of any direct C code generation as a feature of upstream CMake. We have it in a few places now and they all have challenges, particularly because they offers APIs used directly by project C code, and are therefore very close to having a C library come with CMake.

For reference, existing places include:

The create_test_sourcelist command generates a test dispatch helper. We get regular bug reports about the generated code triggering various new compiler warnings. Projects are stuck with those warnings on newer compilers until they require a more recent CMake version.
The WriteCompilerDetectionHeader module writes C++ header files. Many projects found that the best way to use it is to use locally by hand with a recent version of CMake and copy the results into the project. Otherwise they are limited by the knowledge of their minimum supported version of CMake.

In general, projects don’t like to increase their CMake version requirement just for getting updated code generation. Therefore such features do not really belong in upstream CMake.

Instead, such a feature could be offered as a separately maintained and deployed project that can be updated independently of the CMake version. Consumers can use it as an external dependency or vendor (bundle) its sources.

starseeker · November 4, 2020, 8:34pm

FWIW, the way the hexdump example works it’s not actually generating any C code per say - just converting the supplied file into something that can be used to make a C array. The calling code is what includes the data to actually define the array using the generated data:

github.com

starseeker/tinyscheme/blob/master/scheme.c#L108



#ifndef prompt
# define prompt "ts> "
#endif

/* Include and use the hexcode char
 * array version of the init.scm file
 * to ensure we don't have to try and
 * locate it at run time */
unsigned char init_scm[] = {
#include "init_scm.h"
};

#ifndef FIRST_CELLSEGS
# define FIRST_CELLSEGS 3
#endif

enum scheme_types {
  T_STRING=1,
  T_NUMBER=2,
  T_SYMBOL=3,

rleigh · November 4, 2020, 11:05pm

@brad.king I can certainly understand not wanting code generation as part of CMake. In this specific case though, it would be limited to constant byte arrays. There is no executable code to be generated. The array itself is nothing more than a list of hex-encoded byte values.

I can certainly develop an initial implementation for my own use out of tree. However, I thought it would be a sufficiently common situation that others might wish to take advantage of such a feature as well.

Kind regards,
Roger

Angew · November 5, 2020, 9:19am

FWIW, it’s possible to implement this purely in CMake, based on the string(HEX) command. This works for me:

function(xxd outVar text)
  string(HEX "${text}" hex)
  string(LENGTH ${hex} hexLen)
  math(EXPR hexLen "${hexLen} - 1")
  set(result "")
  foreach(idx RANGE 0 ${hexLen} 2)
    string(SUBSTRING ${hex} ${idx} 2 char)
    list(APPEND result "0x${char}")
  endforeach()
  list(JOIN result ", " result)
  set(${outVar} ${result} PARENT_SCOPE)
endfunction()

ben.boeckel · November 5, 2020, 5:36pm

Note that this solution runs into practical limitations in the real world. Some compilers have a maximum line length limit that this blows past for anything moderately large. It is extremely slow (that foreach loop has terrible perf). VTK has something a bit more mature.

Angew · November 5, 2020, 5:39pm

Sure, I just whipped this up on the spot as “it’s possible,” with no claims of optimality.

ben.boeckel · November 5, 2020, 5:53pm

No problem, just trying to avoid anyone copying this as a ready-made solution and then complaining weeks or months from now (as I’m aware that SO-based programming is a thing these days ).

traversaro · November 6, 2020, 6:36pm

A related project (even if not meant to be included in CMake itself) is CMakeRC (https://github.com/vector-of-bool/cmrc).

rleigh · November 14, 2020, 4:22pm

Thanks for all the suggestions. @ben.boeckel’s pointer to the VTK implementation is pretty much everything I envisaged, right down to being invokable in script mode. And the CMRC’s selection of symbol name and namespace is also nice, and something I would have wanted in addition. So I think my approach will be a hybrid of the two.

On the CRMC side, I do think (for the contexts I see this being used in), it’s a bit too opinionated in creating all of the code around access to the embedded data. It’s imposing language and standards version requirements which might not be appropriate. But if it meets your needs, it looks very nice.

In the longer term, I would certainly still like to see something like this in the upstream CMake. Maybe once one of the approaches has matured a bit more, that might be something to reconsider.

Anyway, thanks all for the discussion and suggestions. I might even follow up with my implementation once it’s complete!