Why not to add `dict()` to CMake?

zaufi · April 1, 2020, 4:09am

Hi,

It really pains to store and manipulate key-value pairs nowadays in CMake. Why not add dict() command for that?

dict(INSERT <dict_var> <key> <value>)
dict(REMOVE <dict_var> <key>)
dict(GET  <dict_var> <key> <output_variable>)

foreach(key value IN DICTS <dict_var>...)
...

Even that minimal commands could help A LOT

ben.boeckel · April 1, 2020, 4:53am

I think the basic problem is the stringly-typed language. I think the main issue would be coming up with an encoding scheme for these variables in the current “type system”. How would (should?) the cache editors handle them? Cache file format for storage of them? What happens when you do message("${dict_var}")? How about some_call(${dict_var})? Passing them to scripts via -D?

Unfortunately, I think that this is something that depends on a ship that sailed long ago (late 90’s).

I think there was a proposal/issue for a json command that manipulates JSON objects (which do have an unambiguous string representation if we say “no extra whitespace”; a json(PRETTY) could exist though). Cache editing would still be an interesting question though.

zaufi · April 1, 2020, 6:31am

Nowadays CMake supports “nested lists” (i.e. a list variable wrapped into [ / ] can be atomically inserted/extracted into/from another list).
Yeah, the support from the list() command for that syntax still need some care, but “manually” it’s quite possible.

What about wrapping dict pairs into { / }? (Yeah, lets reserve ( / ) for tuples ;).
A dict than can be stored similar to lists (w/ ; separator) and have items wrapped into { / }…
Then message command just prints its raw representation (like for list variables nowadays).
Same for some_call() – it receives a raw string, but being used w/ dict() it’ll be possible to do smth w/ it (just like list vars today)…

ben.boeckel · April 1, 2020, 12:07pm

That may work, not sure if it’d be acceptable though. I’ve had ideas for CMake to cache the various parsings of variables into variables (e.g., if we parse a variable as a list, cache the vector<string_view> for the list components). For this to be performant, that caching may need to exist.

Other questions that come to mind (about usage and encoding into CMake’s strings):

What is our behavior with multi-value dictionaries?
Multi-value keys in the dictionary?
What happens if you set a dictionary key to a dictionary? List? What about a dictionary value?

All of these are possible because the dictionary has a CMake string encoding and string(APPEND) and set() exist.

@slurps-mad-rips Thoughts? I know you’ve delved into data structures in CMake before.

brad.king · April 1, 2020, 12:25pm

CMake’s language does not have support for arbitrary content in lists because there is no escape mechanism for [ or ], and the escaping for ; only survives one layer. Variable values are always strings and are only interpreted as lists (or numbers, versions, etc.) in specific contexts. I don’t think trying to offer magic interpretation as a dictionary will work well.

zaufi · April 1, 2020, 12:30pm

Any other thoughts? %)
Lack of dicts in CMake gives a lot of pain and limitations

zaufi · April 1, 2020, 12:32pm

And yes, recently I used list-of-lists in “manual” mode… that was hard… and tons of CMake code

McMartin · April 1, 2020, 5:07pm

This article gives some nice ideas: Everything You Never Wanted To Know About CMake - DEV Community

zaufi · April 2, 2020, 6:47am

@McMartin ,

Many thanks! Reading it I feel exciting and disappointing at the same time %) – How dare the author can be if he wants smth… and I’m deeply disappointed for CMake

I don’t think trying to offer magic interpretation as a dictionary will work well.

The discussion I’ve started here is to find a reliable way to have dict() naturally in CMake, cuz obviously it is highly wanted feature. I admit it could be hard and complex… but IMO it worth it!

PS
@marc.chevrier, @brad.king

From the blog post above:

We either must rely on content being stored in a CMake safe format, regexes, or reading one byte at a time in the CMake language (No thank you! ).

Yet another case for foreach(... IN STRINGS...) recently rejected

ferdnyc · April 2, 2020, 2:30pm

CMake is not a programming language, so I’m somewhat wary of efforts to build advanced high-level features into the language itself. At the risk of sounding dismissive, I’ll note that beyond assertions of “a lot of pain and limitations” caused by the lack of an arbitrary key-value datatype, nobody’s really provided even a single concrete example of why it’s needed. Desired, sure, but not unavoidably necessary.

(The blog post linked above — which also fails to demonstrate the actual need for a dict type — is even more dismissive:

Perhaps this might change in the future, and we’ll get a real honest to god dictionary type, but don’t hold your breath. I’d rather see the CMake language go away entirely than get a dictionary type.

So I don’t feel so bad, I guess is what I’m saying.)

Besides, for non-arbitrary key-value pairings, CMake’s existing list and string variables already serve. This sort of pattern is used all the time in Find modules:

foreach(name record1 record2 record3 record4)
  foreach(k key1 key2 key3 key4)
    set(_${name}_${k} <something to determine ${k} value for ${name}>)
  endforeach()
endforeach()

Yes, the list of keys has to be predefined, you can’t easily have each record contain different keys, or arbitrary keys that are only determined at runtime. (Though any of the _${name}_${k} variables can certainly be empty, if that key isn’t needed for a particular ${name}.)

Can key-value pairs be passed between CMake contexts easily? No. Can they be passed with a little work? Most likely yes, using target PROPERTIES (which can, recall, be arbitrary). There’s nothing to stop you from later doing this:

foreach (name ...)
  add_custom_target(${name})
  foreach (k ...)
    set_target_properties(${name} PROPERTIES ${k} "${_${name}_${k}}")
  endforeach()
endforeach()

Then, as the set_target_properties() documentation notes, “You can use any prop value pair you want and extract it later with the get_property() or get_target_property() command.”
So, if you were to include the above code in a “dict.cmake” file, and then later in your CMakeLists.txt you were to call:

set(CMAKE_MODULE_PATH ".")
include(dict)

include(CMakePrintHelpers)
cmake_print_properties(TARGETS record1 record2 PROPERTIES key1 key2)

Well, I just tried it, so I can tell you. Using this as my modified “dict creation” loop:

foreach(name record1 record2 record3 record4)
  foreach(n 1 2 3 4)
    set("_${name}_key${n}" "${name}.value${n}")
  endforeach()
endforeach()

I get this:

$ cmake .                     
-- 
 Properties for TARGET record1:
   record1.key1 = "record1.value1"
   record1.key2 = "record1.value2"
 Properties for TARGET record2:
   record2.key1 = "record2.value1"
   record2.key2 = "record2.value2"

-- Configuring done
-- Generating done
-- Build files have been written to: /tmp

Effectively, each target is a dict of its properties.

alex · June 7, 2021, 6:11pm

That is objectively false. Someone wrote a ray tracer in CMake, if there was any doubt.

ben.boeckel · June 7, 2021, 6:31pm

I mean, sure. People have also written things in “minimal” Turing-complete languages. CMake is Turing-complete, so these things are possible. It is certainly not a general-purpose language however.

I have wanted dictionaries myself in CMake, but the way CMake’s language is “typed” today doesn’t really lend itself to them working all that well. Even the list representation is a hack with language-wide effects and is why my questions about the behavior of dictionaries in various contexts is so important. I suspect if dictionaries ever get support, it’ll be like the JSON support today: only usable by specific functions and no syntactic way to do dict[key] (it’d have to be dict(GET dict key OUTPUT output) or something). I suspect internally it could be represented however, but CMake typically roundtrips its types via strings, so there would need to be some unambiguous string representation (or we ban the type from the cache).

hex · June 7, 2021, 7:32pm

With all due respect, but your argumentation is not objective. For that, you should at least consider the most fundamental design goals of the CMake project. We don’t want a programming language for a build system. A build system should be limited to just that, building a project. Any extra complexity means additional maintenance cost.

alex · June 8, 2021, 3:01am

I believe that implementing richer data structures as opaque types that need to be interacted with a particular function is perfectly workable. In particular, I don’t see why the full set of JSON types isn’t workable: dicts and lists being the top two. It seems to me that JSON is already able to represent every value in the CMake language (currently only strings), so we already have an unambiguous serialization for everything…

If the author directly asks for a special data structure, they should be responsible for moving them between various representations. This is more what I mean, step by step:

dict(NEW my_dict)                   # my_dict is a hash table in memory
dict(INSERT my_dict "key" "value")  # insert (key, value) into the dict
message(STATUS "${my_dict}")        # only now do we cast to a JSON string
my_func(${my_dict})                 # pass the dict as a single argument
my_func("${my_dict}")               # pass JSON string version

The same exact thing could be used to support real, nested lists with arbitrary content.

list(NEW my_list "a" "b;c" "d")
list(LENGTH my_list len)         # len is 3
message(STATUS "${my_list}")     # -- ["a", "b;c", "d"]
my_func(${my_list})              # receives three arguments
my_func("${my_list}")            # receives one argument, JSON repr
list(TO_CMAKE_LIST my_list var)  # dev warning: loss of fidelity

You could also get these compound structures out of JSON

json(PARSE [=[ ["a", "b;c", "d" ] ]=] VAR var TYPE ty)
# var is a structured list now, "${ty}" is "list" (to match the function name
# and make cmake_langauge(CALL ${ty}) practical)

This really doesn’t seem impractical. It’s more about “should we” than about “can we”, and I think the argument for lists is strong, while for dicts it is somewhat weaker.

Sure, but I don’t see how that’s relevant. I could observe that CMake has a script mode that is (ab)used pretty widely. A domain-specific programming language is still a programming language.

That’s not a very good language design ethos. Why don’t we all program in Brainf**k? None of these fancy named variables are unavoidably necessary. We can all just agree that the CMake reserved variables have documented, reserved positions on the tape.

I’m being a little facetious, but you’re assuming some universal set of values re: pain-versus-productivity for a build DSL, which there is not.

The design goals do not make CMake any less Turing complete or any less a programming language. Disliking the fact that CMake is a programming language does not make it any less true. You can argue for the merits (or not) of adding any particular feature to CMake, but you can’t build your argument on a falsehood.

It’s also worth noting that you can view CMake a related pair of programming languages:

The scripting language that runs during the configure step
The inputs to the generator step; those inputs are themselves a sort of sub-Turing-complete DSL for describing builds; the semantics are declarative. Generator expressions are able to carry out substantial computations on strings, too.

So for any feature, you’d have to consider which language (or both) to add it to.

ben.boeckel · June 8, 2021, 11:21am

But you’re missing one big point: these need unambiguous string representations so that the cache can store them and my_func(${my_list}) has known semantics (because that call will do argument splitting on the variable). If we do not have this, the following cannot be done with them:

passing as arguments to functions unquoted or unquoted (the type is lost across the boundary)
storing into the cache file
editing with the cache editor
writing to a file
reading from a file

Which leaves them…pretty useless since CMake doesn’t have expressions and doing anything with them requires functions to extract any useful information out of them.

To get any of these things, one would need to revamp the CMake calling conventions which…good luck. The C++ code can do it, but it massively complicates the CMake code conventions.

alex · June 8, 2021, 7:12pm

To avoid being vague or confusing by being general, I’m just going to talk about the idea of a first-class List data structure, as created by list(NEW my_list).

I don’t think I missed this… list(NEW) does have an unambiguous string representation, as JSON.

If my_list is a string, then it’s split by semicolons before splatting arguments. If my_list is first-class, then it’s just… already split.

I don’t see why the type would need to be lost across the boundary. Why should a list be serialized before calling a CMake function? I don’t think that makes sense. There’s not ABI concerns or anything in the CMake runtime. You can’t load C++ functions to call from CMake. If the calling convention really does need everything to be a string for some reason I’m not seeing, you could pass a list by its name in the parent scope (copy on write?).

Store the as JSON in the cache with type LIST. It’ll be parsed as JSON when the cache is loaded. The cache editor could even be augmented with a special editor for lists that clearly shows element boundaries.

Again, writing it to a file can be done easily by serializing it to JSON via "${my_list}". If the user would prefer something else, they can write the code themselves to do that (no reason foreach(IN LISTS) shouldn’t work).

ferdnyc · June 9, 2021, 6:52am

Well, then, it’s clearly already possible to implement support for handling dict data using only the existing language features. QED. (Heck, I basically already demonstrated that targets and their properties are shallow dicts in disguise. They just can’t live in the cache, so you’d need functions to serialize/deserialize them to strings.)

See, this is the crux of the issue for me, the second half of my first statement that I’d treated as implicit: “CMake is not a programming language, CMake is a build system generator.” The act of maintaining CMake and its featureset is NOT “language design” at all. It is build system generator development and maintenance. The CMake language exists solely to facilitate that primary function, it is not a primary function unto itself. We don’t “program in” CMake at all. We generate build systems with it. Arguing language features from other than that context feels like scope creep.

To paraphrase Ben, people can (ab)use all sorts of things for purposes other than their intended use. That doesn’t have any relevance to that original intent. Look at all of the crazy things people have implemented in Minecraft — counters, accumulators, all manner of low- to -medium-complexity digital circuits. They didn’t then turn around and use that to argue that Minecraft should support digital circuit design natively. (Or, heck, some of them probably did. Hopefully any such requests were rejected as being wildly out of scope.)

I don’t think it’s unfair to ask, regarding any proposed new feature, “What problem does this solve?” More specifically, for CMake, “What problem in the domain of build system generation does this solve?”

I’m not even arguing that there aren’t good answers to that question, I’m simply saying that we haven’t heard them. So far when it comes to adding dict types to the language, the answer to “what problem does this solve?” seems to always be, “The lack of dict types in the CMake language.” Which, in addition to being incredibly circular, is simply not an answer that has anything to do with build system generation.

Knitschi · October 15, 2021, 3:12pm

I hope they are working hard on a CMake 4.0 that will be bundled with a fully featured scripting language.

ferdnyc · October 18, 2021, 3:56am

Hope is good! Hope is powerful. Hope can motivate us and sustain us. “All you need is hope!”, say the sort of people who’ve never experienced what it’s like to have nothing left except hope.

…CMake’s development does not require hope, though. It’s open-source and developed in full view of the public. You always have the option to visit https://gitlab.kitware.com/cmake/cmake/ , where you can see exactly what’s being worked on (and even get involved, if you’re so inclined).

Currently, the CMake developers are working hard on CMake 4.0… minus 0.78. CMake 3.22-rc1 was just published yesterday, and marks the first step in finalizing what will become the next CMake release.

You can also view the public forks of the CMake developers, to get an idea of what they might be working on next. From exploring Ben and Brad’s forks, as well as those of Craig Scott and Sean McBride, I can tell you two things:

Ben’s fork is like looking in a very cluttered mirror, as it is absolutely packed with dozens of branches, most of them years out of date and thousands of commits behind the CMake parent repository.¹
A distinct lack of branches containing 4.0-bound rewrites of the entire CMake language is in evidence…

But, hey, mustn’t lose hope! It could be that, despite working out in the open for many years, the CMake developers right here in this discussion² might be toiling away in secret on a total rewrite of CMake for version 4.0.³

Notes

I’m always amused when someone comments on the supposed clutter of my own repos, because what they never realize is that for all of the abortive experiments and half-baked ideas they come across in my GitHub or Gitlab repos, that’s just the stuff that I actually got far enough along with that I bothered to push the branch. There will be easily another 2 or 3 TIMES as many more branches that never even made it off my local development machine.)
(Both Brad and Ben are core CMake developers, as the little shields next to their name indicate.)
(It is not lost on me, BTW, that my note 1 can be taken as essentially an argument supporting the possibility that ben.boeckel might have a major CMake 4.0 rewrite stuck away in a branch somewhere that simply hasn’t been pushed to Gitlab yet. )

Knitschi · October 18, 2021, 7:33am

A distinct lack of branches containing 4.0-bound rewrites of the entire CMake language is in evidence…

I think we are still in the phase where the community has to convince the developers that integrating a full featured scripting language is something that must be done for the long-term health of CMake. So whenever a problem pops up that is rooted in the lacking CMake language I make an annoying comment hoping that it will slowly brainwash the developers into putting that on their agenda for the future. That is all that post was about