Request for Comment: string() camelCase, snake_case, split, and capitalize functionality

Introduction

I would like to add some more functionality to CMake’s string() processing function, and would like to discuss these changes and receive feedback. Once I’ve gotten some feedback, I’ll look at submitting a PR for these changes.

I would like to implement the following subcommands for the string() function:

string(TOUPPERCAMELCASE ...)
string(TOLOWERCAMELCASE ...)
string(TOUPPERSNAKECASE ...)
string(TOLOWERSNAKECASE ...)
string(CAPITALIZE ...)
string(SPLIT ...)

This is functionality that I have found myself reaching for and re-implementing across projects. Below I’ll describe the inputs and functionality that I would propose for each.

Please let me know what you think.

Proposed function signatures

string(CAPITALIZE <INPUT_STRING> [<OUT_VAR>])

  • INPUT_STRING – any given input string. If OUT_VAR is not specified, INPUT_STRING is mutated
  • OUT_VAR – the variable in which to store the output. If OUT_VAR is not specified, INPUT_STRING remains unchanged.

This would perform the following on the string:

  1. Convert all alphabetical characters to lower-case
  2. If the first letter of the string is alphabetic, capitalize it.
  3. If the first letter of the string is not alphabetic, error? Act like TOLOWER?

Rationale

This is useful for templatization, and can be passed to functions like configure_file. Capitalizing the first token in a string is a pretty commonly desired string operation.

string(TOUPPERCAMELCASE|TOLOWERCAMELCASE|TOUPPERSNAKECASE|TOLOWERSNAKECASE <INPUT_STRING> <FORMAT> [<OUT_VAR>])

  • INPUT_STRING – any given input string. If OUT_VAR is not specified, INPUT_STRING is mutated
  • FORMAT – The format of the input string (UPPERCAMELCASE, LOWERCAMELCASE, UPPERSNAKECASE, LOWERSNAKECASE)
  • OUT_VAR – the variable in which to store the output. If OUT_VAR is not specified, INPUT_STRING remains unchanged.

This would perform the following on the string:

  1. Validate that the input string is in the format specified.
  2. Map the string in one format to the other format

The FORMAT specifier is necessary because there is some overlap in these formats. For example “foo” in lower_snake_case is “foo” in lowerCamelCase.

Perhaps a version of this function could also take a list of tokens, and then assemble them in the correct format?

Rationale

This is useful for templatization with configure_file, or custom commands which generate artifacts that may require different formats. This is somewhat complex in the general case (an arbitrary string), but if we limit it to converting between these common formats, it should be trivial.

I have surprisingly found myself wanting to convert between camelCase and snake_case quite a bit when invoking external scripts as a part of my buildsystem.

string(SPLIT <INPUT_STRING> <OUTPUT_VAR> [<SEPARATOR>])

  • INPUT_STRING – The string to split.
  • OUTPUT_VAR – The variable in which to store the resulting list of tokens
  • SEPARATOR – An optional string input specifying the separator to split along. If not provided, this will assume \s as the separator character

Rationale

Sometimes you want to split an input string into its individual components, and then process them. This is often useful when you want to process the name of an input to get information about a file programmatically, and that input could be a header file like module_functionality_subfunctionality.h, and you want to get those three tokens when making decisions.

Currently you can kind of do this with separate_arguments, but you need to first process the string into one of the given command line formats.

Here’s my take on the proposal. Note that I’m just a CMake user unaffiliated with the project itself, so take these as just “a user’s opinion.”

Overall

  • For me, this would be a useful addition to CMake’s functionality, given its usability as a simple templating mechanism.
  • The vast majority of string() subcommands (including those added more recently) interpret their parameters as strings, not as variables. I would prefer the new subcommands to stick to this convention for consistency.

CAPITALIZE

  • Regarding point 3., my preference would be to act like TOLOWER, i.e. do not error out just because the first character is not alphabetic.

Format conversions

  • I would move FORMAT immediately after the main subcommand argument, and give the values a FROM prefix. The syntax would therefore be e.g. string(TOUPPERCAMELCASE FROMLOWERSNAKECASE <INPUT_STRING> <OUT_VAR>).
  • I know the current subcommands TOUPPER and TOLOWER are single-word, but the new ones look to long for this to me, and thus hard to read. I’d probably prefer TO_UPPER_SNAKE_CASE style here.
  • I’d consider adding a choice between strict mode (errors out when input doesn’t seem to match desired format) and loose/permissive mode (produces suboptimal result for suboptimal input, but works).
  • The mentioned variant which takes a list and joins its elements in the desired format sounds useful and worth adding to me.

SPLIT

  • I will refrain from commenting on this one. Since splitting a CMake string just means replacing the separators with ;, I tend not to use such wrappers and do the replace directly instead (since then I don’t have to study nuances in the wrapper’s behaviour). Do not take this as arguing against its inclusion; I understand people may want to use these for reading clarity etc. Just that I am not the target user, so I have no usage comments to add.

Again, let me re-iterate that these are just one user’s opinions/preferences and in no way official.

@Tim_Finnegan thanks for raising this. Let’s defer SPLIT for now to keep this discussion focused on the case encoding operations.

Does anyone know if there is a well-known name for the concept of encoding multiple words in an identifier?

Can you give a more complete example use case? It seems pretty specialized.

Also, one could reduce signature mode combinations like this:

string(TRANSFORM_CASE FROM <format> TO <format> <input-string> <output-var>)

@Tim_Finnegan I am not sure to understand the goal of the FROM argument.
It seems logical to expect the requested output format regardless of the input format!