CMake files before prepocessing, details needed.

In creating a CMake parser for various reasons am finding that the syntax for CMake file only applies after substitution of the variables occurs. With other programming languages such as C there is a C preprocessor with a specific syntax, or with other languages that support quotes (Lisp) or quasiquotions (Lisp, Haskell, etc.) there is also a specific syntax. For CMake I can not find a specific syntax for such files before being preprocessed, e.g. substitution peformed on the *.cmake.in to create *.cmake.

As noted in a recent earlier question I do find mention of variable references and @-syntax but no formal grammar that notes enough detail that can be used to create a parser for such files, nor even a grammar for a variable name.

Am I missing something?


EDIT

Some of the details are in

Personal notes
ChatGPT Prompt
Convert following Flex input to PlantUML finite state machine 

%option prefix="cmListFileLexer_yy"

%option reentrant
%option yylineno
%option noyywrap
%pointer
%x STRING
%x BRACKET
%x BRACKETEND
%x COMMENT

MAKEVAR \$\([A-Za-z0-9_]*\)
UNQUOTED ([^ \0\t\r\n\(\)#\\\"[=]|\\[^\0\n])
LEGACY {MAKEVAR}|{UNQUOTED}|\"({MAKEVAR}|{UNQUOTED}|[ \t[=])*\"

%%

<INITIAL,COMMENT>\n {
  lexer->token.type = cmListFileLexer_Token_Newline;
  cmListFileLexerSetToken(lexer, yytext, yyleng);
  ++lexer->line;
  lexer->column = 1;
  BEGIN(INITIAL);
  return 1;
}

#?\[=*\[\n? {
  const char* bracket = yytext;
  lexer->comment = yytext[0] == '#';
  if (lexer->comment) {
    lexer->token.type = cmListFileLexer_Token_CommentBracket;
    bracket += 1;
  } else {
    lexer->token.type = cmListFileLexer_Token_ArgumentBracket;
  }
  cmListFileLexerSetToken(lexer, "", 0);
  lexer->bracket = strchr(bracket+1, '[') - bracket;
  if (yytext[yyleng-1] == '\n') {
    ++lexer->line;
    lexer->column = 1;
  } else {
    lexer->column += yyleng;
  }
  BEGIN(BRACKET);
}

# {
  lexer->column += yyleng;
  BEGIN(COMMENT);
}

<COMMENT>[^\0\n]* {
  lexer->column += yyleng;
}

\( {
  lexer->token.type = cmListFileLexer_Token_ParenLeft;
  cmListFileLexerSetToken(lexer, yytext, yyleng);
  lexer->column += yyleng;
  return 1;
}

\) {
  lexer->token.type = cmListFileLexer_Token_ParenRight;
  cmListFileLexerSetToken(lexer, yytext, yyleng);
  lexer->column += yyleng;
  return 1;
}

[A-Za-z_][A-Za-z0-9_]* {
  lexer->token.type = cmListFileLexer_Token_Identifier;
  cmListFileLexerSetToken(lexer, yytext, yyleng);
  lexer->column += yyleng;
  return 1;
}

<BRACKET>\]=* {
  /* Handle ]]====]=======]*/
  cmListFileLexerAppend(lexer, yytext, yyleng);
  lexer->column += yyleng;
  if (yyleng == lexer->bracket) {
    BEGIN(BRACKETEND);
  }
}

<BRACKETEND>\] {
  lexer->column += yyleng;
  /* Erase the partial bracket from the token.  */
  lexer->token.length -= lexer->bracket;
  lexer->token.text[lexer->token.length] = 0;
  BEGIN(INITIAL);
  return 1;
}

<BRACKET>([^]\0\n])+ {
  cmListFileLexerAppend(lexer, yytext, yyleng);
  lexer->column += yyleng;
}

<BRACKET,BRACKETEND>\n {
  cmListFileLexerAppend(lexer, yytext, yyleng);
  ++lexer->line;
  lexer->column = 1;
  BEGIN(BRACKET);
}

<BRACKET,BRACKETEND>[^\0\n] {
  cmListFileLexerAppend(lexer, yytext, yyleng);
  lexer->column += yyleng;
  BEGIN(BRACKET);
}

<BRACKET,BRACKETEND><<EOF>> {
  lexer->token.type = cmListFileLexer_Token_BadBracket;
  BEGIN(INITIAL);
  return 1;
}

({UNQUOTED}|=|\[=*{UNQUOTED})({UNQUOTED}|[[=])* {
  lexer->token.type = cmListFileLexer_Token_ArgumentUnquoted;
  cmListFileLexerSetToken(lexer, yytext, yyleng);
  lexer->column += yyleng;
  return 1;
}

({MAKEVAR}|{UNQUOTED}|=|\[=*{LEGACY})({LEGACY}|[[=])* {
  lexer->token.type = cmListFileLexer_Token_ArgumentUnquoted;
  cmListFileLexerSetToken(lexer, yytext, yyleng);
  lexer->column += yyleng;
  return 1;
}

\[ {
  lexer->token.type = cmListFileLexer_Token_ArgumentUnquoted;
  cmListFileLexerSetToken(lexer, yytext, yyleng);
  lexer->column += yyleng;
  return 1;
}

\" {
  lexer->token.type = cmListFileLexer_Token_ArgumentQuoted;
  cmListFileLexerSetToken(lexer, "", 0);
  lexer->column += yyleng;
  BEGIN(STRING);
}

<STRING>([^\\\0\n\"]|\\[^\0\n])+ {
  cmListFileLexerAppend(lexer, yytext, yyleng);
  lexer->column += yyleng;
}

<STRING>\\\n {
  /* Continuation: text is not part of string */
  ++lexer->line;
  lexer->column = 1;
}

<STRING>\n {
  cmListFileLexerAppend(lexer, yytext, yyleng);
  ++lexer->line;
  lexer->column = 1;
}

<STRING>\" {
  lexer->column += yyleng;
  BEGIN(INITIAL);
  return 1;
}

<STRING>[^\0\n] {
  cmListFileLexerAppend(lexer, yytext, yyleng);
  lexer->column += yyleng;
}

<STRING><<EOF>> {
  lexer->token.type = cmListFileLexer_Token_BadString;
  BEGIN(INITIAL);
  return 1;
}

[ \t\r]+ {
  lexer->token.type = cmListFileLexer_Token_Space;
  cmListFileLexerSetToken(lexer, yytext, yyleng);
  lexer->column += yyleng;
  return 1;
}

. {
  lexer->token.type = cmListFileLexer_Token_BadCharacter;
  cmListFileLexerSetToken(lexer, yytext, yyleng);
  lexer->column += yyleng;
  return 1;
}

<<EOF>> {
  lexer->token.type = cmListFileLexer_Token_None;
  cmListFileLexerSetToken(lexer, 0, 0);
  return 0;
}

%%

ChatGPT Completion

Here’s the PlantUML finite state machine for the Flex input provided:

@startuml
[*] --> INITIAL
INITIAL --> BRACKET : #?\[=*\[
INITIAL --> COMMENT : #
INITIAL --> INITIAL : \n
INITIAL --> STRING : \"
INITIAL --> [*] : <<EOF>>

BRACKET --> BRACKET : [^\]\0\n]+
BRACKET --> BRACKETEND : \]=*
BRACKET --> BRACKET : \n
BRACKET --> INITIAL : <<EOF>>

BRACKETEND --> INITIAL : \]
BRACKETEND --> BRACKETEND : \]

COMMENT --> COMMENT : [^\0\n]*
COMMENT --> INITIAL : \n

STRING --> STRING : [^\\\0\n\"]+ | \\\n
STRING --> INITIAL : \"
@enduml

Note that I have represented the initial state with the [*] notation, which is a shorthand for an initial pseudostate.

Using PalntUML Web Server

image

SVG

AFAIK, there’s no special preprocessor thing for CMake scripts. It’s done via configure_file (and others that work the same way as far as you’re concerned here). That means, @REFERENCES_LIKE_THIS@ and ${REFERENCES_LIKE_THIS} will get substituted (unless the caller paases @ONLY). Since it’s a simple text replacement, the input file does not have to be syntactically correct before substitution.

As for the grammar for the names, I’d assume it’s the same as for the variables. You may, however, not like, how vaguely defined that is. From: https://cmake.org/cmake/help/latest/manual/cmake-language.7.html#cmake-language-variables :

Variable names are case-sensitive and may consist of almost any text, but we recommend sticking to names consisting only of alphanumeric characters plus _ and - .

2 Likes