CMake files before prepocessing, details needed.

EricGT · March 21, 2023, 1:10pm

In creating a CMake parser for various reasons am finding that the syntax for CMake file only applies after substitution of the variables occurs. With other programming languages such as C there is a C preprocessor with a specific syntax, or with other languages that support quotes (Lisp) or quasiquotions (Lisp, Haskell, etc.) there is also a specific syntax. For CMake I can not find a specific syntax for such files before being preprocessed, e.g. substitution peformed on the *.cmake.in to create *.cmake.

As noted in a recent earlier question I do find mention of variable references and @-syntax but no formal grammar that notes enough detail that can be used to create a parser for such files, nor even a grammar for a variable name.

Am I missing something?

EDIT

Some of the details are in

github.com

Kitware/CMake/blob/master/Source/LexerParser/cmListFileLexer.in.l

%{
/* Distributed under the OSI-approved BSD 3-Clause License.  See accompanying
   file Copyright.txt or https://cmake.org/licensing for details.  */
/*

This file must be translated to C and modified to build everywhere.

Run flex >= 2.6 like this:

  flex --nounistd -DFLEXINT_H --noline -ocmListFileLexer.c cmListFileLexer.in.l

Modify cmListFileLexer.c:
  - remove trailing whitespace:              sed -i 's/\s*$//' cmListFileLexer.c
  - remove blank lines at end of file:       sed -i '${/^$/d;}' cmListFileLexer.c
  - #include "cmStandardLexer.h" at the top: sed -i '1i#include "cmStandardLexer.h"' cmListFileLexer.c

*/

/* IWYU pragma: no_forward_declare yyguts_t */

This file has been truncated. show original

Personal notes

ChatGPT Prompt

Convert following Flex input to PlantUML finite state machine 

%option prefix="cmListFileLexer_yy"

%option reentrant
%option yylineno
%option noyywrap
%pointer
%x STRING
%x BRACKET
%x BRACKETEND
%x COMMENT

MAKEVAR \$\([A-Za-z0-9_]*\)
UNQUOTED ([^ \0\t\r\n\(\)#\\\"[=]|\\[^\0\n])
LEGACY {MAKEVAR}|{UNQUOTED}|\"({MAKEVAR}|{UNQUOTED}|[ \t[=])*\"

%%

<INITIAL,COMMENT>\n {
  lexer->token.type = cmListFileLexer_Token_Newline;
  cmListFileLexerSetToken(lexer, yytext, yyleng);
  ++lexer->line;
  lexer->column = 1;
  BEGIN(INITIAL);
  return 1;
}

#?\[=*\[\n? {
  const char* bracket = yytext;
  lexer->comment = yytext[0] == '#';
  if (lexer->comment) {
    lexer->token.type = cmListFileLexer_Token_CommentBracket;
    bracket += 1;
  } else {
    lexer->token.type = cmListFileLexer_Token_ArgumentBracket;
  }
  cmListFileLexerSetToken(lexer, "", 0);
  lexer->bracket = strchr(bracket+1, '[') - bracket;
  if (yytext[yyleng-1] == '\n') {
    ++lexer->line;
    lexer->column = 1;
  } else {
    lexer->column += yyleng;
  }
  BEGIN(BRACKET);
}

# {
  lexer->column += yyleng;
  BEGIN(COMMENT);
}

<COMMENT>[^\0\n]* {
  lexer->column += yyleng;
}

\( {
  lexer->token.type = cmListFileLexer_Token_ParenLeft;
  cmListFileLexerSetToken(lexer, yytext, yyleng);
  lexer->column += yyleng;
  return 1;
}

\) {
  lexer->token.type = cmListFileLexer_Token_ParenRight;
  cmListFileLexerSetToken(lexer, yytext, yyleng);
  lexer->column += yyleng;
  return 1;
}

[A-Za-z_][A-Za-z0-9_]* {
  lexer->token.type = cmListFileLexer_Token_Identifier;
  cmListFileLexerSetToken(lexer, yytext, yyleng);
  lexer->column += yyleng;
  return 1;
}

<BRACKET>\]=* {
  /* Handle ]]====]=======]*/
  cmListFileLexerAppend(lexer, yytext, yyleng);
  lexer->column += yyleng;
  if (yyleng == lexer->bracket) {
    BEGIN(BRACKETEND);
  }
}

<BRACKETEND>\] {
  lexer->column += yyleng;
  /* Erase the partial bracket from the token.  */
  lexer->token.length -= lexer->bracket;
  lexer->token.text[lexer->token.length] = 0;
  BEGIN(INITIAL);
  return 1;
}

<BRACKET>([^]\0\n])+ {
  cmListFileLexerAppend(lexer, yytext, yyleng);
  lexer->column += yyleng;
}

<BRACKET,BRACKETEND>\n {
  cmListFileLexerAppend(lexer, yytext, yyleng);
  ++lexer->line;
  lexer->column = 1;
  BEGIN(BRACKET);
}

<BRACKET,BRACKETEND>[^\0\n] {
  cmListFileLexerAppend(lexer, yytext, yyleng);
  lexer->column += yyleng;
  BEGIN(BRACKET);
}

<BRACKET,BRACKETEND><<EOF>> {
  lexer->token.type = cmListFileLexer_Token_BadBracket;
  BEGIN(INITIAL);
  return 1;
}

({UNQUOTED}|=|\[=*{UNQUOTED})({UNQUOTED}|[[=])* {
  lexer->token.type = cmListFileLexer_Token_ArgumentUnquoted;
  cmListFileLexerSetToken(lexer, yytext, yyleng);
  lexer->column += yyleng;
  return 1;
}

({MAKEVAR}|{UNQUOTED}|=|\[=*{LEGACY})({LEGACY}|[[=])* {
  lexer->token.type = cmListFileLexer_Token_ArgumentUnquoted;
  cmListFileLexerSetToken(lexer, yytext, yyleng);
  lexer->column += yyleng;
  return 1;
}

\[ {
  lexer->token.type = cmListFileLexer_Token_ArgumentUnquoted;
  cmListFileLexerSetToken(lexer, yytext, yyleng);
  lexer->column += yyleng;
  return 1;
}

\" {
  lexer->token.type = cmListFileLexer_Token_ArgumentQuoted;
  cmListFileLexerSetToken(lexer, "", 0);
  lexer->column += yyleng;
  BEGIN(STRING);
}

<STRING>([^\\\0\n\"]|\\[^\0\n])+ {
  cmListFileLexerAppend(lexer, yytext, yyleng);
  lexer->column += yyleng;
}

<STRING>\\\n {
  /* Continuation: text is not part of string */
  ++lexer->line;
  lexer->column = 1;
}

<STRING>\n {
  cmListFileLexerAppend(lexer, yytext, yyleng);
  ++lexer->line;
  lexer->column = 1;
}

<STRING>\" {
  lexer->column += yyleng;
  BEGIN(INITIAL);
  return 1;
}

<STRING>[^\0\n] {
  cmListFileLexerAppend(lexer, yytext, yyleng);
  lexer->column += yyleng;
}

<STRING><<EOF>> {
  lexer->token.type = cmListFileLexer_Token_BadString;
  BEGIN(INITIAL);
  return 1;
}

[ \t\r]+ {
  lexer->token.type = cmListFileLexer_Token_Space;
  cmListFileLexerSetToken(lexer, yytext, yyleng);
  lexer->column += yyleng;
  return 1;
}

. {
  lexer->token.type = cmListFileLexer_Token_BadCharacter;
  cmListFileLexerSetToken(lexer, yytext, yyleng);
  lexer->column += yyleng;
  return 1;
}

<<EOF>> {
  lexer->token.type = cmListFileLexer_Token_None;
  cmListFileLexerSetToken(lexer, 0, 0);
  return 0;
}

%%

ChatGPT Completion

Here’s the PlantUML finite state machine for the Flex input provided:

@startuml
[*] --> INITIAL
INITIAL --> BRACKET : #?\[=*\[
INITIAL --> COMMENT : #
INITIAL --> INITIAL : \n
INITIAL --> STRING : \"
INITIAL --> [*] : <<EOF>>

BRACKET --> BRACKET : [^\]\0\n]+
BRACKET --> BRACKETEND : \]=*
BRACKET --> BRACKET : \n
BRACKET --> INITIAL : <<EOF>>

BRACKETEND --> INITIAL : \]
BRACKETEND --> BRACKETEND : \]

COMMENT --> COMMENT : [^\0\n]*
COMMENT --> INITIAL : \n

STRING --> STRING : [^\\\0\n\"]+ | \\\n
STRING --> INITIAL : \"
@enduml

Note that I have represented the initial state with the [*] notation, which is a shorthand for an initial pseudostate.

Using PalntUML Web Server

SVG

fenrir · March 21, 2023, 2:25pm

AFAIK, there’s no special preprocessor thing for CMake scripts. It’s done via configure_file (and others that work the same way as far as you’re concerned here). That means, @REFERENCES_LIKE_THIS@ and ${REFERENCES_LIKE_THIS} will get substituted (unless the caller paases @ONLY). Since it’s a simple text replacement, the input file does not have to be syntactically correct before substitution.

As for the grammar for the names, I’d assume it’s the same as for the variables. You may, however, not like, how vaguely defined that is. From: https://cmake.org/cmake/help/latest/manual/cmake-language.7.html#cmake-language-variables :

Variable names are case-sensitive and may consist of almost any text, but we recommend sticking to names consisting only of alphanumeric characters plus _ and - .