In creating a CMake parser for various reasons am finding that the syntax for CMake file only applies after substitution of the variables occurs. With other programming languages such as C there is a C preprocessor with a specific syntax, or with other languages that support quotes (Lisp) or quasiquotions (Lisp, Haskell, etc.) there is also a specific syntax. For CMake I can not find a specific syntax for such files before being preprocessed, e.g. substitution peformed on the *.cmake.in
to create *.cmake
.
As noted in a recent earlier question I do find mention of variable references and @-syntax but no formal grammar that notes enough detail that can be used to create a parser for such files, nor even a grammar for a variable name.
Am I missing something?
EDIT
Some of the details are in
Personal notes
ChatGPT Prompt
Convert following Flex input to PlantUML finite state machine
%option prefix="cmListFileLexer_yy"
%option reentrant
%option yylineno
%option noyywrap
%pointer
%x STRING
%x BRACKET
%x BRACKETEND
%x COMMENT
MAKEVAR \$\([A-Za-z0-9_]*\)
UNQUOTED ([^ \0\t\r\n\(\)#\\\"[=]|\\[^\0\n])
LEGACY {MAKEVAR}|{UNQUOTED}|\"({MAKEVAR}|{UNQUOTED}|[ \t[=])*\"
%%
<INITIAL,COMMENT>\n {
lexer->token.type = cmListFileLexer_Token_Newline;
cmListFileLexerSetToken(lexer, yytext, yyleng);
++lexer->line;
lexer->column = 1;
BEGIN(INITIAL);
return 1;
}
#?\[=*\[\n? {
const char* bracket = yytext;
lexer->comment = yytext[0] == '#';
if (lexer->comment) {
lexer->token.type = cmListFileLexer_Token_CommentBracket;
bracket += 1;
} else {
lexer->token.type = cmListFileLexer_Token_ArgumentBracket;
}
cmListFileLexerSetToken(lexer, "", 0);
lexer->bracket = strchr(bracket+1, '[') - bracket;
if (yytext[yyleng-1] == '\n') {
++lexer->line;
lexer->column = 1;
} else {
lexer->column += yyleng;
}
BEGIN(BRACKET);
}
# {
lexer->column += yyleng;
BEGIN(COMMENT);
}
<COMMENT>[^\0\n]* {
lexer->column += yyleng;
}
\( {
lexer->token.type = cmListFileLexer_Token_ParenLeft;
cmListFileLexerSetToken(lexer, yytext, yyleng);
lexer->column += yyleng;
return 1;
}
\) {
lexer->token.type = cmListFileLexer_Token_ParenRight;
cmListFileLexerSetToken(lexer, yytext, yyleng);
lexer->column += yyleng;
return 1;
}
[A-Za-z_][A-Za-z0-9_]* {
lexer->token.type = cmListFileLexer_Token_Identifier;
cmListFileLexerSetToken(lexer, yytext, yyleng);
lexer->column += yyleng;
return 1;
}
<BRACKET>\]=* {
/* Handle ]]====]=======]*/
cmListFileLexerAppend(lexer, yytext, yyleng);
lexer->column += yyleng;
if (yyleng == lexer->bracket) {
BEGIN(BRACKETEND);
}
}
<BRACKETEND>\] {
lexer->column += yyleng;
/* Erase the partial bracket from the token. */
lexer->token.length -= lexer->bracket;
lexer->token.text[lexer->token.length] = 0;
BEGIN(INITIAL);
return 1;
}
<BRACKET>([^]\0\n])+ {
cmListFileLexerAppend(lexer, yytext, yyleng);
lexer->column += yyleng;
}
<BRACKET,BRACKETEND>\n {
cmListFileLexerAppend(lexer, yytext, yyleng);
++lexer->line;
lexer->column = 1;
BEGIN(BRACKET);
}
<BRACKET,BRACKETEND>[^\0\n] {
cmListFileLexerAppend(lexer, yytext, yyleng);
lexer->column += yyleng;
BEGIN(BRACKET);
}
<BRACKET,BRACKETEND><<EOF>> {
lexer->token.type = cmListFileLexer_Token_BadBracket;
BEGIN(INITIAL);
return 1;
}
({UNQUOTED}|=|\[=*{UNQUOTED})({UNQUOTED}|[[=])* {
lexer->token.type = cmListFileLexer_Token_ArgumentUnquoted;
cmListFileLexerSetToken(lexer, yytext, yyleng);
lexer->column += yyleng;
return 1;
}
({MAKEVAR}|{UNQUOTED}|=|\[=*{LEGACY})({LEGACY}|[[=])* {
lexer->token.type = cmListFileLexer_Token_ArgumentUnquoted;
cmListFileLexerSetToken(lexer, yytext, yyleng);
lexer->column += yyleng;
return 1;
}
\[ {
lexer->token.type = cmListFileLexer_Token_ArgumentUnquoted;
cmListFileLexerSetToken(lexer, yytext, yyleng);
lexer->column += yyleng;
return 1;
}
\" {
lexer->token.type = cmListFileLexer_Token_ArgumentQuoted;
cmListFileLexerSetToken(lexer, "", 0);
lexer->column += yyleng;
BEGIN(STRING);
}
<STRING>([^\\\0\n\"]|\\[^\0\n])+ {
cmListFileLexerAppend(lexer, yytext, yyleng);
lexer->column += yyleng;
}
<STRING>\\\n {
/* Continuation: text is not part of string */
++lexer->line;
lexer->column = 1;
}
<STRING>\n {
cmListFileLexerAppend(lexer, yytext, yyleng);
++lexer->line;
lexer->column = 1;
}
<STRING>\" {
lexer->column += yyleng;
BEGIN(INITIAL);
return 1;
}
<STRING>[^\0\n] {
cmListFileLexerAppend(lexer, yytext, yyleng);
lexer->column += yyleng;
}
<STRING><<EOF>> {
lexer->token.type = cmListFileLexer_Token_BadString;
BEGIN(INITIAL);
return 1;
}
[ \t\r]+ {
lexer->token.type = cmListFileLexer_Token_Space;
cmListFileLexerSetToken(lexer, yytext, yyleng);
lexer->column += yyleng;
return 1;
}
. {
lexer->token.type = cmListFileLexer_Token_BadCharacter;
cmListFileLexerSetToken(lexer, yytext, yyleng);
lexer->column += yyleng;
return 1;
}
<<EOF>> {
lexer->token.type = cmListFileLexer_Token_None;
cmListFileLexerSetToken(lexer, 0, 0);
return 0;
}
%%
ChatGPT Completion
Here’s the PlantUML finite state machine for the Flex input provided:
@startuml
[*] --> INITIAL
INITIAL --> BRACKET : #?\[=*\[
INITIAL --> COMMENT : #
INITIAL --> INITIAL : \n
INITIAL --> STRING : \"
INITIAL --> [*] : <<EOF>>
BRACKET --> BRACKET : [^\]\0\n]+
BRACKET --> BRACKETEND : \]=*
BRACKET --> BRACKET : \n
BRACKET --> INITIAL : <<EOF>>
BRACKETEND --> INITIAL : \]
BRACKETEND --> BRACKETEND : \]
COMMENT --> COMMENT : [^\0\n]*
COMMENT --> INITIAL : \n
STRING --> STRING : [^\\\0\n\"]+ | \\\n
STRING --> INITIAL : \"
@enduml
Note that I have represented the initial state with the [*] notation, which is a shorthand for an initial pseudostate.
Using PalntUML Web Server