The notation used for representing the parser file syntax in this document is described here.
A .parser file consists of parser-file-declarations:
parser‑file | → | parser‑file‑declaration* |
parser‑file‑declaration | → | include‑declaration | using‑namespace‑declaration | parser‑declaration |
If the .parser file contains Unicode identifiers with non-ASCII characters, the encoding of it should be UTF-8. The parser generator spg will generate C++ source and header files whose encoding is UTF-8.
For a given parser file ExampleParser.parser the spg tool generates a C++ source file ExampleParser.cpp and a header file ExampleParser.hpp.
An include-declaration is used for inserting an include directive to the start of the generated C++ source or header file. An include declaration consists of an optional include-prefix, a '#' symbol, an include keyword and a file path in angle brackets:
include‑declaration | → | include‑prefix? # include file‑path |
include‑prefix | → | cpp‑prefix | hpp‑prefix |
cpp‑prefix | → | [cpp] |
hpp‑prefix | → | [hpp] |
An include-prefix can be "[cpp]" or "[hpp]".
If an include declaration has a [cpp] include prefix or it has no include prefix, spg puts the include directive to the generated .cpp file. If an include declaration has a [hpp] include prefix, spg puts the include directive to the generated .hpp file.
For the following include declarations in ExampleParser.parser:
// ExampleParser.parser: [hpp]#include <ExampleAST.hpp> [cpp]#include <ExampleLexer.hpp> [cpp]#include <ExampleTokens.hpp> #include <boost/filesystem.hpp>
spg generates following #include directives to generated source files ExampleParser.hpp and ExampleParser.cpp:
// ExampleParser.hpp: #include <ExampleAST.hpp> // ExampleParser.cpp: #include <ExampleLexer.hpp> #include <ExampleTokens.hpp> #include <boost/filesystem.hpp>
A using-namespace-declaration is used for inserting a C++ using directive to the generated C++ source file.
using‑namespace‑declaration | → | using namespace qualified‑cpp‑id ; |
A using namespace declaration consists of the keywords using and namespace, a qualified C++ identifier that is the name of a namespace, and a semicolon.
For the following using namespace declarations:
// ExampleParser.parser: using namespace ExampleTokens; using namespace boost::filesystem;
spg generates the following using directives to the ExampleParser.cpp file:
// ExampleParser.cpp: using namespace ExampleTokens; using namespace boost::filesystem;
A parser-declaration consists of the keyword parser followed by an optional API specifier followed by the name of the parser followed by parsing-declarations enclosed in braces:
parser‑declaration | → | parser api? identifier { parsing‑declaration* } |
parsing‑declaration | → | parsing‑rule | main‑declaration | using‑declaration | use‑lexer‑declaration | rule‑info‑declaration | farthest‑error‑declaration | state‑declaration | nothrow‑declaration |
For a parser declaration the spg tool generates a C++ class having the name of the parser. The class will contain a static member function for each parsing rule contained by the parser.
For the following parser declaration:
// ExampleParser.parser: parser ExampleParser { uselexer ExampleLexer; Statement ::= WhileStatement:whileStatement | EmptyStatement:emptyStatement // ... ; WhileStatement ::= WHILE LPAREN Expression:cond RPAREN Statement:stmt ; EmptyStatement ::= SEMICOLON ; Expression ::= ID // .... ; }spg generates the following C++ class:
// ExampleParser.hpp: class ExampleLexer; struct ExampleParser { static soulng::parser::Match Statement(ExampleLexer& lexer); static soulng::parser::Match WhileStatement(ExampleLexer& lexer); static soulng::parser::Match EmptyStatement(ExampleLexer& lexer); static soulng::parser::Match Expression(ExampleLexer& lexer); };
A parsing-rule has a name and a body separated by the "::=" symbol. It is terminated by a semicolon.
parsing‑rule | → | rule‑name params‑and‑vars? return‑type? ::= rule‑body ; |
rule‑name | → | identifier |
params‑and‑vars | → | ( param‑or‑var (, param‑or‑var)* ) |
param‑or‑var | → | var type‑id declarator | type‑id declarator |
return‑type | → | : type‑id |
rule‑body | → | alternative |
A parsing rule may have parameters, local variables and a return value. The declared parameters become parameters of the static member function generated by spg. The declared variables become local variables of that function. The return value of the parsing rule is transferred to the caller of the parsing rule in the soulng::parser::Match structure that is returned by the generated function. This AST example contains rules with parameters and return values.
The body of a parsing rule consists of parsing expressions.
Let a and b be parsing expressions. Then a | b is a parsing expression with two alternatives. For parsing expression a | b, the generated parser will try to match a. If input matches a, b is not matched, and parsing proceeds. Otherwise the generated parser will backtrack the input where it was when tried to match a and then try to match b.
alternative | → | sequence (| sequence)* |
Let a and b be parsing expressions. Then ab is a parsing expression, a sequence expression. For parsing expression ab, the generated parser will match a and then b in sequence. If both will match, parsing proceeds. Otherwise the parser will backtrack.
sequence | → | difference difference* |
Let a and b be parsing expressions. Then a - b is a parsing expression, a difference expression. For parsing expression a - b, the parser will first match a. If a matches, then it will backtrack the input and try to match b. If a matches and b does not, parsing proceeds. Otherwise the generated parser will backtrack.
difference | → | list (- list)* |
Let a and b be parsing expressions. Then parsing expression a % b is equivalent to a parsing expression a (b a)*, that is: one or more a's separated by b's.
list | → | postfix (% postfix)? |
Let a be a parsing expression. Then a*, a+ and a? are parsing expressions. For parsing expression a*, the generated parser will match zero or more a's. For parsing expression a+, the generated parser will match one or more a's. For parsing expression a?, the generated parser will match zero or one a's.
postfix | → | primary (* | + | ?)? |
The primary parsing expressions are rule-call, a primitive parsing expression, and a grouping parsing expression. They may be followed by expectation and semantic action.
primary | → | ( rule‑call | primitive | grouping ) expectation? action? |
The rule-call consists of a name of a rule, let's call it r, an optional argument list in parentheses, a colon, and an identifier. The generated parser will recursively match r. The identifier is a unique name of the r within the body the current rule. It represents the synthesized attribute of r in a semantic action possibly attached to the rule call.
rule‑call | → | rule‑name ( ( expression‑list ) )? : identifier |
The call of a rule r is implemented as a function call. The function to be called is the function generated from rule r by spg.
If the rule r has n parameters, it must be passed n arguments in the argument list of the rule call. If the number of parameters differs from the number of arguments, spg will produce an error.
A primitive parsing expression is either an empty epxression, an any expression, a token expression, or a lexerless expression:
primitive | → | empty‑expression | any‑expression | token‑expression | lexerless‑expression |
empty‑expression | → | empty |
any‑expression | → | any |
token‑expression | → | token‑id |
token‑id | → | identifier |
lexerless‑expression | → | character‑literal | string‑literal |
An empty expression consists of the keyword empty. An empty expression matches always. Input position of the used lexer is not advanced for an empty expression.
An any expression consists of the keyword any. An any expression matches any token and "eats" the token: the input position of the used lexer is advanced to the next token using expression ++lexer, and parsing proceeds.
For a token expression, the generated parser will compare the current input token of the used lexer to the token-id of the token expression. If the current input token matches the token-id, the input position of the used lexer is advanced to the next token using expression ++lexer, and parsing proceeds. Otherwise the generated parser will backtrack.
To support lexerless parsing, a parsing expression can be a character or string literal. In this case the lexer, that is assumed to be a trivial lexer, produces lexical tokens that are in fact Unicode characters. There are three cases:
A grouping expression is a parenthesized parsing expression.
grouping | → | ( alternative ) |
Let a be a primary parsing expression. Then a! is a parsing expression, an expectation expression. For an expectation expression a!, instead of backtracking the generated parser will produce an error if a does not match:
expectation | → | ! |
Let a be a primary parsing expression or an expectation expression. Two C++ compound statements may be attached to a. The first one will be executed if a matches. The second one, separated by a slash character, is an optional compound statement that will be executed if a does not match. The C++ compound statements attached to a are called semantic actions.
action | → | compound‑statement (/ compound‑statement)? |
There are special symbols that are available in semantic actions:
If the current parsing rule has a synthesized attribute, that is: it returns a value, and it has no semantic action that contains a return statement, the spg tool produces a warning.
If the current parsing rule calls another rule that has a synthesized attribute, and that synthesized attribute is referenced many times in a semantic action, spg warns about this, because a synthesized attribute is represented as a unique pointer that will be released many times. However, if the semantic action consists of a switch statement that has many branches that refer to the same synthesized attribute, this warning can be ignored.
A main-declaration consists of the keyword main and a semicolon.
main‑declaration | → | main ; |
The spg tool will implement a Parse function for each parser that has a main declaration. The Parse function will take a lexer argument and arguments that the the first parsing rule of the parser takes. It will parse the content using the given lexer by calling the first parsing rule of the parser with the lexer and other arguments.
For the following example parser with the main declaration:
// ExampleParser.parser: parser ExampleParser { uselexer ExampleLexer; main; Statement(SymbolTable* symbolTable) : Node* ::= WhileStatement(symbolTable):whileStatement{ return whileStatement; } | EmptyStatement:emptyStatement{ return emptyStatement; } ; WhileStatement(SymbolTable* symbolTable) : Node* ::= WHILE LPAREN Expression:cond RPAREN Statement(symbolTable):stmt{ return new WhileStatementNode(cond, stmt); } ; EmptyStatement : Node* ::= SEMICOLON{ return new EmptyStatementNode(); } ; Expression : Node* ::= ID{ soulng::lexer::Token token = lexer.GetToken(pos); return new IdentifierNode(token.match.ToString()); } // .... ; }
the spg tool will generate the following class declaration with the Parse function in it, and implement the Parse function in the generated source file:
// ExampleParser.hpp: struct ExampleParser { static std::unique_ptr<Node> Parse(ExampleLexer& lexer, SymbolTable* symbolTable); static soulng::parser::Match Statement(ExampleLexer& lexer, SymbolTable* symbolTable); static soulng::parser::Match WhileStatement(ExampleLexer& lexer, SymbolTable* symbolTable); static soulng::parser::Match EmptyStatement(ExampleLexer& lexer); static soulng::parser::Match Expression(ExampleLexer& lexer); };
A using-dedclaration consists of the keyword using, a parsing-rule-id and a semicolon:
using‑declaration | → | using parsing‑rule‑id ; |
parsing‑rule‑id | → | identifier (. identifier)* |
A parsing-rule-id consists of the name of a parser, a period, and a name of a rule in that parser.
The using declaration imports a name of a rule from another parser to the current parser, so that it can be called from the current parser.
The using declaration in the following StatementParser.parser file imports the name of the Expression rule from the ExpressionParser to the StatementParser, so that it can be called from the IfStatement:
// StatementParser.parser: parser StatementParser { uselexer SomeLexer; using ExpressionParser.Expression; Statement ::= IfStatement // ... ; IfStatement ::= IF LPAREN Expression:condition RPAREN Statement ; } // ExpressionParser.parser: parser ExpressionParser { uselexer SomeLexer; Expression ::= // ... ; }
The use-lexer-declaration declaration consists of the keyword uselexer followed by the name of the lexer to use for tokenizing input in the current parser, and a semicolon:
use‑lexer‑declaration | → | uselexer identifier ; |
The spg tool will warn if the uselexer declaration is missing.
The following use-lexer declaration sets the name of the lexer to use in the FunctionParser to CmajorLexer:
// CmajorLexer.lexer: lexer CmajorLexer { // ... } // Function.parser: parser FunctionParser { uselexer CmajorLexer; // ... }
A rule-info-declaration consists of the keyword ruleinfo followed by rule-infos enclosed in braces and separated by commas:
rule‑info‑declaration | → | ruleinfo { (rule‑info (, rule‑info)*)? } |
A rule-info consists of a name of a rule followed by a comma followed by an informative string for that rule. It is enclosed in parentheses:
rule‑info | → | ( rule‑name , string‑literal ) |
The rule info declaration is used for setting informative strings of rules to use in error messages produced by the parser.
The names of the rules of the FunctionParser are given informative strings "function", "function group identifier" and "operator function group identifier" in the following Function.parser file:
// Function.parser: parser FunctionParser { // ... Function ::= // ; FunctionGroupId ::= // ; OperatorFunctionGroupId ::= // ; ruleinfo { (Function, "function"), (FunctionGroupId, "function group identifier"), (OperatorFunctionGroupId, "operator function group identifier") } }
A farthest-error-declaration consists of the keyword farthest_error and a semicolon.
farthest‑error‑declaration | → | farthest_error ; |
The following example parser contains a farthest_error declaration:
// ExampleParser.parser: parser ExampleParser { uselexer ExampleLexer; farthest_error; main; // ... }
See Farthest Error and State document for a usage example.
A state-declaration consists of the keyword state and a semicolon.
state‑declaration | → | state ; |
The following example parser contains a state declaration:
// ExampleParser.parser: parser ExampleParser { uselexer ExampleLexer; farthest_error; state; main; // ... }
See Farthest Error and State document for a usage example.
A nothrow-declaration consists of the keyword nothrow and a semicolon.
nothrow‑declaration | → | nothrow ; |
The following example parser contains a nothrow declaration:
// ExampleParser.parser: parser ExampleParser { uselexer ExampleLexer; nothrow; main; // ... }
See Nothrow parser document for an example.
The following keywords may not be used as identifiers in .parser files:
parser‑file‑keyword | → | cppkeyword | any | api | empty | farthest_error | include | main | nothrow | parser | ruleinfo | start | state | uselexer | var |