Parser File Syntax

1 Parser File Declarations
2 Include Declaration
    2.1 Example
3 Using Namespace Declaration
    3.1 Example
4 Parser Declaration
    4.1 Example
    4.2 Parsing Rules
        4.2.1 Parsing Rule Body
    4.3 Main Declaration
        4.3.1 Example
    4.4 Using Declaration
        4.4.1 Example
    4.5 Use Lexer Declaration
        4.5.1 Example
    4.6 Rule Info Declaration
        4.6.1 Example
    4.7 Farthest Error Declaration
        4.7.1 Example
    4.8 State Declaration
        4.8.1 Example
    4.9 Nothrow Declaration
        4.9.1 Example
5 Keywords

The notation used for representing the parser file syntax in this document is described here.

1 Parser File Declarations

A .parser file consists of parser-file-declarations:

parser‑file	→	parser‑file‑declaration*
parser‑file‑declaration	→	include‑declaration \| using‑namespace‑declaration \| parser‑declaration

If the .parser file contains Unicode identifiers with non-ASCII characters, the encoding of it should be UTF-8. The parser generator spg will generate C++ source and header files whose encoding is UTF-8.

For a given parser file ExampleParser.parser the spg tool generates a C++ source file ExampleParser.cpp and a header file ExampleParser.hpp.

2 Include Declaration

An include-declaration is used for inserting an include directive to the start of the generated C++ source or header file. An include declaration consists of an optional include-prefix, a '#' symbol, an include keyword and a file path in angle brackets:

include‑declaration	→	include‑prefix? # include file‑path
include‑prefix	→	cpp‑prefix \| hpp‑prefix
cpp‑prefix	→	[cpp]
hpp‑prefix	→	[hpp]

An include-prefix can be "[cpp]" or "[hpp]".

If an include declaration has a [cpp] include prefix or it has no include prefix, spg puts the include directive to the generated .cpp file. If an include declaration has a [hpp] include prefix, spg puts the include directive to the generated .hpp file.

2.1 Example

For the following include declarations in ExampleParser.parser:

        // ExampleParser.parser:

        [hpp]#include <ExampleAST.hpp>
        [cpp]#include <ExampleLexer.hpp>
        [cpp]#include <ExampleTokens.hpp>
        #include <boost/filesystem.hpp>

spg generates following #include directives to generated source files ExampleParser.hpp and ExampleParser.cpp:

        // ExampleParser.hpp:

        #include <ExampleAST.hpp>

        // ExampleParser.cpp:

        #include <ExampleLexer.hpp>
        #include <ExampleTokens.hpp>
        #include <boost/filesystem.hpp>

3 Using Namespace Declaration

A using-namespace-declaration is used for inserting a C++ using directive to the generated C++ source file.

using‑namespace‑declaration

→

using namespace qualified‑cpp‑id ;

A using namespace declaration consists of the keywords using and namespace, a qualified C++ identifier that is the name of a namespace, and a semicolon.

3.1 Example

For the following using namespace declarations:

        // ExampleParser.parser:

        using namespace ExampleTokens;
        using namespace boost::filesystem;

spg generates the following using directives to the ExampleParser.cpp file:

        // ExampleParser.cpp:

        using namespace ExampleTokens;
        using namespace boost::filesystem;

4 Parser Declaration

A parser-declaration consists of the keyword parser followed by an optional API specifier followed by the name of the parser followed by parsing-declarations enclosed in braces:

parser‑declaration	→	parser api? identifier { parsing‑declaration* }
parsing‑declaration	→	parsing‑rule \| main‑declaration \| using‑declaration \| use‑lexer‑declaration \| rule‑info‑declaration \| farthest‑error‑declaration \| state‑declaration \| nothrow‑declaration

For a parser declaration the spg tool generates a C++ class having the name of the parser. The class will contain a static member function for each parsing rule contained by the parser.

4.1 Example

For the following parser declaration:

        // ExampleParser.parser:

        parser ExampleParser
        {
            uselexer ExampleLexer;

            Statement
                ::= WhileStatement:whileStatement
                |   EmptyStatement:emptyStatement
                // ...
                ;

            WhileStatement
                ::= WHILE LPAREN Expression:cond RPAREN Statement:stmt
                ;

            EmptyStatement
                ::= SEMICOLON
                ;

            Expression
                ::= ID
                // ....
                ;
        }

spg generates the following C++ class:

        // ExampleParser.hpp:

        class ExampleLexer;

        struct ExampleParser
        {
            static soulng::parser::Match Statement(ExampleLexer& lexer);
            static soulng::parser::Match WhileStatement(ExampleLexer& lexer);
            static soulng::parser::Match EmptyStatement(ExampleLexer& lexer);
            static soulng::parser::Match Expression(ExampleLexer& lexer);
        };

4.2 Parsing Rules

A parsing-rule has a name and a body separated by the "::=" symbol. It is terminated by a semicolon.

parsing‑rule	→	rule‑name params‑and‑vars? return‑type? ::= rule‑body ;
rule‑name	→	identifier
params‑and‑vars	→	( param‑or‑var (, param‑or‑var)* )
param‑or‑var	→	var type‑id declarator \| type‑id declarator
return‑type	→	: type‑id
rule‑body	→	alternative

A parsing rule may have parameters, local variables and a return value. The declared parameters become parameters of the static member function generated by spg. The declared variables become local variables of that function. The return value of the parsing rule is transferred to the caller of the parsing rule in the soulng::parser::Match structure that is returned by the generated function. This AST example contains rules with parameters and return values.

4.2.1 Parsing Rule Body

The body of a parsing rule consists of parsing expressions.

Let a and b be parsing expressions. Then a | b is a parsing expression with two alternatives. For parsing expression a | b, the generated parser will try to match a. If input matches a, b is not matched, and parsing proceeds. Otherwise the generated parser will backtrack the input where it was when tried to match a and then try to match b.

alternative

→

sequence (| sequence)*

Let a and b be parsing expressions. Then ab is a parsing expression, a sequence expression. For parsing expression ab, the generated parser will match a and then b in sequence. If both will match, parsing proceeds. Otherwise the parser will backtrack.

sequence

→

difference difference*

Let a and b be parsing expressions. Then a - b is a parsing expression, a difference expression. For parsing expression a - b, the parser will first match a. If a matches, then it will backtrack the input and try to match b. If a matches and b does not, parsing proceeds. Otherwise the generated parser will backtrack.

difference

→

list (- list)*

Let a and b be parsing expressions. Then parsing expression a % b is equivalent to a parsing expression a (b a)*, that is: one or more a's separated by b's.

list

→

postfix (% postfix)?

Let a be a parsing expression. Then a*, a+ and a? are parsing expressions. For parsing expression a*, the generated parser will match zero or more a's. For parsing expression a+, the generated parser will match one or more a's. For parsing expression a?, the generated parser will match zero or one a's.

postfix

→

primary (* | + | ?)?

The primary parsing expressions are rule-call, a primitive parsing expression, and a grouping parsing expression. They may be followed by expectation and semantic action.

primary

→

( rule‑call | primitive | grouping ) expectation? action?

The rule-call consists of a name of a rule, let's call it r, an optional argument list in parentheses, a colon, and an identifier. The generated parser will recursively match r. The identifier is a unique name of the r within the body the current rule. It represents the synthesized attribute of r in a semantic action possibly attached to the rule call.

rule‑call

→

rule‑name ( ( expression‑list ) )? : identifier

The call of a rule r is implemented as a function call. The function to be called is the function generated from rule r by spg.

If the rule r has n parameters, it must be passed n arguments in the argument list of the rule call. If the number of parameters differs from the number of arguments, spg will produce an error.

A primitive parsing expression is either an empty epxression, an any expression, a token expression, or a lexerless expression:

primitive	→	empty‑expression \| any‑expression \| token‑expression \| lexerless‑expression
empty‑expression	→	empty
any‑expression	→	any
token‑expression	→	token‑id
token‑id	→	identifier
lexerless‑expression	→	character‑literal \| string‑literal

Empty expression

An empty expression consists of the keyword empty. An empty expression matches always. Input position of the used lexer is not advanced for an empty expression.

Any expression

An any expression consists of the keyword any. An any expression matches any token and "eats" the token: the input position of the used lexer is advanced to the next token using expression ++lexer, and parsing proceeds.

Token expression

For a token expression, the generated parser will compare the current input token of the used lexer to the token-id of the token expression. If the current input token matches the token-id, the input position of the used lexer is advanced to the next token using expression ++lexer, and parsing proceeds. Otherwise the generated parser will backtrack.

Lexerless expressions

To support lexerless parsing, a parsing expression can be a character or string literal. In this case the lexer, that is assumed to be a trivial lexer, produces lexical tokens that are in fact Unicode characters. There are three cases:

The expression is a character literal: the parser compares the current input token, that is in fact a Unicode character, to the character literal. If the current input token matches the character literal, the input position of the used lexer is advanced to the next token using expression ++lexer, and parsing proceeds. Otherwise the generated parser will backtrack.
The expression is an ordinary string literal: the parser compares input tokens that are in fact a sequence of Unicode characters, to the string literal. If the sequence matches, the input position of the used lexer is advanced by the number of characters in the sequence, and parsing proceeds. Otherwise the generated parser will backtrack.
The expression is a string literal that contains a character class in brackets: the parser compares the current input token, that is in fact a Unicode character, to the characters in the character class.. If the current input token is in the character class, the input position of the used lexer is advanced to the next token using expression ++lexer, and parsing proceeds. Otherwise the generated parser will backtrack.

Grouping expression

A grouping expression is a parenthesized parsing expression.

grouping

→

( alternative )

Expectation expression

Let a be a primary parsing expression. Then a! is a parsing expression, an expectation expression. For an expectation expression a!, instead of backtracking the generated parser will produce an error if a does not match:

expectation

→

Semantic actions

Let a be a primary parsing expression or an expectation expression. Two C++ compound statements may be attached to a. The first one will be executed if a matches. The second one, separated by a slash character, is an optional compound statement that will be executed if a does not match. The C++ compound statements attached to a are called semantic actions.

action

→

compound‑statement (/ compound‑statement)?

There are special symbols that are available in semantic actions:

lexer is a reference to the lexer used by the current parser. The lexer interface document contains descriptions of the member functions of the lexer.
pos is an int variable that contains the index of the token that has matched. That matched token may be obtained using expression lexer.GetToken(pos)
span is a soulng::lexer::Span variable that represents a token range with start and end index of the span set to pos. By changing the start/end token index of the span, and then calling lexer.GetMatch(span) a string matching a token range may be obtained.
pass is a bool variable. By setting pass to false in a semantic action, the semantic action can conditionally reject the current alternative that has matched, and cause the parser to backtrack and try the next alternative. In that case the parser acts as if the current alternative has not been matched.

If the current parsing rule has a synthesized attribute, that is: it returns a value, and it has no semantic action that contains a return statement, the spg tool produces a warning.

If the current parsing rule calls another rule that has a synthesized attribute, and that synthesized attribute is referenced many times in a semantic action, spg warns about this, because a synthesized attribute is represented as a unique pointer that will be released many times. However, if the semantic action consists of a switch statement that has many branches that refer to the same synthesized attribute, this warning can be ignored.

4.3 Main Declaration

A main-declaration consists of the keyword main and a semicolon.

main‑declaration

→

main ;

The spg tool will implement a Parse function for each parser that has a main declaration. The Parse function will take a lexer argument and arguments that the the first parsing rule of the parser takes. It will parse the content using the given lexer by calling the first parsing rule of the parser with the lexer and other arguments.

4.3.1 Example

For the following example parser with the main declaration:

        // ExampleParser.parser:
    
        parser ExampleParser
        {
            uselexer ExampleLexer;

            main;

            Statement(SymbolTable* symbolTable) : Node*
                ::= WhileStatement(symbolTable):whileStatement{ return whileStatement; }
                |   EmptyStatement:emptyStatement{ return emptyStatement; }
                ;

            WhileStatement(SymbolTable* symbolTable) : Node*
                ::= WHILE LPAREN Expression:cond RPAREN Statement(symbolTable):stmt{ return new WhileStatementNode(cond, stmt); }
                ;

            EmptyStatement : Node*
                ::= SEMICOLON{ return new EmptyStatementNode(); }
                ;

            Expression : Node*
                ::= ID{ soulng::lexer::Token token = lexer.GetToken(pos); return new IdentifierNode(token.match.ToString()); }
                // ....
                ;
        }

the spg tool will generate the following class declaration with the Parse function in it, and implement the Parse function in the generated source file:

        // ExampleParser.hpp:

        struct ExampleParser
        {
            static std::unique_ptr<Node> Parse(ExampleLexer& lexer, SymbolTable* symbolTable);
            static soulng::parser::Match Statement(ExampleLexer& lexer, SymbolTable* symbolTable);
            static soulng::parser::Match WhileStatement(ExampleLexer& lexer, SymbolTable* symbolTable);
            static soulng::parser::Match EmptyStatement(ExampleLexer& lexer);
            static soulng::parser::Match Expression(ExampleLexer& lexer);
        };

4.4 Using Declaration

A using-dedclaration consists of the keyword using, a parsing-rule-id and a semicolon:

using‑declaration	→	using parsing‑rule‑id ;
parsing‑rule‑id	→	identifier (. identifier)*

A parsing-rule-id consists of the name of a parser, a period, and a name of a rule in that parser.

The using declaration imports a name of a rule from another parser to the current parser, so that it can be called from the current parser.

4.4.1 Example

The using declaration in the following StatementParser.parser file imports the name of the Expression rule from the ExpressionParser to the StatementParser, so that it can be called from the IfStatement:

        // StatementParser.parser:

        parser StatementParser
        {
            uselexer SomeLexer;

            using ExpressionParser.Expression;

            Statement
                ::= IfStatement
                    // ...
                ;

            IfStatement
                ::= IF LPAREN Expression:condition RPAREN Statement
                ;
        }

        // ExpressionParser.parser:

        parser ExpressionParser
        {
            uselexer SomeLexer;

            Expression
                ::= // ...
                ;
        }

4.5 Use Lexer Declaration

The use-lexer-declaration declaration consists of the keyword uselexer followed by the name of the lexer to use for tokenizing input in the current parser, and a semicolon:

use‑lexer‑declaration

→

uselexer identifier ;

The spg tool will warn if the uselexer declaration is missing.

4.5.1 Example

The following use-lexer declaration sets the name of the lexer to use in the FunctionParser to CmajorLexer:

        // CmajorLexer.lexer:

        lexer CmajorLexer
        {
            // ...
        }

        // Function.parser:

        parser FunctionParser
        {
            uselexer CmajorLexer;

            // ...
        }

4.6 Rule Info Declaration

A rule-info-declaration consists of the keyword ruleinfo followed by rule-infos enclosed in braces and separated by commas:

rule‑info‑declaration

→

ruleinfo { (rule‑info (, rule‑info)*)? }

A rule-info consists of a name of a rule followed by a comma followed by an informative string for that rule. It is enclosed in parentheses:

rule‑info

→

( rule‑name , string‑literal )

The rule info declaration is used for setting informative strings of rules to use in error messages produced by the parser.

4.6.1 Example

The names of the rules of the FunctionParser are given informative strings "function", "function group identifier" and "operator function group identifier" in the following Function.parser file:

        // Function.parser:

        parser FunctionParser
        {
            // ...

            Function
                ::= //
                ;

            FunctionGroupId
                ::= //
                ;

            OperatorFunctionGroupId
                ::= //
                ;

            ruleinfo
            {
                (Function, "function"), (FunctionGroupId, "function group identifier"), (OperatorFunctionGroupId, "operator function group identifier")
            }
        }

4.7 Farthest Error Declaration

A farthest-error-declaration consists of the keyword farthest_error and a semicolon.

farthest‑error‑declaration

→

farthest_error ;

4.7.1 Example

The following example parser contains a farthest_error declaration:

        // ExampleParser.parser:
    
        parser ExampleParser
        {
            uselexer ExampleLexer;
            farthest_error;
            main;
            // ...
        }

See Farthest Error and State document for a usage example.

4.8 State Declaration

A state-declaration consists of the keyword state and a semicolon.

state‑declaration

→

state ;

4.8.1 Example

The following example parser contains a state declaration:

        // ExampleParser.parser:
    
        parser ExampleParser
        {
            uselexer ExampleLexer;
            farthest_error;
            state;
            main;
            // ...
        }