3 Syntax of Lexical Analyzer Files

Table of contents

3.1 Token File Syntax
3.2 Keyword File Syntax
3.3 Expression File Syntax
3.4 Lexer File Syntax
3.5 Project File Syntax

3.1 Token File Syntax

token‑file tokens token‑module‑id { ( token‑declaration ( , token‑declaration )* )? }
token‑module‑id qualified‑id
token‑declaration ( token‑name , token‑info‑string )
token‑name identifier
token‑info‑string string‑literal

A token file consists of the keyword tokens followed by a token module identifier followed by a sequence of token declarations separated by commas and enclosed in braces.

Each token file must be given a unique token module identifier.

Each token declaration is a pair that consists of a token name and a token info string separated by a comma and enclosed in parentheses.

I use UPPER_CASE identifiers for token names.

The token info string is included in the error message of an exception thrown by a an expectation parser when corresponding token should match but does not.

Example

// example.token:

tokens example.token
{
    (OR, "'or'"), (AND, "'and'"), (EQ, "'='"), (NEQ, "'!='"), (LEQ, "'<='"), (GEQ, "'>='"), (LESS, "'<'"), (GREATER, "'>'")
}

I use qualified identifiers ending with .token for token module identifiers.

Token file extension is .token

3.2 Keyword File Syntax

keyword‑file token‑module‑imports
keywords keyword‑module‑id { ( keyword‑declaration ( , keyword‑declaration )* )? }
token‑module‑imports imports
keyword‑module‑id qualified‑id
keyword‑declaration ( keyword‑string , keyword‑token‑id )
keyword‑string string‑literal
keyword‑token‑id token‑name

A keyword file consists of possibly empty sequence of token module imports followed by the keyword keywords followed by a keyword module identifier followed by a sequence of keyword declarations separated by commas and enclosed in braces.

Each keyword file must be given a unique keyword module identifier.

Each keyword declaration is a pair that consists of a keyword string and a keyword token identifier separated by a comma and enclosed in parentheses.

The keyword string is associated with the corresponding keyword token identifier and will be included in a keyword map contained by a lexer.

Example

// example.keyword:

keywords example.keyword
{
    ("or", OR), ("and", AND), ("div", DIV), ("mod", MOD)
}

I use qualified identifiers ending with .keyword for keyword module identifiers.

Keyword file extension is .keyword

3.3 Expression File Syntax

expression‑file expressions expression‑module‑id { expression‑declaration * }
expression‑module‑id qualified‑id
expression‑declaration expression‑name = expression‑string ;
expression‑name identifier
expression‑string string‑literal

An expression file consists of the keyword expressions followed by an expression module identifier followed by a sequence of expression declarations enclosed in braces.

Each expression file must be given a unique expression module identifier.

An expression declaration consists of an expression name, an assignment symbol, an expression string and a semicolon.

The expression string shall contain a lexer regular expression . The regular expression may contain references to expression names that precede the expression declaration.

The double quote character, the backslash character and the regular expression operator characters ('*', '+', '?', '|', '[', ']', '{', '}' and '.') must be quoted within the expression string by prefixing them with the backslash character if they are to be taken literally.

There are four built-in special expressions available:

Example

// example.expr:

expressions example.expr
{
    ws = "[\n\r\t ]";
    separators = "{ws}+";
    dq_string = "\"[^\"]*\"";
    sq_string = "'[^']*'";
    digits = "[0-9]+";
    number = "{digits}(\.{digits}?)?|\.{digits}";
    name_start_char = ...
    name_char = ...
    name = "{name_start_char}{name_char}*";
}

I use qualified identifiers ending with .expr for expression module identifiers.

Expression file extension is .expr

3.4 Lexer File Syntax

lexer‑file lexer‑module‑declaration imports lexer lexer‑name { lexer‑content * }
lexer‑module‑declaration export‑module‑declaration
lexer‑name identifier
lexer‑content rules | variables | actions
rules rules { lexer‑rule * }
lexer‑rule expression‑string action‑id ? compound‑statement
variables variables { lexer‑variable * }
lexer‑variable variable‑type variable‑name ;
variable‑type type‑id
variable‑name identifier
actions actions { lexer‑action * }
lexer‑action action‑id = compound‑statement
action‑id $ ( digit‑sequence )

A lexer file consists of a lexer module declaration followed by a sequence of imports followed by the keyword lexer followed by the name of the lexer followed by lexer content.

Each lexer file must be given a unique lexer module identifier in the lexer module declaration.

There should be an import statement for each token module, keyword module and expression module used by the lexer.

The lexer content consists of rules, variables and actions.

Each lexer rule consists of an expression string optionally followed by an action identifier followed by a C++ compound statement. Typically the compound statement returns a token identifier to the parser. The compound statement can also be empty in which case the corresponding token is skipped without returning it to the parser.

Example

// example.lexer:

export module example_lexer.lexer;

import example.token;
import example.keyword;
import example.expr;

lexer ExampleLexer
{
    rules
    {
        "{separators}"{ }
        "{name}" { return NAME; }
        "{number}" { return NUMBER; }
        "{dq_string}" { return DQ_STRING; }
        "{sq_string}" { return SQ_STRING; }
        "=" { return EQ; }
        ...
        "\.\." { return DOT_DOT; }
        "\." { return DOT; }
        "::" { return COLON_COLON; }
        ":" { return COLON; }
        "$" { return DOLLAR; }
        "," { return COMMA; }
        "@" { return AT; }
        "\[" { return LBRACKET; }
        "\]" { return RBRACKET; }
        "\(" { return LPAREN; }
        "\)" { return RPAREN; }
    }
}

I use _lexer.lexer suffix for lexer module names.

Expression file extension is .lexer

Variables and Actions

A lexer may also contain C++ variable declarations that can be accessed by the lexer actions and semantic actions of the parser. Each variable declaration consists of a C++ type id followed by the name of the variable. Variables can be accessed using through a pointer variable whose name is vars . For example to increment a variable whose name is varName , add a statement ++vars‑>varName .

A lexer may also contain actions each of which can be associated to some lexer rule. An action consists of an action identifier followed by the = symbol followed by a C++ compound statement. An action identifier consists of the $ symbol followed by a parenthesized digit sequence. Typically an action evaluates a lexer variable and either does nothing or returns INVALID_TOKEN token based on the value of the variable. If the action does nothing, the action accepts the rule associated with the action. If the action returns INVALID_TOKEN , the lexer rejects the rule associated with the action identifier and accepts a rule that has matched before that rule, if any, instead.

Consider implementing a parser that can parse nested template id's such as std::vector<std::unique_ptr<Foo>> that ends with two right angle brackets. The problem is that without a lexical hack the lexer would return a SHIFT_RIGHT token for the ending right angle brackets, not two RANGLE tokens as we would like. One solution is that the parser increments a leftAngleCount variable when parsing a template argument list each time seeing a LANGLE token (see parser 'vars' example ). In the middle of template id the leftAngleCount is now two. Then when ending the template id, for the two > characters, the leftAngleCount is greater than zero for them, so the lexer recects the >> rule that would return SHIFT_RIGHT . The rule for the ">" string has matched in addition to the rule for the ">>" string, and comes before that rule, so the lexer accepts the rule for the ">" string instead. Now each time seeing a > character in a template id, the lexer returns the RANGLE token as desired.

Example

// 'vars':

lexer SlgLexer
{
    rules
    {
        ...
        "<" { return LANGLE; }
        ">" { return RANGLE; }
        ">>" $(0) { return SHIFT_RIGHT; }
        ...
    }

    variables 
    {
        int leftAngleCount;
    }
    
    actions
    {
        $(0)={ if (vars->leftAngleCount > 0) return INVALID_TOKEN; }
    }
}

3.5 Project File Syntax

lexer‑project‑file project lexer‑project‑name ; lexer‑project‑file‑declaration *
lexer‑project‑name qualified‑id
lexer‑project‑file‑declaration token‑file‑declaration | keyword‑file‑declaration | expression‑file‑declaration | lexer‑file‑declaration
token‑file‑declaration extern ? tokens file‑path ;
keyword‑file‑declaration keywords file‑path ;
expression‑file‑declaration expressions file‑path ;
lexer‑file‑declaration lexer file‑path ;

A lexer project file consists of the keyword project followed by a lexer project name followed by a semicolon followed by a sequence of lexer project file declarations.

A lexer project file declaration can be a token file declaration, a keyword file declaration, an expression file declaration or a lexer file declaration.

A token file declaration consists of an optional extern keyword followed by the keyword tokens followed by a file path followed by a semicolon.

If a token file is declared extern , no C++ code for it is generated in this lexer project. It is then expected to be included nonexternally in another lexer project.

A keyword file declaration consists of the keyword keywords followed by a file path followed by a semicolon.

An expression file declaration consists of the keyword expressions followed by a file path followed by a semicolon.

A lexer file declaration consists of the keyword lexer followed by a file path followed by a semicolon.

Example

project example.lexer ;
tokens <example.token >;
keywords <example.keyword >;
expressions <example.expr >;
lexer <example_lexer.lexer >;

Lexer project file extension is .slg