Lexer File Syntax

Table of contents

1 Lexer File Declarations
2 Class Map Declaration
    2.1 Example
3 Prefix Declaration
    3.1 Example
4 Include Declaration
    4.1 Example
5 Token Declarations
    5.1 Example
6 Keyword Declarations
    6.1 Example
7 Expression Declarations
    7.1 Example
    7.2 Predefined Expressions
8 Lexer Declaration
    8.1 Example
    8.2 Variable Declarations
    8.3 Conditional Action Declarations
9 Keywords

The notation used for representing the lexer file syntax in this document is described here.

1 Lexer File Declarations

A .lexer file consists of lexer-file-declarations of which the tokens-declaration and the lexer-declaration are mandatory:

lexer‑file lexer‑file‑declaration*
lexer‑file‑declaration class‑map‑declaration | prefix‑declaration | include‑declaration | tokens‑declaration | keywords‑declaration | expressions‑declaration | lexer‑declaration

If the .lexer file contains Unicode identifiers with non-ASCII characters, the encoding of it should be UTF-8. The lexer generator slg will generate C++ source and header files whose encoding is UTF-8.

2 Class Map Declaration

If a program or library contains many lexers, each lexer must have a class map with a different name. The class map declaration sets the name of the generated class map:

class‑map‑declaration classmap identifier ;

2.1 Example

For the following declaration:

        classmap ExampleClassMap;
    

the slg tool will generate files ExampleClassMap.hpp and ExampleClassMap.cpp that contain the class map.

The default name of the class map is ClassMap. By default slg will generate files ClassMap.hpp and ClassMap.cpp that contain the class map.

3 Prefix Declaration

The prefix declaration sets the path prefix of the include directives generated by slg.

prefix‑declaration prefix string‑literal ;

3.1 Example

If the project containing the generated lexer is in the foo/bar subdirectory of the directory of the solution containing the project, you can use the following prefix declaration:

        prefix "foo/bar";
    
Then slg generates include directives of the form #include <foo/bar/File.hpp> in generated source code.

4 Include Declaration

The slg tool will add an include declaration as is to the generated header files.

include‑declaration # include file‑path

4.1 Example

If you use a function Foo(int) in a semantic action of a lexer, and the prototype of the Foo function is in the header file Foo.hpp, you need to add the following include declaration to the .lexer file:
        #include <Foo.hpp>
    
Then slg will add include directive #include <Foo.hpp> to the beginning of the generated header files.

5 Token Declarations

The names and informative strings of tokens are defined in the tokens-declaration:

tokens‑declaration tokens identifier { (token‑declaration (, token‑declaration)*)? }

The tokens-declaration consists of the keyword tokens followed by an identifier followed by token-declarations separated by commas and enclosed in braces.

The identifier that follows the tokens keyword names the generated source and header files, and the namespace that will contain the C++ definitions for the tokens.

The token-declaration consists of the name and an informative string of a token separated by comma and enclosed in parentheses:

token‑declaration ( identifier , string‑literal )

The informative string is included in error messages produced by the generated lexer.

5.1 Example

The following tokens-declaration defines token names ID, IF, ELSE, LPAREN and RPAREN and informative strings for them:

        tokens ExampleTokens
        {
            (ID, "identifier"), (IF, "'if'"), (ELSE, "'else'"), (LPAREN, "'('"), (RPAREN, "')'"), (SEMICOLON, "';'")
        }
    

The slg tool will generate files ExampleTokens.hpp and ExampleTokens.cpp that contain namespace ExampleTokens. The ExampleTokens namespace contains the identifiers and numeric values of the tokens defined as C++ integer constants:

        // ExampleTokens.hpp:

        namespace ExampleTokens
        {
            const int END = 0;
            const int ID = 1;
            const int IF = 2;
            const int ELSE = 3;
            const int LPAREN = 4;
            const int RPAREN = 5;
            const int SEMICOLON = 6;
            // ...
        }
    

The END token, that is automatically defined by slg, represents end of input condition in the generated lexer.

6 Keyword Declarations

Language keywords are defined in the keywords-declaration, that consists of the keyword keywords followed by an identifier followed by keyword-declarations separated by commas and enclosed in braces:

keywords‑declaration keywords identifier { (keyword‑declaration (, keyword‑declaration)*)? }

The identifier that follows the keywords keyword names the generated source and header files, and the namespace that will contain C++ definitions for the keywords.

Each keyword-declaration consists of a literal string and the corresponding token identifier of the keyword separated by a comma and enclosed in parentheses:

keyword‑declaration ( string‑literal , identifier )

6.1 Example

The following keywords-declaration defines two keywords: if and else, that have corresponding token identifiers IF and ELSE:

        keywords ExampleKeywords
        {
            ("if", IF), ("else", ELSE)
        }
    

The slg tool will generate files ExampleKeywords.hpp and ExampleKeywords.cpp that contain a namespace ExampleKeywords. The ExampleKeywords namespace contains C++ definitions for the keywords:

        // ExampleKeywords.cpp:

        namespace ExampleKeywords
        {
            using namespace ExampleTokens;

            Keyword keywords[] =
            {
                {U"if", IF}, 
                {U"else", ELSE},
                {nullptr, -1}
            };

            // ...
        }
    

7 Expression Declarations

An expressions-declaration contains named regular expression patterns that can be used in other expression declarations and in a lexer declaration. The expressions-declaration consists of the keyword expressions followed by expression-declarations in braces:

expressions‑declaration expressions { expression‑declaration* }

Each expression-declaration consists of the name of the regular expression pattern, an assignment symbol, the regular expression pattern itself, and a semicolon:

expression‑declaration identifier = regular‑expression ;

An expression declaration can be used in expression declarations following it and in a lexer declaration.

7.1 Example

The following expression declarations define six regular expression patterns: ws, newline, linecomment, blockcomment, comment and separators:
        epressions
        {
            ws = "[\n\r\t ]";
            newline = "\r\n|\n|\r";
            linecomment = "//[^\n\r]*{newline}";
            blockcomment = "/\*([^*]|\*[^/])*\*/";
            comment = "{linecomment}|{blockcomment}";
            separators = "({ws}|{comment})+";
        }
    

The separators pattern is used in the following lexer declaration to skip white space and comments:

        lexer ExampleLexer
        {
            "{separators}" {}
        }
    

7.2 Predefined Expressions

There are two predefined expressions: the idstart expression represents those Unicode characters that may start an identifier, and the idcont expression represents those Unicode characters that may follow the start of an identifier. The idstart pattern consists of letters and the underscore character, and the idcont pattern consists of letters, digits and the underscore character. If the -a option is given to slg, the idstart and idcont will contain just ASCII symbols. By default they will contain also non-ASCII Unicode letters.

8 Lexer Declaration

A lexer-declaration connects regular expression patterns to token identifiers. It may also contain C++ variable declarations and conditional actions.

lexer‑declaration lexer api? { lexer‑clause* }

A lexer-declaration consists of the keyword lexer followed by an optional API specifier followed by lexer-clauses enclosed in braces.

A lexer-clause defines a regular expression pattern and a semantic action, a C++ compound statement, that is executed when input matches the regular expression pattern. The lexer clause may also be a variables-declaration or a conditional-actions-declaration:

lexer‑clause regular‑expression conditional‑action‑id? compound‑statement |
variables‑declaration |
conditional‑actions‑declaration

A typical lexer clause will return a token identifier to the parser in its semantic action when input matches the corresponding regular expression

8.1 Example

        tokens ExampleTokens
        {
            (ID, "identifier"), (IF, "'if'"), (ELSE, "'else'"), (LPAREN, "'('"), (RPAREN, "')'"), (SEMICOLON, "';'")
        }

        keywords ExampleKeywords
        {
            ("if", IF), ("else", ELSE)
        }

        expressions
        {
            ws = "[\n\r\t ]";
            newline = "\r\n|\n|\r";
            linecomment = "//[^\n\r]*{newline}";
            blockcomment = "/\*([^*]|\*[^/])*\*/";
            comment = "{linecomment}|{blockcomment}";
            separators = "({ws}|{comment})+";
            id = "{idstart}{idcont}*";
        }

        lexer ExampleLexer
        {
            "{separators}" {}
            "{id}" { int kw = GetKeywordToken(token.match); if (kw == INVALID_TOKEN) return ID; else return kw; }
            "\(" { return LPAREN; }
            "\)" { return RPAREN; }
            ";" { return SEMICOLON; }
        }
    

The slg tool will generate files ExampleLexer.hpp and ExampleLexer.cpp that will contain ExampleLexer C++ class that is the generated lexer:

        // ExampleLexer.hpp:

        class ExampleLexer : public soulng::lexer::Lexer
        {
        public:
            ExampleLexer(const std::u32string& content_, const std::string& fileName_, int fileIndex_);
            int NextState(int state, char32_t c) override;
        private:
            int GetTokenId(int statementIndex);
        };
    

8.2 Variable Declarations

A variables-declaration consists of the keyword variables followed by C++ variable declarations in braces:

variables‑declaration variables { variable‑declaration* }

A variable declaration consists of the C++ type of the variable followed by the name of the variable followed by a semicolon:

variable‑declaration type‑id identifer ;

Variables may be used for communication between a lexer and a parser.

8.3 Conditional Action Declarations

A conditional-actions-declaration consists of the keyword actions followed by conditional actions in braces:

conditional‑actions‑declaration actions { conditional‑action‑declaration* }

A conditional action declaration consists of the conditional action identifier followed by an assignment symbol followed by a C++ compound statement:

conditional‑action‑declaration conditional‑action‑id = compound‑statement

Conditional actions may be used in communication between a lexer and a parser.

A typical conditional action checks a Boolean flag that is set by the parser when some condition holds. By returning the value INVALID_TOKEN from its compound statement, the generated lexer rejects the current token being matched, and returns the token that has matched before it to the parser. If the compound statement of the conditional action does not return anything, the generated lexer accepts the current token being matched, and returns it to the parser.

A conditional action identifier consists of a dollar symbol followed by a numeric value of the action in parentheses:

conditional‑action‑id $ ( integer‑literal )

A conditional action identifier may be used in a lexer-clause to conditionally execute a semantic action of the lexer.

9 Keywords

The following keywords may not be used as identifiers in .lexer files:

lexer‑file‑keyword cppkeyword | actions | api | classmap | expressions | include | keywords | lexer | prefix | tokens | variables