up: Table of contents | prev: Minilang Syntax | next: Building the Lexer

2.1 Lexical Analyzer

Operation of a Lexical Analyzer

The task of the lexical analyzer is to recognize lexical units called tokens from the character stream and return them to the parser. Typical tokens for a language include identifiers and keywords, literals, punctuation and operator symbols. The requirement for a token is that it must be able to be matched by using a regular expression pattern.

A token is defined by choosing a name and an informative string for it and including them in a tokens declaration that is put to a .lexer file. Under an expressions delaration and/or lexer definition that are also put in the .lexer file, a regular expression pattern for each token is declared. When the generated lexical analyzer recognizes that pattern, the corresponding token name is returned to the parser.

Token Declarations

We begin defining lexical analyzer for the Minilang language by creating token declarations in Minilang.lexer file. The Minilang language has six statements: if, while, return, construction, assignment and compound statement. Of those, if, while and return statement include keywords so we create token names for them:

        tokens MinilangTokens
        {
            (IF, "'if'"), (ELSE, "'else'"), (WHILE, "'while'"), (RETURN, "'return'")
        }
    

Tokens are declared in a tokens declaration that consists of keyword tokens followed by an identifier naming the declaration followed by a comma-separated list of identifier/string literal pairs enclosed in braces. The identifier is the name of a token. It is typically written in capital letters, so that is will not clash with keywords of the implementation language. The informative string for a token is included in error messages produced by the parser when a token is expected to match but does not. For example, for a token name LPAREN, the informative string can be "'('", so that the parser error message will say: '(' expected.

The language has three types: int, bool and void, so we include token declaration for those keywords as well:

        tokens MinilangTokens
        {
            (IF, "'if'"), (ELSE, "'else'"), (WHILE, "'while'"), (RETURN, "'return'"), (INT, "'int'"), (BOOL, "'bool'"), (VOID, "'void'")
        }
    

The names of variables and functions of the language are identifiers, so we reserve a token name ID for those.

        tokens MinilangTokens
        {
            (IF, "'if'"), (ELSE, "'else'"), (WHILE, "'while'"), (RETURN, "'return'"), (INT, "'int'"), (BOOL, "'bool'"), (VOID, "'void'"),
            (ID, "identifier")
        }
    

The language has two types of literals: integer and Boolean literals, so next we define tokens for them:

        tokens MinilangTokens
        {
            (IF, "'if'"), (ELSE, "'else'"), (WHILE, "'while'"), (RETURN, "'return'"), (INT, "'int'"), (BOOL, "'bool'"), (VOID, "'void'"),
            (ID, "identifier"), (INTLIT, "integer literal"), (TRUE, "'true'"), (FALSE, "'false'")
        }
    

Then come the punctuation tokens. Each statement of the Minilang language is terminated by a semicolon so we declare a SEMICOLON token name. Parameters and enclosed in parentheses: LPAREN and RPAREN are token names for those. Compound statements are enclosed in braces so LBRACE and RBRACE are names for them:

        tokens MinilangTokens
        {
            (IF, "'if'"), (ELSE, "'else'"), (WHILE, "'while'"), (RETURN, "'return'"), (INT, "'int'"), (BOOL, "'bool'"), (VOID, "'void'"),
            (ID, "identifier"), (INTLIT, "integer literal"), (TRUE, "'true'"), (FALSE, "'false'"),
            (SEMICOLON, "';'"), (LPAREN, "'('"), (RPAREN, "')'"), (LBRACE, "'{'"), (RBRACE, "'}'")
        }
    

Finally we declare token names for operator symbols. The language has Boolean and arithmetic expressions. Arithmetic expressions have operator symbols: '+', '-', '*', '/', '%', and Boolean expressions '!', '==', '!=', '<=', '>=', '<' and '>'. The assignment and construction statements have the '=' symbol, and the parameter and argument lists have ',':

        tokens MinilangTokens
        {
            (IF, "'if'"), (ELSE, "'else'"), (WHILE, "'while'"), (RETURN, "'return'"), (INT, "'int'"), (BOOL, "'bool'"), (VOID, "'void'"),
            (ID, "identifier"), (INTLIT, "integer literal"), (TRUE, "'true'"), (FALSE, "'false'"),
            (SEMICOLON, "';'"), (LPAREN, "'('"), (RPAREN, "')'"), (LBRACE, "'{'"), (RBRACE, "'}'"),
            (PLUS, "'+'"), (MINUS, "'-'"), (MUL, "'*'"), (DIV, "'/'"), (MOD, "'%''"),
            (NOT, "'!'"), (EQ, "'=='"), (NEQ, "'!='"), (LEQ, "'<='"), (GEQ, "'>='"), (LESS, "'<'"), (GREATER, "'>'"),
            (ASSIGN, "'='"), (COMMA, "',''")
        }
    

Now we have declared the names of for all tokens. We now have a choice of defining the patterns for keywords of the language as separate regular expressions, or alternatively have the lexical analyzer recognize them as identifiers and separate them in the rule for an identifier. The SoulNG lexer generator has direct support for the second alternative, so we present it next.

Keyword Declarations

When the keywords are recognized as identifiers and separated by the rule for an identifier, we use the keywords declaration for the keywords:

        keywords MinilangKeywords
        {
            ("if", IF), ("else", ELSE), ("while", WHILE), ("return", RETURN), ("int", INT), ("bool", BOOL), ("void", VOID), ("true", TRUE), ("false", FALSE)
        }
    

The keywords declaration consists of the keyword keywords followed by an identifier naming the keywords declaration followed by a comma-separated list of string literal/identifier pairs enclosed in braces. The string literal defines the keyword string and the identifier defines the corresponding token name to be returned to the parser. The lexical analyzer will have a mapping of keyword string literals to keyword token names, that can be consulted from an action of a lexer rule.

Declaring Token Pattern Expressions

Now that we have defined the names of the tokens and the keyword strings, we turn our attention to declaring the patterns for the tokens. We begin by declaring a pattern for an integer literal:

        expressions
        {
            digit = "[0-9]";
            digit_sequence = "{digit}+";
            integer_literal = "{digit_sequence}";
        }
    

Token pattern expressions are declared in SoulNG in an expressions declaration that consists a a keyword expressions followed by expression definitions enclosed in braces. An expression definition consists of a name for the expression, an assignment symbol, a regular expression string and is terminated by a semicolon. An expression string may contain references to other expressions that come before it in the expressions declaration by enclosing the name of the referenced expression in braces. For example, the digit_sequence expression contains a reference to the digit expression above it. In the above regular expressions pattern strings [0-9] is a character class containing ranges of characters to be matched. Expression e+ matches one or more e's, so "[0-9]" matches any single decimal digit and "[0-9]+" matches one or more decimal digits. Next we have a pattern for an identifier:

        expressions
        {
            digit = "[0-9]";
            digit_sequence = "{digit}+";
            integer_literal = "{digit_sequence}";
            identifier = "{idstart}{idcont}*";
        }
    

The expression e0 e1, matches expression e0 followed by the expression e1, and the expression e* matches zero or more e's, so expression "{idstart}{idcont}*" matches expression idstart followed by zero or more matches of expression idcont. The whole expression matches a single letter or an underscore followed by zero or more letters, digits and underscores. We have not defined expressions idstart and idcont here, because those expressions are built in the SoulNG lexer generator. The reason that these expression are built-in is that by default those patterns include not only the ASCII identifier ranges, but also Unicode identifier ranges containing non-ASCII Unicode letters. They are too numerous to be written as literal regular expressions. If only ASCII identifiers are preferred, user can request them by using the '-a' option when running the slg command, or define them manually as patterns "[A‑Za‑z_]" and "[A‑Za‑z_0‑9]".

To be recognized properly, the other tokens must be separated by separator tokens:

        expressions
        {
            digit = "[0-9]";
            digit_sequence = "{digit}+";
            integer_literal = "{digit_sequence}";
            identifier = "{idstart}{idcont}*";
            ws = "[\n\r\t ]";
            separators = "{ws}+";
        }
    

We have defined an expression ws that is a character class containing one of the white space characters: '\n' for a newline, '\r' for a carriage return, '\t' for a tab and ' ' for a space character. Then we have defined expression separators to match one or more those white space characters.

Defining a Lexer

The token identifiers and regular expression patterns for recognizing the tokens are connected together in a lexer definition. The lexer definition consists of keyword lexer followed by an identifier naming the lexer followed by rules enclosed in braces. Each rule has a regular expression pattern followed by the corresponding semantic action that is written as C++ code. The generated lexer operates in a loop recognizing patterns. The pattern that has the longest match wins. When that happens, the generated lexer executes the semantic action associated with the matched pattern. Typically the semantic action returns the token identifier that corresponds to the matched pattern to the caller, but we have two exceptions, a rule for separators, and a rule for identifiers, that are described first:

        lexer MinilangLexer
        {
            "{separators}" {}
        }
    

First we have a rule for token separators. The semantic action for separators is special in that it does nothing. In particular it does not return a token identifier to the parser as semantic actions for other rules do. When the semantic action is empty, the lexer skips that token and continues recognizing other tokens.

The next rule is for the identifiers:

        lexer MinilangLexer
        {
            "{separators}" {}
            "{identifier}" { int kw = GetKeywordToken(token.match); if (kw == INVALID_TOKEN) return ID; else return kw; }
        }
    

The second rule that is for the identifier is also special, because it is used for recognizing both identifiers and keywords. The semantic action for that rule checks if the match is found to be a keyword string. If not, the built-in GetKeywordToken() function returns INVALID_TOKEN, otherwise it returns the token identifier for the matched keyword. Thus when GetKeywordToken() returns INVALID_TOKEN, we know that the match is an ordinary identifier, and we return ID to the parser; otherwise we return the token identifier for the keyword to the parser.

The rule for an integer-literal is simpler:

        lexer MinilangLexer
        {
            "{separators}" {}
            "{identifier}" { int kw = GetKeywordToken(token.match); if (kw == INVALID_TOKEN) return ID; else return kw; }
            "{integer_literal}" { return INTLIT; }
        }
    

When the match is for integer_literal pattern, we return INTLIT token identifier to the parser. Next come the punctuation symbols:

        lexer MinilangLexer
        {
            "{separators}" {}
            "{identifier}" { int kw = GetKeywordToken(token.match); if (kw == INVALID_TOKEN) return ID; else return kw; }
            "{integer_literal}" { return INTLIT; }
            ";" { return SEMICOLON; }
            "\(" { return LPAREN; }
            "\)" { return RPAREN; }
            "\{" { return LBRACE; }
            "\}" { return RBRACE; }
        }
    

The parentheses have special meaning in the regular expression patterns: they are used for grouping. Also the braces are used in patterns for referencing other expressions. Thus when we want the literal parentheses and braces to be matched in a regular expression pattern, they must be escaped by using the backslash character. The operator symbols are next in line:

        lexer MinilangLexer
        {
            "{separators}" {}
            "{identifier}" { int kw = GetKeywordToken(token.match); if (kw == INVALID_TOKEN) return ID; else return kw; }
            "{integer_literal}" { return INTLIT; }
            ";" { return SEMICOLON; }
            "\(" { return LPAREN; }
            "\)" { return RPAREN; }
            "\{" { return LBRACE; }
            "\}" { return RBRACE; }
            "\+" { return PLUS; }
            "-" { return MINUS; }
            "\*" { return MUL; }
            "/" { return DIV; }
            "%" { return MOD; }
            "!" { return NOT; }
            "==" { return EQ; }
            "!=" { return NEQ; }
            "<=" { return LEQ; }
            ">=" { return GEQ; }
            "<" { return LESS; }
            ">" { return GREATER; }
        }
    

The '+' and '*' characters are operators in regular expressions, so they must also be escaped with the backslash when we want the literal meaning for them. Other symbols stand for themselves.

Only the assignment and comma symbols are left:

        lexer MinilangLexer
        {
            "{separators}" {}
            "{identifier}" { int kw = GetKeywordToken(token.match); if (kw == INVALID_TOKEN) return ID; else return kw; }
            "{integer_literal}" { return INTLIT; }
            ";" { return SEMICOLON; }
            "\(" { return LPAREN; }
            "\)" { return RPAREN; }
            "\{" { return LBRACE; }
            "\}" { return RBRACE; }
            "\+" { return PLUS; }
            "-" { return MINUS; }
            "\*" { return MUL; }
            "/" { return DIV; }
            "%" { return MOD; }
            "!" { return NOT; }
            "==" { return EQ; }
            "!=" { return NEQ; }
            "<=" { return LEQ; }
            ">=" { return GEQ; }
            "<" { return LESS; }
            ">" { return GREATER; }
            "=" { return ASSIGN; }
            "," { return COMMA; }
        }
    

Now the MinilangLexer should recognize all tokens of the language. Next we will build the lexer and compile it...

up: Table of contents | prev: Minilang Syntax | next: Building the Lexer