Lexerless parsing

For some syntaxes it might be easier to relinquish using traditional lexer completely than trying to construct a lexer using SoulNG lexer generator. When the lexer should be able to return some token in some context and another token in a different context for the same lexeme, the SoulNG lexer generator might not be able to construct a working lexer.

In this case it is an option to use a trivial lexer. Instead of returning token identifiers to the parser, a trivial lexer returns just Unicode characters. Then the parser can contain character and string literals that would normally be recognized by the lexer in the form of lexerless expressions.

Parsing primitives in lexerless parsers

You can use three kinds of parsing primitives in lexerless parsers:

  1. strings enclosed in double quotes
  2. characters enclosed in single quotes, and
  3. character sets enclosed in brackets that are enclosed in double quotes

Example of a rule recognizing a string

		somerule
			:: "keyword"
			;
	

Example of a rule recognizig a character

		somerule
			::= 'x'
			;
	

Example of a rule recognizing a set of characters

		hexchar
			::= "[a-fA-F0-9]"
			;
	

Composite parsing expressions in lexerless parsers

You can combine primitive and composite parsing expressions using the following parsing operators:

  1. when x and y are parsing expressions, xy is a parsing expression that recoognizes an x followed by a y
  2. when x is a parsing expression, x* is a parsing expression that recognizes 0 or more xs
  3. when x is a parsing expression, x+ is a parsing expression that recognizes 1 or more xs
  4. when x is a parsing expression, x? is a parsing expression that recognizes 0 or 1 xs.
  5. when x and y are parsing expressions, x | y is a parsing expression that recognizes either an x or a y in that order (meaning if x matches y is not tried).
  6. you can group parsing expressions using parentheses: for example parsing expression a (b | c)* d would recognize an a followed by zero or more bs or cs in any order followed by a d.

Example

As an example of a lexerless parser, take a look at the XML parser. You can recognize a lexerless parser by the uselexer TrivialLexer; statement.