Farthest Error and State

Minixml parser

Take a look at the following minixml.parser:

[cpp]#include <soulng/parser/Range.hpp>
[cpp]#include <soulng/lexer/TrivialLexer.hpp>

parser minixml
{
    uselexer TrivialLexer;
    main;
	
    document
        ::= spaces:s1? element:e spaces:s2?
        ;
		
    element
        ::= '<' name:n1
        (   '>'
            element_content:ec
            "</" name:n2 '>'
        |   "/>"
        )
        ;

    spaces
        ::= "[ \t\r\n]"+
    ;

    name
        ::= name_start_char:nsc name_char:nc*
        ;

    name_start_char
        ::= "[A-Za-z_]"
        ;

    name_char
        ::= name_start_char:nsc | "[0-9]"
        ;

    element_content
        ::= 
        (   element:e
        |   text:t
        )*
        ;

    text
        ::= "[^<]"+
        ;
}

Input

The previous parser accepts XML-like documents like the following philosophers.xml:

<philosophers>
 <philosopher>Socrates</philosopher>
 <philosopher>Plato</philosopher>
 <philosopher>Aristotle</philosopher;>
 <philosopher>Pythagoras</philosopher>
</philosophers>

Here's a test program that reads the philosophers.xml and parses it:

Main program

#include <minixml.hpp>
#include <soulng/util/Unicode.hpp>
#include <soulng/util/MappedInputFile.hpp>
#include <soulng/lexer/TrivialLexer.hpp>
#include <iostream>

using namespace soulng::unicode;
using namespace soulng::util;

int main()
{
    try
    {
        std::string fileName = "philosophers.xml";
        std::string file = ReadFile(fileName);
        std::u32string content = ToUtf32(file);
        TrivialLexer lexer(content, fileName, 0);
        minixml::Parse(lexer);
    }
    catch (const std::exception& ex)
    {
        std::cerr << ex.what() << std::endl;
        return 1;
    }
    return 0;
}

Testing

If I run the program I get the following output:

>minixml
parsing error at 'philosophers.xml:1': document expected:

^

Apparently the input contains an error, but the error message claims that the error is in the start of the input file at line number 1, not in the middle where the error actually is.

The problem with backtracking parsers lacking expectation expressions is that either the parser accepts the entire input, if the input is good, or it reports the error to be in the start of the input, if the input is not good.

Minixml with farthest error declaration

I can get better error messages if I use the farthest_error declaration:

[cpp]#include <soulng/parser/Range.hpp>
[cpp]#include <soulng/lexer/TrivialLexer.hpp>

parser minixml
{
    uselexer TrivialLexer;
    farthest_error;
    main;
    // ...
}

Main program with farthest error detection

The main program needs also a modification:

#include <minixml.hpp>
#include <soulng/util/Unicode.hpp>
#include <soulng/util/MappedInputFile.hpp>
#include <soulng/lexer/TrivialLexer.hpp>
#include <iostream>

using namespace soulng::unicode;
using namespace soulng::util;

int main()
{
    try
    {
        std::string fileName = "philosophers.xml";
        std::string file = ReadFile(fileName);
        std::u32string content = ToUtf32(file);
        TrivialLexer lexer(content, fileName, 0);
        lexer.SetFlag(soulng::lexer::LexerFlags::farthestError);
        minixml::Parse(lexer);
    }
    catch (const std::exception& ex)
    {
        std::cerr << ex.what() << std::endl;
        return 1;
    }
    return 0;
}

I have set the LexerFlags::farthestError flag of the lexer. Otherwise the main program is the same as before.

Testing farthest error

Now if I run the program I get the following output:

>minixml.exe
parsing error at 'philosophers.xml:4':
 <philosopher>Aristotle</philosopher;>
                                    ^

Now the error location is more accurate.

How it works

A parser with a farthest_error declaration keeps track of the farthest position the lexer has reached. If the input contains a parsing error, the error message points to this location.

State

In addition to farthest error declaration there's a declaration that may be useful in parser debugging: the state declaration. When parsing input with a parser that has a state declaration the lexer and the parser keep track of current parsing state. In case of a parsing error the error message includes the state of the parser when the error is detected. Here's a parser that has a state declaration:

[cpp]#include <soulng/parser/Range.hpp>
[cpp]#include <soulng/lexer/TrivialLexer.hpp>

parser minixml
{
    uselexer TrivialLexer;
    farthest_error;
    state;
    main;
    // ...
}

Rules

When using the state declaration, a modification is needed in the minixml.spg file. A rules clause:

project minixml;
source <minixml.parser>;
rules <rules>;

The rules clause declares a base file name without an extension that is used by the parser generator for putting the names of parsing rules. In this case the base name is rules, so spg generates two files: rules.cpp that will contain a rule names in a vector and rules.hpp that will contain a declaration for the rule name vector getter function and constants for the rules.

Main program with farthest error and state detection

The main program needs also some modifications:

#include <minixml.hpp>
#include <rules.hpp>
#include <soulng/util/Unicode.hpp>
#include <soulng/util/MappedInputFile.hpp>
#include <soulng/lexer/TrivialLexer.hpp>
#include <iostream>

using namespace soulng::unicode;
using namespace soulng::util;

int main()
{
    try
    {
        std::string fileName = "philosophers.xml";
        std::string file = ReadFile(fileName);
        std::u32string content = ToUtf32(file);
        TrivialLexer lexer(content, fileName, 0);
        lexer.SetFlag(soulng::lexer::LexerFlags::farthestError);
        lexer.SetRuleNameVecPtr(GetRuleNameVecPtr());
        minixml::Parse(lexer);
    }
    catch (const std::exception& ex)
    {
        std::cerr << ex.what() << std::endl;
        return 1;
    }
    return 0;
}

I have added an include directive for the rules.hpp header file and a lexer.SetRuleNameVecPtr function call so that the lexer has access to the names of the parsing rules.

Testing farthest error and state

Now if I run the program I get the following output:

>minixml.exe
parsing error at 'philosophers.xml:4':
 <philosopher>Aristotle</philosopher;>
                                    ^

Parser state:
minixml.document
minixml.element
minixml.element_content
minixml.element
minixml.name
minixml.name_char
minixml.name_start_char

The parser state that contains the names of the parsing rules at the moment of error are included in the error message.