Our Main.cpp source file should now contain only the following content:
#include <iostream>
int main()
{
return 0;
}
The soulng/util library needs dynamic initialization, so we add the following initialization code to the Main.cpp:
#include <soulng/util/InitDone.hpp>
#include <iostream>
void InitApplication()
{
soulng::util::Init();
}
void DoneApplication()
{
soulng::util::Done();
}
The soulng libraries report errors by throwing exceptions of type std::runtime_error, so the next thing to do is to handle standard exceptions: I have placed the initialization inside a try block because it can throw an exception. (Got this wrong in the previous versions of this project.)
#include <soulng/util/InitDone.hpp>
#include <iostream>
#include <stdexcept>
int main()
{
try
{
InitApplication();
}
catch (const std::exception& ex)
{
std::cerr << ex.what() << std::endl;
return 1;
}
DoneApplication();
return 0;
}
We have included the <stdexcept> header and added a try block that prints the error to the standard error stream.
We will now write a tester function that will read a test file, construct a lexer and print the tokens contained in the test file to the standard output:
// ...
#include <soulng/util/MappedInputFile.hpp>
void TestMinilangLexer(const std::string& minilangFilePath)
{
std::cout << "> " << minilangFilePath << std::endl;
std::string s = soulng::util::ReadFile(minilangFilePath);
// ...
}
First we print the path of the given test file to the standard output. We then call a ReadFile() utility function of the soulng/util library, that will read the contents of the given UTF-8 encoded text file into a std::string. We have included the <soulng/util/MappedInputFile.hpp> header the contains the signature of the ReadFile function.
The lexer operates internally with UTF-32 characters, so we need to convert the UTF-8 string to UTF-32:
// ...
#include <soulng/util/Unicode.hpp>
void TestMinilangLexer(const std::string& minilangFilePath)
{
// ...
std::string s = soulng::util::ReadFile(minilangFilePath);
std::u32string content = soulng::unicode::ToUtf32(s);
// ...
}
We call the ToUtf32() utiltity function of the soulng::util library to do the conversion. We have included the <soulng/util/Unicode.hpp> header that contains the signature of the ToUtf32 function.
Now we construct the lexer and have it read the first token:
// ...
#include <minilang/MinilangLexer.hpp>
void TestMinilangLexer(const std::string& minilangFilePath)
{
// ...
std::u32string content = soulng::unicode::ToUtf32(s);
MinilangLexer lexer(content, minilangFilePath, 0);
++lexer;
// ...
The lexer constructor takes a UTF-32 string that will be tokenized, a file path included in error messages produced by the lexer and a file index included in spans generated by the lexer. The file index is not used by this example so we set it to 0. The ++lexer call is necessary to have the lexer produce the first token.
Next we will test if the lexer has reach the end of the input, and if not get the match from the lexer:
// ...
#include <minilang/MinilangTokens.hpp>
void TestMinilangLexer(const std::string& minilangFilePath)
{
// ...
++lexer;
while (*lexer != MinilangTokens::END)
{
std::u32string match = lexer.token.match.ToString();
// ...
The *lexer expression returns the identifier of the current token. We compare it to the END token that represents end of input.
If the end of input not reached, we will print the name of the matched token and the matching string:
// ...
void TestMinilangLexer(const std::string& minilangFilePath)
{
// ...
std::u32string match = lexer.token.match.ToString();
std::cout << MinilangTokens::GetTokenName(*lexer) << "(" << soulng::unicode::ToUtf8(match) << ")" << std::endl;
// ...
The GetTokenName() function returns the identifier of the token as a string.
Finally we advance the lexer to the next token with the ++lexer expression, go to testing the while condition again, and if the end of input reached, print a message to the standard output:
// ...
void TestMinilangLexer(const std::string& minilangFilePath)
{
// ...
while (*lexer != MinilangTokens::END)
{
// ...
++lexer;
}
std::cout << "end of file '" << minilangFilePath << "' reached" << std::endl;
}
Here's the complete test function:
#include <minilang/MinilangLexer.hpp>
#include <minilang/MinilangTokens.hpp>
#include <soulng/util/InitDone.hpp>
#include <soulng/util/MappedInputFile.hpp>
#include <soulng/util/Unicode.hpp>
#include <iostream>
#include <stdexcept>
void TestMinilangLexer(const std::string& minilangFilePath)
{
std::cout << "> " << minilangFilePath << std::endl;
std::string s = soulng::util::ReadFile(minilangFilePath);
std::u32string content = soulng::unicode::ToUtf32(s);
MinilangLexer lexer(content, minilangFilePath, 0);
++lexer;
while (*lexer != MinilangTokens::END)
{
std::u32string match = lexer.token.match.ToString();
std::cout << MinilangTokens::GetTokenName(*lexer) << "(" << soulng::unicode::ToUtf8(match) << ")" << std::endl;
++lexer;
}
std::cout << "end of file '" << minilangFilePath << "' reached" << std::endl;
}
We will add a simple command option interface to the main function: If the command option is "--lexer-test", the TestMinilangLexer() function is called with the test file path:
void PrintUsage()
{
std::cout << "Usage: minilang [options] { file.minilang }" << std::endl;
std::cout << "Options:" << std::endl;
std::cout << "--help | -h:" << std::endl;
std::cout << " Print help and exit." << std::endl;
std::cout << "--lexer-test | -l" << std::endl;
std::cout << " Test lexical analyzer with <file.minilang>." << std::endl;
}
enum class Command
{
none, lexerTest
};
int main(int argc, const char** argv)
{
try
{
InitApplication();
std::vector files;
Command command = Command::none;
for (int i = 1; i < argc; ++i)
{
std::string arg = argv[i];
if (soulng::util::StartsWith(arg, "--"))
{
if (arg == "--help")
{
PrintUsage();
return 1;
}
else if (arg == "--lexer-test")
{
command = Command::lexerTest;
}
else
{
throw std::runtime_error("unknown argument '" + arg + "'");
}
}
else if (soulng::util::StartsWith(arg, "-"))
{
std::string options = arg.substr(1);
if (options.empty())
{
throw std::runtime_error("unknown argument '" + arg + "'");
}
for (char o : options)
{
if (o == 'h')
{
PrintUsage();
return 1;
}
else if (o == 'l')
{
command = Command::lexerTest;
}
else
{
throw std::runtime_error("unknown argument '-" + std::string(1, o) + "'");
}
}
}
else
{
files.push_back(soulng::util::GetFullPath(arg));
}
}
if (files.empty() || command == Command::none)
{
PrintUsage();
return 1;
}
for (const std::string& filePath : files)
{
if (command == Command::lexerTest)
{
TestMinilangLexer(filePath);
}
else
{
PrintUsage();
throw std::runtime_error("minilang: unknown command");
}
}
}
catch (const std::exception& ex)
{
std::cerr << ex.what() << std::endl;
return 1;
}
DoneApplication();
return 0;
}
Here's the complete Main.cpp file:
#include <minilang/MinilangLexer.hpp>
#include <minilang/MinilangTokens.hpp>
#include <soulng/util/InitDone.hpp>
#include <soulng/util/MappedInputFile.hpp>
#include <soulng/util/Unicode.hpp>
#include <soulng/util/Path.hpp>
#include <soulng/util/TextUtils.hpp>
#include <iostream>
#include <stdexcept>
void TestMinilangLexer(const std::string& minilangFilePath)
{
std::cout << "> " << minilangFilePath << std::endl;
std::string s = soulng::util::ReadFile(minilangFilePath);
std::u32string content = soulng::unicode::ToUtf32(s);
MinilangLexer lexer(content, minilangFilePath, 0);
++lexer;
while (*lexer != MinilangTokens::END)
{
std::u32string match = lexer.token.match.ToString();
std::cout << MinilangTokens::GetTokenName(*lexer) << "(" << soulng::unicode::ToUtf8(match) << ")" << std::endl;
++lexer;
}
std::cout << "end of file '" << minilangFilePath << "' reached" << std::endl;
}
void InitApplication()
{
soulng::util::Init();
}
void DoneApplication()
{
soulng::util::Done();
}
void PrintUsage()
{
std::cout << "Usage: minilang [options] { file.minilang }" << std::endl;
std::cout << "Options:" << std::endl;
std::cout << "--help | -h:" << std::endl;
std::cout << " Print help and exit." << std::endl;
std::cout << "--lexer-test | -l" << std::endl;
std::cout << " Test lexical analyzer with <file.minilang>." << std::endl;
}
enum class Command
{
none, lexerTest
};
int main(int argc, const char** argv)
{
try
{
InitApplication();
std::vector files;
Command command = Command::none;
for (int i = 1; i < argc; ++i)
{
std::string arg = argv[i];
if (soulng::util::StartsWith(arg, "--"))
{
if (arg == "--help")
{
PrintUsage();
return 1;
}
else if (arg == "--lexer-test")
{
command = Command::lexerTest;
}
else
{
throw std::runtime_error("unknown argument '" + arg + "'");
}
}
else if (soulng::util::StartsWith(arg, "-"))
{
std::string options = arg.substr(1);
if (options.empty())
{
throw std::runtime_error("unknown argument '" + arg + "'");
}
for (char o : options)
{
if (o == 'h')
{
PrintUsage();
return 1;
}
else if (o == 'l')
{
command = Command::lexerTest;
}
else
{
throw std::runtime_error("unknown argument '-" + std::string(1, o) + "'");
}
}
}
else
{
files.push_back(soulng::util::GetFullPath(arg));
}
}
if (files.empty() || command == Command::none)
{
PrintUsage();
return 1;
}
for (const std::string& filePath : files)
{
if (command == Command::lexerTest)
{
TestMinilangLexer(filePath);
}
else
{
PrintUsage();
throw std::runtime_error("minilang: unknown command");
}
}
}
catch (const std::exception& ex)
{
std::cerr << soulng::unicode::ToUtf32(ex.what()) << std::endl;
return 1;
}
DoneApplication();
return 0;
}
The project should build now without any errors.
The first test file fibocacci.minilang that is written in Minilang language contains a function generating the Fibonacci sequence:
int fibonacci(int n)
{
if (n == 0) return 0;
if (n == 1) return 1;
return fibonacci(n - 1) + fibonacci(n - 2);
}
Set the Configuration Properties / Debugging / Command Arguments to the value --lexer-test test\fibonacci.minilang and run the program:
We get the following output:
> C:/soulng-1.0.0/examples/minilang/test/fibonacci.minilang
INT(int)
ID(fibonacci)
LPAREN(()
INT(int)
ID(n)
RPAREN())
LBRACE({)
IF(if)
LPAREN(()
ID(n)
EQ(==)
INTLIT(0)
RPAREN())
RETURN(return)
INTLIT(0)
SEMICOLON(;)
IF(if)
LPAREN(()
ID(n)
EQ(==)
INTLIT(1)
RPAREN())
RETURN(return)
INTLIT(1)
SEMICOLON(;)
RETURN(return)
ID(fibonacci)
LPAREN(()
ID(n)
MINUS(-)
INTLIT(1)
RPAREN())
PLUS(+)
ID(fibonacci)
LPAREN(()
ID(n)
MINUS(-)
INTLIT(2)
RPAREN())
SEMICOLON(;)
RBRACE(})
end of file 'C:/soulng-1.0.0/examples/minilang/test/fibonacci.minilang' reached
The lexer operates internally with UTF-32 characters and has Unicode identifier classes enabled, so we next test the lexer with some Unicode identifiers.
The unicodeid.minilang contains some Finnish characters:
int örkki()
{
int öljyä = 1;
return öljyä;
}
Set the Configuration Properties / Debugging / Command Arguments to the value --lexer-test test\unicodeid.minilang and run the program:
We get the following output:
> C:/soulng-1.0.0/examples/minilang/test/unicodeid.minilang
INT(int)
ID(├Ârkki)
LPAREN(()
RPAREN())
LBRACE({)
INT(int)
ID(├Âljy├ñ)
ASSIGN(=)
INTLIT(1)
SEMICOLON(;)
RETURN(return)
ID(├Âljy├ñ)
SEMICOLON(;)
RBRACE(})
end of file 'C:/soulng-1.0.0/examples/minilang/test/unicodeid.minilang' reached
The output does not look right because as far as we know the Windows console cannot handle UTF-8 text properly as Linux terminal can with UTF-8 locale enabled. Texts needs to be converted to UTF-16 to show right on Windows console. We have therefore added the an operator<<() function that can print UTF-32 strings. The function is declared in ConsoleUnicode.hpp and defined in ConsoleUnicode.cpp:
// ConsoleUnicode.hpp:
#ifndef CONSOLE_UNICODE_HPP
#define CONSOLE_UNICODE_HPP
#include <ostream>
#include <string>
std::ostream& operator<<(std::ostream& s, const std::u32string& utf32Str);
#endif // CONSOLE_UNICODE_HPP
// ConsoleUnicode.cpp:
#ifdef _WIN32
#include <io.h>
#include <fcntl.h>
#endif
#ifdef _WIN32
void SetStdHandlesToUtf16Mode()
{
_setmode(0, _O_U16TEXT);
_setmode(1, _O_U16TEXT);
_setmode(2, _O_U16TEXT);
}
void SetStdHandlesToNarrowMode()
{
_setmode(0, _O_TEXT);
_setmode(1, _O_TEXT);
_setmode(2, _O_TEXT);
}
bool IsHandleRedirected(int handle)
{
return !_isatty(handle);
}
struct UnicodeWriteGuard
{
UnicodeWriteGuard()
{
SetStdHandlesToUtf16Mode();
}
~UnicodeWriteGuard()
{
SetStdHandlesToNarrowMode();
}
};
void WriteUtf16StrToStdOutOrStdErr(const std::u16string& str, FILE* file)
{
// precondition: file must be stdout or stderr
if (file != stdout && file != stderr)
{
throw std::runtime_error("WriteUtf16StrToStdOutOrStdErr: precondition violation: file must be stdout or stderr");
}
UnicodeWriteGuard unicodeWriteGuard;
size_t result = std::fwrite(str.c_str(), sizeof(char16_t), str.length(), file);
if (result != str.length())
{
throw std::runtime_error("could not write Unicode text");
}
}
std::ostream& operator<<(std::ostream& s, const std::u32stringm& utf32Str)
{
if (&s == &std::cout && !IsHandleRedirected(1))
{
std::u16string utf16Str = soulng::unicode::ToUtf16(utf32Str);
WriteUtf16StrToStdOutOrStdErr(utf16Str, stdout);
return s;
}
else if (&s == &std::cerr && !IsHandleRedirected(2))
{
std::u16string utf16Str = soulng::unicode::ToUtf16(utf32Str);
WriteUtf16StrToStdOutOrStdErr(utf16Str, stderr);
return s;
}
else
{
return s << soulng::unicode::ToUtf8(utf32Str);
}
}
#else // !_WIN32
std::ostream& operator<<(std::ostream& s, const std::u32string& utf32Str)
{
return s << soulng::unicode::ToUtf8(utf32Str);
}
#endif
Now the strings can be printed to the std::cout and std::cerr as UTF-32 strings and the operator<<() function on Windows converts the argument to UTF-16, when printing to console.
The TestMinilangLexer function must be changed to *not* call the ToUtf8 for the match variable. The match can now be printed as UTF-32 string:
void TestMinilangLexer(const std::string& minilangFilePath)
{
std::cout << "> " << minilangFilePath << std::endl;
std::string s = soulng::util::ReadFile(minilangFilePath);
std::u32string content = soulng::unicode::ToUtf32(s);
MinilangLexer lexer(content, minilangFilePath, 0);
++lexer;
while (*lexer != MinilangTokens::END)
{
std::u32string match = lexer.token.match.ToString();
std::cout << MinilangTokens::GetTokenName(*lexer) << "(" << match << ")" << std::endl;
++lexer;
}
std::cout << "end of file '" << minilangFilePath << "' reached" << std::endl;
}
The main function must also be changed to print the error message having possibly non-ASCII characters as UTF-32:
// ...
catch (const std::exception& ex)
{
std::cerr << soulng::unicode::ToUtf32(ex.what()) << std::endl;
return 1;
}
Having these changes implemented, the output looks now right also on Windows console:
> C:/soulng-1.0.0/examples/minilang/test/unicodeid.minilang
INT(int)
ID(örkki)
LPAREN(()
RPAREN())
LBRACE({)
INT(int)
ID(öljyä)
ASSIGN(=)
INTLIT(1)
SEMICOLON(;)
RETURN(return)
ID(öljyä)
SEMICOLON(;)
RBRACE(})
end of file 'C:/soulng-1.0.0/examples/minilang/test/unicodeid.minilang' reached
Now we test with an erroneous invalid.minilang file:
@
Set the Configuration Properties / Debugging / Command Arguments to the value --lexer-test test\invalid.minilang and run the program:
We get the following error:
> C:/soulng-1.0.0/examples/minilang/test/invalid.minilang
soulng::lexer::Lexer::NextToken(): error: invalid character '@' in file 'C:/soulng-1.0.0/examples/minilang/test/invalid.minilang' at line 1
The lexer has no rule for the '@' character, so it generates the preceding error.
This is the greatest common divisor algorithm:
int gcd(int a, int b)
{
while (b != 0)
{
a = a % b;
int t = a;
a = b;
b = t;
}
return a;
}
Here's the output:
> C:/soulng-1.0.0/examples/minilang/test/gcd.minilang
INT(int)
ID(gcd)
LPAREN(()
INT(int)
ID(a)
COMMA(,)
INT(int)
ID(b)
RPAREN())
LBRACE({)
WHILE(while)
LPAREN(()
ID(b)
NEQ(!=)
INTLIT(0)
RPAREN())
LBRACE({)
ID(a)
ASSIGN(=)
ID(a)
MOD(%)
ID(b)
SEMICOLON(;)
INT(int)
ID(t)
ASSIGN(=)
ID(a)
SEMICOLON(;)
ID(a)
ASSIGN(=)
ID(b)
SEMICOLON(;)
ID(b)
ASSIGN(=)
ID(t)
SEMICOLON(;)
RBRACE(})
RETURN(return)
ID(a)
SEMICOLON(;)
RBRACE(})
end of file 'C:/soulng-1.0.0/examples/minilang/test/gcd.minilang' reached
Next we will show how to write parsers that can utilize the lexer...
up: Table of contents | prev: Building the Lexer | next: Writing Parsers