Cmajor Compiler Organization

The Cmajor compiler follows traditional compiler organization having a front-end and a back-end. The front-end drives the compilation process and generates intermediate code. The back-end optimizes the intermediate code and generates machine code from it. In addition the front-end utilizes an archiver that bundles object code files into libraries and a linker that produces executable programs from the object code and library files.

At present there's one front-end written in C++ and another one planned to be written in Cmajor. There's three back-ends, LLVM back-end that is the real one, a C++ back-end, and System X, that is an experimental one written in Cmajor. LLVM can produce code for many platforms and computer architectures, but currently Cmajor outputs only x86_64 code for Windows and Linux. The System X back-end produces MMIX-based assembly code for a virtual machine.

The front-end has four main components:

parser,
symbol table,
binder and
code generator

Parser

The parser component is made using SoulNG parser generator tool. It produces a tree of objects for each syntax element in a Cmajor source file. Each kind of syntax element, be it an identifier, literal, variable, function or type, is represented by a class that is derived from the Node class. Syntax tree nodes can be written to and read from a binary stream, cloned and they accept a visitor. The phases of compilation following the parsing phase walk the syntax tree node object hierarchy using a concrete visitor derived from the base visitor class.

There are many kinds of syntax tree nodes. Primitive ones such as identifier are derived directly from the node class. They just contain the data for the element. There are also composite syntax tree nodes that have child nodes. The unary nodes have one child node. Concrete examples of unary nodes are nodes for expressions taking one operand. Each kind of unary operator is represented by such a class derived from the unary node class. Similarly each binary operator is represented by a class derived from the binary node class. Nodes having arbitrary many child nodes, such as namespace nodes, have a node list.

Symbol table

The symbol table component collects information about program entities. In syntax tree each identifier has its own node despite of its value. In contrast in the symbol table each symbol except class template specialization symbol is unique. The key classes in this component are

the symbol class of which all symbol classes are derived from,
the container symbol class that is the base class for symbols containing other symbols,
the scope class that is contained by each container symbol and enables looking up symbols by name,
the symbol table class that contains the symbols of a library or program,
the module class that contains the symbol table of a library or program and references to other modules it uses, and
the symbol creator visitor class that creates symbols from syntax tree nodes and insert them into symbol table. There is a symbol for each class, function, compound statement, parameter, variable, typedef, delegate, enumerated type, enumeration constant and constant node.

The information kept in the symbol table depends on the program entity:

All symbols have name and mangled name. The name of a symbol is the value of the identifier used in declaration of the program entity. The mangled name of a symbol identifies the program entity in intermediate code files. It consists of only Latin letters, digits and underscores and ends with hexadecimal SHA-1 hash of the full name of the entity. All symbols have also flags, owner module, pointer to a parent symbol and a span that tells the source location of the symbol.
The type symbols have in addition a unique type id that is an UUID.
A class type symbol has in addition symbols for contained member functions, variables, typedefs, constants and delegates.
A parameter and variable symbol have a type that is represented internally as a pointer to a type symbol. Each type is represented externally by its id, that is an UUID.
A function symbol contains symbols for parameters and a pointer to a return type. They have also a unique id that is an UUID. Function templates and inline functions contain also the actual syntax tree nodes for the function.

Mapping from syntax tree nodes to symbols and vice versa is kept in the symbol table, so when the following phases of compilation visit the syntax tree node object hierarchy, they have access to the corresponding symbols as well.

At the end of compilation of a program or library module, the contents of the module, meaning exportable symbols and their the externally visible attributes, is written to a module file that has a .cmm extension. At the start of a compilation of a program or library module the contents of the modules it references to are read from their module files.

Binder

The task of the binder component is to combine syntax with symbols and to create bound node trees from which intermediate code can be generated. After creating syntax trees from the source files and creating symbols from the syntax trees, the front-end completes the symbols with type information. The type binder visits syntax trees and uses the type resolver to look up identifiers that should represent types from the symbol table. The type resolver also creates class template specialization symbols from template id nodes when needed. The type binder resolves:

the return type of a function or delegate
the type of the base class of a class type
the declared type of a typedef
the type of a parameter, variable or constant
the underlying type of an enumerated type

The second task of the type binder is to evaluate constant expressions. For this it uses a static evaluator. The result of evaluation is a value derived from the value class.

After binding types the control branches to multiple threads and the compilation of each compile unit is done in parallel from now on.

The next phase is binding statements and expressions. The statement binder visits the body of each function and creates bound statements from the corresponding syntax statements. It also binds each expression using the expression binder. The expression binder creates bound expressions from syntax expressions. Each bound expression has a type that is represented as a pointer to a type symbol. Most bound expressions contain also a pointer to the symbol that corresponds the syntax tree node from which the bound node is created from: bound local variable has a local variable symbol, and bound constant has a constant symbol, for example.

Major part of expression binding is devoted to overload resolution. To resolve a function call, the ResolveOverload function is called with the group name of the function to call and bound expression arguments. It returns a bound function call that contains a pointer to the symbol of the called function. Naturally all function call expressions are represented as bound function calls but also construction, copying and moving values, expressions containing unary and binary operators, and array element access are represented as such as well. For such operations the symbol contained in the bound function call is of type derived from the generic function symbol class.

After resolving a function call expression to a function symbol to call, the compiler checks if the function symbol is a constexpr function or a compile time primitive function such as length of an array or string literal. When this is the case, the compiler tries to evaluate the value of the function at compile time using static evaluator. If it succeeds the function call is replaced by a bound literal that contains the evaluated value. If the static evaluation fails, the value of the function is computed at run-time as usual.

The binder component creates a bound compile unit in the beginning of compilation of a source file and it is also the result of this compilation. To support the compilation process, mainly overload resolution, a bound compile unit contains various repositories:

The class template repository is used for instantiating member functions of a class template. When the type resolver creates a class template specialization symbol, it initially uses the class template repository to bind signatures of the member functions of the class template. The member functions are not instantiated until they are needed, that is: called. The instantiation processs is triggered by overload resolution. The class template repository coordinates the instantiation process and contains the symbols of the instantiated member functions. The instantiations themselves go to the bound compile unit. When instantiating a member function of a class template, the class template repository uses the symbol creator, the type binder and the statement binder to compile the body of the function.
The concept repository holds bound concept instantiations and provides them to the constraint checker that is used in overload resolution.
The constexpr function repository provides constexpr function nodes to the static evaluator.
The function template repository coordinates instantiation of function templates. The instantiation process is triggered by overload resolution. When instantiating a function template, the repository uses the symbol creator, the type binder and the statement binder to compile the body of the function.
The inline function repository is in effect only when compiling release mode code. It coordinates instantiation of inline functions. The instantiation process is similar to instantiation of function templates and member functions of class templates.
The operation repository contains operations organized in operation groups by the group name of the operation. The operations create and hold temporary symbols derived from the function symbol on demand. Examples of such a symbols are pointer default constructor and pointer copy constructor. Some of them are however defined in the binder namespace so they are not shown as classes derived from the function symbol in the documentation. Operations on primitive types such as basic type binary operations operating on int are not in this repository but are inserted in the global scope of the System.Core module and found from there by overload resolution directly.

Code generator

The task of a code generator is to go through a bound compile unit and generate intermediate code from it. There are three code generators in the Cmajor front-end: one for Windows, one for Linux and one for System X. The Windows and Linux code generators generate LLVM instructions and the System X code generator generate System X instructions. The difference between the Windows and the Linux code generator is in implementation of exception handling. The Windows code generator implements the Visual C++ exception handling ABI and the the Linux code generator implements the Itanium exception handling ABI. The Windows and Linux code generators use the cmllvm emitter for creating intermediate instructions and the System X code generator use the cmsxbe emitter for this task. Both emitters derive from the abstract ir emitter class, whose interface is a superset of the functionality implemented by both concrete emitters.

When visiting each kind of bound statement, the code generators create basic blocks that contain code for bound expressions and jumps between the basic blocks that correspond the control flow. The bound expressions implement two basic operations defined in the GenObject interface: the load operation that produce side effects, generates a value and pushes it to the value stack of an emitter, and the store operation that pops a value from the value stack and stores it to a location of the bound expression. The store operation of those bound expressions that are not lvalues generate an error by throwing an exception.

Function calls and operations on primitive, derived and class types are all implemented as bound function calls that call the the virtual generate call member function of a contained function symbol that represents the operation.

Reference

The front-end consists of these modules.

Implementation of Specific Features

Is and As Expressions

The implementation of is and as expressions is based on a scheme invented by Michael Gibbs and Bjarne Stroustrup and described in their article Fast dynamic casting.

Each polymorphic class in Cmajor has a unique 128-bit identifier that is a product of prime number keys. The prime number key of the class is based on the level of the class in the class hierarchy. Ultimate base classes are assigned keys 2, 3, 5, etc., and the leaf classes of the class hierarchy have the largest keys. The identifier of a class is the product of the key of itself, the key of its base class, the key of its ancestor class, and so on.

When we have these class identifiers, the test whether a polymorphic class A is the same class or derived from the class B becomes a simple modulo operation: classIdOf(A) % classIdOf(B) == 0 <=> A is same or derived from B.

The 128-bit identifier of a class is stored to the virtual method table of the class at index 0. The class identifier is initially zero and dynamically initialized by calling the runtime library. The runtime library looks up the class identifier from a hash table by using a 16-byte type identifier (UUID) of the class as a key and stores the nonzero 128-bit class identifier to the virtual method table index 0. The 16-byte type identifier (UUID) is initialized by the compiler and stored to the virtual method table at index 2 in 64-bit units. The hash table lookup is needed only first time. In the future the class id is found in the VMT at index 0.