Cmajor Compiler Organization

The Cmajor compiler follows traditional compiler organization having a front-end and a back-end. The front-end drives the compilation process and generates intermediate code. The back-end optimizes the intermediate code and generates machine code from it. In addition the front-end utilizes an archiver that bundles object code files into libraries and a linker that produces executable programs from the object code and library files.

At present there's one front-end written in C++ and another one planned to be written in Cmajor. There's three back-ends, LLVM back-end that is the real one, a C++ back-end, and System X, that is an experimental one written in Cmajor. LLVM can produce code for many platforms and computer architectures, but currently Cmajor outputs only x86_64 code for Windows and Linux. The System X back-end produces MMIX-based assembly code for a virtual machine.

The front-end has four main components:

Parser

The parser component is made using SoulNG parser generator tool. It produces a tree of objects for each syntax element in a Cmajor source file. Each kind of syntax element, be it an identifier, literal, variable, function or type, is represented by a class that is derived from the Node class. Syntax tree nodes can be written to and read from a binary stream, cloned and they accept a visitor. The phases of compilation following the parsing phase walk the syntax tree node object hierarchy using a concrete visitor derived from the base visitor class.

There are many kinds of syntax tree nodes. Primitive ones such as identifier are derived directly from the node class. They just contain the data for the element. There are also composite syntax tree nodes that have child nodes. The unary nodes have one child node. Concrete examples of unary nodes are nodes for expressions taking one operand. Each kind of unary operator is represented by such a class derived from the unary node class. Similarly each binary operator is represented by a class derived from the binary node class. Nodes having arbitrary many child nodes, such as namespace nodes, have a node list.

Symbol table

The symbol table component collects information about program entities. In syntax tree each identifier has its own node despite of its value. In contrast in the symbol table each symbol except class template specialization symbol is unique. The key classes in this component are

The information kept in the symbol table depends on the program entity:

Mapping from syntax tree nodes to symbols and vice versa is kept in the symbol table, so when the following phases of compilation visit the syntax tree node object hierarchy, they have access to the corresponding symbols as well.

At the end of compilation of a program or library module, the contents of the module, meaning exportable symbols and their the externally visible attributes, is written to a module file that has a .cmm extension. At the start of a compilation of a program or library module the contents of the modules it references to are read from their module files.

Binder

The task of the binder component is to combine syntax with symbols and to create bound node trees from which intermediate code can be generated. After creating syntax trees from the source files and creating symbols from the syntax trees, the front-end completes the symbols with type information. The type binder visits syntax trees and uses the type resolver to look up identifiers that should represent types from the symbol table. The type resolver also creates class template specialization symbols from template id nodes when needed. The type binder resolves:

The second task of the type binder is to evaluate constant expressions. For this it uses a static evaluator. The result of evaluation is a value derived from the value class.

After binding types the control branches to multiple threads and the compilation of each compile unit is done in parallel from now on.

The next phase is binding statements and expressions. The statement binder visits the body of each function and creates bound statements from the corresponding syntax statements. It also binds each expression using the expression binder. The expression binder creates bound expressions from syntax expressions. Each bound expression has a type that is represented as a pointer to a type symbol. Most bound expressions contain also a pointer to the symbol that corresponds the syntax tree node from which the bound node is created from: bound local variable has a local variable symbol, and bound constant has a constant symbol, for example.

Major part of expression binding is devoted to overload resolution. To resolve a function call, the ResolveOverload function is called with the group name of the function to call and bound expression arguments. It returns a bound function call that contains a pointer to the symbol of the called function. Naturally all function call expressions are represented as bound function calls but also construction, copying and moving values, expressions containing unary and binary operators, and array element access are represented as such as well. For such operations the symbol contained in the bound function call is of type derived from the generic function symbol class.

After resolving a function call expression to a function symbol to call, the compiler checks if the function symbol is a constexpr function or a compile time primitive function such as length of an array or string literal. When this is the case, the compiler tries to evaluate the value of the function at compile time using static evaluator. If it succeeds the function call is replaced by a bound literal that contains the evaluated value. If the static evaluation fails, the value of the function is computed at run-time as usual.

The binder component creates a bound compile unit in the beginning of compilation of a source file and it is also the result of this compilation. To support the compilation process, mainly overload resolution, a bound compile unit contains various repositories:

Code generator

The task of a code generator is to go through a bound compile unit and generate intermediate code from it. There are three code generators in the Cmajor front-end: one for Windows, one for Linux and one for System X. The Windows and Linux code generators generate LLVM instructions and the System X code generator generate System X instructions. The difference between the Windows and the Linux code generator is in implementation of exception handling. The Windows code generator implements the Visual C++ exception handling ABI and the the Linux code generator implements the Itanium exception handling ABI. The Windows and Linux code generators use the cmllvm emitter for creating intermediate instructions and the System X code generator use the cmsxbe emitter for this task. Both emitters derive from the abstract ir emitter class, whose interface is a superset of the functionality implemented by both concrete emitters.

When visiting each kind of bound statement, the code generators create basic blocks that contain code for bound expressions and jumps between the basic blocks that correspond the control flow. The bound expressions implement two basic operations defined in the GenObject interface: the load operation that produce side effects, generates a value and pushes it to the value stack of an emitter, and the store operation that pops a value from the value stack and stores it to a location of the bound expression. The store operation of those bound expressions that are not lvalues generate an error by throwing an exception.

Function calls and operations on primitive, derived and class types are all implemented as bound function calls that call the the virtual generate call member function of a contained function symbol that represents the operation.

Reference

The front-end consists of these modules.

Implementation of Specific Features

Is and As Expressions

The implementation of is and as expressions is based on a scheme invented by Michael Gibbs and Bjarne Stroustrup and described in their article Fast dynamic casting.

Each polymorphic class in Cmajor has a unique 128-bit identifier that is a product of prime number keys. The prime number key of the class is based on the level of the class in the class hierarchy. Ultimate base classes are assigned keys 2, 3, 5, etc., and the leaf classes of the class hierarchy have the largest keys. The identifier of a class is the product of the key of itself, the key of its base class, the key of its ancestor class, and so on.

When we have these class identifiers, the test whether a polymorphic class A is the same class or derived from the class B becomes a simple modulo operation: classIdOf(A) % classIdOf(B) == 0 <=> A is same or derived from B.

The 128-bit identifier of a class is stored to the virtual method table of the class at index 0. The class identifier is initially zero and dynamically initialized by calling the runtime library. The runtime library looks up the class identifier from a hash table by using a 16-byte type identifier (UUID) of the class as a key and stores the nonzero 128-bit class identifier to the virtual method table index 0. The 16-byte type identifier (UUID) is initialized by the compiler and stored to the virtual method table at index 2 in 64-bit units. The hash table lookup is needed only first time. In the future the class id is found in the VMT at index 0.