The notation used to define the syntax of the Cminor language is a variation of Parsing Expression Grammar or PEG notation for short (See also wikipedia article for PEG.) In PEG ambiguities that can arise in general context-free grammars are avoided by defining a ordered choice operator that tests each alternative in a body of a rule in order and defines that the first alternative that matches input always wins without testing alternatives that come after it.
The syntax notation used in this Cminor language reference uses the pipe symbol, | for separating alternatives, but its semantics is ment to be the same as the semantics of PEG's ordered choice operator.
A grammar that defines the structure of some language consists of productions or rules that define strings of symbols that belong to the language. A production consists of a head, a production symbol, →, and a body. The head names the production. The body consists of terminal and nonterminal symbols and grammar operators. Collectively the terminal symbols used in bodies of the rules form an alphabet of the language. An alphabet is commonly denoted by the greek letter Σ. The alphabet can be the set of ASCII characters, symbols 0 and 1 (a binary alphabet) or a set of Unicode character, for example. The alphabet for Cminor is the set of Unicode characters. The names of the rules are also called nonterminal symbols. Each nonterminal represents strings of terminals and forms a sublanguage of the whole language. Terminals, nonterminals and grammar operators are collectively called grammar symbols.
In this language reference terminals are shown using a monospace font, nonterminals are shown in italic and grammar operators are in serif font like this: [ ], +, −, etc. Cminor keywords are shown in bold. They are terminal strings.
Given two grammar expressions α and β, that represent strings consisting terminals, nonterminals and grammar operators, expression α | β represents ordered choice of α and β.
An expression α β represents strings of terminals consisting strings represented by α catenated with strings represented by β.
An expression α − β represents strings that match α but do not match β.
An expression α ^ β represents strings that match either α or β but not both.
An expression α & β represents strings that match both α and β.
An expression α % β is a short-hand notation for expression α ( β α )*.
An expression α* represents the Kleene closure of α, that is: the empty string, strings represented by α, strings represented by sequence expression α α, strings represented by sequence expression α α α, and so on.
An expression α+ represents strings represented by α, strings represented by sequence expression α α, and so on. That is: we exclude the empty string from α*.
An expression α? represents the empty string or strings represented by single α.
Parentheses ( ) are used to group strings of grammar symbols. They are slightly taller than parentheses used to denote terminal symbols ( ).
Terminal symbol a represents a string consisting of sole symbol a. Terminal symbol combination \n represents a newline character, combination \r represents a carriage return character, and combination \\ represents backslash.
Keyword keyword represents a Cminor keyword string keyword. Strings that are equal to a keyword string but continue with some identifier character do not match a keyword. For example, input string "classified" does not match keyword class.
Expression [a − z] denotes a single lower case Latin letter character. Expression [^ a − f] denotes a single character excluding characters from a to f.
Sometimes syntax would be so verbose that it is more convenient to describe it in plain text. This is denoted by english text in apostrophes: 'complicated syntax'.
alternative | → | sequence (| sequence)* |
sequence | → | difference difference* |
difference | → | exclusive-or (− exclusive-or)* |
exclusive-or | → | intersection (^ intersection)* |
intersection | → | intersection (& intersection)* |
list | → | postfix (% postfix)? |
postfix | → | primary (* | + | ?)? |
primary | → | rule‑name | primitive | grouping |
rule‑name | → | id |
id | → | [a−z A−Z −]+ [a−z A−Z 0−9 −]* |
primitive | → | terminal | char‑class |
grouping | → | ( alternative ) |
terminal | → | 'any Unicode character' |
char‑class | → | [ ^? char‑range* ] |
char‑range | → | char ( − char )? |
char | → | [^ \\\]] | escape |
escape | → | \\ ([xX] hex | [dD] dec | [^ xXdD] ) |
hex | → | hex‑digit+ |
hex‑digit | → | [0−9a−fA−F] |
dec | → | dec‑digit+ |
dex‑digit | → | [0−9] |