Source File Character Encoding

5 Source File Character Encoding

The Cminor compiler assumes that source files have implicit UTF-8 encoding. UTF-8 byte order mark EF BB BF can be used in the start of a source file, but it's not required. If you use non-ASCII characters in a source file, remember to save it using UTF-8 encoding.

All the string literals contained in source files are converted from UTF-8 to UTF-32 representation internally. The string and character types have internal UTF-32 encoding. This way characters (or "glyphs") and character codes (the Unicode code points representing them) have one-to-one correspondence.

The output encoding of strings and characters is UTF-8. This works fine for file output, but unfortunately for console output on Windows it means that for example non-Latin letters generate garbage output, because the author has not found a way to put Windows console window to UTF-8 mode.

5.1 Hello World in Chinese

Here's the contents of program encoding.cminor that prints word "Hello" followed by the word "world" in chinese to file hello.txt:

It is good practice that programs containing I/O or other code that can fail handle exceptions by using try statements.

5.2 Unicode in Identifiers

Besides to string and character literals, identifiers in source code naming functions, classes, variables and so on can also contain non-Latin Unicode letters. Here's the contents of program dice.cminor that has function "HeitäNoppaa" containing Finnish letter 'ä' (represented as 0xC3 0xA4 using UTF-8 encoding). The name of the function means "play dice" in Finnish: