Contents

Introduction

The goal of the lexer is to make incremental syntax highlighting as easy as possible. Ideally, changes to one line should not affect the syntax of other lines. For example, Orth prohibits multi-line strings because they complicate syntax highlighting: inserting a quotation mark could, in theory, change the syntax of every line in the file. Of course block comments are the same way, but there isn't much I can do about them.

Encoding

The Orth compiler understands four encodings:

  1. ISO 8859 (no BOM)
  2. UTF8 (EF BB BF)
  3. UTF16 little endian (FF FE)
  4. UTF16 big endian (FE FF)

A source file that lacks a BOM uses ISO 8859. ISO 8859 is the same as Windows 1252 (the default Windows code page) for chacters in the range [0x00,0x7F] and [0xA0,0xFF]. ISO 8859 doesn't define characters in the range [0x80,0x9F]; whereas, Windows assigns glyphs to most of them (for instance, the euro symbol is 0x80). If your program needs a glyph specific to Windows 1252, you'll need to use one of the three Unicode encodings.

Whitespace characters are 0x20 and 0x09 through 0x0D (horizontal tab, line feed, vertical tab, form feed, and carriage return). A line feed (0x0A) or carriage return (0x0D) signfies a line break unless the carriage return is immediately followed by a line feed, in which case the lexer ignores the carriage return. Other controls characters (0x00 through 0x08, 0x0E through 0x1F, and 0x7F) are illegal.

Lines and Columns

The first line in a source file is line 1. The first column of each source line is 1. A line break increases the line number by 1 and resets the column number to 1. All other whitespace characters increase the column number by 1. In particular, the tab width is effectively one. In order for the compiler to display column numbers properly, you must expand tabs into spaces.

Comments

Line comments begin with // and continue through the next line break. The lexer ignores all characters inside a line comment, including //, /*, and */.

Block comments begin with /* and continue through the matching */. Block comments nest. Unterminated block comments are illegal.

Keywords

alignas cdecl dtor inout shadow typedef
alignof char else int shared typeof
anon class export long short ubyte
auto const false null single uint
bitcast construct finally operator sizeof ulong
bit continue for out stdcall unreachable
bool ctor goto outer struct uninit
break char guard pragma this ushort
byte destruct if return throw void
case do import scope true wchar
catch double include select try while

Potential Keywords

The following identifiers may eventually become keywords:

alias asm enum private protected
public typename union virtual override

Retired Keywords

The following C++ keywords have no special meaning in Orth:

const_cast float register template
default friend reinterpret_cast typeid
delete inline signed unsigned
dynamic_cast mutable static using
explicit namespace static_cast volatile
extern new switch wchar_t

Identifiers

An identifier must begin with one of:

_ $ a-z A-Z any extended character (0x80 and above)

An identifier can contain only:

_ $ a-z A-Z 0-9 any extended character (0x80 and above)

An identifier cannot be a keyword. The dollar sign currently has no special meaning to the Orth compiler. Identifiers beginning with two leading underscores are reserved for future keywords.

Strings

There are three types of strings:

1. C strings "c:\\foo\\bar"
2. WYSIWYG strings `c:\foo\bar`
3. line strings ''Line one
''Line two

C strings begin and end with a quote ("). WYSIWYG strings begin and end with a backquote (`). A line string begins with '' and ends with a line break. A line string lacks a trailing quote. The compiler automatically chooses a type for each string (char[], wchar[], or dchar[]) so the C/C++ L prefix is unnecessary.

C strings can contain escape sequences. C strings cannot span multiple lines.

WYSIWYG strings cannot contain escape sequences. The lexer interprets each character in the string literally with one exception. The lexer interprets two adjacent backquotes as a single backquote, respectively. Notice that the adjacent backquotes would otherwise split the string into two strings. WYSIWYG strings cannot span multiple lines.

Line strings cannot contain escape sequences. Since they include all characters up to and including the next line break, there is no need to escape any characters. An EOF can also terminate a line string, in which case the lexer inserts an imaginary line break.

The compiler automatically concatenates adjacent strings (with the exception noted for WYSIWYG strings).

Line strings are ideal for properly indenting strings that span multiple lines:

/*
<-- This is column 1
*/
void foo()
{
    String s:=''Line one
              ''Line two
              ''Line three
    String t:=''First line
              ''Second line
}

Supposing we used a hypothetical WYSIWYG line string that could contain line breaks, our code would be much less clear:

/*
<-- This is column 1
*/
void foo()
{
    String s:=`Line one
Line two
Line three` //Ugly
    String t:=`First line
Second line` //Ugly
}

Escape Sequences

Orth escape sequences are the same as C/C++ with some noteworthy exceptions:

  1. An escape sequence beginning with a decimal digit is decimal. Orth doesn't support octal. A decimal escape sequence can have up to three digits and must have a value between 0 and 255. A decimal escape whose value exceeds 255 is illegal even if shortening it would make it valid. (e.g., /999 is illegal but /0999 is legal)
  2. An escape sequence beginning with a /y or /Y is binary. A binary escape sequence can have up to eight digits consisting of zeroes and ones.

A hexadecimal escape sequence begins with /x and contains up to two hex digits. A Unicode escape begins with /u or /U and contains up to four or eight digits, respectively. Notice that escapes are case sensitive whereas a number isn't (e.g., "\X12" is invalid but 0X12 is valid). In either case, the digits of the hexadecimal value are case insensitive.

The remaining escape sequences have a single character:

\a \b \t \n \v \f \r \' \" \? \\
0x7 0x8 0x9 0xA 0xB 0xC 0xD 0x27 0x22 0x3F 0x5C

Characters

A character represents a single Unicode code point. Characters literals are generally the same as C/C++. Character literals can contain escape sequences. An empty character literal ('') is illegal. A character literal cannot span multiple lines; you must use '\n' if you want a newline character. The compiler automatically chooses a type for each character (char, wchar, or dchar) so the C/C++ L prefix is unnecessary.

Numbers

A number begins with a decimal digit (0-9) or a decimal point. There are many integer and floating-point forms:

077 Decimal (not octal)
123 Decimal
123. Floating point
.123 Floating point
1e10 Floating point
0xff, 0XFF Hexadecimal
0xFF. Floating point
0xFF.C Floating point
0xFFp4 Floating point
0xFF.Cp4 Floating point
0y111, 0Y111 Binary
0b111, 0B111 Binary

Orth does not support octal. A hexadecimal value followed by a p is in hexadecimal exponential notation. To determine its value, convert the hexadecimal number to the left of the p to decimal and multiply it by 2 raised to the exponent (the value after the p). If the hexadecimal number contains a period, remove it, convert the resulting number to decimal, and divide by 16 raised to the number of digits to the right of the period. For instance, 0xFF.Cp4 is 4092 / 161 * 24.

Numbers can contain underscores for clarity (e.g., 0xffff_ffff). The lexer ignores all such underscores in any part of the number (including the exponent) even if they serve no purpose (e.g., 0x__f__).

All letters in a number are case insensitive.

Numbers cannot have suffixes because integer literals and floating-point literals do not have a type.

There is no way to enter special IEEE754 values (i.e., infinites, infinitesimals, and NaNs). You'll need to bitcast the underlying hexadecimal (or binary) bit pattern to the appropriate floating-point type. The Orth compiler is free to treat +0.0 and -0.0 as the same value.

Operators

The Orth lexer uses the "maximum munch" rule to parse operators. The Orth operators are generally the same as C with a few noteworthy exceptions:

C Orth
# Start of preprocessor token Illegal
## Token pasting operator Illegal
.. Illegal Signifies a closed range (e.g., [1,9])
..< Illegal Signifies a half-open range (e.g., [1,10))
= Assignment Illegal
:= Illegal Assignment
@ Illegal Bitwise xor
@= Illegal Bitwise xor assign
\ Line continuation Illegal
^ Bitwise xor Dereference
^= Bitwise xor assign Illegal
-> Dereference Illegal (use ^. instead)

Preprocessor

Orth does not have a preprocessor. Any use of # outside of comments and strings is illegal. [Orth 0.4 will probably have a compile-time if statement and an alias statement that will perform AST substitutions.]

Indentation

Unlike C/C++, the lexer doesn't ignore whitespace. Instead, the lexer converts whitespace into three tokens that the parser uses to identify the beginning and end of each body: LineBreak, Indent, and Unindent. The lines of an Orth module form a hierarchy:

1: int foo() {             //This is a "root" because it begins at column 1
2:     int x:=0,y:=0       //This is a child of line 1 because it's indented from the line above
3:     if(x==0)            //This is a sibling of line 2 because its indentation matches line 2
4:         if(y==0)        //This is a child of line 3 because it's indented from the line above
5:             return 1    //This is a child of line 4
         return 0          //This statement is an error because it's not indented from the line
                           //above, and its indentation doesn't match lines 1, 3, 4, or 5
6:     return 2            //This is a sibling of line 3 because its indentation matches line 3
7: }                       //This is a "root" because it begins at column 1

In general, the lexer uses the following pseudocode to determine the relationship between each line.

  1. Does the statement begin at column 1? If yes, then the statement is a root.
  2. Is the statement indented from the line above? If yes, then the statement is a child of that statement.
  3. Does the statement line up with any statements above? If yes, then the statement is a sibling of that statement.
  4. Otherwise, the statement is invalid.

When the lexer fetches the next token, it reads each whitespace character up to the next glyph character ignoring comments (i.e., it deletes each comment and lexes the file as if it didn't contain any comments). If it doesn't see any line breaks, then it decodes the next token. Otherwise, it counts the number of whitespace character after the last line break. This number is the line's indentation. For simplicity's sake, the lexer doesn't distinguish space from tabs and other whitespace characters. Each one increases the indentation by one.

In the example file below, there are two lines each containing a single glyph character, a or b. Both lines have an indentation of two:

  / * * /   a   \n   / / \n     \n \t \v / * \n * / b

The lexer maintains an indentation stack with the highest indentation at the top and zero at the bottom. Each time the lexer reads a line break, it searches for a line with a matching indentation in its stack. There are four possibilities:

  1. The line has the same indentation as the stack's top:
    The lexer returns a LineBreak token.
  2. The line has a higher indentation than the stack's top:
    The lexer returns a LineBreak token and an Indent token. It also pushes this line's indentation onto the stack.
  3. There are n entries above the matching line on the stack:
    The lexer returns n Unindent tokens followed by a LineBreak token. The lexer also pulls n values off of the stack.
  4. None of the above conditions are true:
    The line isn't indented properly. The lexer prints an error message.

When the lexer reaches the end of the file, it generates one Unindent for each entry on the stack (excluding the bottommost entry, which is zero), a LineBreak, and an End token. The lexer ignores all whitespace characters after the last glyph character in the file.

In the earlier example, the lexer would produce the following tokens:

int foo ( ) { LineBreak Indent
int x := 0 , y := 0 LineBreak
if ( x == 0 ) LineBreak Indent
if ( y == 0 ) LineBreak Indent
return 1 Unindent Unindent LineBreak
return 2 Unindent LineBreak
} LineBreak End

Line Continuations

The syntax uses the abbreviation LC to indicate points where a line continuation is possible.

When a statement becomes too long, you typically want to split it into two or more lines. In C/C++, you can add spaces and line breaks whereever you want because the lexer ignores whitespace. Orth introduces the concept of joining operators. When a statement ends with a joining operator and the next line is indented, it automatically continues on the next line. Each line continuation must be indented by at least one column from the start of the statement, but it doesn't need to line up with other line continuations for that statement.

A joining operator modifies the algorithm that the lexer uses to determine indentation. Rather than emitting a LineBreak and Indent token and pushing a value onto the stack, the lexer fetches the next token without modifying the stack. The modified algorithm is below (only the italicized sentence has changed).

  1. The line has the same indentation as the stack's top:
    The lexer returns a LineBreak token.
  2. The line has a higher indentation than the stack's top:
    The lexer fetches the next token.
  3. There are n entries above the matching line on the stack:
    The lexer returns n Unindent tokens followed by a LineBreak token. The lexer also pulls n values off of the stack.
  4. None of the above conditions are true:
    The line isn't indented properly. The lexer prints an error message.

The following operators are joining operators in all contexts:

* / % << >> + - & @ | < > <= >= == != && || ?
: := *= /= %= <<= >>= +- -= &= @= |= && || ( [ .. ..< string

Two operators are joining operators only inside parentheses: commas and semicolons.

The following line continuations are legal:

int x:=a+(
          b+c)+ //This line is indented from the first line
       d        //This line is also indented from the first line
String s:=''abc
          ''def
    ''ghi
int[2] values:={
    1, //not a joining operator (not inside parens)
    2, //same
}
void foo() {
    bar(); //not a joining operator (not inside parens)
    bar(); //same
}
int bar() {
    for(a; //ok, the semicolon joins because it is inside parens
        b; //same
        c)
        baz()
    return (a, //ok, the comma joins because it is inside parens
        b)
}

But these line continuations are illegal:

void foo()
{
    int x:=a+(b+
    c)+ //illegal: not indented far enough
d //illegal: not indented far enough
}
int bar()
{
    return a,
        b //Illegal: the comma doesn't join because it isn't inside parens
}

A valid statement cannot end with any of the joining operators, so the parser will never accidentally join a line that you didn't want to join. On the other hand, the parser may decide not to join a line that you intended to join. Consider this example

1: int a:=123
2:     +456 //Error: statement's indentation is incorrect

Line 1 doesn't end with a joining operator, so the parser will assume that line 2 is the indented body of line 1. Since line 1 can't have a body, the parser will issue an error message.