The goal of the lexer is to make incremental syntax highlighting as easy as possible. Ideally, changes to one line should not affect the syntax of other lines. For example, Orth prohibits multi-line strings because they complicate syntax highlighting: inserting a quotation mark could, in theory, change the syntax of every line in the file. Of course block comments are the same way, but there isn't much I can do about them.
The Orth compiler understands four encodings:
EF BB BF
)
FF FE
)
FE FF
)
A source file that lacks a BOM uses ISO 8859. ISO 8859 is the same as Windows 1252 (the default Windows code page) for chacters in the range [0x00,0x7F] and [0xA0,0xFF]. ISO 8859 doesn't define characters in the range [0x80,0x9F]; whereas, Windows assigns glyphs to most of them (for instance, the euro symbol is 0x80). If your program needs a glyph specific to Windows 1252, you'll need to use one of the three Unicode encodings.
Whitespace characters are 0x20 and 0x09 through 0x0D (horizontal tab, line feed, vertical tab, form feed, and carriage return). A line feed (0x0A) or carriage return (0x0D) signfies a line break unless the carriage return is immediately followed by a line feed, in which case the lexer ignores the carriage return. Other controls characters (0x00 through 0x08, 0x0E through 0x1F, and 0x7F) are illegal.
The first line in a source file is line 1. The first column of each source line is 1. A line break increases the line number by 1 and resets the column number to 1. All other whitespace characters increase the column number by 1. In particular, the tab width is effectively one. In order for the compiler to display column numbers properly, you must expand tabs into spaces.
Line comments begin with //
and continue through the next line break. The lexer ignores all
characters inside a line comment, including //
, /*
, and */
.
Block comments begin with /*
and continue through the matching */
. Block comments nest.
Unterminated block comments are illegal.
alignas | cdecl | dtor | inout | shadow | typedef |
alignof | char | else | int | shared | typeof |
anon | class | export | long | short | ubyte |
auto | const | false | null | single | uint |
bitcast | construct | finally | operator | sizeof | ulong |
bit | continue | for | out | stdcall | unreachable |
bool | ctor | goto | outer | struct | uninit |
break | char | guard | pragma | this | ushort |
byte | destruct | if | return | throw | void |
case | do | import | scope | true | wchar |
catch | double | include | select | try | while |
The following identifiers may eventually become keywords:
alias | asm | enum | private | protected |
public | typename | union | virtual | override |
The following C++ keywords have no special meaning in Orth:
const_cast | float | register | template |
default | friend | reinterpret_cast | typeid |
delete | inline | signed | unsigned |
dynamic_cast | mutable | static | using |
explicit | namespace | static_cast | volatile |
extern | new | switch | wchar_t |
An identifier must begin with one of:
_ | $ | a-z | A-Z | any extended character (0x80 and above) |
An identifier can contain only:
_ | $ | a-z | A-Z | 0-9 | any extended character (0x80 and above) |
An identifier cannot be a keyword. The dollar sign currently has no special meaning to the Orth compiler. Identifiers beginning with two leading underscores are reserved for future keywords.
There are three types of strings:
1. C strings | "c:\\foo\\bar" |
2. WYSIWYG strings | `c:\foo\bar` |
3. line strings | ''Line one |
C strings begin and end with a quote ("
). WYSIWYG strings begin and end with a backquote (`
).
A line string begins with ''
and ends with a line break. A line string lacks a trailing quote.
The compiler automatically chooses a type for each string (char[]
, wchar[]
, or dchar[]
)
so the C/C++ L
prefix is unnecessary.
C strings can contain escape sequences. C strings cannot span multiple lines.
WYSIWYG strings cannot contain escape sequences. The lexer interprets each character in the string literally with one exception. The lexer interprets two adjacent backquotes as a single backquote, respectively. Notice that the adjacent backquotes would otherwise split the string into two strings. WYSIWYG strings cannot span multiple lines.
Line strings cannot contain escape sequences. Since they include all characters up to and including the next line break, there is no need to escape any characters. An EOF can also terminate a line string, in which case the lexer inserts an imaginary line break.
The compiler automatically concatenates adjacent strings (with the exception noted for WYSIWYG strings).
Line strings are ideal for properly indenting strings that span multiple lines:
/* <-- This is column 1 */ void foo() { String s:=''Line one ''Line two ''Line three String t:=''First line ''Second line } |
Supposing we used a hypothetical WYSIWYG line string that could contain line breaks, our code would be much less clear:
/* <-- This is column 1 */ void foo() { String s:=`Line one Line two Line three` //Ugly String t:=`First line Second line` //Ugly } |
Orth escape sequences are the same as C/C++ with some noteworthy exceptions:
/y
or /Y
is binary. A binary escape sequence can have up to eight digits
consisting of zeroes and ones.A hexadecimal escape sequence begins with /x
and contains up to two hex digits.
A Unicode escape begins with /u
or /U
and contains up to four or eight digits, respectively.
Notice that escapes are case sensitive whereas a number isn't (e.g., "\X12" is invalid but
0X12 is valid). In either case, the digits of the hexadecimal value are case insensitive.
The remaining escape sequences have a single character:
\a
| \b
| \t
| \n
| \v
| \f
| \r
| \'
| \"
| \?
| \\
|
0x7 | 0x8 | 0x9 | 0xA | 0xB | 0xC | 0xD | 0x27 | 0x22 | 0x3F | 0x5C |
A character represents a single Unicode code point. Characters literals are generally the same as C/C++. Character literals can contain escape sequences.
An empty character literal (''
) is illegal. A character literal cannot span multiple lines; you must
use '\n'
if you want a newline character. The compiler automatically chooses a type for each character (char
, wchar
, or dchar
)
so the C/C++ L
prefix is unnecessary.
A number begins with a decimal digit (0-9) or a decimal point. There are many integer and floating-point forms:
077 | Decimal (not octal) |
123 | Decimal |
123. | Floating point |
.123 | Floating point |
1e10 | Floating point |
0xff, 0XFF | Hexadecimal |
0xFF. | Floating point |
0xFF.C | Floating point |
0xFFp4 | Floating point |
0xFF.Cp4 | Floating point |
0y111, 0Y111 | Binary |
0b111, 0B111 | Binary |
Orth does not support octal. A hexadecimal value followed by a p
is in hexadecimal exponential notation.
To determine its value, convert the hexadecimal number to the left of the p
to decimal and multiply it by 2 raised
to the exponent (the value after the p). If the hexadecimal number contains a period, remove it, convert the resulting
number to decimal, and divide by 16 raised to the number of digits to the right of the period. For instance, 0xFF.Cp4
is 4092 / 161 * 24.
Numbers can contain underscores for clarity (e.g., 0xffff_ffff
). The lexer ignores all such underscores
in any part of the number (including the exponent) even if they serve no purpose (e.g., 0x__f__
).
All letters in a number are case insensitive.
Numbers cannot have suffixes because integer literals and floating-point literals do not have a type.
There is no way to enter special IEEE754 values (i.e., infinites, infinitesimals, and NaNs).
You'll need to bitcast
the underlying hexadecimal (or binary) bit pattern to the appropriate
floating-point type. The Orth compiler is free to treat +0.0 and -0.0 as the same value.
The Orth lexer uses the "maximum munch" rule to parse operators. The Orth operators are generally the same as C with a few noteworthy exceptions:
C | Orth | |
---|---|---|
# | Start of preprocessor token | Illegal |
## | Token pasting operator | Illegal |
.. | Illegal | Signifies a closed range (e.g., [1,9]) |
..< | Illegal | Signifies a half-open range (e.g., [1,10)) |
= | Assignment | Illegal |
:= | Illegal | Assignment |
@ | Illegal | Bitwise xor |
@= | Illegal | Bitwise xor assign |
\ | Line continuation | Illegal |
^ | Bitwise xor | Dereference |
^= | Bitwise xor assign | Illegal |
-> | Dereference | Illegal (use ^. instead) |
Orth does not have a preprocessor. Any use of #
outside of comments and strings is illegal.
[Orth 0.4 will probably have a compile-time if statement and an alias statement that will perform AST substitutions.]
Unlike C/C++, the lexer doesn't ignore whitespace. Instead, the lexer converts whitespace into three tokens that the parser uses to identify the beginning and end of each body: LineBreak, Indent, and Unindent. The lines of an Orth module form a hierarchy:
1: int foo() { //This is a "root" because it begins at column 1
2: int x:=0,y:=0 //This is a child of line 1 because it's indented from the line above
3: if(x==0) //This is a sibling of line 2 because its indentation matches line 2
4: if(y==0) //This is a child of line 3 because it's indented from the line above
5: return 1 //This is a child of line 4
return 0 //This statement is an error because it's not indented from the line
//above, and its indentation doesn't match lines 1, 3, 4, or 5
6: return 2 //This is a sibling of line 3 because its indentation matches line 3
7: } //This is a "root" because it begins at column 1
|
In general, the lexer uses the following pseudocode to determine the relationship between each line.
When the lexer fetches the next token, it reads each whitespace character up to the next glyph character ignoring comments (i.e., it deletes each comment and lexes the file as if it didn't contain any comments). If it doesn't see any line breaks, then it decodes the next token. Otherwise, it counts the number of whitespace character after the last line break. This number is the line's indentation. For simplicity's sake, the lexer doesn't distinguish space from tabs and other whitespace characters. Each one increases the indentation by one.
In the example file below, there are two lines each containing a single glyph character, a
or b
. Both lines have an indentation of two:
/ | * | * | / | a | \n | / | / | \n | \n | \t | \v | / | * | \n | * | / | b |
The lexer maintains an indentation stack with the highest indentation at the top and zero at the bottom. Each time the lexer reads a line break, it searches for a line with a matching indentation in its stack. There are four possibilities:
When the lexer reaches the end of the file, it generates one Unindent for each entry on the stack (excluding the bottommost entry, which is zero), a LineBreak, and an End token. The lexer ignores all whitespace characters after the last glyph character in the file.
In the earlier example, the lexer would produce the following tokens:
int | foo | ( | ) | { | LineBreak | Indent | ||
int | x | := | 0 | , | y | := | 0 | LineBreak |
if | ( | x | == | 0 | ) | LineBreak | Indent | |
if | ( | y | == | 0 | ) | LineBreak | Indent | |
return | 1 | Unindent | Unindent | LineBreak | ||||
return | 2 | Unindent | LineBreak | |||||
} | LineBreak | End |
The syntax uses the abbreviation LC to indicate points where a line continuation is possible.
When a statement becomes too long, you typically want to split it into two or more lines. In C/C++, you can add spaces and line breaks whereever you want because the lexer ignores whitespace. Orth introduces the concept of joining operators. When a statement ends with a joining operator and the next line is indented, it automatically continues on the next line. Each line continuation must be indented by at least one column from the start of the statement, but it doesn't need to line up with other line continuations for that statement.
A joining operator modifies the algorithm that the lexer uses to determine indentation. Rather than emitting a LineBreak and Indent token and pushing a value onto the stack, the lexer fetches the next token without modifying the stack. The modified algorithm is below (only the italicized sentence has changed).
The following operators are joining operators in all contexts:
*
| /
| %
| <<
| >>
| +
| -
| &
| @
| |
| <
| >
| <=
| >=
| ==
| !=
| &&
| ||
| ?
|
:
| :=
| *=
| /=
| %=
| <<=
| >>=
| +-
| -=
| &=
| @=
| |=
| &&
| ||
| (
| [
| ..
| ..<
| string |
Two operators are joining operators only inside parentheses: commas and semicolons.
The following line continuations are legal:
int x:=a+( b+c)+ //This line is indented from the first line d //This line is also indented from the first line String s:=''abc ''def ''ghi int[2] values:={ 1, //not a joining operator (not inside parens) 2, //same } void foo() { bar(); //not a joining operator (not inside parens) bar(); //same } int bar() { for(a; //ok, the semicolon joins because it is inside parens b; //same c) baz() return (a, //ok, the comma joins because it is inside parens b) } |
But these line continuations are illegal:
void foo() { int x:=a+(b+ c)+ //illegal: not indented far enough d //illegal: not indented far enough } int bar() { return a, b //Illegal: the comma doesn't join because it isn't inside parens } |
A valid statement cannot end with any of the joining operators, so the parser will never accidentally join a line that you didn't want to join. On the other hand, the parser may decide not to join a line that you intended to join. Consider this example
1: int a:=123
2: +456 //Error: statement's indentation is incorrect
|
Line 1 doesn't end with a joining operator, so the parser will assume that line 2 is the indented body of line 1. Since line 1 can't have a body, the parser will issue an error message.