Orth Lexer

Introduction
Encoding
Lines and Columns
Comments
Keywords
Potential Keywords
Retired Keywords
Identifiers
Strings
Escape Sequences
Characters
Numbers
Operators
Preprocessor
Indentation
Line Continuations

Introduction

The goal of the lexer is to make incremental syntax highlighting as easy as possible. Ideally, changes to one line should not affect the syntax of other lines. For example, Orth prohibits multi-line strings because they complicate syntax highlighting: inserting a quotation mark could, in theory, change the syntax of every line in the file. Of course block comments are the same way, but there isn't much I can do about them.

Encoding

The Orth compiler understands four encodings:

ISO 8859 (no BOM)
UTF8 (EF BB BF)
UTF16 little endian (FF FE)
UTF16 big endian (FE FF)

A source file that lacks a BOM uses ISO 8859. ISO 8859 is the same as Windows 1252 (the default Windows code page) for chacters in the range [0x00,0x7F] and [0xA0,0xFF]. ISO 8859 doesn't define characters in the range [0x80,0x9F]; whereas, Windows assigns glyphs to most of them (for instance, the euro symbol is 0x80). If your program needs a glyph specific to Windows 1252, you'll need to use one of the three Unicode encodings.

Whitespace characters are 0x20 and 0x09 through 0x0D (horizontal tab, line feed, vertical tab, form feed, and carriage return). A line feed (0x0A) or carriage return (0x0D) signfies a line break unless the carriage return is immediately followed by a line feed, in which case the lexer ignores the carriage return. Other controls characters (0x00 through 0x08, 0x0E through 0x1F, and 0x7F) are illegal.

Lines and Columns

The first line in a source file is line 1. The first column of each source line is 1. A line break increases the line number by 1 and resets the column number to 1. All other whitespace characters increase the column number by 1. In particular, the tab width is effectively one. In order for the compiler to display column numbers properly, you must expand tabs into spaces.

Comments

Line comments begin with // and continue through the next line break. The lexer ignores all characters inside a line comment, including //, /*, and */.

Block comments begin with /* and continue through the matching */. Block comments nest. Unterminated block comments are illegal.

Keywords

alignas	cdecl	dtor	inout	shadow	typedef
alignof	char	else	int	shared	typeof
anon	class	export	long	short	ubyte
auto	const	false	null	single	uint
bitcast	construct	finally	operator	sizeof	ulong
bit	continue	for	out	stdcall	unreachable
bool	ctor	goto	outer	struct	uninit
break	char	guard	pragma	this	ushort
byte	destruct	if	return	throw	void
case	do	import	scope	true	wchar
catch	double	include	select	try	while

Potential Keywords

The following identifiers may eventually become keywords:

alias	asm	enum	private	protected
public	typename	union	virtual	override

Retired Keywords

The following C++ keywords have no special meaning in Orth:

const_cast	float	register	template
default	friend	reinterpret_cast	typeid
delete	inline	signed	unsigned
dynamic_cast	mutable	static	using
explicit	namespace	static_cast	volatile
extern	new	switch	wchar_t

Identifiers

An identifier must begin with one of:

_ $ a-z A-Z any extended character (0x80 and above)

An identifier can contain only:

_ $ a-z A-Z 0-9 any extended character (0x80 and above)

An identifier cannot be a keyword. The dollar sign currently has no special meaning to the Orth compiler. Identifiers beginning with two leading underscores are reserved for future keywords.

Strings

There are three types of strings:

1. C strings	`"c:\\foo\\bar"`
2. WYSIWYG strings	`c:\foo\bar`
3. line strings	`''Line one ''Line two`

C strings begin and end with a quote ("). WYSIWYG strings begin and end with a backquote (`). A line string begins with '' and ends with a line break. A line string lacks a trailing quote. The compiler automatically chooses a type for each string (char[], wchar[], or dchar[]) so the C/C++ L prefix is unnecessary.

C strings can contain escape sequences. C strings cannot span multiple lines.

WYSIWYG strings cannot contain escape sequences. The lexer interprets each character in the string literally with one exception. The lexer interprets two adjacent backquotes as a single backquote, respectively. Notice that the adjacent backquotes would otherwise split the string into two strings. WYSIWYG strings cannot span multiple lines.

Line strings cannot contain escape sequences. Since they include all characters up to and including the next line break, there is no need to escape any characters. An EOF can also terminate a line string, in which case the lexer inserts an imaginary line break.

The compiler automatically concatenates adjacent strings (with the exception noted for WYSIWYG strings).

Line strings are ideal for properly indenting strings that span multiple lines:

/*
<-- This is column 1
*/
void foo()
{
    String s:=''Line one
              ''Line two
              ''Line three
    String t:=''First line
              ''Second line
}

Supposing we used a hypothetical WYSIWYG line string that could contain line breaks, our code would be much less clear:

/*
<-- This is column 1
*/
void foo()
{
    String s:=`Line one
Line two
Line three` //Ugly
    String t:=`First line
Second line` //Ugly
}

Escape Sequences

Orth escape sequences are the same as C/C++ with some noteworthy exceptions:

An escape sequence beginning with a decimal digit is decimal. Orth doesn't support octal. A decimal escape sequence can have up to three digits and must have a value between 0 and 255. A decimal escape whose value exceeds 255 is illegal even if shortening it would make it valid. (e.g., /999 is illegal but /0999 is legal)
An escape sequence beginning with a /y or /Y is binary. A binary escape sequence can have up to eight digits consisting of zeroes and ones.

A hexadecimal escape sequence begins with /x and contains up to two hex digits. A Unicode escape begins with /u or /U and contains up to four or eight digits, respectively. Notice that escapes are case sensitive whereas a number isn't (e.g., "\X12" is invalid but 0X12 is valid). In either case, the digits of the hexadecimal value are case insensitive.

The remaining escape sequences have a single character:

`\a`	`\b`	`\t`	`\n`	`\v`	`\f`	`\r`	`\'`	`\"`	`\?`	`\\`
0x7	0x8	0x9	0xA	0xB	0xC	0xD	0x27	0x22	0x3F	0x5C

Characters

A character represents a single Unicode code point. Characters literals are generally the same as C/C++. Character literals can contain escape sequences. An empty character literal ('') is illegal. A character literal cannot span multiple lines; you must use '\n' if you want a newline character. The compiler automatically chooses a type for each character (char, wchar, or dchar) so the C/C++ L prefix is unnecessary.

Numbers

A number begins with a decimal digit (0-9) or a decimal point. There are many integer and floating-point forms:

`077`	Decimal (not octal)
`123`	Decimal
`123.`	Floating point
`.123`	Floating point
`1e10`	Floating point
`0xff, 0XFF`	Hexadecimal
`0xFF.`	Floating point
`0xFF.C`	Floating point
`0xFFp4`	Floating point
`0xFF.Cp4`	Floating point
`0y111, 0Y111`	Binary
`0b111, 0B111`	Binary

Orth does not support octal. A hexadecimal value followed by a p is in hexadecimal exponential notation. To determine its value, convert the hexadecimal number to the left of the p to decimal and multiply it by 2 raised to the exponent (the value after the p). If the hexadecimal number contains a period, remove it, convert the resulting number to decimal, and divide by 16 raised to the number of digits to the right of the period. For instance, 0xFF.Cp4 is 4092 / 16¹ * 2⁴.

Numbers can contain underscores for clarity (e.g., 0xffff_ffff). The lexer ignores all such underscores in any part of the number (including the exponent) even if they serve no purpose (e.g., 0x__f__).

All letters in a number are case insensitive.

Numbers cannot have suffixes because integer literals and floating-point literals do not have a type.

There is no way to enter special IEEE754 values (i.e., infinites, infinitesimals, and NaNs). You'll need to bitcast the underlying hexadecimal (or binary) bit pattern to the appropriate floating-point type. The Orth compiler is free to treat +0.0 and -0.0 as the same value.

Operators

The Orth lexer uses the "maximum munch" rule to parse operators. The Orth operators are generally the same as C with a few noteworthy exceptions:

	C	Orth
`#`	Start of preprocessor token	Illegal
`##`	Token pasting operator	Illegal
`..`	Illegal	Signifies a closed range (e.g., [1,9])
`..<`	Illegal	Signifies a half-open range (e.g., [1,10))
`=`	Assignment	Illegal
`:=`	Illegal	Assignment
`@`	Illegal	Bitwise xor
`@=`	Illegal	Bitwise xor assign
`\`	Line continuation	Illegal
`^`	Bitwise xor	Dereference
`^=`	Bitwise xor assign	Illegal
`->`	Dereference	Illegal (use `^.` instead)

Preprocessor

Orth does not have a preprocessor. Any use of # outside of comments and strings is illegal. [Orth 0.4 will probably have a compile-time if statement and an alias statement that will perform AST substitutions.]

Indentation

Unlike C/C++, the lexer doesn't ignore whitespace. Instead, the lexer converts whitespace into three tokens that the parser uses to identify the beginning and end of each body: LineBreak, Indent, and Unindent. The lines of an Orth module form a hierarchy:

1: int foo() {             //This is a "root" because it begins at column 1
2:     int x:=0,y:=0       //This is a child of line 1 because it's indented from the line above
3:     if(x==0)            //This is a sibling of line 2 because its indentation matches line 2
4:         if(y==0)        //This is a child of line 3 because it's indented from the line above
5:             return 1    //This is a child of line 4
         return 0          //This statement is an error because it's not indented from the line
                           //above, and its indentation doesn't match lines 1, 3, 4, or 5
6:     return 2            //This is a sibling of line 3 because its indentation matches line 3
7: }                       //This is a "root" because it begins at column 1

In general, the lexer uses the following pseudocode to determine the relationship between each line.

Does the statement begin at column 1? If yes, then the statement is a root.
Is the statement indented from the line above? If yes, then the statement is a child of that statement.
Does the statement line up with any statements above? If yes, then the statement is a sibling of that statement.
Otherwise, the statement is invalid.

When the lexer fetches the next token, it reads each whitespace character up to the next glyph character ignoring comments (i.e., it deletes each comment and lexes the file as if it didn't contain any comments). If it doesn't see any line breaks, then it decodes the next token. Otherwise, it counts the number of whitespace character after the last line break. This number is the line's indentation. For simplicity's sake, the lexer doesn't distinguish space from tabs and other whitespace characters. Each one increases the indentation by one.

In the example file below, there are two lines each containing a single glyph character, a or b. Both lines have an indentation of two:

The lexer maintains an indentation stack with the highest indentation at the top and zero at the bottom. Each time the lexer reads a line break, it searches for a line with a matching indentation in its stack. There are four possibilities:

The line has the same indentation as the stack's top:
The lexer returns a LineBreak token.
The line has a higher indentation than the stack's top:
The lexer returns a LineBreak token and an Indent token. It also pushes this line's indentation onto the stack.
There are n entries above the matching line on the stack:
The lexer returns n Unindent tokens followed by a LineBreak token. The lexer also pulls n values off of the stack.
None of the above conditions are true:
The line isn't indented properly. The lexer prints an error message.

When the lexer reaches the end of the file, it generates one Unindent for each entry on the stack (excluding the bottommost entry, which is zero), a LineBreak, and an End token. The lexer ignores all whitespace characters after the last glyph character in the file.

In the earlier example, the lexer would produce the following tokens:

int foo ( ) { LineBreak Indent
int x := 0 , y := 0 LineBreak
if ( x == 0 ) LineBreak Indent
if ( y == 0 ) LineBreak Indent
return 1 Unindent Unindent LineBreak
return 2 Unindent LineBreak
} LineBreak End

Line Continuations

The syntax uses the abbreviation LC to indicate points where a line continuation is possible.

When a statement becomes too long, you typically want to split it into two or more lines. In C/C++, you can add spaces and line breaks whereever you want because the lexer ignores whitespace. Orth introduces the concept of joining operators. When a statement ends with a joining operator and the next line is indented, it automatically continues on the next line. Each line continuation must be indented by at least one column from the start of the statement, but it doesn't need to line up with other line continuations for that statement.

A joining operator modifies the algorithm that the lexer uses to determine indentation. Rather than emitting a LineBreak and Indent token and pushing a value onto the stack, the lexer fetches the next token without modifying the stack. The modified algorithm is below (only the italicized sentence has changed).

The line has the same indentation as the stack's top:
The lexer returns a LineBreak token.
The line has a higher indentation than the stack's top:
The lexer fetches the next token.
There are n entries above the matching line on the stack:
The lexer returns n Unindent tokens followed by a LineBreak token. The lexer also pulls n values off of the stack.
None of the above conditions are true:
The line isn't indented properly. The lexer prints an error message.

The following operators are joining operators in all contexts:

* / % << >> + - & @ | < > <= >= == != && || ?

: := *= /= %= <<= >>= +- -= &= @= |= && || ( [ .. ..< string

Two operators are joining operators only inside parentheses: commas and semicolons.

The following line continuations are legal:

int x:=a+(
          b+c)+ //This line is indented from the first line
       d        //This line is also indented from the first line
String s:=''abc
          ''def
    ''ghi
int[2] values:={
    1, //not a joining operator (not inside parens)
    2, //same
}
void foo() {
    bar(); //not a joining operator (not inside parens)
    bar(); //same
}
int bar() {
    for(a; //ok, the semicolon joins because it is inside parens
        b; //same
        c)
        baz()
    return (a, //ok, the comma joins because it is inside parens
        b)
}

But these line continuations are illegal:

void foo()
{
    int x:=a+(b+
    c)+ //illegal: not indented far enough
d //illegal: not indented far enough
}
int bar()
{
    return a,
        b //Illegal: the comma doesn't join because it isn't inside parens
}

A valid statement cannot end with any of the joining operators, so the parser will never accidentally join a line that you didn't want to join. On the other hand, the parser may decide not to join a line that you intended to join. Consider this example

1: int a:=123
2:     +456 //Error: statement's indentation is incorrect

Line 1 doesn't end with a joining operator, so the parser will assume that line 2 is the indented body of line 1. Since line 1 can't have a body, the parser will issue an error message.

int	foo	(	)	{	LineBreak	Indent
int	x	:=	0	,	y	:=	0	LineBreak
if	(	x	==	0	)	LineBreak	Indent
if	(	y	==	0	)	LineBreak	Indent
return	1	Unindent	Unindent	LineBreak
return	2	Unindent	LineBreak
}	LineBreak	End

Contents