Orth Syntax

Introduction
Statements
Bodies
Statements
Selection Statements
ctor and dtor Statements
Function and Aggregate Statements
Comma Expressions
Single Expressions
Attributes
Declarators
Initializer Expressions
Assignment Expressions
Unary Expressions
Postfix Expressions
Primary Expressions
Symbols

Introduction

The goal of the parser is to make C/C++ code easier to write by removing all unnecessary braces and semicolons. Ideally, one should be able to take a well-written, properly indented C++ source file, remove every semicolon, and parse the result correctly in Orth. To this end, I've designed the parser to accept many common C++ coding styles, such as putting a short single-line body on the same line as an if statement. I've also tried to keep the syntax free of ambiguities. The grammar currently contains four ambiguities that aren't important enough to resolve:

if(x) if(y) foo() else bar()         //Does the else belong to the first or second if statement?
try foo() try bar() catch(int) baz() //Does the catch belong to the first or second try statement?
try foo() try bar() finally baz()    //Does the finally belong to the first or second try statement?
if(x) foo() else bar()               //Is else a separate statement belonging to a select statement?

In each case, the parser follows the C/C++ convention of associating the else, catch, or finally with the nearest if or try. An else is a separate statement only if an if does not precede it.

I've carefully crafted the grammar rules to make the grammar easy to parse with only one lookahead token. In places where two rules overlap, they always share identical nonterminals up to the terminal which distinguishes them. For example, the rule for colon_expr has two productions beginning with assignment_expr. If a colon follows the assignment_expr, it becomes a designator. Otherwise, it becomes an initializer. Of course, the number of semantically valid expressions for the designator is much smaller than the number of syntactically valid expressions, but the compiler can easily reject expressions that don't make sense later.

Labels are an exception to the rule above (i.e., one rule starts with statement_expr and the other starts with Identifier). Nonetheless, labels don't complicate parsing because we can tentatively read an identifier as a label, read the next token, and reinterpret the identifier as a statement_expr if the next token isn't a ::. Reinterpreting a terminal involves jumping through several levels of nonterminals until we reach the point that we would have reached had we read the identifer as a statement_expr.

Aggregate declarations (struct and class) are another exception to the rule. Unlike a label, we need to lookahead three tokens to distinguish an aggregate statement from an aggregate expression (the keyword, the identifier, and the left brace).

Another difficulty arises from the optional LineBreak in front of a brace_body. In some contexts, we need to examine the token after the LineBreak to determine whether the LineBreak is part of a body or part of brace_body_lines. Rather than complicating the parser with two lookahead tokens, I opted to merge the LineBreak with the left brace inside the lexer to form a LineBreakLbrace token. The optional LineBreak in several productions for basic_statement presented a similar problem, which I solved exactly the same way. In total, the lexer produces five token pairs: LineBreakCatch, LineBreakElse, LineBreakFinallly, LineBreakLbrace, LineBreakRbrace, and LineBreakWhile, and the parser is able to act on a single lookahead token.

Statements

statements ::= Ø¹
statements ::= statement LineBreak
statements ::= statement LineBreak statement LineBreak
statements ::= statement LineBreak statement LineBreak statement LineBreak ...¹

¹the empty set indicates an empty production
an ellipses indicates that the pattern repeats indefinitely
a plus indicates one or more repetitions of a rule

Each file contains a series of zero or more statements separated by line breaks.

Bodies

body ::= same_line_body | indented_body | line_break_opt brace_body
same_line_body ::= statement
indented_body ::= LineBreak Indent statement Unindent
brace_body_stmts ::= statement
brace_body_stmts ::= statement ;
brace_body_stmts ::= statement ; statement
brace_body_stmts ::= statement ; statement ;
brace_body_stmts ::= statement ; statement ; ...
brace_body_lines ::= brace_body_stmts
brace_body_lines ::= brace_body_stmts LineBreak brace_body_stmts
brace_body_lines ::= brace_body_stmts LineBreak brace_body_stmts LineBreak ...
top_brace_stmts ::= Ø
top_brace_stmts ::= statement ;
top_brace_stmts ::= statement ; statement ;
top_brace_stmts ::= statement ; statement ; ...
brace_body ::= { line_break_opt }
brace_body ::= { brace_body_stmts line_break_opt }
brace_body ::= { top_brace_stmts LineBreak Indent brace_body_lines Unindent LineBreak }
line_break_opt ::= Ø | LineBreak

Many statements have bodies. A body has one of three forms:

A single statement on the same line.

if(foo) bar()
if(foo) bar() else baz() //the else can go on the same line as the if
if(foo) bar()
else baz()               //or on a separate line as long as it lines up with the if

A single indented statement on the next line. Putting any tokens at the end of the statement including else, catch, and finally produces an error.

if(foo)
    bar()
else //ok, else is on a separate line and lines up with the if
    baz()

if(foo)
    bar() else baz() //error: unexpected tokens after bar()
if(foo)
    bar()
    else baz() //error: else doesn't line up with the if

A series of zero or more statements between braces and separated by semicolons and/or line breaks. A left or right brace at the beginning of a line must line up with the statement. If the brace_body spans multiple lines, then the right brace must be the first token on the line.

if(foo)
{
    bar()
    baz()
}
if(foo) {
    bar(); baz()
}
if(foo) { int bar(); //The semicolon is required here.  Without it, the parser would assume
    baz()            //that bar() was a function containing a call to baz().
}
if(foo) { bar(); baz()
}
if(foo) { bar(); baz(); }

void nope(int x)
  { //error: left brace doesn't line up with void
    return 0 } //error: right brace isn't at the beginning of the line
int wrong(int x) {
int y:=x+1 //error: each line of brace_body must be indented
    return x //error: each line of brace_body must have the same indentation
    } //error: right brace doesn't line up with 'int'

There are two ways to define a statement that has no body: a pair of empty braces or the expression null. Actually, you can use any expression that lacks side effects like 0 or true, but the compiler usually generates warnings for these expressions because they lack side effects. The compiler understands that null, by itself, is a placeholder for a statement that lacks a body, so it doesn't generate a warning.

while(foo()) {}

while(foo())
        null

Statements

statement ::= attribute_seq
statement ::= attribute_seq line_break_opt brace_body
statement ::= attribute_seq indented_body
statement ::= attribute_seq_opt basic_statement
statement ::= statement_expr
basic_statement ::= scope body
basic_statement ::= guard body
basic_statement ::= while( LC comma_expr ) body
basic_statement ::= for( LC comma_expr_opt ; LC comma_expr_opt ; LC comma_expr_opt ) body
basic_statement ::= jump_keyword statement_expr_opt
basic_statement ::= include single_expr_list
basic_statement ::= do body line_break_opt while( LC comma_expr )
basic_statement ::= if( LC comma_expr ) body
basic_statement ::= if( LC comma_expr ) body line_break_opt else body
basic_statement ::= try body line_break_opt finally body
basic_statement ::= try body catch+ finally_opt
basic_statement ::= selection_statement
basic_statement ::= ctor_dtor_statement
basic_statement ::= function_aggregate_statement
jump_keyword ::= return | throw | goto | break | continue
catch ::= line_break_opt catch( LC single_expr ) body
finally_opt ::= Ø | line_break_opt finally body
identifier_opt ::= Ø | identifier

Each jump statement (return, throw, goto, break, and continue) has the same syntax to simplify the parser, but the following restrictions apply:

goto requires an identifier
If an expression is presesnt after break, it must be an identifier
If an expression is presesnt after continue, it must be an identifier

Attribute sequences have four forms:

The attribute sequence doesn't have a body.
pragma(dll)

The attribute sequence appears in front of a statement or expression. Similar to a function_aggregate_statement, an attribute sequence that appears in front of the while or else keyword cannot be the same_line_body of another statement. This rule lets the parser immediately match a while with a do without searching the rest of the line and the next line for another while.

shared export int x:=123
shared int foo()
    return 123

do
    label:: while(x!=0) foo() //ok
while(bar())

do label:: while(x!=0) //ok

do label:: while(x!=0) foo() while(bar()) //error: "label:: while" is the same_line_body
                                          //of the do-while statement.  The parser can't
                                          //know that the while doesn't belong to the do
                                          //without arbitrary lookahead.

The attribute sequence appears on the line above an indented statement:
shared export int x
The attribute appears in front of or on the line above a left brace:
import foo:: cdecl bar:: pragma(undecorated) { struct S int x void foo(S^) } import foo:: cdecl bar:: pragma(undecorated) { struct S int x; void foo(S^) } //same as above
The parser places a copy of each "relevant" nonlabel attribute in front of each statement between the braces. An attribute is "relevant" if the compiler accepts it without generating an unused-attribute error. The parser places each label attribute on its own line above the attribute sequence. The compiler silently accepts attributes that aren't relevant to any statements (pragma(undecorated) in this example). This example is syntactically equivalent to:
foo:: bar:: cdecl struct S //only cdecl is relevant here int x import cdecl void foo(S^) //import and cdecl are relevant here

Selection Statements

selection_statement ::= select( LC comma_expr ) body
selection_statement ::= case( LC choice_expr_list ) body
selection_statement ::= else body
choice_expr ::= single_expr
choice_expr ::= single_expr .. LC single_expr
choice_expr ::= single_expr ..< LC single_expr
choice_expr_list ::= choice_expr
choice_expr_list ::= choice_expr , LC choice_expr
choice_expr_list ::= choice_expr , LC choice_expr , LC ...

A select statement has a parenthesized expression and a body. The body must contain only case statements and up to one else statement. The else statement need not be the last statement. A case statement has a parenthesized expression list and a body. The expression list can contain closed intervals (start..end) and half-open intervals (start..<end). Orth uses else instead of default because default is more useful as an identifier. Although else is syntactically ambiguous, it is semantically unambiguous because if-else statements and case-else statements can't coexist within the same scope. Here's an example:

select(abc)
{
        case(1,2)
                foo()
        case(3..4) bar() //abc==3 || abc==4
        case(5..<7) bar() //abc==5 || abc==6
        else baz()
}

I decided to rename switch as select to emphasize that the statement is fundamentally different than the C/C++ switch. A select statement "selects" and executes a single case block rather than jumping to a case label.

`ctor` and `dtor` Statements

ctor_dtor_statement ::= ctor( LC single_expr_list_opt ) body
ctor_dtor_statement ::= dtor( LC ) body

The return type of a constructor and a destructor is void. Orth allows you to declare constructors and destructors using the same notation as other functions:

void ctor(int x, int y) {}
void dtor() {}

To make the constructor and destructor more apparent, you can omit the return type:

ctor(int x, int y) {}
dtor() {}

Function and Aggregate Statements

function_aggregate_statement ::= assignment_expr decl_symbol ( LC single_expr_list_opt ) nonbrace_body
function_aggregate_statement ::= aggregate_keyword anon_or_identifier nonbrace_body
aggregate_keyword ::= struct | class
nonbrace_body ::= same_line_body | indented_body
anon_or_identifier ::= anon | identifier

A function/aggregate statement declares a one-line function/aggregate using the same syntax as a statement with one exception: A same_line_body that belongs to another same_line_body cannot begin with a while or else statement. This restriction makes it easier for the parser to distinguish between variable and function declarations in the middle of a do-while statement or if-else statement. You will never experience this restriction in practice unless you're writing very obscure code.

Declaring a function/aggregate using braces produces a function_expr or aggregate_expr.

int foo() return 123 //ok, foo() contains a single statement
struct X
    int a //ok, X has a single member named a
void foo() while(x) bar() //ok, foo() doesn't belong to another same_line_body
do int foo(x) while(y) //ok, there is a variable named foo inside a do-while statement
do int foo(x) while(y) bar() while(z) //error: foo() belongs to another same_line_body and
                                      //contains a same_line_body beginning with 'while'
do int foo(x)
    while(y) bar() //ok
while(z)
do
    int foo(x) while(y) bar() //also ok
while(z)

Comma Expressions

statement_expr ::= attribute_seq_opt assignment_expr
statement_expr ::= variable_expr
statement_expr ::= function_expr
comma_expr ::= attribute_seq_opt assignment_expr
comma_expr ::= attribute_seq_opt assignment_expr , LC assignment_expr
comma_expr ::= attribute_seq_opt assignment_expr , LC assignment_expr , LC ...
comma_expr ::= variable_expr
comma_expr ::= function_expr
variable_expr ::= attribute_seq_opt assignment_expr variable_declarator
variable_expr ::= attribute_seq_opt assignment_expr variable_declarator , LC variable_declarator
variable_expr ::= attribute_seq_opt assignment_expr variable_declarator , LC variable_declarator , LC ...
function_expr ::= attribute_seq_opt assignment_expr function_declarator
statement_expr_opt ::= Ø | statement_expr
comma_expr_opt ::= Ø | comma_expr

A comma expression consists of an attribute sequence followed by an assignment expression followed by an optional comma or declarator list. If a comma follows the assignment expression, then the parser interprets the comma as the side-effect operator. That is, it evalutes the expression on the left-hand side of the comma, discards the result, and then evaluates the expression on the right-hand side. If a declarator list follows the assignment expression, then the comma expression is a declaration. The declaration can declare one or more variables (if the first declarator is a variable_declarator) or a single function (if the first declarator is a function_declarator).

A statement expression is very similar to a comma expression. The statement expression occurs outside parentheses whereas the comma expression occurs inside parentheses. Its purpose is to simplify the line-continuation rules related to commas. See the examples here.

Single Expressions

single_expr ::= attribute_seq_opt assignment_expr
single_expr ::= attribute_seq_opt assignment_expr variable_declarator
single_expr ::= attribute_seq_opt assignment_expr function_declarator
single_expr_list ::= single_expr
single_expr_list ::= single_expr , LC single_expr
single_expr_list ::= single_expr , LC single_expr , LC ...
single_expr_list_opt ::= Ø | single_expr

A single expression is a more restrictive form of the comma expression. It typically occurs in places where the comma expression would be ambiguous (e.g., a comma-delimited list).

Attributes

attribute ::= parameterized_attribute ( LC single_expr_list )
attribute ::= label
attribute ::= const
attribute ::= shared
attribute ::= cdecl
attribute ::= stdcall
attribute ::= shadow
attribute ::= export
attribute ::= import
attribute ::= inout
attribute ::= out
label ::= identifier ::
parameterized_attribute ::= alignas | pragma
attribute_seq ::= attribute+
attribute_seq_opt ::= Ø | attribute_seq

A label serves two purposes: it can be the destination of a goto statement or the label of a for, while, or do-while statement. A label modifies the next statement in the same scope that is not an attribute. In the examples below, notice that the modified statement may or may not be on the same line as the label.

retry:: //retry modifies foo() even though foo() isn't the body of retry
retry2:: foo() //retry2 also modifies foo() and foo() is the body of retry2
if(bar())
        goto retry //retry2 would work too

outer1:: outer2:: //outer1 and outer2 both modify the outer loop
outer3:: //outer3 also modifies the outer loop
for(int i:=0;i<10;++i)
        inner1:: inner2:: for(int j:=0;j<10;++j) //inner1 and inner2 modify the inner loop
                if(foo())
                        break outer1 //outer2 and outer3 would work too

Unlike C/C++, labels require two colons to simplify line continuation rules and to make label declarations more visible to the context tagger.

Declarators

variable_declarator ::= decl_symbol
variable_declarator ::= decl_symbol := LC assignment_expr
variable_declarator ::= decl_symbol := LC brace_expr
variable_declarator ::= decl_symbol ( LC single_expr_list_opt )
function_declarator ::= decl_symbol ( LC single_expr_list_opt ) line_break_opt brace_body

Orth declarators are simpler than C/C++ because they don't contain any type operators like * and []. An Orth declarator consists of a symbol followed by an optional variable initializer, variable argument list, or function parameter list and body:

Initializer Expressions

colon_expr ::= assignment_expr
colon_expr ::= brace_expr
colon_expr ::= designator : LC assignment_expr
colon_expr ::= designator : LC brace_expr
designator ::= assignment_expr
designator ::= assignment_expr .. LC assignment_expr
designator ::= assignment_expr ..< LC assignment_expr
designator ::= ..
brace_expr_entries ::= colon_expr
brace_expr_entries ::= colon_expr ,
brace_expr_entries ::= colon_expr , colon_expr
brace_expr_entries ::= colon_expr , colon_expr ,
brace_expr_entries ::= colon_expr , colon_expr , ...
brace_expr_lines ::= brace_expr_entries
brace_expr_lines ::= brace_expr_entries LineBreak brace_expr_entries
brace_expr_lines ::= brace_expr_entries LineBreak brace_expr_entries LineBreak ...
top_brace_exprs ::= Ø
top_brace_exprs ::= colon_expr ,
top_brace_exprs ::= colon_expr , colon_expr ,
top_brace_exprs ::= colon_expr , colon_expr , ...
brace_expr ::= { line_break_opt }
brace_expr ::= { brace_expr_entries line_break_opt }
brace_expr ::= { top_brace_exprs LineBreak Indent brace_expr_lines Unindent LineBreak }

The syntax for brace_expr is identical to brace_body except that commas replace the semicolons. The parser automatically replaces line breaks with commas to make code maintenance easier. For example, the following declarations are equivalent:

X x:={"1","2","3",}
X x:={"1","2","3"
}
X x:={
    "1","2","3"
}
X x:=
{
    "1","2","3",
}
X x:={"1", //The comma is required here
    "2"    //Without it, the parser would concatenate "1" and "2"
    "3"
}
X x:=
{
    "1","2"
    "3"
}

These declarations are illegal:


X x:={1,,2} //Error: unexpected comma
X x:={
    1,2,3} //Error: right brace must be the first token on the line if the brace_expr spans
           //multiple lines
X x:={
     } //Error: right brace doesn't like up with 'X'
X x:={1 //Error: missing comma
    2
    3}
X x:={
1,2} //Error: each line of brace_expr must be indented from the first
X x:={
    1,2
       3 //Error: each line of brace_expr must have the same indentation
}

Initializers are the same as C/C++ except for the colon syntax. The expression to the left of the colon is a constant integer or range for array initializers and a symbol for aggregate initializers. The range can be closed (start..end) or half open (start..<end) A third option is the unbounded range operator, .., which initializes the remaining elements of an array or the remaining members of a structure.

If a left brace immediately follows a designator, then you should place it on the same line to avoid an unexpected indentation error:

X x:={0: //The parser merges this line with the next because it ends with a colon
    {
        1
        2
    } //Error: this line isn't indented properly
}

X x:={ //Ok
    0:{
        1
        2
    }
}

Assignment Expressions

assignment_expr ::= unary_expr
assignment_expr ::= assignment_expr binary_op LC assignment_expr
assignment_expr ::= assignment_expr ? LC assignment_expr : LC assignment_expr
binary_op ::= any operator from the operator-precedence table

Assignment expressions associate according to the following operator-precedence table. Operators in each red or blue box have the same precedence. The ternary operator behaves like a binary operator with a parenthesized expression in the middle.

Operator	Associativity	Meaning
*	left to right	multiply
/	left to right	divide
%	left to right	remainder
<<	left to right	shift left
>>	left to right	shift right
+	left to right	add
-	left to right	subtract
&	left to right	bitwise and
@	left to right	bitwise xor
\|	left to right	bitwise or
<	left to right	less
>	left to right	greater
<=	left to right	less than or equal
>=	left to right	greater than or equal
==	left to right	equal
!=	left to right	not equal
&&	left to right	and
\|\|	left to right	or
?:	right to left	conditional
:=	right to left	assign
*=	right to left	assign multiply
/=	right to left	assign divide
%=	right to left	assign remainder
<<=	right to left	assign shift left
>>=	right to left	assign shift right
+=	right to left	assign add
-=	right to left	assign subtract
&=	right to left	assign bitwise and
@=	right to left	assign bitwise xor
\|=	right to left	assign bitwise or

Notice that the precedence of <<, >>, &, @, and | is higher than C/C++.

Unary Expressions

unary_expr ::= postfix_expr
unary_expr ::= ++ unary_expr
unary_expr ::= -- unary_expr
unary_expr ::= & unary_expr
unary_expr ::= - unary_expr
unary_expr ::= ~ unary_expr
unary_expr ::= ! unary_expr

Postfix Expressions

postfix_expr ::= primary_expr
postfix_expr ::= postfix_expr ( LC single_expr_list_opt )
postfix_expr ::= postfix_expr [ LC single_expr_list_opt ]
postfix_expr ::= postfix_expr cdecl( LC single_expr_list_opt )
postfix_expr ::= postfix_expr stdcall( LC single_expr_list_opt )
postfix_expr ::= postfix_expr brace_expr¹
postfix_expr ::= postfix_expr . access_symbol
postfix_expr ::= postfix_expr ++
postfix_expr ::= postfix_expr --
postfix_expr ::= postfix_expr ^

¹the opening brace must be on the same line as the postfix_expr to avoid a syntactic ambiguity that occurs when the postfix_expr is inside another brace_expr.

Primary Expressions

primary_expr ::= ( LC comma_expr )
primary_expr ::= aggregate_expr
primary_expr ::= strings
primary_expr ::= number
primary_expr ::= character
primary_expr ::= access_symbol
primary_expr ::= true
primary_expr ::= false
primary_expr ::= bit
primary_expr ::= bool
primary_expr ::= char
primary_expr ::= wchar
primary_expr ::= dchar
primary_expr ::= byte
primary_expr ::= ubyte
primary_expr ::= short
primary_expr ::= ushort
primary_expr ::= int
primary_expr ::= uint
primary_expr ::= long
primary_expr ::= ulong
primary_expr ::= single
primary_expr ::= double
primary_expr ::= void
primary_expr ::= null
primary_expr ::= unreachable
primary_expr ::= uninit
primary_expr ::= auto
primary_expr ::= typeof
primary_expr ::= sizeof
primary_expr ::= alignof
primary_expr ::= bitcast
primary_expr ::= construct
primary_expr ::= destruct
primary_expr ::= aggregate_expr
primary_expr ::= typedef
aggregate_expr ::= aggregate_keyword anon_or_identifier_opt brace_body
strings ::= string
strings ::= string LC string
strings ::= string LC string LC ...
anon_or_identifier_opt ::= Ø | anon | identifier

Declaring an aggregate without a symbol is equivalent to declaring an aggregate as anon. The parser interprets sizeof(expression) as a call to sizeof for simplicity's sake. The parser automatically concatenates adjacent strings.

A pair of parentheses allows you to inject a variable or function declaration into any expression:

int^ p:=&(int x) //Simultaneously declare x and use its address to initialize p
int s:=(int sqr(int x) { return x*x })(12) //Simultaneously declare and call sqr()

A typedef declaration doesn't appear in the grammar because it overlaps a variable declaration. Namely, declarating a variable whose type is typedef defines an alias for the type that appears after the :=. Unlike other variable declarations, a typedef declaration must have an initializer. Declaring two types with the same declaration is legal:

typedef x:=int typedef y //invalid: no initializer typedef x:=int,y:=double

Symbols

access_symbol ::= identifier
access_symbol ::= operator( LC identifier )
access_symbol ::= this
access_symbol ::= outer
decl_symbol ::= identifier
decl_symbol ::= operator( LC identifier )
decl_symbol ::= anon
decl_symbol ::= ctor
decl_symbol ::= dtor
decl_symbol_opt ::= Ø | decl_symbol

Orth replaces the C++ operator syntax with the more versatile operator(name) syntax. For example, operator(add) replaces operator+ and operator(postincrement) replaces operator++. The parser places no restrictions on the identifier inside the parens.

Contents