ctor
and dtor
StatementsThe goal of the parser is to make C/C++ code easier to write by removing all unnecessary braces and semicolons.
Ideally, one should be able to take a well-written, properly indented C++ source file, remove every semicolon,
and parse the result correctly in Orth. To this end, I've designed the parser to accept many
common C++ coding styles, such as putting a short single-line body on the same line as an if
statement. I've also
tried to keep the syntax free of ambiguities. The grammar currently contains four ambiguities that aren't important enough to resolve:
if(x) if(y) foo() else bar() //Does the else belong to the first or second if statement? try foo() try bar() catch(int) baz() //Does the catch belong to the first or second try statement? try foo() try bar() finally baz() //Does the finally belong to the first or second try statement? if(x) foo() else bar() //Is else a separate statement belonging to a select statement? |
In each case, the parser follows the C/C++ convention of associating the else
, catch
, or finally
with the nearest if
or try
. An else
is a separate statement only if an if
does not precede it.
I've carefully crafted the grammar rules to make the grammar easy to parse with only one lookahead token. In places where two rules overlap, they always share identical nonterminals up to the terminal which distinguishes them. For example, the rule for colon_expr has two productions beginning with assignment_expr. If a colon follows the assignment_expr, it becomes a designator. Otherwise, it becomes an initializer. Of course, the number of semantically valid expressions for the designator is much smaller than the number of syntactically valid expressions, but the compiler can easily reject expressions that don't make sense later.
Labels are an exception to the rule above (i.e., one rule starts with statement_expr and the other starts with Identifier).
Nonetheless, labels don't complicate parsing because we can tentatively read an identifier as a label, read the next token,
and reinterpret the identifier as a statement_expr if the next token isn't a ::
. Reinterpreting a terminal
involves jumping through several levels of nonterminals until we reach the point that we would have reached had
we read the identifer as a statement_expr.
Aggregate declarations (struct
and class
) are another exception to the rule. Unlike a label, we need to lookahead
three tokens to distinguish an aggregate statement from an aggregate expression (the keyword, the identifier, and the left brace).
Another difficulty arises from the optional LineBreak in front of a brace_body. In some contexts, we need to examine the token after the LineBreak to determine whether the LineBreak is part of a body or part of brace_body_lines. Rather than complicating the parser with two lookahead tokens, I opted to merge the LineBreak with the left brace inside the lexer to form a LineBreakLbrace token. The optional LineBreak in several productions for basic_statement presented a similar problem, which I solved exactly the same way. In total, the lexer produces five token pairs: LineBreakCatch, LineBreakElse, LineBreakFinallly, LineBreakLbrace, LineBreakRbrace, and LineBreakWhile, and the parser is able to act on a single lookahead token.
statements ::= Ø1
statements ::= statement LineBreak
statements ::= statement LineBreak statement LineBreak
statements ::= statement LineBreak statement LineBreak
statement LineBreak ...1
1the empty set indicates an empty production
an ellipses indicates that the pattern repeats indefinitely
a plus indicates one or more repetitions of a rule
Each file contains a series of zero or more statements separated by line breaks.
body ::= same_line_body | indented_body | line_break_opt brace_body
same_line_body ::= statement
indented_body ::= LineBreak Indent statement Unindent
brace_body_stmts ::= statement
brace_body_stmts ::= statement ;
brace_body_stmts ::= statement ; statement
brace_body_stmts ::= statement ; statement ;
brace_body_stmts ::= statement ; statement ; ...
brace_body_lines ::= brace_body_stmts
brace_body_lines ::= brace_body_stmts LineBreak brace_body_stmts
brace_body_lines ::= brace_body_stmts LineBreak brace_body_stmts LineBreak ...
top_brace_stmts ::= Ø
top_brace_stmts ::= statement ;
top_brace_stmts ::= statement ; statement ;
top_brace_stmts ::= statement ; statement ; ...
brace_body ::= { line_break_opt }
brace_body ::= { brace_body_stmts line_break_opt }
brace_body ::= { top_brace_stmts LineBreak Indent brace_body_lines
Unindent LineBreak }
line_break_opt ::= Ø | LineBreak
Many statements have bodies. A body has one of three forms:
A single statement on the same line.
if(foo) bar() if(foo) bar() else baz() //the else can go on the same line as the if if(foo) bar() else baz() //or on a separate line as long as it lines up with the if |
A single indented statement on the next line. Putting any tokens at the end of the statement including else
,
catch
, and finally
produces an error.
if(foo)
bar()
else //ok, else is on a separate line and lines up with the if
baz()
if(foo)
bar() else baz() //error: unexpected tokens after bar()
if(foo)
bar()
else baz() //error: else doesn't line up with the if
|
A series of zero or more statements between braces and separated by semicolons and/or line breaks. A left or right brace at the beginning of a line must line up with the statement. If the brace_body spans multiple lines, then the right brace must be the first token on the line.
if(foo)
{
bar()
baz()
}
if(foo) {
bar(); baz()
}
if(foo) { int bar(); //The semicolon is required here. Without it, the parser would assume
baz() //that bar() was a function containing a call to baz().
}
if(foo) { bar(); baz()
}
if(foo) { bar(); baz(); }
void nope(int x)
{ //error: left brace doesn't line up with void
return 0 } //error: right brace isn't at the beginning of the line
int wrong(int x) {
int y:=x+1 //error: each line of brace_body must be indented
return x //error: each line of brace_body must have the same indentation
} //error: right brace doesn't line up with 'int'
|
There are two ways to define a statement that has no body: a pair of empty braces or the expression null
.
Actually, you can use any expression that lacks side effects like 0
or true
,
but the compiler usually generates warnings for these expressions because they lack side effects.
The compiler understands that null
, by itself, is a placeholder for a statement that lacks a body, so it doesn't generate a warning.
while(foo()) {} while(foo()) null |
statement ::= attribute_seq
statement ::= attribute_seq line_break_opt brace_body
statement ::= attribute_seq indented_body
statement ::= attribute_seq_opt basic_statement
statement ::= statement_expr
basic_statement ::= scope body
basic_statement ::= guard body
basic_statement ::= while( LC comma_expr ) body
basic_statement ::= for( LC comma_expr_opt ;
LC comma_expr_opt ;
LC comma_expr_opt ) body
basic_statement ::= jump_keyword statement_expr_opt
basic_statement ::= include single_expr_list
basic_statement ::= do body line_break_opt
while( LC comma_expr )
basic_statement ::= if( LC comma_expr ) body
basic_statement ::= if( LC comma_expr ) body
line_break_opt else body
basic_statement ::= try body line_break_opt finally body
basic_statement ::= try body catch+ finally_opt
basic_statement ::= selection_statement
basic_statement ::= ctor_dtor_statement
basic_statement ::= function_aggregate_statement
jump_keyword ::= return | throw | goto | break | continue
catch ::= line_break_opt catch( LC single_expr ) body
finally_opt ::= Ø | line_break_opt finally body
identifier_opt ::= Ø | identifier
Each jump statement (return
, throw
, goto
, break
, and continue
) has the same syntax
to simplify the parser, but the following restrictions apply:
goto
requires an identifier
break
, it must be an identifier
continue
, it must be an identifier
Attribute sequences have four forms:
The attribute sequence doesn't have a body.
pragma(dll) |
The attribute sequence appears in front of a statement or expression. Similar to a function_aggregate_statement,
an attribute sequence that appears in front of the while
or else
keyword cannot be the same_line_body of another statement.
This rule lets the parser immediately match a while
with a do
without searching the rest of the line and the next line for another while
.
shared export int x:=123
shared int foo()
return 123
do
label:: while(x!=0) foo() //ok
while(bar())
do label:: while(x!=0) //ok
do label:: while(x!=0) foo() while(bar()) //error: "label:: while" is the same_line_body
//of the do-while statement. The parser can't
//know that the while doesn't belong to the do
//without arbitrary lookahead.
|
The attribute sequence appears on the line above an indented statement:
shared export int x |
The attribute appears in front of or on the line above a left brace:
import foo:: cdecl bar:: pragma(undecorated) { struct S int x void foo(S^) } import foo:: cdecl bar:: pragma(undecorated) { struct S int x; void foo(S^) } //same as above |
The parser places a copy of each "relevant" nonlabel attribute in front of each statement between the braces. An attribute is
"relevant" if the compiler accepts it without generating an unused-attribute error. The parser places each label attribute on
its own line above the attribute sequence. The compiler silently accepts attributes that aren't relevant to any statements
(pragma(undecorated)
in this example). This example is syntactically equivalent to:
foo:: bar:: cdecl struct S //only cdecl is relevant here int x import cdecl void foo(S^) //import and cdecl are relevant here |
selection_statement ::= select( LC comma_expr ) body
selection_statement ::= case( LC choice_expr_list ) body
selection_statement ::= else body
choice_expr ::= single_expr
choice_expr ::= single_expr .. LC single_expr
choice_expr ::= single_expr ..< LC single_expr
choice_expr_list ::= choice_expr
choice_expr_list ::= choice_expr , LC choice_expr
choice_expr_list ::= choice_expr , LC choice_expr , LC ...
A select
statement has a parenthesized expression and a body. The body must contain only case
statements and up to one else
statement.
The else
statement need not be the last statement.
A case
statement has a parenthesized expression list and a body. The expression list can contain closed intervals (start..
end)
and half-open intervals (start..<
end).
Orth uses else
instead of default
because default
is more useful as an identifier. Although else
is syntactically
ambiguous, it is semantically unambiguous because if-else
statements and case-else
statements can't coexist within the same
scope. Here's an example:
select(abc) { case(1,2) foo() case(3..4) bar() //abc==3 || abc==4 case(5..<7) bar() //abc==5 || abc==6 else baz() } |
I decided to rename switch
as select
to emphasize that the statement is fundamentally different than the C/C++ switch
.
A select
statement "selects" and executes a single case
block rather than jumping to a case
label.
ctor
and dtor
Statements
ctor_dtor_statement ::= ctor( LC single_expr_list_opt ) body
ctor_dtor_statement ::= dtor( LC ) body
The return type of a constructor and a destructor is void
. Orth allows you to declare constructors and destructors
using the same notation as other functions:
void ctor(int x, int y) {} void dtor() {} |
To make the constructor and destructor more apparent, you can omit the return type:
ctor(int x, int y) {} dtor() {} |
function_aggregate_statement ::= assignment_expr decl_symbol ( LC
single_expr_list_opt ) nonbrace_body
function_aggregate_statement ::= aggregate_keyword anon_or_identifier nonbrace_body
aggregate_keyword ::= struct | class
nonbrace_body ::= same_line_body | indented_body
anon_or_identifier ::= anon | identifier
A function/aggregate statement declares a one-line function/aggregate using the same syntax as a statement with one exception:
A same_line_body that belongs to another same_line_body cannot begin with a while
or else
statement.
This restriction makes it easier for the parser to distinguish between variable and function declarations in the middle of a do-while
statement or if-else
statement. You will never experience this restriction in practice unless you're writing very obscure code.
Declaring a function/aggregate using braces produces a function_expr or aggregate_expr.
int foo() return 123 //ok, foo() contains a single statement
struct X
int a //ok, X has a single member named a
void foo() while(x) bar() //ok, foo() doesn't belong to another same_line_body
do int foo(x) while(y) //ok, there is a variable named foo inside a do-while statement
do int foo(x) while(y) bar() while(z) //error: foo() belongs to another same_line_body and
//contains a same_line_body beginning with 'while'
do int foo(x)
while(y) bar() //ok
while(z)
do
int foo(x) while(y) bar() //also ok
while(z)
|
statement_expr ::= attribute_seq_opt assignment_expr
statement_expr ::= variable_expr
statement_expr ::= function_expr
comma_expr ::= attribute_seq_opt assignment_expr
comma_expr ::= attribute_seq_opt assignment_expr ,
LC assignment_expr
comma_expr ::= attribute_seq_opt assignment_expr ,
LC assignment_expr , LC ...
comma_expr ::= variable_expr
comma_expr ::= function_expr
variable_expr ::= attribute_seq_opt assignment_expr variable_declarator
variable_expr ::= attribute_seq_opt assignment_expr variable_declarator ,
LC variable_declarator
variable_expr ::= attribute_seq_opt assignment_expr variable_declarator ,
LC variable_declarator ,
LC ...
function_expr ::= attribute_seq_opt assignment_expr function_declarator
statement_expr_opt ::= Ø | statement_expr
comma_expr_opt ::= Ø | comma_expr
A comma expression consists of an attribute sequence followed by an assignment expression followed by an optional comma or declarator list. If a comma follows the assignment expression, then the parser interprets the comma as the side-effect operator. That is, it evalutes the expression on the left-hand side of the comma, discards the result, and then evaluates the expression on the right-hand side. If a declarator list follows the assignment expression, then the comma expression is a declaration. The declaration can declare one or more variables (if the first declarator is a variable_declarator) or a single function (if the first declarator is a function_declarator).
A statement expression is very similar to a comma expression. The statement expression occurs outside parentheses whereas the comma expression occurs inside parentheses. Its purpose is to simplify the line-continuation rules related to commas. See the examples here.
single_expr ::= attribute_seq_opt assignment_expr
single_expr ::= attribute_seq_opt assignment_expr variable_declarator
single_expr ::= attribute_seq_opt assignment_expr function_declarator
single_expr_list ::= single_expr
single_expr_list ::= single_expr , LC single_expr
single_expr_list ::= single_expr , LC single_expr , LC ...
single_expr_list_opt ::= Ø | single_expr
A single expression is a more restrictive form of the comma expression. It typically occurs in places where the comma expression would be ambiguous (e.g., a comma-delimited list).
attribute ::= parameterized_attribute ( LC single_expr_list )
attribute ::= label
attribute ::= const
attribute ::= shared
attribute ::= cdecl
attribute ::= stdcall
attribute ::= shadow
attribute ::= export
attribute ::= import
attribute ::= inout
attribute ::= out
label ::= identifier ::
parameterized_attribute ::= alignas | pragma
attribute_seq ::= attribute+
attribute_seq_opt ::= Ø | attribute_seq
A label serves two purposes: it can be the destination of a goto
statement or the label of a for
, while
, or do-while
statement. A label modifies the
next statement in the same scope that is not an attribute. In the examples below, notice that the modified statement may or may not be on the same line as the label.
retry:: //retry modifies foo() even though foo() isn't the body of retry retry2:: foo() //retry2 also modifies foo() and foo() is the body of retry2 if(bar()) goto retry //retry2 would work too outer1:: outer2:: //outer1 and outer2 both modify the outer loop outer3:: //outer3 also modifies the outer loop for(int i:=0;i<10;++i) inner1:: inner2:: for(int j:=0;j<10;++j) //inner1 and inner2 modify the inner loop if(foo()) break outer1 //outer2 and outer3 would work too |
Unlike C/C++, labels require two colons to simplify line continuation rules and to make label declarations more visible to the context tagger.
variable_declarator ::= decl_symbol
variable_declarator ::= decl_symbol := LC assignment_expr
variable_declarator ::= decl_symbol := LC brace_expr
variable_declarator ::= decl_symbol ( LC single_expr_list_opt )
function_declarator ::= decl_symbol ( LC
single_expr_list_opt ) line_break_opt brace_body
Orth declarators are simpler than C/C++ because they don't contain any type operators like *
and []
.
An Orth declarator consists of a symbol followed by an optional variable initializer, variable argument list,
or function parameter list and body:
colon_expr ::= assignment_expr
colon_expr ::= brace_expr
colon_expr ::= designator : LC assignment_expr
colon_expr ::= designator : LC brace_expr
designator ::= assignment_expr
designator ::= assignment_expr .. LC assignment_expr
designator ::= assignment_expr ..< LC assignment_expr
designator ::= ..
brace_expr_entries ::= colon_expr
brace_expr_entries ::= colon_expr ,
brace_expr_entries ::= colon_expr , colon_expr
brace_expr_entries ::= colon_expr , colon_expr ,
brace_expr_entries ::= colon_expr , colon_expr , ...
brace_expr_lines ::= brace_expr_entries
brace_expr_lines ::= brace_expr_entries LineBreak brace_expr_entries
brace_expr_lines ::= brace_expr_entries LineBreak brace_expr_entries LineBreak ...
top_brace_exprs ::= Ø
top_brace_exprs ::= colon_expr ,
top_brace_exprs ::= colon_expr , colon_expr ,
top_brace_exprs ::= colon_expr , colon_expr , ...
brace_expr ::= { line_break_opt }
brace_expr ::= { brace_expr_entries line_break_opt }
brace_expr ::= { top_brace_exprs LineBreak Indent brace_expr_lines
Unindent LineBreak }
The syntax for brace_expr is identical to brace_body except that commas replace the semicolons. The parser automatically replaces line breaks with commas to make code maintenance easier. For example, the following declarations are equivalent:
X x:={"1","2","3",} X x:={"1","2","3" } X x:={ "1","2","3" } X x:= { "1","2","3", } X x:={"1", //The comma is required here "2" //Without it, the parser would concatenate "1" and "2" "3" } X x:= { "1","2" "3" } |
These declarations are illegal:
X x:={1,,2} //Error: unexpected comma
X x:={
1,2,3} //Error: right brace must be the first token on the line if the brace_expr spans
//multiple lines
X x:={
} //Error: right brace doesn't like up with 'X'
X x:={1 //Error: missing comma
2
3}
X x:={
1,2} //Error: each line of brace_expr must be indented from the first
X x:={
1,2
3 //Error: each line of brace_expr must have the same indentation
}
|
Initializers are the same as C/C++ except for the colon syntax. The expression to the left of the
colon is a constant integer or range for array initializers and a symbol for aggregate initializers.
The range can be closed (start..
end) or half open (start..<
end)
A third option is the unbounded range operator, ..
, which initializes the remaining elements
of an array or the remaining members of a structure.
If a left brace immediately follows a designator, then you should place it on the same line to avoid an unexpected indentation error:
X x:={0: //The parser merges this line with the next because it ends with a colon
{
1
2
} //Error: this line isn't indented properly
}
X x:={ //Ok
0:{
1
2
}
}
|
assignment_expr ::= unary_expr
assignment_expr ::= assignment_expr binary_op LC assignment_expr
assignment_expr ::= assignment_expr ? LC assignment_expr
: LC assignment_expr
binary_op ::= any operator from the operator-precedence table
Assignment expressions associate according to the following operator-precedence table. Operators in each red or blue box have the same precedence. The ternary operator behaves like a binary operator with a parenthesized expression in the middle.
Operator | Associativity | Meaning |
---|---|---|
* | left to right | multiply |
/ | left to right | divide |
% | left to right | remainder |
<< | left to right | shift left |
>> | left to right | shift right |
+ | left to right | add |
- | left to right | subtract |
& | left to right | bitwise and |
@ | left to right | bitwise xor |
| | left to right | bitwise or |
< | left to right | less |
> | left to right | greater |
<= | left to right | less than or equal |
>= | left to right | greater than or equal |
== | left to right | equal |
!= | left to right | not equal |
&& | left to right | and |
|| | left to right | or |
?: | right to left | conditional |
:= | right to left | assign |
*= | right to left | assign multiply |
/= | right to left | assign divide |
%= | right to left | assign remainder |
<<= | right to left | assign shift left |
>>= | right to left | assign shift right |
+= | right to left | assign add |
-= | right to left | assign subtract |
&= | right to left | assign bitwise and |
@= | right to left | assign bitwise xor |
|= | right to left | assign bitwise or |
Notice that the precedence of <<
, >>
, &
, @
, and |
is higher than C/C++.
unary_expr ::= postfix_expr
unary_expr ::= ++ unary_expr
unary_expr ::= -- unary_expr
unary_expr ::= & unary_expr
unary_expr ::= - unary_expr
unary_expr ::= ~ unary_expr
unary_expr ::= ! unary_expr
postfix_expr ::= primary_expr
postfix_expr ::= postfix_expr ( LC single_expr_list_opt )
postfix_expr ::= postfix_expr [ LC single_expr_list_opt ]
postfix_expr ::= postfix_expr cdecl( LC single_expr_list_opt )
postfix_expr ::= postfix_expr stdcall( LC single_expr_list_opt )
postfix_expr ::= postfix_expr brace_expr1
postfix_expr ::= postfix_expr . access_symbol
postfix_expr ::= postfix_expr ++
postfix_expr ::= postfix_expr --
postfix_expr ::= postfix_expr ^
1the opening brace must be on the same line as the postfix_expr to avoid a syntactic ambiguity that occurs when the postfix_expr is inside another brace_expr.
primary_expr ::= ( LC comma_expr )
primary_expr ::= aggregate_expr
primary_expr ::= strings
primary_expr ::= number
primary_expr ::= character
primary_expr ::= access_symbol
primary_expr ::= true
primary_expr ::= false
primary_expr ::= bit
primary_expr ::= bool
primary_expr ::= char
primary_expr ::= wchar
primary_expr ::= dchar
primary_expr ::= byte
primary_expr ::= ubyte
primary_expr ::= short
primary_expr ::= ushort
primary_expr ::= int
primary_expr ::= uint
primary_expr ::= long
primary_expr ::= ulong
primary_expr ::= single
primary_expr ::= double
primary_expr ::= void
primary_expr ::= null
primary_expr ::= unreachable
primary_expr ::= uninit
primary_expr ::= auto
primary_expr ::= typeof
primary_expr ::= sizeof
primary_expr ::= alignof
primary_expr ::= bitcast
primary_expr ::= construct
primary_expr ::= destruct
primary_expr ::= aggregate_expr
primary_expr ::= typedef
aggregate_expr ::= aggregate_keyword anon_or_identifier_opt brace_body
strings ::= string
strings ::= string LC string
strings ::= string LC string
LC ...
anon_or_identifier_opt ::= Ø | anon | identifier
Declaring an aggregate without a symbol is equivalent to declaring an aggregate as anon
.
The parser interprets sizeof(expression)
as a call to sizeof
for simplicity's sake.
The parser automatically concatenates adjacent strings.
A pair of parentheses allows you to inject a variable or function declaration into any expression:
int^ p:=&(int x) //Simultaneously declare x and use its address to initialize p int s:=(int sqr(int x) { return x*x })(12) //Simultaneously declare and call sqr() |
A typedef
declaration doesn't appear in the grammar because it overlaps a variable declaration. Namely,
declarating a variable whose type is typedef
defines an alias for the type that appears after
the :=
. Unlike other variable declarations, a typedef
declaration must have an initializer. Declaring two types with the same declaration is legal:
typedef x:=int
typedef y //invalid: no initializer
typedef x:=int,y:=double
|
access_symbol ::= identifier
access_symbol ::= operator( LC identifier )
access_symbol ::= this
access_symbol ::= outer
decl_symbol ::= identifier
decl_symbol ::= operator( LC identifier )
decl_symbol ::= anon
decl_symbol ::= ctor
decl_symbol ::= dtor
decl_symbol_opt ::= Ø | decl_symbol
Orth replaces the C++ operator syntax with the more versatile operator(
name)
syntax.
For example, operator(add)
replaces operator+
and operator(postincrement)
replaces operator++
.
The parser places no restrictions on the identifier inside the parens.