Orth News

Announcements

150315 - Block Selections

Progress has been slow these last few years because I have a full time job now. I'm continuing to make improvements to the editor in my spare time. My latest creation is nonrectangular block selections that allow you to edit multiple lines at once. Each line in the block has its own cursor and possibly its own selection. To form a block selection, press Ctrl+Up or Ctrl+Down or drag a rectangle with the mouse holding down the right mouse button. Each editor command has a different operation when applied to a block selection. Some editor commands, like cursor-right, just execute the command on each line. Other commands are more complicated. For example, insert-tab inserts enough spaces in front of each cursor so that each cursor lines up.

The example below shows how block selections can speed up repetitive tasks like member initialization.

130505 - Chunks Revisited

As I described in the overview, a chunk is the smallest unit of recompilation in an Orth program. I decided to split aggregates into two chunks to achieve better data abstraction. The outer aggregate chunk contains the inner aggregate chunk and one chunk for each function and shared variable. The inner aggregate chunk contains each member variable. For example, the program below contains 4 chunks:

struct A                         //  --+ 1. A
{                                //    |
    int x                        //    +-- 2. members (x and y)
    int y                        //    |
    shared int s                 //    +-- 3. s
    int foo() { return x+y }     //    |
}                                //    +-- 4. foo

Now, suppose we write a function, bar(), that takes a A^ argument and calls A.foo():

void bar(A^ a)
{
    a.foo()
}

Since foo() is essentially a global function whose first argument is A^, bar() doesn't depend on the layout of A. In other words, we could reorder x and y in A without having to recompile bar(). Of course we would have to recompile foo() and patch the machine code for bar() with the (possibly changed) address of foo(), but the assembly code of bar() doesn't change. This separation of data and methods is impossible in C++. We could try:

====== test.h ======
struct S
{
    int foo();
};

====== test.cpp ======
#include "test.h"
struct S
{
    int x;
    int y;
    int foo() { return x+y; }
};

But of course the compiler prints an error because you can't define a structure more than once. You can declare a structure and later define it, as the following C code illustrates, but the declaration can't contain any information about the structure (base classes, members functions, etc.)

====== test.h ======
struct S;
int S_foo(S^);

====== test.c ======
#include "test.h"
struct S
{
    int x;
    int y;
};
int S_foo(S^ s) { return s->x+s->y; }

130430 - HTML Cleanup

I added frames to make the website easier to navigate. I also improved the formatting of tables, lists, images, etc. I hope you like the white-on-black look. It's easier on my eyes.

130424 - Orth Console Part 3 - An Interruptable Timer

A console application typically reads characters from STDIN and writes printable characters (including line breaks) to STDOUT, but there are a couple of other ways a console application can interact with its parent process. One way is writing a control character (0x00 - 0x1F) to STDOUT. The two most important control characters are 0x08 (backspace) and 0x0D (carriage return). A backspace instructs the console to place the next (non-line-break) character on top of the previous (non-line-break) character. In other words, it subtracts one from the console's current output column. A carriage return instructs the console to output the next character in column zero. For example, printing "hello, world\rHELLO" will appear as "HELLO, world" because the five characters after the carriage return overwrite the first five characters on the current line. I decided to take a slightly different approach in the Orth console. Rather than overwriting text, the console just erases the preceding (non-line-break) character (in the case of a backspace) and erases the current line up to the preceding line break (in the case of a carriage return). Another interesting control character is 0x07 (bell). The Windows console just ignores it whereas the Orth console calls MessageBeep().

Another form of interaction is console control events. Typically, pressing CTRL+C or CTRL+BREAK terminates a console application, but an application can override this behavior using the SetConsoleCtrlHandler() API. If the registered handler function returns true, the system bypasses the default handler which calls TerminateProcess(). The Orth console uses the GenerateConsoleCtrlEvent() API to generate a console event in response to CTRL+C and CTRL+BREAK. On the surface, the API appears to do exactly what you want. The first argument is the affected process and the second argument is the event type, CTRL_C_EVENT or CTRL_BREAK_EVENT. Unfortunately, the API is completely broken. After extensive research, I discovered that the only way to make the API work is to either (1) call the function inside the affected process; or (2) call the function inside a "launcher" process that sits between the parent and child process. I opted for the former approach using a technique known as code injection. The parent process creates a new thread in the child process whose only purpose is to call GenerateConsoleCtrlEvent().

Putting it all together, we can code an interruptable timer in a console application. The timer might represent a progress bar or a counter or any form of data the console application wants to display in real-time.

bool interrupted=false;
BOOL WINAPI HandlerRoutine(DWORD dwCtrlType)
{
    if(dwCtrlType==CTRL_C_EVENT)
    {
        interrupted=true;
        return true; //don't call TerminateProcess()
    }
    else
        return false; //do call TerminateProcess()
}

void runTimer()
{
    interrupted=false;
    ::SetConsoleCtrlHandler(HandlerRoutine,true); //install our handler
    for(int i=0;i<=100;++i)
    {
        ::Sleep(30);
        if(interrupted)
        {
            prnStdout("\r\aOperation cancelled!");
            break;
        }
        prnStdout("\rProgress: ",i,"%");
    }
    ::SetConsoleCtrlHandler(HandlerRoutine,false); //uninstall our handler
    prnStdout("\n");
}

The console application calls runTimer() in response to a user command. The function displays a progress message in the form Progress: xx%, which lasts for approximately 3 seconds. If the user presses CTRL+C during this period, the function replaces the message with Operation cancelled!, beeps, and returns to the caller (presumably to fetch the next command). Pressing CTRL+BREAK immediately terminates the application.

130420 - Orth Console Part 2 - Writing a Pseudo-Driver

Interacting with a console application in real-time is challenging for a couple of reasons. First, some console applications are accustomed to reading commands and writing responses directly to the console using ReadConsole() and WriteConsole(). If the console application doesn't have a console, it assumes that its output won't be viewed in real-time (e.g., its output has been redirected to a file), so it writes its output to an internal buffer and flushes the buffer to STDOUT only when the buffer is full. Consequently, there is no guarantee that the parent process will see the output of the chlid process until either (1) the child process exits; or (2) the child process flushes its internal buffer prematurely using a call to fflush() or the like.

The root of the problem is buried within the C Runtime Library (CRT). The CRT uses a call to GetFileType() to determine whether to buffer STDOUT. If GetFileType() returns FILE_TYPE_DISK or FILE_TYPE_PIPE, the CRT enables buffering. If GetFileType() returns FILE_TYPE_CHAR, the CRT disables buffering. Since the CRT has no idea whether the parent process is viewing the output in real-time, it has to make an educated guess based on the file type. When it is FILE_TYPE_PIPE, the CRT incorrectly assumes that throughput is more important than response time. The solution is to change the value returned by GetFileType(). The Windows API doesn't provide any way of configuring the file type of an anonymous pipe, but it does provide a way of writing your own pipe using a pseudo-driver.

A pseudo-driver is essentially a virtual file system. An application interacts with the driver using familiar calls to CreateFile(), CloseHandle(), WriteFile(), and ReadFile(). When the application issues a call to CreateFile(), the operating system checks whether the filename begins with \\.\MyDriver\, where MyDriver is the pseudo-driver's name. If so, the operating system calls the driver's CreateFile() handler, passing it the portion of the filename after the driver's name and all of the arguments from the call to CreateFile(). For example, if the filename is \\.\MyDriver\foo\bar, the driver sees \foo\bar. The driver's handler is allowed to do virtually anything it wants. For a simple pipe, the CreateFile() handler just allocates a block of memory to hold the circular queue. The operating system associates the handle returned by CreateFile() with the pseudo-driver so that it is able to dispatch calls to ReadFile() and WriteFile() to the pseudo-driver as well. For a simple pipe, the WriteFile() handler writes data to the queue, possibly blocking if the queue is full. The ReadFile() handler reads data from the queue, possibly blocking if the queue is empty. Finally, the CloseHandle() handler just deallocates memory. Since the pseudo-driver sits in system memory, all applications share the same view of the virtual file system. In other words, two process can communicate by opening a virtual file with the same name and writing/reading data to/from it. I've left out many details about Windows driver development, but the basic concepts are easy to understand.

The second challenge in interacting with a console application in real-time is matching commands with responses. For example, suppose I open CMD.EXE and give it two commands: dir foo and dir bar. How will I know where the response for the first command ends and the response for the second command begins? If I'm certain that the child process is CMD.EXE, I could try searching for a greater-than sign without a subsequent line break, but this approach won't work for all console applications. To solve this problem elegantly, we have to detect when the child process reads the second command. Assuming that the child process is running a single thread, we can safely assume that it has finished writing its response to the first command when it calls ReadFile() to fetch the second command. (If the child process is running multiple threads and interacting with STDIN/STDOUT in multiple threads, the situation is more complicated, but such situations are rare). A pseudo-driver provides the perfect set of tools. When the child process calls ReadFile(), the pseudo-driver's handler writes a special code to STDOUT that essentially says, "everything from this point onwards belongs to the second command". When the parent process receives the special code, it's able to interleave the user's commands with the child process's output correctly. Namely, the parent process will insert each character belonging to the first response in front of cd bar and each character belonging to the second response after cd bar. The user will see something like this:

Shell Version 1.2.3
>dir foo
The contents of foo are:
...
>dir bar
The contents of bar are:
...
>

The blue text is the response to the first command. The red text is the reponse to the second command. The green text is the introductory text that the child process writes before reading any commands.

130418 - Orth Console

Today's screenshot shows the early stages of the Orth Console. The console lets you interact with a console application. Typing a command and pressing enter sends the command to the application's STDIN handle. The application processes the command and writes its response to STDOUT. The console then displays the response beneath the command.

Run Process

Orth Console

In this example, I've executed CMD.EXE and given it two commands: cd .. and dir. The text between cd .. and dir is its response to the first command. The text after dir is its response to the second command. The yellow arrow indicates that CMD.EXE is waiting for me to type the next command on line 21 and press enter.

The Orth Console is far superior to the Windows Console (e.g., the window that appears when you run CMD.EXE). The Orth Console is an ordinary document in most respects, so the full arsenal of editor commands is available to you. You can edit any text including commands and responses; you can execute a search; you can undo and redo changes; you can save the transcript to a file; you can insert a character literal; etc.; etc. The Orth console also supports the standard "terminal" features when the cursor is on the command line (i.e., the line with the yellow arrow). Pressing up and down recalls the previous and next saved command, respectively. Pressing CTRL+BREAK (or CTRL+C if there is no selection) terminates the process. Pressing TAB might display a path-autocompletion box (I'm not sure yet).

130415 - Update Cont'd: Line Selections

Line selections are a convenient way of selecting a group of lines. Most editors allow you to make a line selection by clicking in the sidebar (i.e,. the gray region to the left of the document containing bookmarks, breakpoints, change bars, etc.) and dragging. The Orth editor makes line selections even easier. Pressing ALT+DOWN selects the current line and moves the cursor down. Pressing ALT+UP moves the cursor up and then selects the current line. This behavior may seem counter intuitive, but it's actually the same as character selections! Pressing SHIFT+RIGHT selects the character beneath the cursor and moves the cursor right. Pressing SHIFT+LEFT moves the cursor left and then selects the character beneath the cursor. These semantics, in conjunction with the copy-clears-selection option, allow you duplicate the selected characters or lines by pressing CTRL+C and CTRL+V one or more times. Pressing CTRL+V inserts a character selection in front of the cursor and inserts a line selection above the current line. Many editor commands that use SHIFT to extend the character selection have an analogous ALT counterpart that extends the line selection. Some examples are ALT+PGUP, ALT+PGDN, ALT+HOME, ALT+END.

130414 - Update Cont'd: Insert-Literal Box

The Insert-Literal Box allows you to insert any Unicode character between U+0000 and U+FFFF. Unlike the Windows character map, the insert-literal box displays one Unicode "page" at a time (e.g., page 25 consists of characters U+2500 through U+25FF). You can navigate the characters using the arrow keys, page up, page down, home, end, and the mouse wheel. You can also enter the desired code in the spinner beneath the list of characters. Most of the characters appear as boxes because the font I've chosen (Courier New) is unable to display them properly. Nonetheless, you can still insert them in a document and edit them as you would any other character.

Insert Literal Box

130413 - Overview Updated

I added three sections to the overview.

130413 - Update Cont'd: Change Bars

No editor is complete without change bars. Change bars indicate which lines the user has modified and inserted since the last save. Change bars are challenging for a couple of reasons. First, change bars should indicate visible changes rather than actual changes. For example, suppose I move the cursor to the end of the line and press the space bar. The document has technically changed (and indeed the editor's caption contains an asterisk to indicate that the document needs to be saved), but the document looks exactly the same as before. Another example: Suppose you overwrite a space at the beginning of a line with another space. Again, the document has technically changed but the change is unimportant. This distinction is important to make auto indentation work properly. Suppose I open a document containing foo on the first line and bar on the second line. Placing the cursor in front of bar and pressing enter clearly inserts a new blank line between foo and bar. There's no doubt that the first and last lines are unmodified. Now, let's suppose the lines are indented by four spaces. Assuming smart enter is enabled, placing the cursor in front of the b and pressing enter inserts a line break and four spaces in order to preserve bar's indentation. The line containing bar is exactly the same as before (i.e., four spaces followed by bar), but it contains different spaces than before. The original four spaces are now on the blank line above it. The four spaces in the inserted string are now in front of bar. If we used a naive change-bar algorithm that ignored spaces, we would incorrectly mark the third line as modified.

The second challenge is indicating removed lines properly. Visual SlickEdit doesn't display removed lines at all and Visual Studio just marks the line below as modified. The Orth Editor indicates removals with a red bar between the surrounding (unchanged) lines. In the example below, I've removed Line 2, inserted Line 4a, and modified Line 5.

Change Bars

130412 - Long Overdue Update

The editor is slowly coming to life. It now supports many features that I've grown accustomed to in Visual SlickEdit: regular-expression searches, syntax highlighting, context tagging, change bars, and line selections just to name a few.

Documentation

I've completed the compiler documentation for Orth 0.3 (not yet released). View it here, here, here, and here.
Regular Expressions

The search dialog box employs an interesting strategy for entering regular expressions. Entering regular expressions containing parens, brackets, asterisks, etc. is always a pain because you have to escape each one. You also have to remember to check (or uncheck) the "Regular Expression" box each time you do a regular-expression search. I decided to introduce keyboard shortcuts for entering regular-expression operators. For example, suppose you want to search for x followed by an optional pair of parens followed by y. The regular expression is clearly x(\(\))?y. To enter it in the search dialog box you type:
- x
- CTRL+9
- (
- )
- CTRL+0
- CTRL+/
- y
The keyboard shortcuts are not arbitrary. In fact, if you look at your keyboard, you'll see that the shortcut for entering a left paren operator (CTRL+9) is the same as the keyboard combination for entering a left paren (SHIFT+9) except that CTRL replaces SHIFT. This convention works for all regular-expression operators except [ and { which are ambiguous. Since [ is more common, CTRL+[ enters a bracket operator and ALT+[ enters a brace operator. The best part of the system is that the dialog box color codes your regular expression making it much easier to spot mistakes. Here is a contrived example showing many of the operators and escape sequences:

Since backslashes are fairly rare, I decided to make them regular-expression operators by default. Entering a backslash followed by an 'n' inserts a line-break escape sequence. Pressing the backslash key twice inserts a literal backslash (unlike conventional regular expressions, the backslash appears only once because its color determines whether it's a literal or not). Copying the color-coded regular expression converts it to a conventional regular expression, and similarly, pasting a conventional regular expression inside the search box automatically color codes it. For example, the conventional regular expression corresponding to the image above is:
```
(?1a.?)\n+(bc?d|e\\f[0-9]*)\1{2,3}\x7F\u1234
```
Although such complicated expressions are rare in practice, I think you'll agree that color coding makes it easier to read!
Syntax Highlighting

Syntax highlighting is straightforward. The lexer remembers the type and column of each token that it encounters. The editor then paints each token using the user's preferred color and font style for tokens of that type. The challenging part is making the syntax highlighter (a.k.a. the color coder) smart enough to understand the context of each token. For example, support you type in x y,z in an Orth/C/C++ file. Clearly, x, y, and z are all identifiers, but y and z are declarations whereas x is a symbol (e.g., a structure or perhaps a typedef). Now, suppose you type void foo(x y,z) {} in an Orth/C/C++ file. The tokens inside the parens are exactly the same as before, and yet the semantics of the identifiers has changed. Only y is a declaration whereas x and z are symbols.

The editor's color coder incrementally parses the file as you modify it and selects a color for each token based on the grammer rule used to reduce it. The color coder also matches parentheses and brackets using nesting level to select a color from a pool of four colors. Outermost brackets are white. Once nested brackets are teal. Twice nested brackets are magenta. Thrice nested brackets are yellow. After that, the pattern repeats (i.e., white, teal, magenta, then yellow).

I've rewritten the parser many times and tweaked the grammar close to a hundred times. The parser is now a pushdown automaton. The machine's inputs are the tokens generated by the lexer. The machine's outputs are instructions describing how to construct an abstract syntax-tree (AST). For example, parsing x+y*z might produce the following instructions:
- Eat and push x
- Eat and push +
- Eat and push y
- Eat and push *
- Eat and push z
- Pull three tokens and push MULTIPLY node
- Pull three tokens and push ADD node
I developed a parser generator similar to Bison in many respects. The parser generator creates an enormous state-transition table. For each state and each possible token, the table contains a record indicating the next state and the sequence of actions needed to build the AST. With over 500 states, over 100 tokens, and at least 4 bytes per record, the table could easily consume 200K. Fortunately, there is a huge amount of redundancy in the table. Using several tricks like grouping similar states and merging identical tokens within each group, I was able to reduce the table size to 32K. Unlike Bison, which constructs the parser from grammar rules, I developed a grammar "assembly language" that not only describes but implements each grammar rule. The grammar "program" looks at the next token and decides which actions to perform. One possible action is to "eat" the current token and fetch the next token. Another possible action is to replace the N tokens (or nodes) on the top of the stack with a new node. Let's look at an example for a simple language:
```
program ::= sum END

sum ::= term
sum ::= term '+' term

term ::= value
term ::= value '*' value

value ::= terminal
value ::= '-' terminal

terminal ::= ID | NUMBER
```
Bison is able to convert this grammar directly into a parser. With my generator, we have to first implement the grammar rules as I've illustrated below:
```
program:
    call fetch_sum
    match END else fail
    pass

fetch_sum:
   call fetch_term
1: match PLUS else return
   call fetch_term
   output ADD:3
   goto 1b

fetch_term:
   call fetch_value
1: match ASTERISK else return
   call fetch_value
   output MULTIPLY:3
   goto 1b

fetch_value:
   match HYPHEN else 1f
   call fetch_value
   output NEGATE:2
   return
1: call fetch_terminal
   return

fetch_terminal:
   match ID,NUMBER else fail
   return
```
The match instruction compares the current token to one or more choices. If the token matches any of them, the parser pushes the token onto its stack and control continues on the next line. If the token doesn't match, then control moves to the label after the else. The output instruction is followed by a rule name (e.g., ADD) and a number indicating the number of tokens (or nodes) to remove from the stack. In the case of ADD, we're removing the left operand, the plus, and the right operand, and replacing it with an ADD node. The call, return, and goto instructions work exactly like ordinary assembly.

Once we finish implementing the grammar, we feed it into the parser generator and it produces the state-transition table. There are many details I'm leaving out, but the basic algorithm is simple. The generator converts snapshots of the program into states. A snapshot consists of the current instruction pointer and each return address on the stack. For each state, the generator tries different input tokens, and searches for other reachable states that it hasn't seen yet. Specifically, for each state and for each input the generator simulates the program exactly like a theoretical parser would. The simulator's instruction pointer and stack are initially the same as the snapshot corresponding to the starting state. The simulator executes each instruction and records the program's output (i.e., it makes a list of each output that it encounters). When it encounters a match instruction, it takes a snapshot of the simulator, converts it into a state, and adds a record to its transition table joining the starting state to the new state for the current input. This record also contains the recorded outputs.

The beauty of my parser generator is that it provides many specialized instructions for resolving the so-called shift-reduce conflicts and reduce-reduce conflicts that frequently occur in Bison grammars. For instance, I can begin parsing a series of tokens using one grammar rule and if I run into trouble, I can backtrack and try reparsing the tokens using a different grammar rule. Bison supports a nondetermininistic, generalized parser for such instances, but it is an order of magnitude less efficient that the standard, determininistic parser. Using an assortment of clever tricks, I've managed to convert several instances of backtracking and multiple-token lookahead into a pushdown automaton. A pushdown automaton is ideal because it's simple to implement, simple to understand, and incredibly fast. Another advantage is that the state-transition table automatically encodes the valid inputs at each state. If a token causes the parser to fail, the parser can print a helpful error message by looking at the transition for every other token to determine whether it, too, would cause the parser to fail. Yet another advantage is error recovery. When the parser encounters an invalid token, aborting is unacceptable. Termination is especially bad for the editor because the user expects to see the code displayed properly as he or she is modifying it. The grammar program contains an error handler for each failure (e.g., each fail instruction has an optional error handler after it). The error handler is exactly like the other subroutines. It's allowed to eat and discard as many tokens as it wants before executing a recover instruction that transfers control back to the state in which the failure occurred. At this point, the new lookahead token may or may not be valid. If it isn't valid, the parser doesn't execute the error-handling code a second time. Instead, it selects a valid token to insert into the stream of tokens from the lexer. In this way, the parser is always able to progress through the entire source file no matter what tokens it contains. More importantly, components like the color coder are able to make natural assumptions about the source file. For instance, the color coder naturally assumes that each statement has an equal number of left and right parentheses because the parser inserts missing parentheses.

One challenging aspect of the grammar was deciding what to do about indentation. Since C/C++ ignores whitespace, I had no idea how to represent indentation in the grammar. It wasn't until early this year that I developed a neat way to represent indentation using special "whitespace" tokens (click here for details). The indentation rules make color coding and context tagging much easier than C/C++ because the parser can detect missing braces and insert them where necessary. In fact, the error recovery system is so sophisticated that you can remove most braces from an Orth program and the editor will still understand the relationship between statements. For example, consider the following erroneous program.
```
struct S
{
    struct T
        int a
        int b
    }
    int c
}
```
A C/C++ context tagger would struggle with this code because, ignoring whitespace, the declaration of c appears to be outside of S. The Orth context tagger understands that the programmer's intent was probably to declare c in the same scope as T; he or she just forgot the opening brace after struct T.
Context Tagging

Context tagging is fast and easy. When you press CTRL+. to go to the definition of a symbol, the tagger creates a database for the current document. In the future, the tagger will maintain a database for every source file in the user's workspace, but I'm focusing on a single file for the moment. The database contains the name, file offset, scope, and "type" of each declaration. The tagger makes no distinction between types and values, which is why I put "type" in quotation marks. The tagger just records the expression to the left of declaration. The database also contains the name, file offset, and scope of each symbol. Some symbols are by themselves whereas other symbols appear after a period. For symbols of the latter type, the symbol's record also contains the expression to the left of the period. When you press CTRL+., the tagger searches for a nearby declaration record or symbol record in its database. If it finds one, it begins an iterative process of resolving expressions into structures and then looking up symbols within resolved structures. To see how this works, consider the example below:
```
1:  struct P
2:  {
3:      struct Q
4:      {
5:          int x
6:      }
7:  }
8:  struct S
9:  {
10:     P.Q q
11: }
12: typedef T:=S
13. T.q.x<cursor is here>
```
The tagger locates the declaration of x on line 5 using this step-by-step process:
- 1. x appears after a period, so we resulve T.q
  - 2. q appears after a period, so we resulve T
    - 3. T is a typedef on line 12 whose initializer is S
    - 4. S is a structure on line 8
  - 5. Looking up q in S gives us a declaration on line 10 whose type is P.Q
  - 6. Q appears after a period, so we resulve P
    - 7. P is a structure on line 1
  - 8. Looking up Q in P produces a structure on line 3
- 9. Looking up x in Q moves the cursor to declaration of x on line 5
As I said earlier, the tagger doesn't try to determine the actual type of each declaration because it's relevant only if it resolves into a structure. In the example above, the type of P.Q.x doesn't matter because accessing a member of P.Q.x is impossible. The tagger also ignores operators like ^ and [] because they don't affect the structure that an expression resolves into. In fact, we can ignore function calls if we pretend that a function declaration is actually declaring a variable whose type is the function's result type.

Ambiguities can occur for two reasons: (1) there might be multiple instances of a symbol within a structure at some point during the resolution process; or (2) each branch of the conditional operator might resolve into different structures. In any case, the tagger makes a list of possible declarations and asks you to choose one:

110719 - Editor Underway!

Editor Screenshot

110715 - Debugger Progress

Ok, I've decided to update this webpage more frequently with progress updates. Hopefully, this will motivate me to stay focused on the project. This past week, I've made some exciting progress on the debugger. First, I implemented hardware breakpoints, which let you trap any read/write or write access to any 8-bit, 16-bit, 32-bit, or 64-bit word. You can also trap the execution of a word, but there doesn't seem to be much point since you can just write overwrite the memory in question with 0xCC (INT 3). Second, I discovered a way of implementing perfect stack unwinding. To unwind the stack, you typically walk the linked list of stack frames and read the return address of each one. This approach works fine as long as the processor isn't executing prologue or epilogue code. In this case, the debugger won't see the first return address because it's hidden on the stack or in a register. To further complicate the problem, the system libraries like kernel32.dll don't provide debugging information for their prologue and epilogues (or if they did, it would be a lot of work to obtain and read the debugging information). Therefore, if the IP is in system code, the debugger has no way whatsoever of finding the first return address. I solved this problem by wrapping each system call and creating a dummy return-address frame. The solution is a bit messy, but the details of stack unwinding are now transparent to 99% of the debugger. Third, I discovered a neat way to set a whole bunch of breakpoints at once using VirtualProtectEx(). This awesome function lets you trap any execution or access to one or more 4K pages in the debuggee's address space. This is extremely useful for stepping into system (or more generally library) functions. Supposing the library function has a callback, stepping into the function should break on the first line of the callback. With VirtualProtectEx(), this behavior is simple. I just protect all of the user's code and run! Either (1) the program exits; (2) an exception occurs; or (3) an access violation occurs in user code--either because the library function returned or because it executed a callback in user code. I also added a few minor features like displaying a list of loaded modules.

Next up is the editor!

110709 - Disassembler Releaased

This week, I wrote a disassembler for my debugger because I was unsatisfied with udis86. The disassembler is another guilty pleasure because I use a tiny fraction of the x86 instruction set in Orth, so pretty much any disassembler will do. What can I say? I'm a perfectionist and assembly language is fun. Since I went to so much trouble to write the debugger and thoroughly test its output against udis86 and distorm, I decided to release it. You can download it here. When I say I thoroughly tested it, I mean I thoroughly tested it. I decoded millions of instructions and compared my decoding with another disassembler's decoding. So, at the very least, bugs in my disassembler are probably also present in udis86 and distorm. Please send comments to vamaelo@gmail.com.

110709 - Still Alive

This project is still alive in spite of the lack of updates. Practically everything has changed since 2009. First, I decided to write my own assembler and code generator instead of using LLVM. Why reinvent the wheel, you ask? For one thing, my code generator is much simpler than LLVM. I support only x86/x64 and don't do any complicated optimizations like function inlining and loop unrolling. For another, my code generator gives me precise control over register allocation, stack frames, debugging information, etc., which is important for things like inline assembly. For a third, making changes to my code generator is much easier than making changes to LLVM. I understand precisely how everything works and I have a thorough regression tester that generates code for millions of complicated expressions, executes the code, and compares the code's behavior (i.e., loads, stores, calls, etc.) against a known working solution. Eventually, when the project is mature, I plan to use LLVM for "release" builds. LLVM is ideally suited for "release" builds because compilation speed and debugging information aren't important.

Moving on, I completely changed the syntax as I described in my last update (from 2009 no less!). I developed a wonderful system that lets you format code however you like without annoying semicolons and braces. Semicolons and braces are needed only when you combine multiple statements on the same line. Here are some examples:

if(foo)
    bar() //Call bar() if foo is true

if(foo) bar() //Same

if(foo)
{
    bar() //Call bar() and baz() if foo is true
    baz()
}

if(foo) { bar(); baz() } //Same

if(foo) //Call bar() if foo is true.  Then, call baz().
    bar()
baz()

if(foo) bar() //Same
baz()

if(foo) { bar(); if(foo2) baz() } //Call bar() if foo is true and also call baz() if foo2 is true.

As you can see, the parser uses indentation to determine the relationship between statements. A body consisting of a single statement can go on the same line as its controlling statement. A body consisting of multiple statements can also go on the same line provided that you surround it with braces and separate each statement with a semicolon. Notice that there's no semicolon after the last statement; the semicolon goes between statements. Anyway, I've been using the syntax for many months to test my compiler and I love it! It reminds of me of foreach. Once you realize how awesome foreach is, you can't imagine programming without it.

So, you're probably wondering what the status of the project is after all these years. The code generator is done. I have a working compiler that understands most C constructs and a few C++ constructs. The compiler is also incremental, which means it compiles only those functions that have changed. The algorithm is pretty neat. I first determine which project files have been modified. (By the way, a project automatically includes all .orth files in a particular directory. There's no need to painstakingly tell the compiler to compile each source file.) I then parse each modified file, produce an AST, and compare the modified AST with the original AST, which I've saved in the program's database. The comparison produces a set of "chunks" whose syntax has changed. For example, if you changed '1' to '2' in foo(), then the chunk containing foo() would "change". (For now, let's suppose that each chunk contains a single global function or global variable.) Adding a comment to foo() would not change foo(). Similarly, let's say foo had a single parameter 'MyStruct ms'. Changing the definition of MyStruct would also not change foo(). Ok, the next step is to fix up the line number information for chunks that haven't changed. This step is important because adding a line to the top of a source file will cause each breakpoint (which is saved in the program's database) to be off by one. Fortunately, the compiler uses multiple levels of indirection, so, for instance, the breakpoints don't actually contain line numbers. Instead, they contain indices into a line number table for each module, which I can quickly update. Now that we've fixed the existing chunks, we move on to the modified chunks. The first step is to delete all of the old code for these chunks from our database. Remember when I said changing MyStruct doesn't cause foo() to change? You were probably wondering how this system could possibly work because foo()'s signature has no doubt changed. You're right. I maintain a graph of interchunk dependencies. Each time a chunk accesses a symbol, it adds an edge between the chunk doing the access and the chunk containing the accessed symbol. So each time we delete a chunk, we have to delete any chunks that depend on it. Then, we have to delete any chunks that, in turn, depend on those chunks. And so forth. When we run out of chunks to delete, we can be confident that all of the existing code in the database is correct. All that remains is to compile new code for all the chunks that we deleted. The system I've described thus far works, but it's not as efficient as it could be. Let's suppose bar() contains a call to foo() and you change a statement inside foo(). Can bar() possibly be affected by this change? No. A function's body is entirely invisible to the outside world. Therefore, we can split foo into two chunks: one half which is visible to the outside world (return value, parameters, calling convention, etc.) and one half which isn't (the body). So, now we have two chunks for each global function, one chunk for each global variable, and we haven't thought much about structures yet. A structure serves a dual purpose in C++. On the one hand, it describes a layout, or a table of fields and offsets. On the other hand, it's a namespace, which just controls symbol lookup. In the former sense, a structure changes semantically when any one of its fields (a.k.a. member variables) changes semantically. In the latter sense, a structure is unaffected by symbols (e.g., functions) that are defined within it. In other words, adding a variable to a structure invalidates every chunk that depends on the structure; whereas, adding a function to a structure does not. Now, let's look at member functions. Member functions are ordinary functions (i.e., they have two chunks as we discussed), but they have a hidden this pointer, which adds an implicit dependency on the structure containing the function. In other words, adding a variable to a structure invalidates every nonstatic function, but static functions are immune unless they explicitly access one of the structure's static members.

Are you with me still? Incremental compilation is a complicated subject, but it has so many advantages. For example, you can add a debugging statement to any function and run the program instantly! You don't have to worry about which modules depend on which modules anymore. You can put declarations in any order in any module without the need to forward declare anything. I realize other languages have solved the latter problem, but, as far as I know, you still have to put "import" statements at the top of each file. Finally, the compiler and debugger share the same program database. How awesome is that? I don't have to write debugging information, per se, because the program database already contains everything the debugger needs to know about.

Moving on, I no longer support Linux, and of course, my project is now x86/x64 only. I'm sorry to disappoint Linux fans, but supporting multiple operating systems is too much work for one person. I'd much rather have an awesome IDE for Windows than an ok IDE for Windows, Linux, and BeOS, BSD, and Apple. That's not to say my code is platform specific, per se... Porting the project to Linux is mainly just a matter of substituting Windows calls with Linux calls and replacing the PE reader/writer with an ELF reader/writer. The Linux calling conventions are a bit different too (naturally, they are more complicated on Linux, tsk, tsk), but with my automated call testing system, it shouldn't be too hard to get them right.

Okay, the next thing is the linker, or more precisely, the lack of a linker. That's right, Orth doesn't have a linker. I decided to scrap the Windows CRT since it's just a thin wrapper around Windows system calls, so I don't need to statically link my code to anyone else's. Of course, any Windows application needs to dynamically link to user32.dll and others, but dynamic linking is a walk in the park. Seriously. All you do is put a list of function imports in your executable and the Windows loader resolves the address of each one and stores the result in a table of your choosing. Then, it's just a matter of making an indirect call through the pointer. Function exports are even easier, but you don't need those unless you're making a DLL. So, with a weekend's worth of work and a lot of reading, anyone can write a Windows executable that prints "Hello world\n" using assembly language. The absense of a linker simplifies the compiler enormously. For example, the executable's entry point is simply the first instruction in main(). The last statement in main() is always a call to ExitProcess(), which seems to be the accepted way of exiting an application.

I haven't done much with the debugger. It's still a primitive command-line utility with the basic set of commands for stepping through code, inspecting variables, and viewing the call stack. Recently, I discovered that a console application (specifically a debugger) can trap CTRL+C and automatically suspend the debuggee. This feature is great for debugging frozen applications. The next step is to play around with hardware breakpoints and maybe calling user-defined functions for viewing the contents of user-defined types. Viewing user-defined types using code that's inside the type is a critical component of my debugger. The plan is to execute user code in the debugger's address space. The debugger maps the image into its own address space, so in theory, it should be able to execute any code in the mapped image. We'll see.

I haven't started the editor yet, but I did write my own GUI. I know what you're thinking... Not another GUI. The GUI is a guilty pleasure for me. My time could be better spent on other things, but GUIs are so much fun to write. I was using the FOX toolkit for many years, but I ran into so many problems that I decided it was easier to roll my own GUI than to try to make FOX do what I wanted. My own GUI has the standard "widgets" and a really nice API that I more or less copied from FOX. I have to compliment the author of FOX here, Jeroen van der Zijp, because his design is really good. The GUI is 99.99% done. The GUI is one of those open-ended projects that can go on forever, so at some point, you just have to say, "it's done". I'll correct bugs, but I'm not adding any new features unless I have an immediate need for them. I'll get started on the editor shortly.

So, that's where I'm at... Lots done. Lots to do. I haven't released any code because of the amount of work involved. I'll probably do another release when the editor is done.

091129 - Orth Resyntaxed

I've discovered that braces and semicolons are usually redundant in C++. After ten years of programming in C++, I still habitually forget to insert semicolons. Worse, I often put semicolons after the closing paren of statements like for(), and I sometimes forget to insert or delete closing braces, leading to obscure and time-consuming errors. Clearly, my brain doesn't care about semicolons and braces, and it inserts them only as a programmed afterthought. Line continuations are difficult without semicolons, but there are simple workarounds. One solution is a half indent for line continuations, which many programmers use anyway. Another solution is a backslash, which is tedious to insert by hand, but a good editor should insert it automatically. In fact, an editor usually keeps track of wrapped lines anyway, so it just needs to keep the backslashes in sync with its internal table of wrapped lines.

091109 - Debugger Progress

I'm pleased to announce that my Windows debugger prototype is complete. It supports the following commands:

g             go
g <location>  place a one-shot breakpoint and go
bp <location> set breakpoint
bl            list breakpoints
bc <bid>      clear specific breakpoint
bc            clear all breakpoints
s             step over
si            step into
so            step out
cs            display call stack
lv            display names and values of local variables in current call frame
lv <cid>      display names and values of local variables in specific call frame
gv            display names and values of global variables

<location>    A line number followed by an optional column number

I haven't given much thought to interfacing with foreign code. For example, suppose you pass a callback to a library function for which you don't have the source code. If you step into the library function, then ideally the debugger should stop on the first line of your callback. Stepping out of the callback should unwind the call stack to the nearest Orth function.

091102 - Single Stepping

I've devised a reasonably efficient algorithm to single step through code. The obvious solution is to execute one instruction at a time until the IP points to an instruction corresponding a different statement. For Step Over, you need to skip over each CALL instruction using a breakpoint; whereas, for Step Into, you need to single step each CALL instruction. This approach works well for a series of nonbranching instructions, but suppose the statement contains a loop. For example:

int[100] x,y;
x:=y;

The second statement will copy each element from y to x. We could calculate the range of addresses corresponding to the statement and analyze the instructions within. Unfortunately, determining where each instruction begins is nontrivial. The compiler usually concatenates instructions, but it could intersperse data and padding. A more robust solution is to analyze an instruction the first time the processor reaches it. There are five cases:

Case 1: The current instruction is a conditional branch that we haven't seen:: Execute it and put a breakpoint on the other code path (i.e., if the condition is true and we branch, we'll put a breakpoint on the next instruction; if the condition is false and we don't branch, we'll put a breakpoint at the branch address).
Case 2: The current instruction is an indirect jump that we haven't seen:: Execute it and replace it with a breakpoint.
Case 3: The current instruction is a CALL that we haven't seen:: Execute it for Step Into or step over it for Step Over.
Case 4: The current instruction is an unconditional jump or any other instruction we haven't seen:: Execute it. If the IP doesn't change then the processor is repeating the instruction (e.g., REPE and REPNE) so put a breakpoint on the next instruction and resume the program.
Case 5: We have seen the current instruction:: Resume the program. We've marked every boundary between seen and unseen instructions so the program will break at either an unseen instruction or an indirect jump. Indirect jumps are troublesome because the processor could theoretically jump to any address. The only solution, short of reverse engineering the code, is to single step the indirect jump every time we encounter it.

We repeat this procedure until the IP points to an instruction corresponding to a different statement.

Announcements

Documentation

Regular Expressions

Syntax Highlighting

Context Tagging