Orth Programming Language


News

Overview

Features

Download

Announcements

110719 - Editor Underway!

110715 - Debugger Progress

Ok, I've decided to update this webpage more frequently with progress updates. Hopefully, this will motivate me to stay focused on the project. This past week, I've made some exciting progress on the debugger. First, I implemented hardware breakpoints, which let you trap any read/write or write access to any 8-bit, 16-bit, 32-bit, or 64-bit word. You can also trap the execution of a word, but there doesn't seem to be much point since you can just write overwrite the memory in question with 0xCC (INT 3). Second, I discovered a way of implementing perfect stack unwinding. To unwind the stack, you typically walk the linked list of stack frames and read the return address of each one. This approach works fine as long as the processor isn't executing prologue or epilogue code. In this case, the debugger won't see the first return address because it's hidden on the stack or in a register. To further complicate the problem, the system libraries like kernel32.dll don't provide debugging information for their prologue and epilogues (or if they did, it would be a lot of work to obtain and read the debugging information). Therefore, if the IP is in system code, the debugger has no way whatsoever of finding the first return address. I solved this problem by wrapping each system call and creating a dummy return-address frame. The solution is a bit messy, but the details of stack unwinding are now transparent to 99% of the debugger. Third, I discovered a neat way to set a whole bunch of breakpoints at once using VirtualProtectEx(). This awesome function lets you trap any execution or access to one or more 4K pages in the debuggee's address space. This is extremely useful for stepping into system (or more generally library) functions. Supposing the library function has a callback, stepping into the function should break on the first line of the callback. With VirtualProtectEx(), this behavior is simple. I just protect all of the user's code and run! Either (1) the program exits; (2) an exception occurs; or (3) an access violation occurs in user code--either because the library function returned or because it executed a callback in user code. I also added a few minor features like displaying a list of loaded modules.

Next up is the editor!

110709 - Disassembler Releaased

This week, I wrote a disassembler for my debugger because I was unsatisfied with udis86. The disassembler is another guilty pleasure because I use a tiny fraction of the x86 instruction set in Orth, so pretty much any disassembler will do. What can I say? I'm a perfectionist and assembly language is fun. Since I went to so much trouble to write the debugger and thoroughly test its output against udis86 and distorm, I decided to release it. You can download it here. When I say I thoroughly tested it, I mean I thoroughly tested it. I decoded millions of instructions and compared my decoding with another disassembler's decoding. So, at the very least, bugs in my disassembler are probably also present in udis86 and distorm. Please send comments to aliphany@gmail.com.

110709 - Still Alive

This project is still alive in spite of the lack of updates. Practically everything has changed since 2009. First, I decided to write my own assembler and code generator instead of using LLVM. Why reinvent the wheel, you ask? For one thing, my code generator is much simpler than LLVM. I support only x86/x64 and don't do any complicated optimizations like function inlining and loop unrolling. For another, my code generator gives me precise control over register allocation, stack frames, debugging information, etc., which is important for things like inline assembly. For a third, making changes to my code generator is much easier than making changes to LLVM. I understand precisely how everything works and I have a thorough regression tester that generates code for millions of complicated expressions, executes the code, and compares the code's behavior (i.e., loads, stores, calls, etc.) against a known working solution. Eventually, when the project is mature, I plan to use LLVM for "release" builds. LLVM is ideally suited for "release" builds because compilation speed and debugging information aren't important.

Moving on, I completely changed the syntax as I described in my last update (from 2009 no less!). I developed a wonderful system that lets you format code however you like without annoying semicolons and braces. Semicolons and braces are needed only when you combine multiple statements on the same line. Here are some examples:

                    if(foo)
                        bar() //Call bar() if foo is true
                        
                    if(foo) bar() //Same
                    
                    if(foo)
                        bar() //Call bar() and baz() if foo is true
                        baz() 
                        
                    if(foo) { bar(); baz() } //Same
                    
                    if(foo) //Call bar() if foo is true.  Then, call baz().
                        bar()
                    baz()
                    
                    if(foo) bar() //Same
                    baz()
                    
                    if(foo) { bar(); if(foo2) baz() } //Call bar() if foo is true and also call baz() if foo2 is true.
                    

As you can see, the parser uses indentation to determine the relationship between statements. A body consisting of a single statement can go on the same line as its controlling statement. A body consisting of multiple statements can also go on the same line provided that you surround it with braces and separate each statement with a semicolon. Notice that there's no semicolon after the last statement; the semicolon goes between statements. Anyway, I've been using the syntax for many months to test my compiler and I love it! It reminds of me of foreach. Once you realize how awesome foreach is, you can't imagine programming without it.

So, you're probably wondering what the status of the project is after all these years. The code generator is done. I have a working compiler that understands most C constructs and a few C++ constructs. The compiler is also incremental, which means it compiles only those functions that have changed. The algorithm is pretty neat. I first determine which project files have been modified. (By the way, a project automatically includes all .orth files in a particular directory. There's no need to painstakingly tell the compiler to compile each source file.) I then parse each modified file, produce an AST, and compare the modified AST with the original AST, which I've saved in the program's database. The comparison produces a set of "chunks" whose syntax has changed. For example, if you changed '1' to '2' in foo(), then the chunk containing foo() would "change". (For now, let's suppose that each chunk contains a single global function or global variable.) Adding a comment to foo() would not change foo(). Similarly, let's say foo had s single parameter 'MyStruct ms'. Changing the definition of MyStruct would also not change foo(). Ok, the next step is to fix up the line number information for chunks that haven't changed. This step is important because adding a line to the top of a source file will cause each breakpoint (which is saved in the program's database) to be off by one. Fortunately, the compiler uses multiple levels of indirection, so, for instance, the breakpoints don't actually contain line numbers. Instead, they contain indices into a line number table for each module, which I can quickly update. Now that we've fixed the existing chunks, we move on to the modified chunks. The first step is to delete all of the old code for these chunks from our database. Remember when I said changing MyStruct doesn't cause foo() to change? You were probably wondering how this system could possibly work because foo()'s signature has no doubt changed. You're right. I maintain a graph of interchunk dependencies. Each time a chunk accesses a symbol, it adds an edge between the chunk doing the access and the chunk containing the accessed symbol. So each time we delete a chunk, we have to delete any chunks that depend on it. Then, we have to delete any chunks that, in turn, depend on those chunks. And so forth. When we run out of chunks to delete, we can be confident that all of the existing code in the database is correct. All that remains is to compile new code for all the chunks that we deleted. The system I've described thus far works, but it's not as efficient as it could be. Let's suppose bar() contains a call to foo() and you change a statement inside foo(). Can bar() possibly be affected by this change? No. A function's body is entirely invisible to the outside world. Therefore, we can split foo into two chunks: one half which is visible to the outside world (return value, parameters, calling convention, etc.) and one half which isn't (the body). So, now we have two chunks for each global function, one chunk for each global variable, and we haven't thought much about structures yet. A structure serves a dual purpose in C++. On the one hand, it describes a layout, or a table of fields and offsets. On the other hand, it's a namespace, which just controls symbol lookup. In the former sense, a structure changes semantically when any one of its fields (a.k.a. member variables) changes semantically. In the latter sense, a structure is unaffected by symbols (e.g., functions) that are defined within it. In other words, adding a variable to a structure invalidates every chunk that depends on the structure; whereas, adding a function to a structure does not. Now, let's look at member functions. Member functions are ordinary functions (i.e., they have two chunks as we discussed), but they have a hidden this pointer, which adds an implicit dependency on the structure containing the function. In other words, adding a variable to a structure invalidates every nonstatic function, but static functions are immune unless they explicitly access one of the structure's static members.

Are you with me still? Incremental compilation is a complicated subject, but it has so many advantages. For example, you can add a debugging statement to any function and run the program instantly! You don't have to worry about which modules depend on which modules anymore. You can put declarations in any order in any module without the need to forward declare anything. I realize other languages have solved the latter problem, but, as far as I know, you still have to put "import" statements at the top of each file. Finally, the compiler and debugger share the same program database. How awesome is that? I don't have to write debugging information, per se, because the program database already contains everything the debugger needs to know about.

Moving on, I no longer support Linux, and of course, my project is now x86/x64 only. I'm sorry to disappoint Linux fans, but supporting multiple operating systems is too much work for one person. I'd much rather have an awesome IDE for Windows than an ok IDE for Windows, Linux, and BeOS, BSD, and Apple. That's not to say my code is platform specific, per se... Porting the project to Linux is mainly just a matter of substituting Windows calls with Linux calls and replacing the PE reader/writer with an ELF reader/writer. The Linux calling conventions are a bit different too (naturally, they are more complicated on Linux, tsk, tsk), but with my automated call testing system, it shouldn't be too hard to get them right.

Okay, the next thing is the linker, or more precisely, the lack of a linker. That's right, Orth doesn't have a linker. I decided to scrap the Windows CRT since it's just a thin wrapper around Windows system calls, so I don't need to statically link my code to anyone else's. Of course, any Windows application needs to dynamically link to user32.dll and others, but dynamic linking is a walk in the park. Seriously. All you do is put a list of function imports in your executable and the Windows loader resolves the address of each one and stores the result in a table of your choosing. Then, it's just a matter of making an indirect call through the pointer. Function exports are even easier, but you don't need those unless you're making a DLL. So, with a weekend's worth of work and a lot of reading, anyone can write a Windows executable that prints "Hello world\n" using assembly language. The absense of a linker simplifies the compiler enormously. For example, the executable's entry point is simply the first instruction in main(). The last statement in main() is always a call to ExitProcess(), which seems to be the accepted way of exiting an application.

I haven't done much with the debugger. It's still a primitive command-line utility with the basic set of commands for stepping through code, inspecting variables, and viewing the call stack. Recently, I discovered that a console application (specifically a debugger) can trap CTRL+C and automatically suspend the debuggee. This feature is great for debugging frozen applications. The next step is to play around with hardware breakpoints and maybe calling user-defined functions for viewing the contents of user-defined types. Viewing user-defined types using code that's inside the type is a critical component of my debugger. The plan is to execute user code in the debugger's address space. The debugger maps the image into its own address space, so in theory, it should be able to execute any code in the mapped image. We'll see.

I haven't started the editor yet, but I did write my own GUI. I know what you're thinking... Not another GUI. The GUI is a guilty pleasure for me. My time could be better spent on other things, but GUIs are so much fun to write. I was using the FOX toolkit for many years, but I ran into so many problems that I decided it was easier to roll my own GUI than to try to make FOX do what I wanted. My own GUI has the standard "widgets" and a really nice API that I more or less copied from FOX. I have to compliment the author of FOX here, Jeroen van der Zijp, because his design is really good. The GUI is 99.99% done. The GUI is one of those open-ended projects that can go on forever, so at some point, you just have to say, "it's done". I'll correct bugs, but I'm not adding any new features unless I have an immediate need for them. I'll get started on the editor shortly.

So, that's where I'm at... Lots done. Lots to do. I haven't released any code because of the amount of work involved. I'll probably do another release when the editor is done.

091129 - Orth Resyntaxed

I've discovered that braces and semicolons are usually redundant in C++. After ten years of programming in C++, I still habitually forget to insert semicolons. Worse, I often put semicolons after the closing paren of statements like for(), and I sometimes forget to insert or delete closing braces, leading to obscure and time-consuming errors. Clearly, my brain doesn't care about semicolons and braces, and it inserts them only as a programmed afterthought. Line continuations are difficult without semicolons, but there are simple workarounds. One solution is a half indent for line continuations, which many programmers use anyway. Another solution is a backslash, which is tedious to insert by hand, but a good editor should insert it automatically. In fact, an editor usually keeps track of wrapped lines anyway, so it just needs to keep the backslashes in sync with its internal table of wrapped lines.

091116 - Milestone 2b Reached

The Linux debugger is complete! Porting the Windows debugger to Linux was much easier than expected. There are only 160 lines of OS-dependent code to read the executable and 400 lines of OS-dependent code to control the debuggee. The hardest part was expanding my DWARF reader to understand the additional debugging information GCC puts in the executable. Another challenge was finding an efficient way to read from the process's memory space. ptrace(PTRACE_PEEKDATA) and ptrace(PTRACE_POKEDATA) operate on only four bytes. The solution was to open() the debuggee's /proc/<pid>/mem file and read() from it. ptrace(PTRACE_POKEDATA) is fine for setting breakpoints and ocassionally changing the value of a variable.

Download Orth 0.2.1 here.

Milestone 3 will be an editor prototype. The main objective is an algorithm that will incrementally parse and analyze the source code as you type it. The editor should be able to find the declaration of each symbol and ideally, every use of each symbol. Thus, the editor will need to understand structures, symbol aliasing, function calls, and hardest of all, overloading. Initially, the editor will support only those constructs that the compiler understands.

091110 - Orth 0.2.0 Released

Download it here.

091109 - Milestone 2a Reached

I'm pleased to announce that my Windows debugger prototype is complete. It supports the following commands:

                    g             go
                    g <location>  place a one-shot breakpoint and go
                    bp <location> set breakpoint
                    bl            list breakpoints
                    bc <bid>      clear specific breakpoint
                    bc            clear all breakpoints
                    s             step over
                    si            step into
                    so            step out
                    cs            display call stack
                    lv            display names and values of local variables in current call frame
                    lv <cid>      display names and values of local variables in specific call frame
                    gv            display names and values of global variables
                    
                    <location>    A line number followed by an optional column number
                    

I haven't given much thought to interfacing with foreign code. For example, suppose you pass a callback to a library function for which you don't have the source code. If you step into the library function, then ideally the debugger should stop on the first line of your callback. Stepping out of the callback should unwind the call stack to the nearest Orth function.

I'll be polishing the debugger this week and releasing Orth 0.2.0 shortly. The next milestone will be a Linux debugger with the same commands as the Windows debugger. Hopefully, I can reuse most of the code and simply factor out the operating-system calls.

091102 - Single Stepping

I've devised a reasonably efficient algorithm to single step through code. The obvious solution is to execute one instruction at a time until the IP points to an instruction corresponding a different statement. For Step Over, you need to skip over each CALL instruction using a breakpoint; whereas, for Step Into, you need to single step each CALL instruction. This approach works well for a series of nonbranching instructions, but suppose the statement contains a loop. For example:

                    int[100] x,y;
                    x:=y;
                    

The second statement will copy each element from y to x. We could calculate the range of addresses corresponding to the statement and analyze the instructions within. Unfortunately, determining where each instruction begins is nontrivial. The compiler usually concatenates instructions, but it could intersperse data and padding. A more robust solution is to analyze an instruction the first time the processor reaches it. There are five cases:

  • Case 1: The current instruction is a conditional branch that we haven't seen:
    Execute it and put a breakpoint on the other code path (i.e., if the condition is true and we branch, we'll put a breakpoint on the next instruction; if the condition is false and we don't branch, we'll put a breakpoint at the branch address).
  • Case 2: The current instruction is an indirect jump that we haven't seen:
    Execute it and replace it with a breakpoint.
  • Case 3: The current instruction is a CALL that we haven't seen:
    Execute it for Step Into or step over it for Step Over.
  • Case 4: The current instruction is an unconditional jump or any other instruction we haven't seen:
    Execute it. If the IP doesn't change then the processor is repeating the instruction (e.g., REPE and REPNE) so put a breakpoint on the next instruction and resume the program.
  • Case 5: We have seen the current instruction:
    Resume the program. We've marked every boundary between seen and unseen instructions so the program will break at either an unseen instruction or an indirect jump. Indirect jumps are troublesome because the processor could theoretically jump to any address. The only solution, short of reverse engineering the code, is to single step the indirect jump every time we encounter it.
We repeat this procedure until the IP points to an instruction corresponding to a different statement.

091031 - Windows Debugging API

Today, I discovered the Windows Debugging API. This simple yet powerful set of functions lets you change registers, read and write to memory, and respond to exceptions. I'm delighted that I no longer need to interface with cdb because it is inefficient and cumbersome. Linux provides a complementary debugging API, ptrace(), so I don't have to mess with gdb.

091026 - Debugger Underway

The debugger is going well. So far, I've loaded the DWARF debugging information from an executable and discovered a way to control CDB interactively. I open two pipes: one for input and one for output. When I spawn CDB using CreateProcess(), I redirect stdin to the input pipe and stdout to the output pipe. I then send commands to CDB using WriteFile() and read responses using ReadFile(). On Linux, bidirectional pipes are discouraged, so I'll use sockets instead. Linux has a convenient function called socketpair() for creating a pair of sockets that are already connected. I'll then fork() the debugger and call dup2() in the child process to redirect stdin and stdout.

I gave some thought to Step Into and decided I want to choose which call the debugger steps into. Visual Studio's debugger makes you step into every call, which is terribly slow and inefficient. I have two ideas: (1) provide the user with a list of choices with the most obvious choices (i.e., explicit function calls) at the top; (2) use the cursor position to select a subexpression. For example, in an expression like foo()(a+b), positioning the cursor in front of the first parenthesis would step into foo(); positioning the cursor before the second parenthesis would step into the function returned by foo(); and placing the cursor on the plus sign would step into operator+().

090726 - Milestone 1 Reached

I'll be developing this project in small increments and announcing my progress here on a regular basis. The first milestone was to compile a very small subset of the Orth language and produce an LLVM input file with debugging information. The compiler supports the following constructs:

                    int v:=0; //global variables with an optional initializer (type must be int)
                    int bar(int p) //global functions with zero or more parameters (all types must be int)
                    {
                            if(p) //if-then and if-then-else statements
                                return p+1; //return statements, addition, and integer literals
                            else
                                return 0;
                    }
                    int foo(int p)
                    {
                            scope //local blocks
                            {
                                    int i:=bar(p); //local variables with an optional initializer (type must be int), assignment, and calls                        
                                    v:=i+(1+1); //constant folding
                            }
                            return v;
                    }
                    
The example code looks exactly like C/C++ code except for the assignment operator and the scope keyword. Orth uses := for assignment and == for equality. The = operator is unused since its meaning is ambiguous. The scope keyword simply emphasizes that the brace below it introduces a new scope. You can download the source code here.

The next milestone will be a simple command-line debugger that reads the Dwarf debugging information from an executable and provides a minimal set of commands:

  • Go
  • Step over
  • Step into
  • Step out of
  • Set/clear breakpoint
  • Select call frame
  • Inspect variable
The Windows debugger will rely on cdb; whereas, the Linux debugger will rely on gdb.

Questions? Comments? Contact me at aliphany@gmail.com.