|
News |
Announcements110719 - Editor Underway!Ok, I've decided to update this webpage more frequently with progress updates. Hopefully, this will motivate me to stay focused on the project. This past week, I've made some exciting progress on the debugger. First, I implemented hardware breakpoints, which let you trap any read/write or write access to any 8-bit, 16-bit, 32-bit, or 64-bit word. You can also trap the execution of a word, but there doesn't seem to be much point since you can just write overwrite the memory in question with 0xCC (INT 3). Second, I discovered a way of implementing perfect stack unwinding. To unwind the stack, you typically walk the linked list of stack frames and read the return address of each one. This approach works fine as long as the processor isn't executing prologue or epilogue code. In this case, the debugger won't see the first return address because it's hidden on the stack or in a register. To further complicate the problem, the system libraries like kernel32.dll don't provide debugging information for their prologue and epilogues (or if they did, it would be a lot of work to obtain and read the debugging information). Therefore, if the IP is in system code, the debugger has no way whatsoever of finding the first return address. I solved this problem by wrapping each system call and creating a dummy return-address frame. The solution is a bit messy, but the details of stack unwinding are now transparent to 99% of the debugger. Third, I discovered a neat way to set a whole bunch of breakpoints at once using VirtualProtectEx(). This awesome function lets you trap any execution or access to one or more 4K pages in the debuggee's address space. This is extremely useful for stepping into system (or more generally library) functions. Supposing the library function has a callback, stepping into the function should break on the first line of the callback. With VirtualProtectEx(), this behavior is simple. I just protect all of the user's code and run! Either (1) the program exits; (2) an exception occurs; or (3) an access violation occurs in user code--either because the library function returned or because it executed a callback in user code. I also added a few minor features like displaying a list of loaded modules. Next up is the editor! 110709 - Disassembler ReleaasedThis week, I wrote a disassembler for my debugger because I was unsatisfied with udis86. The disassembler is another guilty pleasure because I use a tiny fraction of the x86 instruction set in Orth, so pretty much any disassembler will do. What can I say? I'm a perfectionist and assembly language is fun. Since I went to so much trouble to write the debugger and thoroughly test its output against udis86 and distorm, I decided to release it. You can download it here. When I say I thoroughly tested it, I mean I thoroughly tested it. I decoded millions of instructions and compared my decoding with another disassembler's decoding. So, at the very least, bugs in my disassembler are probably also present in udis86 and distorm. Please send comments to aliphany@gmail.com. 110709 - Still AliveThis project is still alive in spite of the lack of updates. Practically everything has changed since 2009. First, I decided to write my own assembler and code generator instead of using LLVM. Why reinvent the wheel, you ask? For one thing, my code generator is much simpler than LLVM. I support only x86/x64 and don't do any complicated optimizations like function inlining and loop unrolling. For another, my code generator gives me precise control over register allocation, stack frames, debugging information, etc., which is important for things like inline assembly. For a third, making changes to my code generator is much easier than making changes to LLVM. I understand precisely how everything works and I have a thorough regression tester that generates code for millions of complicated expressions, executes the code, and compares the code's behavior (i.e., loads, stores, calls, etc.) against a known working solution. Eventually, when the project is mature, I plan to use LLVM for "release" builds. LLVM is ideally suited for "release" builds because compilation speed and debugging information aren't important. Moving on, I completely changed the syntax as I described in my last update (from 2009 no less!). I developed a wonderful system that lets you format code however you like without annoying semicolons and braces. Semicolons and braces are needed only when you combine multiple statements on the same line. Here are some examples:
if(foo)
bar() //Call bar() if foo is true
if(foo) bar() //Same
if(foo)
bar() //Call bar() and baz() if foo is true
baz()
if(foo) { bar(); baz() } //Same
if(foo) //Call bar() if foo is true. Then, call baz().
bar()
baz()
if(foo) bar() //Same
baz()
if(foo) { bar(); if(foo2) baz() } //Call bar() if foo is true and also call baz() if foo2 is true.
As you can see, the parser uses indentation to determine the relationship between statements. A body consisting of a single statement can go on the same line as its controlling statement. A body consisting of multiple statements can also go on the same line provided that you surround it with braces and separate each statement with a semicolon. Notice that there's no semicolon after the last statement; the semicolon goes between statements. Anyway, I've been using the syntax for many months to test my compiler and I love it! It reminds of me of foreach. Once you realize how awesome foreach is, you can't imagine programming without it. So, you're probably wondering what the status of the project is after all these years. The code generator is done. I have a working compiler that understands most C constructs and a few C++ constructs. The compiler is also incremental, which means it compiles only those functions that have changed. The algorithm is pretty neat. I first determine which project files have been modified. (By the way, a project automatically includes all .orth files in a particular directory. There's no need to painstakingly tell the compiler to compile each source file.) I then parse each modified file, produce an AST, and compare the modified AST with the original AST, which I've saved in the program's database. The comparison produces a set of "chunks" whose syntax has changed. For example, if you changed '1' to '2' in foo(), then the chunk containing foo() would "change". (For now, let's suppose that each chunk contains a single global function or global variable.) Adding a comment to foo() would not change foo(). Similarly, let's say foo had s single parameter 'MyStruct ms'. Changing the definition of MyStruct would also not change foo(). Ok, the next step is to fix up the line number information for chunks that haven't changed. This step is important because adding a line to the top of a source file will cause each breakpoint (which is saved in the program's database) to be off by one. Fortunately, the compiler uses multiple levels of indirection, so, for instance, the breakpoints don't actually contain line numbers. Instead, they contain indices into a line number table for each module, which I can quickly update. Now that we've fixed the existing chunks, we move on to the modified chunks. The first step is to delete all of the old code for these chunks from our database. Remember when I said changing MyStruct doesn't cause foo() to change? You were probably wondering how this system could possibly work because foo()'s signature has no doubt changed. You're right. I maintain a graph of interchunk dependencies. Each time a chunk accesses a symbol, it adds an edge between the chunk doing the access and the chunk containing the accessed symbol. So each time we delete a chunk, we have to delete any chunks that depend on it. Then, we have to delete any chunks that, in turn, depend on those chunks. And so forth. When we run out of chunks to delete, we can be confident that all of the existing code in the database is correct. All that remains is to compile new code for all the chunks that we deleted. The system I've described thus far works, but it's not as efficient as it could be. Let's suppose bar() contains a call to foo() and you change a statement inside foo(). Can bar() possibly be affected by this change? No. A function's body is entirely invisible to the outside world. Therefore, we can split foo into two chunks: one half which is visible to the outside world (return value, parameters, calling convention, etc.) and one half which isn't (the body). So, now we have two chunks for each global function, one chunk for each global variable, and we haven't thought much about structures yet. A structure serves a dual purpose in C++. On the one hand, it describes a layout, or a table of fields and offsets. On the other hand, it's a namespace, which just controls symbol lookup. In the former sense, a structure changes semantically when any one of its fields (a.k.a. member variables) changes semantically. In the latter sense, a structure is unaffected by symbols (e.g., functions) that are defined within it. In other words, adding a variable to a structure invalidates every chunk that depends on the structure; whereas, adding a function to a structure does not. Now, let's look at member functions. Member functions are ordinary functions (i.e., they have two chunks as we discussed), but they have a hidden this pointer, which adds an implicit dependency on the structure containing the function. In other words, adding a variable to a structure invalidates every nonstatic function, but static functions are immune unless they explicitly access one of the structure's static members. Are you with me still? Incremental compilation is a complicated subject, but it has so many advantages. For example, you can add a debugging statement to any function and run the program instantly! You don't have to worry about which modules depend on which modules anymore. You can put declarations in any order in any module without the need to forward declare anything. I realize other languages have solved the latter problem, but, as far as I know, you still have to put "import" statements at the top of each file. Finally, the compiler and debugger share the same program database. How awesome is that? I don't have to write debugging information, per se, because the program database already contains everything the debugger needs to know about. Moving on, I no longer support Linux, and of course, my project is now x86/x64 only. I'm sorry to disappoint Linux fans, but supporting multiple operating systems is too much work for one person. I'd much rather have an awesome IDE for Windows than an ok IDE for Windows, Linux, and BeOS, BSD, and Apple. That's not to say my code is platform specific, per se... Porting the project to Linux is mainly just a matter of substituting Windows calls with Linux calls and replacing the PE reader/writer with an ELF reader/writer. The Linux calling conventions are a bit different too (naturally, they are more complicated on Linux, tsk, tsk), but with my automated call testing system, it shouldn't be too hard to get them right. Okay, the next thing is the linker, or more precisely, the lack of a linker. That's right, Orth doesn't have a linker. I decided to scrap the Windows CRT since it's just a thin wrapper around Windows system calls, so I don't need to statically link my code to anyone else's. Of course, any Windows application needs to dynamically link to user32.dll and others, but dynamic linking is a walk in the park. Seriously. All you do is put a list of function imports in your executable and the Windows loader resolves the address of each one and stores the result in a table of your choosing. Then, it's just a matter of making an indirect call through the pointer. Function exports are even easier, but you don't need those unless you're making a DLL. So, with a weekend's worth of work and a lot of reading, anyone can write a Windows executable that prints "Hello world\n" using assembly language. The absense of a linker simplifies the compiler enormously. For example, the executable's entry point is simply the first instruction in main(). The last statement in main() is always a call to ExitProcess(), which seems to be the accepted way of exiting an application. I haven't done much with the debugger. It's still a primitive command-line utility with the basic set of commands for stepping through code, inspecting variables, and viewing the call stack. Recently, I discovered that a console application (specifically a debugger) can trap CTRL+C and automatically suspend the debuggee. This feature is great for debugging frozen applications. The next step is to play around with hardware breakpoints and maybe calling user-defined functions for viewing the contents of user-defined types. Viewing user-defined types using code that's inside the type is a critical component of my debugger. The plan is to execute user code in the debugger's address space. The debugger maps the image into its own address space, so in theory, it should be able to execute any code in the mapped image. We'll see. I haven't started the editor yet, but I did write my own GUI. I know what you're thinking... Not another GUI. The GUI is a guilty pleasure for me. My time could be better spent on other things, but GUIs are so much fun to write. I was using the FOX toolkit for many years, but I ran into so many problems that I decided it was easier to roll my own GUI than to try to make FOX do what I wanted. My own GUI has the standard "widgets" and a really nice API that I more or less copied from FOX. I have to compliment the author of FOX here, Jeroen van der Zijp, because his design is really good. The GUI is 99.99% done. The GUI is one of those open-ended projects that can go on forever, so at some point, you just have to say, "it's done". I'll correct bugs, but I'm not adding any new features unless I have an immediate need for them. I'll get started on the editor shortly. So, that's where I'm at... Lots done. Lots to do. I haven't released any code because of the amount of work involved. I'll probably do another release when the editor is done. 091129 - Orth ResyntaxedI've discovered that braces and semicolons are usually redundant in C++. After ten years of programming in C++,
I still habitually forget to insert semicolons. Worse, I often put semicolons after the closing paren of
statements like The Linux debugger is complete! Porting the Windows debugger to Linux was much easier than expected. There are only 160
lines of OS-dependent code to read the executable and 400 lines of OS-dependent code to control the debuggee. The hardest
part was expanding my DWARF reader to understand the additional debugging information GCC puts in the executable. Another
challenge was finding an efficient way to read from the process's memory space. Download Orth 0.2.1 here. Milestone 3 will be an editor prototype. The main objective is an algorithm that will incrementally parse and analyze the source code as you type it. The editor should be able to find the declaration of each symbol and ideally, every use of each symbol. Thus, the editor will need to understand structures, symbol aliasing, function calls, and hardest of all, overloading. Initially, the editor will support only those constructs that the compiler understands. 091110 - Orth 0.2.0 ReleasedDownload it here. 091109 - Milestone 2a ReachedI'm pleased to announce that my Windows debugger prototype is complete. It supports the following commands:
g go
g <location> place a one-shot breakpoint and go
bp <location> set breakpoint
bl list breakpoints
bc <bid> clear specific breakpoint
bc clear all breakpoints
s step over
si step into
so step out
cs display call stack
lv display names and values of local variables in current call frame
lv <cid> display names and values of local variables in specific call frame
gv display names and values of global variables
<location> A line number followed by an optional column number
I haven't given much thought to interfacing with foreign code. For example, suppose you pass a callback to a library function for which you don't have the source code. If you step into the library function, then ideally the debugger should stop on the first line of your callback. Stepping out of the callback should unwind the call stack to the nearest Orth function. I'll be polishing the debugger this week and releasing Orth 0.2.0 shortly. The next milestone will be a Linux debugger with the same commands as the Windows debugger. Hopefully, I can reuse most of the code and simply factor out the operating-system calls. 091102 - Single Stepping
I've devised a reasonably efficient algorithm to single step through code. The obvious solution is to execute one instruction
at a time until the IP points to an instruction corresponding a different statement. For Step Over, you need to skip over each
int[100] x,y;
x:=y;
The second statement will copy each element from
Today, I discovered the Windows Debugging API. This simple yet powerful set of functions lets you change registers,
read and write to memory, and respond to exceptions. I'm delighted that I no longer need to interface with
The debugger is going well. So far, I've loaded the DWARF debugging information from an executable and discovered
a way to control I gave some thought to Step Into and decided I want to choose which call the debugger steps into. Visual Studio's
debugger makes you step into every call, which is terribly slow and inefficient. I have two ideas: (1) provide the user
with a list of choices with the most obvious choices (i.e., explicit function calls) at the top; (2) use the cursor
position to select a subexpression. For example, in an expression like I'll be developing this project in small increments and announcing my progress here on a regular basis. The first milestone was to compile a very small subset of the Orth language and produce an LLVM input file with debugging information. The compiler supports the following constructs:
int v:=0; //global variables with an optional initializer (type must be int)
int bar(int p) //global functions with zero or more parameters (all types must be int)
{
if(p) //if-then and if-then-else statements
return p+1; //return statements, addition, and integer literals
else
return 0;
}
int foo(int p)
{
scope //local blocks
{
int i:=bar(p); //local variables with an optional initializer (type must be int), assignment, and calls
v:=i+(1+1); //constant folding
}
return v;
}
The example code looks exactly like C/C++ code except for the assignment operator and the scope keyword.
Orth uses := for assignment and == for equality. The = operator is unused since its meaning is ambiguous.
The scope keyword simply emphasizes that the brace below it introduces a new scope.
You can download the source code here.
The next milestone will be a simple command-line debugger that reads the Dwarf debugging information from an executable and provides a minimal set of commands:
cdb; whereas, the Linux debugger will rely on gdb.
Questions? Comments? Contact me at aliphany@gmail.com. |