home .. forth .. misc mail list archive ..

Preprocessing of source text


I have been working on a Color Forth insprired language, and I've been
mulling over ideas for moving some preprocessing functions into the editor.
This is largely the result of reading Chuck and Jeff's ideas at
ultratechnology.com.

I worked out some schemes for source preprocessing last night, and I ended
up wondering exactly what I was trying to achieve.  The ultimate scheme
seems to be to reduce a sequence like "DUP * 5 +" into what almost looks
like threaded code:

Ptr to DUP
Ptr to *
Ptr to 5
Ptr to +

In practice this is difficult, because you need to maintain a coherent
dictionary at compile time, taking deletions into account.  I think that's
trickier than it sounds.  This could happen in one big step at save and load
time, but that's almost the same as a whole compiler.  I don't think it buys
much.

Taking one step back, another scheme is to simply preprocess all the strings
so the compiler doesn't have to, sort of like this:

3 'D' 'U' 'P'  1 '*'  1 '5'  1 '+'

So you effectively have a block of counted strings.  Color info can be
included in the count byte.

In both of these schemes, you could preprocess numbers,  so '5' would be
represented as the binary value 5, plus a tag indicating that it is a
number.  Again, this could be compressed into the count byte.

In this second scheme, what is the preprocessing really buying?  Strings
still have to be  looked up at compile time (either a linear search or a
hash).  As such, does the preprocessing really simplify the compiler to a
significant degree?

In terms of memory, The counted string format is the same as the raw text.
The threaded form is actually larger, if the words are short.

Overall, I'm leaning toward processing raw text with embedded color tokens.
I'd be interested in hearing other experiences.

James