home .. forth .. misc mail list archive ..

Re: Preprocessing of source text


Dear MISC readers:

James Hague wrote:
> I worked out some schemes for source preprocessing last night, and I ended
> up wondering exactly what I was trying to achieve.  The ultimate scheme
> seems to be to reduce a sequence like "DUP * 5 +" into what almost looks
> like threaded code:
> 
> Ptr to DUP
> Ptr to *
> Ptr to 5
> Ptr to +

That is one of the three represenations that I am
using in aha to compress the source and speed up 
the compiler.  It is the only representation that
Sean is using in Flux.

I use it only for defined words.  Opcodes and
all the words in the Machine Forth comiler
are represented by tokens which are smaller
than one word.

Since I represent any defined name in 10
or 20 bits I compress all source names
to about 1 or 2 characters and since
Sean is working in 32 bits on Pentium
he represents defined names as 4 bytes
regarless of their size.  So for short
names it is not compression for him
but it does eliminate dictionary
searches at compile time.

> In practice this is difficult, because you need to maintain a coherent
> dictionary at compile time, taking deletions into account.  

Actualy at edit time. At least that is what I am doing and
what Sean is doing.  It eliminates dictionary searches at
compile time and compresses the code.  Because I have a
20 bit machine I am compressing all 

> trickier than it sounds.  This could happen in one big step at save and load
> time, but that's almost the same as a whole compiler.  I don't think it buys
> much.

The beauty if a linked list representation for defined names
is the ease of inserting or deleting items by merely
changing the links.  It is fast and easy.  The dictionary
must be searched at edit time the way most Forth systems
search the dictionary at compile time.

In my case it looks easy.  It delivers about 2/1
compression for me and makes the compiler hundreds
of times faster.  But I am not talking about doing
it at load and save times.  It must be done whenever
a word is encountered in the editor.

> Taking one step back, another scheme is to simply preprocess all the strings
> so the compiler doesn't have to, sort of like this:
> 
> 3 'D' 'U' 'P'  1 '*'  1 '5'  1 '+'

counted string representation. This lets you parse faster and
search faster at the expense of a bigger representation of source.
It is another option.
 
> So you effectively have a block of counted strings.  Color info can be
> included in the count byte.

Yes, it could, unless you allow words to be up to 255 characters long.
If you limit to 31 characters that give you room for 8 different tokens.
 
> In both of these schemes, you could preprocess numbers,  so '5' would be
> represented as the binary value 5, plus a tag indicating that it is a
> number.  Again, this could be compressed into the count byte.

And, I would point out, eliminate searching the dictionary for numbers
and converting them from string to binary format at compile time.

> In this second scheme, what is the preprocessing really buying?  Strings
> still have to be  looked up at compile time (either a linear search or a
> hash).  As such, does the preprocessing really simplify the compiler to a
> significant degree?

Not much that I can see.  Having counted strings will make parsing
faster and searching faster, but not by too much. 
 
> In terms of memory, The counted string format is the same as the raw text.
> The threaded form is actually larger, if the words are short.

Well the counts replace the delimiting spaces so I guess so.  But
it would make it a bit tricky to do anything except process
from the start to finish or you don't know where the delimeters
are.  That might be an issue for what you want to do.
 
> Overall, I'm leaning toward processing raw text with embedded color tokens.
> I'd be interested in hearing other experiences.

That is the smallest step away from traditional practice and might
be easiest.  I know that after I explained the idea of compression
by a name dictionary and pointers in aha that Sean was able to
implement it in Flux in a day or two.

I was describing more details of aha to Chuck today.  I explained
how the development of a desktop and GUI and desire to run multiple
applications on the desktop without having to compile all possible
applications at all times and put them in fixed locations in memory,
and because I don't have relocatable code in this enviroment.  I
also explained that I wanted source code, compilation, and source
enabled debugging in this desktop but that 1MB of ROM of FLASH
was not enough to hold say, the iTV source code.  With this approach
I can put about 5 times that much in the same ROM and compiling
it is almost as fast as loading object code.

Chuck said, so you are doing just in time compiling.  I said,
yes, that is true.  I explained how completely eliminating
dictionary searches at compile time made it incredibly fast.
I said that at 2M to 100M Forth words per second compile
rates and things being so tiny, a few hundred or few thousand
words here and there. That times were quite remarkable.

He joked that I could have an interrupt invoke the comiler
and compile an interrupt service routine with a JIT compiler
that was that fast and with such small code.  We both
laughed about what a funny an idea that was.

Jeff Fox