home .. forth .. misc mail list archive ..

Re: Preprocessing of source text


Dear MISC readers:

> What I have been pondering is, everything else aside, whether or not the
> direct pointer scheme is more complex--in terms of required code and
> potential for bugs--than simply parsing text.  Parsing Forth is a trivial
> problem.  Keeping a dictionary at edit time is messier, and leaves open the
> potential for one bad link causing major problems.

Yes, it makes the editor have to do more things that the compiler
did traditionaly.  I don't know what the tradeoffs will be in your
system, and mine is not done.  It is in the early coding phase
so I have no definitive answers to those questions.

I have been trying to solve what I see as the serious problems
in the design phase and think through the problem well before
I get deeply into the implementation coding.

I have several object code and source code editors that have
to be combined along with parts from a couple of compilers.
These vary in size from one page of code to a few pages
of code.  I don't know if combining them will result in
adding substantial glue to intergrate them or if I can
keep it small and simple when they are rolled together.

My project plan has several steps and several phases and
I will know more when I get to the first editor.  It will
not be the last editor.  One thing I like about the project
is that I can have a Machine Forth style editor and a
Color Forth style editor and even an ANS Forth style
editor that looks like a normal Forth text source editor.
They can all edit the same source and use the same compiler.
So I am sure that the first editor I do will not be the
last.  I am refactoring the problem and focused more on
the compiler at this time but how that effects the editor(s) 
is a consideration.
 
> My current plan is to use something along the line of traditional blocks,
> where each block is compressed individually.  When a block is loaded in the
> editor, it is decompressed from tokens to source.  When it is saved, it is
> converted back to tokens.  This is instantaneous.  The compiler is, in
> effect, a block compiler.  You pass it a tokenized block and it returns when
> done.  This means that blocks are not a fixed size, but that's okay  (I
> wrote a Windows-based editor for variable sized blocks in an evening last
> Summer).

One of the most important design goals for aha is that the methods of
compression are not just for compression but are first and foremost
ways that simplify and speed up the processing of source by the
compiler.  Compression is also a design goal and I am very pleased that
I can get 2 to 5 times compression as the secondary goal while
meeting the primary goal of making the compiler simpler and
significantly faster.

I figure that most people will think that I am loading the source
from ROM or Flash and then decompressing it and then editing or 
compiling it and that it makes more sense to do that all at once
at load and save times.  That is what I do on a PC.  I edit and
compile source text.  I can compress and decompress the source code 
at load and save times in an extra step after loading it or
before saving it by using ziff-huffman encoding.  But that is nothing
like what I am talking about in aha.

In aha the source is always in a format optimized for processing
by a compiler or a source-enabled-debuggger/editor/comiler.  
The source enabled debugger/editor/compiler can compress the
source several more times in the process of comping but at no
time to I ever decompress the source.  Utility programs can
convert between ASCII text and aha format which is like
compression and decompression but I will those very rarely
and only for import or export to another system.

In aha source is compressed but to an optimal form for the compiler 
and good form for compression.  The editor must combine features
from the object code editor/debugger/compiler with code
from my text editor and from the one page Machine Forth 
source compiler.  How complex will these editors be?  That
is a good question and I don't know yet. 

Sean is only using one representation for everything the
dictionary pointer method.  So that simplifies a lot of
things for him compared to aha.  But I want to make
things faster and more compressed and that is why I
am willing to make it more complicated at this point.

But even if I add things like word completion in the editor
I still have a hard time picturing it getting bigger than
a couple of K words of object code or a few K of source code.
The compiler should be a few hundred words.

> The downside is that block-at-a-time tokenization gives poorer compression
> than whole-program tokenization, but (1) it keeps problems more localized,
> and (2) the small size of blocks lets you use smaller tokens and related
> data structures, which is better for cache coherency.

I think my use of several types of tokenization of source which is
done by decisions in the editor/optimizing compiler will also
provide a type of localization of changes much like the block concept.
I have a sort of database and I am editor records.  The changes
to one record are isolated from changes to other records in many
cases because of this.  So the editor doesn't have to recompile the
whole (tiny) application every time I hit a keystroke.  It will
have to figure out what changes a keystroke has on compilation
and source representation but these changes are isolated into 
records and should not require many global changes to the source.

I may use blocks too, but they will be blocks in RAM like what
Chuck is doing.  I'll keep you informed as I go along.
 
> Thank you for the reply, Jeff.  This project has me excited.

Your welcome.  I think MISC is a good place for these discussions
rather than private email.  If they get tiresome to MISC readers
and people say they want to leave the list rather than read these
kind of discussions then we could always find some other
alternative.

I figure that other people might be interested, might toss in
some ideas or solutions.  Even if I was not interested in this
stuff personally and just wanted other kinds of information
from the MISC list I would not think of it as a significant
amount of junk mail.  But if you have a website on search
engines or post to usenet with your real email address you 
get a lot of junk mail and a few messages from MISC are
always welcome.

The MISC list has mostly been quiet for a long time. So I don't think
a few message will result in a flood of unsubscriptions.  If we
start driving people out by talking about this stuff I hope
they will say what they would prefer to see happen with MISC
if anything.

Sean is on vacation at the moment.  When he gets back he may
report details of what he has done or is doing.  There are lots
of details we haven't discussed and I look forward to seeing
what methods he chose to implement certain things and what
problems he runs into in the process.

When I get a Color Forth editor in the aha system if no one
else has done another Color Forth flavor by then we will have
five colorful Forthers, Chuck, Terry, Sean, James and Jeff,
and that should make for many different implementation 
decisions.

Jeff