home .. forth .. colorforth mail list archive ..

Re: [colorforth] shifted huffman trees


On Tue, 21 Sep 2004, Bill Parker wrote:

> --- Mark Slicker <maslicke@xxxxxxxxxxx> wrote:
>
> > This new tree should be regarded as an alternative
> > to the 'comment', 'Capitalized', 'all caps' tags and
> > the anti-space word.
>
> Starting with a colorForth where 13/16 tags were used,
> extending the character set frees up 2 tags.  If you
> also use a special marker value in the upper bits to
> indicate when numbers need a second word (something
> I've toyed with and I think has been mentioned here
> before) then you free up another 2 tags which gets you
> down to the point where only 9/16 tags are used.
>
> At that point, you could shrink the 'character' tags
> with a Huffman-like encoding to add another data bit
> (i.e. 29 bit character fields rather than 28).  It's
> likely this makes the addition of an extended
> character set a net gain (in terms of encoding).
>
> For example, the tag bits of a word could become:
>
>   111 extension
>   110 variable
>   101 comment
>   100 define
>   011 compile/cyan
>   010 compile/green
>   001 execute
> 11000 compile hex #
> 10000 execute hex #
> 01000 compile decimal #
> 00000 execute decimal #
  ||
       ||

    ==>

You need to append the left two bits of the last four bit strings to the
right of the string for this to be a valid prefix coding.

>
> I don't like this because there is no room for
> expansion.  Does anybody see a good way to free up
> 00000 so that it is available to introduce additional
> longer tag values?
>
> > Huffman probabilities should influence the shape and
> > balance of tree, but I haven't done that yet,
>
> One of the reasons I just added one big block of
> extended characters was that I was hard pressed to
> come up with any reasonable 'frequency of use' by the
> time I got to digits, capital letters and punctuation.

You could use a standard english corpus to compute a Huffman coding of
digits, lower/upper case letters and punctuation. Forth source would
likely skew probabilities.

>
> I'll be interested in seeing what you come up with for
> this if you take the idea any further.

I thought I would divide the 12 symbols between the upper case and lower
case, (2*32)-(2*26)=12. If I choose 16 for the remaining catagory I can
assign the 10 digits and 6 remaining symbols. My main concern is the
interface, having to flip through four screens of symbols, unless
alternatives exists. I have come into an instance of needing to define an
'anti-space'. I also have an ASCII String block[1], for specifing ASCII
strings that could be simplified by having one case instead of three.
These in part motivate me to extend the character set. If it simplies the
interface or code generally it might be worth the change.

Mark

http://personalwebs.oakland.edu/~maslicke/colorforth/networking/net-dev.html#148

---------------------------------------------------------------------
To unsubscribe, e-mail: colorforth-unsubscribe@xxxxxxxxxxxxxxxxxx
For additional commands, e-mail: colorforth-help@xxxxxxxxxxxxxxxxxx
Main web page - http://www.colorforth.com