home .. forth .. misc mail list archive ..
No Subject

To: MISC
From: jfox@xxxxxxxxxx (Jeff Fox)
Date: Tue, 18 Apr 1995 11:34:26 -0700

>Dave Lowry writes:
>>[...]
>>CPU         Seconds
>>486DX40     19
>>P21/6Volts  30
>>386DX20     35
>>
>>I was quite surprised at how well P21 did.  P21Forth is not as tweaked as
>>UR/Forth, with many words still in high-level Forth.  When Jeff has time
>>to CODE the line drawing routines, I'll revisit this benchmark.
>
>I was initially disappointed, but then realized this P21 configuration
>did not represent a fast P21 configuration.  Didn't Jeff just post an
>estimate of 6 mips for the P21 kit configuration, and describe a memory
>configuration for 100 mips operation?  Would this benchmark run in 2
>seconds on that configuration?
>
>I think it would be interesting to have the numbers for the benchmark with
>line drawing for 2 reasons:
>
>	Forth has faired well in the past against assembly code,
>
>	For comparison with the CODEd drawing routines when available.
>
>Anyway, thanks for taking the time to quantify P21 operation.
>
>--
>Mark Walker                   | "There's no such thing as fast enough!"  


Dear MISC readers:

No, P21Forth will not run this benchmark in 2 seconds even with high
speed sram and no video output.  P21Forth would have to be rewritten
to take advantage of the high speed sram, and with this mod only and
no video output it could run about three times faster. ie. 10 seconds.

However there are other things to remember here.  P21Forth is an
enhanced version of eForth.  It is not designed for maximum speed at
the Forth compiler level.  It is tweaked compared to the minimal eForth
with only 28 words in CODE since it has about 200 CODE words, but the
compiler design is still a very simplistic direct threaded compiler.

One should remember that the UR/Forth is a much more sophisiticated
compiler.  Using stacks in DRAM and a DTC compiler is very slow on P21.
A native DUP takes 10ns, but a DUP in P21 takes about 500ns even if
the video processor is running and it is called from the same page,
which it never is.  This means with the video processor running and
calling DUP from some other high level word will mean that DUP will
take over 1us.  This is a 100/1 reduction in speed over a native
DUP in P21 assembler.      

I have designed a couple of optimizing compilers for P21, at times
a native code DUP is sufficient, at other times you have to deal
with stacks in memory, although SRAM would be faster, but you can
compile an inline sequence of native code that is much much faster
than what I do in P21Forth.  Of course so far I have been paid $80
for the copies of P21Forth I have sold, so I have not had a lot of
incentive to spend time completing and debugging a good optimizing 
compiler for the P21.  There is after all an assembler in P21Forth
so you can optimize time critical functions in assembler pretty
easily. 

By going to a more sophisticated compiler you could see a BIG
improvement in P21 benchmarks over P21Forth.  P21Forth will perform
below the level of 386 on some benchmarks.  Particularly those
that use a lot of multiply.  P21 is fast on shifts, and pretty
fast on adds, but not too fast on multiply.

You will notice on the loop timing tests (Mentink Benchmark) that
the difference between DO LOOPs written in high level code and
FOR NEXT written in assembler is about 6 to 1.  And you will notice
that the difference between the DO LOOPs (DO and LOOP use COLON
definition primitives) and optimized inlined code on P21 is  about
200 to 1.

Here are a few benchmarks I have done on the MuP21 so far.  These are 
not your traditional benchmarks, ie. Dhrystone, SpecInt, even Sieve,
but they do give some indication of some of what I have done so far.

CORDIC coordinate tranformation:

MuP21 running P21Forth in DRAM with CORDIC in CODE and video on:       20 uS
MuP21 in SRAM with CORDIC in CODE w/o video on: (est)                   6 uS
486 50 running FPC with Colon def of CORDIC:                          500 uS

Towers of Hanoi:

MuP21 running P21Forth in DRAM with Colon def and video on:            .6 Sec
486 50 running eForth 2.42 (same as P21Forth) with Colon def:          12 Sec


3D coordinate transformation with rotation and clipping of a CUBE:

MuP21 running P21Forth 1.02 in DRAM with Colon def and video on:  20 frames/S
      My version of Dave Lowry's 3D demo.
486 50 running FPC with Colon def (Mark Smiley's graphics demo):   4 frames/S


Multitasking tests:

MuP21 running P21Forth background tasks incrementing counter:         120k /S
MuP21 w/ high speed sram and no video (estimated):                    400k /S
486 50 running eForth 2.42 (same as P21Forth) tasks in background:    40k /S
486 50 running FPC with 1 background task incrementing counter:       60k /S

Loop Tests:

MuP21 running P21Forth : X 1000000 0 DO 34 DROP LOOP ;                  25 S
MuP21 running P21Forth : X 1000000 FOR 34 DROP NEXT ;                    4 S
MuP21 w/ high speed sram and no video (estimated):                       1 S
486 50 running eForth 2.42 (same) : X 1000000 0 DO 34 DROP LOOP ;       17 S
486 50 running FPC : X 1000 0 DO 1000 0 DO 34 DROP LOOP LOOP ;           2 S
486 50 running TCOM (optimizing native code compiler)                    1 S

With native code compiler and inlining optimizations and high speed
     sram and no video output the MuP21 (est)                          <.1 S


Results:
Loop tests similar high level code 486 50 50% faster (DO LOOP in COLON defs)
Loop tests with optimizing compiler P21 faster

Multitasking in Forth, P21 3 times faster

My version of High level Forth code 3D projection P21 4 times faster

Towers of Hanoi P21 20 times faster

Cordic P21 faster (10 times faster than 387), much of the 25 to 90
times faster numbers are because this version is not optimized as
much for this 486

Jeff Fox

P.S.

As for Mark's "there is no such thing as fast enough"           

My real interest is in F21.  With a 3 time faster processor clock,
bigger high speed sram interface, and deeper stacks it is very easy
to implement optiminzing native code compilers that will generate
much fast code.  A minimum of an order of magnitude of speed improvement
over P21 will be available.  With a good optimizing compiler a
couple of orders of magnitude can be achieved in many cases.  Since
F21 is designed for multiprocessing, it will possible to get another
order of magnitude or two in speedup on many problems.  3D projections
are a good example of where there is a lot of natural parallelism as
you compute many frames in sequence (normally).

So what would you say to four order of magnitude speedup over P21?

We are about to submit the first F21 prototype.  I will soon be posting
the details of the design.
Previous by thread: No Subject
Next by thread: No Subject
Index(es):
- Thread