home .. forth .. misc mail list archive ..

Re: MISC-d Digest V98 #49


Dear MISC readers:

GARY B. LAWRENCE wrote:

>I will need some time to understand the
>tradeoffs between off page adressing and multiple program words that 
>produce the same result faster. Old 8 bit processors like the 6502 did
>not have fast and slow memory cycles in ram because the processor did
>not outrun the memory.  

One very basic thing one needs to understand about the first three
MISC processors from Chuck Moore (P21,i21,F21) are that they interface
to several types of memory chips and match the signals needed by
those chips.  The large address spaces one these chips are 1Mx20 DRAM.

Unlike Pentium or RISC processors that have a cache interface between
the DRAM memory and processor to get better performance we work 
directly with the DRAM chips.  These chips use pages internally
and provide faster access to onpage memory accesses than offpage
memory access.  This is true whether you use these memory chips
with a Pentium or a MISC processor.  Even a Pentium or whatever
can run at the slowest speed for the offpage memory access of the
DRAM if the data space is large and access is random enough to
stay out of cached memory.

With systems that use cache programmers should pay attention to
what will be cached and what will not.  On a MISC design the same
thing is true with regard to the cache like paging behavior of
the DRAMs themselves and the cache like speed of SRAM.  These
things are managed by the programmer on MISC mostly rather than
by the compiler and cache controller.

There is hierarchy of memory spaces with various sizes and speeds.
One of the biggest conceptual issues with MISC is managing this.
Well designed programs keep things on the stacks for 2ns access
then go to 20ns SRAM access then 40ns DRAM then 140ns DRAM then
250ns FLASH etc.  If a programmer does not understand this
concept then they will use many 140ns memory accesses where
another programmer will use a 2ns stack access.

An introductory example is putting a number on the stack.
We have the # opcode which works like LIT but only delivers a
20 bit value.  So if you need carry set you also must use com.
It is one or two 2ns opcodes but also two memory references
which will take from 40ns to 300ns more.  

There are a few special cases where you can save a little time
over a standard literal at the expense of temporarily using
an extra stack location.  DUP XOR for instance converts anything
to 0 and DUP DUP XOR creates a 0 faster than loading a literal 0
from memory because the stack operations are fast and happen
while the next instruction memory access is happening due to
prefetch.  A normal literal requires another full data memory
access that could add 250ns, pretty slow compared to 2ns stack
access.  DUP DUP XOR COM or DUP DUP COM XOR is another example of
a fast 21 bit -1.  It take a minium of 18ns while 0 # COM uses
one les stack location but could take over 900ns because it makes
two or three memory references.

Simply put keep things on the stack or in SRAM or onpage if you
want fast code.

A couple of other things.  The P21Forth is a good example of a simple
ANS Forth compiler for P21 using a direct threaded Forth design
with stacks in memory.  It would not run unmodified on F21 because
of differences in the opcode bits and functions between the processors.
But the differences are minor as P21 uses a subset of the instructions
on F21 and only a few of the stack positions on F21 would be needed
for an F21 port.  It is there to answer peoples questions about how
many of the functions can be implemented in a simple way on a MISC
processor.

I have done various on chip and target compilers for the MISC processors
and simulated MISC processors.  I have done various threaded and
native code ANS Forth compilers for MISC. I really like the machine
Forth assembler that Chuck provided for his work and which we have
extended and used for various things. It is included in the simulator
and emulator and is very fast, powerful, and easy to use.

I have thought of porting the ANS native code or threaded compilers
to the emulator so ANS Forth programs could be after the ANS Forth
compiler is loaded on top of the machine Forth assembler.  Some
people might contribute libraries or various words to the libraries
of such a project just as Stas and I donated some machine Forth
demo programs for the emulator and simulator.

One other thing regarding P21Forth.  P21Forth was really designed
to run from an SRAM card with a keyboard and video display.  When
you run in ROM you can't edit the BLOCKs or save the system or
the really fun things.  The I/O can be redirected to either bit
bang serial on Offete board #2 or 82c51 serial I/O on Offete board #3 
but it does not support handshaking.  

You either need to load code to redirect I/O to a version that uses
XON/XOFF if you want to use XON/XOFF or you have to write a script
for your terminal program to handshake with P21Forth.  That can
be done in various ways.  Just using a delay is not a good solution
however.  You never know how long a line of code might need to
execute so any arbitrary delay could fail unpredictably.  You can
try waiting for "OK" if you want to synchonize loading files
over a serial or parallel link instead of typing input into P21Forth.

>  I wanted to know if operations running in static ram were affected
> much the other processors.

Memory bandwidth is limited.  The coprocessors get first access to 
memory because they get timed access, the CPU only gets what is left
over.  Thus any use of memory for I/O reduces available bandwidth
to the CPU.

But the paging behavior of DRAM can cause a parasitic effect by
thrashing pages between different processors memory access.  Although
things like audio I/O data rates are negligable compared to the
memory speeds any of the coprocessors can be given the full bus
bandwidth if that is what you want.

Consider also the effect of DRAM paging.  If the CPU is in fast SRAM
getting full access by itself it can get a max of 222mips.  Now if
we also let the video coprocessor get pixels every few hundred nanoseconds
from DRAM it can eat up 20% of the bus bandwidth and slow to CPU to
a max speed of around 180mips.  However if both the CPU and Video
coprocessor are running in DRAM then DRAM access will not just be
sequential access by video only and will no longer only need a 50ns
access every few hundred ns.  Instead it will now need a 150ns access
and it will force the next CPU access to take an extra 100ns because
it will be offpage too.  The CPU only gets the available memory bandwith
and it doesn't go as far in DRAM as SRAM so now the CPU may only get
25mips.  And remember in DRAM even without video access if your app
stays offpage it will be slow regardless of your processor.

> A difference between 20 mips and 222 mips
>would be enough to design inner loops to run in static ram if the video
>processor was being used. 

If you need the speed and can afford a few cheap SRAM chips for the 
node.  It is a dramatic difference for a few bucks.  P21 only addressed
1K words of SRAM directly.  F21 can use 4 of the PPort pins for extra
SRAM addressing to get 16K decoded on chip and could use PPort pins
for bank selection paging above that.  F21 also has the homepage
mechanism to facilitate mixing SRAM and DRAM calls and jumps 
without having to muck around with the carry bit for crossing
the DRAM - SRAM/REGISTER/ROM boundry.

Jeff Fox