home .. forth .. misc mail list archive ..

on-chip memory bandwidth


Somebody (sorry, I dropped the mail inadvertantly) said that a memory cycle
time, four times the instruction cycle time, should be enough.

This is true for programs using only the stacks for data, but most programs use
also memory for data (instructions lit @ @+ @R ! !+ !R, bit pattern 01xxx).
Four memory-access instructions in a row will require 5 memory cycles (the 5th
is for fetching the instructions).

On-chip I/O coprocessors will also share the memory bandwidth with the CPU.
So the ideal would be that the memory bandwidth be the sum of the maximum
bandwidths required by the CPU and by the coprocessors, so that memory
contention would happen only for eventual simultaneous memory accesses.
But the reality of on-chip memory will probably be far from this ideal,
although it will be much better than external memory.

As the memory bandwidth is the product of the memory access time by the bus
width, a limit on the access time may be overcome by increasing the bus width,
which is less limited on-chip than it is off-chip by the number of pads (for
their area and their power consumption).
The on-chip shared memory may be organized in rows of several words (which will
also factorize the address decoder and drivers, low bits address decoding being
made only once for all rows to access a single word) such that a complete row
may be transfered at once to a row-wide private memory of the CPU (the
instruction register) or of a coprocessor (such as a buffer for a video line).
This memory decoupling is already used systematically, but at a smaller scale,
for example for UARTS (the serialization buffers) or for the present
instruction register of the P21 (which holds 4 instructions).

Inter-chip bandwidth is the main bottleneck of most architectures.  On-chip
memory is the only way around, even if it is orders of magnitude more expensive
than off-chip memory.  Main stream processors use associative cache memory,
which costs a lot of silicon area for associative address decoding, and detect
cache-miss "as late as possible", requiring a high bandwidth between external
and cache memories.  Regular on-chip memory with a programmable DMA will use
less silicon and will allow planified "as soon as possible" transfers between
off-chip and on-chip memories, requiring less bandwidth between them.

The simplicity of misc architectures makes it accessible and gives us the
ability to experiment and demonstrate these unusual architectural choices, from
which we expect a high performance/ressource ratio.

CL