home .. forth .. misc mail list archive ..

Re: bits, timings, mem maps, etc.


Penio Penev <penev@venezia.rockefeller.edu>:
> Since there are very few posiible instructions to execute, x21 executes them
> all in parallel, producing the resluts of all of them in internal
> registers. The decoding of the intruction merely tell you _which one_ of
> those internal registers to latch to TOS.  Therefore, the adder gets started
> the moment the stack contents get changed and runs freely, propagating
> carries, at all times. This means the the delay is done with a NOP +, versus
> a + NOP.  This means that a + does not need a NOP if it is in the _first_
> slot, _and_ you don't execute from SRAM.

Sorry Penio, you're right, I presented it wrong for the +, NOPs must be before.

But sorry again Penio, you're also wrong, there is no "ALU internal register",
and the parallel execution is not at the "instruction" level, but at the ALU
"operation" level.

* Instructions are executed sequentially.

A slot sequencer extracts, from the 20 bits instruction(s) register, each
5 bits instruction in turn, and routes it to the instruction decoder which
drives each register-inputs-selector.

* There is no "ALU internal register".

The output of T and S are permanently combined by the ALU which computes
simultaneously (C notations) ~T T<<1 T>>1 T&S T^S T+S, one of which is ready to
be latched into T, depending on the instruction decoding.

T may also be latched from the output of either S or R or A.
T also drives the "T" bus, ready to be latched into either S or R or A.

The output of S (resp. R) also permanently drives an "S" bus (resp. "R" bus)
running through a data (resp. return) circular stack of 16x21bits, ready to be
latched in the currently pointed data (resp. return) stack cell, and the
neighbour of this cell is driving another "S'" bus (resp. "R'" bus) running
through the data (resp. return) stack, ready to be latched in S (resp. R).
A "pushing" or "popping" instruction also moves (circularly) the (hidden) stack
pointer in respectively opposite directions.

T may also drive (or be latched from) the memory data bus when either P or A or
R drives the memory address bus, which is permanently input to an address
incrementer which output is ready to be latched back to P or A or R at the end
of the memory access). Note that MuP21 doesn't connect R to the address bus.

The video coprocessor, when activated, competes with the CPU for the access to
the memory address and data buses. The arbiter gives priority to video.

The upper address bits are decoded to select the right chip-select output pin,
to select the memory signals timing.  In the case of the DRAM the upper bits
are compared with a (hidden) last-DRAM-page register to eventually generate a
RAS cycle with the new upper bits (which are then latched to that register),
before the CAS cycle with the lower address bits.

So the only "internal" (hidden) registers are the stack pointers,
the last-DRAM-page register, and the slot sequencer.

The real wonder is that all timings are done with analogical delays produced by
transistors of different size ratio, delays that OKAD simulates, with dead
simple arithmetics compared with regular SPICE-like simulators, accurately
enough (now) to predict the ordering of signals edges, which is only possible
for dead simple architectures such as misc (self timed asynchronous processors
experiences have been abandonned for self synchronized asynchronous processors
which are expected to run faster and to consume less power than synchronously
clocked processors, but require double-rail bits encoding, often more silicon,
more complex combinatorial circuits, so its hard to predict whether their speed
and/or consumption will be better).

Most silicon designers, used to expensive black-boxed resource-hungry CAD
software, used to high level programming interfaces (so high that it's very
difficult to guess the performances implications of a modification), used to
hours of automatic placement/routing, would be really disappointed in face of
OKAD, because it's too close to silicon and allows only very small designs.
Misc is a very small design where 90% of the chip area is made of only a few
different repetitively tiled macro-cells, so its humanly feasable to tune it
for (very high) efficiency, provided you may hand-place and hand-route signals
and look every signal delay, which OKAD allows.

CL