home .. forth .. misc mail list archive ..

Re: how does + work (MuP21)?


Greg,
please, cool: short questions will have more chance to get answers than
long flames.

Jeff, to help misc newcomers, who keep asking this question, you may publish
this mail on UT site, as is or revised if you wish.  I admire your patience.

CL
--
How does the MISC + instruction work?  When must it be preceded by NOPs?

The misc ALU has two input buses, which are _permanently_ driven by the outputs
of the T and S registers (the Top and Second data stack items), and combines
them _permanently_ to compute _simultaneously_ six outputs (in C notation):
~T, T*2, T/2, T&S, T^S, T+S
(T&S and T^S are in fact intermediate outputs of the T+S combinatorial network)

These six ALU outputs (as well as A's output, R's output, and the memory data
bus when read) are connected to T's nine-to-one input multiplexer, which is
controlled by the instruction decoder (as are all other register-input
multiplexers):
- the COM instruction (ANSI alias INVERT) selects the ALU ~T output,
- the 2* instruction selects the ALU T*2 output,
- the 2/ instruction selects the ALU T/2 output,
- the AND instruction selects the ALU T&S output (and NIPs the data stack),
- the - instruction (ANSI alias XOR) selects the ALU T^S output (and NIPs),
- the + instruction selects the ALU T+S output (and NIPs),
- the +* instruction selects the ALU T+S output ONLY if T's LSB is 1 (no NIP),
- the A instruction (no ANSI equivalent) selects A's output (and push data),
- the R instruction (ANSI alias R>) selects R's output (push data, pop return),
- the memory read instructions (@ @+ @R LIT) select the memory bus (and push).

The instruction sequencer is self-timed, i.e. its state transitions are
triggered by a clock generated by a switched oscillating circuit which period
varies with the relative sizes of the transitors in the selected circuit;
this is how different timings are generated for the different memory spaces,
depending on the address MSBits.
Several outputs at different stages of the oscillating circuit are used to
order the transitions of different registers in the same instruction cycle.

An instruction "starts" (and the previous one "ends") when the input of the
instruction decoder switches from one 5-bits slot (of the 20-bits instruction
register) to the next.  The outputs of the instruction decoder, which drive all
the register-input multiplexers and all the register-clock enables, transit and
hopefully stabilize before the "end" of the instruction cycle.

At the "end" of the instruction cycle (i.e. at the beginning of the next
instruction cycle), if T's clock is enabled, then T's input is latched into
T's output, and the same happens for S and for all other registers.
Then, as T's and S's outputs transit, the ALU outputs also transit (while the
instruction decoder outputs also transit), and hopefully stabilize before the
end of the instruction cycle, _except_ maybe the ALU T+S output which may
require T and S to be stable during a longer delay for the carry to stabilize,
where this delay depends on the carry propagation length; the carry propagation
hardware is able to propagate about 8 bits of carry during a + instruction,
which is very efficient for the silicon process used for the MuP21, and is
obtained with a straight carry-chain by saving an inverter at every bit-slice,
which explains why even and odd bits of the ALU are coded with different
polarities.

If you know at compile-time that the carry-propagation length of an addition
is shorter than 8 bits, you also know that the ALU T+S output will stabilize
before the end of the + (or +*) instruction.  If you know that it is shorter
than 16 bits (resp. if you cannot assume anything about the carry length), then
you also know that for the ALU T+S output to stabilize, you will need T and S
to remain stable during a delay equivalent to at least one (resp. two) NOP
instruction(s) _before_ the + (or +*) instruction is issued.

If the + (or +*) instruction is compiled in the first slot of a program word,
the delay between the end of the previous instruction and the availability of
the + (or +*) instruction is always more than the equivalent of two NOP instr:
during this delay, the instruction decoder waits for the end of the memory
access fetching the program word containing the + (or +*) in its first slot,
then the + (or +*) instruction is decoded with the "regular" timing;
in the shortest case, this memory access to fetch the program word is done in
parallel (prefetch) with the execution of the 4 instructions in the previous
program word (stored in the instruction register), provided these 4
instructions have their MSBit cleared (the MSBit is set for the memory access
instr @ @+ ! !+ LIT, and for the desequencing instr call ret jump jz jc, which
all abort the prefetch to start another memory access, after which the prefetch
restarts), and requires first some time to transit and stabilize the address
lines, then to wait for the memories to fetch the program word and stabilize
the data lines; even with the fastest memory space, the prefetch is longer, by
at least the equivalent of two NOPs, than the execution of the 4 instructions.

If the + (or +*) instruction is not compiled in the first slot of a program
word, then is must be preceded by two, one, or zero NOP instructions, depending
on the assumed carry propagation length.  I personnaly used two macros, named
.+ and ..+ which insert respectively one or two NOPs before a + unless the +
falls in the first slot of the current program word.

--