home .. forth .. misc mail list archive ..

Re: pipelining


On Fri, 2 Jun 1995, Francois-Rene Rideau wrote:

>    I understand that carry propagation is some operation that intrinsically
> takes time. 

Yes. Logic operations take two bits as arguments and return one as a
result. For multi-bit operations all arguments are known in advance, so
they can be executed in parallel. Addition takes three bits as arguments
(two summands and a carry) and returns two bits a results -- low order and
high order (result and a carry). For multi-bit additions, the output of one
operation -- the carry -- is the input for the neighboring one. Carry
propagation is a serial operation. 

> On the P21, this may be bearable, but as word size grows, I'm
> not sure it's a good idea to require NOPs, which drastically reduces
> throughput.

There are three timings involved -- elementary operation (EO) which is ~
300ps on the F21, Internal Clock (IC) ~ 3,600ps = 3.6ns or roughly 12 EO. 
memory access time = memory setup time (~3ns) and waiting for the SRAM
(~12ns)

OR takes 1 EO to produce the result and several EO to latch it to TOS,
thus executing withing 1 IC. 

8 add-with-carry steps take 8 EOs plus several to latch the result to TOS,
thus executing in 1 IC. If one waits 1 IC more _before_ latching the
reslut, 12 more EO are available for 12 more adc steps, giving the carry
time to propagate further. 

>    Perhaps then, on the P32 should the addition be pipelined:

A pipeline is useful when the various stages of the instruction take 
different resources to complete. What are the resources needed for the 
addition?

TOS holds the bits of the first operand. SOS is used to initialize the bits
of the result register, internal to the ADC unit. Another internal -- carry
-- register. 

Suppose you want to do a 2/ (or 2*, or any other ALU or stack manipulation
operation.  for that matter). You need the same resources -- TOS at 
least. No pipelining is possible.

A 16-bit ADC needs at least 16 EOs to complete, using this (or any other)
algorithm. It just takes that long. You have to wait at least that much
from availability of operands to availability of results. You cannot
pipeline, if you need the results for the operation, that follows (which
is a shift in the most interesting case of multiplication). The only
things you can possibly overlay it with are A@ , R@ , CALL , JMP , RET . 
Not terribly frequent. So why bother? Why spend chip area = R&D
time/money, production costs, heat dissipation, to take care of it. 

Well, (inst inst RET +) may be useful from time to time if RET allowed
the instructions till the end to be executed. Is this the case with F21,
Jeff?  I guess this will be the behavior of P32 anyway, since RET does
not occupy a whole instruction slot. 

> instead of requiring three to five NOPs after a +, just have + pop one

_before_ the + . + is a signal to latch the 

> instead of requiring three to five NOPs after a +, just have + pop one 
> or two arguments, compute in a pipeline, and push the result of modify TOS.
> Wouldn't that help ? 
>
or two arguments, compute in a pipeline, and push the result of modify
TOS. > Wouldn't that help ? 

One will need _at least_ two instructions anyway -- one to "issue" the +
(to tell the ADC unit which operands to be taken away) and one to
"collect"  (to latch) the result. This precludes a 1-IC + on short
operands. Does not buy you any speed in the current NOP + scenario (most
probable), does not buy you anything in multiplication schemes. 

> If addition is the slowest operation, 

Well, the slowest operation is a memory fetch, and will continue to be so. 

> this could also
> allow to increase the speed for other operation and just have + delay
> more its result in ticks.

I did not get this. The other (non-memory) operations _do operate_ as fast
as possible with the fabrication technology. (Well, five times as fast :-)

>    BTW, what currently happens if one doesn't do enough NOPs ?

It depends what is the condition of the carry register at the time of the
latch. If it is only zeroes, the result is correct. This is the expected
mode of operations for short operands, or for performing an OR on masked
operands. If the carry register contains information, you loose it, and
the result is incorrect. 

>    Am I missing the point ? Am I being mistaken as for where the problem is ?

Maybe it is time for a FAQ :-) I have asked Jeff exactly the same
questions more than three times myself. 

--
Penio Penev <Penev@venezia.Rockefeller.edu> 1-212-327-7423