home .. forth .. misc mail list archive ..

Re: pipelining


                 ,
Francois-Rene Rideau:
      I understand that carry propagation is some operation that
   intrinsically takes time. On the P21, this may be bearable, but as
   word size grows, I'm not sure it's a good idea to require NOPs,
   which drastically reduces throughput.
      Perhaps then, on the P32 should the addition be pipelined:
   instead of requiring three to five NOPs after a +, just have + pop
   one or two arguments, compute in a pipeline, and push the result of
   modify TOS.  Wouldn't that help ? If addition is the slowest
   operation, this could also allow to increase the speed for other
   operation and just have + delay more its result in ticks.
      Am I missing the point ? Am I being mistaken as for where the
   problem is ?
      BTW, what currently happens if one doesn't do enough NOPs ?

Currently, Chuck is using ripple carry which propagates 8 carries per
instruction slot.  Also, a couple carry slots are used at the begining
while the base values (AND, -OR) are being computed.  If you leave out
nops, and carry needs to propagate a long ways, then you'll have an
erroneous result (somewhere along the line a carry won't be added in).

However, introducing a pipeline wouldn't help much here -- it's the
interval between the previous instruction which modifies the stack and
the add instruction where you need to time delay for carries.  A
pipeline will give you more setup time between memory fetch and
instruction execution, but that's not where the time is needed.

Also, if you delay the execution of the ADD instruction to give carry
a chance to settle, you're essentially just doing the NOPs
"automatically".  The time delay is still there, it's just harder to
see.

The "right way" to fix this problem is to work around the delays
introduced by ripple carry.  One technique is called "carry
prediction".

When you add two numbers using ripple carry, you use EXCLUSIVE OR to
generate the bitwise sum.  You also use AND to generate the bitwise
carry.  Take one of these things (two bits in, one for each argument
-> two bits out, one of which is carry) and you've got yourself a
"half adder" -- it's got carry out, but not carry in.  If you like,
you can construct a "full adder" by sticking two of these boxes
together.  A full adder built this way looks something like this:




           X1  Y1
           |   |
          +-----+
          | half|
          | add |
  Carry   +-----+
      in    |  \
       \    |   \
	|   |    |
       +-----+   |
       | half|   |
       | add |   |
       +-----+   |
	 |  \    |
         Z1  \   |
	      |  |
	    |\____/|	Note: this can be an exclusive or gate -- you'll
	    |      |          never have both inputs with true values at
	    |  or  |	      the same time.
             \    /
	      \  /
	       \/
		\
		 |
		Carry out



Ok. so if we call this construct a full adder (and there's maybe
better ways to build them), you can string them together to implement
addition with ripple carry like this:

  X0  Y0     X1  Y1     X2  Y2     X3  Y3       X30 Y30    X31 Y31
  |   |      |   |      |   |      |   |        |   |      |   |    
 +-----+    +-----+    +-----+    +-----+      +-----+    +-----+   
 | full|    | full|    | full|    | full|      | full|    | full|   
 | add |--->| add |--->| add |--->| add |-...->| add |--->| add |---\
 +-----+    +-----+    +-----+    +-----+      +-----+    +-----+   |
   |         |          |          |            |          |        |
   Z0        Z2         Z3         Z4           Z30         Z31     Z32

Anyways, the point of all this is that you can speed up this whole
operation if you don't have to wait such a long time for all the
carries to propagate the full distance.

One thing to notice is that if there's *any* zeros in the bits leading
up to a specific position, you won't get a carry leading out of that
position.  So, you can predict whether there'll be a carry from that
position using a simple AND gate.

For example, let's say you can have 17 input and gates, and that you
wire them up like this:

1 X0 Y0 X1 Y1 X2 Y2 X3 Y3 X4 Y4 X5 Y5 X6 Y6 X7 Y7   17AND   TO C7
C7 X8 Y8 X9 Y9 ... X15 Y15                          17AND   TO C15
C15 X16 Y16 X17 Y17 ... X23 Y23                     17AND   TO C23
C23 X24 Y24 X25 Y25 ... X31 Y31                     17AND   TO C31 \ Z32

Using the design I outlined above, you'd have 3 gate delays per full
adder, for a total of 96 gate delays for a 32 bit add.  If I stick in
these fat and gates, I know that my carry never has to propagate more
than 8 slots before it hits an and gate.

Of course, this isn't completely ideal, as one of the and gate inputs
at each stage varys with complex logic from the previous stage.  Worst
case is an unpredicted carry -- but that's a maximum of 7 full adder
slots plus up two three of these big adders.  Which is a maximum of 24
gate delays (a factor of four speedup).  [Unless I've made a mistake
in my thinking.]  So, assuming that Chuck is using the same stupid
full-adder I've described here, he can probably cut down the ripple
carry time for the F32 from 5 to 2 instruction slots -- which leaves
it pretty close to what the MuP21 is like now.

Note that this assumes that the gate delay for a fat and gate is the
same as a gate delay for a skinny and gate...

Also, note that there's a number of variations on this theme, so I
haven't a clue how fast you can make something like this go.

Finally, note that I don't know how to read Chuck's gate layout
images, so I don't have a clue how the MuP21 is actually implemented.
[If anyone can suggest how I might recognize a transistor, etc.  I'd
greatly appreciate it.]

-- 
Raul D. Miller