home .. forth .. misc mail list archive ..

Re: P64 Math


>>On Thu, 31 Aug 1995, Steve Atkins wrote:
>>
>>> >>Readers,
>>
>>> [ which adder architecture ]
>> 
>>> When you hit 32 bits I think it's worthwhile to use some sort of carry
>>> propagate acceleration. Something like a carry skip adder doesn't cost too
>>> much silicon, and will be *much* faster than a ripple adder for wide words.
>>
>>Can you name numbers? At least a qualitative factor. It still scales
>>linearly to bit length, I presume?

When I find the book with the decent analysis in it I'll post the
numbers. (Someone `borrowed' it).
>> 
>>> I've been looking at 64 bit arithmetic recently, and (for the silicon we use)
>>> the tradeoffs do seem to change between 32 & 64 bit operations. For a fast 64
>>> bit processor, where addition may be a bottleneck, a good optimised adder would
>>> be essential - either a carry skip adder for medium speeds, a hybrid carry select
>>> adder (such as the one the Alpha uses) for fast speeds. If the word width grew
>>
>>Can you outline their principles, at least roughly? (My numerics 
>>hardware book is a bit outdated :(

Certainly....

-*- text -*- ~atkinss/add.txt

Notation:

x_i is bit i of the number x, in standard twos complement form.

 .  is logical and
 +  is logical or
 #  is logical xor

Basics
======
We're adding two n-bit numbers a & b, and a carry-in c_0 to give an
n-bit result z, and a carry-out c_n

   z = a plus b plus c_0

The operands and result are of the form

   a = { a_n-1, a_n-2, ... a_1, a_0 }

and so on.

Logically addition is this:

   For all 0 <= i < n

   z_i = a_i # b_i # c_i

   c_i+1 = x_i . y_i + c_i . (x_i + y_i)                                 [1]

<This would be so much clearer in TeX or HTML3......>

This is often reformulated in terms of propagate (p) and generate (g) signals:

   g_i = a_i . b_i                                                       [2]

   p_i = a_i # b_i                                                       [3]

[ The alternative p_i = a_i + b_i is sometimes used ]
                                             _________
[ Often a kill signal (k) is used too: k_i = a_i + b_i ]

Then the carry signals can be given as:

   c_i+1 = g_i + c_i . p_i             substitute [2] & [3] in [1]       [4]

and

   z_i = p_i # c_i

The expensive bit about addition is generating c_i. Once all the c_i
are generated the z_i can be generated in constant time (not a negligable
time, but at least it's constant with operand length).

Different adder architectures
=============================

Ripple Carry (RCA)
~~~~~~~~~~~~

A ripple carry adder (RCA) implements [1] directly. You all understand
ripple carry adders, so I'll just say that they are cheap and slow.

The critical path in an RCA is from c_0, a_0 or b_0 through all the
full adders to z_n-1 or c_n.

Carry lookahead (CLA)
~~~~~~~~~~~~~~~

For a four bit adder we can expand [4] to give

   c_1 = g_0 + c_0.p_0                                                [5-8]

   c_2 = g_1 + g_0.p_1 + c_0.p_0.p_1

   c_3 = g_2 + g_1.p_2 + g_0.p_1_p_2 + c_0.p_0.p_1.p_2

   c_4 = g_3 + g_2.p_3 + g_1.p_2.p_3 + g_0.p_1.p_2.p_3 + c_0.p_0.p_1.p_2.p_3

This is reasonable for 4 bits, but will get grotesquely large in
terms of are and fan-out if taken much further.

One option is to divide the operands into groups of four bits, use
carry-lookahead within each group and ripple the carries between groups.

 * Divide and Conquer

If we can improve on speed of rippling carries within each group, then
surely we can improve on the speed of rippling carries between groups
by a similar approach.

Consider Group_0, consisting of (g_0 - g_3, p_0 - p_3)

Defining a `group generate' g* and a `group propagate' p*:

For Group_0:

   g*_0 = g_3 + g_2.p_3 + g_1.p_2.p_3 + g_0.p_1.p_2.p_3

   p*_0 = p_0.p_1.p_2.p_3

and identically for all other groups.

Now

   c_4 = g8_0 + c_0.p*_0

   c_8 = g*_1 + g*_0.p*_1 + c_0.p*_0.p*_1

   c_12 = g*_2 + g*_1.p*_2 + g*_0.p*_1.p*_2 + c_0.p*_0.p*_1.p*_2

   c_16 = [deleted - it's > 80 characters wide....]

These are identical to equations [5-8] used to generate carries within
groups.

There's no need to stop there.

The four-bit groups can be combined into sixteen-bit groups of groups.

Each can produce generate (g**) and propagate (p**) signals, combined as
above to give c_16, c_32, c_48 and c_64.

This divide and conquer approach will eventually generate c_i for the
entire word width.

   <This would be a lot clearer with diagrams>

For board level adders CLA is pretty good, 'cos it can use standard MSI
lookahead generator ICs for nearly everything.

It's not *bad* in CMOS, but it's not particularly good either. I'd use
one of the following architectures in preference. The only exception might
be in a cell-based design where an optimised lookahead cell is provided.

Carry Skip adder
~~~~~~~~~~~~~~~~

"In VLSI technology the carry-skip adder is comparable in speed to the carry
 look-ahead technique (for commonly used word lengths) while it requires
 less chip area and consumes less power."
                -- Computer Arithmetic Algorithms, Israel Koren

...and that's why it's the adder I'd use for a 32 bit system, and probably
for a 64 bit system too.


Carry propagation can skip any stage for which p_i = 1 (ie a_i != b_i).
Several consecutive stages can be skipped if p_i =1 for each stage.

A carry-skip adder is divided into groups of consecutive stages, with a
simple ripple carry scheme in each group.

Each group generates a group propagate signal, p*_i.

For Group_i, consisting of k stages j, j+1, ... j+k-1

   p*_i = p_j . p_j+1 . ... . p_j+k-1

This is used to allow an incoming carry into the group to skip over all the
stages in the group and generate a group carry out.

   Group_i-Carry_out = c_j+k + p*_i . Group_i-Carry_in

   (c_j+k is the normal ripple carry out from the most significat stage
    in the group)

The critical path through a carry-skip adder is via ripple carry through,
one of the groups, and via the skip carry chain through the remainder of
the groups. (Think about it. A carry coming out of a group via the ripple
chain *must* have been generated within the group - if it was generated
before the group, p*_i would have been 1 and the stage would have been
skipped. So the critical path will travel through only one group).

In a carry-skip adder the groups will not all be the same size. The optimal
division of an n-bit adder into carry-skip groups depends on the
characteristics of the target technology.

A 32 bit adder might have 10 groups of sizes {1, 2, 3, 4, 5, 6, 5, 3, 2, 1}
for a typical technology.

VLSI implementations of carry-skip adders tend to be quite small - using the
32 bit adder given above the extra cost over an RCA is about 20 extra gates.

While a single level of carry-skip speeds things up a lot a second level of
skip, skipping over more than one group can speed things up a little further
for very little extra cost.

Carry Select adder
~~~~~~~~~~~~~~~~~~

The reason carry propagation is slow in a ripple adder is because each stage
needs to have a_i, b_i and c_i available before it can calculate c_i+i.
One way of removing this dependency on c_i is to calculate both
a_i + b_i + 0, and a_i + b_i + 1, then choose the appropriate result when
c_i becomes available.

This is the basic trick of the carry select adder.

For 32bit operands the 32 stage adder could consist of four 8bit
groups.

Group_0 is purely an 8bit ripple carry adder, with c_0 as it's carry input.
The sum output from this RCA goes directly to the adder output.

Group_1 has two 8bit ripple cary adders, one with a Cin of 0, the other with
a Cin of 1. The sum outputs from these two RCAs go to a 2:1 multiplexor
controlled by the Cout of Group_0. The output of the mux goes to the adder
output.

Group_2 has two 8bit RCAs, the same as group_1. The sum outputs of this go
to two 2:1 muxes. One is controlled by the Cout of the Group_1 chain with a
Cin of 0, so the mux output is what the sum would be if the Cout of Group_0
were 0. The other is controlled by the Cout of the Group_1 chain with a Cin
of 1, giving the sum if the Cout of Group_1 were 1.

These two mux outputs are fed to another 2:1 mux, controlled by the Cout of
Group_0, to select the correct sum to send to the adder output.

Group_3 is similar. The sums from the two RCAs go through two 2:1 muxes
controled by the two Couts of the two RCAs in Group_2, giving two values, one
if the Cout of Group_1 is 1, one if it is 0.

These two values are passed to another pair of 2:1 muxes, controlled by the
two Couts of Group_1, giving two values, one if the Cout of Group_0 is 1, the
other if it is 0.

The correct one of these two values is selected by a final 2:1 controlled by
the Cout of Group_0.


<This is far too complicated for ASCII art at this time of night. If there's
 any interest I might HTMLify it>


Carry select adders tend to be slightly faster than skip adders, particularly
for wide operands. They will typically consume a little over twice the area
of a ripple carry adder. The layout for a select adder tends to be very
regular, which can be a big advantage for datapath compilers/tilers.

One of the DEC Alphas uses a slight variant on this approach to do a 64bit
add at 200MHz with one cycle of latency in a 0.75um technology. There's a
paper in Vol 27, No 11, November 1992 of the IEEE Journal of Solid-State
Circuits giving a few paragraphs about the adder, and a nice diagram.


Prefix adders
~~~~~~~~~~~~~

These are bizarre to think about, but very powerful, particularly for longer
word lengths.

It's a little mathematical in places....

Define an operator 'o':

   (g,p) o (g',p') = (g + (p . g'), p . p')

Define G_i & P_i:

   (G_1,P_1) = (g_1,p_1)

   (G_i,P_i) = (g_i,p_i) o (G_i-1,P_i-1)   2 <= i <= n

Then

   c_i = G_i   for   1 <= i <= n                                   [1]

There's a fairly easy, but not too interesting, inductive proof of [1].

Next, 'o' is associative, ie

   (g_1,p_1) o ( (g_2,p_2) o (g_3,p_3) )

   = ( (g_1,p_1) o (g_2,p_2) ) o (g_3,p_3) for all (g_i,p_i)       [2]

Again I'll miss out the proof - it's just an 'expand both sides and notice
they are identical'.


So to find c_i it suffices to calculate

   (G_i,P_i) = (g_i,p_1) o (g_i-1,p_i-1) o ... o (g_1,p_1)

and by [2] this can be calculated in any order.


  The Brent-Kung architecture
  ===========================

(The original reference is Brent & Kung, IEEE Transactions on Computers,
 Vol C-31,No 3, March 1982)

First consider the simple problem of calculating just c_n, for n=16.
Since 'o' is associative (G_16,P_16) can be generated by a binary tree:


Each wire in the diagram carries a pair of signals (g,p). (g_i,p_1) are
fed in at the base.

Each node 'O' performs this operation:

(g_in,p_in) o (g'_in,p'_in)

                 |
                 |
                 O
                 |\
                 | \
                 |  \
                 |   \
        (g_in,p_in)  (g'_in,p'_in)




            (G_16,P_16)
                 |
                 |
                 O
                 |\____
                 |     \_______
                 |             \________
                 |                      \
                 O                       O
                 |\                      |\
                 | \__                   | \__
                 |    \__                |    \__
                 |       \__             |       \__
                 |          \            |          \
                 O           O           O           O
                 |\_         |\_         |\_         |\_
                 |  \_       |  \_       |  \_       |  \_
                 |    \      |    \      |    \      |    \
                 O     O     O     O     O     O     O     O
                 |\    |\    |\    |\    |\    |\    |\    |\  
                 | \   | \   | \   | \   | \   | \   | \   | \
                 |  \  |  \  |  \  |  \  |  \  |  \  |  \  |  \  
                 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
                 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
(g_i,p_i)       15 14 13 12 11 10  9  8  7  6  5  4  3  2  1  0


This will give (G_16,p_16) and hence c_16. To generate c_15 downto c_1
another similar tree is required:


c_i             16 15 14 13 12 11 10  9  8  7  6  5  4  3  2  1
                 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
                 |  O  |  O  |  O  |  O  |  O  |  O  |  O  |  |
                 |  |\ |  |\ |  |\ |  |\ |  |\ |  |\ |  |\ |  |
                 |  | \|  | \|  | \|  | \|  | \|  | \|  | \|  |
                 |  |  O  |  |  |  O  |  |  |  O  |  |  |  |  |
                 |  |  |\_|  |  |  |\_|  |  |  |\_|  |  |  |  |
                 |  |  |  \_ |  |  |  \_ |  |  |  \_ |  |  |  |
                 |  |  |  | \|  |  |  | \|  |  |  | \|  |  |  |
                 |  |  |  |  O  |  |  |  |  |  |  |  |  |  |  |
                 |  |  |  |  |\_|  |  |  |  |  |  |  |  |  |  |
                 |  |  |  |  |  \__|  |  |  |  |  |  |  |  |  |
                 |  |  |  |  |  |  \__|  |  |  |  |  |  |  |  |
                 |  |  |  |  |  |  |  \  |  |  |  |  |  |  |  |
                 |  |  |  |  |  |  |  |\ |  |  |  |  |  |  |  |
                 |  |  |  |  |  |  |  | \|  |  |  |  |  |  |  |
                 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
                 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
                 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
                 O  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
                 |\____|  |  |  |  |  |  |  |  |  |  |  |  |  |
                 |  |  \_______ |  |  |  |  |  |  |  |  |  |  |
                 |  |  |  |  | \________ |  |  |  |  |  |  |  |
                 |  |  |  |  |  |  |  | \|  |  |  |  |  |  |  |
                 O  |  |  |  |  |  |  |  O  |  |  |  |  |  |  |
                 |\ |  |  |  |  |  |  |  |\ |  |  |  |  |  |  |
                 | \__ |  |  |  |  |  |  | \__ |  |  |  |  |  |
                 |  | \__ |  |  |  |  |  |  | \__ |  |  |  |  |
                 |  |  | \__ |  |  |  |  |  |  | \__ |  |  |  |
                 |  |  |  | \|  |  |  |  |  |  |  | \|  |  |  |
                 O  |  |  |  O  |  |  |  O  |  |  |  O  |  |  |
                 |\_|  |  |  |\_|  |  |  |\_|  |  |  |\_|  |  |
                 |  \_ |  |  |  \_ |  |  |  \_ |  |  |  \_ |  |
                 |  | \|  |  |  | \|  |  |  | \|  |  |  | \|  |
                 O  |  O  |  O  |  O  |  O  |  O  |  O  |  O  |
                 |\ |  |\ |  |\ |  |\ |  |\ |  |\ |  |\ |  |\ |
                 | \|  | \|  | \|  | \|  | \|  | \|  | \|  | \|
                 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  
                 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
                 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
(g_i,p_i)       15 14 13 12 11 10  9  8  7  6  5  4  3  2  1  0




There are many varying architectures for prefix adders, driven by
speed/area/layout complexity tradeoffs.




--
-- Steve Atkins -- <atkinss@inmos.co.uk> -- +44 454 611439 --

<Squeak!> <Squeak!> <Squeak!>....... <Splash!>

  [ It's a rat leaving a sinking ship. I'm off to AMD......]