home .. forth .. misc mail list archive ..

letter to chuck (fwd)





This is a cc of my letter to Chuck Moore. Since it may
contain matters of general interest I decided to make it
generally available.

Thanks for nice elaborate comments on my post. Sad thing
Chuck doesn't have the money to go really big. Really sad.

------------------------------------------------------------

Dear Chuck,

I know you are a very busy man, but I had to write you this. I
do really need you oppinion.

I'll try to be succinct without leaving out too much. Yet still:
this is a longish mail. About 35 kBytes long.

You surely know how grateful we forthers Out There are for your
language jewel, Forth, and its silicon incarnation, Novix. So I
skip the thanks which would otherwise fill many lines.

During the period of recent 3-6 months I developed a very
detailed (almost down to the gate level) design outline of a
novel CPU/computer architecture. Several guys found it nice,
Dick Pountain of Byte called it even "radical".

This gave me the courage for asking Chuck for an oppinion. (We
all know that he does impossible things even before breakfast :)

I tried to analyze what's going wrong in current CPU designs,
stole shamelessly from (very few) good ones (Novix scores very
high here, haven't seen MuP/F21 yet) and extrapolated current
trends way into the future.

Since the reasons why the final beast is the way it is are many
and interact subtly, a mere list of features without elucidation
would be pointless. So I _have_ to begin with some superficial
analysis, what's currently going wrong, to be able to show, that
the ways of doing it right are not many and that they converge
towards one attractor. (From this sentence you must have noticed
I am a German :)

Skip below passage if you are getting bored.

1. What's going absurdly wrong
   (and some obvious counter measures)

The main problem of current high-clock high-integration CPUs is
low memory bandwidth and logic utilization ratio (bang for the
buck). An elaborate machinery of cache hierarchy and "crystal
ball" precognition engines (branch prediction units) is used to
handle it. It does not work well (a lot of real-world/scientific
problems have low data locality, which causes frequent cache
misses), wastes silicon die as well as motherboard space, burns
power and is not cheap, since both die area, integration density
(+thermal load) as well as defect die yields grow. Design
reiteration cycles grow longer due to increased complexity.
Probability of glitches slipping into production undetected goes
up. Etc. Recent monster CPUs (P6, MIPS & Co.) waste as many as
70-90 (!) percent of die space (FPU included) for this
machinery. RISC is not RISC anymore.

Why are DRAM accesses several orders of magnitude slower as
compared to transistor switching time scale?

There is one main reason:

A microscopic low-power DRAM cell has drive a die-local bus
wire, all the way to the driver, which has to do the same for
the macroscopic bus wire. This is a severe strain on a DRAM cell
charge pump. An SRAM cell is able to bring a wire volume to
logical high much faster, the reason why SRAM cells are always
faster than DRAM cells. Of course, they take four to six times
the transistor and die space of a DRAM cell. (But a small (1-4
kByte) fast SRAM for the core virtual machine which is invoked
very frequently sounds like a good investment. Moreover, caching
stack/code segments in SRAM registers is straightforward).

Increasing bus width also increases the amount of pins (a major
cost/defect/power dissipation factor), leading to costly ceramic
pin-bristling monster packages, which introduce defects and need
elaborate sockets.

If we integrate CPU/RAM/Router on one chip, we would have

 - tiny wire geometries (low inductivity/capacitance, we
   can trade low power for high frequency of vice versa
   there is a limit on minimum wire geometry where the
   speed begins to decrease)

 - very short distances (low signal propagation delays,
   essential especially for very high frequencies)

 - no drivers (less power burned, more available die space)

 - no contact pads (less power burned, more available die space)
   ((Re)charging alone a single bond pad at radio frequency burns
   about 1 mW of power)

 - reclaim cache/BPU transistor resources for CPU/RAM/Router

 - potentially extremely broad buses (since no bulky drivers/
   bond pads necessary)

 - is a prerequisite for wafer-scale-integration (WSI),
   using only cheap wafer-local high speed serial links
   (WSI eliminates the need for testing/die cutting/repackaging,
   a major cost/defect factor. It needs defect tolerance/self-healing,
   of course. Later more on that.)

 - we can (actually, we must) turn medium-grained parallel
   processing (since die complexity is limited. This means
   we must turn extremely minimalistic both at the CPU
   as well as OS complexity level/size)

Summary so far: there are excellent reasons of trying to
integrate CPU and RAM on one die. New constraints introduced are
few, not very restrictive and point towards the future: we'd
need to deal with them sooner or later, anyway.

Some new constraints/features, arising from CPU/RAM-WSI
integration:

 - since we need high nondefective die yield:

   - CPU must be very minimalistic (since low defect hit probability
     is necessary), best a MISC one.

   - RAM area is small (128 kByte - 2 MByte, depends on integration
     density/die size), hence

     - code density must be very good (this spells threaded code, this
       in its turn needs machinery for low subroutine call overhead)

     - the OS must be very small (nanokernel), robust, be able to live
       in and handle maspar networks. (Science fiction? Taos, a
       high-performance, nanokernel (12 kByte) hardware-independant
       maspar networking OO OS by a small UK company, comes
       _very_ close. It's on market, already)

   - we must tolerate defective RAM words by recording/mapping them
     in a list (akin transparent SCSI HD bad block remap)

 - because of high data locality/low power dissipation extremely
   high clock speeds (up to 1 GHz and may be even beyond) are possible.
   Tiny structures, low voltage and high locality is a must, then.

 - since CPU/Router run asynchronously, a closed-loop-oscillator
   (no quartz) can provide system clock(s). Secondary clocks
   can be derived from counter register bits.
   Since communication is handled by the router, each CPU
   runs wholly independently from the others.

 - since several (wafer-integrated) computer dies
   run concurrently, this is true MIMD at wafer level.

 - our buses and memory words can be _very_ broad
   (0.5-2 kBits, major constraint is CPU/ALU bulkiness
   and available logic resources. Extreme
   bus width payoff is roughly proportional to the square
   of die complexity (e.g. 1MBit cell is somewhat too tight,
   4 MBit will do nicely, 16MBit will run like a fox,
   64 MBit will be a killer), so:

   - our memory bit/s bandwidth can be extremely good

   - we can emulate FPUs with fixed-point of scaled-integer
     ALUs, this means:

     - very broad but primitive (logic resource) integer ALUs necessary

     - no FPU, hence more transistor resources for CPU/RAM available

     - address bus width is very narrow (10-16 bits), hence

       - address decoding logic takes fewer resources

   - we can operate directly upon XXL data types, which spells

     - blazing fast bit block logic (BitBlt, video/etc. applications)

     - extremely broad integers (ALU must be very primitive/
       highly clocked, then)

     - we can do single-instruction-on-multiple-data (SIMD)
       with an extremely simple additional engine (details follow)

       - this SIMD-at-CPU-level parallelity on xxl bus width, though
	 currently not available on low-end machines, has shown to
	 be _extremely_ performance boosting (esp. compares/searches)
	 on experimental machines

     - we needs a convenient means to handle much tinier
       data sizes (down to single bit(fields)), since bus widths are
       immense

       - a simple machinery for defining bus windows, which
	 masks out which portions of the word are to
	 be modified and which not

       - a machinery for word-relative adressing (both
	 address and data/return stack) - actually a
	 two-dimensional addressing scheme (no, this
	 is _not_ IBM's/Intel's memory segmentation)

 - since we have WSI-integrated computer dies:

   - they will need a cheap wafer-local communication
     infrastructure

     - it has to be serial (akin Tranputer links. Off-wafer
       links should be optical (hybrid GaAs/smart pixels on
       GaAs isles, but thats SciFi still) for high speed/low
       power dissipation/low EM noise susceptibility, etc.

     - it has to be high-speed (prevent inter-die
       bottlenecks) 50-100 MByte/s on-wafer link
       bandwidth appears viable.

     - it has to be a packet driven high-connectivity
       (hypergrid) network (failure tolerance).

     - it will need a primitive, yet high-performance
       asynchronous data packet router (very cheap, this,
       but later more on that).

   - special dies, at wafer perimeter to handle real-world
     I/O (HD, video, audio, etc. ) are necessary. Alternatively,
     off-wafer chips with serial2parallel link adapters might
     do the trick.

   - extensive interrupt handling logic is unnecessary
     since a maspar machine can handle several interrupts
     concurrently with very low latency

This passage was a bit long and far from being complete, but I think
I made my point clear: CPU/DRAM/Router wafer-scale integration seems
to be possible, the constraints are not very many/restrictive and
have potentially very intriguing technical implications, possibly
worthwhile enough for further investigations.

Before I am going to outline a possible implementation of this
architecture (I called it ULIW for "ultra long instruction/data
word" architecture) I must outline three engines: the hypergrid
router, the TL (trench logic) engine, a prerequisite for SIMD
and the PW (protection window) engine, which is the means for
handling small data types on very large buses. The perfect
shuffle engine is an addendum. (But a very nice one.)

2. The basic features of ULIW

2.1 The Hypergrid Router

The router is the keystone of ULIW. It provides the only means
of information transfer. E.g. CPU dies boot from links.

The router has to be:

- fault tolerant to random defects

  - at single link level (implemented as independant memory-mapped
    segments and a status word). If one link is shot, the whole
    router still runs all right.

  - at wafer level (tolerate high defect percentage). If a die
    is defective it is ignored by the OS in neighbour dies. Partially
    defective dies, software detected at production stage should be
    rendered totally inoperable e. g. by scratching them or burning
    them with a laser

- fast

  - high physical channel bandwidth (50-100 MByte/s on-wafer
    serial link)

  - function autonomously (fully asynchronous to CPU die,
    via bus seal (at router or even single link level) and
    arbiter logic)

  - high memory-router throughput (needs very broad buses!)
    Since a link generates an event each time it needs
    CPU attention, total event frequency goes down with
    increasing word widths

- skinny

  - should need low logic resources/die area

  - simple routing logic

- flexible

  - can be switched off

  - should call CPU for assistance (unknown packet type, own
    ID derivation, etc.)

I won't go here into detail why a packet-switched network is the
best known design for any networks. At least there's a very good
reason why Internet's basic protocol TCP/IP is one.

But the network topology/connectivity needs a closer look. I
will try to show that the hypercube's (boolean n-cube) superset,
called n-grid topology, is a very robust and simple (in terms of
routing intelligence) topology, which can easily/cheaply be
mapped upon hardware.

Consider the Bresenham line drawing algorithm. It decides which
pixel to choose from several neighbours using only the knowledge
where it is now and where it has to go. To make it work the
lattice/grid has to be regular and each element to have an
unique ID/addressing. If we use a regular higher-dimensional
grid and a higher Bresenham homologue it will obviously work.
Even better, very high degrees of defectivity can be tolerated.
The probability of a blockage goes almost down to zero with
increasing number of dimensions. Detection of the blockage
condition is trivial: the path begins to show a small periode
(due to routing artefacts). A tiny FIFO of a path will do. A
random kick will often solve the problem. If not, an exhaustive
local search will do. But is it worth the trouble? Packet
lossage is usually handled transparently by a protocol.)

I refer to this scheme as higher-dimensional Bresenham routing
or simply grassrouting, because of its simplicity.
	  ^^^^^^^^^^^^
The boolean hypercube (n-cube) a member of the hypergrid or the
n-grid family, will be our model case.

The n-grid/n-cube.

Imagine a standard three-dimensional cube with an edge of length
1, the first node lying at coordinates (0,0,0), the second at
(0,0,1), the third ... (0,1,0), (0,1,1), (1,0,0), (1,0,1),
(1,1,0) and the eighth at (1,1,1) (see picture). It is a member

 Y  ^
    |	    Z
    |	   /
 011| o---/----o 111
    |/|  /    /|
010 o-+-+----o | 110   A 3-cube with binary labeled
    | |/     | |       nodes in a cartesian coordinate
 001| o------|-o 101   system
    |/	     |/
    o--------o------->
  000	   100	     X

of the boolean n-cube homologue sequence with total number of
nodes being 2^n. Thus one can label the nodes with a binary
integer 0,..,(2^n)-1. Alternatively, one can say there a space
of IDs for each n-cube. If we consider a single reference node
in a boolean n-cube, it is connected to n nodes _with binary
offsets_ (2^i). The sign of the offset is derived from the
binary count pattern bit sequence. This is a constructive
definition. (Look up connectivity matrix of boolean 5-cube way
down. Note that it contains _four_ 3-cubes).

 This table represents the 3-cube:

 ref. Binary Signs  binary     connected
 ID   Count	    Offsets    IDs
 +---+-----+------+-----------+-------+  (alt.:
 | 0 | 000 |  +++ |  +4 +2 +1 | 4 2 1 |   perfect
 | 1 | 001 |  ++- |  +4 +2 -1 | 5 3 0 |   shuffle
 | 2 | 010 |  +-+ |  +4 -2 +1 | 6 0 3 |   stages (1,2,3)
 | 3 | 011 |  +-- |  +4 -2 -1 | 7 1 2 |   of the initial
 | 4 | 100 |  -++ |  -4 +2 +1 | 1 6 5 |   ref. ID)
 | 5 | 101 |  -+- |  -4 +2 -1 | 2 7 4 |
 | 6 | 110 |  --+ |  -4 -2 +1 | 3 4 7 |
 | 7 | 111 |  --- |  -4 -2 -1 | 4 5 6 |
 +---+-----+------+-----------+-------+

This boolean n-cube is a peculiar beast. On the one hand it is a
very orthogonal structure, easily reasoned upon mathematically.
The signs/offsets look after themselves, it is a
recursive/fractal structure (each n-cube consist of 2
interconnected (via (n-1) number of links) (n-1)-type subcubes),
etc. Routing in it resembles binary search: the first step
brings one halfway there, the second halfway again, etc. We
descend the binary cluster hierarchy until we arrive at the
0-cube (single node). The "distance distribution versus
dimensionality" table is the Pascal triangle, by the way. It can
derive its own ID from wiring constraints and the other IDs it
is directly connected to. (This is a crucial point). Etc.
Routing in a perfect n-cube runs roughly: "going from the left,
find the first bit in destination ID differing from source ID.
Move message packet to the according subcube via direct link
(there is only one). Repeat until you reach outer (rightmost)
bit. Bingo." This can be coded in some 10 assembly statements or
implemented via a very primitive logic. If a part of the links
is missing (the n-cube is defective) we have to visit all the
nodes which should have links to the second subcube until we can
prove the task impossible.

The n-grid, on the other hand, can be considered a better n-cube
(is a superset of it). We can derive it from the n-cube by
making link offsets symmetrical. Links which would reach out of
ID space can be either wrapped around (circular space) or
considered void/free (open space/grid region). See connectivity
table below. Routing is again completely trivial. You just
choose the link which distance (arithmetical ID difference) is
smallest and shove the packet through. In a perfect n-grid, this
works every time and is moreover the best possible algorithm.
Especially, if the near-range order (shortest offsets) is highly
preserved, this scheme works also for severely defective n-grids
(there is a lot of space in those higher dimensions). If the
message _blocks_ (its path has a periode/loop) we can circumvent
the blockage either giving the message packet a random kick or
executing a plan, subsequently visiting all the nodes which
should have links in this direction in ever-increasing
distances.

Computer simulations I did have shown this routing scheme to
work and, even better, to work well.

Connectivity Tables of 5-cube and 5-grid (2^5=32 nodes each)

# means a link,
- means there is no link.

Note that the n-cube is contained in the n-grid (as it is a
superset). The n-cube has a fractal/recursive makeup: it
contains two interconnected (n/2 links) (n-1)-cubes. A n-grid
node has (2n-1) links as compared to the n-cube (free links are
not shown). Two (n-1) subgrids are connected by (2n-1) links as
compared by n links in the n-cube. The simple n-grid routing
algorithm does not work for the n-cube (it blocks/loops in most
cases). Moreover, if the _farthest_ link of each node (n-cube)
is cut, the cube collapses into 2 unconnected (n-1) cubes. This
does not happen to the n-grid. Notice that a defective structure
(some links missing) still routes all right because of built-in
redundancy (which is higher in the n-grid).

The n-grid is a superset of the n-cube. Each node has

boolean 5-cube			  5-grid (open-space version.
					  free links not shown)

-##-#---#-------#---------------  -##-#---#-------#---------------
#--#-#---#-------#--------------  #-##-#---#-------#--------------
#--#--#---#-------#-------------  ##-##-#---#-------#-------------
-##----#---#-------#------------  -##-##-#---#-------#------------
#----##-----#-------#-----------  #-##-##-#---#-------#-----------
-#--#--#-----#-------#----------  -#-##-##-#---#-------#----------
--#-#--#------#-------#---------  --#-##-##-#---#-------#---------
---#-##--------#-------#--------  ---#-##-##-#---#-------#--------
#--------##-#-----------#-------  #---#-##-##-#---#-------#-------
-#------#--#-#-----------#------  -#---#-##-##-#---#-------#------
--#-----#--#--#-----------#-----  --#---#-##-##-#---#-------#-----
---#-----##----#-----------#----  ---#---#-##-##-#---#-------#----
----#---#----##-------------#---  ----#---#-##-##-#---#-------#---
-----#---#--#--#-------------#--  -----#---#-##-##-#---#-------#--
------#---#-#--#--------------#-  ------#---#-##-##-#---#-------#-
-------#---#-##----------------#  -------#---#-##-##-#---#-------#
#----------------##-#---#-------  #-------#---#-##-##-#---#-------
-#--------------#--#-#---#------  -#-------#---#-##-##-#---#------
--#-------------#--#--#---#-----  --#-------#---#-##-##-#---#-----
---#-------------##----#---#----  ---#-------#---#-##-##-#---#----
----#-----------#----##-----#---  ----#-------#---#-##-##-#---#---
-----#-----------#--#--#-----#--  -----#-------#---#-##-##-#---#--
------#-----------#-#--#------#-  ------#-------#---#-##-##-#---#-
-------#-----------#-##--------#  -------#-------#---#-##-##-#---#
--------#-------#--------##-#---  --------#-------#---#-##-##-#---
---------#-------#------#--#-#--  ---------#-------#---#-##-##-#--
----------#-------#-----#--#--#-  ----------#-------#---#-##-##-#-
-----------#-------#-----##----#  -----------#-------#---#-##-##-#
------------#-------#---#----##-  ------------#-------#---#-##-##-
-------------#-------#---#--#--#  -------------#-------#---#-##-##
--------------#-------#---#-#--#  --------------#-------#---#-##-#
---------------#-------#---#-##-  ---------------#-------#---#-##-

Where's the beef?

you might ask. What's the reason to introduce a new topology at
all? There are several reasons:

- n-cube and n-grid have the best routing performance at
  minimum wire length (when mapped to real physical 2- and 3-space).

- the amount of computation (space/time complexity) for
  efficient routing is very small. On-chip routers for
  wafer-local high-speed serial networks are possible.

- n-grids tolerate a high degree of defectivity at still good
  performance.

- the IDs can be derived from (directly visible) neighbour's IDs
  without any need for administration. I call it dynamic ID
  derivation (DID).

- the total communication bandwidth is very high and increases
  with increasing n-grid size.

- the addressing and routing scheme is scalable to extremely
  big (WAN) nets.

Finally, I should say there seems to be a way of relaxing wiring
constraints even further without sacrificing too much
efficiency. Random local wirings with decreasing wiring density
(as realspace projection) can be harnessed with an ID subspace
ID derivation and routing scheme. This I call fuzzy routing.


After reset each link is in state "idle". A simplified state
transition graph looks roughly as follows:

    <----------------------------- ^----------------------------->
    |				   |				 |
just_written-->locked1-->sending-->idle<--just_read<--locked2<--receiving
		  |			      ^ 	 |
		  |			      | 	 |
	       routing------------------------> 	 |
		  ^					 |
		  |					 |
		  <--------------------------------------<

n-grid routing algorithm works roughly as follows (at single link
level):

0: state "just_written"
   examine incoming (either from link or the OS) packet type field,
   can we handle it?
1. state "locked1" or "locked2"
   NO : Signal OS for assistance
2. YES: examine destination field of incoming data packet, has
   it our ID?
3. YES: signal OS that a packet has arrived
   Wait until OS handles packet
   set state to "idle"
4. NO : Contains packet path window a loop?
5. YES: call OS for assistance
   (OS may simply signal failure to sender, give
   the data packet a random kick or
   circumvent the blockage using a plan)
6. NO : select an idle link with an ID having the smallest
   distance to target ID
   Lock that link
   Move the packet to that link
   Initiate a send
   Free current link
   change state to "idle"

Link function semiautonomously. No central controlling agency,
software at most.

Summary so far:

The router consists of a small (5-16 regs) block of
memory-mapped bidirectional word wide shift registers. They work
semiautonomously, sometimes signalling a router event the the
CPU. Access to certain address subspaces triggers actions upon
according link. The router local bus is most of the time sealed
off from the main system bus. Link states are mapped to one
single word (to be efficiently polled by SIMD). A simple
arbitration/glue logic handles CPU/router accesses and generates
interrupt (apart from the clock, the only source of interrupts
in the system). Routing is done mostly (99%) by hardware
asynchronously to the CPU. Computer simulation of the router
(via explicit hypergrid data structure) seem to indicate that
above routing scheme is

- simple (in terms of code/logic)
- efficient
- robust (tolerates high degrees of defectivity (shot links))

2.2 Trench Logic Engine (TLE)

The ALU is sectioned into independant 1 bit wide segments, with
wires between them. The TLE generates from a 3-4 bit wide input
(from opcode field) a comblike pattern. If high, it prevents
data from spilling over at nonlocal operations (ADD, SHIFT,
etc.). The logic bear a striking resemblance to RAM address
decoding logic.

OpCode	Data
Field	Size Type
--------
0000	1    bit (Or starting with 0 for no segments, whatever)
0001	2
0010	4    nybble
0011	8    byte
0100   16    short int
0101   32    int
0110   64    long int
0111  128
1000  256  \
1001  512   \ actual maximum size is constrained
1010 1024   / by physical bus/ALU width
1011 2048  /
etc.

We can interpret TOS as a single data or as a data field, the same
operation to be applied upon it (SIMD). The size of the TOS element
is defined by the PW engine, which is also opcode field coded. The
TLE coding field might be facultative (similiar to variant records),
defined by a single switching opcode bit. I don't know.

This SIMD feature is especially nice for fast searches as we can
compare a substantial number of single data types in one cycle.
In scientific and graphical applications, sometimes the same
operation has to be applied concurrently upon all elements of
all field, also. String matches (e.g. for molecular biology
database search) can be greatly accelerated by either shifting
the string and comparing or using several compare masks.

2.3 Protection Window Engine (PWE)

Essentially, this engine melts two broad inputs (e.g. from RAM)
using a mask (a third input). Each bit of a mask decides which
input channel to choose. A machinery for writing the mask is
necessary. A single/pair mask-override bit (ignore one input,
use only the other, e.g. when a whole word has be swapped
in/out) appears quite nice. The masks may be either generated by
other logic engines (for speed) or coming from RAM
(flexibility). Especially, an engine generating comb patterns
and binary patterns (e.g. 010101010101,001100110011,
0000111100001111, etc.) will later come very handy. Let us call
it the binary mask engine (BME). Structured code self
modification (CSM) of SIMD coding opcodes is an extremely
powerful feature. It can do O(logN) searches, e.g. searching for
free bit blocks in a memory map for memory management.

An application of this is either the implementation of smaller
TOS on a very broad register or when accessing DRAM. Stuff which
is masked out is left unchanged at writes. One can either set
masked out regions to default (e.g. zero) or combine them from
other sources at writes. Obviously, BitBlt can also make use of
this engine (e.g. defining memory windows, combining two
sources, etc.). Many more etc.

This engine is not so cheap as the TLE but is absolutely
necessary to be able to operate upon very broad buses (0.5-2
kBits) at all. And it has many other intriguing application.
Essentially, the beauty of these engines is the fact that can be
interlocked like pieces of a puzzle (but in many variations)
generating potentially extremely complex behaviour. If opcode
mapped, several of these engines can be selected in a single
instruction.

2.4 Perfect Shuffle Engine (PSE)

Actually, this is not a must. But since the engine is quite
cheap logically (it consists entirely out of other ULIW
components, as PWE and binary mask generators, SHIFT/OR/AND) and
powerful as well, its implementation appears worthwhile. Perfect
shuffle is a generic permutation operator, with SWAP as a
special case. Actually, "perfect shuffle" is a term for a family
of swaps on variable data sizes. Using a sequence of n atomic
shuffle steps it is possible to invert sequences of 2^n elements
in n steps, e.g. inverting a 1024 bit sequence in 10 steps.
(This operation can speed up FFT noticeably. Moreover, since
data packets arrive in an inverse bit sequence, they have to be
reversed to be of any use. PSE can take care of that. But it is
not absolutely necessary).

Illustration of perfect shuffle:

initial sequence
-------------------------------
0 1 2 3 4 5 6 7 8 9 a b c d e f    PF stage 0 (no swap)
-------------------------------
1 0 3 2 5 4 7 6 9 8 b a d c f e    PF stage 1 (swap adjacent elements)
-------------------------------
3 2 1 0 7 6 5 4 b a 9 8 f e d c    PF stage 2 (swap adjacent pairs)
-------------------------------
7 6 5 4 3 2 1 0 f e d c b a 9 8    PF stage 3 (-- quadruples, etc.)
-------------------------------
f e d c b a 9 8 7 6 5 4 3 2 1 0    PF stage 4 (inverted sequence)
-------------------------------
final sequence (inverted orig. sequence)

This engine is implemented in terms of 2 broad bidirectional shift
registers (ALU components) and one mask register, which is needed for
PWE. Control logic is simple. Its implementation as threaded code is
straightforward.

3. The ULIW architecture outline (the medium scenario)

There are several flavours of ULIW: the cost-minimalized and the
power-maximalized. Since even a single die might be a
performance overkill for many applications, for them a cheap
minimalist implementation might suffice. A scientific
supercomputer with several 10 kWafers will need a powerful
router. A workstation, a kid's game station or a VR front end
have different needs. Etc. A cost-performance optimum is not
easy to estimate.

One last small remark: the key of superior ULIW performance is
the broader bus as well as the high clock speed. Sacrificing one
(or even both) of them means severely cutting performance.
Extreme bus width, PWE and the router are crucial for ULIW. Any
other stuff may go.

The intuition would seem to indicate that a primitive
high-clocked CPU might have several distinctive advantage over
more complex ones:

- development reiteration cycle is shorter (easy to simulate,
  compact core, less timing problems, etc.)
- less complex circuit/mask, hence
- smaller bug probability
- less transistors for the core CPU necessary, more available
  for router/RAM
- less space constraints, a broader bus possible
- better defect die ratio because of lower defect hit probability
- about the same performance as a more complex but slower CPU
- since the virtual machine is the same, software runs on different
  implementation architectures

Since I am in no position of telling Chuck how a minimalist CPU
can be made (he knows all about it ;), I'd like to illustrate
the concept on a high complexity scenario. Most features can
(and should) be implemented in software, of course.

One ULIW die is made from the following modules:

- CPU (a MISC one) is built around/contains the

  - PWE (protection window engine)

  - TLE (trench logic engine)

  - PSE (perfect shuffle engine for generic SWAPs)

  - a very skinny yet very broad integer ALU (only ADD,
    SHIFTs and some primitive logic as AND, XOR). INC/DEC,
    if possible. It should implement core MISC ALU functions,
    of course.

  - code segment register (current code page (32-256 Bytes) is
    held here) with its own addressing and
    decoding logic (6-8 bit second address bus)
    no main memory acesses are necessary when code loops
    in page. Address underflow/overflow causes page faults
    and swapping out/in. Actually quite rare, this. Especially,
    if code is word-aligned.

  - data (segment) register, TOS actually
    ALU must be broad enough to handle bus sized data types

  - return stack segment register (caches current return stack),
    has it's own addressing and a minimal ALU for fast FOR/NEXT

  - MMU (memory management unit) It checks addresses whether
    they are within allowed range (memory protection). Loads OS
    values upon OS call automagically to reduce latency.

  - clock and interrupt generator
    Apart from the routher the only source of interrupts
    in the system (MMU causes OS traps).

- RAM (DRAM, eventually with a small SRAM isle for fast virtual
  machine threaded code implementation, with address decoding
  logic) Few (some k) but very broad (0.5-2 kbits) words

- Bus Seal (separates CPU from the Router bus, allows them
  to run asynchronously)

- arbitration/glue logic (for CPU/Router bus/event arbitration)

- Router (a memory mapped block of autonoumous bishift registers,
  these must be in hardware)
  In a maximalist scenario it would be made of:
  - DMA and FIFO buffers (software will do)
  - Packet type detection logic (software will do, too)
  - Loop detection logic with (software, of course)
  - Routing logic
    - Integer ALU to compute address diffs (can ask the CPU ALU)
    - Parallel compare logic (smallest diff) (again)
    - temporary pad register(s) (for swaps) (DRAM can do)
    - etc.

Needless to say, the above was a maximalist scenario. Yet
estimation seem to indicate that only 1/16 resources (1
MTransistor) based on  a 16 MBit (2 kBit bus) DRAM cell will
suffice for both the CPU and the Router. After reduction of bus
width, design complexity, etc. a 100 kTransistor (and maybe even
much lower, if newer MISCs as F21 are taken into account) design
complexity seems to be achievable. Having absolutely no chip
design experience I cannot say how big the simplest architecture
will be. Obviously, bus width is main limit here. Router is very
cheap.

Wafer-scale integrated (32-64 dies) high-clock ULIW will have the
following unique features:

- _extremely_ high memory bandwidth (this is an understatement)

  - DRAM/SRAM on-die RAM is very fast (at least 10-20 ns fast)

  - xxl sized data bus width (0.5-2 kbits, 16-64 times that of
    a 32 bit CPU)

- _extremely_ high total networking bandwidth
  (e.g. 32 (dies) * 10 (links) * 50 MBytes/s (link bandwidth) =
   about 16 GByte/s on-wafer bandwidth)

- excellent integer/float performance (2 kBit integers are
  IEEE 64bit float supersets)

- failure tolerance at production/run time

- massive parallelism MIMD/SIMD

- comparatively low heat dissipation (no drivers)

- a big number of free high-speed I/O links (for video, RAID HD,
  DSP, etc.)

- comparatively cheap since no testing/repackaging
  (wafer will still run at 50% defective dies)

- higher good chip yield due to software bad-word remap

(Above feature list is, while extensive, not complete by far). A
WSI ULIW wafer is actually a palmtop supercomputer. Since good
wafer yield is essentially 100%, they auto-test and
auto-benchmark automagically, and need absolutely no
postprocessing (well, besides providing outside links an a
opaque lack finish and maybe a heat dissipator) the beast will
be very cheap. Somewhere in the $200-$1000 range.

4. The OS

The OS has to be compact, since it must live in 128 kByte-2
MByte sized DRAM grains. It must be an efficient networking
(worm characteristic) OS. It must be able to do primitive
reentrant multitasking with address protection. It has to do
memory management with garbage collection (GC). In large LANs/
WANs it must also handle security issues as authentisation/
authorisation. It must be an OO OS (with asynchronous
objects/load balancing). It must be threaded code on a
standartized virtual machine (to allow full
hardware-independancy). In certain areas it has even to surpass
Taos. Since the Taos nanokernel is 12 kByte sized and is not
written in threaded code, a 16-32 kByte nanokernel OS meeting
all above demands seems to be possible. Assuming a 128 kByte
minimal DRAM grain size, this is a good ratio.

5. What's still to do?

You have made it till here? Wow! Thanks for your time. Of course
I posted this outline (there is much much more to it) to the
Great Chuck and you MISC list guys for a reason:

- if the design/parts of it are viable you might find it
  useful, may be in later MISC designs

- some aspects of the stuff (particularly the router, TL/PW engines,
  etc) might be patentable. I don't know how to do this and I lack
  finance anyway, but your's another matter...

- even the smallest die would need resources of a RAM die
  production line (gate arrays will not suffice, sorry) which
  is not cheap, even in those days :(
  The voice of an authority (one of you) might pave the way

What do you think about the beast? Any comments are welcome.

Thanks a lot,

-- Eugene Leitl, Munich, Germany