home .. forth .. machineforth mail list archive ..

[MachineForth] [Nosc] 25x Forth engine

[Jeff Fox <fox@xxxxxxxxxxxxxxxxxxx> writes:]

>The SVFIG chapter's Forth day was held last Saturday and
>Chuck gave a presentation in the afternoon where he
>talked about his colorForth mostly.  He said that he
>would soon be posting two applications in his colorForth
>and that these would be a cross compiler for c18 (was
>called x18) and a simulator for 25x (which might become
>50x before fabrication.)

So he thinks he can fit 2x the CPU cells in?  That's good news.

>It appears that c18 has an 8 bit wide address busses on
>chip which provide access to 256 words of 1ns ROM or DRAM.
>The low address space is on-chip ROM and DRAM.  Somewhere
>above that are control registers similar to those on
>other MISC chips.

We could be looking at 128 (20 bit) words of storage?

>One (or more) of the control registers setup the operation
>of the external memory bus and another provides access to
>the external memory bus.  So the new B register is mainly
>used to hold the address of the external-memory controller
>but it could also hold other control port register
>addresses or onchip address like the A or R registers.

This is where it gets critical.  If you look at Cisco's "Toaster" network
processor (http://www.employees.org/~amcrae/papers/toaster/toaster.html), it
also uses an array of small CPU's.  But what they did is to put a memory
controller per column, so that each 4 CPU's (they use a 4x4 array)
arbitrates with the other 3 to access a relatively large SRAM.

>So one or more on-chip ROM can be programmed to make the
>chip assume the reverse pinout of a 4ns 18 bit wide
>external fast-SRAM.  It can also be programmed to provide
>control signals for other types of memory.

If this means that all 25 (or 50... more on that in a moment) chips have to
arbitrate with all the other 24 to access a single external memory, then it
seems like it'll be very hard to avoid making that the bottleneck.

>One or more chips on the outside have serial busses
>but I don't know the details at this point.  I am
>sure at some point Chuck will publish specs.

I know Chuck is thinking about Ethernet.  I'm now wondering about what speed
of Ethernet--even at GE seems like it might easily take more than one to
keep up with the bandwidth of the CPU's.  More than one GE?

>I think I have covered that.  But if you didn't
>figure this out from above, Chuck also mentioned
>explicity that the CPU cannot execute external
>memory.  They can copy it to on-chip memory
>and only execute from on-chip memory.  So the
>CPU must assume some overhead that is done
>entirely by other hardware on most other chips.
>The CPU must manage on-chip memory manually
>like other chips handle cache in hardware.

I've spent a fair bit of time trying to sketch out a DES key search
application for the understanding I have of a 25x.  I budgeted 128 words,
assuming the other 128 would be bootup ROM plus addresses of chip registers
(I know I could put some DES code in ROM, but if you have to cook new chips
for each new app, I think you've hit the wrong target IMHO).  Of this 128, I
kept coming up with 20-30 words for the code to handle the cache function
(again, you could put some of this in ROM, but with this thin a memory
margin, the cache handlers become application specific).

With 90 words, I had to budget pairs of words to hold each 32-bit value.
The entire S-box structure won't fit on one chip, but S-box access nicely
splits across multiple CPU's.  The permutation steps on the way in and way
out are the real problem; they don't split out as nicely.  So the structure
I got was one CPU to permute, feeding 4 CPU's to do the internal
transformation, and one more to do the final permute.  The front and tail
end run at the same speed, and this is probably slower than the 4 in the
middle by 20-40%.  The head and tail might be sped up by using a CPU to try
and pipeline the key scheduling... not sure yet.

>>         e. Inter-chip signalling
>One control register includes destination (not
>too unlike the F21 network coprocessor) but only
>has 4 bits for the 4 other processors in a row
>or column.  This means it can address one CPU
>or any combination in a row or column in a
>multicast.  The data is passed from a data
>register (address) to a data register on
>another chip and when that chip reads the
>register an ack signal appears in other
>control registers.  If a chip attempts to
>write again before all recipients have
>read the last message the write is blocked
>by hardware.

What happens if two CPU's write at the exact same moment?

>Most likely some ROMs will be programmed to
>make it assume a superset of the reverse
>of an 18-bit fast-SRAM.  The first test
>chip will include unique ROM code on each
>of the core CPU in the hopes that some
>of them will mostly work on the first try.

The inverse-SRAM trick works pretty well for a single CPU.  I wonder again
wow well that trick carries forward to serving 25 CPU's.

>The various MISC chip simulators could be modified
>fairly easily to convert to c18, but software only
>simulators are actually very easy.  Simulators and
>emulators are very nice and as the chip evolves
>it will nice to have access to simulators that
>can be used to develop real ROMable code that 
>will actually work.  We have been doing that for
>over a decade on the project.

I agree the CPU itself is pretty easy.  Simulating the network of CPU's with
their inter-CPU interfaces is what makes it possible to assess 25x
applications as opposed to x18 applications.

My initial (largely uninformed) foray into mapping applications onto the 25x
has left me suspecting that the state space and external I/O may not be
balanced against the raw CPU cycles available internally.  When I ponder a
50x instead of a 25x, my thought is to ask if this "extra" real estate could
be used instead to implement a larger chip-local RAM space?  Chuck has said
that the RAM is roughly 1:1 with the CPU in the x18, so this might mean that
a neighboring CPU-sized piece of silicon could instead be used to triple (or
more, if it can share some of the control logic from the single RAM
infrastructure) the chip-local RAM.  This would have implications for
address space, though.

Making the external SRAM not transparently addressable is probably the right
way to go.  I hope the B register access will permit explicit pre-fetch
along with fetch state status bits.  Even if each x18 had its own external
SRAM the speed disparity would still make this desirable.  However, a 25:1
fan-out to SRAM seems too high for a broad class of applications.

I thought I could make the 25x work pretty well for a DES key search
application.  I'd burn ~6 CPU's with only 1-2 working at full speed and the
others idle ~40% of the time.  Because of the hassles straddling from the
20-bit world to the 32/64-bit world of DES, it looks like it'd take 500-700
cycles to test a key (actually a lot more, but I'm talking wall-clock cycles
while up to 6 x18 cycles run in parallel).  Hell, call it 1,000.  A 2400 MIP
CPU farm should thus be able to do 2.4 million tests per second.  If 6 CPU's
are needed, then we can run four of these at a time, for a total of 14.4
million tests per second per 25x.  That still comes to 5003999585 seconds to
exhaust the 56-bit DES key space.

By contrast, the DES Key Search Machine (http://www.cryptography.com/des/)
can do 90 billion key searches per second.  Even allowing for the
possibility of thousands of 25x's, the speed differential appears to stem
from silicon which "fits" the data structure sizes involved (admittedly,
they get to hard-wire the DES tables because their chips have only one
application) and CPU's sized to process data in its natural form.  From
this, one can wonder if a speed penalty is inherent in a 20-bit CPU trying
to participate in a 32- and 64-bit world.  And if you're going to claim that
a 25x *is* custom burnt for an application, then the DES Key Machine's CPUs
would appear to be superior to a 25x as a starting point for custom silicon.

So my initial thoughts:

	1. I'm still very interested in low-cost parallel MIP devices.
	2. Should more than one external RAM complex be used to better
		balance workload from the 25x CPU's?
	3. If 25x -> 50x could happen, would that chip real estate be
		better used as fast RAM for the 25 CPU's?
	4. What sort of practical I/O bandwidth will the periphery of
		the 25x support?
	5. (More problematic) Are 8-bit address spaces and 20-bit words
		correctly sized for an interesting set of potential

Once again apologies for going on at length.

Andy Valencia

To Unsubscribe from this list, send mail to Mdaemon@xxxxxxxxxxxxxxxxxx with:
unsubscribe MachineForth
as the first and only line within the message body
Problems   -   List-Admin@xxxxxxxxxxxxxxxxxx
Main Machine Forth site   -   http://www.