home .. forth .. machineforth mail list archive ..
[MachineForth] [Nosc] 25x Forth engine

Subject: [MachineForth] [Nosc] 25x Forth engine
From: Jeff Fox <fox@xxxxxxxxxxxxxxxxxxx>
Date: Sun, 25 Nov 2001 11:45:16 -0800
Organization: UltraTechnology
Andy Valencia wrote:
> So he thinks he can fit 2x the CPU cells in?  That's good news.

Chuck's first cut was a 49x (7x7) which became the current
25x.  Remember that the chips are described in an extensible
library so the pads are factored separately from the CPU
core.  The number of times the CPU is replicated is 
mostly controled by a very tiny amount of code.

Also remember that 25x was selected to provide a metric
for a $1 or less production cost chip.  This same size
chip might cost $15,000 to prototype.  But the prototype
chip services provide for only certain sizes.  So I
would expect a 50x would be a $2 or less production
chip.  25x provides a metric of 25 2400 Mip core on a
$1 production cost chip or an associated $15,000 
prototype fab run cost for a couple of dozen chips.

As we would hope it is almost like changing a do loop
from 5x5 to 7x7 or to 5x10 or to 100x100 and then 
agreeing to pay for prototype and/or production chips 
of that size.  The client who proposed this design would 
like to go to wafer scale chips with tens of thousands of 
core per piece of silicon.  The simple fact is that more 
funding is required to make larger chips. 

Chuck mentioned that he would also like to make some
versions of 25x that have fewer I/O pins and are not
preprogammed to assume the pinout of a 256k x 18bit 
4ns SRAM.  Since pins are actually more expensive than
silicon these could be really chip production chips.

> We could be looking at 128 (20 bit) words of storage?

256 words of 18 bit memory on the c18 core at the moment.
We _could_ be looking at any variation that anyone can
make a strong enough case for making to get it made, or
any variation that gets paid for.

Chuck's original specs for x18 were 128 words of ROM
and 384 words of DRAM.  That seems to have changed to
256 words total.  I don't know how it is split.  Chuck
also mentioned that it could be any other size, but that
bigger means slower (and more expensive).   

He probably chose 8-bits to get the 1ns access that he wanted 
and as a tradeoff for the relative size of ALU/Stacks vs ROM/DRAM 
on each core.  Again the chips are described more or less in
Forth source code in OKAD II so some changes are trivial, but
larger changes may involve more subtle changes as well.
Going to significantly more memory per core might
involve changing (not just expanding) memory decode 
circuitry or changing something else since many
things on-chip are coordinated by design.

> If this means that all 25 (or 50... more on that in a moment) 
> chips have to arbitrate with all the other 24 to access a 
> single external memory, then it seems like it'll be very hard 
> to avoid making that the bottleneck.

It is worse than that.  Only one or two CPU have direct
access to external memory.  All others must get all memory
transfered via the CPU connected to the external memory
so it involves software (a few words of code) and two
or more CPU.  And yes, one SRAM pinout is a bottleneck compared
to 60,000 Forth MIPs, 180Gbps of internal communication busses, 
and 450Gbps of internal data busses.

But then again, if some ROMs can be programmed to
provide what looks like a cache-SRAM pinout on the
I/O pins then I would suspect that there is a 
nearly infinite number of possible other ways that
the I/O pins could be programmed.  I can picture
dozens of multi-gigabit signals running on/off
chip.  It could be a router, protocol translator,
data monitor, encoder, decoder, compressor, decompressor,
or other type of packet cruncher providing about
60,000 Forth MIPS of processing power per dollar of
production cost (in .18u, let alone state of the
art silicon production facilites. ;-)

> I've spent a fair bit of time trying to sketch out a 
> DES key search application for the understanding I have 
> of a 25x.  I budgeted 128 words,
> assuming the other 128 would be bootup ROM plus 
> addresses of chip registers (I know I could put some 
> DES code in ROM, but if you have to cook new chips
> for each new app, I think you've hit the wrong target IMHO).  

Yes.  I think the ROM programming is a fascinating problem.
We can experiment with what is needed and what might be
useful, and how much ROM vs RAM is really needed to avoid
the problem of having to make custom ROM production runs
to play the game and raising the bar to a high corporate
funding level.

> Of this 128, I
> kept coming up with 20-30 words for the code to handle 
> the cache function (again, you could put some of this 
> in ROM, but with this thin a memory
> margin, the cache handlers become application specific).

It should be fun to experiment with ROM code in a
25x simulator.

As for DES, I have seen the specs for a scalable
multiprocessor using FPGA to tackle the problem.  It
might be better to approach that kind of problem 
with a more closely matched VLSI architecture modeled
more directly on the problem.   But short of
paying for a custom design that would compete with
the 25x (Forth cluster chip) that we have been talking
about one could see how well the current design matches
the problem. 

> The head and tail might be sped up by using a CPU to try
> and pipeline the key scheduling... not sure yet.

That is why I would favor a pipeline mode in the
interchip networking driver.  At times you could
really use the full 180Gbps internal bandwidth.
By comparison I/O to the outside would will most
likely seem slow even with multiple gigabit or
ten gigabit data streams.
 
> What happens if two CPU's write at the exact same moment?

I would assume that either both get blocked if both hit on the
same 80ps cycle or the chip favors left or up or something.  The 
spec has not been publised yet.  But OKAD simulates 1ps intervals
and indicate that the arbitration circuit works.  Still good
to test the real circuit a lot if it gets made.
 
> The inverse-SRAM trick works pretty well for a single CPU.  
> I wonder again wow well that trick carries forward to 
> serving 25 CPU's.

I agree.  Pumping data in and out on something other
than an 18-bit cache-SRAM bus might be more appropriate
for many scalable applications.
 
> My initial (largely uninformed) foray into mapping 
> applications onto the 25x has left me suspecting that 
> the state space and external I/O may not be
> balanced against the raw CPU cycles available internally.  

The I/O pins for the most part require software so the
number of ways they can be made to function is nearly
infinite.  Chuck's idea is that you trade software for
hardware.  With a little software you can have (only)
a certain number of gigabit datastreams using user 
defined hardware and software protocols from each $1
chip.

> When I ponder a 50x instead of a 25x, my thought is to 
> ask if this "extra" real estate could be used instead to 
> implement a larger chip-local RAM space?  

It kind of depends on the application doesn't it?

As I said about F21 many times, most people working
with multiprocessing are using the approach of
maximizing the size and power of each node and
doing things like running a copy of Unix on each
node.  This architecture is closer to DSP with
communication links where each node will not need
much code or data at any given time.

> Chuck has said
> that the RAM is roughly 1:1 with the CPU in the x18, 
> so this might mean that a neighboring CPU-sized piece 
> of silicon could instead be used to triple (or
> more, if it can share some of the control logic from the single RAM
> infrastructure) the chip-local RAM.  

Right at a slight cost in speed.  Just as going to a fab
process 1/2 the size of .18u tiles would mean that you
could put 7 times as much memory in the same area as the
ALU/Stacks would only require 1/8 instead of 1/2.  In .18u
it is 25 ALU and 25 x 1/4K words on-chip memory and a
certain number of pins for the $1 production size.  In the
same size or cost chip one could put different width ALU
or different mix of memory and I/O etc. but it will also
effect CPU speed.  One could put in a different mix of
processors, with dedicated I/O processors like on F21,
but the idea of 25x is that a cluster has the power
to do a lot of I/O under hardware/software control right 
off the pins.
 
> Making the external SRAM not transparently addressable 
> is probably the right way to go.  I hope the B register 
> access will permit explicit pre-fetch along with fetch 
> state status bits. 

The B register is not special except that it has
restricted access because of instruction set decode
limitations.  There is B! but no B@.  There is no
!B+ or @B+.  The idea is to make external memory
address look transparent to software, sort of,
by having the appropriate addresses in the on-chip
control and data I/O registers with the right
setup to transparently generate the signals on the
pins. The idea is that an addressing register makes
the hardware interface thinner and only needs a subset
of the general addressing registers.  (user defined I/O 
protocols such as a 18-bit SRAM pinout, well that isn't 
user defined in it  is already in ROM(s).)

> Even if each x18 had its own external
> SRAM the speed disparity would still make this desirable.  

Yes, but it would take thousands of pins which are the
most expensive thing when you make chips this simple.
Control registers are accessed at the speed of on-chip
memory.  Yes, well 1ns at 256 cells... and external
speed is programmable so it might be able to go faster 
than 4ns to match other parts.

> However, a 25:1
> fan-out to SRAM seems too high for a broad class of applications.

Some classes would benefit from something other than the
SRAM pinout programming.  I look forward to seeing what
ideas Chuck has for other pinout programming and to
see what we all can come up with as drivers for ROMs.
I got to write some ROM code for F21c.  It was fun 
stuff.  It was also interesting to see what Chuck's
ideas were about how the ROM could be used.

> I thought I could make the 25x work pretty well for a 
> DES key search application.  I'd ...

As you know the instruction set was not optimized
for math.  There is no hardware multiply, and the
multiply step requires multiple iterations.  Even
the + and +* are slower than other instructions
because of the use of ripple carry in the ALU.
But also remember that a one-cycle multiply circuit
would be larger than this core.

>  Hell, call it 1,000.  A 2400 MIP
> CPU farm should thus be able to do 2.4 million tests per second.  

I call 25x a Forth Cluster Chip.  A CPU farm is not made of one
$1 chip.  Instead of 25 $25,000 workstations, or 500 $1000 PCs,
I think of a CPU farm as something like 500,000 $1 cluster chips
to put it on the same scale as a typical farm.  So instead
of comparing a farm to one $1 chip, I think it is better to 
compare 1000 25x to a PC or 25,000 25x to a workstation or 
up to millions for a farm when evaluating the idea of farming 
Forth chips.

> ... That still comes to 5003999585 seconds to
> exhaust the 56-bit DES key space.

Even dividing by 500,000 would still be an hour, specialized
hardware would be faster.

> By contrast, the DES Key Search Machine (http://www.cryptography.com/des/)
> can do 90 billion key searches per second.  

Does the machine cost $1-$6 to build like the
25x version that you pictured?

> Even allowing for the
> possibility of thousands of 25x's, 

The potential for a closer match between the hardware 
architecture and the problem is indicated.  If Chuck
were commissioned to do a DES breaker I think it 
would be 1000x more efficient at the job, but could
do very little else.  It is a the old question of
hardware efficiency/brittleness vs software 
inefficiency/flexibility.  

> the speed differential appears to stem
> from silicon which "fits" the data structure sizes 
> involved (admittedly,
> they get to hard-wire the DES tables because their 
> chips have only one application) and CPU's sized to 
> process data in its natural form.  

Yes, a custom VLSI design would beat the FPGA designs
in performance/cost because it would be more highly 
tuned to the problem.  A custom design using Chuck's
approach will beat one that uses a schematic by
about an order of magnitude.  But as for using 
MISC architecture, it may not be a very efficient
platform for the DES software.  Keep us informed...

> And if you're going to claim that
> a 25x *is* custom burnt for an application, 

This ROM variability issue is a sticky one.  I hope
we can find a flexible approach.  The tradeoffs are
clear, the specifics are not at this time.

> then the DES Key Machine's CPUs
> would appear to be superior to a 25x as a starting 
> point for custom silicon.

Yes.  I hope that you don't hire Chuck to make a custom
VLSI DES Key machine because then he would not work
on the 25x family.  But good luck with your work
of trying it as an example on 25x.  It should serve 
to help educate other people one way or another.

>  2. Should more than one external RAM complex be used to better
>                 balance workload from the 25x CPU's?

Perhaps a multiple-access arbitrated external memory bus could 
be programmed. There is a pin-count issue and the ROM issue.
Perhaps a driver for a serial ROM could support configuring
a set of I/O pins for that purpose, or perhaps Chuck can
be convinced to include that driver in the ROMs of a
prototyped device.  He likes good ideas.  Do your
research and if you uncover a good idea you might be
able to get it put on a chip.

>  3. If 25x -> 50x could happen, would that chip real estate be
>         better used as fast RAM for the 25 CPU's?

Good question. It is just software in OKAD II, effort, and funding.
That kind of research is fun and not outside of the range of
a relatively unfunded individual.

>  4. What sort of practical I/O bandwidth will the periphery of
>      the 25x support?

Good question.  I saw a chip recently with multiple CPU and
multiple 3.2Ghz serial channels and lots of other logic on
chip.  It makes one wonder what the upper limits per I/O pin
on 25x in .18u would be.

>  5. (More problematic) Are 8-bit address spaces and 20-bit words
>     correctly sized for an interesting set of potential
>     applications?

18-bit width busses w/ c18 core.  When I started with MISC chips
the idea of 10-bit branching seemed way too small.  I understood
that it matched DRAM hardware pages, but I didn't think it would
match programs.  It didn't match with a certain class of my
programs and on F21 we got up to 15 bit branching opcodes.  

When Chuck said that "Most programs fit in 1K." I didn't get it.  
After many years I observed that most programs assigned the
MachineForth programmers were less than one 1K on Chuck's
word addresssing and multiple opcodes per word designs.  Chuck's
code is even more dense.  It took me a long time to get what
he was talking about.  He now seems to feel that 256
(up to 3 opcodes on c18) words is reasonable for most 
things, at least those that he is thinking about...

256 does seem absurdly small, even with 3 instructions per
word that is less than 1K opcodes.  Maybe it is just right,
I don't know.  I also don't know how much will be filled with 
"required" drivers so I do wonder how crampted it would be.  
The best way to find out seems to be to study the code examples 
that Chuck will publish and see how far one can push that edge 
in software, and talk about it of course.

Since there are many ways that software can go we (programmers
who are interested enough) can mold the Forth from the hardware 
up on these machines.  If you get involved early enough you may 
get to mold some of the hardware as well.

Jeff Fox
------------------------

To Unsubscribe from this list, send mail to Mdaemon@xxxxxxxxxxxxxxxxxx with:
unsubscribe MachineForth
as the first and only line within the message body
Problems   -   List-Admin@xxxxxxxxxxxxxxxxxx
Main Machine Forth site   -   http://www.
Previous by thread: [MachineForth] [Nosc] 25x Forth engine
Next by thread: [MachineForth] [Nosc] 25x Forth engine
Index(es):
- Thread