F21
- To: MISC
- Subject: F21
- From: jfox@xxxxxxxxxx (Jeff Fox)
- Date: Tue, 1 Aug 1995 10:28:58 -0700
Dear MISC readers,
Mark Callaghan (>) made some comments about what Eugene (>>) had said:
>Point:
>Memory bandwidth is becoming an increasingly more significant factor
>for processor performance, and future CPU enhancements may be wasted
>if the memory bandwidth is not improved.
This is very true.  And is a sense all the solutions employ some form
of parallelism.  The popular way to attack this problem is to use a
wider bus, multiple execution units and multistage pipelining, lots
of general purpose registers, and cache memory.  It is a big problem
these days, and people should realize that three levels of cache
memory is really like the other techniques an attempt at parallelism
at some level or hardware.
MISC chips attack the problem by using very small instructions with
multiple instructions per memory word, and the first MISC chips also
use on chip stacks to speed execution.  This is also a form of
parallelism.  The bottleneck is indeed the memory access, but the
MISC architecture technique is a cheap way to do lots of processing
out of relatively slow and cheap memory.
The Novix used a variation on the wider (more parallel) bus by having
multiple (parallel) busses.  There is no reason that future MISC chips
cannot have as many or as wide of busses as are needed to get high
memory bandwidth.  Latency is still expensive since it takes faster chips.
Ball Grid Array packaging is being used on a P32 prototype.  It makes
it possible to get many more pins on a tiny die.  Yes, P32 is back to
the tiny die used on the MuP21.  Since P32 is .8 micron it has room for
more transistors also than MuP21 which used 1.2 micron on the same
2.4 x 2.4mm die.  There is nothing preventing connecting to memory
through multiple banks of arbitrary width except a penny per pin.
The real point I am trying to make is that memory bandwidth has certainly
been the limiting factor on the MuP21 and F21.  The upper limit for
what Chuck understands right now would support about 1000 mip in a fast
process, and an interface to multiple wide banks of memory to actually
get that performance on _some_ code.  Chuck could go to a wider bus
and more banks of memory, but his CPU would still be limited to about
1000 mip.
>Generalization:
>Only a tiny class of problems is intrinsically serial?
I think these are two entirely unrelated subjects myself.  I do not
see that one follows the other.  (parallel not serial subjects :-)
But I cannot agree more.  The more I look at it the more parallelism
I see.  I have concluded that everything in the real world is incredibly
parallel, and only a very tiny class of problems (mostly examples used
in serial communications) create the illusion that serial is useful.
This is why computers have historically been so limited and unable to
most of what has been expected of "thinking" machines.  Serial processing
has not proved very useful for thinking.
You know much of our reality is the distributed computer formed by
10^10 nodes with 10^12 parallelism at each node. (The collection of human
minds in human culture is one example of parallelism, try to imagine
simulating it on a serial machine.)
>How about...
>"Only a small class of problems are cost-effectively parallel."
                                 ^^^
How about "have been" instead of "are"?  Parallel machines are
historically expensive so that has limited them to only a very small
range of things that our tax dollars are not paying for.  Just
because the only parallel machines have been "grand challenge"
type state of the art national debt sort of platforms doesn't
really mean that that kind of expense is inherent in parallel
approaches.
>How is a program classified as not intrinsically serial?
>  a) can the algorithm be encoded on a parallel machine      
>  b) can I afford to run the algorithm on a parallel machine
   c) it can be run in a serial fashion along with lots of other
      useful processing going on in parallel on a parallel machine.
Ok that might be stretching the definition, I mean after all this
says that there is no such thing as anything really intrinsically
serial in the real world.  But then again at the lowest level each
process in a parallel machine is still serial.
But I think a) does apply to algorithms but not to programs or
problems.
b) is a very poor defintion because what "I can afford" is back
   to the idea that parallel is somehow intrinsically expensive.
   And what they could affort has not limited the amount of money
   that was spent on many government funded projects.
>The second definition is a more concrete means of classifying an
>application as parallel or serial.  If it is cost effective for
>me to run my application on a parallel machine, the application
>is parallel. The cost effectiveness will be determined, in large
>part, by the speedup that I can obtain on the parallel machine.
Yes, for a given application and a given machine or budget etc,
I agree.  This is just common sense in the real world.
>I will use a parallel machine rather than a serial machine when
>I can:
>a) solve larger problems
>b) solve problems faster
 how about also:
 c) solve more problems in parallel?
 d) solve problems cheaper?
>Parallelizing applications is an open research problem, and  
>a successful parallelization of an important application, or 
>an improvement of a previous parallelization is usually worthy
>publishing.
To paraphrase Gordon Bell, "the reseach clearly shows that you
can solve any problem on any hardware if you throw enough of
our tax dollars at it."
I have looked into various architectures in hardware and
various approaches to software.  Most of F21 is quite conventional
in this regard.  But as a cheap platform for parallel research
it might provide some new discoveries.
>The difficulty of developing parallel versions from sequential
>applications, and the difficulty of achieving an acceptable
>speedup limits many applications from being run on parallel
>machines.
I see this as a consequence of the fact that the parallel machines
were designed for grand challange problems.  It is expensive to
port and run on these machines and that is what limits the range
of applications to those where money is no object.  I wince when
I hear things like "our entry level machine with two processors
is really inexpensive, only $500,000.00!"
>A low-cost parallel machine would make more applications
>cost-effectively parallel.  But the performance of this system
>(a cluster of F21s perhaps) must be demonstrated.
Yes, but knowing that chips can be manufactured for <$1 each
in quantity will not convince many people unless they can actually
buy them for just a few bucks themselves.
Prototyping chips is very cheap compared to most peoples ideas about
what it takes to develop and prototype custom vlsi chips, but it is
still orders of magnitude from large volume prices.  Mosis raised
rates just before this last run.  If we get all the prototype die
mounted in pga packages they will cost about $375 each, and if we
spend $1000 less and don't get them all packages we can get some
from this run for $1000 each.  This is still and issue at this time.
>>Hence, for most problems, even a single F21 is no slower than
>>a Pentium, an Alpha AXP or whatever, if adequately programmed.
I try to avoid saying things like this myself.  People will assume
this means you could do 600 MFLOP like a $4000 300 mhz 21164 with
a few hundred dollars worth of GAS cache controllers and three
levels of cache etc on a single F21 with some cheap ram.
I might say that for some problems a single F21 might show similar
performance to a Pentium and maybe even an alpha.  But even if there
are really lots of these problems they are not the ones people will
think of first.  They want to know SPEC marks, NT performance, WINmarks,
or the C code they are already using.
The top of the line Alpha is listed as having (peak advertizing) mips
of 1200.  That is two integer and two FP instructions per clock cycle
at 300 mhz.  But this is not very realistic as a real world indication
where things go out of cache etc.  I have seen numbers listed showing
the actual performance in real world applications to be 3 or 4 clock
cycles per instruction.  That is 75 to 100 mips, not 1200 mips.
You see the same thing with P21 or F21.  100 or 200 mips is really
peak advertizing mips.  You will not see that performance in many
applications.  If they are well coded they will approach that on the
timing critical routines.
However there is always the point that the 21164 is roughly 1000 times
bigger, more expensive, and power hungry than a F21.  I prefer to
compare one 21164 to 1000 F21 for this reason.  I think it very
misleading to say that a single F21 can do most things about as
fast a Pentium or Alpha.  To most people most things means what
they are doing now.
>>Even better: using mid-grain (1-2 MByte/node) multinode desktop
>>machine we can have 1-2 orders of magnitude the performance of
>>a big machine (a workstation) at the same hardware price.
This I would agree with.  This is what currently one of the things
that distinguishes F21 from the other chips Chuck is working on.
It is designed for SMP.
>Generalization:
>An F21 is no slower than a Pentium or an Alpha AXP.
>
>What is the basis for the comparison between the F21 and the AXP?
>I can look up the SPEC ratings for the AXP and see how it performs
>for integer and FP codes, and I can compare the ratings for the AXP
>to the ratings for other processors.  Currently I cannot do this
>for the F21.  I was unaware that the F21 is being proposed as a
>general purpose processor that will compete with the Alpha AXP and
>Pentium.
I don't think it is.  Alpha and Pentium are targeted for UNIX, C, DOS,
and WINDOWS apps.  F21 is not.  This is what "general purpose" means
to most people. (but not to me, to me this is a small subset for
expensive designer machines :-)
I would propose using F21 instead of a Pentium on many of the embedded
applications I read about in comp.realtime or comp.arch.embedded.  With
people saying "a 66 mhz 486 DX/2 with 16M of ram and 256K cache can
support 100 interrupts per second.  If your app needs to run faster
than that in the real world upgrade to a Pentium 90 mhz with 32M ram."
But at least there have been more voices of reason in those newsgroups
who know something about hardware and software.
>Generalization:
>High-performance parallel systems are built by wiring together
>high-performance scalar processors.
This has been the trend.  It has been shown that just by running
software on their already in place networks of workstations that
institutions can get an order of magnitude better price performance
on their large parallel applications than they can on their big
super computers.  It has been generally accepted that many problems
can be solved quite well on systems that are really nothing more
than workstations wired together.  Sure some grand challange
computing projects really need those hundreds of megabytes of
shared memory on the big machines, but hey we have a rather big
federal budget deficit already guys!
>Building a parallel machine is not simply a matter of wiring CPUs
>together.  Memory latency, synchronization, memory coherency, and
>high-speed communication must be supported efficiently.
>What features does the F21 provide in this area?
Most super computing today is being done on workstation farms
connected via eithernet.  There is no special provision for memory
latency, sychronization, and cache coherency problems in hardware.
This is just done in software.  Many times even big machines like
Cray machines are also connected on these very slow (not high-speed)
communication backbones.
F21 should provide a better high-speed communication than this.
In combination with the fact that the interconnect is virtually
free you get a very low price per node.  Just like a workstation
farm, but instead of $25,000 to $100,000 per node you can pay
$25 to $100 per node and get one to two orders of magnitude
improvement in the interconnect speed!
You also don't have the problem of resolving cache coherency issues
like you do on a network of Alpha with three levels of cache at
each node since F21 does not use cache memory.
>> _Especially_ for scientific number crunching, a F21 cluster is great.
>
>Why?
For the reasons above. Nodes are cheap and scaled integer math may be
fast enough and interconnect should be pretty fast.
>       Are there quantitative reasons (SPEC ratings, simulations of F21
>clusters demonstrating scalable speedups,
No SPEC marks, but of course there have been many simulations.  There
are people like Penio, Eugene, and Michael who are running parallel
apps and are familiar with the effects of parameters on performace.
If you can show similar performance per node and reduce the cost
per node by factor of 100 or 1000 and increase the interconnect speed
this is pretty strong quantitative evidence to me.
>                                           architectural features that
>will improve its performance for scalar or parallal processing -
>features that other architectures do not provide)
Well such things as single message broadcast on the network is a
nice feature not found on all networks.  The main architectural
benefit is size of these chips.  Very low cost and power requirements
per node.  For grand challange problems people want the fastest
processor and most memory per node they can get.  This is why a
microprocessor based multiprocessor could cost $250,000 per node!
Or why you may need a 200 KW power supply!
>or are the reasons qualitative (a belief that modern processors are
>wasteful, a belief that the Forth model of computation is the right one).
You can call it a belief, but unless you think that the only problems
are grand challange problems, or unless you think like Bill Gates that
your toaster needs to be running MS WINDOWS you would see that for
many things modern (high end) processors are wasteful.  I think there
is some pretty strong quantitative evidence for this.  Do you really
need a 90 Mhz Pentium with 32M of ram to read an i/o device 150 times
per second?  Modern processors are also defined by what people are
doing with them, and a lot of that is wasteful.
This is not a religious argument.  It has no more to do with belief
than any other opinion.  All opinions must be based on some metaphysical
belief system.      
As for the Forth model of computation being the right one, well I
think there is strong evidence that the Forth model is a good one
for this architecture.  I am not convinced that it is the only or
even the best.  I will be happy to be convinced that other approaches
can produce high quality software as well.  It may be to some extent
that these chips use will be restricted to the Forth model unless
people show that you can effectively apply other models to these
fairly unusual processors.
Many people associate Forth with the attitute that the Forth model
is the right one for everything.  I have always argued strongly against
this since there are so many other factors in the real world when
discussing a computer language used by humans on real world problems.
But we are discussing a pretty tight integration of the Forth
computing model and the hardware in the MISC mail list.  I
think it is a valid assumption that Forth is a good computational
model on these machines.  But I don't think that it is a matter of
opinion based on evidence vs "belief" based on faith or some other
such concept.  This has always been one of my buttons.  I like to
point out that the people who tell me that I am a (Forth) religious
zealot usually also tell me that although we cannot understand
the inner workings of modern chips and C compilers that we have
"faith" in these things.  
>If it is fast enough, they will use it.
>
>mark
(Is this heaven? No, it's Iowa.)
I would like to think so, after all scientists should have an objective
view.  But being humans they are also influenced by many human factors
such as prior experience, habit, and limited metaphysical perspective
that will factor into the equation.  Remember that when Apple talks
about Open Firmware they never use the word "Forth" since it seems to
make the audience boo.
Many scientists are locked into Fortran and dismiss Forth out of hand
because they only do floating point or complex math and well Fortran
does that and they think Forth can't.  F21 is not really designed for
floating point either so that is an issue with many of these guys up
front.
There is also the status issue.  Much of microcomputer marketing is
based on style and status.  How "macho" is your processor?  Do you
follow "Mac Style" tips in your lifestyle?  Have you upgraded to
the latest version of the compatible hardware and software?
Many managers have told me that the biggest thing in their budgets
are the PCs that they buy.  Their status and promotion in the organization
is improved by spending money not saving it.  They would much rather buy
$20,000 PCs than $200 PCs.  If they don't spend the money this year then
next year they will only be able to approve $200 PCs.
The market is driven by bigger boxes, more diskettes, upgrades, and
bigger disks for larger screen saver collections etc.  The PC market
seems to be the last bastian of conspicuous consumerism in this
country.  People tell me all the time how waste is a good thing because
it is what drives and inflates the market.
"If it is fast enough" may not be enough.  I would like to think so, but
with so many other factors coming into the equation just how fast is
that?  There are so many image issues that I wonder.
I was able to get Russ Hersh to update his information about the Forth
language in the Microcontroller FAQ listed in a number of newsgroups
on the internet last year.  He even included a section on the MuP21
and F21 even though they are not really microcontrollers. (just priced
like them)  But most people on the net who have requested information
on processors have declined to include MuP21 or F21 in their lists
of compiled information.  This is not because they simply don't believe
the data, but it is meaningless to them.  They typically say, "I will
include information when you supply SPECint, SPECfp, Winmarks, etc,
until then the information is not useful."  If the only evidence that
is meaningful is "how fast will it run my MS WINDOWS?" then no one
may ever even notice F21 or P32 or whatever.  It doesn't matter if
you can deliver 1000 mips for $1 to most people if it is not politically
correct most people will not ever hear about it or consider it.
If some of the people who consider F21 actually do something and
we can publish the results then some people may notice.  Most people
will only recognize a product.  They have no idea that it has an
embedded computer in it at all, let alone what model or microprocessor
is making it work.
Getting F21 from just an idea back in 1990 to a silicon prototype chip
has been a long and uphill struggle.  Production and consumer products
are large hurdles yet to be cleared.  I have thought that the MISC
mail list might provide help in this regard.  Having other people
involved in research and product development will be essential in
building momentum for the project.  The signal to noise ratio in
MISC seems to be pretty good.  It seems that it is a sort of
parallel support environment for MISC chips with over 100 readers
thinking about the MISC chip issues.
Jeff Fox
Ultra Technology