home .. forth .. misc mail list archive ..

Re: MeshSP vs. P21


On Tue, 1 Aug 1995, Mark Callaghan wrote:

[ snip ]
 
> My take on cost-effective parallel computing.
> An n-processor parallel processor will cost n times as much as a serial
> machine.

This statement is simply not true. At least not if one considers F21 
flavoured machines.
 
> There are many factors which make this untrue.  An n-processor parallel machine
> will not have n of everything (n monitors, n disks).  An n-processor parallel 
> machine may also not have as much RAM per CPU, and memory for a 16 Meg PC will cost
> more than the CPU.

A standard PCish workstation has:

a cabinet
keyboard
mouse
floppy
hard disk
CRT
audio card
video card
host adaptor
main board
ethernet card
CPU
DRAM
cache SRAM


Now consider an F21ish cluster. CPU is negligeably cheap, so we take
8-16. DRAM and SRAM we will need still, but the same amount (16-32 MByte). 
SRAM slightly more, maybe, though 64k/node will probably suffice. No 
external SRAM is needed if we integrate fast SRAM on chip. In fact
you don't need a mainboard, and separate packages (one package without
pins, imagine!) if one integrates CPU/RAM/Router on chip.

No ethernet card, since we have built-in networking. One mainboard is 
sufficient. One host adaptor is sufficient, though one might use several 
with several EIDE (technically ugly, but cheap and quite fast) for a 
cheap but very efficient RAID. Host adaptor chips are dirt cheap.

No video card. No audio card. One CRT. Several hard
disks for RAID. One of everything else.

In fact, above configuration will be cheaper, since a lot (e.g. very
dear CPU, etc) is missing and significantly more powerful than the
workstation we are comparing it with.
 
You won't have UNIX and all that nice tools. So buy a cheap 33 MHz 486
PC and install Linux on it, if you really need a Unix workstation.
These are purported to be quite cheap on the other side of Atlantic.

> However, there are other overheads of parallel computers that offset this reduction
> in cost.  Fast parallel computers require efficient means of communication, latency
> and bandwidth demands are increased with respect to the serial computer.

Again, if you assume braindead mainframe technology (e.g. KSR), you
are right. But things must not go always wrong. Again I say: look at
the F21. Push it at least to 32 bit and add a hypergrid networking
processor, and you can scale up networking bandwidth (lots of it)
with each added node. 100 MByte/s each link, dirt cheap.

> Given a linear, or near-linear cost-up for a parallel computer, I would seek a
> linear, or near-linear speedup for my parallel applications (unless I were a 
> special customer for whom speed at almost any cost was acceptable, or I had a
> problem that was too large to solve on a serial machine).  

F21 cluster scaling costs are linear per node. Absolute costs per node
are much smaller. Performance per node is somewhat smaller, though not
drastically so. Total bang/$ is fantastic.
  
> My cynicism about the wonders of parallel computing is that linear, or near-linear
> speedups are difficult to achieve, and become increasingly more difficult to 
> achieve as the number of processors is increased.  Communication between processors
> is more inefficient than serial processing, and improving the communication is
> expensive.

Again I say: you are wrong. All above has been true, until quite recently.
Apart from the admirable British Transputer which wasn't cheap, there hadn't 
been 
any CPU+router on chip implementations. One _can_ bring message passing
overhead down to almost nothing, and it is cheap both logically an in
terms of software. There are lots of maspar programming paradigms
requiring low communication bandwidth, though even this is not a problem
in a hypergrid topology, particularly Genetic Programming.

Writing adequate software for this machine is the problem. That is
what has brought D. Hillis down, though there are misdesigns in
the CM.

> The weakness in my argument is my assumption of a near-linear cost-up for parallel
> computers in the future (while we have had superlinear cost-up in the past and
> present).
> 
> >I see this as a consequence of the fact that the parallel machines
> >were designed for grand challange problems.  It is expensive to

There were designed for them, right. I see nothing wrong in
trying to tackle big challenges. Protein folding is a GC, by
the way. So what? Does a set top box need performance? Do
VR or Descent type games need performance? Do GUI need performance?
Does DB need performance? Robotics? Autonomous vehicles? Compression?
connAI? Voice input? Realtime video? 3d trackers?

I could have gone on endlessly. There is always demand for
cheap power.

> >port and run on these machines and that is what limits the range
> >of applications to those where money is no object.  I wince when
> >I hear things like "our entry level machine with two processors
> >is really inexpensive, only $500,000.00!"
> 
> This addresses another issue, and perhaps the primary issue on the cost of
> parallel computing.  The software cost is much more expensive than the hardware
> cost.  A machine that is more expensive than a cluster of F21s, but which supports

Remember GNU? Now imagine a maspar home computer as a GNU platform, hooked up
to the net. This is fantasy, of course, but PD high quality software
is not a myth, it can be done. It has been done. Wouldn't PC design be 
crippled, fantasy would be reality today. 

All hail to IBM developers.

> F90 and HPF and PVM may be a less expensive solution for a parallel processing
> customer when the costs of porting applications to a new system are considered.

This is nothing I would use at home.

> >This I would agree with.  This is what currently one of the things
> >that distinguishes F21 from the other chips Chuck is working on.
> >It is designed for SMP.
> 
> What features have been added to support SMP?   or,
> What features have been added to support efficient communication between CPUs?

Though I consider F21 networking the weakest part of design, it
has more networking horsepower than anything on market.
The networking processor can be improved upon very easily, imo.

> >>Generalization:
> >>High-performance parallel systems are built by wiring together
> >>high-performance scalar processors.
> 
> >This has been the trend.  It has been shown that just by running
> >software on their already in place networks of workstations that
> >institutions can get an order of magnitude better price performance
> >on their large parallel applications than they can on their big
> >super computers.  It has been generally accepted that many problems
> >can be solved quite well on systems that are really nothing more
> >than workstations wired together.  Sure some grand challange
> >computing projects really need those hundreds of megabytes of
> >shared memory on the big machines, but hey we have a rather big
> >federal budget deficit already guys!

This is true. Sadly, network clusters are still not widely used. Particulary,
ethernet bandwidth stinks.

> For structured floating point intensive scientific codes, networks of 
> workstations cannot approach the performance of vector supercomputers.

You are talking sequential Fortran here. Even a small workstation cluster
will outperform any sequential number cruncher, as several FPUs are
operated in parallel. Even now ab initio quantum packages are being
ported to parallel machines. There is nothing magical in number
crunching, it can be efficiently done in parallel. And again: the
Fortran vector way of doing things is the old way of doing things.
Most paradigms do not map well to vector machines. One can approximate
Navier-Stokes for fluid dynamics and solve this in matrix algebra on
a Cray, or implement a distributed lattice CA gas-flavoured model.
I have seen the latter running in realtime on a PC (2d lattice, though).
A Cray can't do that.

The right paradigm and the right algorithm. You are lucky if your
paradigm matches your hardware. F21 might be a lucky strike for me.

> Networks of workstations impersonating parallel computers can efficiently
> solve problems requiring infrequent commmunication.  These do not make
> up the majority of parallel codes.  For parallel computing to become

This ain't entirely true, either.

> ubiquitous, it must be useful for applications that require frequent
> communication (so all of the traditional serial applications can be
> converted).

Again, there is nothing magical in fast high bandwidth communications.
If you keep geometry modest, GHz clock rates are possible.

> >Most super computing today is being done on workstation farms
> >connected via eithernet.  There is no special provision for memory
> >latency, sychronization, and cache coherency problems in hardware.
> >This is just done in software.  Many times even big machines like
> >Cray machines are also connected on these very slow (not high-speed)
> >communication backbones.
> 
> I would not call a PVM cluster of workstations a supercomputer.  A
> Cray is a supercomputer.  A CM-5 was a supercomputer.

Maspar supercomputers can be built. But what for? A workstation
is always available though number crunching runs in the background.
It does not need dedicated maintanance. It does not need dedicated
climatized buildings with lots of technician slaves. And it's cheaper.
Much cheaper than a supercomputer. Provided, networking gets faster
supercomputers will die. A maspar supercomputers has several
advantages compared to a workstation cluster. Yet.

> Having no provision for high-bandwidth and low-latency communication
> between nodes greatly limits the problems that such a system can solve.
> The communication between nodes over ethernet will be measured in 
> milliseconds.  Such a high latency will restrict both the size of the
> cluster and the set of solvable problems. 

Biological neural machinery switches at 1-10 ms rate. Signal
propagation velocity is 120 m/s rate. Try emulating
a house fly navigation skill on a Cray. Oh, there are
more than enough bits for that on a Cray. But the 
performance...

See the point?

> >F21 should provide a better high-speed communication than this.
> >In combination with the fact that the interconnect is virtually
> >free you get a very low price per node.  Just like a workstation
> >farm, but instead of $25,000 to $100,000 per node you can pay
> >$25 to $100 per node and get one to two orders of magnitude
> >improvement in the interconnect speed!
> 
> How is the interconnect free?  Or am I getting what I pay for?
> I am assuming that ethernet is the interconnect.

This is the MISC mailing list and we are arguing F21 flavoured
desktop multinode 1-2 k$ workstation here. You can't predict
the future using arguments from the past.

> Clusters of workstations are popular because people have workstations
> sitting around.  They would have these workstations even if they
> were not running PVM, and a lot of the workstations have idle cycles.

This is the beauty behind workstation clusters. You are getting
a supercomputer virtually for free since the infrastructure is
already there. With F21 one can put a small workstation cluster
equivalent on an individual's desktop for a price of a standard
workstation. This is the qualitatively new aspect of networked MISC.

> The interconnect on this cluster is also free, since the workstations
> required ethernet.

Of course we'd connect the boxes with the same propritary link.
Why artificially crippling them using ethernet?

> How is a cluster of F21s going to compete against something that was
> already free.  The cluster won't be any faster since it uses the
> same interconnect (ethernet)?

Of course we are comparing new installation here. New workstations
or new F21 cluster: what is faster and/or cheaper? (Of course
this is unrealistic since F21 won't run Unix and Unix is the new
religion. But we are comparing technical detail and not marketing
here. Of course marketing is king and horrible design are being
sold by gigabucks. On the other hand there are intrinsic virtues
of not going with the flow. Intrinsic aggravation as well, of course.)

> >You also don't have the problem of resolving cache coherency issues
> >like you do on a network of Alpha with three levels of cache at
> >each node since F21 does not use cache memory.

This cache hierarchy business is getting positively ridiculous
with each new processor release. Have you seen the P6? Have you 
watched the transistor number vs. performance curve? That
ain't saturation anymore, that's convergence towards
a constant. What will we do when integration density reaches physical 
limits in 10-15 years and there is still no nanotech? 
(And there won't).

> Does this mean that the F21 cluster won't provide shared memory?  
> Shared memory is an important model for many parallel programmers. 
> They don't all want to write message passing code.

The less power to them. Shared memory is Evil, with capital E. 
It burns transitors. It is a nightmare to design if you want
to ensure data consistancy and it cripples performance for physical 
reasons: keep data local. Oh, I forgot: it prevents people from adopting 
asynchronous OOP, the only  natural (the universe works this way) 
programming paradigm we have invented so far.

> >>Why?
> 
> >For the reasons above. Nodes are cheap and scaled integer math may be
> >fast enough and interconnect should be pretty fast.

Yes, that's exactly the right argument.

> I don't think there is much of a market for an F21 cluster if it will 
> only compete with PVM style clusters.  That is why I have asked about
> SMP specific features:
>  * cache-coherency
>  * high-bandwidth communication
>  * elimination of interconnect and memory hot spots
>  * low-latency communication 

Of course you are right. F21 won't sell en masse. But not for
technical reasons. For the reasons of going with the flow. For
investment protection. Training costs. Blind advertisement belief.
Conservativism. Ignorance. Stupidity.

Orelse why do you think PC architecture has persisted for 15 years
and still going stronger than ever?

> The memory latency problem for parallel processors is even worse than
> for serial processors, but no mention has been made about how this 
> problem will be addressed.

What is this? The limiting factor is the physical memory bandwidth,
which is a constant for both designs and the MMU/cache memory interface,
which introduces additional delays. The average memory bandwidth is
_higher_, in fact, than for serial processors.

Whenever we are going to simulate a high-statespace velocity/high
path entropy systems (most physical systems are), a shoebox sized
2 k$ F21/P32 box will run rings around any Cray. And this is no
bullshitting, you can _prove_ that.

It will not be easy to program, but the rewards are well worth it, 
imo. In fact sometime in the future I intend to begin a loose series of 
posts to MISC regarding the importance of new algorithms and degree of 
control (manageable complexity threshold) of cheap maspar hardware in 
scientific computation.

> >>       Are there quantitative reasons (SPEC ratings, simulations of F21
> >>clusters demonstrating scalable speedups,
> 
> >No SPEC marks, but of course there have been many simulations.  There
> >are people like Penio, Eugene, and Michael who are running parallel
> >apps and are familiar with the effects of parameters on performace.
> >If you can show similar performance per node and reduce the cost
> >per node by factor of 100 or 1000 and increase the interconnect speed
> >this is pretty strong quantitative evidence to me.

Oh yes.

> Yes, that would be strong quantitative evidence, but...
> 
> * you have not shown similar performance per node
>   (I am assuming that the F21 is being compared to a modern commodity CPU)
>   How can this be done without SPECmarks?
>   This cannot be done by comparing peak MIPS.
>   (I am suggesting that you have done so)

For several reasons I was trying to verbalize above, standard
benchmarks won't run well an an F21, particularly sequential
and float ones (but not too bad also). In fact these benchmarks
have been designed for old architectures, measure things which
might be irrelevant to some users while ignoring several important
ones.

"Benchmarks never lie. Liars do benchmarks." 
   
> * you have not shown how this system will provide low-latency, high-bandwidth
>   communication between CPUs.   This is not necessarily a feature of the CPU,
>   the interconnect/memory system is equally important.  What type of interconnect
>   is beging designed for the F21 cluster? 

High bandwidth? Yes, very important for most maspar applications. Low 
latency? This point is highly arguable. But even so: 100 MByte/s
(I hope Chuck gets this working, 10 MByte/s is somewhat tight) is 
well beyond PCI bandwidth, and that's only for a communication
channel. It can be expanded to support multiple asynchronous links.

Latency? Very low. Chuck didn't choose message packet networking 
(this becomes necessary for very large and physically distant
networks though), so there is one bit per node delay.  

> While current and past parallel machines are not the only way to design parallel
> machines, they do point out problems inherent in parallel computing, and these
> problems have not been addressed:
> 
> * efficient synchronization methods

there are a lot of asynchronous maspar models. But this is merely
a software problem. Provided, the network bandwidth is high, router
independant, message overhead low and a there is a minimum of
hardware message passing machinery, sync is not a problem at all

> * high-bandwith communication

Check.

> * low-latency communication

Only of secondary importance, at best.

> * elimination of interconnect and memory hot-spots

the former is part of network bandwidth, the second does
not make sense. If physical memory bandwidth is the bottleneck,
what is this hot-spot?

> * the non-scalability of broadcast networks for parallel machines

See hypergrid networks. You add network bandwidth as you add
new nodes. Of course, not true for a ring.

> * the difficulty of writing large message passing applications

I thought they invented OOP solely to increase code modularity, reusal
and to facilitate design of large software projects?

> >You can call it a belief, but unless you think that the only problems
> >are grand challange problems, or unless you think like Bill Gates that
> >your toaster needs to be running MS WINDOWS you would see that for
> >many things modern (high end) processors are wasteful.  I think there
> >is some pretty strong quantitative evidence for this.  Do you really
> >need a 90 Mhz Pentium with 32M of ram to read an i/o device 150 times
> >per second?  Modern processors are also defined by what people are
> >doing with them, and a lot of that is wasteful.

100% accurate. But I would say that there is great potential market 
for home maspar boxes. The awareness/demand is just underdeveloped.
The mainstream will reinvent maspar MISC, if given enough time. But
I don't want to wait yet another decade.

> >This is not a religious argument.  It has no more to do with belief
> >than any other opinion.  All opinions must be based on some metaphysical
> >belief system.      
> 
> My use of a workstation is a waste when there is an acceptable alternative.
> Is the F21 an alternative for a workstation or parallel computing?

I _believe_ it is one. I will know when I run my code on it.

> If so, then it is fair to ask for a quantitative analysis of it to compare
> it to what I currently use.  It is also fair to ask how it addresses problems
> that I have with my current systems.

It would be very helpful if you could describe the hardware setup you
currently use for which problems.
 
> It becomes a religious argument when the alternative is promoted as better
> with no means of evaluation against my current system.  

True.

> >As for the Forth model of computation being the right one, well I
> >think there is strong evidence that the Forth model is a good one
> >for this architecture.  I am not convinced that it is the only or

Sure, this is a stack processor. LISP should also run fine, though.
CLOS, parallel function evaluation and fine-grained parallel GC
is the icing on the cake. Anybody on MISC wants to do LISP on F21?

> >even the best.  I will be happy to be convinced that other approaches
> >can produce high quality software as well.  It may be to some extent
> >that these chips use will be restricted to the Forth model unless
> >people show that you can effectively apply other models to these
> >fairly unusual processors.

Forth is messy and I tend to screw up my machine (Novix board or Amiga)
each time I try to hack some algorithm at home. It is interesting to 
compare two machines: the amiga is very easy to shoot since there are lots
of vulnerable tasks and no memory protection. It is comfortable but
terribly easy to shoot and takes minutes to reboot (I have lots of
things running in the background). Novix, on the other hand has small
memory space without multitasking and takes fractions of a second to
reboot. Since I tend to do shootgun debugging and have a messy programming
style, the Novix is an optimal prototyping machine.

An F21 cluster would have these nice features: not easy to shoot
(separate address spaces in each node), lightning fast to reboot
and lots of Forth debugging tools available. The assembly language
is simply enough to use. In fact, self writing and self modifying
code becomes possible. I know nowadays this is frowned upon, but using
assembly in isolated hot spots (e.g. defining your own array neighbourhood
operator via offsets, dedicated 2^n sized array index clipping with
logics or hacking the  return address, autointerpolating lookup 
tables, etc.) can vastly increase performance if called a lot in a 
scientific calculation.

> We would not be communicating with one another if it were not for other
> models of computation.  I find it very amusing when C and the design of
> CPUs are demeaned on comp.lang.forth.  The simplest way to show value
> is by demonstration.  Of course, I hate it when dos lovers use this
> argument to show the utility of their systems.  Then I liken its large
> use to a virus.

C _is_ a terrible language, UNIX _is_ an awful OS and current CPUs 
_are_ atrocious. But they are mainstream. Some 10 mio programmers
are using them world wide as compared to few 10 thousands Forth
programmers. This eliminates bugs from software, produces good tools
and makes a great market.

It matters little if the Forth language and stack processors
would have been generally adopted as the standard though we would
living in a hacker's paradise, the chance is gone. A real chance
existend in the beginning of the 80ies, now Forth might be a niche
for machine independant low level hardware boot (Sun, PCI Power Mac)
or nanokernel OSses (like Taos) because of high code density.

Forth offers too much power for the average programmer. Sure one
can leap mountains or walk on water if one does really have
intricate knowledge of a good Forth system (I have always envied
these guys), but they always end up tweaking bits to customize
their particular system into something awesome and quite beyond
applicatbility instead of concentrating on practical solutions.

Let me sum it up: low absolute numbers of Forth programmers,
quite isolated, terrible press, or, even worse, not any press 
at all, too much customizability, pretty hard to learn
in the beginning, general tendency towards esoteric programming 
practices, wild claims and few reald-world applications, none
widely known.

Any questions left why Forth never really did took wing?

Dick Pountain wrote something like:
"Forth is, regrettably, one of the best kept secrets in the
computer society". This says it all.

> >such concept.  This has always been one of my buttons.  I like to
> >point out that the people who tell me that I am a (Forth) religious
> >zealot usually also tell me that although we cannot understand
> >the inner workings of modern chips and C compilers that we have
> >"faith" in these things.  
> 
> More important to me is that we can evaluate compilers and architectures
> and implementations of architectures.  I brought up 'quantitative' versus

No, you cannot. You simply use these benchmarks (and compiler writers
do know this benchmarks _really_ well by now). How often do you look
at the assembly you compiler produces? Can you read it? Does it make
sense? Can you estimate how long a routine will take without using
a profiler? Do you get nonlinear, or even worse, paradoxical results
at optimizing? Do you know all these compiler switches by heart and
what they stand for? Do you know the processor details on your machine?
How long does it take to recompile your project? How stable does your
system run?

You maintain an illusion of control. Most people won't ever need that
much knowledge/control over their machine. But a simple design is...
simple. At least you have a chance of learning these things, if you
have need of them on a F21. And knowledge is power, after all.

Now try to understand a Sun en detail. Ack.

> 'qualitative' because the MISC chips were being judged to be superior
> solely on qualitative terms.
> 
> >I was able to get Russ Hersh to update his information about the Forth
> >language in the Microcontroller FAQ listed in a number of newsgroups
> >on the internet last year.  He even included a section on the MuP21
> >and F21 even though they are not really microcontrollers. (just priced
> >like them)  But most people on the net who have requested information
> >on processors have declined to include MuP21 or F21 in their lists
> >of compiled information.  This is not because they simply don't believe
> >the data, but it is meaningless to them.  They typically say, "I will
> >include information when you supply SPECint, SPECfp, Winmarks, etc,
> >until then the information is not useful."  If the only evidence that
> >is meaningful is "how fast will it run my MS WINDOWS?" then no one
> >may ever even notice F21 or P32 or whatever.  It doesn't matter if
> >you can deliver 1000 mips for $1 to most people if it is not politically
> >correct most people will not ever hear about it or consider it.

Yeah, this is the point. Apart from not believing the specs (I found
it highly unwise to tell of the 10 kTransistor and the die size after
having delivered the specs cornerstone) the next question is: does
DOS/Windows/Linux run on it? Which software is available?
And that was that.

> I am a user of workstations.  The SPEC benchmarks are a reasonable 
> (not perfect) means for me to judge the performance of a workstation.
> It reflects a usage similar to how I may use the workstation.
> 
> Others are consumers of software.  They have much money invested directly
> (in purchase costs) and indirectly (in training) in Windows.  Spending
> $1 for an F21 or $400 for a 486 is minor compared to the cost of the 
> rest of the system.

Yes, all perfect reasons to perpetuate idiocy all the way up into the
third millenium. Optimizing locally instead of going for the local
minimum. Yeah, we all know these perfectly sane perfectly moronic
reasons.

> Even if performance were the only metric used to evaluate a system, 
> MIPS per dollar is still irrelevant.  What is relevant is execution time
> for applications that I want to run.

Have you noticed as creeping featurits bloats code and increases
execution time but people still buy it because there are _more_ features
and the version number has been incremented?

We all get the systems we deserve. (Seems to have a lot with
prisonners dilemma in common.)

> Heck, there is a product out now that translates binaries from SPARCs to
> Alphas.  Workstation are not sold on CPU performance alone.

This will not work for hand-crafted code e.g. with computed jumps.
Not CPU performance alone? We just bought 20 Indys, but they all
number crunch on pentiums (if they can't get a Cray) since float 
performance is so much better there.

They use Indys for web crawling and mail, mostly. Ah, and games,
I forgot.

> >If some of the people who consider F21 actually do something and
> >we can publish the results then some people may notice.  Most people
> >will only recognize a product.  They have no idea that it has an
> >embedded computer in it at all, let alone what model or microprocessor
> >is making it work.
> 
> I cannot talk about the embedded systems market.  I am a consumer in the
> workstation/academic market.  I want to see SPEC benchmark ratings if
> MISC chips are going to be evaluated in this area.  I also want to hear
> about your interconnect, and SMP specific features if MISC chips are going
> to be proposed as a parallel system.  Nothing about MISC chips will be 
> accepted by a referee for publishing in this area without quantitative
> information such as this.

These data will be eventually forthcoming. And data won't look
well since benchmarks are biased. It's all perfectly propritary
and incompatible and it won't sell worth one cent. Oh, I forgot
the embedded controller market. Well, make it two cents.

Defaetistic? Bitter? Disillusioned? No way, Sir!

-- Eugene

> 
> mark
> 
> 
>