home .. forth .. misc mail list archive ..
Re: MeshSP vs. P21

To: Mark Callaghan <markc@xxxxxxxxxxx>
Subject: Re: MeshSP vs. P21
From: Eugen Leitl <ui22204@xxxxxxxxxxxxxxxxxxxxxxx>
Date: Wed, 2 Aug 1995 16:40:17 +0200 (MET DST)
Cc: MISC
In-Reply-To: <9507311928.AA26793@poona.cs.wisc.edu>

On Mon, 31 Jul 1995, Mark Callaghan wrote:

[ Comments on Christophe Lavarenne snipped ]
> 
> Point:
> Memory bandwidth is becoming an increasingly more significant factor
> for processor performance, and future CPU enhancements may be wasted
> if the memory bandwidth is not improved.

Point well taken. The fact is: today's RAM bulk does not go below 60 ns.
Store access with low locality does not get cached. Cache misses 
increase latency and make program profiling nontrivial. Moreover, they
introduce fake performance impressions. Of what use is a 300 MHz processor
in a neural net emulation or a CA gas emulation if RAM can take only
17 MHz accesses? I can have 100 times that memory bandwidth at the same
price on a maspar machine.

> 
> Generalization:
> Only a tiny class of problems is intrinsically serial?

Yeah, I think this is true.

> How about...
> "Only a small class of problems are cost-effectively parallel."

This cost-effectivity thing is self fulfilling prophecy. The hardware
is not cheap since demand is low. Demand is low since hardware is dear.
Training is not cheap since there is no hardware available and humans
are conservative. Microsoft/Intel/IBM triumvirate has delayed hardware
and software progress by more than a decade and _they keep going on
at full speed into a blind alley_. Aargh.
 
> How is a program classified as not intrinsically serial?  
>   a) can the algorithm be encoded on a parallel machine      
>   b) can I afford to run the algorithm on a parallel machine
>  
> The second definition is a more concrete means of classifying an
> application as parallel or serial.  If it is cost effective for
> me to run my application on a parallel machine, the application
> is parallel. The cost effectiveness will be determined, in large
> part, by the speedup that I can obtain on the parallel machine.

Why have workstation clusters spread mightily? Why do vector machines
sales stagnate? At Leibniz Rechenzentrum we had a KSR, now we got
a SP2, while older irons are increasingly taken out of service. (The
Cray is still in heavy use, though. Monolingual Fortran scientist
type is in the majority.

> I will use a parallel machine rather than a serial machine when
> I can:
> a) solve larger problems
> b) solve problems faster
> 
> and the cost of using the parallel machine is offset by any
> gain from the faster solution, or a solution to a larger problem.
> 
> Parallelizing applications is an open research problem, and  
> a successful parallelization of an important application, or 
> an improvement of a previous parallelization is usually worthy
> publishing.

Almost anything can be ported to message passing hypergrids
(e.g. boolean hypercubes) and it maps well. Of course one
has to program asynchronous OOP, but this is most natural
way to model things.

There is a lot of crippled harware (KSR, e.g.) out there and
C++ is not exactly OOP, but do we have to show the bad
examples first?

> 
> The difficulty of developing parallel versions from sequential
> applications, and the difficulty of achieving an acceptable
> speedup limits many applications from being run on parallel
> machines.

You mean porting, not developing. Porting seq->par is essentially
recoding from scratch. De novo programming is simple. Provided.

If the hardware is bad, debugging tools inadequate, the language
without intrinsic parallelity and the developer has no par design
experience it is a nightmare.
 
> A low-cost parallel machine would make more applications 
> cost-effectively parallel.  But the performance of this system
> (a cluster of F21s perhaps) must be demonstrated.

Of course. I intend to do protein folding and realtime visualisation
once F21 becomes available. But the point is: an F21 cluster is cheap
enough to buy and experiment on without a major hole in the pocket.
Scalable maspar destop supercomputer at standard workstation price
_does_ have an appeal.

F21 is no panacea, though. The arithmetics precision is very limited
(32 bit is barely adequate) and the network bandwidth limited, though
high if compared to Ethernet.

> >Hence, for most problems, even a single F21 is no slower than 
> >a Pentium, an Alpha AXP or whatever, if adequately programmed. 
> >Even better: using mid-grain (1-2 MByte/node) multinode desktop 
> >machine we can have 1-2 orders of magnitude the performance of 
> >a big machine (a workstation) at the same hardware price.
> 
> Generalization:
> An F21 is no slower than a Pentium or an Alpha AXP.

Sure this is a generalization. If you need float number crunching
or a purely sequential problem then you are lost. What I inteded
to say, though slightly provoking, was the fact that one does 
rarely need these.

> What is the basis for the comparison between the F21 and the AXP?
> I can look up the SPEC ratings for the AXP and see how it performs
> for integer and FP codes, and I can compare the ratings for the AXP
> to the ratings for other processors.  Currently I cannot do this
> for the F21.  I was unaware that the F21 is being proposed as a
> general purpose processor that will compete with the Alpha AXP and
> Pentium.

In fact, F21 was not intended as a general purpose processor. But
it just turns out it is one, and (provided the specs do not lie)
a damn good one. At least my kind of simulation will not run on
Alpha or Pentium _any_ faster than e.g. on a 25 MHz 68030.

If you use lots of floats and program standard monotask 
software with C a workstation is the right thing to buy. 
If you want to do unusual high speed computing at low 
price, F21 cluster is for you.

> 
> Generalization:
> High-performance parallel systems are built by wiring together
> high-performance scalar processors.

This is not true, or at least not very true. E.g. you don't increase
memory bandwidth by using faster processors.
 
> Building a parallel machine is not simply a matter of wiring CPUs
> together.  Memory latency, synchronization, memory coherency, and

Memory latency is essentially determined by physical access time,
as there is not memory interface worth to be spoken of in F21.
Synchronization? Each node accesses its memory asynchonously.
Memory coherency? OOP does not need to maintain an illusion
of one common memory. 

> high-speed communication must be supported efficiently.
> What features does the F21 provide in this area?

High speed networking? Sort of. 10-100 MByte/s ring is quite
fast, I think. The network processor can be expanded to higher
topologies scaling roughly linearly, if you go for total network
bandwidth. 
 
> > _Especially_ for scientific number crunching, a F21 cluster is great.
>  
> Why?  Are there quantitative reasons (SPEC ratings, simulations of F21
> clusters demonstrating scalable speedups, architectural features that
> will improve its performance for scalar or parallal processing -
> features that other architectures do not provide)

The chip is reasonably fast, provided one believes the specs.
It's a bit low on registers, I must admit. Integer ratings will
be not awesome (especially, if you divide a lot). Float ratings
disastrous at best since there is no float unit (thanks god).
Video and analog processing are nice goodies, but few scientists need 
them. So what?

The best feature is the price. You can build a 10-node machine
for the price of one. Memory bandwidth is great. Networking 
bandwith adequate. Parallel integer performance good (alas,
there is no benchmark for that). Etc. Not too bad, I think.
At least for me F21 cluster is a great machine, though I would
wish for 64 bit and multiple links and maybe 2 MBytes/node,
though this is probably too much.

> or are the reasons qualitative (a belief that modern processors are
> wasteful, a belief that the Forth model of computation is the right one).

I have not spoken a single word pro Forth, though low threading
overhead is a great plus for certain programming models (yes, simulation). 
I'd even wish for 5-10 all purpose registers, though this would bloat
code space and disable multiple instructions in single word feature.
Home code page in SRAM can possibly alleviate the lack of generic
registers.

Yes, modern processors are wasteful: they cost too much without 
being faster for my application. I can't buy a workstation cluster
(and I wouldn't even if I could), but I can build and use an F21
cluster at home. It suits my programming paradigma (maspar) excellent.
Its architecture is SMALL (=beautiful). Easy enough for me to understand
and use constructively. Easy enough to write self-writing and 
self-modifying code. Easy enough to hack fast display code, which often
rivals dedicated video hardware.

I might be biased, but I like F21.

> >Of course, an off shelf Fortran compiler won't run on it.
> >Some adaptation on the side of programmer will be needed. 
> >Alas, most scientists are nonprogrammers and _very_ 
> >conservative. Bad luck.
> 
> If it is fast enough, they will use it.

Unfortunately, you are wrong. Most scientists do not use computers
for anything other than paper writing and those few who do cannot
program. I mean, they speak Fortran fairly well, but they cannot
program. Though fairly bright they don't have time nor energy being
terribly busy with their project. After all, that's what they are
paid for: doing research. 

I wish I was wrong, but I have seen the same picture both in 
the chemistry and biochemistry department. If these guys only
knew what they can do with computer chemistry....

-- Eugene

> 
> mark
> 
> 
>
Follow-Ups:
- Re: MeshSP vs. P21
  - From: Penio Penev
- Re: MeshSP vs. P21
  - From: Penio Penev
References:
- Re: MeshSP vs. P21
  - From: Mark Callaghan
Previous by thread: MeshSP vs. P21
Next by thread: Re: MeshSP vs. P21
Index(es):
- Thread