home .. forth .. misc mail list archive ..

Re: MeshSP vs. P21



>> = Mark Callaghan
> = Eugen Leitl

>F21 cluster scaling costs are linear per node. Absolute costs per node
>are much smaller. Performance per node is somewhat smaller, though not
>drastically so. Total bang/$ is fantastic.

This cluster has neither been designed, built, nor simulated.  I don't doubt
the utility of the F21.  I doubt:

  * the performance of a cluster of F21s as a maspar machine
  * the performance of a single F21 for scientific computing
  * the performance of a single F21 for workstation type computing

>> My cynicism about the wonders of parallel computing is that linear, or near-linear
>> speedups are difficult to achieve, and become increasingly more difficult to 
>> achieve as the number of processors is increased.  Communication between processors
>> is more inefficient than serial processing, and improving the communication is
>> expensive.

>Again I say: you are wrong. All above has been true, until quite recently.
>Apart from the admirable British Transputer which wasn't cheap, there hadn't 
>been 
>any CPU+router on chip implementations. One _can_ bring message passing
>overhead down to almost nothing, and it is cheap both logically an in
>terms of software. There are lots of maspar programming paradigms
>requiring low communication bandwidth, though even this is not a problem
>in a hypergrid topology, particularly Genetic Programming.
>
>Writing adequate software for this machine is the problem. That is
>what has brought D. Hillis down, though there are misdesigns in
>the CM.

Message passing overhead is only part of the problem.  The difficulty for 
parallel systems is to run programs that require frequent communication on
systems with high-latency for communication operations.   Message passing
overhead contributes to the problem by taking away from computation cycles,
however, removing the overhead by handling message passing with a communication
process does not remove the latency, it still takes a long time (from the viewpoint
of the CPU) to communicate with another processor.

This will never be a solved problem.  

Processors will never be able to communicate with other processors faster than
they can communicate with their local storage.

The effects of this problem can be seen very clearly in speedup curves for
parallel machine.

Parallel applications that can tolerate low-bandwidth and high-latency are the
minority of current parallel applications.  If parallel processing is to become
useful for general purpose computing, high-bandwidth and low-latency must be
supported inexpensively.

Low-bandwidth and high-latency are already supported by PVM.  In some sense
a PVM cluster is a free parallel machine since it can utilize idle workstations.
It will be very difficult for a vendor to "sell" a machine in this market.

>> Clusters of workstations are popular because people have workstations
>> sitting around.  They would have these workstations even if they
>> were not running PVM, and a lot of the workstations have idle cycles.

>This is the beauty behind workstation clusters. You are getting
>a supercomputer virtually for free since the infrastructure is
>already there. With F21 one can put a small workstation cluster
>equivalent on an individual's desktop for a price of a standard
>workstation. This is the qualitatively new aspect of networked MISC.

Two assertions.
1) An F21 cluster == a workstation cluster
2) A workstation cluster == a supercomputer

There is no evidence for 1.  This is something that I have been whining about
for several posts.  

There is evidence against 2.  Workstation clusters do not provide the performance
of supercomputers.  

They lack the memory system performance. 
They lack the i/o performance.
They lack the communication performance.

Sustained performance is the key, and peak FP and integer rates
of workstation CPUs do not translate into sustained performance.  MPPs can solve
problems that vector supercomputers cannot.  But a workstation cluster is not
an MPP (t3d, cm5).

>> Does this mean that the F21 cluster won't provide shared memory?  
>> Shared memory is an important model for many parallel programmers. 
>> They don't all want to write message passing code.

>The less power to them. Shared memory is Evil, with capital E. 
>It burns transitors. It is a nightmare to design if you want
>to ensure data consistancy and it cripples performance for physical 
>reasons: keep data local. Oh, I forgot: it prevents people from adopting 
>asynchronous OOP, the only  natural (the universe works this way) 
>programming paradigm we have invented so far.

Analogies to nature are moot.

>Whenever we are going to simulate a high-statespace velocity/high
>path entropy systems (most physical systems are), a shoebox sized
>2 k$ F21/P32 box will run rings around any Cray. And this is no
>bullshitting, you can _prove_ that.

The only thing you can say about 2000 F21s is that their peak mips sums to
2000 * 330 mips.  You cannot prove anything else.

Of course, this peak MIPS rating assumes:
 * no branches
 * running the entire machine out of SRAM (that gets expensive after awhile)
 * no data accesses
 * even if there were data accesses, no FP, no integer multiply

>For several reasons I was trying to verbalize above, standard
>benchmarks won't run well an an F21, particularly sequential
>and float ones (but not too bad also). In fact these benchmarks
>have been designed for old architectures, measure things which
>might be irrelevant to some users while ignoring several important
>ones.
>
>"Benchmarks never lie. Liars do benchmarks." 
   
Standard benchmarks reflect to some degree the performance a user can expect
from a machine.
They also reflect to some degree the way in which a user uses a machine.

If you are proposing the F21 as a workstation replacement, then the SPEC benchmarks
are an acceptable means to justify your proposal.

If you are proposing the F21 cluster as a parallel system replacement, then 
the NAS benchmarks, the Perfect Club benchmarks and others are an acceptable means.

This discussion will degenerate into a religious argument if no benchmarks are done.
What alternative do you propose to compare the performance of an F21 cluster versus
existing machines?

>> * efficient synchronization methods

>there are a lot of asynchronous maspar models. But this is merely
>a software problem. Provided, the network bandwidth is high, router
>independant, message overhead low and a there is a minimum of
>hardware message passing machinery, sync is not a problem at all

So the machine has high-overhead synchronization or no synchronization.

>> * low-latency communication

>Only of secondary importance, at best.

Really?  This is of primary importance.

>> * elimination of interconnect and memory hot-spots

>the former is part of network bandwidth, the second does
>not make sense. If physical memory bandwidth is the bottleneck,
>what is this hot-spot?

If you have routers in your network, then one of them may get too much message
traffic, and the interconnect will saturate at very low-levels of traffic.
A memory hot-spot would be an F21 node that received too much traffic, and 
delayed progress (serialized progress) of your applications.

>> My use of a workstation is a waste when there is an acceptable alternative.
>> Is the F21 an alternative for a workstation or parallel computing?

>I _believe_ it is one. I will know when I run my code on it.

Unfortunately, it is impossible to discuss the merits of your system without
any means of evaluation.  I am not willing to accept it on faith.

>These data will be eventually forthcoming. And data won't look
>well since benchmarks are biased. It's all perfectly propritary
>and incompatible and it won't sell worth one cent. Oh, I forgot
>the embedded controller market. Well, make it two cents.
>
>Defaetistic? Bitter? Disillusioned? No way, Sir!

The benchmarks reflect to some extent the usage of the machines.  If a customer
wants a machine that will use applications similar to those provided by 
benchmark foo, then they will look at result from benchmark foo.  So, yes,
they are biased, they are biased to show good performance for machines that
perform benchmark applications and similar applications well.

If an F21 cluster is not an appropriate match to the types of problems that are
being solved on these machines, than no comparisons can be made between an F21
cluster and existing machines.  It is neither better nor worse, only different.

So now the line is:

F21 clusters are the answer
We have no benchmark results
Benchmark results are meaningless

How can this be anything but a religious argument?

mark