home .. forth .. misc mail list archive ..
SEQ vs PAR flamefest party

To: Mark Callaghan <markc@xxxxxxxxxxx>
Subject: SEQ vs PAR flamefest party
From: Eugen Leitl <ui22204@xxxxxxxxxxxxxxxxxxxxxxx>
Date: Fri, 4 Aug 1995 15:52:19 +0200 (MET DST)
Cc: MISC, Jeff Fox <jfox@xxxxxxxx>
In-Reply-To: <9508031620.AA00490@poona.cs.wisc.edu>
On Thu, 3 Aug 1995, Mark Callaghan wrote:

> 
> >> = Me
> > = Jeff Fox

[ workstation scaling performance ]
 
> 
> 1) If workstation farms gave superlinear speedups for more than a subset of
>    problems, then parallel computing is a solved problem.  Most sites will 

The vast majority of software in existance is sequential software. 
Porting to a workstation cluster requires considerable human resources
and some insight into a new way of programming (e.g. rpc calls).
Moreover, ethernet network bandwidth and latency do not compare
favourably to streamlined maspar supercomputer communication 
infrastructure.

Apart from human conservativism, these are the main reasons why workstation
clusters are accepted only very hesitantly as supercomputer substitute.
But the trend seems to go towards farms.

>    already have the farm (a workstation on each desk).  It is very easy to
>    demo superlinear speedup with a parallelized code, run with a data set 
>    that is too large for the cache on one processor, but easily fits in the
>    cache on n processors.  This is by no means the norm.

Sequential benchmarks are meaningless on maspar machines, but you
do not seem to accept the results of those few parallel benchmarks
in existance. Your position on benchmark result relevance appears
quite flexible. (This not as a flame, but as oberservation).
 
> 2) A big problem with workstation farms is communication latency.  PVM clusters
>    will take several milliseconds per message.  It is difficult to design 
>    efficient parallel applications that can tolerate this type of latency.  

This is not difficult, merely unusual. (Asynchronous) OOP does not suffer
from such problems. And again, in bioNNs signal switch latency at each
synapse takes some ms. Have you ever tried to emulate a NN in real
time? How many realtime-neurons can you do on your machine?

This is not a meaningless fact. NNs do a lot of computation with 
essentially very weak hardware and we'd better not to remain blind to 
this fact.

High latencies can be tolerated using the right paradigm. There are 
several ways to solve a problem. Why making things
harder artificially by choosing a mismatchign software/hardware pair?

And again, ms delay is obsolete, considering F21 typical network delay
(and for god's sake there are not any benchmarks out there. Hang on
till Nov 1995 until we get first prototype chips, which will have
low bandwidth but latency will be about the same as in later products).

>    The applications that can tolerate this have been ported, but this is by 
>    no means the majority of 'parallel' codes, and this will never allow parallelism
>    to be used for traditional sequential codes.  Long latencies flatten out 
>    speedup curves rapidly.

One should never say never. I can't _prove_ you wrong. But look at the
trend, I say. Look at the potential of maspar machines _realistically_.
Please realize we are living in the final death throes of purely sequential
computation. The future is looking increasingly maspar and damn fine-grained
maspar (CA) at that if you look 20 years from now. We gotta have evolvable
maspar hardware, by then.

[ vector vs farm float performance ]

> They do not provide an order of magnitude better performance for all applications.

Of course not. There are hardly few tailor-made applications in existance.
There are lots of totally inadequate codes in existance. What are you trying
to prove?

And again: you can state your problem in a system of difeqs. You can't
solve them analytically, so you approximate them by a system of equations,
which you approximately solve by (biiig) matrix algebra. Since you set
out this, way you demand dedicated vector hardware and large VM, and you 
finally get it if you cry for it loud and long enough. This started
as early as the Manhattan project, when they had to compute the
explosive charge geometry and fission kinetics.

Admit it, the predominance of vector computers is an artefact of a 
certain way of reality perception. Old ways of seeing things.
Why do you think the protein folding problem, the Holy Grail
of molecular biology and protein engineering and God knows
what else, has remained for three decades unsolved? (I have done some 
reading on _that_, believe me). Because they insist(ed) on ignoring
reality. They kept cheerfully underestimating the magnitude of
problems, of course. But they don't, now. The hardware is fast enough,
today. Provided. One does not. Cripple it. With inadequate software.
Thus. Loosing three orders of magnitude performance, mostly
turned into joules. There are cheaper ways of keeping buildings warm.

Nature does things by asynchronous small-scale local interactions.
Read up your Einstein and QM and Newton if you don't believe me.
If you want some nickel's worth performance at mimicking Real 
Things, do it maspar.

99.9 % of real-world problems are best done in maspar way.
Since we live in, are products of, and adapted to, this sort
of universe, most our problems are best formulated and solved
this way. Think about it.

We loose a lot using a long pipeline of approximations, the
final product being forced on operate hard on the limit
of what physical laws allow us. Why doing things this way? Because our
forefathers did it this way? Hell, my forefathers ate raw meat,
clothed in stinking skins and lived in holes without central
heating and a cool beer. Few of them lived beyond 30.

Shall I try to do things like my forefathers?

> Government and non-government money is still buying vector supercomputers.

Well, at least our government does not buy vectors supercomputers anymore,
thanks god. But they install workstation clusters and maspar supercomputers.
Science budget is tight, but not that tight, particularly network and
computer infrastructure is constantly being enhanced.

> For some applications, vector supercomputers still provide the best price
> performance.  Of course, the formula is somewhat complicated.  The machines
> must also have enouugh memory to solve the problem, and be capable of solving
> the problem in a short enough period of time.  

Bits aren't everything. MIPS and even floats are't everything.
Memory bandwidth is something. Networking bandwidth is something.
Only then comes MIPS. Floats? Who ever needs floats?

[ Linda, F*F & Co. ]

> DSM means that a single address space will be provided.

This means that physically distant locations have to be
instantly accessable, ignoring switching time and signal
travel delay. This means you can thrash code with dangling 
pointers on distant nodes. This means duplicating distant
memory locations in caches with all their minuses. This
means encouraging programmers to ignore reality and 
stick to their swinish coding habits. You noticed it:
I really like DSM.

> DSM means that some form of coherency will be provided.

This means some computationally and hardware intensive 
machinery to watch for cache coherence/consistancy.
Which way you turn it, DSM is a kludge.

> The F21 does not have enough address bits for a single address space.

Alas, this is true. I would wish for 32 bits to map hard disk
into address space for clean VM. Watch out for P32 for a remedy.

> The F21 would have to provide coherency in software (an acceptable means for
> some applications). It has not been demonstrated that software coherence
> is efficient enough to support parallel applications that communicate 
> frequently.

I spare my breath.

> >Does the human brain use random access shared memory available
> >to all processes?
> 
> Is the F21 a human brain?

We are talking clusters of F21 here, not single processors.
Quantity turns into quality once quantity is high enough.
No, F21 is not the human brain, not by a goodish bit. But
it is at least a step in the right direction. Yes, I keep
on harping upon memory and network bandwidth.

> >You still seem to miss the whole idea of F21.  For the price of a

Do you know what? Jeff seems to really have a point, here.

> >single node on a PVM style cluster you can max out the number of
> >processors in an F21 network.

[ quack, quack ]
 
> 
> >Religion has nothing to do with it.  When people want to imply that
> >what they are saying is logical and that what is not but they cannot
> >use logic to support this opinion they will just try to label
> >your argument as a "religous" argument.
> 
> It is illogical to claim that a MISC chip is better than an Alpha or a
> MISC cluster is better than an Alpha cluster and to provide no quantitative
> evidence of such claims.  

For a certain class of problems (which happen to be my problems)
a (small) F21 cluster of the same monetary value is _substantially_
better than a single modern high-end Alpha workstation.

Now I said the word you wanted me to say: "better". Satisfied?
 
> It is illogical to claim that an Alpha workstation or an Alpha cluster
> will only appear to perform better on a benchmark or general purpose use
> because the use and the benchmark are wrong.

You know what? Wait until the machines are out and watch out for
results on MISC here. Then try to recode some of the demo problem
solutions which will be (hopefully) presented and compare the
performance. You see, I can also state problems which would be 
hard to do on a purely sequential machine, which need not to 
be benchmarks.
 
> I am still unsure if such claims are being made.

See above.

> >If by "evaluation against my current system" you mean will it run the
> >same software you are using now, I think you missed something.  I have
> >suggested the sort of scientific experiment to determine the best
> >estimate of performance on a particular problem.  Anyone who knows
> >about computers knows that that sort of experiment is meaningful and
> >SPEC marks are not.  Unless of course "your current system" means
> >SPEC marks.
> 
> I have asked this question many times, and I have not understood the answer.

Too bad. I would have given almost exactly the same answer as Jeff 
in his place.

> If a MISC cluster is the answer, what is the question?

Unconventional problem solutions without sequential bottlenecks
at excellent hardware price/performance ratio.

If you don't have a problem with a sequential hot spot in it,
are not afraid of doing things not the Fortran and less the Unix
C way and look for a good purchase, go for the F21.

> What types of computing will it do better than existing alternatives?

All requiring local-interaction high statespace velocity and 
(less) path entropy at modest integer precision. Now I hear you
saying this is meaningless gibberish.

> Why will it do better on these types of computing?

Power. Power. Power. For a fraction of price you'd pay
doing it the other way.

> How will its performance be demonstrated?

Best would be a real-world problem performance comparison.

> I have heard some answers to these questions.  Various areas of computational
> science have been named.  Various applications have been listed.
> I require more quantitative evidence.  Substantial simulations of an

Unfortunately, most buyers are thus. They don't read the specs, don't
believe them if they did and demand that anything has to be exactly
the same as on the machine they are using now, only Better.

> F21 cluster would model the network as well as the CPUs.  An easier
> simulation would be to model simple array calculations to determine
> how many Mbytes/second and MFlops/second or even Mops/second the F21

Forget MFlops. That is, unless you'd settle for scaled integers.
Even now, P32 or even P64 would be more the thing to wait for if
you need that much precision. MWords/s rate in SRAM is equivalent
to best-case secondary cache. Often used data/code can be placed
into SRAM by hand, either using beforehand knowledge or OS adaptive
load leveler by call frequency, etc. I don't expect the early OS
to be able to do this kind of load leveling, so a little intelligence
on the part of programmer will be required, orelse you'd get only
mediocre results.

Random-access DRAM MWords/s value is the same or better of
your machine of choice, provided you use the same DRAM technology.
MOPS is very difficult to compare: you'd get stupendous peak
values, but CISC operations are typically more complex. If you would
settle for adding the typical MOPS in each node, the values
are extremely good, far beyond anything you can buy now.
 
> could perform.
> 
> for (i=0 --> N) A[i] = B[i]

This should's run too bad if we are pushing word sized data
around. Since we have only one index register, some stack
juggling will be required. Adding a second index register
with according opcodes will remedy that.

If the array is large enough, you'd get getting cache misses 
and the data transfer rate would break down to physical access 
time on a cached nonMiSC. By placing the code loop in SRAM, the 
bottleneck would be main DRAM speed alone on both architectures.

If array is small enough, you can put both code and data 
into SRAM. Now the bottleneck is code throughput and instruction 
verbosity. 
If code loop will fit into primary cache, data into secondary on
a nonMISC. SRAM access bandwith it bottleneck. MISC shouldn't
do all too bad here, also.

> for (i=0 --> N) A[i] = x + B[i]

See above. About the same performance. (How fast is the adder, Jeff?).

> for (i=0 --> N) A[i] = x * B[i]

Oops. There is a nasty * in it. If you'd settle for multiplying
small values only or obtaining approximate results (this isn't
as abstruse as it sounds, there are uses for that), you'll be as
fast as a hardware multiplier using a SRAM lookup. Otherwise, you 
should probably have bought a TI DSP with links.

Use a smart compiler turning * into shortest shifts and adds
sequence otherwise.

> for (i=0 --> N) A[i] = B[i] + C[i]

This would be excellent, if we had three index registers,
with post(in|de)crement, or at least two. (Three is the 
magical number which one does need often, e.g. for BitBlt 
or above code).
 
> I have not seen such numbers for the F21.  Do they exist? 

I would like to code this when the eval boards are out. Jeff,
do you have simulated estimates on above code?

> >When people say "just look at what has been done with C, look at
> >all these multimegabyte apps we have running here." I like to say,
> >yes with 100 programmers you can write a fantastic 10 megabyte
> >program that does amazing stuff.  Very impressive.

This is the point. Apart from all those manyears, the odds 
towards getting excellent programmers with increasing samples
go up.

[ snip ]

> >When people tell me "By using C++ and this $3000 machine I can
> >get Z performance."  Sometimes I say "I know of an example of
> >someone doing it on a $50 computer and getting better than nZ."

Can you give an example, Jeff?

> When people tell me, "this can be done in Forth," I say, "it hasn't."
> Forth can be promoted far better by distributing useful software written
> in Forth (as Java has been promoted) than by criticizing other tools that
> have made great contributions.

The developer community is terribly small. It is the lack of common
tools as emacs, mailers, GhostScript, TeX, www browsers, TCP/IP packages, 
millions of PD software under X, or bloody _standards_ as Unix calls are that
is most lamentable. But we won't get them, since one has to recode them
for scratch and there ain't enough of us for this mammuntine task.

> >The real problem is when you cite examples where things
> >are 1000 times smaller or 1000 times faster or 1000 times
> >cheaper.  Then people find what you are saying very threatening.
> >They interpret your saying you found a better way to having
> >said they were an idiot for not having figured it out themselves.
> >Try to talk about OKAD, F21, MISC etc and people hear funny stuff.

_I_ have some funny stuff trying to talk to people about MISC.

> The real problem is that you cannot expect others to accept your 
> claims without demonstration.  

It is not easy to gather a number of people in one room at the
same time which all have the same background to be able to 
understand what's happening on the screen this guy is sitting
in front and _what it means_.

Unless you will be giving F21 with killer apps for free or
show them at major fair, people won't listen.

> >Judged by whom?  How can you say how other people are making judgements?
> >You seem to ignore the many papers I have published or posted on the
> >subject, or dismiss these as not "quantitative."  Of course it is
> >easier to ask the question and have me explain it again then to read
> >all the information and quantitative information that has already
> >been published or posted by other people.
> >
> >Either that or "quantitative" means SPECmarks to you.
> 
> If MISC chips are proposed as workstation CPUs, 'quantitative' means 
> SPECmarks.

Would you allow adding each node's results arithmetically? This could
be interesting.

> If MISC clusters are proposed as cluster CPUs, 'quantitative' means parallel
> benchmarks.

Granted. I will port some if I have time and it's not too hard to do.

> If MISC chips are proposed for another use, then why are they being compared 
> to 'complex,' 'inefficient,' and 'costly' RISC CPUs?
> 
> But still, 'quantitative' means how fast will program 'foo' run on this 
> system.

I haven't tried typing make at the "ok" prompt, but something in me
says this won't work too well. Taking time to write a (surely substantial)
"foo" from scratch merely for the sake of demonstration? No, thanks.

The protein project I intend to do will be coded in C++ on an SP2
(or an 20-nodes SGI Indy cluster) and in Forth/assembler on an F21. 
I will post the results, but this is surely at least 12 months from 
now. (I might also run into idea implementation trouble, so better
don't expect any results at all in reasonable time).

> I have seen no such posts with that information.

I think it is a bit hard running a substantial program on 
an simulated chip. Wait for the first silicon.

> I have not seen that information on your home page.
> 
> Peak MIPS is not quantitative

Is typical average parallel MIPS quantitative? This can be estimated
from simulation. In fact Jeff already posted these figures.

> >>Heck there is a product out now that translates binaries from SPARCs to
> >>Alphas.  Workstation are not sold on CPU performance alone.
> 
> >What else then, style?  If it is not performance alone, and it is not
> >price-performance then what is it?  Designer accesories?
> 
> Software, training, standards conformance, interoperability

MISC isn't for you, then. You have to write your software yourself,
you have learn and teach your programmers by yourself and MISC
is a proprietary standard (an interesting word combinations..),
it won't work well with anything but other MISCs. 

[ snip ]

> >When other people think of workstations they think "UNIX" "C" and the
> >specific programs they use.  They think it shouldn't cost much either,
> >you know like $25,000 or $100,000.
> 
> When I think of a workstation I think of "Linux", "GNU", and a 386 or 486 
> knockoff.

I would agree: this is a low end workstation. But please substitute at least
by a PCI Pentium 60/75 and a 1 GByte SCSI/EIDE HD and 8-16 MByte RAM.
You can't buy an i386 anymore and even i486s sales are long below breakeven.

> Much more like < $2000.  I can get a lot of useful work done on such a beast.

Jeff, how many mass produced 2 MByte F21 nodes can we have for $2000, 
including cabinet, small CRT, etc? About $100-150/node? Cheaper?

-- Eugene

> mark
> 
> 
>
References:
- Re: long rant
  - From: Mark Callaghan
Previous by thread: Re: long rant
Next by thread: Re: Long Rant
Index(es):
- Thread