home .. forth .. misc mail list archive ..

Re: MeshSP vs. P21


On Thu, 3 Aug 1995, Eugen Leitl wrote:

> You are right, but you describe a tiny minority riding the front
> of technological progress shockwave. This is big science by definition
> as it commands mega and even giga bucks, but the absolute number
> of persons using them are very small. I was talking about poor
> underpaid eyes red-rimmed frustrated overworked off-shelf scientist
> slob.

If they don't have the government (our) billions, then they are in for
cheap powerful computers, aren't they. One just has to convince them, that
with 3 months of porting their paper-producing application they'll be able
to do in one month and in budget what would have taken them two years with
the old technology (and in red tape). 

> They also do xRay crystallography
> on VAXen and some very very few are doing protein dynamics on
> KSR and now on the IBM SP2, which recently came into service.

These people use basically two or three very specified programs -- Explor
burning the most cycles. And there are many hundred of labs doing it. 
Isn't that a market?

> > There are quite a few initiatives to build the "TeraFLOP computer," that
> > achieves its performance one _one problem only_ -- Lattice QCD. The
> 
> Wow. But you can't really compare F21 20 bit scaled integers to 64 or 80 bit
> floats they use in benchmarks. "P64" would compare good, though.

As I mentioned, there is _only one_ problem, that they are solving. And
those are highly qualified physicists. They understand pretty well terms
like "valid figures," "standard deviation," "error propagation," "range of
values of the observables." They work on one project for many months to
years. They understand pretty well (or they could, if they needed to)
where the error is coming from and what precision is needed in which 
parts of the calculations.

To cut it, they don't _need_ float. They can live happily with fixed
point. And there is no need for benchmarking. The benchmark is the
application (since the machine is built to run _one_ application). In 
Lattice QCD the _real_ measure is not FLOPs, but rather "staggered Dirac 
Operator"s /sec, which is converted to FLOPs for comparison with 
the literature.

I use float for image processing. Not because I need to, or I want to, 
but because I have a 100K+ transistor FPU, that performs a 24-bit mantissa
multiply in 10 clocks, rather than the crippled 32-bit integer unit 
multiplying for 15. And what I _really need_ is 8x8 or 6x8, or 7x7 
multiplications with some reasonable accumulation (say 21 bits :-) 

And yet, the real issue on the MIPS line is the cache. On real code --
convolutions with small, up to 10x10 kernels -- I'm getting 10 times worse
performance than peak advertised FLOPs my algorithm needs. 

Well, I could be more careful and do it with knowledge of the cache sizes
and latencies of the current processor I'm using. But this already is not
your stock "my existing application" type of argument. I'll be taking care
of each and every different processor revision I'm working with. If I'll
be worrying about pipelines, load delay slots, safe overlapping of FPU
units and data layout in cache, I might as well worry about stacks, SRAM,
DRAM and carry propagation, if that buys me performance within the budget
I have (which is in the order of small $10Ks, rather than high $100Ks).

To wrap it up for scientific computing: If you build it, they will come.


--
Penio Penev <Penev@venezia.Rockefeller.edu> 1-212-327-7423