home .. forth .. misc mail list archive ..

Re: MISC-d Digest V99 #87


Dear List Readers,

I thought Christophe's explanation of SynDEx was very nice.  Like 
the other explanations he has given here it was very clear and
informative.  I am sure it helped many people understand many of
the issues that they may not have considered unless they are deeply
involved in software design for multiprocessors.  It is a very nice 
way to approach parallelizing software and can have less runtime
overhead than the software I have been working with.

In case there was any confusion about what I intended to say
about "functions" of DSM in hardwware let me review. I had said,

>>If you have something that has functions of distributed shared memory
>>in hardware I would think you would want to try to remove extra software
>>layers to boost performance.

Fully implemented distrubuted shared memory in hardware includes 
things that range from parallel multiported memory access to SCI.
Scalable Coherent Interface looks just like coherent cache to the
processor hardware and thus removes communication overhead from 
the software.

"Functions" of distrubuted shared memory in hardware is any subset of
that.  This ranges from things like a sort of poor man's SCI.  This
is the idea behind the network coprocessor on F21, it can be programmed
with a little code to be a network interface of DSM.  it provides
routing and dma transfers between nodes at high speeds with little
software overhead.  It removes much of communication overhead from
the software by doing routing and DMA in hardware separate from the CPU.

The next stage down would be simple serial ports.  Software is now
needed to perform the routing. Software is now needed to perform the
dma.  There is now more software overhead to the communcations between
nodes.

The next stage down is without hardware serial, say bit banged serial.
Software is now needed to perform serialization and bit banging timing.
There is now more software overhead to the communctions between
nodes.

Alternately there may be serial hardware to send data between nodes 
but it may require more elaborate software in some sort of external
network protocols.  In active message passing you fire off something
and it it gets directly routed to the destination without any CPU
or memory access overhead anywhere and then fires an immediate CPU
interupt to execute code.  If message passing is implemented on top
of other network protocol layers there will be extra software overhead
to process network buffers and network interrupts and extra memory
cycle overhead involved in the process.  

Everyone knows that one can build multiprocessors on any collection
of hardware and balance the load on whatever combination of CPU
and communication paths between them are available.  Most people
know that the miminal software overhead and efficiency is achieved
by the most expensive parallel interfaces to memory but that these
can cost millions of dollars just for the bus interface.  Most
people know that there is a continuum of other hardware arrangements
that trade software cost for hardware cost.  

Christophe said:

>To remove extra software layers, you don't even need to haved distributed
>shared memory in hardware 

Yes, but I said, "Functions of distributed shared memory in hardware."
This includes a range of options from full hardware implementation of
SCI, to hardware implementation of active message passing, to hardware
implementation of network interconnects, to serial interrupt driven
hardware for interconnects, to software driven bit bang serial
interconnect.  This is a continuum with "functions" of distributed
shared memory being traded off from hardware to software.  There is
either hardware overhead or software overhead.  There may also be
extra layers of software overhead such as ugly network protocols
never designed for efficient multiprocessor communication at that
level.

>with SynDEx (http://www-rocq.inria.fr/syndex),
>distributed memory with any support for interprocessor memory transfers is
>enough.

"Enough" of course if it is fast enough to meet requirements.  But
what you are describing is just trading off the efficiency and cost of
more hardware with the inefficiency and overhead of more software. 

My point was not that you need hardware shared memory, but that there
is a tradeoff in the implementation of distributed shared memory that
involves hardware overhead or software overhead.  The less hardware
the more software overhead required.  The other point I was making
is that _extra_ software overhead can easily be introduced in addition
to the software overhead required to replace _functions_ that would
be more efficient in hardware.

>From your specification of your application algorithm and architecture,
>and from measured characteristics of the algorithm components (computations)
>relative to the architecture components (processors and communication media),
>SynDEx tries its best to distribute and schedule computations over the
>processors, while distributing and scheduling communications over the
>interprocessor communication media.  If the resulting predicted global
>performance matches your requirements, SynDEx then generates a SYNchronized
>Distributed EXecutive (hence the SynDEx acronym).

Yes, a very nice approach.  The Linda approach that I investigated many
years ago introduces some runtime overhead that is not required in the
SynDEx approach.  A full implementation of Linda also must know the
relative strenghts of the various archtitectures and interconnects in
the system match the distribution and scheduling of processes and
communications but it is done at runtime.  This introduces runtime
overhead that is not there when this is done at design or compile time.
It also allows the compiler to spend more time optimizing the code
generation since that overhead exists only once so it may also produce
runtime code with less overhead for that reason also.  

My experiments with Linda were for an SMP where the distribution and
scheduling are much easier since nodes are identical.  In systems with
complex combinations of interconnects, such as some rings, some stars,
some on network hardware, some on serial hardware, some on parallel
hardware does reintroduce the complexity of the code distribution and
scheduling ideas.  The main difference bwtween Linda or F*F and SynDEx
is the runtime load balancing. IMHO. SynDEx is also more sophisticated,
more portable, and capable of producing tigher object code.

>Instead of relying on layers of separately compiled resident executives,
>SynDEx generates for each processor a generic macrocode controlling static
>memory allocation, and static sequencing and synchronization of computations
>and communications.  This macrocode is then macroprocessed to be substituted
>with processor specific assembly instructions directly accessing hardware
>registers for best performance.

Yes, minimal runtime overhead.  And you are saying it uses any _functions_
of DSM implemented in hardware to remove system overhead.  This was the
point that I was making.  I did not mean that a full hardware implemention
of DSM is essential.  I am not promoting SCI or Cray style busses for everyone.

>For example, on the TMS320C40, the total overhead for completely handling an
>interprocessor communication, including the programmation of the DMA and 4
>context switches between the computation sequence and the communication
>sequence (2 for go programming the DMA and return, and 2 for servicing the
>end-of-transfer interrupt and return), is only 51 instruction cycles.

You are saying that the TMS320C40 has most of the functions of DSM in
hardware and that by using those you reduce software overhead.  My point
exactly. This is the same reason that F21 has hardware for routing,
DMA, and remote CPU interrupts in hardware.  I am sure SynDEx would
generate code to use this hardware on F21 multiprocessor to reduce 
software overhead.

We can support two modes on the network hardware, one with general
routing and another with half of the processors writing to their
neighbor in a pipeline fashion.  For problems where data can be
efficiency pipelined through an array of processors the communcation
bandwidth can be maximized and the routing overhead minimized. I am
sure that SynDEx could also take advantage of this mode in the
breakdown of the problem space and scheduling of communication that
goes on at comile time.  In F*F this still requires manual specification
by the programmer as well as some runtime overhead that is not there
in SynDEx.  SynDEx needs to know about the details of what functions
are available in hardware and how to balance the load when compiling
the code.  SynDEx is very impressive software.

>When we are less concerned with speed, as when distributing an application
>on a network of workstations without taking control of the bare metal over
>Un*x, the macrocode is substituted with C (or any other high level language,
>would you like to change basic macrodefinitions) and relies on the resident
>operating system (processes, semaphores, TCP/IP sockets).

Yes, well you qualify this by saying you are less concerned with speed.
I suspect that in Un*x that C is a reasonable way to generate code.
You also note the introduction of extra software overhead by using
the OS services for sockets etc. into the equation.  This was the
other point that I was trying to make.  If you choose to trade
speed for portability you can do this. We could implement an F*F
for ANS Forth using TCP/IP sockets for the DSM layer and have
added value in the increased portability.  Perhaps we will after
we have a reference on the originaly intended target.

Jeff Fox