home .. forth .. misc mail list archive ..
Re: lean + mean

To: Christophe.Lavarenne@xxxxxxxx
Subject: Re: lean + mean
From: Eugene Leitl <eugene.leitl@xxxxxxxxxxxxxxxxxxx>
Date: Thu, 22 Jul 1999 22:27:57 -0700 (PDT)
Cc: Eugene.Leitl@xxxxxxxxxxxxxxxxxxx, misc
In-Reply-To: <199907221114.NAA29428@ligeti.inria.fr>
References: <14230.4469.568581.318124@lrz.de><199907221114.NAA29428@ligeti.inria.fr>

No one has objected so far, so we apparently can continue here.

Christophe Lavarenne writes:

 > Between your lines, I understand that you want to attack a massive problem,
 > with massive data arrays processed by very repetitive operations, allowing
 > massive data parallelism, that you want to accelerate (or even process in
 > real-time?) with a big lattice of SHARCs, each simply executing a slice of
 > the repetitive loops.
 
Yes, that's entirely correct. There are two problem domains, both of
them being embarrasingly parellel. The first (and the simplest) one
requires (semirealtime) interactive volume visualisation of large
physical systems using voxel techniques:
       
        http://www-graphics.stanford.edu/software/volpack/movies/vp_movies.html

If we're talking about visualizing molecular dynamics simulations,
clearly conventional methods as

	http://thematrix.acmecity.com/gift/202/research/s0.html

break down well before we reach mesoscale (~10^6 atoms, none of the
above simulation involved more than 50*10^3 atoms). Things really
start getting similiar to experimental science: specimen preparation
becomes necessary, and measurement becomes nontrivial because of the
sheer data volume.  A 3d array of DSPs clearly seems to be the optimal
architecture for parallel voxel visualization.

The other problem domain is to simply run the simulation. Molecular
dynamics

    http://www.sissa.it/furio/md/md/

uses a classical treatment, considering atoms points of mass bound in
a complex potential or (semiempirical) force field. Codes as SPaSM
parallelize

  http://www.swig.org/papers/Py97/beazley.html
  http://bifrost.lanl.gov/MD/MD.html

trivially, and I think there is a way to evaluate long range (Coulomb) 
interactions significantly better than current algorithm (DPMTA).

 > Evidence: if you look at multiDSP, it's because you want very fast processing.
 > Then don't waste processing power with a (Forth) virtual machine, use the full
 > power of the DSP directly with assembly.  You say that you'll have just a
 > single simple loop on each DSP, then this loop shouldn't be very hard to code
 > efficiently in assembly.
 
Yes, this is what I had in mind. Forth would just be needed as a
compiler/debugger/OS.

 > You don't need to bother with a SHARC-resident Forth interpreter/compiler,
 > this job may be done much easier on the host computer by an umbilical
 > cross-assembler/compiler written in Forth.  The only thing you need resident

I agree, but I don't have the slightest idea how to tackle
this. Feeding Forth source via a link to a node-local compiler appears
more doable than shuffling around binaries remotely.

 > on each SHARC is a minimal monitor accepting remote subroutine calls, with
 > at least one initial subroutine for downloading new subroutines.
 > 
 > All my umbilical cross-assemblers work this way.  More precisely, the host and
 > target processors interact such that the host may be seen as a server for the
 > target.  Initially, the target executes the monitor, which requests the host:
 >   "Hey, I'm ready to accept a new job, ask the user what he wants me to do."
 > Then the host lets the user enter some line(s) of code, compiles them on the
 > fly in a memory-image of the target code memory, marking which range of memory
 > addresses have been updated, until the user asks for some code to be executed.
 > Then the host asks the target to execute its side of the code downloader, and
 > downloads that memory range which has been marked as updated, i.e. subroutines
 > including the one the user asked to be executed by the target, and resets the
 > updated-range marks.

Sound sensible so far.

 > When the target's downloader returns to the monitor, the monitor requests:
 >   "Hey, I'm ready to accept a new job, ask the user what he wants me to do."
 > Then the host asks the target to execute the user requested subroutine, and
 > from there acts as an i/o (display, keyboard, disk) server for the target.
 > The downloaded user subroutine executed by the target may then request some
 > services to the host (such as "print me this"), until it returns to the monitor
 > which requests again the host:
 >   "Hey, I'm ready to accept a new job, ask the user what he wants me to do."
 > And we're back to the initial state, apart for the downloaded code and the
 > eventual side effects of the executed user subroutine.
 > 
 > How does the user specify which code the target must execute?  Very simple.
 > 
 > ": name  some words ;" is a named subroutine, therefore it may be referenced
 > later by its name, so it's simply compiled and kept in code memory, i.e. it's
 > compiled in the host memory image of the target code memory, and its name is
 > kept in a host dictionary of the target subroutine entries.
 > 
 > "some words ;" is an anonymous subroutine, therefore it may not be referenced
 > later by name, so the only sensible thing to do with it is to compile it,
 > execute it, and forget it on the fly, i.e. it's compiled in the host memory
 > image of the target code memory, then downloaded into the target code memory
 > with other newly compiled subroutines, then the compilation pointer is reset
 > to the beginning of the anonymous subroutine (i.e. to the end of the previous,
 > in fact last, named subroutine) so that its code space is recovered for the
 > next named or anonymous subroutine, then the downloaded anonymous subroutine
 > is remotely called and executed by the target, during which the host acts as
 > i/o server, then the host resume processing the user input.
 > 
 > Note that whether named or anonymous subroutines, we're always compiling,
 > there is no need for a "compiling/interpreting" state, and therefore no more
 > beginner-puzzling interpretation-forbidden words (IF ." POSTPONE and so on).
 > Native cross-compilation, with subroutine-threading and primitive-inlining,
 > is also much simpler, because you don't need to provide for the interpreting
 > mode a separate subroutine for each primitive, most primitives are just
 > host-resident macros which compile their few instructions inline, and may even
 > look back one or a few previously compiled instructions for eventual peep-hole
 > optimizations.  Simple and efficient, very minimalist, I enjoy it.
 > 
 > I presented this idea for the first time at the EuroFORML'92 conference in
 > Southampton, with Rod Crawford ("Who needs the interpreter anyway?"), but
 > nobody seemed to understand that as an impromvement.  Then at the FORML'95
 > conference in Asilomar, in an impromptu talk (I didn't take the time to write
 > and present a paper), with a short demo on 3 different simulated targets (8051,
 > RTX2000, muP21), which was rewarded for being for long the only proposition of
 > modification to _remove_ a feature (the interpreter) from Forth.  But it seems
 > that the idea has not spread more than that.
 
Whoops, that's perhaps because the idea is not so trivial. I guess I
must reread the passage a couple of times before I'll get it.

 > Since then, I have been playing a lot with ADSP2181 targets (www.analog.com)
 > and maybe one day I'll play with a SHARC ADSP21065 target, or with a Lucent.

Oh, if you had time to tackle the SHARC it would be just
wonderful. I'm sorry for being such a pathetic weenie when it comes to
the matter of programming.

 > My umbilical cross-assembler/compilers are used by friends of mine:
 > FF51 for i8051 is used for trailer embedded applications,
 > FF2K for RTX2000 is used for industrial video quality control applications,
 > FF21 for muP21 and v21 is the smallest, prettiest, but was just a dream,
 > FF86 for i80x86 is a project to extend the idea to the host itself,
 > FF96 for i80196 was abandonned with the idea to use this processor,
 > FF81 for ADSP2181 is used for home alarm systems applications.
 > I develop them during summers or "lost" nights, because I'm pretty fully
 > charged with INRIA's research and development on distributed, embedded,
 > optimized real-time applications on multi-workstation/DSP/microcontroler/CAN
 > (http://www-rocq.inria.fr/syndex/).
 > 
 > Can you give pointers on your research work?

Alas, my master's thesis is written in an unreadable language
(Russian).  It is located at http://thematrix.acmecity.com/gift/202/ ,
along with the slide show I mentioned above. I don't have anything
published yet, the first paper (synthetical iceblockers) has just been
submitted. I could give you pointers to papers written by other people
which I agree with very strongly (some of them available above), but I
guess that would be leading too far.

 > CL
References:
- Re: lean + mean
  - From: Eugene Leitl
- Re: lean + mean
  - From: Christophe Lavarenne
Previous by thread: Re: lean + mean
Next by thread: Re: lean + mean
Index(es):
- Thread