EST simulations
- To: misc <MISC>
- Subject: EST simulations 
- From: Eugen Leitl <ui22204@xxxxxxxxxxxxxxxxxxxxxxx>
- Date: Wed, 16 Aug 1995 18:34:27 +0200 (MET DST)
An iterated sequence of primitive explicit statespace traversal
simulation (ESTS) outlines. Final product resembles CAs by more
than a bit.
Caveat: this is ad hoc material. It may contain errors. (Please
report them to Eugene Leitl ui22204@sunmail.lrz-muenchen.de.
Thanks. Any comments are welcome).
A) Consider a classical (nonQM) one-dimensional open box
containing many identical noninteracting infinitely small
particles, in short, an unconstrained 1d particle system. Each
particle
-----------.--.---..---.-------.-----------.---------- 1d box
	   ^  ^   ^^
	   particles
has a position and a velocity at t. Since all particles have the
same mass, their individual kinetic energies each being
E=0.5*m*v^2, we can say the total system energy is proportional
to the total sum of individual v square contributions.
E=sigma_i(v_i*v_i). Departing from physical reality even
further, we can define E as sigma_i(v_i). In 1d, the distance
between particles i and j is defined as abs(x_j-x_i).
The simulation runs in discrete time steps, the next
generation's x being computed by x=x+dx. If we initialize x and
dx with random values lying within a small window in the middle
of total numerical dynamics range of this particular machine,
then the following single line in (fortranish) C is sufficient
to run this "simulation".
for(i=0;i<max_particles;i++) x[i]=x[i]+dx[i];
/* of course one would normally use pointers here... */
Assuming the arrays to occupy contigous memory blocks, indices
increasing from left to right, each coordinates restricted to
word-sized integers for the sake of efficiency we have roughly
the following memory map:
x  x  x  x  x  x  x  x	. . . . . . dx dx dx dx dx dx dx dx
If we have a CISC machine with an orthogonal instruction set
containing postincrement or postdecrement indexing, one
generation tick boils down to below short piece of (pseudo)code:
   LOAD X, $foo       ; begin of x  array
   LOAD Y, $bar       ; begin of dx array
   LOAD Count, #$baz  ; number of elements
loop:
   LOAD A,(X)         ; load x into A (or to top of stack)
   ADD A,(Y)+         ; get dx with postincrement and compute x+dx
   STORE A,(X)+       ; store result with postincrement
   DJNZ loop	      ; decrement C and loop if not zero
(RISC code would have been much more verbose, but doing about
the same in principle. The stack machine F21, having but a
single index register, would have to resort to some return stack
value juggling here). A fast CPU without cache would be limited
solely by memory bandwidth. Since both data and code come from
core, we would want to increase the number of LOAD/ADD/STORE
units in the loop, probably substituting the entire loop with a
sequence of them, a common enough practice e.g. on uncached
Motorolas. If we put our code into a small cache or a tiny fast
SRAM area somewhere in our address space (MISC/F21) we can have
several code accesses (5-6) for the price of one DRAM access. In
a primary/secondary cached machine we would want to keep the
loop small enough to fit into the primary cache and the array
size small enough to fit into the secondary cache for speed.
Once our system grows beyond secondary cache size, we are in
trouble: only a very smart (and dear) memory interface would
recognize interleaved memory access to build burst blocks. Of
course we can make our code fit the burst pattern, but we have
to be very careful indeed to make the loop still fit into the
primary cache. Apart from freeing us from the tedious task of
keeping track in what memory our code is, a cached architecture
has absolutely no pluses as compared to the bi-speed memory one
(MISC/F21).
Now consider our system's mirror image in statespace. Beginning
with a point in discrete statespace, the system evolves linearly
with a constant velocity, in discrete jumps. If you look into
the future, wishing to know where the system in s steps will be,
simply add the s*dx_i with x_i (vector addition). If you intend
to look into the past, apply the same operation with inverted dx
signs. (Big deal. We could have used algebra right from the
start instead of doing a simulation.)
What is the difference between a real physical systems and a
computer? In a physical system _all_ particle's coordinates get
modified simultaneously (in one "time cycle"), whereas a
sequential computer needs _several_ primitive cycles to modify a
_single_ one. If we look at real physical particles' high
velocities (km/s) and enormous numbers of them (10^4-10^7)
contained even in a modest large scale simulation we might
notice that we have a problem.
A computer has also a position in its statespace. One can look
at it as a binary vector and use the Hamming distance metric
(bits differing at t and t+1), or, better, take memory words as
integers and apply the Euclidian distance upon the resulting
vector. If using discrete time steps of about the same duration
we notice an essentially trivial fact: the translational
velocity of a physical system lies _many_ orders of magnitude
higher than the computer's. The bottleneck of a sequential
processor is the memory bandwidth. By taking n simple processors
each equipped with their own independant memories we can
increase the statespace velocity n times. (For totally
unconstrained systems, statespace velocity alone is an adequate
abstract computational performance index).
B) One funny things about our simulation: the box is open. Each
coordinate gets incremented/decremented until bang! it reaches
the integer space limits of this particular machine. Some
integer ALUs might ignore overflow/underflow, some might cause
the CPU to execute a trap. Provided the particle velocity dx is
equidistributed, the average particle density decreases in the
course of the simulation: we have opened a box to release gas
particles into a vacuum. They escape into (virtual) infinity of
this machine's integer space.
If we wish to reduce the space of the simulation and need a
clean way of handling escaping particles we have basically two
possibilities: 1) A wraparound (toroidal) space, where escaping
particles come sailing back in from the opposite side. This
approach minimizes boundary artefacts (no potential modulation
in space visible) and creates the valuable illusion of an
infinite open system. 2) Introduce walls (potential function
V(x) modulating the box), which either mimic perfect reflectors
(infinite potential), merely inverting the sign of dx (total E
remains constant) or decreasing dx by some amount (cold walls)
or increasing them (hot walls). (To heat/cool 1) we have merely
to tweak some random particle's dx). Now the statespace picture
looks different: the representation point is now boxed into a
subregion of statespace (this is a constraint, no?). But still,
between boundary events, the system trajectory is a straight
line. Predicting the future/past, though less easy, is still
possible, particularly for perfect reflector or perfect
wraparound space walls (though I wouldn't want to do it with
paper-and-pencil mathematics).
A straightforward implementation would check each x_i whether it
is still within some range and set it some new position/velocity
explicitly if true.
   LOAD X, $foo       ; begin of x  array
   LOAD Y, $bar       ; begin of dx array
   LOAD Count, #$baz  ; number of elements
loop:
   LOAD A, (X)        ; get x
   ADD	A, (Y)+       ; compute x+dx
   TST	A,#$lower     ; underflow?
   JNZ	skip1	      ; no
   handle1_A	      ; reset A to a plausible value
skip1:
   TST A, #$upper     ; overflow?
   JNZ skip2	      ; no
   handle2_A	      ; reset A to a plausible value
skip2:
   STORE A, (X)+      ; write x+dx back with postincrement
   DJNZ loop	      ; decrement C and jump if not zero
In C this would look roughly:
for(i=0;i<max_particles;i++)
{
  x[i]=x[i]+dx[i];
  if (x[i]<lower) handleLower;
  if (x[i]>upper) handlerUpper;
}
As you see, above code outline is still small enough to fit into
the primary cache, at least if hand-coded. However, now memory
bandwidth ceases to be the bottleneck on most but the fastest
machines if we have lots of collisions: too many instructions
lie between two store accesses.
A simple trick helps to implement toroidal space: by restricting
the box size to binary values 2^n, we can use 2^n-1 (1, 3, 5, 7,
15,...,2^n-1) as logical masks:
   LOAD X, $foo       ; begin of x  array
   LOAD Y, $bar       ; begin of dx array
   LOAD Count, $baz   ; number of elements
loop:
   LOAD A, (X)        ; get x
   ADD	A, (Y)+       ; compute x+dx
   AND	A, $mask      ; clip the result
   STORE A, (X)+      ; write x+dx back with postincrement
   DJNZ loop	      ; decrement C and jump if not zero
In C:
for(i=0;i<max_particles;i++) x[i]=(x[i]+dx[i])&mask;
A simple mask of the 000000111111 type ANDed with a coordinate
results in efficient clipping... Or does it? For positive values
this works obviously.. but negative values are represented as
two's complement on most machines, resulting in a totally
different bit pattern. Too bad.
This can be overcomed by using a 000000111111110000000'ish AND
mask pattern and restricting the value range of dx to values
resulting in positive (x+dx). (This is not a noticeable
restriction as it may seem as ESTS requires relatively fine
simulation steps). Moreover, now we can simultaneously test upon
overflow and underflow with the same statement pair:
   LOAD X, $foo       ; begin of x  array
   LOAD Y, $bar       ; begin of dx array
   LOAD Count, $baz   ; number of elements
loop:
   LOAD A, (X)        ; get x
   ADD	A, (Y)+       ; compute x+dx
   MOVE A, B	      ; make a copy
   AND	A, mask       ; clip it with mask
   AND	B, !mask      ; AND with inverse mask: sth sticking over?
   JZ ok	      ; no over/underflow most of the time
   handle_bang	      ; bang!
ok
   STORE A, (X)+      ; write x+dx back with postincrement
   DJNZ loop	      ; decrement C and jump if not zero
In C:
for(i=0;i<max_particles;i++)
{
  if ((x[i]=x[i]+dx[i])¬Mask) do_bang;
/* swinish C: assignment gives numerical result back as side effect
   which is interpreted as a logical value
*/
  x[i]=x[i]&mask;
}
With increased simulation size better volume/surface ratio makes
bangs relatively rare events, hence requiring a streamlined
collision detection, though seemingly losing efficiency by not
knowing first what exactly is wrong.
C) Now let us introduce another constraint: particle
interaction. It is defined in terms of a potential, which is a
function of distance. E.g. consider a van der Waals potential,
which describes noble gas interactions quite accurately: strong
repulsion at small distances, an energy minimum at middle
distances and weak attraction at long distances (imagine an
asymmetrical, shallow potential well). The sign of derivative of
the potential, the slope, shows the direction towards energy
minimum and the value of the slope the length of the vector
modulating dx. To speed things up we use integers of limited
resolution and a lookup with precomputed values. Using linear
interpolation (lookup, again) between two coarse potential table
entries can be used if we want to reduce the memory demand at
the cost of burning more instructions.
Each individual particle's interaction gets added up
arithmetically, resulting in O(N^2) runtime complexity:
for(i=0;i<max_particles;i++)  /* step through all particles */
{
  for(j=0;j<max_particles;j++) /* consider each particle's component */
  {			       /* in regard to the current reference particle */
    accumulate_individual_contribution;
  }
  compute_new_dx; /* from potential slope */
}
for(i=0;i<max_particles;i++) compute_new_x;
Note that we have just narrowly avoided something ugly: the
sequential bias. Its origin is the attempt to emulate an
intrinsically parallel system by a sequential process. Would we
compute new positions directly from accumulated individual
contribution instead of indirectly via dx, each already changed
position would influence other, yet to be changed positions thus
introducing a (small) error. Such systems must be modeled using
_two_ arrays, old and new, alternating between them (via
swapping the reference pointers) if we have enough memory, or
using a strip buffer, copying things to and fro (for strongly
local interactions only).
D) For several reasons, above model is somewhat skewed. We have
an O(N^2) complexity, which limits the size of the systems we
can model quite drastically. Moreover, shielding by intermediate
particles can become quite noticeable for most systems,
resulting in grossly erroneus results. For systems where local
interactions (van der Waals is _very_ short-range coming from
dipole fluctuations) and shielding occurs (e.g. ion shielding by
its hydrate shell in a water box), it is enough to consider only
interactions within a certain range. Using the technique called
"ray-casting" (truncate interaction tracking below a threshold
value, a cousin of raytracing), local shielding can be tackled.
Notice that this allows us to tesselate a big system upon
several computers: only the small overlap at the subsystem
boundary (not a volume, but virtually a surface) has to be
communicated through the network. Surface/volume ratio saves us,
again.
Only... which particles are local? Their address in core gives
us no clue about their "virtual" location in the simulation. We
can use one fact, though: the locality relation changes only
very little during each iteration. By keeping an array of
pointers, sorted by x (_not_ qsort, as the array is in almost
perfect order after each iteration, for this special case there
are much better algorithms) and refreshing it every few
iterations, we can use a sliding window of relatively small
(independant from simulation size) aperture to find the local
particles & compute only their contribution. (Now the cache can
earn its keep for the first time, reducing access latency of
overlapping neighbourhood elements. However, this is nothing we
can't do by hand on a bi-speed machine, since we _know_ the
locality relations right from the start. F21/MISC wins, again.)
We have (almost) turned O(N^2) into O(N)! Perfect?
E) Since we have conquered the worm of time complexity, let us
add two more dimensions, thus arriving at true 3d simulation.
x,y,z; dx, dy, dz. Now our memory looks like:
x  x  x ..  y  y  y ..	z  z  z .. dx dx dx .. dy dy dy .. dz dz dz ..
This means we can use exactly the same code for the first
simulations, not one single opcode has to be changed. We must
clip x,y,z independantly, though, if we look at the second
simulations. No big deal. Instead of delta, we have to use the
Euclidian distance metric now for potential constraints
(..though one could probably use a sum of abs of individual
deltas as a lookup upon a precomputed table). Ok, now we have to
use three pointer tables, sorted by x, by y and by z.... Stop,
we have to find first the overlap where the three orthogonal
window-thick space slices cross! And this is not even an
anisotropic ball, but a cube we have to carve into a ball,
first. Stuck?
Why not using a coarse (one or few atoms per cell) neighbourhood
map (voxelspace/3d array) instead of kludgy sorted pointer
tables to find the direct neighbours? Filled with pointers to
code, or even dire CALL object instructions. Put a NOP where
nothing (=vacuum) is, paste an return from subroutine opcode as
the last instruction and you can call the map directly, using
the processor hardware interpreter instead of a software loop.
(Threaded code architectures have some very real performance
advantages here due to low subroutine call overhead). If the box
is really crowded (liquid) you loose nothing this way. Use an
external 1d CALL object table otherwise. Use several monomethod
maps if you have lots of core (you can swap some out, mind), or
a single one, using a register or top stack element as a
message, and CALL method dispatchers to find the particular
method for the particular object. Use lookup tables of store
offsets (contigous 3d array) for raycasting (with different
radii for different particle type/class) and cause each object
to repaint his part of the map after each iteration. Now assign
a variable-resolution (octree) mutation map to the system. Clone
them into a population, assigning a node to each individual, and
use genetical algorithms (GP) to find local minima. Use
genetical programming (GP) to breed (interpreted stack
instruction) code from empirical data if you don't know how the
energy minimum function looks like en detail. Think about giving
covalent bonds dedicated handling if they never break at the
energy scale you are simulating.
Where does all this leave us?
Our system's statespace antics have become very nontrivial.
Large regions of statespace have become unaccessable, primarily
for two reasons: 1) they are occupied by high energy barriers
(thermodynamical exclusion) 2) they are shielded by high energy
barriers (kinetical inhibition). As in reality, the system we
are modeling is extremely myopic, perceiving only local energy
gradients & sliding along them. (It can also come back this or
the other way if it receives a random kick from somewhere, e.g.
from a hot solvent molecule or from a mutation). The entire
pathway + surrounding the system sees during its evolution is
only an infinitesimal fraction of the total statespace.
(Mutation map-) modulated GA mutation prevents the system from
becoming too myopic and settle down in some only too convenient
energy minimum. We cannot model very ergodic (where points
initially lying in a contigous region diverge markedly during
evolution, winding up widely dispersed throughout the
statespace) systems featuring sharp phase transitions, though we
can model large-scale properties of them. But: we can model
mildly and markedly nonergodic systems with high speed and
accuracy, having forsaken the classical approximation pipeline.
Obviously, we need computers with large-statespace-velocity-
times-path-entropy index, meaning 3d computer node
grids/lattices with orthogonal or hexagonal symmetry (6 vs 12
direct neighbours, provided with 6-12 direct high-bandwidth
links). This matches the demand for increased information
locality (to reduce signal propagation delay) very well. The
node size is determined by the fraction of local memory
bandwidth over network bandwidth (memory size and instruction
complexity not taken into account).
All in all not exactly the properties one expects to find in an
off-shelf workstation, workstation cluster or a maspar machine.
Comments?
-- Eugene