home .. forth .. misc mail list archive ..

protein folding / scientific simulation with CAs




Some Relevant Aspects of the Protein Folding Problem

1. Motivation

The Protein Folding Problem, _the_ Holy Grail of computational
biology, has withstood countless attacks, undertaken by many
bright minds and augmented by years of scientific supercomputer
time over the last decades.

The reliable prediction of protein tertiary structure, which is
a function of the primary sequence, and both context and history
(less so) is an absolute prerequisite for any practicable form
of protein engineering.

Even more, since in proteins form follows function we face the
inverse problem. Our task is much harder: we have to predict a
primary sequence which have to fold robustly into a given shape.
This crystal ball has to work reliably for an overwhelming
majority of cases and it must work _fast_. Orelse the powerful
tool of simulated random mutagenesis, the "digital evolution",
will never become a practical one.

Why is Protein Folding such a hard problem, what did they wrong
when trying to solve it and what might be the key features of a
sucessful solution?

2. Feature Focus

On the first glance the problem is trivial: a simple
optimization procedure; or an exhaustive search for a global
energetic minimum in a given search space. An unknown mapping of
string space over amino acid alphabet to coordinate space.
Simple?

Unfortuntely, the search space is much too big for exhaustive
search in any but astronomical time scales even for tiny
peptides. Neither is the target configuration energy minimum
likely to be a global one. Nor is the energy function known. All
we know that it is very rugged and the standard deviation of the
energy minima at glass transition (middle-to late folding stage)
state is very small.

A computational nightmare? Wait, there is even more.

Though folding pathways of any amino acid sequence has been
optimized evolutionary for robust convergence towards one
"right" structure, sometimes there are misfoldings. A family of
chaperones (e.g. hsps), recognize misfolded domains, massage
them into the right shape or mark the protein for oblivion.
Since most chaperon family members and their mechanism are both
unknown, their impact cannot currently be assessed
quantitatively. Even more, the protein tertiary structure is a
function of its environment: the water box, the ion
concentration, present and past oxygene concentration (disulfide
bonds), etc. A protein in a nontypical protein crystal, subject
to different boundary conditions, may assume a different
conformation. Thanks god, so far NMR and XRay resolved
structures seem to indicate that this possibility is a rare one.
The assumption might not be true for certain water-poor crystals
from purely de novo engineered proteins.

3. What They Did Wrong

It is a well known truth in computer science that the right
algorithm and (especially) the right architecture may reduce the
time/space complexity of a problem by many orders of magnitude.
Some problems, the NP-complete/hard ones, have so far been
unable to be solved in any but polynomial time complexity. Is
the protein folding problem NP-hard? I think not.

Scenario 1. Time: Early 70's or mid 80's. Machine: an IBM
mainframe/Cray. Language: Fortran 66.

"The attempts of using ab initio or even QED are clearly
laughable. Obviously, a semiempirical solution is needed.
Moreover, quantum mechanics phenomena seem to be rare in protein
folding and easily identifiable. A Newtonian mechanics emulation
with empirical quantum corrections might well suffice. What we
have, is an assembly of particles, with mass, charge/dipole and
momentum, connected by springy bars, the chemical bonds. The
standard way of solving it, would to approximate it by a
formidable system of linear equations, which in turn would be
handled by standard Fortran matrix libraries on a scientific
vector computer big iron. Sufficient virtual memory space is a
must, of course. Since the total search space is much too big
and rugged, we will have to reduce the search space either by
introducing additional constraints or reducing the number of
sample points on the energy landscape."

I think this is a fundamentally wrong way to attack this
problem.

We can't sample the energy landcape or even tiny fractions of it
with sufficient resolution because of its size and ruggedness.
This means, all we get after a few days Cray time is some local
minimum which is of little help as our energy function doubtless
differs from the true one and our randomly hit minimum is almost
certainly wrong. Since the protein chain itself does not sample
the vast majority of its conformation space because of simple
kinetical reasons (heavy constraints forbid exploration of many
pathways, hiding them behind high energy walls), our simulation
is bound to be wrong, as it does not take folding pathway
kinetics into account. Obviously, a cul de sac.

Scenario 2. Time: Mid 90's. Machine: coarse grained parallel
supercomputer or a workstation cluster. Alternatively: special
purpose (neuro) hardware accelerators. Language: Fortran 90 or
C/C++.

"The massive statistical match database search/neural network
pattern matching has shown the given residue to belong to a
certain domain (e.g. alpha helix) with a probability of 0.85.
After every residue in the chain has been mapped similiarly, a
tentative prefolded structure is being refined by semiempirical
methods. Though true energy function is unknown, genetical
algorithms (GA) search using an estimated fitness function is
utilized."

Since evolution tends to conserve building blocks/motifs, the
prefolded structure might be roughly true. Only a small fraction
of the conformation search space has to be sampled. The
probability of success is much higher than the first method.
Sometimes, for an unusal protein motif, the method will fail
utterly. For a nonbiological protein the method will fail
certainly. The reason for failure is the simple fact that
primary sequence domains lying apart will come into immediate
vicinity during the folding process and will/might influence
both the folding pathway as the target structure. A high-tech
manure in a majority of cases.

4. Some Aspects of a Succesful Solution

I have been quite harsh, even arrogant, in my estimations above
standard methods' instrumental quality. Because of my limited
insight/experience I might also be quite wrong. Nevertheless,
(assuming I am right): is there a better way to do things?

Obviously, a sucessful simulation must _follow the folding
pathway(s)_. It must mimic all kinetic kinks along the real
origami. It must thus take _time_, the history into account.
Both the time (steps/increments) and space (coordinates) have to
be _quantized_, space even more roughly. This spells (scaled)
integers instead of floats. The system's Hamiltonian causes it
to converge towards very few (ideally one) attractors makes
trajectories convergent. This clearly defines a well-behaved
(nonergodic) system. Which indicates the Hamiltonian to be
collapsible, to be computable by a finite-state system, best an
automaton network.

The degrees of freedom being low-resolution integers facilitates
the use of look-up or hashing tables, possibly intelligent ones
with automagical interpolation. Sometimes, even extremely large
tables, resulting in MMU misses (VM acesses), can be worthwhile
if the function is extremely time-complex to compute. Since
state space search velocity is essentially bottlenecked by
memory bandwidth alone, _code/data optimization for
primary/secondary cache_ is straightforward. About one order of
magnitude speedup may result from this detail alone.

Usage either of a _coarse-grained maspar supercomputer_ (as
LRZ's SP2) or a _large workstation cluster_ (20-100 networked
low-end machines) comes to mind. _Much_ later, special-purpose
hardware accelerators might become necessary. A workstation
network has even some very real advantages as compared to the
supercomputer: total memory bandwidth (bottleneck parameter for
statespace search) as well as resource availability during night
time (since the advent of load balancing single user mode on a
big supercomputer is not the rule nowadays) is much better. And
RAID comes really cheap this way.

The model has to include a _water/ion box_ (no protein folds
without context, ever), either as single particles (this might
become necessary =( or as enery field. Since this adds the
NP-hard (at least O(N^2)) nonconstrained N-body problem to the
initial flexible chain (constrained N-body), a reduction of
necessary computation amount is crucial. Since we assume (with
good reason) the final protein structure to be an attractor in
its state-space this means we can assume weak (=distant)
interactions to be negligable (no butterfly effect) and
_concentrate on strong local ones_ instead. This allows the use
of cellular automata (CA)-proven encodings as e.g. the light
cone hashing approach (sphere of influence at t being its base
and t+1 single residue conformation its tip). (See forthcoming
paper on CAs in simulation).

A special neural net/automaton network to assess the energy
function, either as software or hardware implementation, might
reduce the amount of computation drastically. Particularly, two
fundamentally different encodings are possible here: either we
assume the neighbourhood to change very little during one
incremental time step (a _sliding neighbourhood window over x,
y, z (periodically re)qsorted pointers on particle table_
encoding) or as _space slab cluster/voxel space_ encoding.
Especially, the latter is very suitable for C++ OOP type of
programming and offer noticeable advantages either for time
complexity or precision as it performs particle clustering
inherently. The sliding neighbourhood window has a negative
performance payoff since it has to compute the local box from
overlap of crossing space slabs. Since this reduces data
locality, increased secondary cache miss frequency might reduce
overall performance very notecably.

Since in vivo proteins have to fold either at the ribosome
outlet or after transmembrane transport either from linear
conformation (utterly unrealistic) or molten globule (pretty
unrealistic), which introduces a lot of additional constraints,
this fact may be utilized for drastic searchspace reduction.

Finally, we have to sample several points of the energy
landscape simultaneously. For this, we need a _population of
individuals_. Possibly, a substantial one. This, in its turn,
suggests using the genetic algorithm (GA) paradigm, which suits
excellently for coarse-grained low-communication-bandwidth
architectures. Embellishments, as gene fragment swapping via
crossing-over and mutation frequency map (is equivalent to
automatical local temperature adaptation) might come quite handy
in this context. Since the actual fitness (energy landscape)
function is unknown and the interaction scope/sphere is limited
(thanks god), a GP (genetic programming; a GA derivate) method
may be used to derive the actual fitness function from
real-world e.g. PDB data. Using GA upon automaton networks seems
to be a straightforward approach here.

5. Plan of Action

The plethora of above approach outlines appears overwhelming. A
thorough investigation of above algorithm subspace alone will
take man-decades. To prove which lanes to be fruitful and which
barren we need a plan of exploration.

The implementation language is C/C++ (eg. as its GNU gcc/gpp
incarnations) due to excellent portability and run-time
efficiency. The development/early implementation/front end
platform is either a Linux box (a PC) or a low-end workstation.
Later stage: either a workstation cluster or a maspar
supercomputer. Multi-window visualisation (under X) should show
key features (entropy, kinetics, etc.) as well as individual 3d
"movie" system dynamics. Switching to a PD protein structure
visualizer package quite early might be a good idea. Recent
emergence of MISC (minimal instruction set computer)
fine-grained networks allows to hope to achieve 1-10 integer and
memory bandwidth factor of a SGI Crimson workstation per single
1-2 MByte node at about $200 hardware costs per node,
$100-$150/node in large quantities. Network bandwidth is 10
MBytes/s per link, which is more than sufficient for GAs.


5.1 Implementation of a realistic (ice/fluid domains) water
    box
5.1.1 A brute-force (O(N^2)) time-quantized implementation
5.1.2 A sliding window implementation
5.1.3 A space slab/voxel space implementation
5.1.4 Estimation of potential, seek/eliminate hot spots
      in all three models.
5.1.5 Calibrate with a real water box. Try to substitute
      water by a force field.
5.1.6 (Add ions, alkanes/phospholipids. Check behaviour.)
5.1.7 Recode model for GA
5.1.8 Distribute model over network/supercomputer nodes.
5.1.9 Sample parameter space (grain size, time/space quantization,
      boost system size) to map its applicability boundaries
5.2 Add a constrained N-body (flexible chain) with an abstract
    energy function.
5.3 Warp both the flexible chain and the energy function
    towards physical realism
5.4 Try to derive energy function from PDB data by GP. Use
    sum of square minima (in toto or at single domain level)
    as fitness function."The right energy function is those
    which causes a known protein fold correctly". Investigate
    the possibility to code it as an Automaton Network, to
    be bred genetically.
5.5 Etc.

6. Conclusion

The Grail is not ours, yet. But, since for the first time we are
able to asess qualitatively the titanic amout of work still to
be done, a realistic man-years estimation seems now be possible.
New hardware, massively parallel one as well as (more
importantly) new algorithms will be our weapons in this long
fight.