The past year has seen the birth of several commercial enterprises whose
goal is to capitalize on the abundance of ``spare'' compute cycles in
personal computers by providing an efficient and convenient platform for
very large scale distributed computations. Application domains which potentially
benefit from such huge amounts of compute include DNA sequencing and protein
folding in the biotech sector, graphics in the entertainment industry,
and exhaustive regression and other statistical and modeling applications
in the financial and oil industries. Such large scale distributed computations
running in an untrusted environment raise a number of security concerns.
These include the potential for either intentional or unintentional corruption
of data at individual nodes, or for the collusion of a group of malicious
nodes. While several (though not all) of the commercial platforms under
development include measures to ensure the privacy of data and protect
proprietary software, these efforts typically amount to security through
obscurity. Guaranteed reliability of results performed on these platforms
can be achieved trivially through redundancy, but in many situations this
is both inefficient and expensive -- firms providing the platform generate
revenue from finishing large numbers of jobs, and firms utilizing the
service are often charged based on some measure of total cycles required.
Often neither benefits from redundancy. On the other hand, without some
measure of redundancy, there can be no guarantee of the validity of returned
results. We approach the problem from two angles. First, we have developed
and are currently analyzing a number of frameworks which attempt to balance
the need for reliability with the need for efficient use of resources.
Second, we are examining issues of resilience in distributed computations.
For the framework development, our techniques include both simulation
and probabilistic analysis, and include a wide variety of models of potential
attack strategies and defenses. Once we have refined our frameworks, we
will test them by running several compute jobs on the Frontier distributed
computing platform developed by Parabon
Computation, Inc. in Fairfax, Virginia. (I designed and implemented
the Parabon Exhaustive Regression Engine, which runs on Pioneer, and so
am familiar with their API as well as the overall Frontier architecture.)
Each job will involve tens of thousands of nodes, with specific proportions
of the nodes running ``rogue'' code. Our approach to the question of resilience
has been to develop a rough taxonomy of classes of computations, since
often times the the degree of damage that may be caused by rogue nodes
depends on the nature of the computation. For example, an optimization
problem that involves a simple parameter search in a large space is more
resilient if the function being analyzed obeys some notion of continuity,
since potential extrema can be easily verified (so a rogue node claiming
to have a solution is foiled), and since non-optimal parameters should
be very near the true extremal parameters (so a rogue node that omits
a number of good results is foiled).