One of the continually promising and seldom delivering ideas in computer engineering is “grid computing”. The idea is to couple a bunch of machines together and use the slack computing time to have a truly massive computational resource. It’s always been promising, but never really works.
Distributed computing, where individual machines communicate in a safe internal environment with reliable communications, does work and is a routine tool in software development.
So what is the difference? and (more importantly) what can we do about it?
There are three basic differences between the grid environment and the distributed environment and they are probably responsible for the differences in usefulness.
- Authentication/security. In a distributed environment the individual machines trust each other. Either they are connected on a high-speed local network or they share a memory bus. There is a central issue of authentication/accounting and when the user logs into one part of the machine, he is authenticated for the whole system. In the grids the machines have to have some way of transferring authentication, and this is fraught with potential security holes. For example if one machine in the grid uses weak user authentication or a short out of date certification then it becomes a gateway for undesirable activity on the whole grid. Messages also can be packet sniffed revealing sensitive information and faked to subvert the computation.
- Computing Reliability. The grid is a heterogeneous mix of machines, and the grid user is typically the lowest priority guest on each machine. Since the grid user is a guest who may be preempted by any other user, it is difficult to assert a reliable time frame for the calculations. The heterogeneous nature of the grid means that the software has to be verified to work on several architectures and that silly issues like default word sizes do not interfere with the exchange of data. Additionally, individual machines will be turned off for maintenance (we’ll assume hardware failures are equally likely) on an arbitrary schedule that is not necessarily coordinated across the grid. What happens when a task is checkpointed, the control program spawns a new task (on some other resource), and then the task resumes and tries to re-integrate into the program?
- Communications Reliability. The worst case for communications in conventional distributed computing is a beowulf with consumer grade internet (say 100base T nowadays). Other than some weird contention issues if you try to use a broadcast algorithm (switches don’t magically expand channel capacity – they just pick who gets to talk), messages between processes simply work. Replacing that network with an arbitrary internet connection, that probably should be encrypted and certainly should be internally check-summed, reduces both the speed of communication and the reliability of it. It is hard enough to avoid deadlocks without having to worry about the message never arriving.
So what can we do about this?
- Algorithm changes. If you can’t have reliable communications or computing, then don’t use it. Many problems, most notably in simulations, are simply decomposable into individual smaller problems that can run independently and have the results merged. In many cases these multiple run algorithms can produce better results than an individual run. Think “task parallel”. Ideas like granular learning machines, Monte Carlo algorithms, and Markov chain models suggest themselves.
- Language changes. Functional languages that are designed for concurrent programming (did anyone mention Erlang?) and that are designed to have no side effects (i.e. stateless functions) are highly adaptable to a grid environment. The function evaluations can be fired off over a pool of servers and if they don’t return in time, re-fired. Map reduce is “trivial” when written in the list comprehension primitives of a functional language. The variable latencies in communication and computation don’t really matter with a highly granular approach to software.