Thursday, March 21, 2013

Survivable Multiprocessor Machine

BUILDING A SURVIVABLE MACHINE WITH MULTI PROCESSORS
Can Your Machine Take a Hit and keep on ticking? You know... cosmic ray, project falls to the floor, incoming asteroid, all can upset your precious machine and ruin your day. How can we better ensure the machine will take a hit and bounce right back for more?

At what level can you build a multiprocessor machine which will continue functioning if one or more chips fail? In this blog we will consider some of the ideas required to construct a more reliable machine.

EXPOSED WIRING AND CIRCUITS
Machines with exposed wires and circuit are easily bumped, and oops, a wire is pulled out and the machine no longer functions correctly. Consider when a bump causes a single integrated circuit to dislodge. Another oops. After the machine is built and fully functioning, cover it with microwavable Seran Wrap or a housing containment.

EARTHQUAKE AVOIDANCE
It's almost impossible to avoid earthquakes in some regions of the world unless you move to calmer continents. Unknown to many people, an earthquake often has a rapid vibration effect that can remove paint from walls and ceilings. This high speed oscillatory motion can also effect electronic components and circuits. Pad electronics to absorb oscillations.

SPECIFIC INTERFACE WIRING
The machine with wiring that can continue to function when one or more ICs drop out, is very desirable. To accomplish this, special wiring is required so that the loss of one chip will not hang the interface.

AUTO CHIP REPLACEMENT
Both wiring and software must work together to accomplish the Auto Chip Replacement method, where some unused chips are standing by, ready to fill in for failed chips.

DIAGNOSTICS
On the fly diag is highly useful to determine the stats of each chip. This can be run at power up, mid range, and power down. The trick is to find a test algorithm that functions very effectively and very quickly. Remember, the time required to run the test is multiplied by the number of chips and the number of cores. For a 100 chip machine, this equals 800 cores. If the test requires one minute per core, the test will finish in 800 minutes or 13.3 hours later.