Does cloud guarantee fault tolerance?
Understanding the difference between a fault tolerant solution and having some fault tolerant services in your architecture.
As you can see in the building above, it has some cracks and is still standing. Cracks are faults and it seems to be fault tolerant. However, since the building has been closed down, it is not usable as expected/specified. Consequently, not making it fault tolerant.
Many cloud services offer fault tolerance out-of-the-box. As a result, it may appear as there is nothing to worry about. However, this is not the case. I will deconstruct this question and start by explaining what is fault tolerance, why is important and explain if we can take it for granted in the cloud or some thinking/effort is needed to achieve it.
What is fault tolerance?
Fault tolerance refers to the ability of a system (computer, network, cloud cluster, etc.) to continue operating as per specifications even when one or more of its components encounter a fault [1]. What’s more, fault tolerance can happen at different levels such as hardware, software, network, power supply, storage, etc; meaning we can be fault tolerant for some layers of the system, or some components, but not the whole system.