Fault Tolerance: Definition, Testing & Importance
Fault tolerance refers to a system's ability to operate when components fail.
Even the most well-designed system fails from time to time. Viruses strike. Servers overheat. Computer components wear out. Fault tolerance allows for smooth operation despite these failures.
Losing even a moment or two of connectivity can be catastrophic. Just ask Disney+. When the organisation's servers delivered glitchy performance in February 2021, users got mad. Instead of watching WandaVision, they wrote nasty tweets.
Fault tolerance plans may not keep your entire organisation running smoothly all the time. But your work could prevent a worst-case scenario from happening.
What is fault tolerance?
When a computer, server, network, or another IT component keeps operating even when a component fails, fault tolerance is responsible.
Create a fault-tolerant design to:
- Stay operational. Make sure your system doesn't go down altogether when something breaks.
- Reduce risks. Bar disruptions stemming from one critical piece of hardware or software. Overlap functions, so you can share the load in a crisis.
- Buy time. Fixing any kind of IT problem requires investigation and savvy. Fault tolerance ensures people can keep working while you hunt down the source.
Imagine that you run servers in Washington, D.C., and you just opened a portal for vaccine registration. Users flood you with responses, and your servers crash. Reporters take notice and write about your mistake all over the United States.
Now imagine that you've built a fault-tolerant system. When the influx overloads one server, another takes over, and users never know that anything went wrong.
The fault-tolerance concept isn't new. IT professionals have used it since the 1950s to describe systems that must stay online, no matter what.
But early fault-tolerance plans involved alerts. A system notified staff when something was about to fail, and they had to step in and do something immediately. Modern plans involve backups and redundancies, so the team can work while the system stays online.
People sometimes confuse fault tolerance with high availability. A company's high-availability score refers to how often the system stays up when compared to overall run times. To maintain high availability, a system switches to another system when something fails. The backup often provides reduced capacity and a poor experience. The company stays online, but work can slow.
In a true fault-tolerant system, redundant hardware does exactly the same job when the original system is offline.
How does fault tolerance work?
How can you keep something up and running even while parts and pieces of it are breaking? Answer this question with a comprehensive fault-tolerance plan.
At its core, your program should:
- Eliminate. Don’t allow a single point of failure. The system operates without stopping, even if you must make repairs.
- Isolate. You should remove the defective piece from system operation rather than letting it cause a cascade of problems.
- Engage. When you complete the repair, the part should come back online with no noticeable disruption.
Your fault-tolerance plan might include:
- Hardware. Build in backups so one can take over when another breaks. Run them in parallel, so they're always online and ready to go.
- Software. Multiple instances can take over for one another if one fails.
- Power. Your IT system always has current, even if your power company experiences a catastrophe.
There are multiple fault-tolerance techniques, including:
- Replication. Everything breaks in time. For example, most computers last about eight years, even with appropriate maintenance. Duplicating hardware and software ensures you always have a secondary source to lean on when you need to.
- Continuation. Ensure that your programs keep running even if errors exist.
- Recovery. Allow software programs to recover from a failure gracefully.
Your company is unique, and your solution set should reflect your risks and environment.
Fault tolerance in data centres
Functional, efficient data centres operate with many staff members. The average organisation has 1,000 or more employees. Even so, these teammates can't sit on their servers 24/7 to keep them up and running. Fault-tolerance plans help them address the unexpected.
Fault-tolerant data centres must:
- Protect. Parallel heating/cooling systems keep equipment from breaking due to environmental factors.
- Back up. Identical or similar systems running in parallel keep operations moving.
- Plan ahead. Alternative power sources ensure that the centre can operate even when the grid goes down.
- Repair. Routine maintenance ensures that all parts keep working, rather than allowing them to break before you address them.
Most data centres sell their services with promises of uptime. They keep those promises (and their customers) by keeping fault-tolerance plans tight.
Fault tolerance in web applications
Every time your customers pick up their phones, they expect your app to be online and available. Fault tolerance makes uptime possible.
Load balancing is critical for web applications. Multiple servers handle the load, switching back and forth as needed to serve your customers. That same system could help if you're dealing with a catastrophic server issue that takes down an element.
Fault tolerance in cloud computing
Many organisations are switching from on-site servers to cloud solutions.
Despite its name, cloud computing has nothing to do with the atmosphere. Services that offer cloud computing have physical server bases, just like data centres. They use the same concepts, ideas, and techniques to serve their customers.
Many organisations strive to identify core processes that must stay online at all times and move them to the cloud.
What's best for you?
The options, techniques, and tools that make up a fault-tolerance plan can be confusing. You may not know where to start. Let us help.
Okta is proud to deliver 99.99% uptime to every customer around the world, whether you are using our free developer edition or are an enterprise customer—all at no additional cost. Learn more.
References
Millions of WandaVision Fans Crashed the Disney+ Servers Trying to Stream Episode 7. (February 2021). Movieweb.
D.C. Vaccine Registration System Riddled With Crashes, Dropped Calls for Third Day in a Row. (February 2021). WAMU 88.5.
Fault Tolerant. PC.
How Long Do Computers Last? 10 Signs You Need a New One. (November 2020). Business News Daily.
Data Center World: Survey Shows Enterprises Are Building New Data Centers. (March 2019). Data Center Knowledge.
Six Reasons Why Companies Hang Onto Their Data Centers. (May 2017). ZD Net.