Online Tech Expert Interview: What Is High Availability?
Using a High Availability (HA) architecture can greatly reduce the risk of losing revenue and customers due to a loss of Internet connectivity or loss of power. However, how much do you really understand about a high availability infrastructure? What does high availability really mean when it comes to data centers and networks? Why is high availability important for your business?
I was able to sit down and speak with Online Tech’s Project Manager, Noah Wolff, to answer common questions about high availability at our Michigan colocation data centers.
Watch our Online Tech Ask the Expert: What is High Availability? video.
Q: What is high availability?
Wolff: High availability is a design approach that takes in account the sum of all the parts including the application, all the hardware it is running on, power infrastructure, and the networking behind the hardware. High availability is usually measured against 100% operational availability. I would consider high availability to be above 99.999% available.
Q: When do you need high availability?
Wolff: You should use high availability when you have a service level agreement (SLA) you have to meet, you have a mission critical system that can’t be down, or your client requires little to no downtime.
Q: What are some reasons to go to high availability?
Wolff: With high availability, you can perform maintenance without downtime. Density increases means the outages affect more users especially in a virtualized environment. Also a single firewall, single switch or single PDU failure will not affect your availability. The increase of availability transfers to higher productivity and hopefully – cost savings.
Q: What are the components of high availability?
Wolff: The mean time between failure (MTBF) and the mean time to repair (MTTR). The MTBF is the average time to transition from fault-free to failure, and MTTR is the opposite, the average time to transition from failed to operational. To calculate high availability: Availability = MTBF/(MTBF + MTTR)
Q: Explain some high availability terms?
Wolff: Reliability is the probability that a system will not fail. Resiliency is the ability of a system to recover. Fault Tolerance is the ability to respond to unexpected failure. Availability is the ratio of time that a system is available. N+1 is a form of resilience. Disaster Recovery (DR) is protecting against natural and/or man made disasters.
Q: How do you calculate high availability and downtime?
Wolff: The first step is to monitor your system, preferably through a third-party company that checks availability every second, 5 seconds, etc. Even though the server is up and running doesn’t mean that an application is up and running, or just because network is up and running doesn’t mean the server is up and running. So monitor the application itself, from the client’s perspective, to make sure that you are seeing what they are seeing. Then subtract the amount of downtime for the day, month or year.
Q: What is considered uptime?
Wolff: Uptime refers to the amount of time the application, the server, or tech environment is available. For high availability, we like to see 99.999% or better. So, if you have availability of 99.999%, that means you have less than 5.26 minutes of downtime per year.
Q: Explain high availability power.
Wolff: When you talk about high availability power there are several things to keep in mind. Just because you have two power circuits does not mean they’re highly available. The primary circuit should be provided by the primary UPS, which is backed up by the primary generator. The secondary circuit should be provided by the secondary UPS, which is backed up by the secondary generator. That way a UPS failure or generator failure can never interrupt power in your environment.
Q: What is a common mistake people make when creating a high availability environment?
Wolff: One step that is commonly overlooked is testing the design you put into place, or the hardware you put into place actually performs the way it is expected to. You should go through and make sure that you test the power structure by pulling away the primary power to ensure that the HA power takes over. And pull away primary networking so that the secondary networking takes over. Also, you should test certain pieces of hardware that are redundant to make sure that there is failover in every area of the environment.
Q: What is the biggest cause of high availability failure?
Wolff: The biggest mistake by far is human error. But when you break that down you have power failure, hardware failure, network failure, and application errors.
Q: Are backups still important?
Wolff: If you want to recover from a catastrophic event, like a flood or hurricane, then high availability does not have much to do with it. High availability would of kept your system up and running inside of your rack or inside your data center. But when the data center is gone, you’ll want your backups to restore to a completely separate site.
Q: Where does disaster recovery fit in with high availability?
Wolff: Disaster recovery assumes multiple points of failure. By the time DR comes in, your HA has completely failed, your whole system is broken down, and you now have to completely recover to a different geographical location.
Q: Does high availability cost more?
Wolff: In most cases, yes it will cost a little more. Typically, on the infrastructure side of things you have a completely extra leg of power. For the network you have redundant hardware in most cases. For your servers, you have redundant power supplies and redundant NIC cards.
Q: Does High Availability guarantee you will never have downtime?
Wolff: No, there are no guarantees. It’s a design approach that assures the single points of failure; but whenever you introduce humans there is always a chance they will make a mistake, something will get interrupted, something will be done incorrectly and downtime may occur.
Q: What should you look for in a high availability data center?
Wolff: Obviously you want to look for a data center that can provide high availability services such as power, a primary and secondary power feed, and a primary and secondary internet uplink if you are purchasing internet from them. If the high available data center is providing hardware, firewalls, or switches; make sure they can offer redundant primary and redundant hardware.
If you are using managed services and purchasing a server from a data center, make sure all the hardware is configured for high availability: dual power supplies, dual NIC cards. Also, make sure that their server is wired back to different switches and the switches are dual homed to different access layer routing so there is no single point of failure anywhere in the environment.
Online Tech (www.OnlineTech.com) is the leader in secure and compliant hosting services including private cloud hosting, managed cloud hosting, hybrid cloud hosting, managed dedicated servers, disaster recovery and offsite backup services, and Michigan colocation. Online Tech’s legacy of independent HIPAA, PCI, SAS 70 Type II, SSAE 16 Type II (SOC 1), SOC 2, and SOC 3 audits and reports ...
Other Posts by Thu Pham
The moderated business community for business intelligence, predictive analytics, and data professionals.