Jump to section

What is high availability?

Copy URL

High availability is the ability of an IT system to be accessible and reliable nearly 100% of the time, eliminating or minimizing downtime. It combines two concepts to determine if an IT system is meeting its operational performance level: that a given service or server is accessible–or available–almost 100% of the time without downtime, and that the service or server performs to reasonable expectations for an established time period. High availability is more than hitting an uptime service level agreement (SLA), or the expectations set between a service provider and client. It is about truly resilient, reliable, and well-functioning systems.

 

With the adoption of online services and hybrid workloads, there is greater demand for infrastructures to handle increased system loads while still maintaining operational standards. To achieve high availability, these infrastructures, often referred to as high availability systems, must hit defined, quantifiable outcomes beyond just "running better."

One of the targets of high-availability solutions, or high availability services, is called five-nines availability, meaning that a system is running and performing well 99.999% of the time. Usually only mission critical systems like healthcare, government, and financial services require this level of availability for compliance or competitive reasons. However, many organizations and industries still require their high availability systems to maintain 99.9% or even 99.99% uptime to provide constant digital access for their customers or allow their employees to work from home.

High-availability infrastructure is dependent on detecting and eliminating single points of failure that could contribute to increased system downtime and prevent organizations from reaching their performance goals. A single point of failure is an aspect of the infrastructure that could take the entire system offline, and in complex systems, multiple single points of failure can exist.

Organizations also have to take into account the different types of failures that can occur in a modern, complex IT infrastructure. These include hardware failures, software failures (both for the operating system and for the running applications), service failures (such as inaccessible networking and latency or cloud services or performance degradation), and external failures, such as power outages.

The first step each organization can take toward high availability is determining the specific, most important outcomes it wants to see based on its core services, workload and regulatory or compliance requirements, performance benchmarks, critical applications, and operational priorities:

 

  • What are the uptime requirements either for regulatory compliance or for user experience?
  • How distributed is the environment? What are the key points of failure?
  • What is the required performance for the application? What are the risks to that app's performance (e.g., high user traffic or heavy write loads)?
  • What kind of storage is in use?
  • What requirements are there around data loss or data access?
  • Given current IT resources, what are achievable SLAs in case of an outage? What are the current planned maintenance schedules, and what is the impact on uptime?
  • Are there plans around different disaster recovery scenarios or changes in business operations?

With high-availability environments, there are also several common metrics that IT teams use to determine whether the high availability architecture is meeting its objectives. Some may be more relevant to your architecture than others, but it is worthwhile to evaluate all of them to set baseline performance expectations:

  • Mean time between failures (MTBF): How long the environment operates between a system failure.
  • Mean downtime: How long the system is down (minutes of downtime) before it is recovered or replaced in the topology.
  • Recovery time objective (RTO): The total time it takes to complete a repair and bring a system back online.
  • Recovery point objective (RPO): The period of time in which you need to be able to recover data. This is the window of lost data. For example, if a system is relying on bringing in another system from backups and the backups are taken daily, then there could be up to 24 hours of lost data in the recovered system. However, if there is replicated or shared storage, then the data loss may only be minutes or even less.

A high-availability architecture incorporates principles from each layer of continuity planning, such as monitoring and automation. This allows the overall system to be resilient to all types of failures, from specific local failures to an overall outage. It even allows the overall system to remain operational even with planned maintenance windows and other service interruptions.

A disaster recovery or continuity plan would incorporate approaches for each potential failure:

  • Anticipate specific failures: For each of those areas, IT architects first make sure that systems are redundant, and that backup systems are available in case of a failure. The next step is to automate failover and failure-detection processes so that down systems are automatically detected and services are switched to the backup system.
  • Manage performance proactively: Fault tolerance will address an outage, but it won’t necessarily deal with performance degradation. This is where load balancing and scalability become useful tools. In this case, IT architects monitor system performance and use multiple systems to manage user requests and operations. Load balancers and traffic management can intelligently route traffic in real time based on bandwidth, system performance, user, or request type.
  • Deal with catastrophe: Widespread infrastructure failures–like a cloud provider going down, a natural disaster at a data center site–are rare, but they require a more comprehensive approach than hardware/software failures alone. Along with bringing the infrastructure back online, it is necessary to have up-to-date data available. This can be done synchronously through replication (though there are performance risks) or asynchronously through data backups (with some risk of data loss).

High-availability architectures run active failover clusters, so there is built-in redundancy and failover and—hopefully—zero downtime. Within the cluster, nodes are monitored not just for availability, but for overall performance of applications, services, and network. Because there is shared storage, there is no data loss if a node goes down, because all cluster nodes work from the same data source. Load balancing can be used to manage traffic for best performance.

Outside those broad characteristics, high-availability clusters can be designed for more specialized activities, depending on the priorities and activities within the IT infrastructure. The Red Hat Enterprise Linux High Availability Add-on, for example, has four default configurations:

  • High availability: focuses on uptime and availability
  • High performance: for high speed, concurrent operations 
  • Load balancing: for cost-effective scalability
  • Storage: for resilient data management

In real-life environments, the high-availability systems  would incorporate aspects of those focus elements.

High availability spans the entire infrastructure, accounting for data and storage management in separate environments–both cloud and physical–and different locations for services and applications. This is why a common platform and standard operating environment can be so powerful: it creates consistency regardless of the deployment environment.

Red Hat Enterprise Linux has additional capabilities and services that can be included through add-on packages. The Red Hat Enterprise Linux High Availability Add-on addresses the networking, clustering, and storage aspects of the topology.

Because high availability is so entwined with data management, Red Hat Enterprise Linux deployments for Microsoft SQL Server and SAP also include the Red Hat Enterprise Linux High Availability Add-on.