Clusters

Building scalable and reliable systems

Paul Krzyzanowski

April 2007

Introduction

If we want a computer that will be highly available, we would traditionally look for one that has reliable components (a better fan and power supply, for example) and incorporates elements of fault-tolerant design. This ranges from basic replication to TMR (triple modular redundancy). If we're serious about availability, we'll be looking for redundant and hot-swappable power supplies, disk drives, and processor cards as well as error-correctiong memory. The problem with these designs is their high cost. This is not a mass-market prodcut.

If we want a computer that can handle massive amounts of computation, we would traditionally select a system that offered an SMP (symmetric multi-processing) architecture with a large number of processors. The problem with SMP designs is that the performance gain as a function of the number of processors is sublinear, particularly with more than 8–16 processors, primarily because of contention for shared resources such as busses, memory, and devices. These computers are also not mass-market devices and are expensive compared to generic PCs.

Clusters

Wouldn't it be great if we could just connect a bunch of cheap off-the-shelf PCs together and get all the benefits of reliability and scalability without the high cost?

The goal of clustering is to achieve reliability and scalability by interconnecting multiple independent systems. A cluster is a collection of standard, autonomous, machines, configured so that they appear on the network as a single machine. The application and its data can reside anywhere but this is transparent to the application and its users. It is made up of computing and storage systems interconnected using a high speed network. By using standard components, we can take advantage of the low price of mass produced PCs, disk systems, and networking hardware.

This is the realization of the single-system image. Depending on the implementation and the needs of the system, the single system image may not be 100% realized. For example, the differences between machines may be visible to the administrator and the systems may be individually managed. However, the single system image should be present for users and applications.

A cluster is a distributed, not a parallel, system. Each machine runs a separate copy of the operating system on each node. A management subsystem creates the abstraction of an integrated entity (rather than a rack of PCs) and a cluster API (application programmer interface) provides a collection of system interfaces to perform operations such as determining the set of nodes on a cluster, monitor all state, launch applications, etc.

Ideally, running one set of "clustering software" over a bunch of machines will solve all issues at once, providing scalable computing together with fault tolerance. Unfortunately, we don't get all this in one package (yet). Clustering solutions tend to primarily focus one set of needs (although there is overlap):

High Availability (HA)

A high-availability system strives to provide a non-stop environment for applications, even if machines fail.

High Performance Computing (HPC)

High perforamce computing attempts to aggregate the combined computing power of a bunch of computers. There are a few variations of this:

supercomputing (cooperative multiprocessing): simulate the large-scale multiprocessing of supercomputer, where software has to operate on massive vectors and arrays. Ongoing communication between processors is generally needed for this.
batch processing: dispatch chunks of work to a set of computers. A renderfarm is a classic example of this.
load balancing: dispatch network requests among a set of similar servers to distribute the load. This clearly provides fault tolerance as well

Most commercial clustered solutions today focus on high availability rather than HPC since that's where the more lucrative business needs reside. Load balancing is often a sufficient approach for many of these environments. Because of this, it is common for most commercial clustering environments to support no more than a few servers.

High Performance Computing (HPC)

The attraction of combining a bunch of off-the-shelf computers together is that the level of scalability can be greatly enhanced from the typical 2-8 CPUs possible with off-the-shelf systems and at a far reduced cost than that of massively multiprocessor systems. Taking 16 $300 PCs and networking them together costs little more than $4800, far less than the cost of a 16-processor computer. Moreover, memory contention problems are reduced, workload can be balanced across multiple servers, and nodes can be added incrementally, reducing a need for an initial expensive outlay.

Supercomputing clusters are still largely custom efforts, optimized for the specific task at hand. One of the most popular efforts for supercomputing clustering arose in 1994 from the Center of Excellence in Space Data and Information Sciences (CESDIS), a division of University Space Research Association at the Goddard Space Flight Center [this division is currently known as CISTO]. This is a collection of software and configuration known as Beowulf. It was initially built to address problems associated with large data sets in Earth and Space Science applications. What made it possible to build highly scalable supercomputing systems is:

Commodity off-the-shelf (COTS) computers have become cost effective. The mass-market adoption of PCs drove the costs of these devices down.
Low-cost, high-speed, switchable networking is available. Switches allow network bandwidth to scale with the number of hosts since two machines communication with each other over a switched connection do not slow down any other network communications.
Open source software (Linux, GNU compilers and tools), including libraries for supporting parallel machine abstractions, such as MPI (message passing interface) and PVM (parallel virtual machine) libraries. These allow software to be written that is largely hardware and platform independent

In addition to this, many more people have accumulated experience with parallel software. Thus far, writing parallel programs tends to be difficult and solutions tend to be custom.

Programs that do not require fine-grain computation and communication can usually be ported to run effectively on Beowulf clusters. Under Beowulf, the machines (nodes) are generally dedicated to the cluster (to the application that will be run). This means that the performance of the nodes is not subject to external factors, making load balancing easier. The interconnect network among the cluster elements is generally a separate network that is isolated from the external network, making the network load be determined only by the application run on the cluster. A global process ID is provided, enabling processes on one node to be able to send signals to a process on another node.

Beowulf is not a single product but rather a collection of publicly available software that helps build a cluster. It generally runs on Linux systems (and also on the variants of BSD). The system includes:

BPROC: Beowulf Distributed Process Space. This allows process IDs to span multiple nodes in a cluster, provides a way for users to start processes on other machines, and allows signals and exit status to be forwarded across machines.
device drivers. In additional to supporting conventional drivers, Beowulf contributed channel-bonded ethernet drivers, enabling network traffic to be striped across multiple ethernet cards. With 100 Mbps Ethernet cards standard fare now and 1 Gbps cards quite inexpensive, channel bonding is not as useful as it was when the project began.
monitor driver provides a /proc interface for an LM78 on-board hardware monitor.
tools that include:
- PVM (parallel virtual machine) and MPI (message passing interface) libraries
- Distributed shared memory (page based with software-enforced page ownership and consistency policy). The original DSM available for Beowulf was based on ZOUNDS (Zero Overhead Unified Network DSM System) but other packages are available.
- Other tools, such as a cluster monitor and global ps/top/uptime tools.

Depending on the application needs, different configurations of Beowulf are possible:

Networking: single entry point to cluster: one monitor and one keyboard and a single external IP address. The rest of the cluster hides behind this with IP masquerading. Users log onto the main node and spawn remote jobs with ssh.
nodes: each has its own external address. Each system may double as a desktop machine.
File systems. There is no particular file system needed for Beowulf and the system does not support any mechanism for synchronizing files across machines. There are several options for synchronization:
- Local disks synchronized nightly (except for /var, /tmp, /etc/sysconfig) with a utility such as rsync
- Local disks are not synchronized: useful for applications that do only number crunching
- NFS root (or some other distributed file system): useful for programs that need disk synchronization but are not disk bound
Process management: batch system: job scheduling is left to the programmer (job runner). Generally a queue is used to ssh remote jobs. This is good for highly parallel applications, such as supercomputing or rendering
Preemptive scheduling/migration: this is not a part of the Beowulf solution but may be incorporated into the system. Processes can be automatically migrated based on cluster status. This is geared to an environment that is not dedicated to the cluster or one where different types of jobs are run. Two popular packages that provide this capability are Condor (not open source) and Mosix
Fine-grained control: programs control their own synchronization and load balancing using MPI and/or PVM libraries and bproc for process dispatch.

Batch Processing

One area that readily benefits from having access to a collection of computers is large-scale batch processing. Batch processing dates back to the early days of computing, when users would submit jobs (usually in the form of stacks of punched cards) that would then be run in batches. Today, it refers to a non-interactive environment where one can submit a set of tasks for execution and get results later. These jobs tend to be long-running and CPU-intensive. Circuit simulations and rendering frames for computer animation of movies are classic examples. The typical solution is to maintain a queue of frames to be rendered and have a dispatcher that remotely executes the process on an available server.

Batch processing environments for computer animation are referred to as render farms. These are collections of computers, all of which are generally connected to a shared storage system. A coordinator maintains a work queue and distributes tasks to each member of the farm ("machine 1 renders frame 1," "machine 2 renders frame 2," etc.). When a machine is done with its task, it informs the coordinator and gets the next chunk of work.

The composition of render farms changes frequently as companies beef up their infrastructure and replace aging equipment. As an example of the types of systems that a few render farms use, consider:

Pixar: 1,024 machines with 2.8 GHz Intel Xeon processors running Linux and Renderman software. THe systems in aggregate have 2 TB RAM and 60 TB of disk space. Pixar initially used NFS file systems but switched to a SAN (Storage Area Network) based on an EMC CX700 after discovering that NFS was a bottleneck. The movie Cars required 300 times more compute power than Toy Story: 2,300 CPU years at a rate of approximately one hour per frame.
ILM: 3,000 AMD processor render farm that expands to 5,000 processors by harnessing desktop machines. For storage, they use 20 Linux-based SpinServer NAS systems and 3,000 disks from Network Appliance. Machines are networked with 10 Gbps ethernet.
DreamWorks: Dreamworks outsources their rendering to the HP Labs Utility Rendering Service. This is a 1,000-processor collection of machines running Linux.
Sony Pictures' Imageworks: This render farm comprises over 1,200 processors in Dell and IBM workstations. Almost 70 TB of data was processed in the rendering of Polar Express.

Outside of proprietary batch processing solutions, there are a few attempts to create a general-purpose framework for dispatching and managing remote jobs. One of these is the Portable Batch System, OpenPBS. Like Beowulf, it has it's origins in NASA. The original free version is unsupported although there is still a user community for it that provides support and patches. The open source development on this effort has forked to the TORQUE Resource Manager project.

THe basic idea of these systems is that they provide software to submit, execute, list, hold, and delete jobs remotely.

Grid Computing

One relatively recent, and still very much evolving, effort is to develop software to deal with a general purpose decentralized collection of computing resources. The idea is to create a "virtual" clusters from a set of machines that will let you run your software for a period of time instead of using a dedicated cluster where the collection of machines is configured for and dedicated to a specific set of tasks.

The best known Grid effort is the Globus toolkit - an open source collection of software for building grids. The main goal of this software is to provide a standard set of interfaces for exchanging messages. The Globus toolkit makes use of web service mechanisms with XML-based formats for discovering, describing, and invoking services.

Clustering for High Availability (HA)

When we talk about the availability of systems, we may refer to different levels of statistical availability, which also include human overhead in dealing with failures. 100% is unattainable. 99.999%, known as "five nines" is the standard for devices such as phone system switches and generally requires backup power, earthquake-resistant enclosures (if needed), redundant power supplies and redundant networks. Just because your PC running FreeBSD hasn't been rebooted all year does not mean that the system is designed with five nines reliability.

Class	level	annual downtime
Continuous	100%	0
Fault tolerant	99.999%	5 minutes
Fault resilient	99.99%	53 minutes
High availablity	99.9%	8.8 hours
Normal availablity	99-99.5%	1.8-3.6 days

Companies such as Stratus NEC, and Marathon technologies have built fault-tolerant systems through hardware replication. These are proprietary systems and are expensive. NEC and Stratus support software running in lockstep synchronization on two identical connected systems. Systems such as Novell's SFT-III (Software Fault Tolerance) allow two systems to be configured identically with not only the disks being replicated but also the entire server replicated onto an identical machine, running identical software and designed such that both systems receive the exact same inputs. Software runs in synchrony on both machines. If one system fails, the other takes over immediately. There still is a problem. Software faults can crash both systems. It is also a relatively expensive solution since a dedicated backup system is required along with very frequent communication between the two systems.

Failover

With clustering, we would like to find a solution that can achieve high availability with standard computers. This problem is addressed by many vendors, including, Sun, IBM, HP, Microsoft, SteelEye Lifekeeper, and many others.

In the general case, if one server fails, the fault is isolated to that node. The workload is automatically spread over surviving nodes. This allows a machine to be taken down for maintenance without disrupting the system as a whole. A node that picks up work does not have to be sitting idle waiting for a failure. This overhead is that k systems are now doing the work that was done by k+1 systems.

To build this kind of system, we need to have software on each system that can detect when other machines have died or when an application on its own machine has died. The software must run an election and ensure that some system is put in charge of figuring out which system will take over and run applications from a machine that failed and then dispatching messages to other systems to start those applications.

Two important issues in cluster reliability are how application failover takes place and how long it takes the cluster to realize that one of its members is dead. Design options for failover are:

cold failover: When a cluster detects that an application is dead on one machine, it restarts it on another machine.
warm failover: A running application checkpoints itself periodically. When it dies, the cluster restarts the last checkpointed image. It may also maintain a log of inputs to perform a roll-forward to bring the application to the point where the original system died. Most commercial HA clustering systems provide a library that an application programmer can use to create and load a checkpoint file for an application.
hot failover: The application's state is lockstep synchronized on a backup system. When the application dies, the backup takes over. This generally requires custom hardware (e.g., Stratus, NEC, Marathon technologies) or software with extensive communication (e.g., Novel SFT III).

Clearly, hot and warm failovers have the advantage of bringing the application to the state last presented to the user. The issue with a checkpointed restart is to ensure that all activity is logged if roll- forward is used to ensure that it has all the state necessary to execute properly and not will not generate redundant disk activity (for example). Issues for both warm and hot failovers are that system specific state is not present in the devices (e.g. a connection over a particular serial port) and that relevant operating system state is also reestablished (e.g. open files, network connections).

Another issue in failover is whether the cluster can support application failover to different machines or only to a single machine. This is known as multi-directional failover. Once failover takes place, an extra level of fault tolerance is achieved if the application can fail over to yet another system if the backup system dies. This is known as cascading failover.

Shared resources

Most real-world applications rely on accessing files stored on disk. A problem with failover is that, while an application may have restarted on another system, the disk containing the files that the application needs are still on the dead machine.

There are two basic models used for dealing with disk access: shared-disk and shared-nothing. A shared-disk cluster allows multiple systems to share access to disk drives while a shared-nothing cluster has no shared devices.

With a shared-disk cluster, each node has its own memory but storage resources are shared. This works well if applications do not generate much disk I/O since a shared device can be a point of contention. Disk access must be synchronized to maintain data integrity. Synchronization is achieved by using a distributed lock manager (DLM) to serialize requests. This is mutual exclusion software to ensure that two systems do not try to claim disk access at the same time. The benefit of a shared-disk cluster is that everyone sees the same storage system. If an application moves to another machine, it can continue making the same disk requests (but cache coherency remains an issue). The detriment is disk contention.

Under a shared-nothing cluster, each system has its own storage resources (cache and disk). The benefit is that there is no need to deal with DLMs and there are no problems with cache coherency. If a machine A needs data that resides on a disk connected to machine B, A sends a message to machine B with the request. If there are many such requests, performance is still an issue. Moreover, if machine B fails, storage resources have to be switched over to a live node or be completely inaccessible. In general, shared nothing is better at providing linear scalability because of lower contention.

A hybrid architecture is also possible, where shared-nothing can be used for scalable, easily- partitioned applications and shared-disk be used for difficult-to-partition applications or those that mostly do disk reads rather than disk writes.

This hybrid architecture can also help in providing a failure-mode of operation a shared-nothing architecture. Here, a shared-nothing architecture is used but the disk is still connected to multiple machines. When all systems are up, a system will forward its disk requests to the system that owns that storage. If that system fails, then the failover system can access the disk directly, knowing that it is the only one doing so.

Hardware enhancement in HA clusters

In addition to software support for high availability, it often makes sense to beef up the hardware on systems to make them more fault tolerant. The decision to do so is largely a financial one: is it cost-effective to do so? Some hardware enhancements include the use of hot-pluggable devices (e.g., you don't need to shut the machine down to replace a drive, bad fan, or power supply), redundant devices (ECC memory, RAID 1 mirroring on disks), and hardware with support for on-line diagnostics.

Cluster interconnects

Traditional LANs and WANs are often too slow to serve as a cluster interconnect (connecting server nodes, storage nodes, I/O channels, and perhaps memory pages). This led to the emergence of the System Area Network (SAN). A SAN is a switched interconnect that can switch any cluster resources together. It provides reliability (no need for TCP), high speed (usually over 1 Gbps), low latency, I/O without processor intervention, and a scalable switching fabric (it is easy to add more nodes). Transfer is typically through a Remote Direct Memory Access (RDMA) mechanism. A key point in a SAN is the low processing overhead since processing overhead affects scalability. Since it is a dedicated reliable network there is no need no manage a complex protocol stack such as TCP/IP. An example of a system area network is Tandem's ServerNet.

Another important interconnect, which unfortunately shares the same acronym, is the SAN, or Storage Area Network. A storage area network is a network that allows storage devices, such as disk arrays, to appear as if they are directly attached to the computer even though they are really connected over a network and may be shared with other systems. The most common interconnect is fibre channel, where a fiber optic network connects the computer to a fibre channel switch that is connected to one or more storage systems. Other options include iSCSI, which encapsulates SCSI over a TCP/IP link or ATA over Ethernet.

Heartbeat network

To detect system faults and distinguish them from network faults, it is useful for clustered systems to maintain redundant networks. One network can be used to send periodic heartbeat messages to test a machine's liveness. Ideally, this network will be a reliable network with a bounded response time. Lucent RCC (Reliable Cluster Computing) used a serial line connection for sending heartbeats. Microsoft's Cluster Server uses a SCSI bus for this, with one machine performing a reset and waiting to see if another reestablishes a connection within 10seconds (a rather long time).

HA cluster software

The goal of software is to hide the complexity of clustering from applications, application programmers, and end users. The first is difficult, the next easier, and the last much easier to accomplish. For example, a cluster-aware database can partition databases and tables across nodes transparently and will know how to divide queries across nodes. Cluster-aware applications can also fix deficiencies in the underlying system's ability to provide failover by managing write-ahead logs and detecting restart. Most vendors of clustering software provide an API to enable developers to design such cluster-aware applications.

At the operating system level, clustering software aids in providing a single-system image. Applications should be allowed to run in an environment were resources can be accessed in a device- independent manner and the application is not aware of what machine it is using or whether resources are local or remote. To an administrator, the cluster should be managed as one logical entity. For example, a ps command on a Data General cluster shows all processes throughout the cluster. An exec system call runs a process on a machine selected (by certain parameters) by the cluster manager.

Some key components that the collection of software should support to enable high availability and failover include:

IP address takeover: When a system takes over for one that died, it will often need the ability to listen not just on its own IP address but also on that of a dead machine that it took over.
Checkpointing: An application may be provided with the ability to call an interface that will cause its entire memory state to be saved. Upon starting up, an application can check whether a checkpointed image is present and should be loaded.
Configuration database: The software should know what services need to be running and what systems they are allowed to run on.
Failover manager: In the event that an application or a node failed, the failover manager should discover this and restart the application on a working node.
Membership manager: Keeps track of which machines are currently alive in the cluster and which are dead.
Global update manager: Propagate configuration changes to all nodes in the cluster (implement an atomic multicast).

Clusters

Building scalable and reliable systems

Introduction

Clusters

High Performance Computing (HPC)

Batch Processing

Grid Computing

Clustering for High Availability (HA)

Failover

Shared resources

Hardware enhancement in HA clusters

Cluster interconnects

Heartbeat network

HA cluster software

Sample HA cluster server architectures

Stratus RADIO

Microsoft Windows 2003 Cluster Server

Load balancing and fault tolerance

Load balancing with REDIRECT

Load balancing with software load balancers

Load balancing with routers

References

Contents

CS 416

CS 416 background

CS 417

CS 417 background