Chapter 10. FailSafe Operations

The IRIS FailSafe software uses redundant servers to provide a high-availability system. The servers run Video Server Toolkit (VST) and communicate using a special serial connection as well as a private ethernet; they may share clips and index filesystems.

When FailSafe detects a failure on the primary server, IRIS FailSafe transfers the filesystems and processing functions to the secondary server.

To use FailSafe, make sure the subsystems, vst_eoe.sw.fsmon and vst_eoe.sw.failsafe, are installed.

IRIS FailSafe is discussed in detail in the IRIS FailSafe Administrator's Guide and the IRIS FailSafe Programmer's Guide .

Hardware Configuration

IRIS FailSafe is a software product that enables a pair of servers to be used in a redundant configuration. The servers are configured with the VST software and share the VST filesystems. The dual-server IRIS FailSafe configuration shown in Figure 10-1 provides redundancy. If the primary server fails for any reason, the secondary server mounts any shared filesystems with the clips and their index files. Then, using the local VST, the second server is ready to play the clips.

In an IRIS FailSafe configuration, the operating system on each server is configured for the VST. The disks on each server store the operating system and the shared RAIDs store the clips and index filesystems. The RAID is shared by physically attaching it to an Emulex LH5000 Digital Fibre hub and by connecting the servers to two other ports in this hub. The two servers along with the shared RAID(s) form an IRIS FailSafe cluster.

The servers also share a public IP alias or a name users can use to connect to the servers. If a server fails, the backup server takes over the shared RAIDs as well as this IP alias, which users still use to connect to the servers. To the user, a failover looks the same as a server that crashed and got rebooted very quickly.

The servers use a special serial connection to communicate. When the backup server detects a problem with the active server, IRIS FailSafe unmounts the filesystems from the active server and automounts them on the secondary server. VST on the secondary server detects this action and adds the clips to its tables so that it is ready to play them.

Figure 10-1. VST IRIS FailSafe Configuration

Figure 10-1 VST IRIS FailSafe Configuration

Shared Resources

The primary and secondary servers can share up to four Ciprico 7000 RAID systems using an Emulex LH5000 digital Fibre Hub, as shown in Figure 10-1. They also share an IP alias which always points to the currently-active server.

Server Failures

Failures occur if one of the following events are detected on the currently-active server:

  • Power failure

  • Public network card failure

  • Operating system crash

  • Unmounting of shared filesystems

When a failure is detected, the secondary server attempts to reboot the primary server using the serial cable. The secondary server also takes over the shared services, the shared RAIDS, and the IP alias. VST detects the newly-mounted filesystem using fsmon and loads the clips.

Troubleshooting IRIS FailSafe

In an IRIS FailSafe configuration, the primary and secondary servers use the states shown in Table 10-1.

Table 10-1. Primary and Secondary Server States

Event

Primary Server Status / VST Status

Secondary Server Status / VST Status

Owner of Shared Services

Normal operation

Normal

Running

Normal

Running

Primary

Power off primary server

Cannot be determined

Not running

Degraded

Running

Secondary

Power on primary server

Controlled-failback

Running

Degraded

Running

Secondary

ha_admin -rf (after power off on primary server, ha_admin executes on primary server)

Normal

Running

Normal

Running

Primary

Power off secondary server (before execution of ha_admin -rf)

Degraded

Running

Cannot be determined

Not running

Primary

Power on secondary server

Degraded

Running

Controlled-failback

Running

Primary

ha_admin -rf (after power off on secondary server, ha_admin executes on secondary server)

Normal

Running

Normal

Running

Primary

Unmount shared filesystem (Primary)

Standby

Running

Degraded

Running

Secondary

Unmount shared filesystem (Secondary)

Standby

Running

Degraded

Running

Neither

ha_admin -rf (after unmounting, ha_admin executes on primary server)

Normal

Running

Normal

Running

Primary

Disconnect serial connection

Normal

running

Normal

Running

Primary

ha_admin -m start backup

Normal

Running

Normal

Running

Primary

ha_admin -fs primary (executed on secondary server)

Standby

Running

Degraded

Running

Secondary

ha_admin -rf primary after ha_admin -fs primary completed executing

Normal

Running

Normal

Running

Primary


Monitoring

The server availability status can be checked using the ha_admin -a command.

The status of the shared filesystem can be checked with the df command from either server.

The status of the proper aliasing of the servers and the network can be checked using the netstat -i command.

Use the IRIS FailSafe command-line interface to manage the Video Server Toolkit IRIS FailSafe system. For details, see the IRIS FailSafe Administrator's Guide .