This chapter contains procedures to follow to prepare for and recover from a major disk crash. The four different types of disk crash situations in this chapter are summarized below:
The primary disk, which contains the operating system and NetWorker binaries, is damaged. This can apply to a client system or a NetWorker server.
The secondary disk, which contains other filesystems, is damaged. This can apply to a client system or a NetWorker server.
The NetWorker server disk, which contains the online indexes (the /nsr filesystem), is damaged. You have to recover the indexes before using NetWorker to recover any filesystems.
The NetWorker fileserver was destroyed. You have to recover everything to a new NetWorker server.
If a primary disk suffers a head crash, you may need to replace the disk, boot from mini-root, format and partition the disk, re–install the operating system and NetWorker binaries, and then recover affected filesystems. In this case, before using NetWorker to recover the data to the disk, consult the system administration manuals you used to set up your fileserver for the first time.
If a secondary disk suffers a head crash, its recovery procedure is simpler, since you do not have to reinstall the operating system and NetWorker binaries.
The ultimate disaster for a system is to lose all the files on its disk. Most sites back up their fileservers daily as a preventive measure. If a primary system disk suffers a crash, you can rebuild its filesystems with NetWorker, after you reinstall the operating system (if necessary).
If the NetWorker server filesystem or disk that contains /nsr and/or any contents of /nsr linked to another location is destroyed, the recovery procedure involves an extra step; you must recover the online indexes for the server as well as the server filesystems. The /nsr directory on the server contains one index for each client, including an index for the server as a client of itself.
If your NetWorker server was destroyed (in a fire, for example) you need to replace it with another machine. You may do this as long as you do the following:
Name the replacement server with the same hostname as the original NetWorker server, if possible.
Reinstall NetWorker using the same directory locations for the online indexes as in the original installation.
Reregister the new NetWorker server.
Once you understand the procedure for a disaster recovery, make sure you have carefully thought of a disaster recovery plan for your site. If possible, test the ability to recover from a disaster at your site.
If you set up your network and enabled NetWorker for scheduled network-wide backups, you are well prepared for a disaster. Every time NetWorker backs up a group of clients, it also backs up all the online indexes for those clients. This backup includes the file index, media database, and NetWorker configuration files for the server itself (which contain entries for the client indexes, should these require recovery). The media database, NetWorker configuration files, and part of the server index are saved to a special save set named bootstrap. The save set identification numbers (ssid) for all recent bootstraps are sent to a default printer, providing a hard copy for your records.
Silicon Graphics recommends that you take these additional precautionary steps to help you recover from a possible crash:
Keep a file containing hard copies of the bootstrap records. Place these daily sheets of paper in a three-ring binder or a file folder.
Make a hard copy record of the disks, partition sizes, and mount points for the server and any clients that have a local hard disk. This information makes a future recovery much smoother for you.
Save your enabler code certificate and the authorization code for the NetWorker server.
NetWorker sends a record of the bootstrap save sets to your default printer, so you have a piece of paper with the dates, locations, and save set ID numbers needed for disaster recovery.
If you ever need to recover the server online indexes, the information on this piece of paper can save you a great deal of time. Save this information in a safe place.
The information sent to the printer looks similar to this:
December 13 03:35 1996 server's bootstrap information Page 1 date time level ssid file record volume 12/12/96 18:23:23 full 12096 10 0 vol.001 12/13/96 3:34:46 9 12099 13 0 vol.001
NetWorker prints all the bootstrap save sets for the past month. The bootstrap save set may span more than one backup volume. The file and record numbers are used to find the associated save set quickly.
You can also manually back up the NetWorker server indexes by using the savegrp -O command. Using this command also sends the bootstrap information to a printer. For example:
# savegrp -O -c servername
To use the savegrp command, you must be root on the NetWorker server.
Use the disk information command (df) to find out how the NetWorker server disks are partitioned and mounted. Use the disk volume header command (prtvtoc), or the logical volume manager (xlv_mgr) to print disk partitioning information. Do the same for any NetWorker clients that have local hard disks. (On old SunOS® systems, use the dkinfo command to show disk partitioning. On AIX® systems, use the lslv command to show disk partitioning.)
Do this for each system NetWorker backs up, unless the systems are consistent in disk and filesystem layout. Print and file this information in case you ever have to recover from a disk crash.
For example, the df information looks similar to this:
% df -k Filesystem Type kbytes use avail %use Mounted on /dev/root xfs 39287 29074 10213 75 / /dev/dsk/xlv/user2 xfs 3826740 3735108 91632 98 /usr /dev/dsk/xlv/e xfs 35224840 34471480 753360 98 /e /dev/dsk/xlv/d xfs 1953920 1708676 245244 88 /d /dev/dsk/xlv/g xfs 35222600 32929604 2292996 94 /g /dev/dsk/xlv/b xfs 35228840 31598192 3630648 90 /b /dev/dsk/xlv/h xfs 35224840 33678664 1546176 96 /h
The following prtvtoc command example gives you information about how a root disk is partitioned for an IRIX system. The device name is the “raw” device corresponding to the device name used for the output from the df command.
# prtvtoc Printing label for root disk * /dev/rdsk/dks0d1s0 (bootfile "/unix") * 512 bytes/sector * 104 sectors/track * 5 tracks/cylinder * 5 spare blocks/cylinder * 4019 cylinders * 5 cylinders occupied by header * 4014 accessible cylinders * * No space unallocated to partitions Partition Type Fs Start: sec (cyl) Size: sec (cyl) Mount 0 xfs yes 2575 ( 5) 1905500 (3700) / 1 raw 1908075 (3705) 161710 ( 314) 8 volhdr 0 ( 0) 2575 ( 5) 10 volume 0 ( 0) 2069785 (4019)
If a disk is destroyed in a head crash, you will be able to restore it and recover the filesystems to their original state, using the hard copy information from these disk information commands. At a minimum, you need to have partitions large enough to hold all the recovered data.
You have a choice between using the save set recover feature or the normal recovery procedure to recover filesystems after disk failure. This section describes the advantages of each method.
During backup, NetWorker multiplexes different filesystems simultaneously to the backup media. By recovering multiple filesystems at the same time, NetWorker has the opportunity to de-multiplex the filesystem save sets from the same backup volume in parallel, thus reading each backup volume only once. This is why you should recover multiple filesystems at the same time if it is practical for you to do so. You can do this by marking different save set points (using the save set recover feature) or different mount points (using the normal recover procedure).
With release 4.1 and higher, the NetWorker server can recover several save sets from the same backup volume in parallel, eliminating the need to read the same backup volume several times during a recovery.
An advantage to using the save set recover feature is that you spend less time browsing and marking filesystems. With normal recovery, an entry for each individual file is accessed in the NetWorker file index to reconstruct an accurate view of the filesystem. It takes time to “pick” the most recent versions of files from the tape. With save set recover, individual file browsing is bypassed and entire save sets are recovered in one step. If the browse policy has expired, save set recover is the only way to recover a file using the GUI. For information about using save set recover, see Chapter 5, “Recovering and Cloning Save Sets.”
|Caution: Whenever you have to recover the primary disk (for example, root), do so in single-user mode from the system console, not multi-user from the X Window system. Before starting this procedure, make sure all filesystems are mounted.|
There are two ways to use the save set recover feature:
Run the nwadmin program and choose the Recover command from the Save Set menu. This opens the Save Set Recover window.
Run recover with the -S ssid option from the system prompt. This is the way to recover if you are not using the X window system. See the mminfo(1M) reference page for instructions on how to find the save set(s) you want to recover. See the recover(1M) reference page for instructions on using the recover -S command.
To use the Save Set Recover window for recovering an entire filesystem, you need to pay attention to the backup levels of the save sets you are marking for a recovery. For each save set you wish to recover, you need the last full backup followed by the most recent level backups of the save set to bring the system back to the state it was in before the system crash.
A level 0 backup is a full backup. Other levels after a full are represented by ascending numbers, or by the letter i for incremental saves. In other words, a level 3 backs up more data than a level 5. The following illustration shows you several backups over time. The bars represent the level backups since the last full backup.
The arrows in the illustration point to the save sets you would need in order to recover your filesystems from the disk crash.
By using save set recover, it is possible to recover files that were deleted between backups. For example, if file F existed at time a, but was deleted prior to time b, file F will be recovered. This may require more disk space than is available if large files or a significant number of deleted files are recovered. Normal recover does not recover the deleted files, and the filesystem will be restored exactly as it was at time d with no superfluous files recovered.
For example, suppose you wish to recover the save set /xlv0 in the Save Set Recover window, as shown in Figure A-2.
The Instances display shows the backup history of /xlv0. To recover /xlv0, you need to mark the most recent full backup of /xlv0 and the most recent level 9 to restore /xlv0 to the state it was in before the system crash. In other words, you need to mark the appropriate save set levels for each filesystem you are recovering. As indicated above, you may recover files you were not expecting and do not need. Just delete these files.
Normally during disaster recovery, you want to force the recover program to overwrite existing files. Using save set recover, this is even more important, since the same file may be recovered multiple times with each successive version coming from a later save set. Since each file in each save set is recovered with save set recover, files or directories previously deleted or renamed (with the mv command) between backups are still recovered. These files or directories need to be manually deleted. You may even run out of space during the recovery if there are too many instances of previously deleted files or directories recovered by save set recover.
The save set recover feature reads each save set in its entirety during recovery. If you have to recover many save sets, the “normal” recover may work better than save set recover. However, if you need only the last full backup of a save set, save set recover is a better method. By contrast, if you use the Recover command from nwrecover, you will recover filesystems from the last backup before the crash, and NetWorker will read the minimum amount of tape to recover the files. This method has these disadvantages:
You spend more time browsing and marking files for recovery, since NetWorker needs to find an entry for each file in the index to add all the files in one filesystem.
You could run out of swap space if the list of filesystems to recover grows too large.
In summary, save set recover allows faster disaster recovery in these cases:
If you can determine the correct save sets to recover.
If there are only a few save sets to recover for each filesystem.
If you are not recovering the NetWorker indexes.
If recovering extra files is acceptable (not an issue with full backups).
Use the normal recovery procedure in these cases:
If you cannot determine which save sets to recover.
If you need to recover files from many backups to restore the filesystem acceptably.
If you are recovering the NetWorker indexes.
If recovering extra files is uncceptable, or you are recovering incremental backups.
This section provides an example of how to recover a secondary disk using NetWorker. The example may apply to either a NetWorker server or a client.
It is impossible to provide step-by-step instructions on how to recover your system from a disaster, since every site is unique. The examples in this chapter are designed to give you general principles of recovering a primary or secondary disk, and to demonstrate the procedure. They are meant to be examples only, not instructions.
The example in Figure A-3 assumes the primary disk is still operational, so the system is up and can run NetWorker. However, a secondary disk is lost due to a head crash.
If the disk is damaged, replace it with a new disk of the same type and size as the old one. You will need a disk large enough to hold all the filesystems to be recovered.
Install the replacement disk. Make sure the operating system and kernel recognize the new disk.
Label and partition the new disk so that you can recover the filesystems. Use the hard copy of the disk information to remember how large each partition was. (See “File the Disk Information”.)
|Note: If you do not have this information, look at the /etc/fstab file to find out how the disk was partitioned into filesystems. You will have to guess how much space to give each partition. Since you still have the primary disk, the partition information is available.|
Make filesystems for each raw partition that you are going to recover and mount the block partition, consulting the hard copy of the df output. (NetWorker does not initialize filesystems; it recovers data into existing ones.)
|Caution: Using the mkfs command destroys the disk contents. Be sure this disk is really destroyed before using mkfs.|
For example, for an IRIX system:
# mkfs /dev/rdsk/dks0d1s0 ... # mount /dev/rdsk/dks0d1s0 /
After creating and mounting all the filesystems on the replacement disk, use the NetWorker save set recover feature or the NetWorker Recover window to recover the files. For more detailed information about save set recover, see Chapter 5, “Recovering and Cloning Save Sets.”
In the example shown in Figure A-4, a disk with the operating system or the NetWorker binaries is damaged. You have to first reinstall and reboot the operating system. If necessary, remake and mount all filesystems. Next, reinstall NetWorker from the original software distribution, so you can recover the data.
|Caution: Anytime you have to recover the primary disk (for example, root), do so in single-user mode from the system console, not multiuser from the X Window system. Before starting this procedure, make sure all filesystems are mounted.|
After replacing the damaged disk, format it and reinstall the operating system, using the original software distribution. Consult the documentation included with the operating system for instructions on how to reinstall the system. On a NetWorker server, reinstall the NetWorker software using the same paths and directory locations. On NetWorker clients you need access only to the NetWorker binaries. You may run NetWorker from the nsr_extract directory or NFS mount the binaries from another system running NetWorker.
Using the original partition information, make filesystems for each partition you are going to recover and mount them. If a filesystem is already created and mounted, you do not need to do this. For example, if you reinstalled / and /usr, you do not need to re-create them.
Next, reinstall NetWorker from the distribution media. Since you may have several different versions of NetWorker on the media, use the version number that matches the version of NetWorker you were running before you lost your disk. For example, if you were running NetWorker version 4.2, install version 4.2. The version must be equal to or later than the version used for the backups. Refer to the installation chapters in this guide for detailed instructions.
|Tip: You do not need to reload the license enablers. If the /nsr partition is present, the license enabler is still loaded. If the /nsr partition was also lost, recovering /nsr (described in the next section) recovers your license enablers.|
There is one other way to access the NetWorker binaries if these were located within one of the damaged filesystems. If there is another system of the same type as the system being recovered on the network which has NetWorker running, you may NFS mount the NetWorker binaries on the damaged system.
# mount venus:/usr/etc /mnt # /mnt/recover -s server -q recover> add / recover> force recover> recover
With the operating system and NetWorker back in place, you are ready to start recovering the remainder of the data lost from the disk.
Use mmrecov -s server_name or the recover -s server_name command to recover each filesystem on the disk being recovered.
|Caution: Always reboot a system after recovering its primary disk.|
This section addresses the case where the /nsr filesystem on a NetWorker server is lost due to a disk crash. The /nsr filesystem contains the indexes that hold the necessary information to recover the NetWorker clients.
If the server loses its operating system and NetWorker programs, they must be reinstalled first. (See “Recovering a Primary Disk”.)
The next important step is to recover the server indexes from the backup media, using the mmrecov command. The mmrecov command asks you for the bootstrap save set identification number (ssid). If you followed the procedure recommended to prepare for a disk crash, you have a piece of paper with the name of the backup media you need and the bootstrap ssid.
In the following example, ssid “12102” is the most recent bootstrap backup:
December 13 03:35 1996 server's bootstrap information Page 1 date time level ssid file record volume 12/12/96 18:23:23 full 12096 10 0 vol.001 12/13/96 3:34:46 9 12099 13 0 vol.001 12/14/96 3:35:18 9 12102 16 0 vol.001
If you do not have this piece of paper, you can still recover the indexes by finding the ssid using the scanner -B command. (See “Finding the Bootstrap Save Set ID”.)
You may need more than one backup media to recover the server indexes. During the recovery, you can use the nsrwatch command or open the NetWorker Administrator window to watch for pending messages requesting backup media.
With the OS and NetWorker in place, recover the indexes from backup media:
Find the printout of the bootstrap save set ID information. You need it for the next two steps.
Retrieve the backup media that contains the most recent backup named bootstrap and load it into the server's device.
Use the mmrecov command to extract the contents of the bootstrap backup.
# mmrecov $pathname/mmrecov: Using mars as server NOTICE: mmrecov is used to recover the NetWorker server's on-line file and media indexes from media (backup tapes or disks) when either of the server's on-line file or media index has been lost or damaged. Note that this command will OVERWRITE the server's existing on-line file and media indexes. mmrecov is not used to recover NetWorker clients' on-line indexes; normal recover procedures may be used for this purpose. See the mmrecov(8) and nsr_crash(8)man pages for more details. What is the name of the tape drive to use [/dev/rmt/tps0d3]? [Enter] Enter the latest bootstrap save set id : 12102 Enter starting file number (if known) : 16 Enter starting record number (if known) : 0 Please insert the volume on which save set id 1148869870 started into /dev/rmt/tps0d3. When you have done this, press <RETURN>: [Enter] Scanning /dev/nrst8 for save set 1148869870; this may take a while... scanner: scanning 8mm 5GB tape mars.006 on /dev/nrst8 uasm -r /nsr/res/nsr.res uasm -r /nsr/res/nsrjb.res uasm -r /nsr/res/ nsrmmdbasm -r /nsr/mm/mmvolume /nsr/mm/mmvolume: file exists, overwriting uasm -r /nsr/index/mars/ nsrindexasm -r /nsr/index/mars/db scanner: ssid 1148869870: scan complete scanner: ssid 1148869870: 31 KB, 10 files nsr/index/mars/db: file exists, overwriting uasm -r /nsr/index/ uasm -r /nsr/mm/ uasm -r /nsr/ uasm -r / mars: 31 records recovered, 0 discarded. nsrindexasm: Building indexes for mars... 8mm 5GB tape mars.006 mounted on /dev/nrst8, write protected The bootstrap entry in the on-line index for mars has been recovered. The complete index is now being reconstructed from the various partial indexes which were saved during the normal saves for this server. mars# nsrwatch nsrindexasm: Pursuing index pieces of nsr/index/mars/db from mars. Recovering 2 files into their original locations Total estimated disk space needed for recover is 11 MB Requesting 2 files, this may take a while... nsrindexasm -r ./db merging with existing mars index mars: 25711 records recovered, 0 discarded. nsrindexasm -r ./db merging with existing mars index Received 2 files from NSR server `mars' mars: 733 records recovered, 0 discarded. nsrindexasm: Building indexes for mars... nsrindexasm: Suppressing duplicate entries in mars - 50 duplicates discarded. The on-line index for `mars' is now fully recovered.
Notice how in the example above, the shell prompt appears during the bootstrap recovery. You can use the NetWorker Administrator window, or commands such as nsrwatch to gauge the progress of the server. Open a new terminal window to monitor the recovery so that the mmrecov output does not display on top of the nsrwatch output.
The mmrecov command also recovers the /nsr/res directory, which is used by NetWorker to store configuration information such as the list of NetWorker clients and registration information. Unlike the indexes, the contents of this directory can not be reliably overwritten while NetWorker is running. Therefore, the mmrecov program recovers the /nsr/res directory as /nsr/res.R.
To complete the recovery of the /nsr/res directory, shut down NetWorker, move the recovered /nsr/res directory into its original location, and then restart NetWorker.
Complete these steps after mmrecov has finished and this final message appears:
The on-line index for `server' is now fully recovered.
Shut down the NetWorker server using the networker stop command:
# /etc/init.d/networker stop
Save the original /nsr/res directory and move the recovered version into the correct location:
# cd /nsr # mv res res.orig # mv res.R res
Restart the NetWorker server. When it restarts, the server uses the recovered configuration data:
# cd / # nsrd
Once you have verified that the NetWorker configuration is correct, you can remove the /nsr/res.orig directory:
# rm -r /nsr/res.orig
To recover other filesystems, see “Recovering a Secondary Disk”.
If you did not file a hard copy of the bootstrap information, you can still find the save set ID of the most recent bootstrap by using the scanner -B command. For example:
Place the most recent media used for scheduled backups in the server device.
Use the scanner -B command as root to locate the most recent bootstrap on the backup media. Replace the device name in the appropriate example with the name of your device:
# scanner -B /dev/rmt/tps0d3nrnsv
The scanner -B command displays information on the latest bootstrap save set found on the backup volume, as illustrated below:
scanner: scanning 4mm tape ishi001 on /dev/rmt/tps0d4nrnsv scanner: done with 4mm tape ishi001... Bootstrap 12099 of 12/13/96 3:34:46 located on volume v.001, file 13.
If the bootstrap date looks reasonable, run the mmrecov command and supply the save set ID and file number displayed by the scanner command. Otherwise, use another backup volume to try to find a more recent bootstrap.
This section describes the case where your old NetWorker server is beyond repair, and you wish to recover NetWorker to a new server.
|Tip: Things may go more smoothly if the new NetWorker server has the same hostname as the old NetWorker server. However, since licensing is based on host ID, you need to relicense the NetWorker server anyhow.|
Follow these steps to transfer from one NetWorker server to another:
Install the NetWorker software from the original distribution media on the new server.
|Note: If you have a jukebox, do not start the NetWorker daemons. Refer to the instructions in this guide for jukebox device driver installation and testing.|
Find the printout of the bootstrap save set ID information from the old server. You need it for the next two steps.
Retrieve the backup media that contains the most recent backup named bootstrap, and load it into the new server device.
Add the name of the old server as an alias for the new server.
Use the mmrecov command to extract the contents of the bootstrap backup.
The NetWorker daemons should start up on the new server and display the following messages:
new_server syslog: NetWorker Server: (notice) started new_server syslog: NetWorker Registration: (notice) invalid auth codes detected. new_server syslog: new_server syslog: The auth codes for the following licenses enablers are now invalid. new_server syslog: The cause may be that you moved the NetWorker server to a new computer. new_server syslog: You must reregister these enablers within 15 days to obtain new codes. new_server syslog: new_server syslog: License enabler #xxxxxx-xxxxxx-xxxxxx (NetWorker)
Relicense your new NetWorker server. After moving NetWorker from one system to another, you have 15 days to relicense the new server with Silicon Graphics. Follow the instructions in the IRIX NetWorker Installation Guide.
Follow these steps after successfully moving your server:
Verify that all the clients are included in the scheduled backups.
Recover the client filesystems and indexes.
Use the Recover window to make sure all the client indexes are visible and therefore “recoverable.”
Back up the indexes on the new server by doing a full backup of the new server as soon as possible.
This section provides a description of how to use autochangers during disaster recovery. The procedures are similar, but employ jukebox-specific commands.
Follow these steps to perform disaster recovery with a jukebox:
Read the disaster recovery procedures listed in the nsr_crash(1M) reference page. Perform all steps up to the point where you issue the mmrecov command. If only one volume is needed to recover the NetWorker file indexes, follow the instructions for nsr_crash.
Run the jb_config command to add the jukebox.
Issue the command nsrjb -H. This resets the jukebox for operation. If any volumes are loaded in the media drives, they are moved back to a slot. This operation may take a few minutes to finish.
Using the instructions in nsr_crash, determine which volume(s) are needed to retrieve the NetWorker file indexes. Load these volume(s) into the jukebox. See the nsr_crash(1M) reference page for additional information.
Issue the command nsrjb -I. This reinventories the jukebox. If you want to speed up this process, issue the command with the -S flag and list the slots where you placed the required backup volumes. You must list the slots in order (for example, nsrjb -I -S 1-3). If you want to inventory slots out of order (for example, 2, 1, and 4), you must issue the nsrjb –I –S command separately for each slot. All the volumes currently loaded in the jukebox will be marked with an asterisk since there is no media database.
Run mmrecov. When prompted, provide the device name you want to use, the last save set ID, and (if known) the file number and record number.
Load the first volume that mmrecov requests into the first drive in the autochanger. From a new window, issue the command:
nsrjb -l -n -S slot -f device_name
where slot is the slot in which the first volume is located and device_name is the pathname of the first drive.
Issue the nsrjb -u command after the indexes have been recovered.
Shut down NetWorker using the /etc/init.d/networker stop command.
The mmrecov command creates a directory named /nsr/res.R and recovers three files into it: nsr.res, nsrjb.res, and nsrla.res. When mmrecov completes the recovery, copy the three files contained in /nsr/res.R into the /nsr/res directory.
Restart NetWorker daemons using the /etc/init.d/networker start command.
The following steps summarize what you need to do if a primary or secondary disk is damaged, destroying the filesystems of a NetWorker server or client:
Reload and boot the system, if the operating system is lost, using the same hostname and disk partitioning, if possible.
Replace the damaged disk(s), if necessary; format, partition, make new filesystems, and mount filesystems with the same names as those that were damaged.
Reinstall or access the NetWorker binaries if they were lost. On a NetWorker server, reinstall them from the distribution media. On a NetWorker client, extract them or temporarily NFS mount them over the network.
Use mmrecov to recover the indexes for the NetWorker server if the /nsr directory is destroyed on the server.
Recover the lost filesystems by using the normal recover process or the save set recover feature. Recover the client indexes, using the normal recover process.