Chapter 3. XVM Best Practices

This section discusses the following:

Configuration Best Practices

This section discusses the following:

Use XVM Configuration Tools Appropriately

Do not attempt to make simultaneous configuration changes using the xvm(8) command-line interface (CLI) and the XVM Manager graphical user interface (GUI). Use one tool at a time.

The GUI provides a convenient display of XVM components. If you are using XVM in a cluster environment, you should use the XVM Manager or the CXFS Manager GUI to see your progress and to avoid adding or removing CXFS nodes too quickly. After defining a CXFS node, you should wait for it to appear in the view area before adding another node. After defining a cluster, you should wait for it to appear before you add nodes to it. If you make changes too quickly, errors can occur. For more information, see Chapter 10, “XVM Manager GUI”.

Use One Slice Per LUN

Use LUNs of equal size and one slice for each LUN. You should assemble stripes for any desired performance characteristics out of the slices or mirrors. If more size is needed, use a concat at the top level to allow an arbitrary number of volume elements to be connected.

Use the Default Data Subvolume

For most configurations, the default data subvolume is the only subvolume required.

Explicitly Name Volume Elements

In order to create a name that will persist across reboots, SGI recommends that you explicitly name a volume when you create a volume element or an empty volume. This will reduce the risk of data loss.


Note: If you do not name an empty volume when you create it, you must specify that the system generate a temporary name; this practice is not recommended for general configuration.

If you have already created volumes that you did not name explicitly, you can use the change command to assign these volumes permanent names. See “Modifying Volume Elements with the change Command ” in Chapter 5.

Use an Appropriate Stripe Unit Size and Alignment

If the LUNs you are striping are RAID devices, then it is also advantageous to have your XVM stripes line up on and be a multiple of the RAID stripe boundaries. This will not only allow all of your XVM LUNs to transfer in parallel, but all of the disks in the RAID will be accessed in parallel, in units of the same size. By default, a stripe unit must be a multiple of 32 512-byte blocks.

To get the best performance, make the XVM stripe unit be the same as the RAID stripe width, so that the RAID gets an optimally sized chunk of data to store.

Use the single-partition method for GTP label creation. For example: (line breaks added for readability):

# parted /dev/disk/by-path/pci-0000:03:00.0-fc-0x20360080e5232098:0x0000000000000000 "unit s mklabel gpt"
# parted /dev/disk/by-path/pci-0000:03:00.0-fc-0x20360080e5232098:0x0000000000000000 \
"unit s mkpart primary xfs 34 -8193"

For LUNs with a power-of-2 number of data drives, XVM will align by default. For non-power-of-2, use the following formula to calculate the alignment:

RAID_segment_size_in_KiB * number_of_data_drives * 2 

This value will be in 512-byte basic blocks. For example, using a 6+1 RAID with a 64KiB-segment LUN:

64KiB * 6 * 2 = 768 blocks

The slice command would be:

# xvm slice -all -align 786 phys/is5500_lun0

Use Mirrors Efficiently

This section discusses the following:

Use Mirrors with Identical Components

To make the best use of space, create mirrors with components of identical size. If the components are not identical, there will be unused space in the larger components.

Place Mirrors at the Bottom of the XVM Topology Tree

Place mirrors at the lowest possible level (below any stripes) to maximize independence and minimize synchronization times during revive operations. This provides the redundancy of a mirror with improved performance.

Avoid Unnecessary Revives

For large mirror components, the revive process may take a long time. Consider the following:

  • For a new mirror that does not need mirroring at creation, use the -clean option to specify that the mirror will revive at reboot but not creation. An example is creating a new filesystem; because everything will be written before it will be read, there is no need for a revive beforehand.

  • For new mirror that you will use for scratch filesystems (such as /tmp) that will never need to be synchronized, use the -norevive option to specify that the mirror will never revive.

Set Mirror Revive Resources Properly

You should set the xvm_max_revive_rsc and xvm_max_revive_threads XVM system-tunable kernel parameters appropriately for your site's mirror revive performance requirements. Increasing xvm_max_revive_rsc will increase the data throughput per thread, and increasing xvm_max_revive_threads will increase the number of parallel I/O processes used in reviving. Decreasing the resources causes less interference with an open filesystem at the cost of increasing the total time to revive the data.

As a general guideline:

  • Increase the revive resource tunable values if you want to revive as quickly as possible and do not mind the performance impact on normal I/O processes

  • Decrease the revive resource tunable values if you want to have a smaller impact on a particular filesystem

See Appendix A, “XVM System Tunable Kernel Parameters”.

Restripe Efficiently

To restripe an existing volume without deleting the slices, do the following:

  • Use the following option to delete the XVM structure other than the slices:

    delete -nonslice

  • Use the stripe command as needed to create new stripes

Categorize Portions of a Complex Volume

Sometimes it may be useful to categorize volume elements by name. For example, you may want to name a portion of a volume fast so that you can search for volumes that have fast stripe objects. For example:

xvm:cluster> stripe -volname myvol -vename fast0 -unit 128 slice/lucys2 slice/rickys0 slice/ethyls0 slice/freds0
</dev/cxvm/myvol> stripe/fast0
xvm:cluster> stripe -volname myvol -vename fast1 -unit 128 slice/lucys3 slice/rickys1 slice/ethyls1 slice/freds1
</dev/cxvm/myvol> stripe/fast1

When you name the stripes as in the preceding example, you can use a wildcard to show both fast0 and fast1 stripes:

xvm:cluster> show -topology stripe/fast*
stripe/fast0         23705088 online
     slice/lucys2                5926340 online
     slice/rickys0               5926340 online
     slice/ethyls0               5926340 online
     slice/freds0                5926340 online
stripe/fast1         23705088 online
     slice/lucys3                5926340 online
     slice/rickys1               5926340 online
     slice/ethyls1               5926340 online
     slice/freds1                5926340 online

Use Automatic Probing Wisely

After you label a device, XVM will automatically probe it and any unlabeled disks in order to locate alternate paths. Disks are also probed when the system is booted and when you explicitly execute an XVM probe command. In most cases, this default behavior is appropriate.

However, a probe can be slow, and it is necessary to probe a newly-labeled device only once. For example, if you are executing a series of individual label commands, you might wish to disable automatic probing using one of the methods described in “Controlling Automatic Probing with the label and set Commands” in Chapter 5.

Do Not Create Slices Within the RAID Exclusion Zone

Due to some performance issues in XVM, SGI strongly recommends that the XVM slice be completely outside the RAID exclusion zone. By default, the xvm CLI label command places the user data area entirely outside of the exclusion zones, so you do not need to consider the exclusion zones in allocating slices. See “Making an XVM Volume Using a GPT Label” in Chapter 7.


Note: If it is necessary to allows the user space to overlap the RAID exclusion zones, you can use the following command to override the default layout behavior:
xvm:cluster> label -use-exclusion-zones unlabeled_disk



XVM Configuration for Mixing SSD and HDD Media

Different types of media are appropriate for different uses:

  • Solid-state drive (SSD) media is appropriate for small latency-sensitive operations

  • Rotating hard-disk drive (HDD) media is appropriate for larger bandwidth- and capacity-intensive operations

The ibound mount option specifies where the filesystem places the inodes, which lets you use SSD media for a filesystem's inodes and HDD media for the file data. In this case, you should create an XVM volume that concatenates a slice of SSD media with HDD media and then use the ibound mount option to restrict filesystem inode allocation to the fast SSD media at the beginning of the XVM volume.


Note: The ibound mount option implies inode32 behavior and is therefore incompatible with the inode64 mount option. Behavior of the inode32 mount option is not affected.

For more information, see the chapter about enhanced XFS extensions in XFS Administrator Guide.

Administrative Best Practices

This section discusses the following:

Save the XVM Configuration Before Making Changes

It is possible for XVM labels to become corrupted. Therefore, you should use the XVM dump command to make a copy of the configuration before making a change so that you can recover from potential problems introduced by the change. You should save the dump output into a filesystem other than the one being dumped.

Do Not Use a Given Disk for both XVM and Non-XVM

Do not use a given disk device as an XVM volume if you are already using it to mount a filesystem outside of XVM. XVM cannot always detect that a LUN is already in use by some other subsystem, so verify that a LUN is available before creating XVM physvols on it.

Specify Path Failover for Non-ALUA RAID

This section discusses the following for RAIDs that do not use the asymmetrical logical unit access (ALUA) feature:

Define the /etc/failover2.conf File on Every Node

If you use non-ALUA RAID and if you spread I/O across controllers, you should define the /etc/failover2.conf file to ensure that I/O is done efficiently and is directed to the path that you prefer; unnecessary switching between RAID controllers can degrade performance considerably. See Chapter 6, “XVM Path Failover”.

In a cluster configuration, be sure that the /etc/failover2.conf file is correct and consistent on every node in the cluster.

Do Not Use affinity or preferred Keywords for ALUA RAID

For RAID that supports the ALUA feature, you should not use the affinity or preferred keywords for normal operation. Those keywords can be used in a failover2.conf file to override the settings read from the RAID in order to work around a problem.

Set Nonzero affinity Values

You may find it useful to specify affinity values starting with affinity=1 and specify a nonzero value for all paths. This makes it easy to detect those paths that have not yet been configured because they are assigned the default of affinity=0. See “Set Appropriate affinity Values for Non-ALUA LUNs” in Chapter 6.

Periodically Check for Unassigned Paths

If you used the method recommended in “Set Nonzero affinity Values”, you should periodically examine the show -v output for new LUNs to find any that are unassigned. You may wish to write a script to perform this function and execute it via a cron(8) job.

Change Affinity Consistently Across the Cluster

If you change the affinity setting for a path in the cluster domain, you should include the -cluster option so that the setting is consistent across all nodes in the cluster. For example:

xvm:cluster> foswitch -cluster -setaffinity 2 -movepath phys/lun33

If you change the preferred path, you should include the -cluster option if the switch will move to a different affinity group. For example, suppose the following:

pathA affinity=1 preferred
pathB affinity=1
pathC affinity=2
pathD affinity=2

You could switch the preferred path to pathB for a single node in the cluster because it has the same affinity setting as pathA:

xvm:cluster> foswitch -preferred pathA

However, if you want to use pathC as the preferred path, you should include -cluster because it is part of a different affinity group:

xvm:cluster> foswitch -cluster -preferred pathC

Do Not Override ALUA RAID Settings via /etc/failover2.conf

Normally, you should not use the /etc/failover2.conf file to override the path settings provided automatically by a RAID that has the ALUA feature. If changes are required, you should make them via the ALUA RAID software.

Do Not Run fsck on Filesystems that Use XVM Devices

It is possible that XVM might not discover all devices associated with XVM volumes by the time that the filesystems listed in /etc/fstab are mounted, meaning that some volumes may not yet be complete at that point. If an fsck command is run on an XFS filesystem when XVM devices are undiscovered, the system may suspend the system boot sequence and require input from the administrator.

Therefore, for XFS filesystems listed in /etc/fstab that use XVM devices, you should set the fsck flag to 0. XVM includes a helper service that mounts all filesystems listed in /etc/fstab that use XVM devices at the time that XVM is started during the boot sequence.

Give Rather than Steal Ownership

You should only use the steal command when ownership cannot be changed by using the give command.

Unmount Filesystems Before Making XVM Topology Changes

You should unmount a filesystem before making XVM topology changes.

A child of an open volume element can only be detached if this will not cause the volume element to go offline. The only child that can be detached without putting the volume element offline is a mirror leg that is not the last leg of that mirror.

In particular, normally you should not execute the following xvm commands on an open volume element:

change disable

Unmount a CXFS Filesystem Before Growing It

You should unmount a filesystem from the CXFS cluster and mount it locally before growing it. For more information, see the section about growing filesystems in the CXFS 7 Administrator Guide for SGI InfiniteStorage.

Do Not Use an XVM Volume as a Dump Device

You should not use an XVM volume as a dump device.

Use xvm Commands Carefully in Scripts

If you write scripts that use xvm(8) configuration commands, be aware that running multiple commands in quick sequence can cause the commands to fail. An XVM device newly created by one command is held open for an interval by Linux utility programs; subsequent xvm commands in the script cannot use the device and therefore fail. Following is common error in this situation:

error creating  item:  operation will cause the ve's
subvolume to go offline

Do Not DIsplay I/O Activity in show Commands

For best performance in a production system, do not display of I/O activity in the output of the xvm show command (that is is, you should use the default setting of 0 for the pm_display_io_activity system tunable parameter, which disables this feature).

You should enable pm_display_io_activity only while you are performing configuration. See “pm_display_io_activity” in Appendix A

Understand the Use of Symbolic Links and Device Names in Output

XVM uses symbolic links (in the format /dev/lxvm/localXFSname or /dev/cxvm/CXFSname) to the actual device names (in the format /dev/xvm- n). For example, the default ls(1) output will show the symbolic link:

# ls /dev/lxvm/align-tests0
/dev/lxvm/align-tests0

To view the link plus the actual device name, use the -l option:

# ls -l /dev/lxvm/align-tests0
lrwxrwxrwx 1 root root 8 Sep 30 13:41 /dev/lxvm/align-tests0 -> ../xvm-5

Commands such as mount(8) and df(1) show the actual device names. For example, to mount (using the symbolic link) and then show the mounted xfs filesystems:

# mount /dev/lxvm/align-tests0 mnt
# mount -t xfs
/dev/sda2 on / type xfs (rw)
/dev/xvm-5 on /root/mnt type xfs (rw)

To report on filesystem usage:

# df -k
Filesystem            1K-blocks      Used Available Use% Mounted on
/dev/sda2              20970644   6269112  14701532  30% /
udev                    4027496      1252   4026244   1% /dev
tmpfs                   4027496         0   4027496   0% /dev/shm
/dev/xvm-5             35802032     32928  35769104   1% /root/mnt

To get information by starting with a /dev/xvm- n device name, you can display information from the sysfs filesystem as follows:

# ls /dev/xvm-*
/dev/xvm-10  /dev/xvm-5  /dev/xvm-6  /dev/xvm-8
# cat /sys/block/xvm-5/vol-info/xvm-domain
lxvm
# cat /sys/block/xvm-5/vol-info/xvm-volname
align-tests0
# cat /sys/block/xvm-5/vol-info/xvm-svol-type
data