This chapter discusses best-practices for client-only nodes:
Also see the best practices information in the CXFS Administration Guide for SGI InfiniteStorage.
This section discusses the following:
CXFS may not give optimal performance under the following circumstances:
When distributed applications write to shared files that are memory-mapped.
If a client is used, SGI will only support an NFS or Samba export from a metadata server.
When extending large highly fragmented files. The metadata traffic when growing files with a large number of extents will increase as more extents are added to the file. The following I/O patterns will cause highly fragmented files:
Random writes to sparse files
Files generated with memory-mapped I/O
Writing files in an order other than linearly from beginning to end
Do the following to prevent highly fragmented files:
Create files with linear I/O from beginning to end
Use file preallocation to allocate space for a file before writing
Create filesystems with sparse files disabled (unwritten=0)
When access would be as slow with CXFS as with network filesystems, such as with the following:
Lots of metadata transfer. Metadata operations can take longer to complete through CXFS than on local filesystems. Metadata transaction examples include the following:
Opening and closing a file
Changing file size (usually extending a file)
Creating, renaming, and deleting files
Searching a directory
In addition, multiple processes on multiple hosts that are reading and writing the same file using buffered I/O can be slower when using CXFS than when using a local filesystem. This performance difference comes from maintaining coherency among the distributed file buffers; a write into a shared, buffered file will invalidate data (pertaining to that file) that is buffered in other hosts.
Also see “Functional Limitations and Considerations for Windows” in Chapter 9.
![]() | Caution: It is critical that you understand these rules before attempting to configure a CXFS cluster. |
The following hostname resolution rules and recommendations apply to all nodes:
The first node you define must be a server-capable administration node.
Hostnames cannot begin with an underscore (_) or include any whitespace characters.
The private network IP addresses on a running node in the cluster cannot be changed while CXFS services are active.
You must be able to communicate directly between every node in the cluster (including client-only nodes) using IP addresses and logical names, without routing.
A private network must be dedicated to be the heartbeat and control network. No other load is supported on this network.
The heartbeat and control network must be connected to all nodes, and all nodes must be configured to use the same subnet for that network.
If you change hostname resolution settings in the /etc/nsswitch.conf file after you have defined the first server-capable administration node (which creates the cluster database), you must recreate the cluster database.
Use the cxfs-config -check -ping command line on a server-capable administration node to confirm network connectivity. For more information, see CXFS Administration Guide for SGI InfiniteStorage.
If there are any network issues on the private network, fix them before trying to use CXFS. Ensure that you understand the information in “Understand Hostname Resolution and Network Configuration Rules”.
When you install the CXFS software on the client-only node, you must modify certain system files. The network configuration is critical. Each node in the cluster must be able to communicate with every other node in the cluster by both logical name and IP address without going through any other network routing; proper name resolution is key. SGI recommends static routing.
You must use a private network for CXFS metadata traffic:
A private network is a requirement.
The private network is used for metadata traffic and should not be used for other kinds of traffic.
A stable private network is important for a stable CXFS cluster environment.
Two or more clusters should not share the same private network. A separate private network switch is required for each cluster.
The private network should contain at least a 100-Mbit network switch. A network hub is not supported and should not be used.
All cluster nodes should be on the same physical network segment (that is, no routers between hosts and the switch).
The private network must be configured as the highest priority network for the cluster. The public network may be configured as a lower priority network to be used by CXFS network failover in case of a failure in the private network.
A virtual local area network (VLAN) is not supported for a private network.
Use private (10.x.x.x, 176.16. x.x, or 192.168.x.x) network addresses (RFC 1918).
You should define most nodes as client-only nodes and define just the nodes that may be used for CXFS metadata as server-capable administration nodes.
The advantage to using client-only nodes is that they do not keep a copy of the cluster database; they contact a server-capable administration node to get configuration information. It is easier and faster to keep the database synchronized on a small set of nodes, rather than on every node in the cluster. In addition, if there are issues, there will be a smaller set of nodes on which you must look for problems.
All nodes should run the same level of CXFS and the same level of operating system software, according to platform type. To support upgrading without having to take the whole cluster down, nodes can run different CXFS releases during the upgrade process. For details, see the platform-specific release notes and the information about rolling upgrades in CXFS Administration Guide for SGI InfiniteStorage.
I/O fencing is required on client-only nodes without reset capability in order to protect the data integrity of the filesystems in the cluster.
You should use the admin account when configuring I/O fencing. On a Brocade switch running 4.x.x.x or later firmware, modify the admin account to restrict it to a single telnet session. For details, see the CXFS Administration Guide for SGI InfiniteStorage.
You must keep the telnet port on the switch free at all times; do not perform a telnet to the switch and leave the session connected.
SGI recommends that you use a switched network of at least 100baseT.
You should isolate the power supply for the switch from the power supply for a node and its system controller. You should avoid any possible situation in which a node can continue running while both the switch and the system controller lose power. Avoiding this situation will prevent the possibility a split-brain scenario.
You must put switches used for I/O fencing on a network other than the primary CXFS private network so that problems on the CXFS private network can be dealt with by the fencing process and thereby avoid data corruption issues. The network to which the switch is connected must be accessible by all server-capable administration nodes in the cluster.
See the following:
SGI recommends that you always define a client-only node as the CXFS tiebreaker. (Server-capable administration nodes are not recommended as tiebreaker nodes.) This is most important when there are an even number of server-capable administration nodes.
The tiebreaker is of benefit in a cluster with an odd number of server-capable administration nodes when one of the server-capable administration nodes is removed from the cluster for maintenance (via a stop of CXFS services).
The following rules apply:
If exactly two server-capable administration nodes are configured and there are no client-only nodes, neither server-capable administration node should be set as the tiebreaker. (If one node was set as the tiebreaker and it failed, the other node would also shut down.)
If exactly two server-capable administration nodes are configured and there is at least one client-only node, you should specify the client-only node as a tiebreaker.
If one of the server-capable administration nodes is the CXFS tiebreaker in a two server-capable cluster, failure of that node or stopping the CXFS services on that node will result in a cluster-wide forced shutdown. Therefore SGI recommends that you use client-only nodes as tiebreakers so that either server could fail but the cluster would remain operational via the other server.
Setting a client-only node as the tiebreaker avoids the problem of multiple-clusters being formed (also known as split-brain syndrome) while still allowing the cluster to continue if one of the metadata servers fails.
Setting a server-capable administration node as tiebreaker is recommended only when there are four or more server-capable administration nodes and no client-only nodes.
If there are an even number of servers and there is no tiebreaker set, the failure action hierarchy should not contain the shutdown option because there is no notification that a shutdown has occurred.
SGI recommends that you start CXFS services on the tie-breaker client after the metadata servers are all up and running, and before CXFS services are started on any other clients.
Enable the forced unmount feature for CXFS filesystems, which is turned off by default. Normally, an unmount operation will fail if any process has an open file on the filesystem. However, a forced unmount allows the unmount to proceed regardless of whether the filesystem is still in use.
Many sites have found that enabling this feature improves the stability of their CXFS cluster, particularly in situations where the filesystem must be unmounted. For more information, see “Forced Unmount of CXFS Filesystems” in Chapter 10 and the CXFS Administration Guide for SGI InfiniteStorage.
Configure firewalls to allow CXFS traffic. See CXFS Administration Guide for SGI InfiniteStorage for CXFS port usage. (Preferred.)
Configure firewalls to allow all traffic on the CXFS private interfaces. This assumes that the public interface is not a backup metadata network.
Disable firewalls.
For more information, see your firewall documentation.
This section discusses the following:
Do the following when upgrading the software:
Read the release notes when installing and/or upgrading CXFS. These notes contain useful information and caveats needed for a stable install/upgrade.
Do not make any other configuration changes to the cluster (such as adding new nodes or filesystems) until the upgrade of all nodes is complete and the cluster is running normally.
See the following:
Each platform in a CXFS cluster has different issues. See the following:
When shutting down, resetting, or restarting a CXFS client-only node, do not stop CXFS services on the node. (Stopping CXFS services is more intrusive on other nodes in the cluster because it updates the cluster database. Stopping CXFS services is appropriate only for a CXFS server-capable administration node.) Rather, let the CXFS shutdown scripts on the node stop CXFS when the client-only node is shut down or restarted.
SGI recommends that backups are done on the CXFS metadata server.
Do not run backups on a client node, because it causes heavy use of non-swappable kernel memory on the metadata server. During a backup, every inode on the filesystem is visited; if done from a client, it imposes a huge load on the metadata server. The metadata server may experience typical out-of-memory symptoms, and in the worst case can even become unresponsive or crash.
Because CXFS filesystems are considered as local on all nodes in the cluster, the nodes may generate excessive filesystem activity if they try to access the same filesystems simultaneously while running commands such as find or ls. You should build databases for rfind and GNU locate only on the metadata server.
On IRIX systems, the default root crontab on some platforms has the following find job that should be removed or disabled on all nodes (line breaks added here for readability):
0 5 * * * /sbin/suattr -m -C CAP_MAC_READ, CAP_MAC_WRITE,CAP_DAC_WRITE,CAP_DAC_READ_SEARCH,CAP_DAC_EXECUTE=eip -c "find / -local -type f '(' -name core -o -name dead.letter ')' -atime +7 -mtime +7 -exec rm -f '{}' ';'" |
Do not use any filesystem defragmenter software. You can use Linux xfs_fsr command only on a metadata server for the filesystem it acts upon.
Always contact SGI technical support before using xfs_repair on CXFS filesystems. Only use xfs_repair on metadata servers and only when you have verified that all other cluster nodes have unmounted the filesystem.
When using xfs_repair, make sure it is run only on a cleanly unmounted filesystem. If your filesystem has not been cleanly unmounted, there will be un-committed metadata transactions in the log, which xfs_repair will erase. This usually causes loss of some data and messages from xfs_repair that make the filesystem appear to be corrupted.
If you are running xfs_repair right after a system crash or a filesystem shutdown, your filesystem is likely to have a dirty log. To avoid data loss, you MUST mount and unmount the filesystem before running xfs_repair. It does not hurt anything to mount and unmount the filesystem locally, after CXFS has unmounted it, before xfs_repair is run.
Disable CXFS before maintenance (perform a forced CXFS shutdown, stop the cxfs_client daemon, and disable cxfs_client from automatically restarting).
You can use the cxfscp(1) command to quickly copy large files (64 KB or larger) to and from a CXFS filesystem. It can be significantly faster than cp(1) on CXFS filesystems because it uses multiple threads and large direct I/Os to fully use the bandwidth to the storage hardware.
Files smaller than 64 KB do not benefit from large direct I/Os. For these files, cxfscp uses a separate thread using buffered I/O, similar to cp(1).
The cxfscp command is available on IRIX, SGI ProPack, Linux, and Windows platforms. However, some options are platform-specific, and other limitations apply. For more information and a complete list of options, see the cxfscp(1) man page.
To match up physical device names to their corresponding XVM physical volumes (physvols), use the following command:
xvm show -v -top -ext vol/volname |
In the output for this command, the information within the parentheses matches up the XVM pieces with the device name. For example (line breaks shown for readability):
# xvm show -v -top -ext vol/test vol/test 0 online,open subvol/test/data 1142792192 online,open stripe/stripe0 1142792192 online,tempname,open (unit size:128) slice/cc_is4500-lun0-gpts0 142849024 online,open (cc_is4500-lun0-gpt:/dev/xscsi/pci08.03.0/node200400a0b8119204/port4/lun0/disc) slice/cc_is4500-lun1-gpts0 142849024 online,open (cc_is4500-lun1-gpt:/dev/xscsi/pci08.03.1/node200500a0b8119204/port1/lun1/disc) slice/cc_is4500-lun0-gpts1 142849024 online,open (cc_is4500-lun0-gpt:/dev/xscsi/pci08.03.0/node200400a0b8119204/port4/lun0/disc) slice/cc_is4500-lun1-gpts1 142849024 online,open (cc_is4500-lun1-gpt:/dev/xscsi/pci08.03.1/node200500a0b8119204/port1/lun1/disc) slice/cc_is4500-lun0-gpts2 142849024 online,open (cc_is4500-lun0-gpt:/dev/xscsi/pci08.03.0/node200400a0b8119204/port4/lun0/disc) slice/cc_is4500-lun1-gpts2 142849024 online,open (cc_is4500-lun1-gpt:/dev/xscsi/pci08.03.1/node200500a0b8119204/port1/lun1/disc) slice/cc_is4500-lun0-gpts3 142849024 online,open (cc_is4500-lun0-gpt:/dev/xscsi/pci08.03.0/node200400a0b8119204/port4/lun0/disc) slice/cc_is4500-lun1-gpts3 142849024 online,open (cc_is4500-lun1-gpt:/dev/xscsi/pci08.03.1/node200500a0b8119204/port1/lun1/disc) |
![]() | Note: The xvm command on the Windows platform does not display the worldwide name (WWN). For more information about WWNs and Windows, see “XVM Failover V2 on Windows” in Chapter 9. |
For more information about XVM physvols, see the XVM Volume Manager Administrator's Guide.