Chapter 6. Flexible File I/O

This chapter covers the following topics:

About FFIO

Flexible File I/O (FFIO) can improve the file I/O performance of existing applications without having to resort to source code changes. The current executable remains unchanged. Knowledge of source code is not required, but some knowledge of how the source and the application software work can help you better interpret and optimize FFIO results. To take advantage of FFIO, all you need to do is to set some environment variables before running your application.

The FFIO subsystem allows you to define one or more additional I/O buffer caches for specific files to augment the Linux kernel I/O buffer cache. The FFIO subsystem then manages this buffer cache for you. In order to accomplish this, FFIO intercepts standard I/O calls such as open, read, and write, and replaces them with FFIO equivalent routines. These routines route I/O requests through the FFIO subsystem, which uses the user-defined FFIO buffer cache.

FFIO can bypass the Linux kernel I/O buffer cache by communicating with the disk subsystem via direct I/O. This bypass gives you precise control over cache I/O characteristics and allows for more efficient I/O requests. For example, doing direct I/O in large chunks (for example, 16 megabytes) allows the FFIO cache to amortize disk access. All file buffering occurs in user space when FFIO is used with direct I/O enabled. This differs from the Linux buffer cache mechanism, which requires a context switch in order to buffer data in kernel memory. Avoiding this kind of overhead helps FFIO to scale efficiently.

Another important distinction is that FFIO allows you to create an I/O buffer cache dedicated to a specific application. The Linux kernel, on the other hand, has to manage all the jobs on the entire system with a single I/O buffer cache. As a result, FFIO typically outperforms the Linux kernel buffer cache when it comes to I/O intensive throughput.

Environment Variables

To use FFIO, set one of the following environment variables: LD_PRELOAD or FF_IO_OPTS.

In order to enable FFIO to trap standard I/O calls, set the LD_PRELOAD environment variable, as follows:

# export LD_PRELOAD="/usr/lib64/libFFIO.so"

The LD_PRELOAD software is a Linux feature that instructs the linker to preload the indicated shared libraries. In this case, libFFIO.so is preloaded and provides the routines that replace the standard I/O calls. An application that is not dynamically linked with the glibc library cannot work with FFIO because the standard I/O calls cannot be intercepted. To disable FFIO, type the following:

# unset LD_PRELOAD

The FFIO buffer cache is managed by the FF_IO_OPTS environment variable. The syntax for setting this variable can be quite complex. A simple format for defining this variable is as follows:

export FF_IO_OPTS  'string(eie.direct.mbytes:size:num:lead:share:stride:0)'

You can use the following parameters with the FF_IO_OPTS environment variable:

string

Matches the names of files that can use the buffer cache.

size

Number of 4k blocks in each page of the I/O buffer cache.

num

Number of pages in the I/O buffer cache.

lead

The maximum number of read-ahead pages.

share

A value of 1 means a shared cache, 0 means private.

stride

Note that the number after the stride parameter is always 0.

Example 1. Assume that you want a shared buffer cache of 128 pages. Each page is to be 16 megabytes (that is, 4096*4k). The cache has a lead of six pages and uses a stride of one. The command is as follows:

% setenv FF_IO_OPTS 'test*(eie.direct.mbytes:4096:128:6:1:1:0)'

Each time the application opens a file, the FFIO code checks the file name to see if it matches the string supplied by FF_IO_OPTS . The file's path name is not considered when checking for a match against the string. For example, file names of /tmp/test16 and /var/tmp/testit both match.

Example 2. This more complicated usage of FF_IO_OPTS builds upon the previous example. Multiple types of file names can share the same cache, as the following example shows:

% setenv FF_IO_OPTS 'output* test*(eie.direct.mbytes:4096:128:6:1:1:0)'

Example 3. You can specify multiple caches with FF_IO_OPTS . In the example that follows, files of the form output* and test* share a 128 page cache of 16 megabyte pages. The file special42 has a 256-page private cache of 32 megabyte pages. The command is as follows:

% setenv FF_IO_OPTS 'output* test*(eie.direct.mbytes:4096:128:6:1:1:0) special42(eie.direct.mbytes:8192:256:6:0:1:0)'

Additional parameters can be added to FF_IO_OPTS to create feedback that is sent to standard output. For examples of this diagnostic output, see the following:

“Simple Examples”

Simple Examples

This topic includes some simple FFIO examples. Assume that LD_PRELOAD is set for the correct library, and FF_IO_OPTS is defined as follows:

% setenv FF_IO_OPTS 'test*(eie.direct.mbytes:4096:128:6:1:1:0)'

It can be difficult to tell what FFIO might or might not be doing even with a simple program. The examples in this topic use a small C program called fio that reads 4-megabyte chunks from a file for 100 iterations. When the program runs, it produces the following output:

% ./fio -n 100 /build/testit									
Reading 4194304 bytes 100 times to /build/testit						
Total time  = 7.383761							       		
Throughput  = 56.804439 MB/sec	

Example 1. You can direct a simple FFIO operations summary to standard output by making the following simple addition to FF_IO_OPTS :

% setenv FF_IO_OPTS 'test*(eie.direct.mbytes:4096:128:6:1:1:0, event.summary.mbytes.notrace )'

This new setting for FF_IO_OPTS generates the following summary on standard output when the program runs:

% ./fio -n 100 /build/testit									
Reading 4194304 bytes 100 times to /build/testit						
Total time  = 7.383761							       		
Throughput  = 56.804439 MB/sec								
										
event_close(testit)    eie <-->syscall   (496 mbytes)/( 8.72 s)=   56.85 mbytes/s		
oflags=0x0000000000004042=RDWR+CREAT+DIRECT				
sector size =4096(bytes)								
cblks =0  cbits =0x0000000000000000							
current file size =512 mbytes   high water file size =512 mbytes

function     times      wall     all     mbytes     mbytes       min        max       avg	
             called     time    hidden  requested  delivered   request    request    request
    open          1     0.00						
    read          2     0.61                 32         32        16         16         16	
    reada        29     0.01       0        464        464        16         16         16	
    fcntl                    								
       recall									
       reada     29     8.11								
       other      5     0.00				 		
    flush         1     0.00					
    close         1     0.00			

Two synchronous reads of 16 megabytes each were issued, for a total of 32 megabytes. In addition, there were 29 asynchronous reads ( reada) issued, for a total of 464 megabytes.

Example 2. You can generate additional diagnostic information by specifying the .diag modifier. The following is an example of the diagnostic output generated when the .diag modifier is used:

% setenv FF_IO_OPTS 'test*(eie.direct.diag.mbytes:4096:128:6:1:1:0 )'
% ./fio -n 100 /build/testit									
Reading 4194304 bytes 100 times to /build/testit						
Total time  = 7.383761							       		
Throughput  = 56.804439 MB/sec

eie_close EIE final stats for file /build/testit	
eie_close  Used shared eie cache 1								
eie_close  128 mem pages of 4096 blocks (4096 sectors), max_lead = 6 pages			
eie_close  advance reads used/started :       23/29    79.31%   (1.78 seconds wasted)	
eie_close  write hits/total           :        0/0      0.00%					
eie_close  read  hits/total           :       98/100   98.00%					
eie_close  mbytes transferred    parent --> eie --> child      sync        async
eie_close                                 0            0        0             0   
eie_close                               400          496        2            29 (0,0)
eie_close                        parent <-- eie <-- child

eie_close EIE stats for Shared cache 1								
eie_close  128 mem pages of 4096 blocks							
eie_close  advance reads used/started :       23/29    79.31%   (0.00 seconds wasted)
eie_close  write hits/total           :        0/0      0.00%					
eie_close  read  hits/total           :       98/100   98.00%					
eie_close  mbytes transferred    parent --> eie --> child      sync        async
eie_close                                 0                     0             0
eie_close                               400         496         2            29 (0,0)

The preceding output lists information for both the file and the cache. In the mbytes transferred information, the lines in bold are for write and read operations, respectively. Only for very simple I/O patterns can the difference between ( parent --> eie) and (eie --> child) read statistics be explained by the number of read aheads. For random reads of a large file over a long period of time, this is not the case. All write operations count as async.

You can generate additional diagnostic information by specifying the .diag modifier and the .event.summary modifier. The two modifiers operate independently from one another. The following specification uses both modifiers:

% setenv FF_IO_OPTS 'test*(eie.diag.direct.mbytes:4096:128:6:1:1:0, event.summary.mbytes.notrace )'

Multithreading Considerations

FFIO works with applications that use MPI for parallel processing. An MPI job assigns each thread a number or rank. The master thread has rank 0, while the remaining slave threads have ranks from 1 to N-l where N is the total number of threads in the MPI job. It is important to consider that the threads comprising an MPI job do not necessarily have access to each others' address space. As a result, there is no way for the different MPI threads to share the same FFIO cache. By default, each thread defines a separate FFIO cache based on the parameters defined by FF_IO_OPTS.

Having each MPI thread define a separate FFIO cache, based on a single environment variable (FF_IO_OPTS), can waste a lot of memory. Fortunately, FFIO provides a mechanism that allows you to specify a different FFIO cache for each MPI thread via the following environment variables:

setenv FF_IO_OPTS_RANK0 'result*(eie.direct.mbytes:4096:512:6:1:1:0)'
setenv FF_IO_OPTS_RANK1 'output*(eie.direct.mbytes:1024:128:6:1:1:0)'
setenv FF_IO_OPTS_RANK2 'input*(eie.direct.mbytes:2048:64:6:1:1:0)'
             .
             .
             .
setenv FF_IO_OPTS_RANKN-1 ...   (N = number of threads).

Each rank environment variable is set using the exact same syntax as FF_IO_OPTS and each defines a distinct cache for the corresponding MPI rank. If the cache is designated as shared, all files within the same ranking thread can use the same cache. FFIO works with SGI MPI, HP MPI, and LAM MPI. In order to work with MPI applications, FFIO needs to determine the rank of callers by invoking the mpi_comm_rank_() MPI library routine. Therefore, FFIO needs to determine the location of the MPI library used by the application. To accomplished this, set one, and only one, of the following environment variables:

  • setenv SGI_MPI /usr/lib

  • setenv LAM_MPI

  • setenv HP_MPI


Note: LAM MPI and HP MPI are usually distributed via a third party application. The precise paths to the LAM and the HP MPI libraries are application dependent. See the application installation guide to find the correct path.

To use the rank functionality, both the MPI and FF_IO_OPTS_RANK0 environment variables must be set. If either variable is not set, then the MPI threads all use FF_IO_OPTS. If both the MPI and the FF_IO_OPTS_RANK0 variables are defined but, for example, FF_IO_OPTS_RANK2 is undefined, all rank 2 files generate a no match with FFIO. This means that none of the rank 2 files are cached by FFIO. In this case, the software does not default to FF_IO_OPTS.

Fortran and C/C++ applications that use the pthreads interface create threads that share the same address space. These threads can all make use of the single FFIO cache defined by FF_IO_OPTS .

Application Examples

FFIO has been deployed successfully with several high-performance computing applications, such as Nastran and Abaqus. In a recent customer benchmark, an eight-way Abaqus throughput job ran approximately twice as fast when FFIO was used. The FFIO cache used 16-megabyte pages (that is, page_size = 4096) and the cache size was 8.0 gigabytes. As a rule of thumb, it was determined that setting the FFIO cache size to roughly 10-15% of the disk space required by Abaqus yielded reasonable I/O performance. For this benchmark, the FF_IO_OPTS environment variable was defined as follows:

% setenv FF_IO_OPTS '*.fct *.opr* *.ord *.fil *.mdl* *.stt* *.res *.sst *.hdx *.odb* *.023
       *.nck* *.sct *.lop *.ngr *.elm *.ptn* *.stp* *.eig *.lnz* *.mass *.inp* *.scn* *.ddm
       *.dat* fort*(eie.direct.nodiag.mbytes:4096:512:6:1:1:0,event.summary.mbytes.notrace)'

For the MPI version of Abaqus, different caches were specified for each MPI rank, as follows:

% setenv FF_IO_OPTS_RANK0 '*.fct *.opr* *.ord *.fil *.mdl* *.stt* *.res *.sst *.hdx *.odb* *.023 
       *.nck* *.sct *.lop *.ngr *.ptn* *.stp* *.elm *.eig *.lnz* *.mass *.inp *.scn* *.ddm  
       *.dat* fort*(eie.direct.nodiag.mbytes:4096:512:6:1:1:0,event.summary.mbytes.notrace)'
      
% setenv FF_IO_OPTS_RANK1 '*.fct *.opr* *.ord *.fil *.mdl* *.stt* *.res *.sst *.hdx *.odb* *.023
       *.nck* *.sct *.lop *.ngr *.ptn* *.stp* *.elm *.eig *.lnz* *.mass *.inp *.scn* *.ddm     
       *.dat* fort*(eie.direct.nodiag.mbytes:4096:16:6:1:1:0,event.summary.mbytes.notrace)'
      
% setenv FF_IO_OPTS_RANK2 '*.fct *.opr* *.ord *.fil *.mdl* *.stt* *.res *.sst *.hdx *.odb* *.023 
       *.nck* *.sct *.lop *.ngr *.ptn* *.stp* *.elm *.eig *.lnz* *.mass *.inp *.scn* *.ddm  
       *.dat* fort*(eie.direct.nodiag.mbytes:4096:16:6:1:1:0,event.summary.mbytes.notrace)'
      
% setenv FF_IO_OPTS_RANK3 '*.fct *.opr* *.ord *.fil *.mdl* *.stt* *.res *.sst *.hdx *.odb* *.023 
       *.nck* *.sct *.lop *.ngr *.ptn* *.stp* *.elm *.eig *.lnz* *.mass *.inp *.scn* *.ddm     
       *.dat* fort*(eie.direct.nodiag.mbytes:4096:16:6:1:1:0,event.summary.mbytes.notrace)'

Event Tracing

If you specify the .trace option as part of the event parameter, you can enable the event tracing feature in FFIO.

For example:

% setenv FF_IO_OPTS 'test*(eie.direct.mbytes:4096:128:6:1:1:0, event.summary.mbytes.trace )'

This option generates files of the form ffio.events. pid for each process that is part of the application. By default, event files are placed in /tmp. To chang this destination, set the FFIO_TMPDIR environment variable. These files contain time-stamped events for files using the FFIO cache and can be used to trace I/O activity such as I/O sizes and offsets.

System Information and Issues

The FFIO subsystem supports applications written in C, C++, and Fortran. C and C++ applications can be built with either the Intel or gcc compiler. Only Fortran codes built with the Intel compiler work with FFIO.

The following restrictions on FFIO must also be observed:

  • The FFIO implementation of pread/ pwrite is not correct. The file offset advances.

  • Do not use FFIO for I/O on a socket.

  • Do not link your application with the librt asynchronous I/O library.

  • FFIO does not intercept calls that operate on files in /proc, /etc, and /dev.

  • FFIO does not intercept calls that operate on stdin, stdout, and stderr.

  • FFIO is not intended for generic I/O applications such as vi, cp, or mv, and so on.