Chapter 6. Scaling Applications with RASC

Getting an application to scale well with RASC can require some effort. If the application does not require large bandwidth, it is just a matter of determining what, if any, data dependencies exist and using the algorithm layer to automatically scale across multiple FPGAs. If the application requires maximal bandwidth (it is bandwidth limited), then care should be taken when determining buffer sizes and placement.

The rasclib_cop_malloc(), rasclib_algorithm_malloc() routines and the associated *_mfree() routines aid in this process. This chapter describes some things you need to consider when using these routines.

Data Buffer Size

The buffer size you choose to do sends and receives should match up as closely as possible with the Hugepagesize as shown by the /proc/meminfo. file. An example /proc/meminfo file is, as follows:

system 56% cat meminfo
MemTotal:     48034320 kB
MemFree:       6848464 kB
Buffers:           208 kB
Cached:       34947536 kB
SwapCached:          0 kB
Active:       25110112 kB
Inactive:     13881616 kB
HighTotal:           0 kB
HighFree:            0 kB
LowTotal:     48034320 kB
LowFree:       6848464 kB
SwapTotal:    69824176 kB
SwapFree:     69824176 kB
Dirty:           32736 kB
Writeback:           0 kB
AnonPages:     3982080 kB
Mapped:         123472 kB
Slab:          1648864 kB
CommitLimit:  93841328 kB
Committed_AS:  6141376 kB
PageTables:      67520 kB
VmallocTotal: 137416910048 kB
VmallocUsed:    233168 kB
VmallocChunk: 137416675984 kB
HugePages_Total:     0
HugePages_Free:      0
HugePages_Rsvd:      0
Hugepagesize:    262144 kB

To avoid wasting space, the buffer sizes should be as close to Hugepagesize as possible without exceeding it. The RASC abstraction layer will allocate the memory associated with a given COP as close (in a topology sense) to the COP as it can. It will also make an effort to spread the memory allocation across memory nodes as evenly as possible to avoid over- subscribing the memory bandwidth on a memory node. The rasclib_cop_malloc() routine allocates a single huge page for a single COP. Requesting a size that is larger than the huge page size will result in an error.

Requesting a size that is smaller than the huge page size will result in wasted space.

The rasclib_algorithm_malloc() routine will allocate a single huge page for each COP participating in the algorithm. So, it is VERY important to match the data buffer size to the huge page size. If this cannot be done, then using rasclib_*_malloc() may not result in optimal behavior.

Achieving Maximal Bandwidth

For absolute, maximal bandwidth, there should be one memory node with processors for every FPGA.

The more removed from this one-to-one ratio that a topology is, the lower the aggregate bandwidth. The memory nodes need to have processors (not memory only nodes) because each FPGA requires processors for the threads that drive the DMA to run on.

High bandwidth also requires the use of direct I/O (see Chapter 5, “Direct I/O”).

Conclusion

As mentioned above, the size of the application I/O buffers should match the size of the system's huge page size. The memory allocation routines provided by the RASC abstraction layer will only allocate, at most, one huge page per request.