Getting an application to scale well with RASC can require some effort. If the application does not require large bandwidth, it is just a matter of determining what, if any, data dependencies exist and using the algorithm layer to automatically scale across multiple FPGAs. If the application requires maximal bandwidth (it is bandwidth limited), then care should be taken when determining buffer sizes and placement.
The rasclib_cop_malloc(), rasclib_algorithm_malloc() routines and the associated *_mfree() routines aid in this process. This chapter describes some things you need to consider when using these routines.
The buffer size you choose to do sends and receives should match up as closely as possible with the Hugepagesize as shown by the /proc/meminfo. file. An example /proc/meminfo file is, as follows:
system 56% cat meminfo MemTotal: 48034320 kB MemFree: 6848464 kB Buffers: 208 kB Cached: 34947536 kB SwapCached: 0 kB Active: 25110112 kB Inactive: 13881616 kB HighTotal: 0 kB HighFree: 0 kB LowTotal: 48034320 kB LowFree: 6848464 kB SwapTotal: 69824176 kB SwapFree: 69824176 kB Dirty: 32736 kB Writeback: 0 kB AnonPages: 3982080 kB Mapped: 123472 kB Slab: 1648864 kB CommitLimit: 93841328 kB Committed_AS: 6141376 kB PageTables: 67520 kB VmallocTotal: 137416910048 kB VmallocUsed: 233168 kB VmallocChunk: 137416675984 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 Hugepagesize: 262144 kB |
To avoid wasting space, the buffer sizes should be as close to Hugepagesize as possible without exceeding it. The RASC abstraction layer will allocate the memory associated with a given COP as close (in a topology sense) to the COP as it can. It will also make an effort to spread the memory allocation across memory nodes as evenly as possible to avoid over- subscribing the memory bandwidth on a memory node. The rasclib_cop_malloc() routine allocates a single huge page for a single COP. Requesting a size that is larger than the huge page size will result in an error.
Requesting a size that is smaller than the huge page size will result in wasted space.
The rasclib_algorithm_malloc() routine will allocate a single huge page for each COP participating in the algorithm. So, it is VERY important to match the data buffer size to the huge page size. If this cannot be done, then using rasclib_*_malloc() may not result in optimal behavior.
For absolute, maximal bandwidth, there should be one memory node with processors for every FPGA.
The more removed from this one-to-one ratio that a topology is, the lower the aggregate bandwidth. The memory nodes need to have processors (not memory only nodes) because each FPGA requires processors for the threads that drive the DMA to run on.
High bandwidth also requires the use of direct I/O (see Chapter 5, “Direct I/O”).