Chapter 10. RASC Examples and Tutorials

The chapter contains Reconfigurable Application-Specific Computing Software (RASC) examples and tutorials and covers these topics:

System Requirements

For design, synthesis and bitstream generation you need the following

  • PC with 1 GHz or greater clock speed

  • At least 8 Gbytes random access memory (RAM)

  • Red Hat Linux Enterprise version 3.0 or later

  • Xilinx ISE development tools (version 9.2i, Service Pack 3 or higher)

  • Optional: High-level language compiler.

  • Optional: 3rd party FPGA synthesis software supporting Xilinx FPGAs (such as, Synplicity Synplify Pro 8.9 or later)

  • Optional: 3rd party HDL simulation software

For bitstream download, algorithm acceleration, and real-time verification you need the following:

  • One Altix system

  • One or more RASC bricks or blades

  • SGI ProPack 5 for Linux

  • RASC software module

  • A network connection to the PC detailed earlier

Prerequisites

The information and tutorials in this Examples and Tutorials section of the User Guide assume that you have previously installed and familiarized yourself with the Xilinx ISE tools and all optional software. It is also assumed that you have read the Chapter 7, “RASC Algorithm FPGA Implementation Guide” and Chapter 9, “Running and Debugging Your Application” and that you have some experience with Verilog and/or VHDL.

Additional background information (not from SGI) is available in the following:

Tutorial Overview

The following tutorials illustrate the implementation details of the algorithm programming interface using two different algorithms. During the following sections you will learn how to integrate algorithms into the RASC brick or RASC blade that are written in hardware description languages (HDLs). You will also see a subset of the optimizations that can be made for RASC implementations.

For both algorithms we will step through the entire RASC design flow: integrating the algorithm with core service; simulating behavior on the algorithm interfaces; synthesizing the algorithm code; generating a bitstream; transferring that bitstream and metadata to the Altix platform; executing an application; and using GDB to debug an application on the Altix system and FPGA simultaneously.

These tutorials only illustrate a subset of the options available for implementing an algorithm on RASC. For more details, see Chapter 3, “RASC Algorithm FPGA Hardware Design Guide” and Chapter 7, “RASC Algorithm FPGA Implementation Guide”.

The Verilog example codes are on your x86 system because Verilog codes are compiled by using Xilinx XST which runs on the x86 system. The example application C codes are compiled by the C compiler which runs on your Alitx system.

Simple Algorithm Tutorial

Overview

The first algorithm we will use to describe the interfaces and various programming templates for RASC is (d = a & b | c). This simple algorithm allows you to compare coding options and analyze optimization techniques. This section steps through integrating an algorithm written in Verilog and VHDL. Then it will demonstrate simulation, synthesis, bitstream generation, platform transfer, running and debugging the application. These steps are the same for this algorithm regardless of the coding technique used.

Figure 10-1 contains a diagram of the algorithm and its memory patterns.

Figure 10-1. Simple Algorithm for Verilog

Simple Algorithm for Verilog

Application

The application that runs (d = a & b | c) on the Altix platform is fairly simple. The following code demonstrates the RASC Abstraction Layer calls that are required to utilize the RASC-brick as a Co-Processing (COP) unit. The application also runs the algorithm on the Altix box to verify its results. The application C code is on your Altix system at the following location:

/usr/share/rasc/examples/alg6.c

This simple application runs quickly on an Altix machine, although it is not optimized C code. Please note that this application is not chosen to emphasize the optimizations available from acceleration in hardware, but rather to compare and contrast the various programming methods for RASC. As you work through the tutorials for the different languages, there will be similarities and differences that highlight advantages of one methodology versus another. For a more computationally intensive example, please see the Data Flow Algorithm in Verilog later in this chapter.

Coding Techniques: Verilog

Overview

First we will analyze how to write a Verilog version of (d = a & b | c) for RASC. It is important to note that the source code for this example allows for multi-buffering.

Integrating with Core Services

Begin by loading the hardware description file for the Verilog algorithm, alg_block_top.v, into your text editor. Change directory to $RASC/examples/alg_simple_v and select alg_block_top.v.

This file is the top level module for the algorithm. The other file that is required for this and all other implementation is the file alg.h. This Verilog version of (d= a & b | c) reads a from the first address in SRAM 0, b from the 16384th address of SRAM 0, c from the 32768th address in SRAM 0, and then it writes out the resulting value, d, to the first address of SRAM 1. Arrays a, b, c, and d are 2048 elements long where each element is a 64-bit unsigned integer, and all the arrays are enabled for multi-buffering by the RASC Abstraction Layer. The version of the algorithm, read data, write data, read address, write address, and control signals are all brought out to debug mux registers.

Figure 10-2 contains a diagram of the algorithm and its memory access patterns.

Figure 10-2. Simple Algorithm for Verilog andVHDL

Simple Algorithm for Verilog andVHDL

The source code is available on your x86 system at the following location:

$RASC/examples/alg_simple_v/alg_block_top.v


Note: The handshaking methodologies (see “Handshaking Methodologies” in Chapter 3) are not used here since it is assumed SRAM port access conflicts will never occur between the algorithm and DMA.


Extractor Comments

Other important source code considerations include adding the extractor comments that are required for accurate data movement and debugger control by RASClib. A python script called extractor parses all the Verilog, VHDL, and header files in your algorithm directory to generate the symbol tables required by GDB and to communicate to the abstraction layer the data that should be written and read from the SRAM banks.

Comment fields to generate the configuration files for the algorithm are inserted in these examples. There is a template in the alg_core directory, and several examples. The comment fields can be located in any file in or below the directory specified as the second argument to the extractor call (see Chapter 7, “RASC Algorithm FPGA Implementation Guide” for more detail on how to specify the makefile target). The fields are core services version, algorithm version, SRAM denoting where data will be read from and written to on the SRAM interface, register in for parameters set through an application writer's code, or register out for a section of code that needs to be mapped to a debug register.

The debug comments for metadata parsing in this file are:

// extractor VERSION: 6.3
// extractor CS: 2.1
// extractor SRAM:a_in  2048  64 sram[0] 0x0000 in  u stream
// extractor SRAM:b_in  2048  64 sram[0] 0x4000 in  u stream
// extractor SRAM:c_in  2048  64 sram[0] 0x8000 in  u stream
// extractor SRAM:d_out 2048  64 sram[1] 0x0000 out u stream
// extractor REG_IN:op_length1 10 u alg_def_reg[0][9:0]
// extractor REG_OUT:alg_id 32 u debug_port[0][63:32]
// extractor REG_OUT:alg_rev 32 u debug_port[0][31:0]
// extractor REG_OUT:rd_addr 64 u debug_port[1]
// extractor REG_OUT:rd_data_sram0_lo 64 u debug_port[2]
// extractor REG_OUT:rd_data_sram0_hi 64 u debug_port[3]
// extractor REG_OUT:wr_addr 64 u debug_port[4]
// extractor REG_OUT:wr_data_sram1_lo 64 u debug_port[5]
// extractor REG_OUT:wr_data_sram1_hi 64 u debug_port[6]
// extractor REG_OUT:cntl_sigs 64 u debug_port[7]
// extractor REG_OUT:dummy_param0_out 16 u debug_port[8][15:0]
// extractor REG_OUT:dummy_param1_out 16 u debug_port[8][31:16]
// extractor REG_OUT:dummy_param2_out 16 u debug_port[8][47:32]
// extractor REG_OUT:dummy_param3_out 16 u debug_port[8][63:48]

These comments are located within alg_block_top.v in this case, but they can be anywhere within the algorithm hierarchy as a header or source file. The core services tag helps describe which version of core services was used in generating a bitstream, this is useful with debugging. The version tag allows the user to understand from their GDB session which algorithm and revision he or she has loaded. The register out tag (REG_OUT) specifies registers that are pulled out to the debug mux. The SRAM tag is to describe arrays that are written to or read from the SRAM banks by the algorithm. For more information, see Chapter 7, “RASC Algorithm FPGA Implementation Guide”.

Coding Techniques: VHDL Algorithm

Overview

Now we will analyze how to write a VHDL version of (d = a & b | c) for RASC. This source code also allows for multi-buffering.

Figure 10-3 contains a diagram of the algorithm and its memory patterns.

Figure 10-3. Simple Algorithm for Verilog and VHDL

Simple Algorithm for Verilog and VHDL

Integrating with Core Services

Begin by loading the hardware description file for the VHDL algorithm, alg_block.vhd, into your text editor. Change directory to $RASC/examples/alg_simple_vhd where you will see alg_block_top.v and alg_block.vhd.

These files are the top level module and computation block, respectively. This VHDL version of (d = a & b | c) reads a from the first address in SRAM 0, b from the 16384th address of SRAM 0, c from the 32768th address in SRAM 0, and then it writes out the resulting value, d, to the first address of SRAM 1. Arrays a, b, c, and d are 2048 elements long where each element is a 64-bit unsigned integer, and all the arrays are enabled for multi-buffering by the RASC Abstraction Layer. The version of the algorithm, read data, write data, read address, and write address for the algorithm are all brought out to debug mux registers.

The source for the alg_block_top.v file is similar in this instance to the verilog version of (d = a & b | c), except that it performs no calculation. Instead, it wraps the alg_block.vhd so it can speak to the user_space_wrapper.v that instantiates it. There is no reason why a computation block needs to be wrapped in a Verilog module. In this case it was done for convenience.


Note: The handshaking methodologies (see “Handshaking Methodologies” in Chapter 3) are not used here since it is assumed SRAM port access conflicts will never occur between the algorithm and DMA.


Extractor Comments

Other important source code considerations include adding the extractor comments that are required for accurate data movement and debugger control. A python script called extractor parses all the Verilog, VHDL, and header files in your algorithm directory to generate the symbol tables required by GDB and to communicate to the abstraction layer the data that should be written and read from the SRAM banks.

Comment fields to generate the configuration files for the algorithm are provided in this example for alg_block.vhd. There is a template in the alg_core directory, and several examples. The comment fields that can be located anywhere below the directory specified in the second argument to the extractor call (see Chapter 7, “RASC Algorithm FPGA Implementation Guide” for more detail on how to specify the makefile target). The fields are core services version, algorithm version, SRAM denoting where data will be read from and written to on the SRAM interface, register in for parameters set through an application writer's code, or register out for a section of code that needs to be mapped to a debug register.

The debug comments for metadata parsing in this file are embedded in the VHDL code. They appear as:

-- extractor VERSION: 9.1
-- extractor SRAM:a_in  2048 64 sram[0] 0x0000 in  u stream
-- extractor SRAM:b_in  2048 64 sram[0] 0x4000 in  u stream
-- extractor SRAM:c_in  2048 64 sram[0] 0x8000 in  u stream
-- extractor SRAM:d_out 2048 64 sram[1] 0x0000 out u stream
-- extractor REG_OUT:version 64 u debug_port[0]
-- extractor REG_OUT:rd_addr 64 u debug_port[1]
-- extractor REG_OUT:rd_data_sram0_lo 64 u debug_port[2]
-- extractor REG_OUT:rd_data_sram0_hi 64 u debug_port[3]
-- extractor REG_OUT:wr_addr 64 u debug_port[4]
-- extractor REG_OUT:wr_data_sram1_lo 64 u debug_port[5]
-- extractor REG_OUT:wr_data_sram1_hi 64 u debug_port[6]

These comments are located within alg_block.vhd in this case, but they can be anywhere within the algorithm hierarchy as a header or source file. The core services tag helps describe what version of core services was used in generating a bitstream. This information is useful when debugging. The version tag allows the user to understand from his GDB session which algorithm and revision he has loaded. The register out tag (REG_OUT) specifies registers that are mapped to the debug mux. The SRAM tag is to describe arrays that are written to or read from the SRAM banks by the algorithm.

Compiling for Simulation

To build the generic SSP stub test bench for the Verilog version of the simple algorithm, change to the $RASC/dv/sample_tb directory. At the prompt, enter the following:

% make ALG=alg_simple_v

If you do not enter the ALG tag, the makefile will default to compiling for alg_simple_v.

To run the diagnostic alg_simple_v, at the prompt, enter the following:

% make run DIAG=diags/alg_simple_v ALG=alg_simple_v

As with the case in building the test bench, the ALG tag default is alg_simple and it needs to be overwritten for the data flow algorithm.

The SRAM mapping with physical memory is, as follows:

mem0[127:64] -> qdr_sram_bank1.SMEM -> init_sram1_*.dat, final_sram1.dat
mem0[63:0  ] -> qdr_sram_bank0.SMEM -> init_sram0_*.dat, final_sram0.dat
mem1[127:64] -> qdr_sram_bank3.SMEM -> init_sram3_*.dat, final_sram3.dat
mem1[63:0  ] -> qdr_sram_bank4.SMEM -> init_sram2_*.dat, final_sram2.dat
mem2[63:0  ] -> qdr_sram_bank2.SMEM -> init_sram4_*.dat, final_sram4.dat

These can all be overwritten on the command line.

By specifying the SRAM input files, the user can skip the DMA process for the purposes of testing the algorithm, providing a fast check for the algorithm without verifying the DMA engines of core services.

You can also use a simple C program called check_alg_simple.c to verify the test results; build and run it to analyze the initial and final SRAM simulation contents.

Running a diag through this test bench produces results in four formats:

  • *vcdplus.vpd* - this file contains the simulation results for the run.

  • *terminal output* - the status of the test is output to the screen as it runs, notifying the user of packets sent/received.

  • *log file* - the output to the screen is also stored in the logfile:<diag_name>.<alg_name>.run.log (for example, dma_alg_simple_v.alg_simple_v.run.log)

  • *sram output files* - at the end of simulation (when the diag finishes because it has been successful, an incorrect packet has been received, or time-out has occurred), the contents of of all SRAMS are dumped to the corresponding .dat output files (the defaults or user-specified files).

Building an Implementation

When the algorithm has been integrated and verified, it is time to build an implementation.

Change directories to $RASC/implementations/alg_simple_*/

To synthesize the design type,use one of the following commands:

make synplify 

or

 make xst. 

This is set up to utilize the black-boxed version of core services and should synthesize faster.

To generate the required metadata information for the abstraction layer and the debugger, you need to run the extractor script on your file. The physical design makefile includes a make extractor target for this purpose. When it is executed, it will generate two configuration files--one describing core services, and one describing the algorithm behavior.

To execute the ISE foundation tools and run the extractor script on the file enter the following command:

make all

This will take approximately one to two hours due to the complex mapping and place and route algorithms executed by the ISE tools. Please note that the details of setting up your own project are described in Chapter 7, “RASC Algorithm FPGA Implementation Guide”, specifically in “Installation and Setup” in Chapter 7.

Transferring to the Altix Platform

To transfer to the Altix platform, you must add your RASC design implementation into the Device Manager registry. This transfer must occur regardless of the algorithm generation method.

  1. Use FTP to move the algorithm files from the PC to the /usr/share/rasc/bitstreams directory on the Altix machine:

    $RASC/implementations/alg_simple_*/rev_1/alg_simple_v.bin
    $RASC/implementations/alg_simple_*/<core_services>.cfg
    $RASC/implementations/alg_simple_*/<user_space>.cfg
    

  2. Log into the Altix machine and execute the Device Manager user command devmgr

    devmgr -a -n alg6 -b alg_simple_v.bin -c <user_space>.cfg -s <core_services>.cfg
    

The script will default the bitstream and configuration files to these names, although the device manager can add files of any name to the registry, so users should feel free to rename project files as convenient.

Verification using GDB

To run a debug session on this bitstream, you must start the application from a GDB session window. GDB is enabled with all versions of this algorithm. To run an application using RASClib, you must execute the extended GDB on the application detailed at the beginning of this example.

% gdbfpga /usr/share/rasc/examples/alg6
GNU gdb 6.3.50.20050510
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "ia64-unknown-linux-gnu"...Using host libthread_db library "/lib/libthread_db.so.1".
 
(gdb) break rasclib_brkpt_start
Function "rasclib_brkpt_start" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (rasclib_brkpt_start) pending.
(gdb) break rasclib_brkpt_done
Function "rasclib_brkpt_done" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 2 (rasclib_brkpt_done) pending.
(gdb) handle SIGUSR1 nostop pass noprint
Signal        Stop      Print   Pass to program Description
SIGUSR1       No        No      Yes             User defined signal 1
(gdb) run
Starting program: /usr/share/rasc/examples/alg6 
Failed to read a valid object file image from memory.
[Thread debugging using libthread_db enabled]
[New Thread 2305843009292443296 (LWP 24946)]
Breakpoint 3 at 0x2000000000085bb1: file rasclib_debug.c, line 86.
Pending breakpoint "rasclib_brkpt_start" resolved
Breakpoint 4 at 0x2000000000085b82: file rasclib_debug.c, line 104.
Pending breakpoint "rasclib_brkpt_done" resolved[New Thread 2305843009303867984 (LWP 24949)]
[New Thread 2305843009312551504 (LWP 24951)]
fpga config file def reg count (64) of /var/rasc/rasc_registry/alg6/bitstream.cfg exceeds gdb current maximum of 8, excess ignored
 [Switching to Thread 2305843009292443296 (LWP 24946)]

Breakpoint 3, rasclib_brkpt_start (cop_desc=547920) at rasclib_debug.c:86
86      rasclib_debug.c: No such file or directory.
        in rasclib_debug.c
(gdb) info fpga
fpga 0
  Active       : on
  State        : ready-to-fpgastep-fpgacont
  Algorithm id : alg6
   Core svc ver : 2.100000
   Algorithm ver: 6.300000
   Algorithm src: v
   Alg. dev     : 
   Alg. config  : /var/rasc/rasc_registry/alg6/bitstream.cfg
   CS version   : 2.100000
   CS config    : /var/rasc/rasc_registry/alg6/core_services.cfg
   prev step ct : 0
   step ct      : 0
 (gdb) info fpgaregisters
 alg_id         0x6      6
 alg_rev        0x3      3
 rd_addr        0x0      0
 rd_data_sram0_lo0xa5a4a3a2a1a09f9e      11935844831330344862
 rd_data_sram0_hi0xadacabaaa9a8a7a6      12514566214034958246
 wr_addr        0x0      0
 wr_data_sram1_lo0xf7f4f7f2f3f0fffe      17867178244532404222
 wr_data_sram1_hi0xfffcfffafbf8fff6      18445899627237015542
cntl_sigs      0x0      0
 dummy_param0_out0x0     0
 dummy_param1_out0x100   256
 dummy_param2_out0x20    32
 dummy_param3_out0x3303  13059
 op_length1     0x3ff    1023
 dummy_param0_in0x0      0
 dummy_param1_in0x100    256
 dummy_param2_in0x20     32
 dummy_param3_ 
 (gdb) fpgastep 55
 (gdb) info fpgaregisters
 alg_id         0x6      6
 alg_rev        0x3      3
 rd_addr        0x811    2065
 rd_data_sram0_lo0xd5d4d3d2d1d0cfce      15408173127558025166
 rd_data_sram0_hi0xdddcdbdad9d8d7d6      15986894510262638550
 wr_addr        0xd      13
 wr_data_sram1_lo0xd7d6d5d4d3d2d1d0      15552853473234178512
 wr_data_sram1_hi0xdfdedddcdbdad9d8      16131574855938791896
 cntl_sigs      0x3      3
 dummy_param0_out0x0     0
 dummy_param1_out0x100   256
 dummy_param2_out0x20    32
 dummy_param3_out0x3303  13059
 op_length1     0x3ff    1023
 dummy_param0_in0x0      0
 dummy_param1_in0x100    256
 dummy_param2_in0x20     32
 dummy_param3_in0x3303   13059
 (gdb) print $rd_addr
 $1 = 2065
 (gdb) print $wr_addr
 $2 = 13
 (gdb) print /x $rd_data_sram0_lo
 $3 = 0xd5d4d3d2d1d0cfce
 (gdb) print /x a_in[8]
 [Switching to Thread 2305843009292443296 (LWP 24946)]
 $4 = 0x4746454443424140
 (gdb) fpgastep 3
 (gdb) print $rd_addr
 $5 = 2066
 (gdb) print $wr_addr
 $6 = 14
 (gdb) print /x $rd_data_sram0_lo
 $7 = 0xe5e4e3e2e1e0dfde
 (gdb) print /x a_in[8]
 [Switching to Thread 2305843009292443296 (LWP 24946)]
 $8 = 0x4746454443424140
 (gdb) fpgacont
 (gdb) print $rd_addr
 $9 = 3071
 (gdb) print $wr_addr
 $10 = 0
 (gdb) print /x $rd_data_sram0_lo
 $11 = 0xa5a4a3a2a1a09f9e
 (gdb) print /x b_in[3]
 [Switching to Thread 2305843009292443296 (LWP 24946)]
 $12 = 0x1e1d1c1b1a191817
 (gdb) delete
 Delete all breakpoints? (y or n) y
 (gdb) cont
 Continuing.
 [Thread 2305843009312551504 (LWP 24951) exited]
 success
 Program exited normally.
 (gdb) quit

Many other commands are available. For more information on these commands, see “Using the GNU Project Debugger (GDB)” in Chapter 9.

Data Flow Algorithm Tutorial

This example algorithm illustrates the optimization considerations for multi-buffering a complex algorithm on the RASC platform. This section steps you through the design process for this algorithm with source code in Verilog that steps by clocks.

Application

The application for the data flow algorithm is slightly more complex. This example creates a 16 KB array, sorts it from most-significant byte to least-significant byte, runs a string search on the sorted data against a match tag, and then performs a pop count. The location of application code to perform this operation on both the Altix system and the FPGA is provided below. The results are compared to verify the algorithm implementation. The application C code is on your Altix system at the following location:

/usr/share/rasc/examples/alg10.c

Loading the Tutorial

Begin by loading the hardware description files into your text editor. Change directory to $RASC/examples/alg_data_flow_v/

and you will see several files:

alg_block_top.v, sort_byte.v, string_search.v and pop_cnt.v.

If you look through the files you will see that he data flow algorithm reads 16K bytes of data from SRAM 0. Then it sorts the bytes of each double-word of the input data from most significant to least significant byte order. The algorithm writes those results out to SRAM 1, and then it performs a string search on the sorted data with a 16-bit match string that is provided by the application writer. The match tags resulting from the string search are written out to SRAM 1 and a population count is then run on the data. The resulting population count is written to debug register 1.

Figure 10-4 contains a diagram of the major computational blocks and the memory access patterns for the data flow algorithm.

Figure 10-4. Data Flow Algorithm

Data Flow Algorithm

Integrating with Core Services

Extractor Comments

Extractor comments are inserted in the hierarchy to describe the algorithm. In the alg_block_top.v file for the data flow algorithm the following comments exist:

//  extractor CS:1.0
//  extractor VERSION:10.4
//  extractor SRAM:data_in                2048 64 sram[0] 0x0000 in  u stream
//  extractor SRAM:sort_output            2048 64 sram[1] 0x0000 out u stream
//  extractor SRAM:bitsearch_match_vector 2048 64 sram[1] 0x8000 out u stream
//  extractor REG_IN:match_string 16 u alg_def_reg[0][15:0]
//  extractor REG_IN:multi_iter_rst 64 u alg_def_reg[2]
//  extractor REG_IN:op_length1 11 u alg_def_reg[1][10:0]
//  extractor REG_OUT:version 64 u debug_port[0]
//  extractor REG_OUT:running_pop_count 64 u debug_port[1]
//  extractor REG_OUT:sort_input 64 u debug_port[2]
//  extractor REG_OUT:sort_output 128 u debug_port[3]
//  extractor REG_OUT:match_vector 64 u debug_port[5]
//  extractor REG_OUT:pipe_vld 64 u debug_port[6]
//  extractor REG_OUT:dw_pcnt 64 u debug_port[8]
//  extractor REG_OUT:f_pcnt_63_0 64 u debug_port[9]

The core services tag helps describe what version of core services was used in generating a bitstream which is useful with debugging. The version tag allows the user to understand from his GDB session which algorithm and revision he has loaded. The register out tag (REG_OUT) specifies registers that are pulled out to the debug mux. The SRAM tag is to describe arrays that are written to or read from the SRAM banks by the algorithm.


Note: The handshaking methodologies (see “Handshaking Methodologies” in Chapter 3) are not used here since it is assumed SRAM port access conflicts will never occur between the algorithm and DMA.


Compiling for Simulation

To build the Generic SSP stub test bench for the data flow algorithm change to the $RASC/dv/sample_tb directory. At the prompt, enter the following:

% make ALG=alg_data_flow_v

If you do not enter the ALG tag, the makefile will default to compiling for alg_simple_v.

To run the diagnostic alg_data_flow_v, at the prompt, enter the following:

% make run DIAG=diags/alg_data_flow_v ALG=alg_data_flow_v

As with the case in building the test bench, the ALG tag default is alg_simple and it needs to be overwritten for the data flow algorithm.

The SRAM mapping with physical memory is, as follows:

mem0[127:64] -> qdr_sram_bank1.SMEM -> init_sram1_*.dat, final_sram1.dat
mem0[63:0  ] -> qdr_sram_bank0.SMEM -> init_sram0_*.dat, final_sram0.dat
mem1[127:64] -> qdr_sram_bank3.SMEM -> init_sram3_*.dat, final_sram3.dat
mem1[63:0  ] -> qdr_sram_bank4.SMEM -> init_sram2_*.dat, final_sram2.dat
mem2[63:0  ] -> qdr_sram_bank2.SMEM -> init_sram4_*.dat, final_sram4.dat

These can all be overwritten on the command line.

By specifying the SRAM input files, the user can skip the DMA process for the purposes of testing the algorithm, providing a fast check for the algorithm without verifying the DMA engines of core services. You can also use a simple C program called check_alg_data_flow.c to verify the test results.

Running a diag through this test bench produces results in four formats:

  • *vcdplus.vpd* - this file contains the simulation results for the run.

  • *terminal output* - the status of the test is output to the screen as the it runs, notifying the user of packets sent/received.

  • *log file* - the output to the screen is also stored in the logfile:<diag_name>.<alg_name>.run.log (for example, dma_alg_data_flow_v.alg_data_flow_v.run.log)

  • *sram output files* - at the end of simulation (when the diag finishes because it has been successful, an incorrect packet has been received, or time-out has occurred), the contents of the 4 SRAMs are dumped to .dat files (the defaults or user-specified files).

Building an Implementation

When the algorithm has been integrated and verified, it is time to build an implementation.

Change directories to $RASC/implementations/alg_data_flow_v/

To synthesize the design type make synplify or make xst. This is set up utilize the black-boxed version of core services and should synthesize faster.

To generate the required metadata information for the abstraction layer and the debugger, you need to run the extractor script on your file. The physical design makefile includes a make extractor target for this purpose. When it is executed, it will generate two configuration files--one describing core services, and one describing the algorithm behavior.

To execute the ISE foundation tools and run the extractor script on the file type make all. This will take approximately one to two hours due to the complex mapping and place and route algorithms executed by the ISE tools. Please note that the details of setting up your own project are described in Physical Implementation chapter.

Transferring to the Altix Platform

To transfer to the Altix platform, you must add your RASC design implementation into the Device Manager Registry by performing following steps:

  1. Use FTP to move the algorithm files from the PC to the /usr/share/rasc/bitstreams/ directory on the Altix machine:

    $RASC/implementations/alg_data_flow_v/rev_1/alg_data_flow_v.bin
    $RASC/implementations/alg_data_flow_v/<core_services>.cfg
    $RASC/inplementations/alg_data_flow_v/<user_space>.cfg
    

  2. Log into the Altix machine and execute the Device Manager user command devmgr

    devmgr -a -n alg10 -b alg_data_flow_v.bin -c <user_space>.cfg -s <core_services>.cfg
    

The script will default the bitstream and configuration files to these names, although the device manager can add files of any name to the registry, so users should feel free to rename project files as convenient.

Verification Using GDB

To run a debug session on this bitstream, you must start the application from a GDB session window. To do that, you must execute the extended GDB on the application detailed at the beginning of this example.

% gdbfpga /usr/share/rasc/examples/alg10
GNU gdb 6.3.50.20050510
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "ia64-unknown-linux-gnu"...Using host libthread_db library "/lib/libthread_db.so.1".

(gdb) break rasclib_brkpt_start
Function "rasclib_brkpt_start" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (rasclib_brkpt_start) pending.
(gdb) handle SIGUSR1 nostop pass noprint
Signal        Stop      Print   Pass to program Description
SIGUSR1       No        No      Yes             User defined signal 1
(gdb) run
Starting program: /usr/share/rasc/examples/alg10 
Failed to read a valid object file image from memory.
[Thread debugging using libthread_db enabled]
[New Thread 2305843009292443296 (LWP 1157)]
Breakpoint 2 at 0x2000000000085bb1: file rasclib_debug.c, line 86.
Pending breakpoint "rasclib_brkpt_start" resolved
[New Thread 2305843009303867984 (LWP 1160)]
[New Thread 2305843009312551504 (LWP 1169)]
fpga config file def reg count (64) of /var/rasc/rasc_registry/alg10/bitstream.cfg exceeds gdb current maximum of 8, excess ignored
[Switching to Thread 2305843009292443296 (LWP 1157)]

Breakpoint 2, rasclib_brkpt_start (cop_desc=547920) at rasclib_debug.c:86
86      rasclib_debug.c: No such file or directory.
        in rasclib_debug.c
(gdb) info fpga
fpga 0
  Active       : on
  State        : ready-to-fpgastep-fpgacont
  Algorithm id : alg10
  Core svc ver : 2.100000
  Algorithm ver: 10.400000
  Algorithm src: v
  Alg. dev     : 
  Alg. config  : /var/rasc/rasc_registry/alg10/bitstream.cfg
  CS version   : 2.100000
  CS config    : /var/rasc/rasc_registry/alg10/core_services.cfg
  prev step ct : 0
  step ct      : 0
(gdb) info fpgaregisters
version        0xa00000001      42949672961
running_pop_count0x0    0
sort_input     0x0      0
sort_output    0x0      0
match_vector   0x0      0
pipe_vld       0x0      0
dw_pcnt        0x0      0
f_pcnt_63_0    0x0      0
test_cnt       0x645900000000   110333414866944
match_string   0x0      0
multi_iter_rst 0x0      0
op_length1     0x0      0
(gdb) fpgastep 5
(gdb) info fpgaregisters
version        0xa00000001      42949672961
running_pop_count0x0    0
sort_input     0x0      0
sort_output    0x0      0
match_vector   0x0      0
pipe_vld       0x0      0
dw_pcnt        0x0      0
f_pcnt_63_0    0x0      0
test_cnt       0x800000004      34359738372
match_string   0x0      0
multi_iter_rst 0x0      0
op_length1     0x0      0
(gdb) fpgastep 3
(gdb) print /x $test_cnt
$1 = 0xb00000007
(gdb) print /x data_in[8]
[Switching to Thread 2305843009292443296 (LWP 1157)]
$2 = 0x67fce141a13ee970
(gdb) fpgastep 6
(gdb) print /x $test_cnt
$3 = 0x110000000d
(gdb) print /x data_in[14]
[Switching to Thread 2305843009292443296 (LWP 1157)]
$4 = 0xbb5cf98961bed875
(gdb) delete
Delete all breakpoints? (y or n) y
(gdb) fpgacont
(gdb) cont
Continuing.
[Thread 2305843009312551504 (LWP 1169) exited]
success sorted
success match_list
popcnts = 10c
HW POP COUNT = 268
SW POP COUNT = 268

Program exited normally.
(gdb) quit

The above commands would execute the application, and then hit the breakpoint inserted by the rasclib_breakpoint_start function call. At that stage you would be able to query generic data about the FPGA that is configured in the system. Turning stepping on, you can view internal registers and arrays within the session at different steps. Many other commands are available. For more information, see “GDB Commands” in Chapter 9.

Streaming DMA Algorithm Tutorial

This example algorithm illustrates the use of the streaming DMA feature available to an algorithm running on the RASC platform. This section steps you through the design process for this algorithm with source code in Verilog.

Application

The application for the streaming DMA algorithm sends data to the FPGA using an input stream and receives data from the FPGA using an output stream. This example creates an array of 512K 8-byte integers and increments each integer by a constant value. The application sets the number of integers and the increment value in the algorithm using extractor symbols op_length1 and alg_inc_val, runs the algorithm, and verifies the results.The location of application code to perform this operation on both the Altix system and the FPGA is provided below. The results are compared to verify the algorithm implementation. The application C code is on your Altix system at the following location:

/usr/share/rasc/examples/alg12_strm.c

A second application for the streaming DMA algorithm is identical to the preceeding example, except it uses an end-of-stream notification to signal the algorithm when all of the data has been sent. This application indicates to the algorithm that it will use end-of-stream notification instead of the count of integers by setting the algorithm's use_strm_in_complete variable; the op_length1 variable is not used.

/usr/share/rasc/examples/alg12_strm_eos.c

Loading the Tutorial

Begin by loading the hardware description files into your text editor. Change directory to $RASC/examples/alg_dma_stream_v/

and you will see several files:

alg_block_top.v, acs_adr.v, acs_debug_reg.v and user_space_wrapper.v

If you look through the files, you will see that the streaming DMA algorithm reads the specified number of integers from the input stream, two values at a time. The algorithm then increments each integer by the specified value and writes the results, two values at a time, to the output system.

 Figure 10-5 contains a diagram of the major computational blocks and the data access for the streaming DMA algorithm.

Figure 10-5. Streaming DMA Algorithm

Streaming DMA Algorithm

Integrating with Core Services

Extractor Comments

Extractor comments are inserted in the hierarchy to describe the algorithm. In the alg_block_top.v file for the streams DMA algorithm the following comments exist:

//  extractor CS: 2.1
//  extractor VERSION:12.3
//  extractor STREAM_IN:op_in_strm 0 0
//  extractor STREAM_OUT:result_out_strm 0 0 //  
//  extractor REG_IN:op_length1 18 u alg_def_reg[0][17:0]
//  extractor REG_IN:alg_inc_val 4 u alg_def_reg[1][3:0]
//  extractor REG_OUT:version 64 u debug_port[0]
//  extractor REG_OUT:alg_def0 64 u debug_port[1]
//  extractor REG_OUT:alg_def1 64 u debug_port[2]
//  extractor REG_OUT:cntl_sigs 64 u debug_port[9]
//  extractor REG_OUT:rd_data_lo 64 u debug_port[10]
//  extractor REG_OUT:rd_data_hi 64 u debug_port[11]
//  extractor REG_OUT:wr_data_lo 64 u debug_port[12]
//  extractor REG_OUT:wr_data_hi 64 u debug_port[13]

The core services tag helps describe what version of core services was used in generating a bitstream which is useful with debugging. The version tag allows the user to track the algorithm revision using the output from the devmgr -q command. The register out tag (REG_OUT) specifies registers that are pulled out to the debug mux. The STREAM_IN and STREAM_OUT tags describe data written to or read from DMA streams.

Compiling for Simulation

To build the Generic SSP stub test bench for the data flow algorithm change to the $RASC/dv/sample_tb directory. At the prompt, enter the following:

% make ALG=alg_dma_stream_v

If you do not enter the ALG tag, the makefile will default to compiling for alg_simple_v.

To run the diagnostic alg_dma_stream_v, at the prompt, enter the following:

% make run DIAG=diags/alg_dma_stream_v ALG=alg_dma_stream_v

As with the case in building the test bench, the ALG tag default is alg_simple and it needs to be overwritten for the streaming DMA algorithm.

The algorithm input stream data is coded as cache line packets in the alg_dma_stream_v diagnostic. The input data is identified by this leading comment:

print "\n\n*******Start Stream Read DMA Engine 0\n\n";

and cache lines of data are coded as a receive packet request and a packet response with 128 bytes of input stream data. In this diagnostic a total of 32 cache lines of input stream data is sent to the algorithm. An example is, as follows:

# 1 of 32
rcv_rd_req ( MEM, FCL, ANY, 0x00000000100000 ); snd_rd_rsp ( MEM, FCL, ANY, 0, 0x7654321076543210, 0x7654321076543210,
                               0x7654321076543210, 0x7654321076543210,
                               0x7654321076543210, 0x7654321076543210,
                               0x7654321076543210, 0x7654321076543210,
                               0x7654321076543210, 0x7654321076543210,
                               0x7654321076543210, 0x7654321076543210,
                               0x7654321076543210, 0x7654321076543210,
                               0x7654321076543210, 0x7654321076543210 );

The diagnostic also handles output stream data from the algorithm in cache line units. The output packet request contains the output stream data and the response acknowledges receipt of that request. The receive packet request is coded with the expected values in that cache line and a data miscompare between expected and simulated output causes the dianostic to fail. The expected output stream data is indentified by a leading comment, as follows:

print "\n\n*******Starting Stream 0 Write DMA Engine.\n\n";

In this diagnostic a total of 32 cache lines of output stream data is expected from the algorithm. An example is, as follows:

# 1 of 32                                                           
rcv_wr_req ( MEM, FCL, ANY, 0x00000000100000, 0x7654321076543215, 0x7654321076543215,
                                              0x7654321076543215, 0x7654321076543215,
                                              0x7654321076543215, 0x7654321076543215,
                                              0x7654321076543215, 0x7654321076543215,
                                              0x7654321076543215, 0x7654321076543215,
                                              0x7654321076543215, 0x7654321076543215,
                                              0x7654321076543215, 0x7654321076543215,
                                              0x7654321076543215, 0x7654321076543215 );
snd_wr_rsp ( MEM, FCL, ANY, 0 );

Running a diag through this test bench produces results in four formats:

  • *vcdplus.vpd* - this file contains the simulation results for the run.

  • *terminal output* - the status of the test is output to the screen as the it runs, notifying the user of packets sent/received.

  • *log file* - the output to the screen is also stored in the logfile:<diag_name>.<alg_name>.run.log (for example, alg_dma_stream_v.alg_dma_stream_v.run.log)

Building an Implementation

When the algorithm has been integrated and verified, it is time to build an implementation.

Change directories to $RASC/implementations/alg_dma_stream_v/

To synthesize the design type make synplify or make xst. This is set up to utilize the black-boxed version of core services and should synthesize faster.

To generate the required metadata information for the abstraction layer, you need to run the extractor script on your file. The physical design makefile includes a make extractor target for this purpose. When it is executed, it will generate two configuration files--one describing core services, and one describing the algorithm behavior.

To execute the ISE foundation tools and run the extractor script on the file type make all. This will take approximately one to two hours due to the complex mapping and place and route algorithms executed by the ISE tools. Please note that the details of setting up your own project are described in Physical Implementation chapter.

Transferring to the Altix Platform

To transfer to the Altix platform, you must add your RASC design implementation into the Device Manager Registry by performing following steps:

  1. Use FTP to move the algorithm files from the PC to the /usr/share/rasc/bitstreams/ directory on the Altix machine:

    $RASC/implementations/alg_dma_stream_v/rev_1/alg_dma_stream_v.bin
    $RASC/implementations/alg_dma_stream_v/<core_services>.cfg
    $RASC/inplementations/alg_dma_stream_v/<user_space>.cfg
    

  2. Log into the Altix machine and execute the Device Manager user command devmgr

    devmgr -a -n alg12_strm -b alg_dma_stream_v.bin -c <user_space>.cfg -s <core_services>.cfg
    

The script will default the bitstream and configuration files to these names, although the device manager can add files of any name to the registry, so users should feel free to rename project files as convenient.

Verification Using GDB

The GDB debugger does not support the streaming DMA feature and cannot be used to debug the alg_dma_stream_v algorithm.