The chapter contains Reconfigurable Application-Specific Computing Software (RASC) examples and tutorials and covers these topics:
For design, synthesis and bitstream generation you need the following
PC with 1 GHz or greater clock speed
At least 8 Gbytes random access memory (RAM)
Red Hat Linux Enterprise version 3.0 or later
Xilinx ISE development tools (version 9.2i, Service Pack 3 or higher)
Optional: High-level language compiler.
Optional: 3rd party FPGA synthesis software supporting Xilinx FPGAs (such as, Synplicity Synplify Pro 8.9 or later)
Optional: 3rd party HDL simulation software
For bitstream download, algorithm acceleration, and real-time verification you need the following:
One Altix system
One or more RASC bricks or blades
SGI ProPack 5 for Linux
RASC software module
A network connection to the PC detailed earlier
The information and tutorials in this Examples and Tutorials section of the User Guide assume that you have previously installed and familiarized yourself with the Xilinx ISE tools and all optional software. It is also assumed that you have read the Chapter 7, “RASC Algorithm FPGA Implementation Guide” and Chapter 9, “Running and Debugging Your Application” and that you have some experience with Verilog and/or VHDL.
Additional background information (not from SGI) is available in the following:
Xilinx ISE Software Manuals and Help
Synplicity's Synplify Pro User Guide and Tutorial
The following tutorials illustrate the implementation details of the algorithm programming interface using two different algorithms. During the following sections you will learn how to integrate algorithms into the RASC brick or RASC blade that are written in hardware description languages (HDLs). You will also see a subset of the optimizations that can be made for RASC implementations.
For both algorithms we will step through the entire RASC design flow: integrating the algorithm with core service; simulating behavior on the algorithm interfaces; synthesizing the algorithm code; generating a bitstream; transferring that bitstream and metadata to the Altix platform; executing an application; and using GDB to debug an application on the Altix system and FPGA simultaneously.
These tutorials only illustrate a subset of the options available for implementing an algorithm on RASC. For more details, see Chapter 3, “RASC Algorithm FPGA Hardware Design Guide” and Chapter 7, “RASC Algorithm FPGA Implementation Guide”.
The Verilog example codes are on your x86 system because Verilog codes are compiled by using Xilinx XST which runs on the x86 system. The example application C codes are compiled by the C compiler which runs on your Alitx system.
The first algorithm we will use to describe the interfaces and various programming templates for RASC is (d = a & b | c). This simple algorithm allows you to compare coding options and analyze optimization techniques. This section steps through integrating an algorithm written in Verilog and VHDL. Then it will demonstrate simulation, synthesis, bitstream generation, platform transfer, running and debugging the application. These steps are the same for this algorithm regardless of the coding technique used.
Figure 10-1 contains a diagram of the algorithm and its memory patterns.
The application that runs (d = a & b | c) on the Altix platform is fairly simple. The following code demonstrates the RASC Abstraction Layer calls that are required to utilize the RASC-brick as a Co-Processing (COP) unit. The application also runs the algorithm on the Altix box to verify its results. The application C code is on your Altix system at the following location:
/usr/share/rasc/examples/alg6.c |
This simple application runs quickly on an Altix machine, although it is not optimized C code. Please note that this application is not chosen to emphasize the optimizations available from acceleration in hardware, but rather to compare and contrast the various programming methods for RASC. As you work through the tutorials for the different languages, there will be similarities and differences that highlight advantages of one methodology versus another. For a more computationally intensive example, please see the Data Flow Algorithm in Verilog later in this chapter.
First we will analyze how to write a Verilog version of (d = a & b | c) for RASC. It is important to note that the source code for this example allows for multi-buffering.
Begin by loading the hardware description file for the Verilog algorithm, alg_block_top.v, into your text editor. Change directory to $RASC/examples/alg_simple_v and select alg_block_top.v.
This file is the top level module for the algorithm. The other file that is required for this and all other implementation is the file alg.h. This Verilog version of (d= a & b | c) reads a from the first address in SRAM 0, b from the 16384th address of SRAM 0, c from the 32768th address in SRAM 0, and then it writes out the resulting value, d, to the first address of SRAM 1. Arrays a, b, c, and d are 2048 elements long where each element is a 64-bit unsigned integer, and all the arrays are enabled for multi-buffering by the RASC Abstraction Layer. The version of the algorithm, read data, write data, read address, write address, and control signals are all brought out to debug mux registers.
Figure 10-2 contains a diagram of the algorithm and its memory access patterns.
The source code is available on your x86 system at the following location:
$RASC/examples/alg_simple_v/alg_block_top.v |
| Note: The handshaking methodologies (see “Handshaking Methodologies” in Chapter 3) are not used here since it is assumed SRAM port access conflicts will never occur between the algorithm and DMA. |
Other important source code considerations include adding the extractor comments that are required for accurate data movement and debugger control by RASClib. A python script called extractor parses all the Verilog, VHDL, and header files in your algorithm directory to generate the symbol tables required by GDB and to communicate to the abstraction layer the data that should be written and read from the SRAM banks.
Comment fields to generate the configuration files for the algorithm are inserted in these examples. There is a template in the alg_core directory, and several examples. The comment fields can be located in any file in or below the directory specified as the second argument to the extractor call (see Chapter 7, “RASC Algorithm FPGA Implementation Guide” for more detail on how to specify the makefile target). The fields are core services version, algorithm version, SRAM denoting where data will be read from and written to on the SRAM interface, register in for parameters set through an application writer's code, or register out for a section of code that needs to be mapped to a debug register.
The debug comments for metadata parsing in this file are:
// extractor VERSION: 6.3 // extractor CS: 2.1 // extractor SRAM:a_in 2048 64 sram[0] 0x0000 in u stream // extractor SRAM:b_in 2048 64 sram[0] 0x4000 in u stream // extractor SRAM:c_in 2048 64 sram[0] 0x8000 in u stream // extractor SRAM:d_out 2048 64 sram[1] 0x0000 out u stream // extractor REG_IN:op_length1 10 u alg_def_reg[0][9:0] // extractor REG_OUT:alg_id 32 u debug_port[0][63:32] // extractor REG_OUT:alg_rev 32 u debug_port[0][31:0] // extractor REG_OUT:rd_addr 64 u debug_port[1] // extractor REG_OUT:rd_data_sram0_lo 64 u debug_port[2] // extractor REG_OUT:rd_data_sram0_hi 64 u debug_port[3] // extractor REG_OUT:wr_addr 64 u debug_port[4] // extractor REG_OUT:wr_data_sram1_lo 64 u debug_port[5] // extractor REG_OUT:wr_data_sram1_hi 64 u debug_port[6] // extractor REG_OUT:cntl_sigs 64 u debug_port[7] // extractor REG_OUT:dummy_param0_out 16 u debug_port[8][15:0] // extractor REG_OUT:dummy_param1_out 16 u debug_port[8][31:16] // extractor REG_OUT:dummy_param2_out 16 u debug_port[8][47:32] // extractor REG_OUT:dummy_param3_out 16 u debug_port[8][63:48] |
These comments are located within alg_block_top.v in this case, but they can be anywhere within the algorithm hierarchy as a header or source file. The core services tag helps describe which version of core services was used in generating a bitstream, this is useful with debugging. The version tag allows the user to understand from their GDB session which algorithm and revision he or she has loaded. The register out tag (REG_OUT) specifies registers that are pulled out to the debug mux. The SRAM tag is to describe arrays that are written to or read from the SRAM banks by the algorithm. For more information, see Chapter 7, “RASC Algorithm FPGA Implementation Guide”.
Now we will analyze how to write a VHDL version of (d = a & b | c) for RASC. This source code also allows for multi-buffering.
Figure 10-3 contains a diagram of the algorithm and its memory patterns.
Begin by loading the hardware description file for the VHDL algorithm, alg_block.vhd, into your text editor. Change directory to $RASC/examples/alg_simple_vhd where you will see alg_block_top.v and alg_block.vhd.
These files are the top level module and computation block, respectively. This VHDL version of (d = a & b | c) reads a from the first address in SRAM 0, b from the 16384th address of SRAM 0, c from the 32768th address in SRAM 0, and then it writes out the resulting value, d, to the first address of SRAM 1. Arrays a, b, c, and d are 2048 elements long where each element is a 64-bit unsigned integer, and all the arrays are enabled for multi-buffering by the RASC Abstraction Layer. The version of the algorithm, read data, write data, read address, and write address for the algorithm are all brought out to debug mux registers.
The source for the alg_block_top.v file is similar in this instance to the verilog version of (d = a & b | c), except that it performs no calculation. Instead, it wraps the alg_block.vhd so it can speak to the user_space_wrapper.v that instantiates it. There is no reason why a computation block needs to be wrapped in a Verilog module. In this case it was done for convenience.
| Note: The handshaking methodologies (see “Handshaking Methodologies” in Chapter 3) are not used here since it is assumed SRAM port access conflicts will never occur between the algorithm and DMA. |
Other important source code considerations include adding the extractor comments that are required for accurate data movement and debugger control. A python script called extractor parses all the Verilog, VHDL, and header files in your algorithm directory to generate the symbol tables required by GDB and to communicate to the abstraction layer the data that should be written and read from the SRAM banks.
Comment fields to generate the configuration files for the algorithm are provided in this example for alg_block.vhd. There is a template in the alg_core directory, and several examples. The comment fields that can be located anywhere below the directory specified in the second argument to the extractor call (see Chapter 7, “RASC Algorithm FPGA Implementation Guide” for more detail on how to specify the makefile target). The fields are core services version, algorithm version, SRAM denoting where data will be read from and written to on the SRAM interface, register in for parameters set through an application writer's code, or register out for a section of code that needs to be mapped to a debug register.
The debug comments for metadata parsing in this file are embedded in the VHDL code. They appear as:
-- extractor VERSION: 9.1 -- extractor SRAM:a_in 2048 64 sram[0] 0x0000 in u stream -- extractor SRAM:b_in 2048 64 sram[0] 0x4000 in u stream -- extractor SRAM:c_in 2048 64 sram[0] 0x8000 in u stream -- extractor SRAM:d_out 2048 64 sram[1] 0x0000 out u stream -- extractor REG_OUT:version 64 u debug_port[0] -- extractor REG_OUT:rd_addr 64 u debug_port[1] -- extractor REG_OUT:rd_data_sram0_lo 64 u debug_port[2] -- extractor REG_OUT:rd_data_sram0_hi 64 u debug_port[3] -- extractor REG_OUT:wr_addr 64 u debug_port[4] -- extractor REG_OUT:wr_data_sram1_lo 64 u debug_port[5] -- extractor REG_OUT:wr_data_sram1_hi 64 u debug_port[6] |
These comments are located within alg_block.vhd in this case, but they can be anywhere within the algorithm hierarchy as a header or source file. The core services tag helps describe what version of core services was used in generating a bitstream. This information is useful when debugging. The version tag allows the user to understand from his GDB session which algorithm and revision he has loaded. The register out tag (REG_OUT) specifies registers that are mapped to the debug mux. The SRAM tag is to describe arrays that are written to or read from the SRAM banks by the algorithm.
To build the generic SSP stub test bench for the Verilog version of the simple algorithm, change to the $RASC/dv/sample_tb directory. At the prompt, enter the following:
% make ALG=alg_simple_v |
If you do not enter the ALG tag, the makefile will default to compiling for alg_simple_v.
To run the diagnostic alg_simple_v, at the prompt, enter the following:
% make run DIAG=diags/alg_simple_v ALG=alg_simple_v |
As with the case in building the test bench, the ALG tag default is alg_simple and it needs to be overwritten for the data flow algorithm.
The SRAM mapping with physical memory is, as follows:
mem0[127:64] -> qdr_sram_bank1.SMEM -> init_sram1_*.dat, final_sram1.dat mem0[63:0 ] -> qdr_sram_bank0.SMEM -> init_sram0_*.dat, final_sram0.dat mem1[127:64] -> qdr_sram_bank3.SMEM -> init_sram3_*.dat, final_sram3.dat mem1[63:0 ] -> qdr_sram_bank4.SMEM -> init_sram2_*.dat, final_sram2.dat mem2[63:0 ] -> qdr_sram_bank2.SMEM -> init_sram4_*.dat, final_sram4.dat |
These can all be overwritten on the command line.
By specifying the SRAM input files, the user can skip the DMA process for the purposes of testing the algorithm, providing a fast check for the algorithm without verifying the DMA engines of core services.
You can also use a simple C program called check_alg_simple.c to verify the test results; build and run it to analyze the initial and final SRAM simulation contents.
Running a diag through this test bench produces results in four formats:
*vcdplus.vpd* - this file contains the simulation results for the run.
*terminal output* - the status of the test is output to the screen as it runs, notifying the user of packets sent/received.
*log file* - the output to the screen is also stored in the logfile:<diag_name>.<alg_name>.run.log (for example, dma_alg_simple_v.alg_simple_v.run.log)
*sram output files* - at the end of simulation (when the diag finishes because it has been successful, an incorrect packet has been received, or time-out has occurred), the contents of of all SRAMS are dumped to the corresponding .dat output files (the defaults or user-specified files).
When the algorithm has been integrated and verified, it is time to build an implementation.
Change directories to $RASC/implementations/alg_simple_*/
To synthesize the design type,use one of the following commands:
make synplify |
or
make xst. |
This is set up to utilize the black-boxed version of core services and should synthesize faster.
To generate the required metadata information for the abstraction layer and the debugger, you need to run the extractor script on your file. The physical design makefile includes a make extractor target for this purpose. When it is executed, it will generate two configuration files--one describing core services, and one describing the algorithm behavior.
To execute the ISE foundation tools and run the extractor script on the file enter the following command:
make all |
This will take approximately one to two hours due to the complex mapping and place and route algorithms executed by the ISE tools. Please note that the details of setting up your own project are described in Chapter 7, “RASC Algorithm FPGA Implementation Guide”, specifically in “Installation and Setup” in Chapter 7.
To transfer to the Altix platform, you must add your RASC design implementation into the Device Manager registry. This transfer must occur regardless of the algorithm generation method.
Use FTP to move the algorithm files from the PC to the /usr/share/rasc/bitstreams directory on the Altix machine:
$RASC/implementations/alg_simple_*/rev_1/alg_simple_v.bin $RASC/implementations/alg_simple_*/<core_services>.cfg $RASC/implementations/alg_simple_*/<user_space>.cfg |
Log into the Altix machine and execute the Device Manager user command devmgr
devmgr -a -n alg6 -b alg_simple_v.bin -c <user_space>.cfg -s <core_services>.cfg |
The script will default the bitstream and configuration files to these names, although the device manager can add files of any name to the registry, so users should feel free to rename project files as convenient.
To run a debug session on this bitstream, you must start the application from a GDB session window. GDB is enabled with all versions of this algorithm. To run an application using RASClib, you must execute the extended GDB on the application detailed at the beginning of this example.
% gdbfpga /usr/share/rasc/examples/alg6
GNU gdb 6.3.50.20050510
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "ia64-unknown-linux-gnu"...Using host libthread_db library "/lib/libthread_db.so.1".
(gdb) break rasclib_brkpt_start
Function "rasclib_brkpt_start" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (rasclib_brkpt_start) pending.
(gdb) break rasclib_brkpt_done
Function "rasclib_brkpt_done" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 2 (rasclib_brkpt_done) pending.
(gdb) handle SIGUSR1 nostop pass noprint
Signal Stop Print Pass to program Description
SIGUSR1 No No Yes User defined signal 1
(gdb) run
Starting program: /usr/share/rasc/examples/alg6
Failed to read a valid object file image from memory.
[Thread debugging using libthread_db enabled]
[New Thread 2305843009292443296 (LWP 24946)]
Breakpoint 3 at 0x2000000000085bb1: file rasclib_debug.c, line 86.
Pending breakpoint "rasclib_brkpt_start" resolved
Breakpoint 4 at 0x2000000000085b82: file rasclib_debug.c, line 104.
Pending breakpoint "rasclib_brkpt_done" resolved[New Thread 2305843009303867984 (LWP 24949)]
[New Thread 2305843009312551504 (LWP 24951)]
fpga config file def reg count (64) of /var/rasc/rasc_registry/alg6/bitstream.cfg exceeds gdb current maximum of 8, excess ignored
[Switching to Thread 2305843009292443296 (LWP 24946)]
Breakpoint 3, rasclib_brkpt_start (cop_desc=547920) at rasclib_debug.c:86
86 rasclib_debug.c: No such file or directory.
in rasclib_debug.c
(gdb) info fpga
fpga 0
Active : on
State : ready-to-fpgastep-fpgacont
Algorithm id : alg6
Core svc ver : 2.100000
Algorithm ver: 6.300000
Algorithm src: v
Alg. dev :
Alg. config : /var/rasc/rasc_registry/alg6/bitstream.cfg
CS version : 2.100000
CS config : /var/rasc/rasc_registry/alg6/core_services.cfg
prev step ct : 0
step ct : 0
(gdb) info fpgaregisters
alg_id 0x6 6
alg_rev 0x3 3
rd_addr 0x0 0
rd_data_sram0_lo0xa5a4a3a2a1a09f9e 11935844831330344862
rd_data_sram0_hi0xadacabaaa9a8a7a6 12514566214034958246
wr_addr 0x0 0
wr_data_sram1_lo0xf7f4f7f2f3f0fffe 17867178244532404222
wr_data_sram1_hi0xfffcfffafbf8fff6 18445899627237015542
cntl_sigs 0x0 0
dummy_param0_out0x0 0
dummy_param1_out0x100 256
dummy_param2_out0x20 32
dummy_param3_out0x3303 13059
op_length1 0x3ff 1023
dummy_param0_in0x0 0
dummy_param1_in0x100 256
dummy_param2_in0x20 32
dummy_param3_
(gdb) fpgastep 55
(gdb) info fpgaregisters
alg_id 0x6 6
alg_rev 0x3 3
rd_addr 0x811 2065
rd_data_sram0_lo0xd5d4d3d2d1d0cfce 15408173127558025166
rd_data_sram0_hi0xdddcdbdad9d8d7d6 15986894510262638550
wr_addr 0xd 13
wr_data_sram1_lo0xd7d6d5d4d3d2d1d0 15552853473234178512
wr_data_sram1_hi0xdfdedddcdbdad9d8 16131574855938791896
cntl_sigs 0x3 3
dummy_param0_out0x0 0
dummy_param1_out0x100 256
dummy_param2_out0x20 32
dummy_param3_out0x3303 13059
op_length1 0x3ff 1023
dummy_param0_in0x0 0
dummy_param1_in0x100 256
dummy_param2_in0x20 32
dummy_param3_in0x3303 13059
(gdb) print $rd_addr
$1 = 2065
(gdb) print $wr_addr
$2 = 13
(gdb) print /x $rd_data_sram0_lo
$3 = 0xd5d4d3d2d1d0cfce
(gdb) print /x a_in[8]
[Switching to Thread 2305843009292443296 (LWP 24946)]
$4 = 0x4746454443424140
(gdb) fpgastep 3
(gdb) print $rd_addr
$5 = 2066
(gdb) print $wr_addr
$6 = 14
(gdb) print /x $rd_data_sram0_lo
$7 = 0xe5e4e3e2e1e0dfde
(gdb) print /x a_in[8]
[Switching to Thread 2305843009292443296 (LWP 24946)]
$8 = 0x4746454443424140
(gdb) fpgacont
(gdb) print $rd_addr
$9 = 3071
(gdb) print $wr_addr
$10 = 0
(gdb) print /x $rd_data_sram0_lo
$11 = 0xa5a4a3a2a1a09f9e
(gdb) print /x b_in[3]
[Switching to Thread 2305843009292443296 (LWP 24946)]
$12 = 0x1e1d1c1b1a191817
(gdb) delete
Delete all breakpoints? (y or n) y
(gdb) cont
Continuing.
[Thread 2305843009312551504 (LWP 24951) exited]
success
Program exited normally.
(gdb) quit
|
Many other commands are available. For more information on these commands, see “Using the GNU Project Debugger (GDB)” in Chapter 9.
This example algorithm illustrates the optimization considerations for multi-buffering a complex algorithm on the RASC platform. This section steps you through the design process for this algorithm with source code in Verilog that steps by clocks.
The application for the data flow algorithm is slightly more complex. This example creates a 16 KB array, sorts it from most-significant byte to least-significant byte, runs a string search on the sorted data against a match tag, and then performs a pop count. The location of application code to perform this operation on both the Altix system and the FPGA is provided below. The results are compared to verify the algorithm implementation. The application C code is on your Altix system at the following location:
/usr/share/rasc/examples/alg10.c |
Begin by loading the hardware description files into your text editor. Change directory to $RASC/examples/alg_data_flow_v/
and you will see several files:
alg_block_top.v, sort_byte.v, string_search.v and pop_cnt.v.
If you look through the files you will see that he data flow algorithm reads 16K bytes of data from SRAM 0. Then it sorts the bytes of each double-word of the input data from most significant to least significant byte order. The algorithm writes those results out to SRAM 1, and then it performs a string search on the sorted data with a 16-bit match string that is provided by the application writer. The match tags resulting from the string search are written out to SRAM 1 and a population count is then run on the data. The resulting population count is written to debug register 1.
Figure 10-4 contains a diagram of the major computational blocks and the memory access patterns for the data flow algorithm.
Extractor comments are inserted in the hierarchy to describe the algorithm. In the alg_block_top.v file for the data flow algorithm the following comments exist:
// extractor CS:1.0 // extractor VERSION:10.4 // extractor SRAM:data_in 2048 64 sram[0] 0x0000 in u stream // extractor SRAM:sort_output 2048 64 sram[1] 0x0000 out u stream // extractor SRAM:bitsearch_match_vector 2048 64 sram[1] 0x8000 out u stream // extractor REG_IN:match_string 16 u alg_def_reg[0][15:0] // extractor REG_IN:multi_iter_rst 64 u alg_def_reg[2] // extractor REG_IN:op_length1 11 u alg_def_reg[1][10:0] // extractor REG_OUT:version 64 u debug_port[0] // extractor REG_OUT:running_pop_count 64 u debug_port[1] // extractor REG_OUT:sort_input 64 u debug_port[2] // extractor REG_OUT:sort_output 128 u debug_port[3] // extractor REG_OUT:match_vector 64 u debug_port[5] // extractor REG_OUT:pipe_vld 64 u debug_port[6] // extractor REG_OUT:dw_pcnt 64 u debug_port[8] // extractor REG_OUT:f_pcnt_63_0 64 u debug_port[9] |
The core services tag helps describe what version of core services was used in generating a bitstream which is useful with debugging. The version tag allows the user to understand from his GDB session which algorithm and revision he has loaded. The register out tag (REG_OUT) specifies registers that are pulled out to the debug mux. The SRAM tag is to describe arrays that are written to or read from the SRAM banks by the algorithm.
| Note: The handshaking methodologies (see “Handshaking Methodologies” in Chapter 3) are not used here since it is assumed SRAM port access conflicts will never occur between the algorithm and DMA. |
To build the Generic SSP stub test bench for the data flow algorithm change to the $RASC/dv/sample_tb directory. At the prompt, enter the following:
% make ALG=alg_data_flow_v |
If you do not enter the ALG tag, the makefile will default to compiling for alg_simple_v.
To run the diagnostic alg_data_flow_v, at the prompt, enter the following:
% make run DIAG=diags/alg_data_flow_v ALG=alg_data_flow_v |
As with the case in building the test bench, the ALG tag default is alg_simple and it needs to be overwritten for the data flow algorithm.
The SRAM mapping with physical memory is, as follows:
mem0[127:64] -> qdr_sram_bank1.SMEM -> init_sram1_*.dat, final_sram1.dat mem0[63:0 ] -> qdr_sram_bank0.SMEM -> init_sram0_*.dat, final_sram0.dat mem1[127:64] -> qdr_sram_bank3.SMEM -> init_sram3_*.dat, final_sram3.dat mem1[63:0 ] -> qdr_sram_bank4.SMEM -> init_sram2_*.dat, final_sram2.dat mem2[63:0 ] -> qdr_sram_bank2.SMEM -> init_sram4_*.dat, final_sram4.dat |
These can all be overwritten on the command line.
By specifying the SRAM input files, the user can skip the DMA process for the purposes of testing the algorithm, providing a fast check for the algorithm without verifying the DMA engines of core services. You can also use a simple C program called check_alg_data_flow.c to verify the test results.
Running a diag through this test bench produces results in four formats:
*vcdplus.vpd* - this file contains the simulation results for the run.
*terminal output* - the status of the test is output to the screen as the it runs, notifying the user of packets sent/received.
*log file* - the output to the screen is also stored in the logfile:<diag_name>.<alg_name>.run.log (for example, dma_alg_data_flow_v.alg_data_flow_v.run.log)
*sram output files* - at the end of simulation (when the diag finishes because it has been successful, an incorrect packet has been received, or time-out has occurred), the contents of the 4 SRAMs are dumped to .dat files (the defaults or user-specified files).
When the algorithm has been integrated and verified, it is time to build an implementation.
Change directories to $RASC/implementations/alg_data_flow_v/
To synthesize the design type make synplify or make xst. This is set up utilize the black-boxed version of core services and should synthesize faster.
To generate the required metadata information for the abstraction layer and the debugger, you need to run the extractor script on your file. The physical design makefile includes a make extractor target for this purpose. When it is executed, it will generate two configuration files--one describing core services, and one describing the algorithm behavior.
To execute the ISE foundation tools and run the extractor script on the file type make all. This will take approximately one to two hours due to the complex mapping and place and route algorithms executed by the ISE tools. Please note that the details of setting up your own project are described in Physical Implementation chapter.
To transfer to the Altix platform, you must add your RASC design implementation into the Device Manager Registry by performing following steps:
Use FTP to move the algorithm files from the PC to the /usr/share/rasc/bitstreams/ directory on the Altix machine:
$RASC/implementations/alg_data_flow_v/rev_1/alg_data_flow_v.bin $RASC/implementations/alg_data_flow_v/<core_services>.cfg $RASC/inplementations/alg_data_flow_v/<user_space>.cfg |
Log into the Altix machine and execute the Device Manager user command devmgr
devmgr -a -n alg10 -b alg_data_flow_v.bin -c <user_space>.cfg -s <core_services>.cfg |
The script will default the bitstream and configuration files to these names, although the device manager can add files of any name to the registry, so users should feel free to rename project files as convenient.
To run a debug session on this bitstream, you must start the application from a GDB session window. To do that, you must execute the extended GDB on the application detailed at the beginning of this example.
% gdbfpga /usr/share/rasc/examples/alg10
GNU gdb 6.3.50.20050510
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "ia64-unknown-linux-gnu"...Using host libthread_db library "/lib/libthread_db.so.1".
(gdb) break rasclib_brkpt_start
Function "rasclib_brkpt_start" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (rasclib_brkpt_start) pending.
(gdb) handle SIGUSR1 nostop pass noprint
Signal Stop Print Pass to program Description
SIGUSR1 No No Yes User defined signal 1
(gdb) run
Starting program: /usr/share/rasc/examples/alg10
Failed to read a valid object file image from memory.
[Thread debugging using libthread_db enabled]
[New Thread 2305843009292443296 (LWP 1157)]
Breakpoint 2 at 0x2000000000085bb1: file rasclib_debug.c, line 86.
Pending breakpoint "rasclib_brkpt_start" resolved
[New Thread 2305843009303867984 (LWP 1160)]
[New Thread 2305843009312551504 (LWP 1169)]
fpga config file def reg count (64) of /var/rasc/rasc_registry/alg10/bitstream.cfg exceeds gdb current maximum of 8, excess ignored
[Switching to Thread 2305843009292443296 (LWP 1157)]
Breakpoint 2, rasclib_brkpt_start (cop_desc=547920) at rasclib_debug.c:86
86 rasclib_debug.c: No such file or directory.
in rasclib_debug.c
(gdb) info fpga
fpga 0
Active : on
State : ready-to-fpgastep-fpgacont
Algorithm id : alg10
Core svc ver : 2.100000
Algorithm ver: 10.400000
Algorithm src: v
Alg. dev :
Alg. config : /var/rasc/rasc_registry/alg10/bitstream.cfg
CS version : 2.100000
CS config : /var/rasc/rasc_registry/alg10/core_services.cfg
prev step ct : 0
step ct : 0
(gdb) info fpgaregisters
version 0xa00000001 42949672961
running_pop_count0x0 0
sort_input 0x0 0
sort_output 0x0 0
match_vector 0x0 0
pipe_vld 0x0 0
dw_pcnt 0x0 0
f_pcnt_63_0 0x0 0
test_cnt 0x645900000000 110333414866944
match_string 0x0 0
multi_iter_rst 0x0 0
op_length1 0x0 0
(gdb) fpgastep 5
(gdb) info fpgaregisters
version 0xa00000001 42949672961
running_pop_count0x0 0
sort_input 0x0 0
sort_output 0x0 0
match_vector 0x0 0
pipe_vld 0x0 0
dw_pcnt 0x0 0
f_pcnt_63_0 0x0 0
test_cnt 0x800000004 34359738372
match_string 0x0 0
multi_iter_rst 0x0 0
op_length1 0x0 0
(gdb) fpgastep 3
(gdb) print /x $test_cnt
$1 = 0xb00000007
(gdb) print /x data_in[8]
[Switching to Thread 2305843009292443296 (LWP 1157)]
$2 = 0x67fce141a13ee970
(gdb) fpgastep 6
(gdb) print /x $test_cnt
$3 = 0x110000000d
(gdb) print /x data_in[14]
[Switching to Thread 2305843009292443296 (LWP 1157)]
$4 = 0xbb5cf98961bed875
(gdb) delete
Delete all breakpoints? (y or n) y
(gdb) fpgacont
(gdb) cont
Continuing.
[Thread 2305843009312551504 (LWP 1169) exited]
success sorted
success match_list
popcnts = 10c
HW POP COUNT = 268
SW POP COUNT = 268
Program exited normally.
(gdb) quit
|
The above commands would execute the application, and then hit the breakpoint inserted by the rasclib_breakpoint_start function call. At that stage you would be able to query generic data about the FPGA that is configured in the system. Turning stepping on, you can view internal registers and arrays within the session at different steps. Many other commands are available. For more information, see “GDB Commands” in Chapter 9.
This example algorithm illustrates the use of the streaming DMA feature available to an algorithm running on the RASC platform. This section steps you through the design process for this algorithm with source code in Verilog.
The application for the streaming DMA algorithm sends data to the FPGA using an input stream and receives data from the FPGA using an output stream. This example creates an array of 512K 8-byte integers and increments each integer by a constant value. The application sets the number of integers and the increment value in the algorithm using extractor symbols op_length1 and alg_inc_val, runs the algorithm, and verifies the results.The location of application code to perform this operation on both the Altix system and the FPGA is provided below. The results are compared to verify the algorithm implementation. The application C code is on your Altix system at the following location:
/usr/share/rasc/examples/alg12_strm.c |
A second application for the streaming DMA algorithm is identical to the preceeding example, except it uses an end-of-stream notification to signal the algorithm when all of the data has been sent. This application indicates to the algorithm that it will use end-of-stream notification instead of the count of integers by setting the algorithm's use_strm_in_complete variable; the op_length1 variable is not used.
/usr/share/rasc/examples/alg12_strm_eos.c |
Begin by loading the hardware description files into your text editor. Change directory to $RASC/examples/alg_dma_stream_v/
and you will see several files:
alg_block_top.v, acs_adr.v, acs_debug_reg.v and user_space_wrapper.v
If you look through the files, you will see that the streaming DMA algorithm reads the specified number of integers from the input stream, two values at a time. The algorithm then increments each integer by the specified value and writes the results, two values at a time, to the output system.
Figure 10-5 contains a diagram of the major computational blocks and the data access for the streaming DMA algorithm.
Extractor comments are inserted in the hierarchy to describe the algorithm. In the alg_block_top.v file for the streams DMA algorithm the following comments exist:
// extractor CS: 2.1 // extractor VERSION:12.3 // extractor STREAM_IN:op_in_strm 0 0 // extractor STREAM_OUT:result_out_strm 0 0 // // extractor REG_IN:op_length1 18 u alg_def_reg[0][17:0] // extractor REG_IN:alg_inc_val 4 u alg_def_reg[1][3:0] // extractor REG_OUT:version 64 u debug_port[0] // extractor REG_OUT:alg_def0 64 u debug_port[1] // extractor REG_OUT:alg_def1 64 u debug_port[2] // extractor REG_OUT:cntl_sigs 64 u debug_port[9] // extractor REG_OUT:rd_data_lo 64 u debug_port[10] // extractor REG_OUT:rd_data_hi 64 u debug_port[11] // extractor REG_OUT:wr_data_lo 64 u debug_port[12] // extractor REG_OUT:wr_data_hi 64 u debug_port[13] |
The core services tag helps describe what version of core services was used in generating a bitstream which is useful with debugging. The version tag allows the user to track the algorithm revision using the output from the devmgr -q command. The register out tag (REG_OUT) specifies registers that are pulled out to the debug mux. The STREAM_IN and STREAM_OUT tags describe data written to or read from DMA streams.
To build the Generic SSP stub test bench for the data flow algorithm change to the $RASC/dv/sample_tb directory. At the prompt, enter the following:
% make ALG=alg_dma_stream_v |
If you do not enter the ALG tag, the makefile will default to compiling for alg_simple_v.
To run the diagnostic alg_dma_stream_v, at the prompt, enter the following:
% make run DIAG=diags/alg_dma_stream_v ALG=alg_dma_stream_v |
As with the case in building the test bench, the ALG tag default is alg_simple and it needs to be overwritten for the streaming DMA algorithm.
The algorithm input stream data is coded as cache line packets in the alg_dma_stream_v diagnostic. The input data is identified by this leading comment:
print "\n\n*******Start Stream Read DMA Engine 0\n\n"; |
and cache lines of data are coded as a receive packet request and a packet response with 128 bytes of input stream data. In this diagnostic a total of 32 cache lines of input stream data is sent to the algorithm. An example is, as follows:
# 1 of 32
rcv_rd_req ( MEM, FCL, ANY, 0x00000000100000 ); snd_rd_rsp ( MEM, FCL, ANY, 0, 0x7654321076543210, 0x7654321076543210,
0x7654321076543210, 0x7654321076543210,
0x7654321076543210, 0x7654321076543210,
0x7654321076543210, 0x7654321076543210,
0x7654321076543210, 0x7654321076543210,
0x7654321076543210, 0x7654321076543210,
0x7654321076543210, 0x7654321076543210,
0x7654321076543210, 0x7654321076543210 );
|
The diagnostic also handles output stream data from the algorithm in cache line units. The output packet request contains the output stream data and the response acknowledges receipt of that request. The receive packet request is coded with the expected values in that cache line and a data miscompare between expected and simulated output causes the dianostic to fail. The expected output stream data is indentified by a leading comment, as follows:
print "\n\n*******Starting Stream 0 Write DMA Engine.\n\n"; |
In this diagnostic a total of 32 cache lines of output stream data is expected from the algorithm. An example is, as follows:
# 1 of 32
rcv_wr_req ( MEM, FCL, ANY, 0x00000000100000, 0x7654321076543215, 0x7654321076543215,
0x7654321076543215, 0x7654321076543215,
0x7654321076543215, 0x7654321076543215,
0x7654321076543215, 0x7654321076543215,
0x7654321076543215, 0x7654321076543215,
0x7654321076543215, 0x7654321076543215,
0x7654321076543215, 0x7654321076543215,
0x7654321076543215, 0x7654321076543215 );
snd_wr_rsp ( MEM, FCL, ANY, 0 );
|
Running a diag through this test bench produces results in four formats:
*vcdplus.vpd* - this file contains the simulation results for the run.
*terminal output* - the status of the test is output to the screen as the it runs, notifying the user of packets sent/received.
*log file* - the output to the screen is also stored in the logfile:<diag_name>.<alg_name>.run.log (for example, alg_dma_stream_v.alg_dma_stream_v.run.log)
When the algorithm has been integrated and verified, it is time to build an implementation.
Change directories to $RASC/implementations/alg_dma_stream_v/
To synthesize the design type make synplify or make xst. This is set up to utilize the black-boxed version of core services and should synthesize faster.
To generate the required metadata information for the abstraction layer, you need to run the extractor script on your file. The physical design makefile includes a make extractor target for this purpose. When it is executed, it will generate two configuration files--one describing core services, and one describing the algorithm behavior.
To execute the ISE foundation tools and run the extractor script on the file type make all. This will take approximately one to two hours due to the complex mapping and place and route algorithms executed by the ISE tools. Please note that the details of setting up your own project are described in Physical Implementation chapter.
To transfer to the Altix platform, you must add your RASC design implementation into the Device Manager Registry by performing following steps:
Use FTP to move the algorithm files from the PC to the /usr/share/rasc/bitstreams/ directory on the Altix machine:
$RASC/implementations/alg_dma_stream_v/rev_1/alg_dma_stream_v.bin $RASC/implementations/alg_dma_stream_v/<core_services>.cfg $RASC/inplementations/alg_dma_stream_v/<user_space>.cfg |
Log into the Altix machine and execute the Device Manager user command devmgr
devmgr -a -n alg12_strm -b alg_dma_stream_v.bin -c <user_space>.cfg -s <core_services>.cfg |
The script will default the bitstream and configuration files to these names, although the device manager can add files of any name to the registry, so users should feel free to rename project files as convenient.