This chapter provides an introduction to Reconfigurable Computing (RC) including challenges in design and implementation of RC systems, an introduction to SGI's system platform and an overview of SGI's Reconfigurable Application Specific Computing (RASC). It covers the following topics:
|Note: Make sure that you read the section called “Major Technical Changes in the RASC 2.20 Release” on the “New Features in This Guide” page in the frontmatter of this manual.|
Reconfigurable computing is defined as a computer having hardware that can be reconfigured to implement application-specific functions (see Figure 1-1). The basic concept of a reconfigurable computer (RC) was proposed in the 1960s, but only in the last decade has an RC been feasible. RC systems combine microprocessors and programmable logic devices, typically field programmable gate arrays (FPGAs), into a single system. Reconfigurable computing allows applications to map computationally dense code to hardware. This mapping often provides orders of magnitude improvements in speed while decreasing power and space requirements.
As defined above, RC uses programmable FPGAs. Current FPGA technology provides more than 10 million logic elements, internal clock rates over 200MHz, and pin toggle rates approaching 10Gb/s. With these large devices, the scope of applications that can target an FPGA has dramatically increased. The challenges to using FPGAs effectively fall into two categories: ease of use and performance. Ease of use issues include the following:
Methodology of generating the “program” or bitstream for the FPGA
Ability to debug an application running on both the microprocessor and FPGA
Interface between the application and the system or Application Programming Interface (API)
Performance issues include the following:
Data movement (bandwidth) between microprocessors and FPGAs
Latency of communication between microprocessors and FPGAs
Scalability of the system topology
Historically, programming an FPGA required a hardware design engineer. (Non-hardware designers can program FPGAs with the newer tools described in “Getting Started with RASC Programming”). Typically, an algorithm was hand translated into a Hardware Description Language (HDL), verified by human-generated HDL tests, synthesized into logic elements, physically placed in the FPGA and then analyzed for speed of operation. If errors occurred, these design and verification steps were repeated. This iterative process is conducive for semiconductor chip design, but it impedes the rapid proto-typing and solution oriented goals of application-specific programming.
Debugging FPGAs requires specialized logic to be inserted into register transfer level (RTL) code. Then, FPGA-specific tools, along with typical microprocessor debug methods, are used to analyze anomalies in application behavior.
Lastly, users are hampered by the lack of standardized application interfaces. This deficiency forces recoding when hardware or platform upgrades become available--a time-consuming and error-inducing process.
Performance is the fundamental reason for using RC systems. By mapping algorithms to hardware, designers can tailor not only the computational components, but also perform data flow optimization to match the algorithm. Today's FPGAs provide over a terabyte per second of memory bandwidth from small on-chip memories as well as tens of billions of operations per second. Transferring data to the FPGA and receiving the results poses a difficult challenge to RC system designers. In addition to bandwidth, efficient use of FPGA resources requires low latency communication between the host microprocessor and the FPGAs. When low latency is achieved, scaling and optimization across multiple computational elements can occur, often called load balancing.
Although challenges abound, RC systems allow users to explore solutions that are not viable in today's limited computing environments. The benefits in size, speed, and power alone make RC systems a necessity.
SGI was founded in 1982 based on Stanford University's research in accelerating a specific application, three dimensional graphics. SGI pioneered acceleration of graphics through hardware setting records and providing technological capabilities that were impossible without specialized computational elements. Tackling difficult problems required a supercomputer with capabilities that were not available from other computer vendors. SGI chose to develop its own large scale supercomputer with the features needed to drive graphics. The development of large scale single system image (SSI) machines was also pioneered at SGI. From the early days systems such as Power Series, Challenge, and Power Challenge defined large shared bus SSI systems, but the focus continued to be on providing the high bandwidth and scaling that was needed to drive graphics applications. To transcend the 36 microprocessor Challenge Series systems, conventional single backplane architectures were not sufficient. SGI returned again to its roots at Stanford University and the Distributed Architecture for Shared (DASH) memory project. The original concepts for cache coherent non-uniform memory access (ccNUMA) architecture are based on the DASH memory project. The ccNUMA architecture allows for distributed memory through the use of a directory-based coherency scheme, removing the need for large common busses like those in the Challenge systems. Without the restrictions of a single bus, bandwidth for the system increased by orders of magnitude, while latency was reduced. This architecture has allowed SGI to set new records for system scalability including 1024 CPU SSI, 1TB/s Streams benchmark performance, and many others.
The SGI Altix system is the only fourth generation Distributed Shared Memory (DSM) machine using a NUMA architecture that is connected by a high–bandwidth, low–latency interconnect. In keeping with ever increasing demands, Altix allows independent scaling for CPUs, memory, Graphics Processing Units (GPU), I/O interfaces, and specialized processors. The NUMAlink interconnect allows Altix to scale to thousands of CPUs, terabytes of memory, hundreds of I/O channels, hundreds of graphics processors and thousands of application-specific devices.
SGI uses NUMAlink on all of its ccNUMA systems. NUMAlink 4 is a third generation fabric that supports topologies starting at basic rings. With the addition of routers, meshes, hypercubes, and modified hypercubes, full fat tree topologies can be built. The protocol and fabric allows the topology to be matched to the workload as needed. By using the high bandwidth, low latency interconnect, SGI has a flexible and powerful platform to deliver optimized solutions for HPC.
The RASC program leverages more than 20 years of SGI experience accelerating algorithms in hardware. Rather than using relatively fixed implementations, such as graphics processing units (GPUs), RASC uses FPGA technology to develop a full-featured reconfigurable computer. The RASC program also addresses the ease of use and performance issues present in typical RC environments.
To address performance issues, RASC connects FPGAs into the NUMAlink fabric making them a peer to the microprocessor and providing both high bandwidth and low latency. By attaching the FPGA devices to the NUMAlink interconnect, RASC places the FPGA resources inside the coherency domain of the computer system. This placement allows the FPGAs extremely high bandwidth (up to 6.4GB/s/FPGA), low latency, and hardware barriers. These features enable both extreme performance and scalability. The RASC product also provides system infrastructure to manage and reprogram the contents of the FPGA quickly for reuse of resources.
RASC defines a set of APIs through the RASC Abstraction Layer (RASCAL). The abstraction layer can abstract the hardware to provide deep and wide scaling or direct and specific control over each hardware element in the system. In addition RASC provides a FPGA-aware version of GNU Debugger (GDB) that is based on the standard Linux version of GDB with extensions for managing multiple FPGAs. The RASC debug environment does not require learning new tool sets to quickly debug an accelerated application.
RASC supports the common hardware description languages (HDLs) for generating algorithms. RASC provides templates for Verilog- and VHDL-based algorithms. Several 3rd-party high-level language tool vendors are developing RASC interfaces and templates to use in place of traditional hardware design languages.
FPGA programming is a fairly complex task when using the main FPGA programming languages, VHDL and Verilog, directly. They require an electrical engineering background and the understanding of timing constraints; that is, the time it takes for an electrical signal to travel on the chips, delays introduced by buffers, and so on. For example, the blank FPGA has physical connection from the memory pins to FPGA I/O pins. You need a protocol like QDR-II that specifies memory transfer rates for the dual in-line memory modules connected to the memory pins.
Low-level abstractions allow an application to read a memory location without understanding the underlying hardware. SGI RASC calls this functionality Core Services (for more information, see “Algorithm Interfaces ” in Chapter 3). These low-level abstractions can almost be thought of as the basic input/output system (BIOS) of the RASC unit.
Currently, FPGAs are running at clock speeds of about 200 MHz. This may seem slow compared to an Itanium processor; however, the FPGA can be optimized for specific algorithms and potentially performs several hundreds or thousands of operations in parallel.
Programming high-performance computing applications in VHDL and/or Verilog is extremely time-consuming and resource-intensive and probably best left to very advanced users. However, high-level tools provided by vendors such as Mitrionics, Inc. or Impulse Accelerated Technologies, Inc. are available.
These tools produce VHDL or Verilog code (potentially thousands of lines for even a small code fragment). This code then has to be synthesized (compiled) into a netlist. This netlist then is used by a place and route program to implement the physical layout on the FPGAs.
Write an application in C programming language for system microprocessor
Identify computation intense routine(s)
Generate a bitstream using Core Services and language of choice
Replace routines with RASC abstraction layer (rasclib) calls that support both a C and Fortran90 interface
Run your application and debug with GDB (see “Helpful SGI Tools”)
The RASC tutorial (see “Tutorial Overview” in Chapter 10) steps you through the entire RASC design flow: integrating the algorithm with Core Services; simulating behavior on the algorithm interfaces; synthesizing the algorithm code; generating a bitstream; transferring that bitstream and metadata to the Altix platform; executing an application; and using GDB to debug an application on the Altix system and FPGA simultaneously.
In this guide, implementation flow (bitstream development) refers to the comprehensive run of the extractor, synthesis, and Xilinx ISE tools that turn the Verilog or VHDL source into a binary bitstream and configuration file that can be downloaded into the RASC Algorithm FPGA (for more information, see “Summary of the Implementation Flow” in Chapter 7).
Figure 1-2 shows bitstream development on an x86 Linux platform for an Altix RASC hardware. The RASC Abstraction Layer (rasclib) provides an application programming interface (API) for the kernel device driver and the RASC hardware. It is intended to provide a similar level of support for application development as the standard open/close/read/write/ioctl calls for IO peripheral. For more on the RASC Abstraction Layer, see Chapter 4, “RASC Abstraction Layer”.
Synplify Pro is a synthesis product developed by Synplicity, Inc. For more information, click on the Literature link on the top of the homepage at www.synplicity.com .
Xilinx Synthesis Technology (XST) is a synthesis product developed by Xilinx, Inc. Information on XST is available at http://www.xilinx.com/ .
The Xilinx Development System Reference Guide provides a bitstream generation workflow diagram and detailed description of all the files generated and used in the workflow and the tools that create and use these files. From the Xilinix, Inc. homepage, Click on the Documentation link. Under Design Tools Documentation, select Software Manuals.
For the RASC 2.20 release, reference the 9.x Software Manuals.
For additional documentation that you may find helpful, see “Additional Documentation Sites and Useful Reading”.
SGI provides a Device Manager for loading bitstreams. It maintains a registry of algorithm bitstreams that can be loaded and executed using the RASC abstraction layer (rasclib). The devmgr user command is used to add, delete, and query algorithms in the registry. For more information on the Device Manager, see “RASC Device Manager” in Chapter 9.
Based on Open Source GNU Debugger (GDB)
Uses extensions to current command set
Can debug host application and FPGA
Provides notification when FPGA starts or stops
Supplies information on FPGA characteristics
Can "single-step" or "run N steps" of the algorithm
Dumps data regarding the set of "registers" that are visible when the FPGA is active
The initial RASC hardware implementation used SGI's first generation peer attached I/O brick for the base hardware. The RASC hardware module is based on an application-specific integrated circuit (ASIC) called TIO. TIO attaches to the Altix system NUMAlink interconnect directly instead of being driven from a compute node using the XIO channel. TIO supports two PCI-X busses, an AGP-8X bus, and the Scalable System Port (SSP) port that is used to connect the Field Programmable Gate Array (FPGA) to the rest of the Altix system for the RASC program. The RASC module contains a card with the co-processor (COP) FPGA device as shown Figure 1-4.
The FPGA is connected to an SGI Altix system via the SSP port on the TIO ASIC. It is loaded with a bitstream that contains two major functional blocks:
The reprogrammable algorithm
The Core Services that facilitate running the algorithm
For more information on the RASC FPGA, see Chapter 3, “RASC Algorithm FPGA Hardware Design Guide”.
For more information on the Altix system topology and a general overview of the Altix 350 system architecture, see Chapter 2, “Altix System Overview”.
RASC hardware implementation for SGI Altix 4700 systems is based on blade packaging as shown in Figure 1-5. For an overview of the Altix 4700 system, see “SGI Altix 450 and Altix 4700 System Overview” in Chapter 2.
The RASC hardware blade contains two computational FPGAs, two TIO ASICs, and a loader FPGA for loading bitstreams into the computational FPGAs. The computational FPGAs connect directly into the NUMAlink fabric via SSP ports on the TIO ASICS. The new RASC blade has high-performance FPGAs with 200K logic cells and increased memory resources with 10 synchronous static RAM dual in-line memory modules (SSRAM DIMMs).
For legacy systems, optional brick packaging is available for the latest RASC hardware.
Figure 1-6 shows an overview of the RASC software.
Major software components are, as follows:
Standard Linux GNU debugger with FPGA extensions
Abstraction layer library
Algorithm device driver
COP (TIO, Algorithm FPGA, memory, download FPGA)
This software is described in detail in this manual in the chapters that follow.