Xilinx Reveals More Versal Details

by | Sep 10, 2019 | AI and Machine Learning, In the News

Xilinx held an “Innovation Day” event this week to share more details about the technologies the company hopes will forge a larger footprint in the data center. After listening to eight presentations and having intriguing discussions with the company’s leading technologists, some of which I actually understood, I think I have a better appreciation of both the potential of the company’s Versal acceleration platform and the magnitude of the software challenges the company faces. Xilinx is now on a journey from its legacy FPGA business to a new hybrid approach that it believes will deliver the best of both worlds: programmable hardware logic and software-programmable domain-specific engines. The breadth and depth of the information Xilinx shared is far beyond the scope of a short blog, so I will focus here on a few Versal highlights.

Figure 1: The Xilinx Versal acceleration platform includes programmable logic, processors and domain-specific engines, all interconnected with an on-die network and a rich set of I/O interfaces. image: XILINX

The “Aha!” moment for me came when Xilinx fellow Ralph Wittig explained that the concept here is to optimize all the architecture (domain-specific engines, DSPs, memory, and programmable logic) on the fly, adapting to the data flow and compute patterns that emerge at execution time. Reconfigurable memory hierarchies and intelligent fabric enable this adaptability. The architecture is quite scalable, from Data Center acceleration platforms to embedded designs. It supports tens to hundreds of cores, up to megabytes of on-die memory, up to terabits of memory bandwidth, and anything from sub-5 watt automotive SOCs to 75 watt PCIe cards to a 200 watt node for HPC.

The underlying technology includes significant innovations, starting with the AI engines. These consist of a 2D array of vector (SIMD) cores with local SRAM memory. The cores, memory, and fabric act as an adaptable data flow engine, with fast local memory acting as buffers between them. Cores activate when data arrives, unlike the threaded execution model found in CPUs, GPUs, and most ASICs. This approach saves die area, reduces power, and speeds execution.

Figure 2: The Xilinx Versal AI engine is a dataflow machine, with up to hundreds of vector cores each with local fast SRAM memory, and access to the configurable memory residing in the chips programmable logic array.  image: XILINX

It’s all up to software

Since Versal’s launch last fall, I’ve wondered how Xilinx would avoid the pitfall of trading one software challenge (programming an FPGA) for a much larger set of hurdles (programming across an SOC containing FPGA, ARM, DSPs, and an AI Engines). To help programmers factor their code and make all of this work, Xilinx is developing compiler technologies that automate and optimize these functions somewhat automatically (at least in theory). Instead of each programmer having to figure out how to structure their applications to optimize execution, the compiler will take on the burden of determining the type and amount of compute resources and fabric routing required for balanced execution. This all sounds a bit like magic to me, but the Xilinx team seems confident it can pull it off.

Figure 3: Xilinx envisions a compiler that can organize and dispatch application kernels across the AI engines and programmable logic, with reconfigurable memory hierarchies across the AI engines and logic arrays. image: XILINX

Outlook and conclusions

Once ready for market, I expect many development teams who seek more performance will explore ACAP. However, I also expect it may take years to realize Versal’s potential, since the new platform represents a relatively new paradigm in heterogeneous programming models. A good analogy here is NVIDIA ’s CUDA, which took roughly a decade to become a pervasive platform for parallel computing on GPUs. It took a lot of talented engineers, free hardware to universities, and development tools for NVIDIA to realize its vision for CUDA.

Initially, ACAP will work as a data flow processor, with an adaptable memory hierarchy matched to high-performance data flow computing. Then, in the near future, Xilinx will exploit ACAP to optimize parallel computing with Adaptive Intelligent Fabrics to provide low-latency data flow over configurable arrays of multiple ACAP chips. This will replace North/South traffic to and from the Top of Rack Switch with East/West interconnectivity. In the future, Xilinx envisions ACAP networks across the Data Center, supporting adaptable interconnection hierarchy for Distributed Adaptive Computing.

On the software front, the initial goal for ACAP volume production will be to fit the architecture to a complete application, programmed in C or C++, with compiler-generated memory and fabric optimization. The magic will really start when the compiler can also enable multi-core optimization and dataflow. Subsequently, Xilinx envisions the capability to dynamically load and dispatch programs, even from multiple users, across the network of ACAPs.

All of this is just the tip of the iceberg. Xilinx executives realize that meeting the challenge ahead will require open communications and development tools with the application development community. The upcoming Xilinx Developers Forum, to be held October 1-2 in San Jose, will be the next milestone for that outreach. Hope to see you there!