World-Record AI Chip Announced By Habana Labs

by | Jun 20, 2019 | AI and Machine Learning, In the News

Out of the tsunami of AI chip startups that hit the scene in the last few years, Israeli startup Habana Labs stands out from the crowd. The company surprised and impressed many with the announcement last fall of a chip designed to process a trained neural network (a task called “inference”) with record performance at low power. At the time, Eitan Medina, the company’s Chief Business Officer, promised a second chip called Gaudi that could challenge NVIDIA in the market for training those neural networks. On Monday, the company made good on that promise, announcing a very fast chip that also includes an on-die standards-based fabric to build large networks of accelerators and systems. Availability is set for the second half of 2019.

What did Habana Labs announce?

While the company’s first chip, Goya, targeted the relatively simpler task of inference in data centers, the new Gaudi chip targets AI neural net training, a market dominated today by NVIDIA. Gaudi uses 8 “Tensor Processor Cores,” or TPCs, each with dedicated on-die memory, a GEMM math engine, Gen 4 PCIe, and 32 GB of High Bandwidth Memory. Additionally, it features the industry’s first on-die implementation of Remote Direct Memory Access over Ethernet (RDMA and RoCE) on an AI chip, which provides 10x100Gb or 20x50Gb communication links to enable scaling up to thousands of accelerators.

Figure 1: Habana’s new training chip was designed for high-performance AI training at significant scale.  image: HABANA LABS

While I was fairly certain Habana Labs would deliver a fast accelerator, I was impressed and surprised to see the system-level thinking represented in its approach. This includes the chip’s massive on-die interconnect bandwidth, an array of system building blocks, and the AI software suite, called SynapseAI. This development and execution platform also includes a multi-stream execution environment, AI libraries, and a JIT compiler, which provides layer fusion and compilation for improved hardware utilization and efficiency. The adherence to industry standards, such as RoCE, PCIe Gen 4, and Open Compute Accelerator Modules, along with its support for popular AI frameworks, are all intended to simplify adoption and deployment in large-scale production environments.

RDMA over Converged Ethernet (RoCE) is the holy grail of standards-based chip to chip memory but has been essentially dominated by NICs from Mellanox (another Israeli company which NVIDIA is set to acquire). This acquisition is not a trivial coincidence, as the market for AI training demands solutions that scale to hundreds or thousands of accelerators to solve increasingly complex neural network models. This is just the beginning of the scaling challenges ahead: the human brain contains around 100 billion neurons and 1,000 trillion synaptic interconnections.

While RDMA (and RoCE in particular) hold great promise, application adoption outside of HPC has been muted, since apps must be written to take advantage of remote memory access while maintaining low latency for local memory fetches. Within the AI realm, there is the potential for wider adoption; Tensorflow, the most popular AI framework, was adapted to use RDMA in December 2018. While RDMA use in AI is primarily a subject of ongoing research today, I expect support to grow as high performance, low-cost chips like Gaudi come to market with RoCE scalability.

The system building blocks announced Monday include a PCIe card for plug-and-play in existing servers, an Open Compute compliant mezzanine card, and a complete system using 8 Gaudis called the HLS-1 (akin to a DGX system from NVIDIA). Since the interconnect is standard RoCE with 10×100 Gb or 20×50 Gb Ethernet links per chip, the company established a standards-based platform for scaling up to thousands of interconnected accelerators.

Figure 2: The HLS-1 system supports 8 Gaudi chips with single hop interconnects.  image: HABANA LABS

Now, some readers might note that the HLS-1 box contains half the accelerators of an NVIDIA DGX2, which uses the fast but proprietary NVLink and NVSwitch as an interconnect. I suspect one or more large customers told Habana that they would prefer a smaller building block; the Habana design can be readily extended to a 16-node chassis by adding a simple backplane, without needing to add an external Ethernet switch. Again, NVIDIA’s acquisition of Mellanox may point the way to the future, where large oceans of processors and memory can support very large models and massive training data sets. Low latency scaling is essential when deploying “model-based parallelism,” which farms out pieces of a large neural network for parallel processing. Data parallelism is also aided by a local fabric, but the industry is developing ever-larger networks which will demand hardware that can support efficient model parallelism.

On to the 6-billion-dollar question: how fast is this chip and will it threaten NVIDIA? The new Gaudi chip is said to train image networks at an impressive rate of 1650 images per second using the ResNet 50 benchmark at a relatively small batch size of 64. That result comes at a very cool 140 watts, or roughly half of that needed by high-performance GPUs. Perhaps more importantly, the on-die RoCE fabric enables a very high level of scaling, theoretically to thousands of nodes. NVIDIA’s massive ecosystem of researchers, users, and software will provide a defensive moat, at least for a few years. During that time, of course, we can expect NVIDIA to present more than a few surprises of its own making.

Conclusions

As the first merchant semiconductor startup to announce a high-end training chip (no, Google is not a semiconductor supplier), Habana Labs engineered an impressive platform, a robust AI software suite, and system building blocks, all based on industry standards. The support of 10x100Gb Ethernet with RDMA is a masterstroke and has probably set the standard by which all challengers will and should be measured. In order to be successful, Habana will need to work with the open source community to fully exploit production-class RDMA in the software frameworks.

While the initial performance per die and the scalability look very impressive, there was only one benchmark that Habana was able to share at this time. This is not a huge deal—it’s still early and we are months away from availability. Over the next few months, I anticipate more metrics, such as the full mlperf suite, with different batch sizes, latencies, and accuracy levels to better asses the chip’s competitive positioning. Finally, all of this begs the question: what will NVIDIA’s next chip deliver, and when? Will it include on-die fabric support and extensions to TensorCores? I can assure you that most startups initially plan for an advantage of 10 times. While the 3-4 times advantage Gaudi claims over NVIDIA’s aging V100 is nice, I suspect that the low-cost, high bandwidth scalability will turn out to be the compelling aspect of Gaudi’s value proposition.