An Update On Intel And Habana Labs

by | Feb 21, 2020 | AI and Machine Learning, In the News

Last week, I reported that Intel plans to switch its AI acceleration from Nervana technology to Habana Labs, which it acquired in December. Since Intel had planned to bring out both the inference and the training versions of Nervana’s second generation, this came as quite a surprise to many in the industry. I just got an update from Naveen Rao, former Nervana CEO and now Intel GM of AI, and want to share what I learned.

Figure 1: The Habana Gaudi chip is designed for training neural networks, and includes an on-die 100Gb fabric that supports ROCE for remote memory access. image:HABANA LABS

Why did Intel make this dramatic change?

According to Intel, customer input helped make the company’s decision to build its AI roadmap based on Habana, bringing in future enhancements from the Nervana NNP architecture and software to create what Intel believes will be leadership products. My original supposition was that large customers concluded that the Nervana designs were not as fast as the Goya and Gaudi chips from Habana Labs. Also, I suspected that Intel saw strategic value in the ROCE capability on the training Gaudi chip, a prize that it can leverage into a wide variety of future products for both accelerators, processors and networking gear.

However, Dr. Rao assured me that performance was not the driving factor of Intel’s decision; apparently both chip families perform quite well. So, the decision was likely based on lower costs, ROCE and the desire for a converged architecture. It doesn’t hurt that Intel already has AI R&D teams in Israel.

The Nervana NNP-T training chip, like high-end GPUs, supports High Bandwidth Memory, in addition to a large on-die memory store. HBM2 is really fast, and relatively large at 16GB, but it certainly adds cost and manufacturing complexity. The Gaudi chip handles the limitations of its smaller memory capacity by exploiting “model parallelism” through the on-die fabric to scale out to hundreds or even thousands of nodes. Scale-out will become critical as DNN models continue to grow in size and complexity, doubling every 3.5 months. Nervana has always touted model parallelism through a fast, low-latency fabric directly on its NNP-T die, but unlike Habana, that fabric is not based on industry-standard Ethernet. Additionally, the Nervana fabric does not support RDMA over Converged Ethernet (ROCE).

ROCE is a big deal. By putting it on the chip, Intel can now offer 8 very fast (100Gb) interconnect ports without an expensive Network Interface Card (NIC), which can cost well over $1000, plus the cost of an expensive (~$10K) top of rack switch. RDMA can greatly simplify the programmer’s challenge of accessing shared memory across a large fabric, and improve performance compared to using software to share memory that consumes CPU cycles.

Finally, Dr. Rao pointed to the advantages of a converged architecture for inference and training, which the Habana architecture provides. The NNP chips were designed without that constraint; each was tasked with being the very best chips the design team could create, and compatibility was not a design factor. This complicates the task of building and optimizing software that can run on either chip. Also, the NNP-T chip was being developed on TSMC, while the NNP-I chip is being manufactured on Intel’s 10nm facility—this means that each is built with different libraries.

But AI is all about the software, right?

Yes, a fast chip is just expensive sand without the right software stack. This is an area where Intel is light-years ahead of any startup, second only to NVIDIA. The Intel (Nervana-derived) AI software stack is already layered to provide support (through abstraction) of a wide variety of chips, including Xeon, Nervana, Movidius and even NVIDIA GPUs. Additionally, the Nervana software stack already supports model parallelism, so adapting it to support the Ethernet-based fabric on Habana should be fairly straightforward. Of course, the industry will take some time to leverage the RDMA features. However, there are efforts underway to adopt RDMA in PyTorch, and it is already supported in TensorFlow as well as NVIDIA in GPU-Direct RDMA.

Conclusions

I get the impression that Intel saw Habana as a way to accelerate where the company was headed anyway: a converged architecture with lower costs and industry standards. While I’m now satisfied that the Nervana technology was pretty solid, I remain convinced that Intel did the right thing to change horses before it got into the middle of the stream. It is still relatively early in the AI HW game, but Intel would have no option further down the road (apologies for the mixed metaphor). Now Intel and Habana need to get these chips into full production, add Habana to its OpenVino software stack, get it all adopted by a few very large customers and penetrate the enterprise where the company can leverage its CPU server presence. To be sure, this is no small list of challenges.

With NVIDIA’s large and multi-year head start, Intel must avoid the temptation to rip it all up and try to design a new chip from the two efforts; there will be plenty of opportunity to do that later. It is time to execute.