Intel, GraphCore And Groq: Let The AI Cambrian Explosion Begin

by | Dec 12, 2019 | AI and Machine Learning, In the News

As we approach the end of a year full of promises from AI startups, a few companies are meeting their promised 2019 launch dates. These include Intel, with its long-awaited Nervana platform, UK startup Graphcore and the stealthy Groq from Silicon Valley. Some of these announcements fall a bit short on details, but all claim to represent breakthroughs in performance and efficiency for training and/or inference processing. Other recent announcements include Cerebras’s massive wafer-scale AI engine inside its multi-million dollar CS-1 system and NVIDIA’s support for GPUs on ARM-based servers. I’ll opine on those soon, but here I will focus on Intel, Graphcore and Groq’s highly anticipated chips.

Intel demos Nervana NNP, previews Ponte Vecchio GPU

At an event in San Francisco on November 12, Intel announced it was sampling its Nervana chips for training and inference to select customers including Baidu, Facebook and others. Additionally, it took the opportunity to demonstrate working hardware. While this looked a lot like a launch, Intel carefully called it an “update.” Hopefully we will see a full launch soon, with more specs like pricing, named customers and OEM partners ready to ship product in volume.

Intel recently previewed impressive performance in the Mlperf inference benchmarks for the NNP-I (the “I” stands for inference). Keep in mind that these chips are the second iteration of Intel’s Nervana chip, and I expect Intel incorporated significant customer input in these revised designs. While Intel disclosed few details about the microarchitecture, it did tout an Inter-Chip Link (ICL). The ICL supposedly enables nearly 95% scalability as customers can add more chips to solve larger problems. Intel also claimed that a rack of NNP-I chips will outperform a rack of NVIDIA’s T4 GPUs by nearly 4X, although I would note that this compares 32 Intel chips to only 20 T4 chips. While improved compute density is a good thing, more details will be required to properly assess the competitive landscape.

 

Figure 1: Intel demonstrated both the training and inference versions of its NNP architecture, born from the 2016 acquisition of Nervana. image: INTEL

The NNP chips support all AI frameworks and benefit from the well-respected Nervana software stack. Intel also laid out its vision for the “One API” development environment, which will support Xeon CPUs, Nervana AI chips, FPGAs and future Xe GPUs. This software approach will be critical in helping Intel’s development community to optimize their code once for a broad range of devices.

Though details were scarce, Intel also announced its first data-center GPU at SC19, codenamed Ponte Vecchio. We know that Pointe Vecchio will go inside the Argonne National Labs Aurora exascale system in 2022, but we should see consumer versions sometime in 2020.

It is noteworthy that Intel sees a role for so many architectures for specific types of workloads, a strategy Intel calls “Domain-specific architectures.” The GPU can perform a wide variety of tasks, from traditional HPC to AI, while the Nervana chips are designed to train and query deep neural networks at extreme performance and efficiency. While some may say that Intel is taking a shotgun approach, fielding many architectures hoping to hit something, I believe the company is being smart. It is optimizing chips for specific tasks at a scale only Intel could array.

The Graphcore Intelligent Processing Unit (IPU)

Unicorn UK Startup Graphcore recently launched its IPC chip, complete with customers, partners, benchmarks and immediate availability. It is geared towards training and inference processing of AI neural networks, or any other computation that can be represented as a graph. Graphcore garnered financial and strategic backing from Dell, Microsoft and others, and announced availability of its Tensor Streaming Processor in both Dell servers and in the Microsoft Azure cloud. Customers testing early silicon include the European search engine Quant (image processing), Microsoft Azure (natural language processing), hedge fund manager Carmot Capital (Markov Chain Monte Carlo) and the Imperial College of London (robotic simultaneous location and mapping).

Graphcore’s architecture was designed for the most computationally challenging problems, using 1216 cores, 300 MB of in-processor memory at 45 TB/s and 80 IPU Links at 320GB/s. The company strategy is not to take on NVIDIA on every front, but rather to focus on those applications ideally suited to its architecture. Consequently, the benchmarks Graphcore published are relatively new in the industry; they have not yet published results for industry standard benchmarks such as Mlperf. In a conversation with CEO Nigel Toon last week, I was reassured that more standard benchmarks are forthcoming that will enable tuned apples-to-apples comparisons. That being said, the published benchmarks span several workloads and are quite impressive in both throughput and latency.

 

Figure 2: Graphcore’s unique design.  image: GRAPHCORE

Neural network size and complexity is growing at the rate of 3.5 times every 3 months, according to the OpenIA group. This means that adopters and researchers need accelerators that can scale to massive size to minimize training time. Hence, both Intel Nervana NNP-T and the Graphcore IPU’s support for native interconnect fabrics, In Graphcore’s case, the fabric is enabled by IPU-Links, as well as an on-die fabric IPU-Exchange (switch) for core to core communication. Combined, this enables fabrics of accelerators to tile out huge models in parallel, scaling to hundreds or even thousands of nodes. Cerebras is doing something similar but at supercomputing scale, using chips that are a full wafer of interconnected engines.

Groq: Screaming, Streaming Tensors from the creators of Google TPU

Groq is a Silicon Valley startup founded by a few members of the Google TPU team, and was operating in stealth mode until this announcement. A few months back, I spoke with CEO and co-founder Jonathon Ross, and he made it clear that the company was building what could be the fastest single-die chip in the industry. The company was a no-show at September’s second annual AI HW Summit, causing many to wonder; it was widely expected to come out of stealth at the sold-out event.

The company was probably just getting its first silicon back that week—an exciting and super-busy time for any semiconductor company. Clearly the Groq team was up to the task: it had the A0 version of the silicon up and running in one week, was sampling to early customers within just six weeks and has now gone into production.

Mr. Ross was right—the Groq Tensor Streaming Processor appears to be the fastest single AI die to date (I refer to “single die” to differentiate this from the Cerebras Wafer Scale Engine, which is a “single chip,” but is comprised of 84 interconnected dies). Groq’s TSP cranks out one quintillion (one thousand trillion) integer ops per second and 250 trillion floating point ops per second (FLOPS). Of course, the usual caveats apply. We must await application performance benchmarks to see if the software can deliver on the hardware’s potential. Still, these are certainly amazing numbers.

Inspired by a software-first mindset, Groq pushes a lot of the optimization, control and planning functions to the software. The company claims this results in higher performance per millimeter of silicon, saving die area for computation. Perhaps more importantly, the tight integration of the compiler and the hardware produces deterministic results and performance, eliminating the time-consuming profiling usually required. According to the white paper released by the company, “the compiler knows exactly how the chip works and precisely how long it takes to perform each computation. The compiler moves the data and the instructions into the right place at the right time so that there are no delays. The flow of instructions to the hardware is completely choreographed, making processing fast and predictable. Developers can run the same model 100 times on the Groq chip and receive precisely the same result each time.”

 

Figure 3: The Groq TSP moves control , planning, and caches to the software stack, freeing up logic area for more cores and performance.  image: GROQ

I look forward to learning a lot more about Groq as the company begins to stake out its messages and reveal more details about the architecture, but the preliminary claims are undeniably impressive. Groq set a high bar by which other AI chip companies will be measured.

Conclusions

These three companies have made impressive gains in hardware and software innovation for AI, but more details are needed to validate their claims, and understand where they will excel and where they might struggle. And of course, these are just the first new chips of the coming Cambrian Explosion over the next 1-2 years, as billions of dollars of venture capital are converted into new silicon for AI.

I suspect the NVIDIA benchmarking and software tuning teams are going to have a busy holiday season!