The Cambrian AI Landscape: GROQ

by | Feb 25, 2021 | In the News

Ex-Google TPU engineers have been there and done that!

Startup Groq is now sampling its AI platform to select customers and claims to have built the most efficient DNN processor in the industry. However, we need more transparency to substantiate this claim, in my opinion.

Groq was founded in 2017 by the engineering leadership team that created the Google TPU, with Jonathon Ross as CEO and initial funding provided by Chamath Palihapitiya of Social Capital. Groq focuses on inference processing, however, its first chip supports high-performance floating-point and so it could also train a neural network. The first part, now sampling, is expected to be in full production soon. As is so often the case with startups, we hope the company can disclose more about its early customers in the coming few months.

The Groq processor is unique, acting as a single fast core with on-die memory.

The GROQ processor is a unique and novel design, acting as a single core with a high level of vector … [+] and matrix (tensor) parallelism. Source: Groq

We spoke with CEO and cofounder Jonathon Ross, and he made it clear that the company was building what he believes could be the fastest single-die chip in the industry. (We pre-pended his claim with “single-die” in light of Cerebras.)

The Groq node has four chips per card, similar to most AI startups.

The Groq node design with two cards with 4 TSP’s each, joined with 2 AMD Rome EPYC CPUs. Source: Groq

Mr. Ross appears to have been right—the Groq Tensor Streaming Processor may be the fastest single AI die to date at 1000 TOPS at full frequency. Groq has said the chip delivers 250 trillion floating-point ops per second (FLOPS), which would enable training at the edge and in data centers.  Of course, the usual caveats apply and we await application performance benchmarks to see if the software can deliver on the hardware’s potential. Still, these are impressive numbers.

Inspired by a software-first mindset, Groq pushes many optimizations, control, and planning functions to the software. The company claims this results in higher performance per millimeter of silicon, saving die area for computation. Perhaps more importantly, the compiler’s tight integration and the hardware produce deterministic results and performance, eliminating the time-consuming profiling usually required.

According to the white paper released by the company, “the compiler knows exactly how the chip works and precisely how long it takes to perform each computation. The compiler moves the data and the instructions into the right place at the right time so that there are no delays. The flow of instructions to the hardware is completely choreographed, making processing fast and predictable. Developers can run the same model 100 times on the Groq chip and receive precisely the same result each time.”

Groq is a software-first design

The Groq TSP moves control, planning, and cache management into the software stack, freeing up logic space for processing elements. Source: Groq

Last summer, the company disclosed its intention to enter the automotive marketplace, with a scalable accelerator, in addition to its effort to penetrate the data center. “Because it’s deterministic, [Groq’s chip] appeals to the autonomous folks because it simplifies its software design,” said Bill Leszinske, VP Products, and Marketing at Groq. “They have hours of video footage that it uses to train its models every night, to make incremental improvements. And so, you enable training and deployment in the field on the same hardware, which simplifies its development cycle as well.”

We look forward to learning more about Groq as the company begins to stake out its target markets and reveals more details about the application-level performance.  But the initial claims are undeniably impressive. We wonder about the memory architecture, which has only 220MB of on-die memory with no DDR memory to act as a local store as far as we understand. Graphcore quickly learned that its chip needed more memory for handling large AI models and increased its SRAM to 900 MB per IPU. We would expect Groq to announce its 2nd generation platform sometime this year, which could likely add more SRAM needed for larger models.

Strengths: The Groq Tensor Streaming Processor (TSP) architecture is unique and potentially higher than other accelerators. Its deterministic nature could be a significant differentiator in applications requiring real-time latencies.

Weaknesses: The burden on the software (especially the compiler) is a heavy lift. Memory capacity could be an issue for larger models, but we suspect the Groq team is already working to resolve this limitation.