Today unicorn startup Cerebras disclosed a few details about the wafer-scale AI chip it has been keeping under wraps for some three years. While many unanswered questions remain, the new approach could mark a significant milestone in the semiconductor industry, where chips have historically been constrained by the size of a single chip’s mask. Basically, Cerebras designed a wafer of 84 interconnected chips that act as one device for compute and memory, interconnected by a super-fast on-die fabric. While building a supercomputer on a chip sounds like a great idea, building a wafer-scale array of chips is not for the faint of heart or talent.
Furthermore, if Cerebras is right, AI might just be the start of wafer-scale integration; applications increasingly demand better performance than available from CPUs. For example, I imagine that Cerebras’ wafer-scale approach could completely transform High Performance Computing, if it turns its attention to floating-point cores once it finishes its first AI-focused implementation.
Background
Cerebras was co-founded by hardware architect Sean Lie and CEO Andrew Feldman, former founder and CEO of micro-server innovator Sea Micro (acquired by AMD in 2011). Mr. Feldman’s new startup now employs nearly 200 engineers, many of which are Sea Micro alumni, has raised over $120M, and was recently valuated at $860M. While most AI semiconductor startups focus on building more efficient arrays of cores and on-chip memory for conducting the matrix and vector processing needed for deep neural networks, Cerebras decided to go beyond optimizing the math. Instead, it strove for extreme scalability.
If you’re wondering if this level of performance is really needed, consider that the recently published mlperf AI benchmarks by Google and NVIDIA achieved a ~2-hour training record using AI supercomputers costing tens of millions of US dollars. As Greg Daimos, Sr. Researcher at Baidu, shared, “Training large models on very large data sets would take months or years of critical path compute time, making these training runs impractical for any real-world problem on existing systems.” This leads to the conclusion that the industry is still 2-4 orders of magnitude away from anything approaching interactive training of Deep Neural Networks. This jives with Cerebras’ goal to deliver 1000 times the performance of the state of the art. Meanwhile, the newer networks being developed and trained are increasingly complex and deep, so the chip industry must scale dramatically to keep pace with the performance appetite of research scientists.
The wafer-scale engine: a big chip with big challenges
The company shared some details about its design this week at the annual Hot Chips conference held on the campus of Stanford University. While it is too early to see benchmarks or real-world use cases, the specs shared by Cerebras are truly mind-blowing. The “chip” is cut from a 300mm wafer built by TSMC using its mature 16nm manufacturing process. The device features:
- 2 trillion transistors
- 46,225 mm2 of silicon
- 400,000 AI programmable cores
- 18 GB of super-fast on-die memory (SRAM)
- 9 Petabytes/s memory bandwidth
- 100 Petabits/s fabric bandwidth
- Native optimization for sparsity (to avoid multiplying by zero)
- Software compatibility with the standard AI Frameworks such as TensorFlow and PyTorch.
By building a really large chip, Cerebras believes it can store and process an entire neural network on just one of these devices, eliminating the need to scale a problem across multiple devices and memory layers (a process called model parallelism). Think of this approach as akin to building an entire cluster of computers, with memory, on one single, vast chip. On-chip memory and communications are thousands of times faster than going off-chip and will save significant power and expense compared to racks of servers with hundreds of traditional accelerators.
Such a bold design must overcome significant technical hurdles, including interconnectivity, memory, power, yield, packaging, and cooling. One of the breakthroughs the company touts is the “Swarm-IO” fabric mesh for cross-die and intra-chip communications. These connections must cross the boundary between the reticles (masks) used in photo lithography and etched directly on the silicon. Mr. Feldman says this solution is based on Cerebras-owned technology that was jointly developed with TSMC for fabrication and testing.
To deal with inevitable defects on the wafer, Cerebras taps into redundant cores and fabric links to replace bad circuits and reconnect the grid at startup time. In addition, there is no package available in which such as massive chip can be installed, powered, and cooled. Cerebras had to invent custom packaging technologies and tools to address these challenges.
Conclusions and outlook
Given the dearth of details on performance and power, it is difficult to evaluate how impactful this wafer-scale approach might become. That being said, I find the approach compelling and believe that Cerebras could vault into a leadership challenger position if it can deliver on its aggressive vision of scalable parallel computing. Cerebras says it is currently working with major customers to evaluate early silicon, and that it hopes to ship production servers using its WSE by mid-2020. This implies the company has already mastered many of the technical challenges in front of getting these beasts to market. Of course, it still needs to develop the ecosystem of software and researchers needed to turn theoretically fast chips into fast solutions for real problems. That said, if Cerebras delivers a solution that is 1000 times faster than the competition, at a reasonable price and power envelope, I believe the ecosystem will beat a path to its doors. Still, that is a lot of “ifs” that must be answered.
Combined with other technologies launched at Hot Chips, this announcement certainly points to a wave of innovation in semiconductor designs driven by the insatiable performance demands of AI. I expect this will prove out at the upcoming AI HW Summit in September. Yes, hardware is finally cool again. The Cerebras design also points to a possible future of highly-scalable parallel compute that will inevitably find its way into other applications beyond AI. I suspect that the US DOE, who is spending in excess of $1.5B on the next wave of supercomputers, is watching closely.