AI Hardware: Harder Than It Looks

by | Oct 29, 2019 | AI and Machine Learning, In the News

The second AI HW Summit took place in the heart of Silicon Valley on September 17-18, with nearly fifty speakers presenting to over 500 attendees (almost twice the size of last year’s inaugural audience). While I cannot possibly cover all the interesting companies on display in a short blog, there are a few observations I’d like to share.

John Hennessy’s keynote

Computer architecture legend John Hennessy, Chairman of Alphabet and former President of Stanford University, set the stage for the event by describing how historical semiconductor trends, including the untimely demise of Moore’s Law and Dennard scaling, led to the demand and opportunity for “Domain-Specific Architectures.” This “DSA” concept applies not only to novel hardware designs but to the new software architecture of deep neural networks. The challenge is to create and train massive neural networks and then optimize those networks to run efficiently on a DSA, be it a CPU, GPU, TPU, ASIC, FPGA or ACAP, for “inference” processing of new input data. Most startups wisely decided to focus on inference processing instead of the training market, avoiding the challenge of tackling the 800-pound gorilla that is NVIDIA .

The new approach to software, where the software creates “software” (aka, “models”) through an iterative learning process, demands supercomputing performance. To make the problem even more challenging, the size of these network models is increasing exponentially, doubling every 3.5 months, creating an insatiable demand for ever more performance. As a result, there are now well over 100 companies developing new architectures to bring the performance up and the cost of computing down. However, they have their work cut out for them. Intel’s Naveen Rao points out that to achieve the required 10X improvement every year it will take 2X advances in architecture, silicon, interconnect, software and packaging.

Figure 1: Intel’s Naveen Rao says that the compute capacity needed to handle increasing model complexity will need to improve by 10X every year.  image: INTEL

Observation #1: 20 guys in a garage cannot out-engineer the leaders

The startups can and will invent novel architectures that could beat the incumbents in performance, but they will require partnerships with large customers to bring these technologies to market at scale. And while the rich set of architectural approaches is pretty amazing, the pace of development of both the hardware and the prerequisite software is frustratingly slow. A year ago, dozens of startups presented their plans in PowerPoint at the Summit event. This year, dozens of startups presented updated PowerPoints. Where’s the hardware?

The fact is that few new chips are in volume production since the last summit. Qualcomm  Snapdragon 855 and Alibaba’s Hanguang 800 are notable exceptions; Snapdragon is, of course, a mobile SOC, and Hanguang is only for Alibaba’s internal use. In part, the delay is because this stuff is a lot harder than it initially looks (isn’t all silicon?). But let’s also be realistic: 20, 50 or even 100 engineers are not going to out-engineer companies like NVIDIA, GoogleXilinxMicrosoftAmazon  AWS and Intel. They can innovate amazing new architectures, but execution is the science of engineering, not the art of architectural design. While many can build a fast chip with lots of TOPS, it will “take a village” of researchers, engineers, university professors, internet datacenters and social networking companies to turn those TOPS into usable performance and to build and optimize models for these new chips.

Israeli-startup Habana Labs offers a good example of the challenge. Habana launched its first impressive chip, Goya, for data center inference processing at the inaugural AI HW Summit event. Yet, a full year later, there are no public endorsements or deployments of Goya in spite of the chip’s exceptional performance and very low power. This is not because Goya doesn’t work; its because the “rest of the story” will just take some time and effort to play out.

Another prime example is Intel’s Nervana neural network processor. Even armed with an innovative design and a world-class engineering team, that chip was shelved after 3 years of work. Intel wisely went back to the drawing boards with additional experience and customer feedback about a year ago to figure out how it could compete with NVIDIA’s now 3-year-old V100 TensorCore technology, still the industry’s fastest AI chip. Unlike a startup, Intel can afford to wait until it can deliver a winner: Intel’s Nervana processors (NNP-T and NNP-I) are now expected to be sampling later this year. However, NVIDIA isn’t standing still—we should see its new 7nm designs sometime soon (perhaps at SC19 in November, but more likely at GTC ‘20 next spring).

Going forward, the pace of production deployment for new chips will be gated by the depth and breadth of the ecosystem investments, in addition to the completion of the chips themselves. Keep in mind that while data centers are embracing heterogeneity, they prefer what I would call homogeneous heterogeneity—selecting a minimum number of chip architectures that can cover the widest range of workloads. To do otherwise would be unprofitable, due to the low utilization of fragmented compute realms, and costly to manage.

Observation #2: There are many avenues to improve performance

As I listened to the presenters at the summit, I was amazed by the rich landscape of innovations they outlined. Here are a few highlights, beyond the use of lower precision, tensor cores and arrays of MACs (multiply-accumulate cores). These are not orthogonal approaches, by the way.

Figure 2: A short list of some of the innovations being pursued in the search for faster and more… [+]  image: MOOR INSIGHTS & STRATEGY

There are two primary categories for these architectures. Von Neuman massively parallel designs use code (kernels) that process matrix operations in the traditional realm of digital computers (do this, then do this, …). More radical approaches typically take the form of melding compute and memory on a chip, either using digital representations for weights and activations that comprise the neural networks or using analog techniques that more closely resemble the biological functions of the human brain. The analog approach is higher risk, but could hold significant promise.

Many of the digital in-memory designs use data flow computing architectures, including Cerebras and Xilinx Versal, where AI cores are embedded in fabric with on-die memory that pipes activations to and from successive network layers. To make any of these designs work well in inference, the players will need to develop custom compiler technology to optimize the network, trim the unused parts of the network, and eliminate multiplication by zero (where of course the answer is zero).

Figure 3: A useful and simple taxonomy to help put the companies and architectural styles into… [+] image: MYTHIC

Don’t get me wrong, most of these companies, big and small, are going to deliver some pretty amazing designs. Let’s keep in mind, though, the time and magnitude of investments needed to build useful scalable solutions from a novel DSA device. To put that investment in perspective, I suspect that NVIDIA spends hundreds of millions of dollars every year to foster innovation around the world for AI research and development on its chips. No startup can afford this, so they will need to attract some big design wins to help carry them across the chasm.

Observation #3: NVIDIA is still on top

Ian Buck, VP and GM of NVIDIA’s Data Center business unit, bravely took the stage as the event’s last presenter, standing in front of hundreds of hungry wolves dedicated to taking NVIDIA down a notch. NVIDIA has made progress in extending its technology for inference through faster software and DNN research supported by its Saturn V Supercomputer (#22 on the Top 500 list). Buck pointed to design wins for inference, including some big names and a wide range of use cases.

Figure 4: NVIDIA was able to show a dozen companies that have adopted GPUs for inference, as well as all the major cloud vendors. image: NVIDIA

To help drive inference adoption on GPUs, NVIDIA announced Version 6 of TensorRT—software that includes an optimizer and run-time support to deploy trained neural networks for inference processing on the range of NVIDIA hardware. It supports the $99 Jetson for embedded processing, Xavier for autonomous vehicles, the Turing T4 for data center applications, and more.

Second, Amazon AWS announced support for the NVIDIA TensorCore T4 GPU, a 75-watt PCIe card that can support complex inference processing for images, speech, translation and recommendations. NVIDIA T4 will be a common comparison target for startups such as Habana Labs and established companies like Intel Nervana. While I assume new chips will come along with outstanding metrics, NVIDIA will rightly argue that the usefulness of these devices in a cloud will depend on the amount of available software and a user base comfortable with running a variety of models on these accelerators.

Finally, demonstrating that GPUs can continually evolve in place (counter to what many startups claim), NVIDIA announced the 8.3 billion parameter Megatron-LM transformer network for language processing. Developed on NVIDIA’s Saturn V using 512 GPUs, this also shows what you can do when you have your own AI supercomputer. Note that NVIDIA also doubled the performance of its existing V100 GPU in just 7 months, as measured by the mlPerf benchmark.

Some still think inference is for lightweights. NVIDIA showed that modern inference use cases require multiple models at real-time latencies to meet users’ expectations, with 20-30 containers collaborating to answer a simple verbal query.

Figure 5: This slide depicts the workflow for answering a simple verbal query. image: NVIDIA


The coming Cambrian Explosion in domain-specific architectures is exciting, but it is still “coming soon to a server near you.” By the time most startups reach the starting gate, many of their potential customers like Google, Amazon AWS, Baidu and Alibaba will have their own designs in production. Additionally, the big semiconductor vendors will have new silicon ready to crunch even bigger networks (like Megatron-LM) or power energy-efficient inference designs.

This doesn’t mean startups should simply give up and return their capital to their investors, but the startups will have a very high bar to reach, by a substantial margin. Either that or they will need to target niche markets where they can win with better power efficiency and lower prices.

Of course, another option for them is to Go Big, or Go Home, as Cerebras is attempting to do with its Wafer-Scale AI Engine recently announced at Hot Chips. However, this is not an approach I would recommend for the faint of heart! I look forward to seeing the domain-specific architecture landscape develop further.