The 2020 AI HW Summit: Beware The Benchmarks

by | Oct 28, 2020 | AI and Machine Learning, In the News

The 3rd annual AI Hardware Summit concluded last week, after four days of mind-bending presentations, panels and discussions about the Cambrian Explosion in AI. For well over two years, many of us industry observers have waited for the players, big and small, to present working production silicon with benchmarks. In previous years, only NVIDIA delivered (or should I say, exceeded) the kind of advancements I expected. Unfortunately, this year, we had to settle once again for PowerPoint presentations full of claims that were optimistic at best. On a brighter note, while NVIDIA still rules the data center, potential competitors finally seem to be edging closer to real products. Many are now shipping their samples to the big customers who will make or break their investors’ dreams. Let’s dive in and examine who I’m excited about, and who continues to disappoint. I would note that neither NVIDIA nor Intel presented this year, likely for very different reasons.

Google

I was thrilled that none other than David Patterson kicked off the event with a keynote on Google’s state-of-the-art AI. While the talk of problem size reduction and TPU-V3 was exciting, the most important thing I learned from Patterson is that while benchmarks don’t lie, they may not be telling the whole story. Specifically, he pointed out that the TPU-V3 and its contemporary NVIDIA V100 are roughly in the same ballpark when it comes to MLPerf results. But internally, Google engineers use more efficient BFloat16 instead of IEEE fp16, and the impact is enormous. Patterson also noted that most practitioners still use the far slower IEEE fp32 format to increase precision, making me wonder whether external users of TPU-V3 have made the switch and realized these benefits.

Figure 1: The TPU-V3 performs dramatically better using the BFloat16 format. Image: COMMUNICATIONS OF THE ACM

SambaNova

Stealthy startup SambaNova made several disclosures about its inference processing platform. SambaNova, much like the UK’s GraphCore, is targetting some of the largest AI models, such as natural language processing, scientific image processing, and recommendation models. I believe this is an example of is a significant trend we will see unfold: NVIDIA competitors will focus on those applications more massive than a GPU can handle well or low-end applications that fly under NVIDIA’s radar screen. Only a few hearty souls will take on an 800-pound gorilla in its heartland.

But a controversy arose after SambaNova’s presentation, one that pointed out the need for peer-reviewed benchmarks such as MLPerf. It all started when SambaNova took a shot at NVIDIA for Bert-Large performance, claiming 16 times the throughput with 1/2 the latency, using a batch size of one. However, this claim contradicts the results published by Microsoft in this blog. Microsoft’s benchmarks imply that the V100 is 33% faster than that achieved by SambaNova, albeit with a batch size of 64 on the V100. When I asked SambaNova to explain the apparent discrepancy, a spokesprson said it ran the benchmark on Amazon AWS V100 instances with a batchsize of 4. But I must point out that the ancient Maxwell GPUs on the Azure NV6 looks precisely like the results SambaNova claimed to have obtained on the V100. (8000/500=16, which is the performance advantage SambaNova is claiming versus the V100).

So, there are two takeaways here. First, in the quest for impressive performance claims, startups must be careful to provide full disclosure, ideally reviewed by outside professionals. Second, batch size in inference processing is controversial. When running inference in a large cloud service, a larger batch size will significantly improve performance. So long as latency is low (Microsoft states their goal is under 10 ms), larger batches are a good thing. But in other applications, such as autonomous vehicles and robotics, we see a different dynamic. Waiting to accumulate multiple queries in real-time control applications is unacceptable, so chips in those applications require excellent latency with small batches.

Figure 2: Microsoft published performance data on the 3-level BERT NLP model that seems to contradict SambaNova’s performance comparisons to the NVIDIA V100. In fact, SambaNova’s comparison for NVIDIA looks like Maxwell results.  image: MICROSOFT AND KARL FREUND

Groq

I have to say that Groq remains somewhat of an enigma to me. The founder and CEO, Jonathan Ross, claims his chip is the only 1000 TOPS AI processor—an impressive assertion to be sure. At the summit, he said the company is shipping samples to large customers but declined to provide any meaningful comparisons that would enable us all to “groq” the chip’s capabilities. Given the company’s Google TPU heritage, I have no reason to doubt that the first product will be quite impressive. But we will have to wait a while longer to see some performance numbers!

Figure 3: Groq, the startup founded by Google TPU inventors, is sampling its AI platform to select customers.  image: GROQ

Qualcomm

Compared to the razzle-dazzle of some other companies at the event, Qualcomm’s Ziad Ashgar was a breath of calm self-confidence and assurance. As I have opined on Forbes before, I believe that the Cloud AI100 (now sampling) is a strong contender and a viable alternative to NVIDIA for edge cloud and 5G AI. Capable of completing 400 TOPS while consuming only 75 watts, the Cloud AI100 is one of the most efficient inference platforms announced to date. Ashgar put this all into the context of the company’s vision for distributed intelligence, from the mobile and embedded Snapdragon devices to the edge cloud and the tier-one cloud service providers. Qualcomm wants everyone to know that the company has a vision and impressive technology to offer, along with one of the industry’s most established AI software stacks. I must admit that I’m a fan.

Mythic

OK, now back to the drawing boards. Mythic has been one of the highest-profile startups to focus on analog computing for AI. The company has a vision of performing matrix calculations inside Flash memory chips. While this may sound crazy, it is one of the most exciting developments that could materialize between now and Quantum computing, the ultimate compute technology of our lifetime. Flash is very dense and consumes very little power, especially when turned off. The early research predicts that this approach could improve efficiency by at least an order of magnitude. Stay tuned for this one!

Figure 4: Mythic strives to build the first production analog AI accelerator based on flash memory. image: MYTHIC

An interesting use case: Using AI to build AI chips

In addition to presenting my annual state of the industry landscape, I was honored to be asked by Synopsys to host a panel discussion on the use of AI to improve the design optimization of semiconductors. The small panel discussion featured Intel, Google, and Synopsys engineers, and highlighted each company’s AI journey to cut costs, improve quality and deliver faster chips at lower power. I heard a lot from Synopsis about the impressive results its clients have seen with AI. The EDA firm used AI to optimize the physical design process (place and route) so that it could be done with fewer engineers, in less time. This process has helped the company build products from mobile SOCs to AI accelerators above target frequency, often with less power and die area. Now, cutting-edge engineering teams are using AI across the development process with impressive results, achieving some 30% better chip efficiency with less engineering. The message was clear: semiconductor teams better use AI to optimize chip design, or else they’ll get left behind.

Figure 5: Synopsys leverages AI to help semiconductor teams produce better chips, with less effort and time. image: SYNOPSYS

How to win designs for AI chips

The AI software and services company Codeplay presented its thoughts and experiences on how the conference’s attendees should think about design wins. There were a few tidbits I took away and sent to my clients. Codeplay echoed what I have been advising my clients. Let’s assume a company like Facebook or Microsoft determines, after extensive development and analysis, that a startup’s architecture is worthy of deployment. But before you pop those champagne corks, realize that this all took a lot of time, during which the incumbent, inevitably NVIDA has improved performance or even introduced a new and better chip. These big companies are unlikely to accept much, if any, risk to save a few dollars.  Consequently, you should expect that the first chip could be just a test chip for deploying your second chip, and not a significant source of revenue. This rule doesn’t always hold, as some AI could enable solutions to unsolved problems that could produce substantial new income. But it is a possible outcome for which any startup must factor into its plans and cash burn. As my CS prof in grad school (Fred Brooks, inventor of the IBM 360) always admonished us: build the first solution and then throw it away.

Figure 6: Codeplay presented its experiences and insights from years of helping chip companies seek design wins. image: CODEPLAY

Wrapping up

I’m exhausted, and suspect you are as well if you made it to the end of this blog! The Cambrian Explosion of AI accelerators is still in the early phases of experimentation and innovation. I can’t wait to see what’s next!