How Enfabrica Is Reimagining, And Disrupting, The AI Data Center

by Karl Freund | Sep 25, 2023 | In the News

The AIHW and Edge AI Summit had a treasure trove of insightful presentations from luminaries such as Andrew Ng, Lip-Bu Tan, Marc Tremblay, and many others. I hope to get around to writing about what I learned, but first, I want to share the innovations from a startup called Enfabrica. The company recently raised a $125M Series B funding round, led by Atreides and Nvidia as a strategic investor. After hearing CEO Rochan Sankar’s presentation, I concluded his company has the potential to revolutionize the communications infrastructure for accelerated computing.

The Current State of the Art will be Disrupted

Today’s state-of-the-art Nvidia DGX system combines CPUs, GPUs, a GPU-Native Fabric (NVLink and NVSwitch), and PCIe switches and NICs to connect these compute devices to a network. While Nvidia enjoys significant revenue from all those networking devices, Enfabrica represents a disruptive technology that can replace or augment a server’s “Switching Tray” and memory subsystems.

Nvidia CEP Jensen Huang does not fear technologies that disrupt the status quo. He embraces them if they advance performance and lower costs.While Enfabrica could replace a lot of NICs he sells, it is a far more elegant solution.

The startup aims to replace the entire switching electronics of a GPU-Server with a single chip. ENFABRICA

Enfabrica’s Accelerated Compute Fabric Switch (ACF-S)

The Enfabrica ACF-S acts like Pac-Man, gobbling up existing silicon products and replacing them with an integrated solution that is far more cost-effective and lowers latencies. The diagram below is for an Nvidia-based accelerated server, but the ideas apply more generally to systems like Intel Gaudi and AMD’s MI300.

The Enfabrica approach ENFABRICA

Instead of using industry-standard PCIe and Ethernet Network Interface Cards (NICs) with RDMA, the Enfabrica ACF-S provides the interconnectivity and network services for up to eight GPUs and even provides load distribution across the GPUs. This approach reduces network hops by over half and provides memory access across the network to shared pools of elastic (CXL) slower memory to augment the HBM.

The company’s solution aggregates 8 Network cards that reduce latencies and can lower TCO. ENFABRICA

The image below shows that the ACF-S provides direct access to pools of CXL-attached DDR5 DRAM with only four microseconds latency and 400 GB/s aggregate bandwidth. In this vision, there are no more PCIe, NICs, or isolated CPU-attached DRAM. The system integrates HBM-equipped GPUs (optionally with integrated on-package CPUs) connected over NVLink and ACF-s interfacing to other nodes and shared memory.

The addition of shared CXL memory connectivity will dramatically reshape the server of tomorrow. ENFABRICA

The Impact on System Design

Nvidia’s Grace Hopper Superchip and the upcoming AMD MI300 will consolidate the computational components by combining the CPU and GPU onto a single package, w especially useful for inference processing. These systems won’t need DRAM, as the models all fit (sort of) into HBM memory on the box as well. Training will also benefit from the Enfabrica approach with discrete GPUs and CPUs. As the diagram below shows, there are a lot of soon-to-be discarded components for AI-optimized system design.

The Enfabrica approach would replace all the upper devices. NVIDIA and THE AUTHOR

Conclusions

It is pretty cool to imagine how this technology will impact system design. Everything is shared, everything is connected with lower power and everything will be faster. You can even run your GPUs at a higher clock frequency as you aren’t cooling the now-discarded components.

In the future, Nvidia and OEMs may ship more dense and streamlined systems, replacing NICs and PCIe switch revenue with Enfabrica switch revenue.

← Previous Post Next Post →

How Enfabrica Is Reimagining, And Disrupting, The AI Data Center

The Current State of the Art will be Disrupted

Enfabrica’s Accelerated Compute Fabric Switch (ACF-S)

The Impact on System Design

Conclusions

More Recent AI News>>

Categories

How Enfabrica Is Reimagining, And Disrupting, The AI Data Center

The Current State of the Art will be Disrupted

Enfabrica’s Accelerated Compute Fabric Switch (ACF-S)

The Impact on System Design

Conclusions

More Recent AI News>>

Companies

Categories