Speeding AI With Co-Processors

by | May 7, 2025 | In the News

An artists conception of a high-speed chip

An artists conception of a high-speed chip. CADENCE DESIGN

Most chips today are built from a combination of customized logic blocks that deliver some special sauce, and off-the-shelf blocks for commonplace technologies such as I/O, memory controllers, etc. But there is one needed function that has been missing; an AI co-processor.

In AI, the special sauce has been the circuits that do the heavy-lifting of parallel matrix operations. However, other types of operations used in AI do not lend themselves well to such matrix and tensor operators and silicon. These scalar and vector operators for computing activations and averages are typically calculated on a CPU or a digital signal processor (DSP) to speed vector operations.

Designers of custom AI chips often use a network processor unit coupled with a DSP block from companies like Cadence or Synopsys to accelerate scalar and vector calculations. However, these DSPs also include many features that are irrelevant to AI. Consequently, designers are spending money and power on unneeded features. (Both Cadence and Synopsys are clients of Cambrian-AI Research.)

Enter AI Co-Processors

Large companies that design custom chips address this by building in their own AI Co-Processor. Nvidia Orin Jetson uses a vector engine called PVA, Intel Gaudi uses its own vector processor within its TPCs, Qualcomm Snapdragon has its vector engine within the Hexagon accelerator, as does the Google TPU.

AI Co-Processors work alongside AI matrix engines in many accelerators today.

AI Co-Processors work alongside AI matrix engines in many accelerators today. CADENCE DESIGN

But what if you are an automotive, TV, or edge infrastructure company designing your own AI ASIC for a specific application? Until now, you had to either design your own co-processor, or license a DSP block and only use part of it for your AI needs.

The New AI Co-Processor Building Block

Cadence Design has now introduced an AI co-processor, called the Tensilica NeuroEdge, which can deliver roughly the same performance of a DSP but consumes 30% less die area (cost) on an SoC. Since NeuroEdge was derived from the Cadence Vision DSP platform, it is fully supported by an existing robust software stack and development environment.

An AI SoC can have CPUs, AI block like GPUs, Vision processors, NPUs, and now AI co-processors to accelerate the entire AI workload.

An AI SoC can have CPUs, AI block like GPUs, Vision processors, NPUs, and now AI co-processors to accelerate the entire AI workload. CADENCE DESIGN

The new co-processor can be used with any NPU, is scalable, and helps circuit design teams get to market faster with a fully tested and configurable block. Designers will combine CPUs from Arm or RISC-V, NPUs from EDA firms like Synopsys and Cadence, and now the “AICP” from Cadence, all off-the-shelf designs and chiplets.

The NeuroEdge AI Co-processor

The NeuroEdge AI Co-processor. CADENCE DESIGN

The AICP was born from the Vision DSP, and is configurable to meet a wide-range of compute needs. The NeuroEdge supports up to 512 8×8 MACs with FP16, 32, and BD16 support. It connects with the rest of the SoC using AXI, or using Cadence’s HBDO (High-Bandwidth Interface). Cadence has high hopes for NeuroEdge in the Automotive market, and is ready for ISO 26262 Fusa certification.

An architectural overview of the AI Co-Processor

An architectural overview of the AI Co-Processor. CADENCE DESIGN

NeuroEdge fully supports the NeuroWeave AI compiler toolchain for fast development with a TVM-based front-end.

The software stack for development of AI applications using the AI Co-processor.

The software stack for development of AI applications using the AI Co-processor. CADENCE DESIGN

My Takeaway

With the rapid proliferation of AI processing in physical AI applications such as autonomous vehicles, robotics, drones, industrial automation and healthcare, NPUs are assuming a more critical role. Today, NPUs handle the bulk of the computationally intensive AI/ML workloads, but a large number of non-MAC layers include pre- and post-processing tasks that are better offloaded. Current CPU, GPU and DSP solutions required tradeoffs, and the industry needs a low-power, high-performance solution that is optimized for co-processing and allows future proofing for rapidly evolving AI processing needs. Cadence is the first to take that step.