Taalas Launches Hardcore Chip With ‘Insane’ AI Inference Performance

by | Feb 24, 2026 | In the News

Taalas performance was described by one user as "Insane"

Taalas performance was described by one user as “Insane”. TAALAS

The AI hardware market looks a lot different today than it did yesterday, thanks to the introduction of the Taalas Hardcore AI architecture. Founded two and a half years ago, Taalas has developed a platform for transforming any AI model into custom silicon, hardening the model, parameters and weights into extremely fast and low-cost silicon. From the moment a new model becomes available, Taalas can build hardware for it in two months. This is essentially AI ASICs on steroids and delivers “insane” performance according to one early user. (Taalas, like many AI semiconductor companies, is a client of my firm, Cambrian-AI Research, LLC.)

How can the AI industry solve these problems?

How can the AI industry solve these problems? Taalas thinks the answer are designing chips for each AI model. TAALAS

Hardware Designs: Flexibility vs. Performance

Hardware designs always involve trade-offs: typically ease of programming versus performance. For example, today’s GPUs, such as those from Nvidia and AMD, provide a flexible platform for most parallel workloads, including AI, HPC and graphics processing. GPUs are not as flexible as CPUs but are an order of magnitude faster and programmable for solving parallel problems.

Alternatively, AI ASICs at AWS, Google, Meta and others provide far less flexibility while offering greater performance when running AI models. However, they are largely ineffective for other workloads, such as HPC codes (fluid dynamics, structural analysis, etc.). It’s a tradeoff between flexibility and performance.

An example of further optimization at the expense of flexibility is startup Etched. The Etched Sohu chip is an application‑specific integrated circuit designed exclusively to run transformer architectures (LLMs, vision transformers, MoE variants, etc.), but cannot run non‑transformer models like CNNs, LSTMs or DLRMs. But Sohu is expected to be very fast.

Taalas, which just emerged from stealth, has taken specialization a step further with customized working silicon for one, and only one, model. The company’s performance results are mind-blowing and potentially industry changing if large data centers adopt it. But that is a very big “if.”

“Our first product was brought to the world by a team of 24 team members, and a total of just $30 million spent, of more than $200 million raised,” said Taalas co-founder and CEO Ljubisa Bajic. “Our debut model is clearly not on the leading edge, but we decided to release it as a beta service anyway – to let developers explore what becomes possible when LLM inference runs at sub-millisecond speed and near-zero cost.”

Taalas Hardcore AI: Hardened AI Models

Instead of using software to program (compile) and execute on a GPU or an ASIC such as a Google TPU, Taalas hardwires the model and its weights into a “Hardcore” ASIC (HC1) embedding the entire model into a bespoke application-specific chip. This strategy is both an advantage and a potential market hurdle, as I explore in this article.

While largely hard-wired for speed, the HC1 retains flexibility through a configurable context window size and support for fine-tuning via low-rank adapters (LoRAs).

How Fast Is It?

Hardcode delivers astounding performance results. Taalas has first implemented its approach for the popular open Meta Llama 8B model, available for demos on the company’s website. I tried it and you should, too, here is the demo on their site. And here is the announcement blog.

HC1 delivers instantaneous responses, even for longer tasks like providing a detailed monthly history of WWII (0.138 seconds and 14,357 tokens per second). It’s so fast, you can’t even watch it scroll. Crazy. The economic benefit to cloud service providers could be dramatic if adopted at scale, as Taalas could support many more simultaneous queries and tokens per dollar spent.

The Taalas performance on the HC1 chip for Lllama 8B measured on working silicon now available for testing and demos, is 10 times faster than the Cerebras wafer-scale engine, currently the fastest inference platform available. It is two orders of magnitude faster than GPUs.

Here’s a view of the economics afforded by the Hardcore approach. Inference queries for a single model cost 0.75 cents per million tokens for Llama 3.1 8B and 7.6 cents for the DeepSeek R1 reasoning model. The Llama results have been measured on the first-generation silicon, while the DeepSeek results are simulated. Compare that to 3.79 and 28.6 cents (throughput and latency optimized, respectively) and 20-49 cents for GPUs for Llama 8B and DeepSeek R1, respectively.

Taalus economics could, over time, reshape the industry.

Taalas economics could, over time, reshape the industry. TAALAS

When a model updates, typically every year or so, Taalas would revise the chip in under two months, and has included the cost of three upgrades over a four-year lifespan, including the expenses of designing and taping out that upgrade, to calculate these cost comparisons.

Power and cooling are two of the top concerns for AI deployments. Energy costs of Hardcore systems are significantly lower than today’s technologies, at only 12-15 KW/rack, compared to a GPU rack at 120-600 KW/rack. The Taalas rack can also be air-cooled, reducing the need for costly data center retrofits. Taalas HC1 PCIe cards can be installed in virtually any server, and supports both Intel and AMD CPUs.

The Taalas HC1 card.

The Taalas HC1 card. TAALAS

Talus will sell both Inference as a Service and Hardcore hardware to customers. The Llama 3.1 8B based HC1 is not expected to generate significant revenues, it is more of a prototype than a production model. But its great for demos and validation of the Hardcore concept and the Taalas business model.

Expected Market Impact

With these performances and economics, this value proposition seems like a slam dunk. More than 10x faster, 10x cheaper and power efficient. What else is there?

However, customers may balk at the lack of flexibility, so it’s hard to know for sure. Let’s look at the pros, issues and how Taalas is addressing potential objections.

The pros are obvious. Even if a CSP swaps out thousands of accelerators every time a model changes, it would be saving 60-75% in capex over a four-year comparable lifespan of whatever AI alternative accelerator it might consider, all while providing customer the fastest AI on the planet.

The problem, however, is that customers would have to absorb multiple iterations of a hardened model over time and multiple versions to run different models. No data center organization wants to manage that many SKUs; the operational complexities are hard to fathom. But the economics remain compelling.

Taalas will deliver superfast AI in standard server chassis.

Taalas will deliver superfast AI in standard server chassis. TAALAS

But if a data center has only a few models that consume a large percentage of production AI inferencing cycles, then the economics could work out quite nicely. And thats putting it mildly.

Another concern customers may have is the short lifespan of a Taalas rack. While Nvidia and AMD are both now on annual cadences, the Taalas approach means each model domain must be updated. Model cadence is now roughly twelve to eighteen months, while a new silicon generation can take two to three years to develop.

To address this concern, Taalas only requires changes to two metal layers to upgrade the base logic layers, not a complete redesign. The company says it can do that in only two months, not two years. Taalas likely has its supplier (TSMC) store unfinished wafers, ready for the two metal (interconnect) layers to be added. That is the only way it could get a two-month turnaround.

But …

But in spite of this awesome performance (for a single model), Meta is doubling down on its relationship with Nvidia in what the AI chip giant called a “multigenerational” deal. It is reasonable to suspect that Meta already knew about Taalas, and that knowledge did not change its viewpoint, apparently, that Nvidia will continue to dominate AI hardware. At least for now.

The idea of hardware designed for each  model has issues, which Taalus has addressed.

The idea of hardware designed for each model has issues, which Taalas has addressed. TAALAS

The Taalas Hardcore AI Roadmap

At least initially, the Hardcore chip is built for inferencing small Llama 8B models, but the roadmap includes larger models with multiple chips. The “Silicon Llama” is aggressively quantized, combining 3-bit and 6-bit parameters, which introduces some quality degradations relative to GPU benchmarks.

The second model, still based on Taalas’ first-generation silicon platform (HC1), will be a mid-sized reasoning LLM. It is expected to arrive in Taalas’ labs this spring, thanks to the two metal layer changes, and will be integrated into the company’s inference service shortly thereafter.

Then, Taalas will fabricate a frontier LLM using the second-generation silicon platform (HC2). HC2 offers considerably higher density, uses multiple chips for logic and memory and delivers even faster execution. It will adopt standard 4-bit floating-point formats to address accuracy limitations while maintaining high speed and efficiency. Deployment of this likely terabyte-scale hard model is planned for this winter, 2026. Thereafter, future tape-outs will be prioritized by deal size and significance of the customer in the industry.

Are We Entering a New AI Hardware Era?

Costs per token have fallen by roughly two orders of magnitude since the original Cloud AI 100 era, with something like a 50–100× reduction for “GPT‑4‑class” capability and 100–1000× for smaller/open models, depending on whose stack you look at. Much of this has been through specialization. Now Taalas has entered the era of full model specialization. Is it a step too far?

The issues outlined above will preclude the approach in some instances, but likely not all. And while data centers may not like the operational impact of revamping an entire fleet every one to two years, the economics remain compelling.

If a few pioneers deploy Taalas in a meaningful way, the floodgates could open for far more deployments. While the outcome is difficult to foretell, it will be interesting to watch. And as the old saying goes, “May you live in interesting times.”

Disclosures: This article expresses the opinions of the author and is not to be taken as advice to purchase from or invest in the companies mentioned. My firm, Cambrian-AI Research, is fortunate to have many semiconductor firms as our clients, including Baya Systems BrainChip, Cadence, Cerebras Systems, D-Matrix, Esperanto, Flex, Groq, IBM, Intel, Micron, NVIDIA, Qualcomm, Graphcore, SImA.ai, Synopsys, Taalus, Tenstorrent, Ventana Microsystems, and scores of investors. I have no investment positions in any of the companies mentioned in this article.