How To Run Large AI Models On An Edge Device

by | Jul 10, 2023 | In the News

It can be done, but it requires the edge device vendor to work to optimize the model. A hybrid approach can also extend the applicability of LLMs by combining Cloud and Edge processing.

When most people think of Artificial Intelligence (AI), they imagine a berserk Hollywood android, or more realistically, a massive data center filled with racks of GPUs, cranking out the answers to the meaning of life, the universe, and everything (which of course is 42). And the latter is undoubtedly true if incomplete. However, most are unaware that we use AI when taking photos or playing games on our smartphones. The user isn’t knowingly or directly interacting with artificial intelligence. Instead, edge AI is often hidden in an app, improving the performance or function.

But thanks to the explosive revolution caused by generative AI, Large Language Models (LLMs), and ChatGPT, the time has come when people want to directly interact with an AI application on a mobile device, in their vehicles, or in doctor’s offices. Let’s call this explicit AI. AI apps like Apple Siri or Google Assistant run primarily in the cloud today. But running them directly on an edge device could bring many benefits if those devices had the capacity and performance to do the job. Let’s examine what is possible today on the edge.

Microsoft showed the popular Midjourney image creation app running on a laptop at Microsoft Build.

Microsoft showed the popular Midjourney image creation app running on a laptop at Microsoft Build. Microsoft

AI On The Edge

As cloud vendors begin to reckon with the “eye-watering” costs of generative AI, the major players are looking for resources on edge to carry more of the load. While data center GPUs offer great performance, they can cost over $30K each. Inflection AI, a startup founded by the former head of Deep Mind, raised $1.3 billion from industry heavyweights to build a cloud supercomputer with 22,000 NVIDIA H100 GPUs, costing hundreds of millions of dollars.

To help lower the cost and increase access to the power of LLMs, Microsoft has introduced Office 365 Co-pilot, which uses AI hardware in both the cloud and locally, where possible, to help users across the Windows OS. In another example of seeking the benefits of on-device AI, Google has launched the Gecko version of the Palm 2 model. It is so lightweight that it can work on mobile devices and is fast enough for great interactive applications on-device, even offline. And Meta has released the LLAMA generative AI model, which has a version consisting of only 7B parameters intended for edge devices.

In addition to realizing significant cost reductions, these cloud providers are using artificial intelligence on devices close to the data source to help their customers realize other benefits as well, including reduced latency, improved privacy, lowered costs, and increased accessibility across devices.

A few challenges must be addressed to provide performant AI solutions on edge devices. First and foremost is the edge devices’ computational and memory constraints. This is the biggest hurdle to running large AI apps on the edge, which have significantly fewer computational and memory resources than cloud servers. This means that AI models need to be optimized for smaller devices.

Heterogeneity is also a stumbling block. Edge devices come in various shapes and sizes, with different capabilities and limitations. This makes it difficult for application developers to deliver AI solutions that can run across many devices. A robust AI stack supported across a wide range of devices is key.

Finally, security and privacy must be maintained. Edge devices are often connected to the internet, which makes them vulnerable to cyberattacks. Implementing security measures to protect data and devices from unauthorized access can minimize this risk.

How Do We Get There?

Optimizing and quantizing large language models is critical to making generative AI practical on edge devices. Large language models (LLMs) are a type of AI model that can be used for a variety of tasks, such as natural language processing (NLP) and machine translation. However, LLMs can be computationally expensive to train and run. There are several techniques that can be used to optimize and quantize LLMs for edge devices.

One technique is to use a technique called “knowledge distillation,” or “Domain reduction”, which involves training a smaller model to mimic the behavior of a larger model on a smaller data set. Another technique to reduce model size and improve performance is “quantization,” which involves reducing the precision of the model’s weights and activations without significantly impacting its accuracy. This can be tricky; you don’t want to accidentally opt for a highly efficient model using, say, 8-bit integers that doesn’t give accurate answers. Researchers are seeing a significant reduction in model size and improved performance while attaining accuracy within 0.5-1.0 percent of that achieved with a 32-bit floating point. Looking ahead, the 4-bit realm is equally promising. Many models on Hugging Face are already available with 4-bit quantization.

The Hybrid Approach

In some cases, it may be necessary to use a hybrid solution, where some of the processing is done locally and some in the cloud. This can be a good option for applications that require high accuracy or that need to process larger amounts of data than the edge device can contain. The local processing needs to be augmented with cloud compute services and the application needs to know when to run what where to provide a seamless experience for the user. The hybrid AI approach is applicable to virtually all generative AI applications and device segments – including phones, laptops, XR headsets, vehicles, and IoT.

Hybrid AI also allows for devices and cloud to run models concurrently – with devices running ‘light’ versions of the model for low latency while the cloud processes multiple tokens of the ‘full’ model in parallel and corrects the device answers if needed.


The world is rushing to embrace the Large Language Models, but most get a shock when they realize the costs involved, which can be 10X higher than traditional search algorithms. Many would say that the LLM explosion will fizzle if these costs cannot be constrained and reduced. Edge AI shows the promise of utilizing AI-enabled edge devices to significantly off-load the processing required while improving user experiences with high quality and low latencies.

Consequently, Edge AI is a promising technology that has the potential to revolutionize a wide range of applications. The challenges of using edge AI are being addressed by advances in model optimization, quantization, and hybrid solutions. As AI technology continues to develop, we can expect to see even more innovative and groundbreaking applications for edge AI in the coming years.

For more information on the state of the art Edge AI, please see our more complete analysis on our website here.