Setting the Stage for AI On Device

Perhaps unsurprisingly, the vast majority of inquiries I get from investors, founders, and the bulk of our client base from the tech ecosystem is on silicon and AI. For the better part of a year, I have been deeply engaged in these conversations, exploring the limits of current silicon as it relates to AI and how many AI developers and corporations, large and small, are thinking about AI.

While most of the AI conversation revolves around leveraging the massive computing resources found in public cloud providers today like Microsoft, AWS, and Google Cloud, as well as a host of “bare metal” co-location facilities, from our conversations with developers and enterprises, ISVs, and the broader corporate ecosystem, we believe a demand curve for AI to be run locally with on-premise hardware at both the enterprise edge infrastructure and on client devices like Windows PCs and Macs.

The reason for this is multi-faceted. Firstly cost is a big reason. Training and then running large models in the cloud increased costs for IT budgets as it relates to the broader cloud workflow budget already in place. In many cases, these cloud AI workflows are even more expensive than many other cloud workflows organizations are investing in. But the main one that keeps coming up in our conversations with CIOs/CSOs is data sovereignty. What makes AI unique, in a corporate context, is the data IP they have and how that data will uniquely empower productivity gains for employees. The idea of using a public cloud solution to train and then run the unique AI data models of a corporation is a hard stop no from many CIOs/CSOs we talk to. This is also something every infrastructure provider is hearing, as well as they are looking to sell AI servers and AI hardware.

Organizations with AI projects being explored are working to understand the size of their structured data and then how to train and fine-tune that data model in a way that can be used as a productivity gain via LLMs and more to organizations like sales, customer support, marketing, business analytics, etc. We believe running AI on devices and at the edge is the way forward for organizations, which means the capabilities of client silicon become even more essential going forward.

So why did I just outline where we believe AI at the edge is going when I’m about to talk about my experience with Apple Silicon in this context? Because Macs are an engrained tool in the enterprise, particularly in the US, in all sized organizations, and Macs continue to grow share in the workforce. Therefore, any future about AI in the enterprise has to include Macs.

That being said, the Windows ecosystem and its merchant silicon vendors like Intel, AMD, and Qualcomm all have strategies around the AI PC (something we will explain in detail in a coming report) that leverage AI-specific solutions on-chip to help accelerate AI/LLMs being used by corporations to boost productivity. While the solutions coming out around the AI PC are exciting and have the potential to cause a rethink in the role of the PC and hyper-productivity in the coming years, my time with Apple Silicon confirms that Apple is not left out of this conversation.

Client Silicon and Local Large Language Models

Bear with me as I add a few other points necessary to set the context.

When you try a server-side LLM today like ChatGPT, Anthropic Claude2, etc., you are using an LLM that has been trained on many, many millions of parameters. In the case of ChatGPT 4, which many would say is the most capable LLM, it was trained on 1.76T parameters. It is extremely likely that many more versions of these models will be trained in the trillion-plus parameter size and from a competitive and capabilities context that is necessary for something that is trying to be a general-purpose model covering the widest set of needs.

However, from purely a corporation standpoint, the interest is to ONLY run local models trained on their proprietary data for their employees. These models will not and do not need to be extremely large, especially when that company is training and fine-tuning a model just for one organization. It is more likely these enterprise-specific fine-tuned models are more in the 10s of billions of parameters than the 100s when it comes down to just an organization’s corporate and domain-specific data.

I lay all that out to say that when we think about the types of models that will run locally on client devices, looking at models in the 10s of billions is a good start to grasp their capabilities and their limitations.

Apple Silicon and Local Large Language Models
While the vast majority of people “reviewing” Macbook Pro and M3 silicon are running Geekbench and trying to strain their systems with 4k or 8k concurrent video encodes, I decided to benchmark the 16′′ Macbook Pro with M3 Max and 48GB of RAM on some use cases I believe will be prevalent and computationally intensive in the future. So I decided to see how many different size LLMs I could run and judge key parts of an LLM benchmark, which are time to first token (this is the most resource-driven task), total tokens per second (which equates to words per minute), among of RAM needed for each model, and how taxing running these models is to the system in terms of performance-per-watt.

Below are the models I tested.

– Llama 2 7B Quant Method 4 Max RAM required 6.5GB
– Llama 2 13B Quant Method 5 Max RAM required 9.8GB
– Llama 2 34B Quant Method 5 Max RAM required 29.5GB

With each model, I ran it on the CPU only, and then GPU (Metal) accelerated. Key metrics in this benchmark were time to first token (TTFT), tokens per second (TPS), and total system package watts while running the model. For reference, 20 tokens per second is producing words at about the rate humans can read.

Results:

Llama 2 7B
– CPU TTFT = 3.39 seconds, TPS = 23, total system package = 36W
– GPU accelerated TTFT = .23 seconds, TPS = 53, total system package = 28W

Llama 2 13B
– CPU TTFT = 6.25 seconds, TPS = 11, total system package = 38W
– GPU accelerated TTFT = .40 seconds, TPS = 27, total system package = 42W

Llama 2 34B
– CPU TTFT = 27 seconds, TPS = 4, total system package = 42W
– GPU accelerated TTFT = .77 seconds, TPS = 13, total system package = 54W

LLM Benchmark Takeaways

  • Compared to some prior similar benchmarks I found running the same models on M1 Ultra and M2 Ultra, the M3 Max is on par with M1 Ultra in token per second speed with each model and just slightly slower than M2 Ultra. This means M3 Max is in the same ballpark as Apple’s highest-end desktop workstation when it comes to its local AI processing.
  • Another key takeaway from this exercise is how poor the CPU performs regarding local AI inferencing.
  • The other standout observation is the speed of GPU acceleration, as all models I tested supported acceleration via Apple Metal. While doing so yielded no significant power advantage and, in some cases, drew higher power, it was significantly faster than when the inference ran on the CPU only.

These were impressive results and began to create a picture of what AI development can look like on a Mac as more AI developers and enterprises look to do more local processing of their AI models. But as interesting as this was to do on M3 Max, the highest-end of Apple Silicon today, running these tests on my base M2 Macbook Air was even more interesting.

Results on M2 Macbook Air – 16GB RAM

Llama 2 7B
– CPU TTFT = 6.79 seconds, TPS = 14, total system package = 20W
– GPU accelerated TTFT = .54 seconds, TPS = 17, total system package = 10W

Llama 2 13B
– CPU TTFT = 12 seconds, TPS = 4, total system package = 22W
– GPU accelerated TTFT = .60 seconds, TPS =10, total system package = 16W

I ran these tests on base M2 to demonstrate that even last year’s base M2 silicon appears to be competitive with similar model size benchmarks on the AI-specific silicon platforms launched soon by Apple’s competitors in the Windows ecosystem.

The Big Takeaway

For me, this exercise made it very clear that Apple’s Mac is not absent from the conversation when it comes to silicon capabilities of running local large language models despite what many competitors will say. In fact, one could argue that the most common model sizes being benchmarked for local computing, in 7B and 13B parameter models, run fine on a massive amount of Macs in use today. This means that developers and enterprises who are continuing to explore local processing for their apps and software can easily do so with a wide range of choices of Macs today.

While it may not be as evident now, 2024 is going to see a significant shift in the narrative from purely cloud-based inference of AI and LLMs to local on-device processing. I came across a report last night from Morgan Stanley research making the same key points I am sharing this slide as the key advantages to edge AI.

The trend line is clear to the importance of being able to run local models on client devices like workstations, laptops, tablets, and smartphones, and the only question that remains is what size the model will be. This exercise only confirmed to me that local LLMs, for the foreseeable future, are likely to stay under 34B parameters purely due to the memory constraints of running these models locally. Therefore, benchmarks that test the performance of models in the 7B, 13B, and 34B are sufficient to gauge the capabilities of client silicon related to local AI software development and enterprise use cases regarding their domain-specific fine-tuned LLMs.

Expect to hear a lot more about local on-device AI in 2024, and we will be publishing our own report on the AI PC as well as results from a generative AI survey we are running on generative AI use cases and IT decision-makers current thinking and plans around deploying generative AI in the workplace.

Join the newsletter and stay up to date