DEV Community

Julien Simon
Julien Simon

Posted on β€’ Originally published at julsimon.Medium on

In this video, we run local inference on an Apple M3 MacBook with llama.cpp

Local inference shootout: Llama.cpp vs. MLX on 10B and 32B Arcee SLMs

In this video, we run local inference on an Apple M3 MacBook with llama.cpp and MLX, two projects that optimize and accelerate SLMs on CPU platforms. For this purpose, we use two new Arcee open-source models distilled from DeepSeek-v3: Virtuoso Lite 10B and Virtuoso Medium v2 32B.

First, we download the two models from the Hugging Face hub with the Hugging Face CLI. Then, we go through the step-by-step installation procedure for llama.cpp and MLX. Next, we optimize and quantize the models to 4-bit precision for maximum acceleration. Finally, we run inference and look at performance numbers. So, who’s fastest? Watch and find out!

Sentry growth stunted Image

If you are wasting time trying to track down the cause of a crash, it’s time for a better solution. Get your crash rates to zero (or close to zero as possible) with less time and effort.

Try Sentry for more visibility into crashes, better workflow tools, and customizable alerts and reporting.

Switch Tools πŸ”

Top comments (0)

A Workflow Copilot. Tailored to You.

Pieces.app image

Our desktop app, with its intelligent copilot, streamlines coding by generating snippets, extracting code from screenshots, and accelerating problem-solving.

Read the docs

πŸ‘‹ Kindness is contagious

Please leave a ❀️ or a friendly comment on this post if you found it helpful!

Okay