2026-04-11 · Tutorial

Splitting Llama across two MacBook Pros with llama.cpp RPC

You have two Macs. Both have enough RAM to hold part of a large language model, but neither has enough to hold the whole thing. You don't want to rent an H100. You don't want to ship your prompts to a stranger's data center. What you want is for the two Macs to act like one bigger Mac — and that's exactly what llama.cpp's RPC backend does.

This post is a walkthrough of setting that up, with commands copied straight from a working setup. I'll use two M1 Max MacBook Pros on the same LAN, but the same steps work for any combination of Apple Silicon, Linux x86_64, Linux ARM64, and Windows.

What llama.cpp RPC actually does

The short version: ggml is llama.cpp's tensor library, and it supports pluggable backends — CPU, Metal, CUDA, Vulkan, and one called RPC. The RPC backend doesn't run computations itself. Instead, it forwards tensor operations over TCP to a rpc-server process running on another machine. That remote process executes the op on its local backend (Metal, CUDA, whatever) and sends the result back.

From llama-server's point of view, an RPC backend looks identical to any other backend: it has memory, it accepts tensors, it runs ops. The abstraction is clean enough that you can --rpc a worker into an otherwise-normal llama-server invocation and inference just works. The cost is latency: every tensor op pays a round-trip over the network. For a token-by-token chat workload on a LAN, that cost is small. Over the open internet with 80 ms of latency, it is not small.

Rule of thumb: if you can ping the other machine in under 5 ms, you'll barely notice the RPC overhead. At 50 ms, it's painful. At 200 ms, it's unusable.

Step 1 — Build llama.cpp with RPC on both machines

You need cmake and a recent compiler. On macOS, Xcode command-line tools are enough:

xcode-select --install
brew install cmake

Clone and build with GGML_RPC=ON. The BUILD_SHARED_LIBS=ON flag matters if you want the binaries to be relocatable later — without it, libggml gets statically linked in and you can't rewrite the paths afterward:

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build \
    -DGGML_RPC=ON \
    -DGGML_METAL=ON \
    -DBUILD_SHARED_LIBS=ON \
    -DCMAKE_BUILD_TYPE=Release
cmake --build build --target rpc-server llama-server -j

On an M1 Max this takes about four minutes. You should end up with build/bin/rpc-server and build/bin/llama-server, plus a pile of libggml*.dylib files scattered across build/ggml/src/. Verify they work:

./build/bin/rpc-server --help
./build/bin/llama-server --help

Do this on both machines. The primary (the one that'll run llama-server) and the worker (the one that'll run rpc-server) both need the same llama.cpp version — different versions will hang at the handshake or segfault mid-inference. I'd recommend pinning to a tag like b4404 on both.

Step 2 — Start rpc-server on the worker

On the second Mac (the worker), decide how much RAM you want to dedicate to the network. The -m flag is the backend memory pool size in megabytes — this is what the worker will advertise as available for tensor storage:

./build/bin/rpc-server -H 0.0.0.0 -p 50052 -m 8192

Breakdown:

You'll see a line like:

Starting RPC server v3.0.0
  endpoint       : 0.0.0.0:50052
  local cache    : n/a
  backend memory : 8192 MB

That's it for the worker. Leave it running.

Step 3 — Start llama-server on the primary with --rpc

On the first Mac, download a GGUF model and point llama-server at both the local Metal backend and the remote RPC worker:

./build/bin/llama-server \
    -m ~/models/llama-3-8b-instruct.Q4_K_M.gguf \
    --rpc 192.168.1.42:50052 \
    --host 127.0.0.1 \
    --port 8080 \
    -ngl 99

The key flag is --rpc 192.168.1.42:50052, which tells llama.cpp to add a remote backend at that address. -ngl 99 asks for as many layers offloaded as possible; llama.cpp's scheduler will spread them across the available backends, which now includes both local Metal and the remote RPC worker.

In the primary's startup log you should see something like:

load_backend: loaded RPC backend from ...
ggml_backend_rpc_buffer_type: connecting to rpc-worker:50052
ggml_backend_rpc_buffer_type: ok
load_tensors: offloaded 33/33 layers to GPU
load_tensors:        RPC[192.168.1.42:50052] model buffer size =  4521.56 MiB
load_tensors:        Metal model buffer size =  1203.42 MiB

Those two model buffer size lines are the whole point: some of the model's tensor weights are now resident on the remote worker, some are resident locally, and the scheduler will route each tensor op to whichever backend owns the data. First-token latency is slightly higher than local-only; subsequent tokens stream at roughly the throughput of the slowest link in the chain.

Step 4 — Verify it's actually splitting

Fire a completion at the primary and watch the worker's stdout:

curl -s http://127.0.0.1:8080/completion \
  -H 'Content-Type: application/json' \
  -d '{"prompt":"The capital of France is","n_predict":32}' | \
  python -m json.tool

On the worker, you should see a stream of lines like remote_graph_compute and remote_alloc_buffer — that's llama.cpp forwarding tensor operations to the worker and pulling results back. If you don't see any activity on the worker, something's wrong: either --rpc didn't actually connect, or all the layers got offloaded locally because the primary had enough VRAM on its own. The easiest way to force real splitting is to make the model bigger than the primary's Metal pool.

Common failures (and fixes)

"Unable to load model" with no obvious error

You probably have different llama.cpp builds on the two machines. The GGML ABI changes between versions. Pin both to the same tag, rebuild, retry.

GGML_ASSERT(tensor->ne[0] % 512 == 0) on the worker

This is a known llama.cpp RPC limitation: certain quantization formats with row counts that aren't a multiple of 512 assert-fail when sent over RPC. It hits Q4_K and Q5_K variants of some specific model shapes. Workaround: either use a different quant (Q8_0 usually works), or use a model whose hidden size is a clean multiple of 512.

"Connection refused" from the primary

The worker is bound to 127.0.0.1, not 0.0.0.0. Re-run rpc-server with -H 0.0.0.0.

Inference works but is painfully slow

You're probably on WiFi. Switch to ethernet or a Thunderbolt bridge. I've seen token throughput drop from 20 tok/s to 2 tok/s just by moving from wired gigabit to WiFi 6.

Where to go from here

If you got this working with two machines, the next question is obvious: what about ten? A hundred? That's what SharedLLM is building — a coordinator that discovers workers, tracks their available memory and GPU VRAM, splits models across them via a proper layer-split scheduler, handles worker churn, and does all of this under AGPL-3.0-or-later governance so nobody can turn it into an enclosure.

The v0.1.0 public alpha ships pre-built rpc-server and llama-server binaries for macOS, Linux (x86_64 + arm64), and Windows, so you can skip the build step entirely. It also ships a Docker integration stack that runs the whole coordinator-plus-workers path in three commands, which is the fastest way to audit the code if you're just trying to understand how all the pieces fit together.


Next: SharedLLM vs Petals vs Exo vs Kalavai: distributed LLM networks in 2026 →