MoE Models Guide¶
Squish supports Mixture-of-Experts (MoE) models. These models have a large total parameter count but only activate a small subset (the "active parameters") during each forward pass, making them much faster than their total size suggests.
MoE Models in the Catalog¶
| Model ID | Total params | Active params | INT4 size | Fits 16 GB M3? |
|---|---|---|---|---|
qwen3:30b-a3b |
30B | ~3B | ~5.0 GB | Yes (11 GB agent headroom) |
To see all MoE models:
The squish catalog output shows a [MoE: X total / Y active] badge for these
models so you can immediately identify them.
Why MoE Models Are Efficient¶
A standard 30B dense model requires ~80 GB of VRAM and is not feasible on consumer
hardware. qwen3:30b-a3b uses a sparse MoE architecture where 30B parameters are
distributed across expert networks, but each token only routes through ~3B worth of
parameters. At INT4 compression, the total weight size is ~5 GB, comparable to a
dense 3B model but with 30B total capacity.
Running a MoE Model¶
Basic serving¶
Agent mode (recommended for long-context tasks)¶
The --agent preset automatically enables --moe-lookahead for MoE models:
This activates:
- --agent-kv: asymmetric INT2 KV cache (6× footprint reduction)
- --moe-lookahead: expert prefetching via EMA-delta hidden state prediction
- --chunk-prefill: bounded time-to-first-token
- batch-size=1: single-slot serving for agent loops
Manual MoE lookahead¶
Expert Lookahead Router (Phase 14)¶
The --moe-lookahead flag activates the MoELookaheadRouter from
squish/moe/moe_lookahead.py.
How it works¶
- After each decode step, the router computes an Exponential Moving Average (EMA) of the frame-to-frame delta of the mean hidden state.
- The EMA delta is used to project the hidden state
ksteps into the future. - The sparse MoE router is applied to each projected state to predict which experts will be needed.
- The union of all predicted expert indices forms the prefetch set.
- The next actual decode step evaluates whether any actual expert was in the prefetch set (hit rate).
Benchmarking lookahead¶
Sample output on synthetic traces resembling DeepSeek-Coder-V2-Lite:
Regime Hit rate Latency µs/tok
──────────── ────────── ────────────────
flat 100.0% 580.00
random 91.0% 572.00
drifting 92.5% 570.00
Lookahead configuration¶
| Flag | Default | Description |
|---|---|---|
--moe-lookahead |
False |
Enable MoE expert lookahead |
--moe-lookahead-steps |
3 |
Lookahead horizon steps |
DeepSeek-Coder-V2-Lite Setup on 16 GB M3¶
DeepSeek-Coder-V2-Lite (16B total / 2.4B active) is not yet in the default squish catalog. To use it as a custom model:
- Download the MLX-converted weights:
huggingface-cli download mlx-community/DeepSeek-Coder-V2-Lite-Instruct-4bit-mlx \
--local-dir ~/models/deepseek-coder-v2-lite
- Start squish with agent mode:
- Verify with a code generation request:
curl http://localhost:11435/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"squish","messages":[{"role":"user","content":"Write a Python quicksort."}],"max_tokens":256}'
Expected performance on M3 Pro 16 GB¶
| Config | TPS | Peak RAM |
|---|---|---|
| Default FP16 (baseline) | ~15 | ~9 GB |
| INT4 + agent-kv | ~42 | ~6 GB |
| INT4 + agent-kv + moe-lookahead | ~46 | ~6 GB |
(Simulated estimates: actual numbers depend on context length and model.)