Squish: Multi-Model Benchmark Results¶

Load Time & Throughput¶

Model	Size (GB)	Compressed (GB)	Ref Load	Squish Load	Speedup	Tok/s
Qwen2.5-1.5B	3.1	2.9	1.45s	0.43s	3.4×	18.9
Qwen2.5-7B	14.0	4.0	—	2.01s	—	16.8
Qwen2.5-14B	29.6	8.3	—	3.36s	—	7.7
Qwen3-8B	16.4	4.4	—	2.20s	—	15.1

Enabled with --pm-kvq --mix-kvq --cocktail-kv on any model.

Model	Baseline KV mem	Wave 12 KV mem	Reduction	Context (VRAM-same)
Qwen2.5-1.5B	1× (FP16)	~0.24×	4.2×	4× longer context
Qwen2.5-7B	1× (FP16)	~0.26×	3.8×	4× longer context
Qwen2.5-14B	1× (FP16)	~0.26×	3.8×	4× longer context
Qwen3-8B	1× (FP16)	~0.26×	3.8×	4× longer context

KV reduction measured at 4 096-token sequence length; PM-KVQ assigns FP16 to recent 6% of tokens, INT8 to 19%, INT4 to 75%.

Module	Flag	Memory	Latency overhead	Paper speedup
PM-KVQ	`--pm-kvq`	4.2× KV reduction	14 µs/step	—
MixKVQ	`--mix-kvq`	3.9× KV reduction	712 µs/KV	—
CocktailKV	`--cocktail-kv`	3.0× KV reduction	895 µs/512-tok	—
AgileIO	`--agile-io`	≈0	3.5 µs warm	40–60% I/O latency ↓
MiLo INT3	`--milo`	5.3× weight compression	one-time convert	—
SageAttn	`--sage-attention`	—	—	2.1× attn
SpargeAttn	`--sparge-attn`	—	—	2.5–5× attn

Task	Squish v1	+ Wave 12	Delta
ARC-Easy (acc_norm)	73.5%	73.5%	±0%
HellaSwag (acc_norm)	62.0%	62.0%	±0%
PIQA (acc_norm)	76.5%	76.5%	±0%
WinoGrande (acc)	67.0%	67.0%	±0%

Wave 12 does not alter base-model weights. KV quantisation modules introduce ≤0.5% accuracy delta at standard context lengths.

Squish load uses Tier 1 (safetensors) for 1.5B, Tier 0 (4-bit MLX) for larger models
Wave 12 KV reduction applies during generation (not prefill-only)
Tok/s measured on Apple M-series 16 GB unified memory
lm-eval harness: EleutherAI lm-evaluation-harness v0.4.x
Wave 12 micro-benchmarks run via dev/benchmarks/bench_wave12.py
Full raw data: dev/results/wave12_bench.json