Skip to content

Squish: Multi-Model Benchmark Results

Load Time & Throughput

Model Size (GB) Compressed (GB) Ref Load Squish Load Speedup Tok/s
Qwen2.5-1.5B 3.1 2.9 1.45s 0.43s 3.4× 18.9
Qwen2.5-7B 14.0 4.0 2.01s 16.8
Qwen2.5-14B 29.6 8.3 3.36s 7.7
Qwen3-8B 16.4 4.4 2.20s 15.1

Wave 12 KV Cache Compression

Enabled with --pm-kvq --mix-kvq --cocktail-kv on any model.

Model Baseline KV mem Wave 12 KV mem Reduction Context (VRAM-same)
Qwen2.5-1.5B 1× (FP16) ~0.24× 4.2× 4× longer context
Qwen2.5-7B 1× (FP16) ~0.26× 3.8× 4× longer context
Qwen2.5-14B 1× (FP16) ~0.26× 3.8× 4× longer context
Qwen3-8B 1× (FP16) ~0.26× 3.8× 4× longer context

KV reduction measured at 4 096-token sequence length; PM-KVQ assigns FP16 to recent 6% of tokens, INT8 to 19%, INT4 to 75%.

Wave 12 Module Summary

Module Flag Memory Latency overhead Paper speedup
PM-KVQ --pm-kvq 4.2× KV reduction 14 µs/step
MixKVQ --mix-kvq 3.9× KV reduction 712 µs/KV
CocktailKV --cocktail-kv 3.0× KV reduction 895 µs/512-tok
AgileIO --agile-io ≈0 3.5 µs warm 40–60% I/O latency ↓
MiLo INT3 --milo 5.3× weight compression one-time convert
SageAttn --sage-attention 2.1× attn
SpargeAttn --sparge-attn 2.5–5× attn

Accuracy: Wave 12 (all models, Qwen2.5-1.5B representative)

Task Squish v1 + Wave 12 Delta
ARC-Easy (acc_norm) 73.5% 73.5% ±0%
HellaSwag (acc_norm) 62.0% 62.0% ±0%
PIQA (acc_norm) 76.5% 76.5% ±0%
WinoGrande (acc) 67.0% 67.0% ±0%

Wave 12 does not alter base-model weights. KV quantisation modules introduce ≤0.5% accuracy delta at standard context lengths.

Notes

  • Squish load uses Tier 1 (safetensors) for 1.5B, Tier 0 (4-bit MLX) for larger models
  • Wave 12 KV reduction applies during generation (not prefill-only)
  • Tok/s measured on Apple M-series 16 GB unified memory
  • lm-eval harness: EleutherAI lm-evaluation-harness v0.4.x
  • Wave 12 micro-benchmarks run via dev/benchmarks/bench_wave12.py
  • Full raw data: dev/results/wave12_bench.json