Skip to content

Benchmark Results

Squish — Benchmarks

Grounded numbers from the squish repository. Reproduce locally with the commands at the bottom of each section. No marketing rounding — everything here either runs in CI or is a single Python invocation away.


0. Headline (v9.32.0 / bench v5.1.1)

Source: README.md "The Numbers" section; raw artifacts in results/benchmarks_v5_1_1/.

Measured 2026-06-02 on Apple M3 MacBook Pro, 16 GB unified memory. Model: Qwen2.5-7B-Instruct. Quant: INT4 (squish) / Q4_K_M (Ollama). Five-run medians.

Metric Ollama 0.18.2 Squish
E2E response @ 4000-token prompt 69.63 s 12.78 s (5.4× faster)
E2E response @ 75-token prompt 8.09 s 5.50 s (1.5× faster)
Peak RAM during inference ~5 GB 3.36 GB
Disk size — INT4 4.36 GB 4.00 GB
Disk size — INT3 (Qwen3) not supported 3.56 GB
TTFT @ 75-token prompt 131 ms 279 ms (honest loss)

Squish wins end-to-end response time at every prompt size measured (5.4× at 4000 tokens), uses ~33% less RAM, and supports INT3 for compatible model families. Ollama wins TTFT at every prompt size — if first-byte latency matters more than full-response latency, Ollama is the right tool.

Full methodology and ablation: docs/RESULTS.md (v5.1.1 section).


1. Cold-start load time and TTFT

Source: README.md headline numbers; reproduced via benchmarks/ollama_vs_squish/bench_cold_prefill.py and squish bench.

mlx_lm (cold) Ollama squish
Cold-start load time 28.81 s 8-25 s 0.33-0.53 s §
Cold start → first token n/a 20-30 s ~0.5 s
RAM during load ~2400 MB ~2-8 GB 160 MB

§ Cold-start load = wall time for weights to be accessible in Metal unified memory (mmap, no dtype conversion). Qwen2.5-1.5B, M3 16 GB, on hardware with sufficient RAM to build the Tier 1 MLX safetensors cache (~34 GB required). On a standard 16 GB M3, an 8B INT4 model loads in ~2.7 s via the lut_int2 path. ‡ Cold start → first token includes weight load + prefill; Ollama pays a full cold model load here, which is where Squish's load advantage shows up. For warm steady-state TTFT (model already loaded), see §1b — the two engines are comparable there. † 160 MB = Apple Metal virtual-address delta during load (mmap, no CPU heap). Peak RSS ~402 MB.

1b. Serving throughput & latency (warm, Qwen2.5-7B vs Ollama)

Thermally controlled (cooldown + drift check + die-temp logging), M3 16 GB, vs Ollama 0.18.2 and 0.30.7. See README and docs/paper.md §4.4.

Metric (warm) Ollama 0.30.7 squish INT4 squish INT3
Decode tok/s @ 75 tok 20.3 20.5 24.0
Decode tok/s @ 4000 tok 17.0 19.1 19.5
Inter-token p95 @ 75 52.4 ms 48.4 ms 42.7 ms
E2E @ 4000-token prompt 37.5 s 3.8 s (9.8×)
TTFT (loaded, 75 tok) 167 ms 192 ms 192 ms
Peak RAM 5.14 GB 3.5 GB

INT3 is the recommended default — arc_easy acc_norm 0.551 vs INT4 0.541 (tied, n=1000). Squish's only loss is warm single-token TTFT (192 vs 167 ms).


2. Disk size — raw vs. squished

Source: README.md model-size table; the squished column is what squish pull <model> actually downloads from the squishai HF org.

Model Raw (bf16) Squished (INT4) Saved
qwen3:0.6b 1.3 GB 0.4 GB 69%
qwen3:1.7b 3.5 GB 1.0 GB 71%
qwen3:4b 8.2 GB 2.2 GB 73%
qwen3:8b 16.4 GB 4.4 GB 73%
qwen3:14b 28.7 GB 7.6 GB 74%
llama3.1:8b 16.1 GB 4.3 GB 73%
deepseek-r1:7b 14.4 GB 3.9 GB 73%

Average: ~73% smaller on disk, 3.7× compression, statistically identical generation quality.


3. Weight quantization accuracy gates (lm_eval, arc_easy)

Source: CI accuracy gates; gate values are hard-stops in CI — ship requires meeting or beating them.

Format Model Gate (arc_easy) Status
INT4 AWQ g=32 Qwen2.5-1.5B ≥ 70.6 % ✅ shipped (W92)
INT3 g=32 Qwen2.5-1.5B ≥ 67.2 % ✅ shipped (W92)
INT3 gemma-3-* ≤ 4B -15 pp ❌ blocked
INT3 Qwen3 family within ±2pp ✅ shipped (9.33.5)
INT2 (naive) any ~29 % ≈ random ⛔ never ship
SQINT2 Qwen2.5-7B ≥ 65 % (target 67%) 🎯 in progress

SQINT2 is the four-stage geometry-aware INT2 pipeline: Hadamard incoherence preprocessing + NF2 per-group quantisation + low-rank SVD residual + layer-selective mixed precision. Effective bit-rate ~2.15 bpw — half of INT4 storage, ~7× of fp16 — see the squish architecture for the full math. Stages 1-3 land code-complete + SNR-validated; the final lm_eval ship gate runs at W103.4d on real M3 16 GB hardware.


4. KV-cache quantization (W104 / W105 / W106)

Three storage tiers — INT8 (default ≤ 8 K), INT4 (8 K-16 K band), INT2 (> 16 K). All three share the same _quantize_*_per_channel / _dequantize_*_per_channel codec contract; the only differences are the codebook and the bit-packing.

Per-token storage (head_dim = 128)

Mode Code bytes Scale bytes Total per token Compression vs fp16
fp16 (reference) 256 0 256 B 1.00×
int8 128 4 132 B 1.94×
int4 64 4 68 B 3.76×
int2 32 4 36 B 7.11×

Reconstruction SNR (fp16, n_tokens=256, head_dim=128, seed=42)

Distribution Hadamard INT8 SNR INT4 SNR INT2 SNR
Gaussian (σ=0.3) off 43.89 dB 19.25 dB 5.20 dB
Gaussian (σ=0.3) on 43.82 dB 19.27 dB 5.27 dB
Heavy-tailed (t, df=3) off 37.79 dB 12.90 dB -3.39 dB
Heavy-tailed (t, df=3) on 44.27 dB 19.69 dB 5.71 dB
Outlier-spiked (1% @ ±5) off 34.29 dB 7.09 dB -8.61 dB
Outlier-spiked (1% @ ±5) on 47.70 dB 23.18 dB +8.47 dB

Read the bottom row carefully: without rotation, INT2 sits at −8.6 dB on outlier-spiked activations — the reconstruction error is literally seven times the signal, and the cache is destroyed. Apply the randomised Hadamard rotation (HadamardKVCache, free at runtime, seeded so it is deterministic) and SNR jumps 17 dB to +8.5 dB. This is exactly the bin-collapse failure mode that motivated the W104 codec design — and exactly what the demo Space lets you click on.

Qwen2.5-7B KV-cache memory by context length

n_layers=28, n_kv_heads=4, head_dim=128. Numbers below are estimate_kv_memory(...).total_bytes / 1e6, the same closed-form used by make_kv_cache(planned_context=...) to pick a tier.

Context tokens fp16 int8 int4 int2
4 096 234.9 MB 121.1 MB 62.4 MB 33.0 MB
8 192 469.8 MB 242.2 MB 124.8 MB 66.1 MB
16 384 939.5 MB 484.4 MB 249.6 MB 132.1 MB
32 768 1 879.0 MB 968.9 MB 499.1 MB 264.2 MB
65 536 3 758.1 MB 1 937.8 MB 998.2 MB 528.5 MB

Headroom story on M3 16 GB (≈ 15.5 GB usable Metal budget): a fp16 KV cache for Qwen2.5-7B at 32 K tokens is 1.88 GB, on top of ~4.4 GB of INT4 weights, leaving only ~9 GB for everything else and OOMing around 10 K in practice. The same workload at INT2 KV is 264 MB — 7× smaller, fits 32 K cleanly, and 65 K stays under 530 MB.

from squish.kv.kv_cache import recommended_kv_mode_3tier
recommended_kv_mode_3tier(   4_000)   # → "int8"
recommended_kv_mode_3tier(  12_000)   # → "int4"
recommended_kv_mode_3tier(  32_000)   # → "int2"

Defaults: ≤ 8 K → int8, 8-16 K → int4, > 16 K → int2.


5. Throughput — quantized GEMV (W101 / W102)

INT4 group-32 GEMV, M3 16 GB, single-thread baseline numbers from squish bench --format int4 on (batch=1, in=4096, out=4096, group=32, iters=200, warmup=50). Rust path released the GIL via py.allow_threads() and parallelised across output features with Rayon; NumPy path kept as a portable fallback.

Backend p50 latency p95 latency GOPS
NumPy fallback reference reference
Rust (squish_quant_rs) -2-3× faster -3-4× faster 2-3×

(Exact numbers depend on host; reproduce with squish bench --format int4 --in-features 4096 --out-features 4096 --group-size 32 --iters 200.)


6. Reproduce these numbers

# Cold load + TTFT
benchmarks/ollama_vs_squish/bench_cold_prefill.py

# Quantized GEMV throughput
squish bench --format int4
squish bench --format int8

# KV codec SNR (the demo's numbers, exactly)
python -c "
from spaces._logic import make_synthetic_activations, apply_hadamard, run_all_tiers
arr = make_synthetic_activations(256, 128, 'outlier', seed=42)
for r in run_all_tiers(apply_hadamard(arr)):
    print(f'{r.mode}: SNR={r.snr_db:.2f} dB, {r.bytes_per_token} B/tok, {r.compression_vs_fp16:.2f}x')
"

# KV memory for any model + context
python -c "
from squish.kv.kv_cache import estimate_kv_memory
e = estimate_kv_memory(n_layers=28, n_kv_heads=4, head_dim=128,
                       context_tokens=32_000, mode='int2', window=128)
print(f'total = {e.total_bytes/1e6:.1f} MB, ratio = {e.compression_ratio:.2f}x')
"

# Full lm_eval gate (overnight, requires real model)
lm_eval --model squish --model_args path=$MODEL_DIR --tasks arc_easy --limit 500

7. Live in the browser

The KV-cache numbers from §4 are interactive at the squish-kv-quant Hugging Face Space. Pick a distribution, toggle Hadamard rotation, see SNR shift in real time. Source in spaces/.