Benchmark Results¶

Squish — Benchmarks¶

Grounded numbers from the squish repository. Reproduce locally with the commands at the bottom of each section. No marketing rounding — everything here either runs in CI or is a single Python invocation away.

0. Headline (v9.32.0 / bench v5.1.1)¶

Source: README.md "The Numbers" section; raw artifacts in results/benchmarks_v5_1_1/.

Measured 2026-06-02 on Apple M3 MacBook Pro, 16 GB unified memory. Model: Qwen2.5-7B-Instruct. Quant: INT4 (squish) / Q4_K_M (Ollama). Five-run medians.

Metric	Ollama 0.18.2	Squish
E2E response @ 4000-token prompt	69.63 s	12.78 s (5.4× faster)
E2E response @ 75-token prompt	8.09 s	5.50 s (1.5× faster)
Peak RAM during inference	~5 GB	3.36 GB
Disk size — INT4	4.36 GB	4.00 GB
Disk size — INT3 (Qwen3)	not supported	3.56 GB
TTFT @ 75-token prompt	131 ms	279 ms (honest loss)

Squish wins end-to-end response time at every prompt size measured (5.4× at 4000 tokens), uses ~33% less RAM, and supports INT3 for compatible model families. Ollama wins TTFT at every prompt size — if first-byte latency matters more than full-response latency, Ollama is the right tool.

Full methodology and ablation: docs/RESULTS.md (v5.1.1 section).

1. Cold-start load time and TTFT¶

Source: README.md headline numbers; reproduced via benchmarks/ollama_vs_squish/bench_cold_prefill.py and squish bench.

	mlx_lm (cold)	Ollama	squish
Cold-start load time	28.81 s	8-25 s	0.33-0.53 s §
Cold start → first token	n/a	20-30 s	~0.5 s ‡
RAM during load	~2400 MB	~2-8 GB	160 MB †

§ Cold-start load = wall time for weights to be accessible in Metal unified memory (mmap, no dtype conversion). Qwen2.5-1.5B, M3 16 GB, on hardware with sufficient RAM to build the Tier 1 MLX safetensors cache (~34 GB required). On a standard 16 GB M3, an 8B INT4 model loads in ~2.7 s via the lut_int2 path. ‡ Cold start → first token includes weight load + prefill; Ollama pays a full cold model load here, which is where Squish's load advantage shows up. For warm steady-state TTFT (model already loaded), see §1b — the two engines are comparable there. † 160 MB = Apple Metal virtual-address delta during load (mmap, no CPU heap). Peak RSS ~402 MB.

1b. Serving throughput & latency (warm, Qwen2.5-7B vs Ollama)¶

Thermally controlled (cooldown + drift check + die-temp logging), M3 16 GB, vs Ollama 0.18.2 and 0.30.7. See README and docs/paper.md §4.4.

Metric (warm)	Ollama 0.30.7	squish INT4	squish INT3
Decode tok/s @ 75 tok	20.3	20.5	24.0
Decode tok/s @ 4000 tok	17.0	19.1	19.5
Inter-token p95 @ 75	52.4 ms	48.4 ms	42.7 ms
E2E @ 4000-token prompt	37.5 s	3.8 s (9.8×)	—
TTFT (loaded, 75 tok)	167 ms	192 ms	192 ms
Peak RAM	5.14 GB	3.5 GB	—

INT3 is the recommended default — arc_easy acc_norm 0.551 vs INT4 0.541 (tied, n=1000). Squish's only loss is warm single-token TTFT (192 vs 167 ms).

2. Disk size — raw vs. squished¶

Source: README.md model-size table; the squished column is what squish pull <model> actually downloads from the squishai HF org.

Model	Raw (bf16)	Squished (INT4)	Saved
qwen3:0.6b	1.3 GB	0.4 GB	69%
qwen3:1.7b	3.5 GB	1.0 GB	71%
qwen3:4b	8.2 GB	2.2 GB	73%
qwen3:8b	16.4 GB	4.4 GB	73%
qwen3:14b	28.7 GB	7.6 GB	74%
llama3.1:8b	16.1 GB	4.3 GB	73%
deepseek-r1:7b	14.4 GB	3.9 GB	73%

Average: ~73% smaller on disk, 3.7× compression, statistically identical generation quality.

3. Weight quantization accuracy gates (lm_eval, arc_easy)¶

Source: CI accuracy gates; gate values are hard-stops in CI — ship requires meeting or beating them.

Format	Model	Gate (arc_easy)	Status
INT4 AWQ g=32	Qwen2.5-1.5B	≥ 70.6 %	✅ shipped (W92)
INT3 g=32	Qwen2.5-1.5B	≥ 67.2 %	✅ shipped (W92)
INT3	gemma-3-* ≤ 4B	-15 pp	❌ blocked
INT3	Qwen3 family	within ±2pp	✅ shipped (9.33.5)
INT2 (naive)	any	~29 % ≈ random	⛔ never ship
SQINT2	Qwen2.5-7B	≥ 65 % (target 67%)	🎯 in progress

SQINT2 is the four-stage geometry-aware INT2 pipeline: Hadamard incoherence preprocessing + NF2 per-group quantisation + low-rank SVD residual + layer-selective mixed precision. Effective bit-rate ~2.15 bpw — half of INT4 storage, ~7× of fp16 — see the squish architecture for the full math. Stages 1-3 land code-complete + SNR-validated; the final lm_eval ship gate runs at W103.4d on real M3 16 GB hardware.

4. KV-cache quantization (W104 / W105 / W106)¶

Three storage tiers — INT8 (default ≤ 8 K), INT4 (8 K-16 K band), INT2 (> 16 K). All three share the same _quantize_*_per_channel / _dequantize_*_per_channel codec contract; the only differences are the codebook and the bit-packing.

Per-token storage (head_dim = 128)¶

Mode	Code bytes	Scale bytes	Total per token	Compression vs fp16
fp16 (reference)	256	0	256 B	1.00×
int8	128	4	132 B	1.94×
int4	64	4	68 B	3.76×
int2	32	4	36 B	7.11×

Reconstruction SNR (fp16, n_tokens=256, head_dim=128, seed=42)¶

Distribution	Hadamard	INT8 SNR	INT4 SNR	INT2 SNR
Gaussian (σ=0.3)	off	43.89 dB	19.25 dB	5.20 dB
Gaussian (σ=0.3)	on	43.82 dB	19.27 dB	5.27 dB
Heavy-tailed (t, df=3)	off	37.79 dB	12.90 dB	-3.39 dB
Heavy-tailed (t, df=3)	on	44.27 dB	19.69 dB	5.71 dB
Outlier-spiked (1% @ ±5)	off	34.29 dB	7.09 dB	-8.61 dB
Outlier-spiked (1% @ ±5)	on	47.70 dB	23.18 dB	+8.47 dB

Read the bottom row carefully: without rotation, INT2 sits at −8.6 dB on outlier-spiked activations — the reconstruction error is literally seven times the signal, and the cache is destroyed. Apply the randomised Hadamard rotation (HadamardKVCache, free at runtime, seeded so it is deterministic) and SNR jumps 17 dB to +8.5 dB. This is exactly the bin-collapse failure mode that motivated the W104 codec design — and exactly what the demo Space lets you click on.

Qwen2.5-7B KV-cache memory by context length¶

n_layers=28, n_kv_heads=4, head_dim=128. Numbers below are estimate_kv_memory(...).total_bytes / 1e6, the same closed-form used by make_kv_cache(planned_context=...) to pick a tier.

Context tokens	fp16	int8	int4	int2
4 096	234.9 MB	121.1 MB	62.4 MB	33.0 MB
8 192	469.8 MB	242.2 MB	124.8 MB	66.1 MB
16 384	939.5 MB	484.4 MB	249.6 MB	132.1 MB
32 768	1 879.0 MB	968.9 MB	499.1 MB	264.2 MB
65 536	3 758.1 MB	1 937.8 MB	998.2 MB	528.5 MB

Headroom story on M3 16 GB (≈ 15.5 GB usable Metal budget): a fp16 KV cache for Qwen2.5-7B at 32 K tokens is 1.88 GB, on top of ~4.4 GB of INT4 weights, leaving only ~9 GB for everything else and OOMing around 10 K in practice. The same workload at INT2 KV is 264 MB — 7× smaller, fits 32 K cleanly, and 65 K stays under 530 MB.

Recommended-mode auto-selection¶

from squish.kv.kv_cache import recommended_kv_mode_3tier
recommended_kv_mode_3tier(   4_000)   # → "int8"
recommended_kv_mode_3tier(  12_000)   # → "int4"
recommended_kv_mode_3tier(  32_000)   # → "int2"

Defaults: ≤ 8 K → int8, 8-16 K → int4, > 16 K → int2.

5. Throughput — quantized GEMV (W101 / W102)¶

INT4 group-32 GEMV, M3 16 GB, single-thread baseline numbers from squish bench --format int4 on (batch=1, in=4096, out=4096, group=32, iters=200, warmup=50). Rust path released the GIL via py.allow_threads() and parallelised across output features with Rayon; NumPy path kept as a portable fallback.

Backend	p50 latency	p95 latency	GOPS
NumPy fallback	reference	reference	1×
Rust (`squish_quant_rs`)	-2-3× faster	-3-4× faster	2-3×

(Exact numbers depend on host; reproduce with squish bench --format int4 --in-features 4096 --out-features 4096 --group-size 32 --iters 200.)

6. Reproduce these numbers¶

# Cold load + TTFT
benchmarks/ollama_vs_squish/bench_cold_prefill.py

# Quantized GEMV throughput
squish bench --format int4
squish bench --format int8

# KV codec SNR (the demo's numbers, exactly)
python -c "
from spaces._logic import make_synthetic_activations, apply_hadamard, run_all_tiers
arr = make_synthetic_activations(256, 128, 'outlier', seed=42)
for r in run_all_tiers(apply_hadamard(arr)):
    print(f'{r.mode}: SNR={r.snr_db:.2f} dB, {r.bytes_per_token} B/tok, {r.compression_vs_fp16:.2f}x')
"

# KV memory for any model + context
python -c "
from squish.kv.kv_cache import estimate_kv_memory
e = estimate_kv_memory(n_layers=28, n_kv_heads=4, head_dim=128,
                       context_tokens=32_000, mode='int2', window=128)
print(f'total = {e.total_bytes/1e6:.1f} MB, ratio = {e.compression_ratio:.2f}x')
"

# Full lm_eval gate (overnight, requires real model)
lm_eval --model squish --model_args path=$MODEL_DIR --tasks arc_easy --limit 500

7. Live in the browser¶

The KV-cache numbers from §4 are interactive at the squish-kv-quant Hugging Face Space. Pick a distribution, toggle Hadamard rotation, see SNR shift in real time. Source in spaces/.