Squish: 2-bit Quantization Comparison Benchmark¶
Phase 9C, sub-2-bit weight compression: INT4 vs VPTQ vs AQLM vs QuIP# Weight-reconstruction metrics from synthetic 64×64 weight matrices. Perplexity + TPS metrics require
--model-dir(run on real hardware).
Overview¶
This benchmark evaluates four near-2-bit weight-compression methods on the same weight matrix and reports:
- BPW: bits per weight after compression (including index + scale overhead)
- SNR (dB): weight-reconstruction signal-to-noise ratio vs. original
- Compress ms: wall-clock time for the offline compression step
- Decompress ms: wall-clock time to reconstruct weights from compressed form
- Perplexity: wikitext-2 perplexity (requires
--model-dir, see below) - TPS: tokens/second during generation (requires
--model-dir)
Stage 1: Weight Reconstruction (Synthetic 64×64 matrix)¶
Note: The synthetic weight matrix (64×64 = 4096 parameters at σ=0.02) matches the scale of a small linear layer. BPW results are representative of the compression scheme; SNR is a proxy for perplexity, higher is better. Compress time for VPTQ reflects its k-means++ calibration cost (reduced config for benchmark speed; production use targets k=256, group=8).
| Method | BPW | SNR (dB) | Compress (ms) | Decompress (ms) | Backend |
|---|---|---|---|---|---|
| INT4 nibble (baseline) | 5.00 | 21.0 | 0.8 | 0.14 | numpy |
| VPTQ (NeurIPS 2025) | 1.25 | 3.7 | 111.1 | 0.02 | vptq-numpy |
| AQLM 2-bit (Phase 9A) | — | — | — | — | not yet implemented |
| QuIP# 2-bit (Phase 9B) | 3.00 | 6.8 | 0.7 | 0.02 | quip-numpy |
Notes on BPW¶
| Method | BPW formula | Notes |
|---|---|---|
| INT4 nibble | 4 + 32/group_size + 32/group_size = 5.0 bpw | Asymmetric; 2× float32 overhead per group (scale + zero_point). Rust symmetric path: 4.5 bpw |
| VPTQ | (log₂(k_primary) + log₂(k_residual)) / group_size + scale_overhead | Benchmark config: k=16 primary + k=4 residual, group=8 → 0.75 bpw indices + 0.5 bpw col-scales = 1.25 bpw. Production config (k=256 + k=16) → ~1.5–1.75 bpw |
| AQLM | (M × log₂(codebook_size)) / group_size | Not implemented yet (Phase 9A) |
| QuIP# | 8 bits (E8 index) + 16 bits (residual scale) per 8-D chunk = 3.0 bpw | Excludes rotation matrix (per-matrix one-time cost; negligible for large layers) |
Notes on SNR¶
The SNR values on the 64×64 synthetic matrix reflect the quantization error for a small random Gaussian weight distribution (σ=0.02). On a real language model, the true quality metric is wikitext-2 perplexity. The expected perplexity degradation relative to FP16 is:
| Method | Expected Δ PPL (Qwen2.5-1.5B, wikitext-2) |
|---|---|
| INT4 nibble | ~+0.3–0.8 nats above FP16 |
| VPTQ (k=256, group=8) | ~within 1 nat of INT4 |
| AQLM 2-bit | ~within 0.5 nat of INT4 |
| QuIP# 2-bit | ~within 0.3 nat of FP16 |
Stage 2: Model Evaluation (requires --model-dir)¶
Model-level perplexity and TPS have not been collected yet. To collect them, run with a downloaded model:
python3 dev/benchmarks/bench_2bit.py \
--model-dir models/Qwen2.5-1.5B \
--ppl-tokens 2048 \
--tps-tokens 128 \
--output dev/results/quant_2bit_comparison.json
This evaluates FP16 perplexity and generation throughput via mlx_lm.
Quantized model evaluation will be added when AQLM (Phase 9A) is complete.
Benchmark Configuration¶
The benchmark uses reduced settings for fast CI execution (< 15 s per run). For a high-fidelity comparison on large weight matrices, increase:
# In dev/benchmarks/bench_2bit.py:
BENCH_ROWS = 4096 # realistic linear layer height
BENCH_COLS = 4096 # realistic linear layer width
VPTQ_N_PRIMARY = 256 # full codebook (standard NeurIPS 2025 config)
VPTQ_N_RESIDUAL = 16
VPTQ_ITERS = 20 # full k-means iterations
Running the Benchmark¶
# Stage 1 only (no model required, < 15 s):
python3 dev/benchmarks/bench_2bit.py --dry-run
# Stage 1 + Stage 2 (requires mlx_lm and a downloaded model):
python3 dev/benchmarks/bench_2bit.py --model-dir models/Qwen2.5-1.5B
# With Markdown table output:
python3 dev/benchmarks/bench_2bit.py --markdown
# Custom output path:
python3 dev/benchmarks/bench_2bit.py --output /path/to/out.json
Results are written to dev/results/quant_2bit_comparison.json.
Implementation Status¶
| Phase | Method | Module | Status |
|---|---|---|---|
| Baseline | INT4 nibble | squish/quant/quantizer.py |
✅ Complete |
| Phase 7 | VPTQ | (experimental) | ✅ Complete |
| Phase 9A | AQLM 2-bit | squish/quant/aqlm.py |
⏳ Not yet implemented |
| Phase 9B | QuIP# 2-bit | (experimental) | ✅ Complete |
| Phase 9C | This benchmark | dev/benchmarks/bench_2bit.py |
✅ Complete |
See Also¶
CHANGELOG.md: version history and shipped phases- VPTQ / QuIP#: experimental 2-bit paths, not in the shipped tree
- Raw results:
dev/results/quant_2bit_comparison.json(generated locally)