Skip to content

Squish: 2-bit Quantization Comparison Benchmark

Phase 9C, sub-2-bit weight compression: INT4 vs VPTQ vs AQLM vs QuIP# Weight-reconstruction metrics from synthetic 64×64 weight matrices. Perplexity + TPS metrics require --model-dir (run on real hardware).


Overview

This benchmark evaluates four near-2-bit weight-compression methods on the same weight matrix and reports:

  • BPW: bits per weight after compression (including index + scale overhead)
  • SNR (dB): weight-reconstruction signal-to-noise ratio vs. original
  • Compress ms: wall-clock time for the offline compression step
  • Decompress ms: wall-clock time to reconstruct weights from compressed form
  • Perplexity: wikitext-2 perplexity (requires --model-dir, see below)
  • TPS: tokens/second during generation (requires --model-dir)

Stage 1: Weight Reconstruction (Synthetic 64×64 matrix)

Note: The synthetic weight matrix (64×64 = 4096 parameters at σ=0.02) matches the scale of a small linear layer. BPW results are representative of the compression scheme; SNR is a proxy for perplexity, higher is better. Compress time for VPTQ reflects its k-means++ calibration cost (reduced config for benchmark speed; production use targets k=256, group=8).

Method BPW SNR (dB) Compress (ms) Decompress (ms) Backend
INT4 nibble (baseline) 5.00 21.0 0.8 0.14 numpy
VPTQ (NeurIPS 2025) 1.25 3.7 111.1 0.02 vptq-numpy
AQLM 2-bit (Phase 9A) not yet implemented
QuIP# 2-bit (Phase 9B) 3.00 6.8 0.7 0.02 quip-numpy

Notes on BPW

Method BPW formula Notes
INT4 nibble 4 + 32/group_size + 32/group_size = 5.0 bpw Asymmetric; 2× float32 overhead per group (scale + zero_point). Rust symmetric path: 4.5 bpw
VPTQ (log₂(k_primary) + log₂(k_residual)) / group_size + scale_overhead Benchmark config: k=16 primary + k=4 residual, group=8 → 0.75 bpw indices + 0.5 bpw col-scales = 1.25 bpw. Production config (k=256 + k=16) → ~1.5–1.75 bpw
AQLM (M × log₂(codebook_size)) / group_size Not implemented yet (Phase 9A)
QuIP# 8 bits (E8 index) + 16 bits (residual scale) per 8-D chunk = 3.0 bpw Excludes rotation matrix (per-matrix one-time cost; negligible for large layers)

Notes on SNR

The SNR values on the 64×64 synthetic matrix reflect the quantization error for a small random Gaussian weight distribution (σ=0.02). On a real language model, the true quality metric is wikitext-2 perplexity. The expected perplexity degradation relative to FP16 is:

Method Expected Δ PPL (Qwen2.5-1.5B, wikitext-2)
INT4 nibble ~+0.3–0.8 nats above FP16
VPTQ (k=256, group=8) ~within 1 nat of INT4
AQLM 2-bit ~within 0.5 nat of INT4
QuIP# 2-bit ~within 0.3 nat of FP16

Stage 2: Model Evaluation (requires --model-dir)

Model-level perplexity and TPS have not been collected yet. To collect them, run with a downloaded model:

python3 dev/benchmarks/bench_2bit.py \
    --model-dir models/Qwen2.5-1.5B \
    --ppl-tokens 2048 \
    --tps-tokens 128 \
    --output dev/results/quant_2bit_comparison.json

This evaluates FP16 perplexity and generation throughput via mlx_lm. Quantized model evaluation will be added when AQLM (Phase 9A) is complete.


Benchmark Configuration

The benchmark uses reduced settings for fast CI execution (< 15 s per run). For a high-fidelity comparison on large weight matrices, increase:

# In dev/benchmarks/bench_2bit.py:
BENCH_ROWS      = 4096   # realistic linear layer height
BENCH_COLS      = 4096   # realistic linear layer width
VPTQ_N_PRIMARY  = 256    # full codebook (standard NeurIPS 2025 config)
VPTQ_N_RESIDUAL = 16
VPTQ_ITERS      = 20     # full k-means iterations

Running the Benchmark

# Stage 1 only (no model required, < 15 s):
python3 dev/benchmarks/bench_2bit.py --dry-run

# Stage 1 + Stage 2 (requires mlx_lm and a downloaded model):
python3 dev/benchmarks/bench_2bit.py --model-dir models/Qwen2.5-1.5B

# With Markdown table output:
python3 dev/benchmarks/bench_2bit.py --markdown

# Custom output path:
python3 dev/benchmarks/bench_2bit.py --output /path/to/out.json

Results are written to dev/results/quant_2bit_comparison.json.


Implementation Status

Phase Method Module Status
Baseline INT4 nibble squish/quant/quantizer.py ✅ Complete
Phase 7 VPTQ (experimental) ✅ Complete
Phase 9A AQLM 2-bit squish/quant/aqlm.py ⏳ Not yet implemented
Phase 9B QuIP# 2-bit (experimental) ✅ Complete
Phase 9C This benchmark dev/benchmarks/bench_2bit.py ✅ Complete

See Also

  • CHANGELOG.md: version history and shipped phases
  • VPTQ / QuIP#: experimental 2-bit paths, not in the shipped tree
  • Raw results: dev/results/quant_2bit_comparison.json (generated locally)