INT Quantization Benchmark Plan¶
Benchmark plan for evaluating Squish's INT4, INT3, and INT2 quantization across 10 production LLMs, measuring throughput, perplexity, accuracy, and storage requirements at each precision level.
Overview¶
| Item | Value |
|---|---|
| Bit levels | BF16 (reference), INT4, INT3, INT2 |
| Models | 10 (4 existing benchmarked + 6 new) |
| Tests per model per bit level | 3 (T1: throughput, T2: perplexity, T3: accuracy) |
| Total test runs | 10 × 3 bit levels × 3 tests = 90 runs |
| Platform | Apple Silicon (M-series), primary |
| Scripts | dev/benchmarks/bench_int_quant.py (per-model) |
| Aggregation | dev/benchmarks/aggregate_int_quant.py (combined report) |
| Shell orchestration | dev/scripts/run_all_int_quant.sh |
| Results | dev/results/int_quant/*.json |
| Output doc | docs/benchmark_int_quant.md |
Model Selection¶
10 Target Models¶
| # | Model | HF Repo | Params | BF16 Size |
|---|---|---|---|---|
| 1 | Qwen2.5-1.5B-Instruct | Qwen/Qwen2.5-1.5B-Instruct | 1.5B | ~3.1 GB |
| 2 | Llama-3.2-3B-Instruct | meta-llama/Llama-3.2-3B-Instruct | 3.2B | ~6.4 GB |
| 3 | Gemma-3-4B-IT | google/gemma-3-4b-it | 4B | ~8.6 GB |
| 4 | Qwen2.5-7B-Instruct | Qwen/Qwen2.5-7B-Instruct | 7.6B | ~14.0 GB |
| 5 | Mistral-7B-Instruct-v0.3 | mistralai/Mistral-7B-Instruct-v0.3 | 7.2B | ~14.5 GB |
| 6 | DeepSeek-R1-Distill-Qwen-7B | deepseek-ai/DeepSeek-R1-Distill-Qwen-7B | 7.6B | ~15.2 GB |
| 7 | Llama-3.1-8B-Instruct | meta-llama/Meta-Llama-3.1-8B-Instruct | 8.0B | ~16.0 GB |
| 8 | Qwen3-8B | Qwen/Qwen3-8B | 8.2B | ~16.4 GB |
| 9 | Phi-4 | microsoft/phi-4 | 14.7B | ~29.4 GB |
| 10 | Qwen2.5-14B-Instruct | Qwen/Qwen2.5-14B-Instruct | 14.8B | ~29.6 GB |
Previously benchmarked (existing data): Models 1, 4, 8, 10
New models to benchmark: Models 2, 3, 5, 6, 7, 9
Selection Rationale¶
- Architecture variety: Qwen2.5 (MLA/GQA), Llama-3.x (standard GQA), Mistral (sliding window), Gemma-3 (multi-query), Phi-4 (mixture of experts), DeepSeek (distilled reasoning)
- Size spread: 1.5B → 14.8B — covers mobile-class to workstation-class
- Popularity: Top-tier HuggingFace download counts in each size class
- License coverage mix: Apache 2.0, MIT, Llama Community License, Gemma TOS
Compression Methods¶
T0 — BF16 Reference¶
No compression. Models loaded directly with mlx_lm or transformers in bfloat16.
Used as the baseline for all quality delta calculations.
INT4 (Primary)¶
- CLI:
squish-convert --int4 --super-weight --int4-group-size 32 - Format: 4-bit nibble packed, group-quantised with super-weight correction
- Expected bpw: ~4.5 bpw (4-bit weights + scale overhead)
- Expected size reduction: ~73% vs BF16
- Quality signal: Primary target — should retain >95% of BF16 accuracy
INT3¶
- API:
MiLoQuantizerPython class (squish.quant.milo_quant) - Config:
MiLoConfig(group_size=128, max_rank=16) - Iterates: safetensors shards; applies per-layer mixed INT3
- Expected bpw: ~3.5 bpw
- Expected size reduction: ~78% vs BF16
- Quality signal: Secondary — expect mild PPL degradation vs INT4
Note: No --int3 CLI flag exists in convert.py. Python API required.
INT2¶
- CLI:
squish-convert --aqlm --aqlm-n-codebooks 2 --aqlm-codebook-size 16 --aqlm-group-size 8 - Format: Additive Quantisation of Language Model Weights (AQLM)
- Expected bpw: ~2.0–2.5 bpw
- Expected size reduction: ~86% vs BF16
- Quality signal: Research/reference only — significant quality loss expected especially below 3B params
Test Definitions¶
T1 — Throughput (tok/s)¶
What it measures: Generation speed in output tokens per second.
Method:
1. Load compressed model via mlx_lm.load()
2. Run 5 standard prompts × --runs iterations (default 3)
3. Measure output tokens / wall-clock seconds
4. Report: mean, stddev, 95th-percentile
Prompts used:
1. "Explain the theory of relativity in simple terms."
2. "Write a Python function to compute Fibonacci numbers."
3. "What are the pros and cons of electric vehicles?"
4. "Summarise the French Revolution in 3 sentences."
5. "What is the difference between supervised and unsupervised learning?"
Max new tokens per call: 256
Temperature: 0.0 (deterministic)
T2 — Perplexity (PPL)¶
What it measures: Language modelling quality; lower = better.
Method:
1. Load first 512 tokens of WikiText-2 test split
2. Compute token-level negative log-likelihood with mlx.core
3. PPL = exp(mean NLL)
Threshold for acceptable quality: - INT4 delta vs BF16: < 1.0 PPL points - INT3 delta vs BF16: < 2.5 PPL points - INT2 delta vs BF16: < 8.0 PPL points (informational)
T3 — Accuracy (Arc-Easy + HellaSwag)¶
What it measures: Zero-shot multiple-choice accuracy on standard NLP benchmarks.
Method:
1. Use lm_eval harness with HFLM wrapper
2. Tasks: arc_easy + hellaswag
3. Sample limit: 200 examples per task
4. Report: accuracy on each task + combined average
Note: T3 is the most time-intensive test (~30 min per model on M2 Max).
Run separately using --eval-acc flag. Omit from initial throughput-only runs.
Disk Space Requirements¶
Per-Model Storage Estimate¶
| Model | BF16 | INT4 | INT3 | INT2 | Total (all) |
|---|---|---|---|---|---|
| Qwen2.5-1.5B | 3.1 GB | 0.9 GB | 0.7 GB | 0.4 GB | 5.1 GB |
| Llama-3.2-3B | 6.4 GB | 1.8 GB | 1.4 GB | 0.9 GB | 10.5 GB |
| Gemma-3-4B | 8.6 GB | 2.4 GB | 1.9 GB | 1.2 GB | 14.1 GB |
| Qwen2.5-7B | 14.0 GB | 3.9 GB | 3.1 GB | 2.0 GB | 23.0 GB |
| Mistral-7B-v0.3 | 14.5 GB | 4.0 GB | 3.2 GB | 2.0 GB | 23.7 GB |
| DeepSeek-R1-Distill-7B | 15.2 GB | 4.2 GB | 3.3 GB | 2.1 GB | 24.8 GB |
| Llama-3.1-8B | 16.0 GB | 4.4 GB | 3.5 GB | 2.2 GB | 26.1 GB |
| Qwen3-8B | 16.4 GB | 4.6 GB | 3.6 GB | 2.3 GB | 26.9 GB |
| Phi-4 | 29.4 GB | 8.2 GB | 6.5 GB | 4.1 GB | 48.2 GB |
| Qwen2.5-14B | 29.6 GB | 8.2 GB | 6.5 GB | 4.1 GB | 48.4 GB |
| Total | 153 GB | 42.6 GB | 33.7 GB | 21.3 GB | 250.6 GB |
Plan A — Full benchmark (all bits, keep all): ~251 GB free required
Plan B — Rolling benchmark (delete BF16 after compress): ~100 GB free
Plan C — INT4 only, BF16 kept: ~196 GB free
Plan D — INT4 only, delete BF16 after compress: ~43 GB free (minimum)
Recommended: Plan B for initial run. Use --keep-compressed flag to retain quantized weights.
Execution Plan¶
Stage 1 — Validate pipeline (1 model, INT4 only)¶
./dev/scripts/run_all_int_quant.sh \
--models "Qwen2.5-1.5B" \
--bits 4 \
--eval-tps --eval-ppl \
--runs 1
Expected runtime: ~15 min (download 3.1 GB + compress + 2 tests)
Stage 2 — INT4 throughput sweep (all 10 models, tok/s only)¶
Expected runtime: ~4 hours
Stage 3 — INT4 full quality (all 10, all 3 tests)¶
Expected runtime: ~8 hours (PPL + accuracy are slow)
Stage 4 — INT3 sweep¶
Expected runtime: ~5 hours (MiLo compression is CPU-bound, slower)
Stage 5 — INT2 sweep¶
Expected runtime: ~5 hours (AQLM compression is very slow for large models)
Stage 6 — Generate combined report¶
python3 dev/benchmarks/aggregate_int_quant.py \
--results-dir dev/results/int_quant \
--output docs/benchmark_int_quant.md \
--json-output dev/results/int_quant/combined.json
Output Format¶
Per-run JSON (dev/results/int_quant/<model>_<N>bit.json)¶
{
"model_id": "Qwen2.5-7B-Instruct",
"bits": 4,
"timestamp": "2025-03-18T14:22:00",
"compression": {
"original_gb": 14.0,
"compressed_gb": 3.9,
"ratio": 0.279,
"time_sec": 312.4
},
"throughput": {
"mean_tps": 47.3,
"std_tps": 1.2,
"p95_tps": 45.1,
"runs": 3
},
"perplexity": {
"ppl": 6.84,
"bf16_ppl": 6.21,
"delta": 0.63
},
"accuracy": {
"arc_easy": 0.712,
"hellaswag": 0.613,
"combined": 0.663,
"bf16_combined": 0.701,
"delta": -0.038
}
}
Combined markdown table (docs/benchmark_int_quant.md)¶
| Model | BF16 | INT4 | INT4↓% | INT3 | INT3↓% | INT2 | INT2↓% |
| Model | BF16 GB | INT4 GB | INT3 GB | INT2 GB |
| Model | BF16 tok/s | INT4 tok/s | INT4 speedup | INT3 tok/s | INT2 tok/s |
| Model | BF16 PPL | INT4 PPL | INT3 PPL | INT2 PPL |
| Model | BF16 ARC | INT4 ARC | INT3 ARC | INT2 ARC |
Success Criteria¶
| Metric | INT4 target | INT3 target | INT2 target |
|---|---|---|---|
| Size reduction | ≥ 70% | ≥ 75% | ≥ 83% |
| PPL delta vs BF16 | < 1.0 | < 2.5 | < 8.0 |
| ARC accuracy delta | < -3% | < -6% | informational |
| Throughput vs BF16 | ≥ 1.5× faster | ≥ 1.3× faster | informational |
| Models meeting ALL criteria | ≥ 8 / 10 | ≥ 7 / 10 | — |
Infrastructure¶
All benchmark scripts are in /dev/benchmarks/:
| Script | Purpose |
|---|---|
bench_int_quant.py |
Per-model benchmark runner |
aggregate_int_quant.py |
Combines JSONs → markdown tables |
Shell script in /dev/scripts/:
| Script | Purpose |
|---|---|
run_all_int_quant.sh |
Download + run all 10 models, supports --bits, --eval-tps/ppl/acc |
Results directory: dev/results/int_quant/
Published doc: docs/benchmark_int_quant.md
Last updated: 2025 — initial INT quantization benchmark planning.