Skip to content

INT Quantization Benchmark Plan

Benchmark plan for evaluating Squish's INT4, INT3, and INT2 quantization across 10 production LLMs, measuring throughput, perplexity, accuracy, and storage requirements at each precision level.


Overview

Item Value
Bit levels BF16 (reference), INT4, INT3, INT2
Models 10 (4 existing benchmarked + 6 new)
Tests per model per bit level 3 (T1: throughput, T2: perplexity, T3: accuracy)
Total test runs 10 × 3 bit levels × 3 tests = 90 runs
Platform Apple Silicon (M-series), primary
Scripts dev/benchmarks/bench_int_quant.py (per-model)
Aggregation dev/benchmarks/aggregate_int_quant.py (combined report)
Shell orchestration dev/scripts/run_all_int_quant.sh
Results dev/results/int_quant/*.json
Output doc docs/benchmark_int_quant.md

Model Selection

10 Target Models

# Model HF Repo Params BF16 Size
1 Qwen2.5-1.5B-Instruct Qwen/Qwen2.5-1.5B-Instruct 1.5B ~3.1 GB
2 Llama-3.2-3B-Instruct meta-llama/Llama-3.2-3B-Instruct 3.2B ~6.4 GB
3 Gemma-3-4B-IT google/gemma-3-4b-it 4B ~8.6 GB
4 Qwen2.5-7B-Instruct Qwen/Qwen2.5-7B-Instruct 7.6B ~14.0 GB
5 Mistral-7B-Instruct-v0.3 mistralai/Mistral-7B-Instruct-v0.3 7.2B ~14.5 GB
6 DeepSeek-R1-Distill-Qwen-7B deepseek-ai/DeepSeek-R1-Distill-Qwen-7B 7.6B ~15.2 GB
7 Llama-3.1-8B-Instruct meta-llama/Meta-Llama-3.1-8B-Instruct 8.0B ~16.0 GB
8 Qwen3-8B Qwen/Qwen3-8B 8.2B ~16.4 GB
9 Phi-4 microsoft/phi-4 14.7B ~29.4 GB
10 Qwen2.5-14B-Instruct Qwen/Qwen2.5-14B-Instruct 14.8B ~29.6 GB

Previously benchmarked (existing data): Models 1, 4, 8, 10
New models to benchmark: Models 2, 3, 5, 6, 7, 9

Selection Rationale

  • Architecture variety: Qwen2.5 (MLA/GQA), Llama-3.x (standard GQA), Mistral (sliding window), Gemma-3 (multi-query), Phi-4 (mixture of experts), DeepSeek (distilled reasoning)
  • Size spread: 1.5B → 14.8B — covers mobile-class to workstation-class
  • Popularity: Top-tier HuggingFace download counts in each size class
  • License coverage mix: Apache 2.0, MIT, Llama Community License, Gemma TOS

Compression Methods

T0 — BF16 Reference

No compression. Models loaded directly with mlx_lm or transformers in bfloat16. Used as the baseline for all quality delta calculations.

INT4 (Primary)

  • CLI: squish-convert --int4 --super-weight --int4-group-size 32
  • Format: 4-bit nibble packed, group-quantised with super-weight correction
  • Expected bpw: ~4.5 bpw (4-bit weights + scale overhead)
  • Expected size reduction: ~73% vs BF16
  • Quality signal: Primary target — should retain >95% of BF16 accuracy

INT3

  • API: MiLoQuantizer Python class (squish.quant.milo_quant)
  • Config: MiLoConfig(group_size=128, max_rank=16)
  • Iterates: safetensors shards; applies per-layer mixed INT3
  • Expected bpw: ~3.5 bpw
  • Expected size reduction: ~78% vs BF16
  • Quality signal: Secondary — expect mild PPL degradation vs INT4

Note: No --int3 CLI flag exists in convert.py. Python API required.

INT2

  • CLI: squish-convert --aqlm --aqlm-n-codebooks 2 --aqlm-codebook-size 16 --aqlm-group-size 8
  • Format: Additive Quantisation of Language Model Weights (AQLM)
  • Expected bpw: ~2.0–2.5 bpw
  • Expected size reduction: ~86% vs BF16
  • Quality signal: Research/reference only — significant quality loss expected especially below 3B params

Test Definitions

T1 — Throughput (tok/s)

What it measures: Generation speed in output tokens per second.

Method: 1. Load compressed model via mlx_lm.load() 2. Run 5 standard prompts × --runs iterations (default 3) 3. Measure output tokens / wall-clock seconds 4. Report: mean, stddev, 95th-percentile

Prompts used:

1. "Explain the theory of relativity in simple terms."
2. "Write a Python function to compute Fibonacci numbers."
3. "What are the pros and cons of electric vehicles?"
4. "Summarise the French Revolution in 3 sentences."
5. "What is the difference between supervised and unsupervised learning?"

Max new tokens per call: 256
Temperature: 0.0 (deterministic)

T2 — Perplexity (PPL)

What it measures: Language modelling quality; lower = better.

Method: 1. Load first 512 tokens of WikiText-2 test split 2. Compute token-level negative log-likelihood with mlx.core 3. PPL = exp(mean NLL)

Threshold for acceptable quality: - INT4 delta vs BF16: < 1.0 PPL points - INT3 delta vs BF16: < 2.5 PPL points - INT2 delta vs BF16: < 8.0 PPL points (informational)

T3 — Accuracy (Arc-Easy + HellaSwag)

What it measures: Zero-shot multiple-choice accuracy on standard NLP benchmarks.

Method: 1. Use lm_eval harness with HFLM wrapper 2. Tasks: arc_easy + hellaswag 3. Sample limit: 200 examples per task 4. Report: accuracy on each task + combined average

Note: T3 is the most time-intensive test (~30 min per model on M2 Max). Run separately using --eval-acc flag. Omit from initial throughput-only runs.


Disk Space Requirements

Per-Model Storage Estimate

Model BF16 INT4 INT3 INT2 Total (all)
Qwen2.5-1.5B 3.1 GB 0.9 GB 0.7 GB 0.4 GB 5.1 GB
Llama-3.2-3B 6.4 GB 1.8 GB 1.4 GB 0.9 GB 10.5 GB
Gemma-3-4B 8.6 GB 2.4 GB 1.9 GB 1.2 GB 14.1 GB
Qwen2.5-7B 14.0 GB 3.9 GB 3.1 GB 2.0 GB 23.0 GB
Mistral-7B-v0.3 14.5 GB 4.0 GB 3.2 GB 2.0 GB 23.7 GB
DeepSeek-R1-Distill-7B 15.2 GB 4.2 GB 3.3 GB 2.1 GB 24.8 GB
Llama-3.1-8B 16.0 GB 4.4 GB 3.5 GB 2.2 GB 26.1 GB
Qwen3-8B 16.4 GB 4.6 GB 3.6 GB 2.3 GB 26.9 GB
Phi-4 29.4 GB 8.2 GB 6.5 GB 4.1 GB 48.2 GB
Qwen2.5-14B 29.6 GB 8.2 GB 6.5 GB 4.1 GB 48.4 GB
Total 153 GB 42.6 GB 33.7 GB 21.3 GB 250.6 GB

Plan A — Full benchmark (all bits, keep all): ~251 GB free required
Plan B — Rolling benchmark (delete BF16 after compress): ~100 GB free
Plan C — INT4 only, BF16 kept: ~196 GB free
Plan D — INT4 only, delete BF16 after compress: ~43 GB free (minimum)

Recommended: Plan B for initial run. Use --keep-compressed flag to retain quantized weights.


Execution Plan

Stage 1 — Validate pipeline (1 model, INT4 only)

./dev/scripts/run_all_int_quant.sh \
    --models "Qwen2.5-1.5B" \
    --bits 4 \
    --eval-tps --eval-ppl \
    --runs 1

Expected runtime: ~15 min (download 3.1 GB + compress + 2 tests)

Stage 2 — INT4 throughput sweep (all 10 models, tok/s only)

./dev/scripts/run_all_int_quant.sh \
    --bits 4 \
    --eval-tps \
    --runs 3

Expected runtime: ~4 hours

Stage 3 — INT4 full quality (all 10, all 3 tests)

./dev/scripts/run_all_int_quant.sh \
    --bits 4 \
    --eval-tps --eval-ppl --eval-acc \
    --runs 3

Expected runtime: ~8 hours (PPL + accuracy are slow)

Stage 4 — INT3 sweep

./dev/scripts/run_all_int_quant.sh \
    --bits 3 \
    --eval-tps --eval-ppl \
    --runs 3

Expected runtime: ~5 hours (MiLo compression is CPU-bound, slower)

Stage 5 — INT2 sweep

./dev/scripts/run_all_int_quant.sh \
    --bits 2 \
    --eval-tps --eval-ppl \
    --runs 2

Expected runtime: ~5 hours (AQLM compression is very slow for large models)

Stage 6 — Generate combined report

python3 dev/benchmarks/aggregate_int_quant.py \
    --results-dir dev/results/int_quant \
    --output docs/benchmark_int_quant.md \
    --json-output dev/results/int_quant/combined.json

Output Format

Per-run JSON (dev/results/int_quant/<model>_<N>bit.json)

{
  "model_id": "Qwen2.5-7B-Instruct",
  "bits": 4,
  "timestamp": "2025-03-18T14:22:00",
  "compression": {
    "original_gb": 14.0,
    "compressed_gb": 3.9,
    "ratio": 0.279,
    "time_sec": 312.4
  },
  "throughput": {
    "mean_tps": 47.3,
    "std_tps": 1.2,
    "p95_tps": 45.1,
    "runs": 3
  },
  "perplexity": {
    "ppl": 6.84,
    "bf16_ppl": 6.21,
    "delta": 0.63
  },
  "accuracy": {
    "arc_easy": 0.712,
    "hellaswag": 0.613,
    "combined": 0.663,
    "bf16_combined": 0.701,
    "delta": -0.038
  }
}

Combined markdown table (docs/benchmark_int_quant.md)

| Model | BF16 | INT4 | INT4↓% | INT3 | INT3↓% | INT2 | INT2↓% |
| Model | BF16 GB | INT4 GB | INT3 GB | INT2 GB |
| Model | BF16 tok/s | INT4 tok/s | INT4 speedup | INT3 tok/s | INT2 tok/s |
| Model | BF16 PPL | INT4 PPL | INT3 PPL | INT2 PPL |
| Model | BF16 ARC | INT4 ARC | INT3 ARC | INT2 ARC |

Success Criteria

Metric INT4 target INT3 target INT2 target
Size reduction ≥ 70% ≥ 75% ≥ 83%
PPL delta vs BF16 < 1.0 < 2.5 < 8.0
ARC accuracy delta < -3% < -6% informational
Throughput vs BF16 ≥ 1.5× faster ≥ 1.3× faster informational
Models meeting ALL criteria ≥ 8 / 10 ≥ 7 / 10

Infrastructure

All benchmark scripts are in /dev/benchmarks/:

Script Purpose
bench_int_quant.py Per-model benchmark runner
aggregate_int_quant.py Combines JSONs → markdown tables

Shell script in /dev/scripts/:

Script Purpose
run_all_int_quant.sh Download + run all 10 models, supports --bits, --eval-tps/ppl/acc

Results directory: dev/results/int_quant/
Published doc: docs/benchmark_int_quant.md


Last updated: 2025 — initial INT quantization benchmark planning.