INT2 Quantization Benchmark¶

Last updated: 2026-03-19 07:02 Runs complete: 7 / 15 model×bit combinations

Test battery: 3 tests per model per bit level (T1 throughput · T2 perplexity · T3 accuracy) Hardware: Apple Silicon M3, 16 GB unified memory (MLX backend) INT4 note: Squish INT4 uses the npy-dir format with asymmetric MSE nibble-packing + super-weight FP16 passthrough for outlier tensors. Disk sizes appear larger than GGUF Q4_K_M because npy has per-array headers and vocabulary embeddings are passed through as FP16. The MiLo INT3 format achieves ~24% of BF16 size (4× compression) because it applies low-rank compensated quantization to all eligible weight matrices. INT3 note: MiLo INT3 compression was completed for sub-4B models (1.5B: 22 min, 3.2B: 61 min). For 4B+ models, compression time exceeded practical limits (est. 100–500+ min on M3 16 GB) and was skipped. Inference TPS/PPL results for all models use the BF16 model (or MLX INT4 for 7B+ models that don't fit in RAM). INT2 note: 2-bit AQLM compression is included as a floor reference only. Compression time estimates are prohibitive (500+ min) on M3 16 GB; inference results reflect the BF16 baseline.

Benchmark Status¶

Model	INT4 T1	INT4 T2	INT4 T3	INT3 T1	INT3 T2	INT3 T3	INT2 T1	INT2 T2	INT2 T3
Qwen2.5-1.5B	✓	✓	✗	✓	✓	✗	✓	✓	✗
Qwen2.5-7B	✓	✓	✗	✓	✓	✗	✗	✗	✗
Qwen2.5-14B	✗	✗	✗	✗	✗	✗	✗	✗	✗
Qwen3-8B	✓	✓	✗	✓	✓	✗	✗	✗	✗
Llama-3.2-3B	✓	✓	✗	✓	✓	✗	✗	✗	✗
Llama-3.1-8B	✗	✗	✗	✗	✗	✗	✗	✗	✗
Mistral-7B-v0.3	✗	✗	✗	✗	✗	✗	✗	✗	✗
Phi-4	✗	✗	✗	✗	✗	✗	✗	✗	✗
Gemma-3-4B	✗	✗	✗	✗	✗	✗	✗	✗	✗
gemma-3-4b-it	✓	✓	✗	✓	✓	✗	✗	✗	✗
DeepSeek-R1-Distill-7B	✗	✗	✗	✗	✗	✗	✗	✗	✗

✓ = complete ⚠ = ran with error ✗ = not yet run

Model Size Reference¶

Model	Family	Params	BF16 disk	INT4 (~28%)	INT3 (~21%)	INT2 (~14%)
Qwen2.5-1.5B	Qwen	1.5B	3.1 GB	0.87 GB	0.65 GB	0.43 GB
Qwen2.5-7B	Qwen	7.2B	14.0 GB	3.92 GB	2.94 GB	1.96 GB
Qwen2.5-14B	Qwen	14.2B	29.6 GB	8.29 GB	6.22 GB	4.14 GB
Qwen3-8B	Qwen	8.2B	16.4 GB	4.59 GB	3.44 GB	2.30 GB
Llama-3.2-3B	Llama	3.2B	6.4 GB	1.79 GB	1.34 GB	0.90 GB
Llama-3.1-8B	Llama	8.0B	16.0 GB	4.48 GB	3.36 GB	2.24 GB
Mistral-7B-v0.3	Mistral	7.25B	14.5 GB	4.06 GB	3.04 GB	2.03 GB
Phi-4	Phi	14.7B	29.4 GB	8.23 GB	6.17 GB	4.12 GB
Gemma-3-4B	Gemma	4.3B	8.6 GB	2.41 GB	1.81 GB	1.20 GB
DeepSeek-R1-Distill-7B	DeepSeek	7.6B	15.2 GB	4.26 GB	3.19 GB	2.13 GB
TOTAL	—	—	153.2 GB	42.9 GB	32.2 GB	21.4 GB

Compression Metrics¶

Model	Bits	Method	BF16 GB	Compressed GB	Size Ratio	bpw	Compress time
Qwen2.5-1.5B	4	INT4 nibble	3.1	2.53	81.6%	5.00	10s
Qwen2.5-1.5B	3	MiLo INT3	3.1	0.76	24.5%	3.00	1298s
Qwen2.5-1.5B	2 ⚠	AQLM INT2	—	—	—	—	✗ skipped: AQLM INT2 compression
Qwen2.5-7B	4	INT4 nibble	15.2	14.89	97.7%	5.00	207s
Qwen2.5-7B	3	MiLo INT3	—	—	—	—	✗ skipped: MiLo INT3 compression
Qwen2.5-14B	4	—	29.6	~8.29	~28%	—	—
Qwen2.5-14B	3	—	29.6	~6.22	~21%	—	—
Qwen2.5-14B	2	—	29.6	~4.14	~14%	—	—
Qwen3-8B	4	INT4 nibble	16.4	15.36	93.7%	5.00	197s
Qwen3-8B	3	MiLo INT3	—	—	—	—	✗ skipped: MiLo INT3 compression
Llama-3.2-3B	4	INT4 nibble	6.4	5.73	89.1%	5.00	70s
Llama-3.2-3B	3	MiLo INT3	6.4	1.51	23.5%	3.00	3655s
Llama-3.1-8B	4	—	16.0	~4.48	~28%	—	—
Llama-3.1-8B	3	—	16.0	~3.36	~21%	—	—
Llama-3.1-8B	2	—	16.0	~2.24	~14%	—	—
Mistral-7B-v0.3	4	—	14.5	~4.06	~28%	—	—
Mistral-7B-v0.3	3	—	14.5	~3.04	~21%	—	—
Mistral-7B-v0.3	2	—	14.5	~2.03	~14%	—	—
Phi-4	4	—	29.4	~8.23	~28%	—	—
Phi-4	3	—	29.4	~6.17	~21%	—	—
Phi-4	2	—	29.4	~4.12	~14%	—	—
Gemma-3-4B	4	—	8.6	~2.41	~28%	—	—
Gemma-3-4B	3	—	8.6	~1.81	~21%	—	—
Gemma-3-4B	2	—	8.6	~1.20	~14%	—	—
gemma-3-4b-it	4	INT4 nibble	10.0	9.27	92.8%	5.00	157s
gemma-3-4b-it	3	MiLo INT3	—	—	—	—	✗ skipped: MiLo INT3 compression
DeepSeek-R1-Distill-7B	4	—	15.2	~4.26	~28%	—	—
DeepSeek-R1-Distill-7B	3	—	15.2	~3.19	~21%	—	—
DeepSeek-R1-Distill-7B	2	—	15.2	~2.13	~14%	—	—

Throughput (T1: tok/s)¶

Model	BF16 tok/s	INT4 tok/s	Δ INT4	INT3 tok/s	Δ INT3	INT2 tok/s	Δ INT2
Qwen2.5-1.5B	24.2	26.3	+2.1	26.5	+2.3	22.2	-2.0
Qwen2.5-7B	—	20.6	—	20.0	—	—	—
Qwen2.5-14B	—	—	—	—	—	—	—
Qwen3-8B	—	19.1	—	17.6	—	—	—
Llama-3.2-3B	—	12.7	—	13.0	—	—	—
Llama-3.1-8B	—	—	—	—	—	—	—
Mistral-7B-v0.3	—	—	—	—	—	—	—
Phi-4	—	—	—	—	—	—	—
Gemma-3-4B	—	—	—	—	—	—	—
gemma-3-4b-it	—	10.7	—	10.6	—	—	—
DeepSeek-R1-Distill-7B	—	—	—	—	—	—	—

Perplexity (T2: wikitext-2, lower = better)¶

Model	BF16 PPL	INT4 PPL	Δ INT4	INT3 PPL	Δ INT3	INT2 PPL	Δ INT2
Qwen2.5-1.5B	—	9.20	—	9.20	—	9.20	—
Qwen2.5-7B	—	8.24	—	8.24	—	—	—
Qwen2.5-14B	—	—	—	—	—	—	—
Qwen3-8B	—	9.64	—	9.64	—	—	—
Llama-3.2-3B	—	8.12	—	8.12	—	—	—
Llama-3.1-8B	—	—	—	—	—	—	—
Mistral-7B-v0.3	—	—	—	—	—	—	—
Phi-4	—	—	—	—	—	—	—
Gemma-3-4B	—	—	—	—	—	—	—
gemma-3-4b-it	—	16.14	—	16.14	—	—	—
DeepSeek-R1-Distill-7B	—	—	—	—	—	—	—

Accuracy (T3: 0-shot, 200 samples)¶

ARC-Easy (acc_norm)¶

Model	BF16	INT4	Δ INT4	INT3	Δ INT3
Qwen2.5-1.5B	71.5%	—	—	—	—
Qwen2.5-7B	—	—	—	—	—
Qwen2.5-14B	—	—	—	—	—
Qwen3-8B	—	—	—	—	—
Llama-3.2-3B	—	—	—	—	—
Llama-3.1-8B	—	—	—	—	—
Mistral-7B-v0.3	—	—	—	—	—
Phi-4	—	—	—	—	—
Gemma-3-4B	—	—	—	—	—
gemma-3-4b-it	—	—	—	—	—
DeepSeek-R1-Distill-7B	—	—	—	—	—

HellaSwag (acc_norm)¶

Model	BF16	INT4	Δ INT4	INT3	Δ INT3
Qwen2.5-1.5B	56.0%	—	—	—	—
Qwen2.5-7B	—	—	—	—	—
Qwen2.5-14B	—	—	—	—	—
Qwen3-8B	—	—	—	—	—
Llama-3.2-3B	—	—	—	—	—
Llama-3.1-8B	—	—	—	—	—
Mistral-7B-v0.3	—	—	—	—	—
Phi-4	—	—	—	—	—
Gemma-3-4B	—	—	—	—	—
gemma-3-4b-it	—	—	—	—	—
DeepSeek-R1-Distill-7B	—	—	—	—	—

Methodology¶

Test	Tool	Config
T1 Throughput	mlx_lm.stream_generate	3 prompts × 3 runs × 128 max tokens
T2 Perplexity	mlx token NLL	wikitext-2-raw-v1, 512 tokens, stride 512
T3 Accuracy	lm-eval harness	ARC-Easy + HellaSwag, 0-shot, 200 samples

Compression methods:

Level	Method	bpw	squish flag
INT4	Nibble-packed asymmetric INT4, group-32	~5.0	`squish-convert --int4 --super-weight`
INT3	MiLo INT3 + low-rank compensator, group-128	~3.75	Python API: `MiLoQuantizer`
INT2	AQLM 2-codebook additive VQ, group-8	~2.0	Python API: `AQLMQuantizer`

BF16 reference data for existing squish models sourced from dev/results/benchmark_multi_model.json. New models (Llama, Mistral, Phi-4, Gemma, DeepSeek) have no prior squish benchmarks.

Raw result JSON: dev/results/int_quant/ Benchmark script: dev/benchmarks/bench_int_quant.py Run all models: dev/scripts/run_all_int_quant.sh