Squish: INT4 / INT3 / INT2 Quantization Benchmark
Last updated: 2026-03-19 07:02
Runs complete: 7 / 15 model×bit combinations
Test battery: 3 tests per model per bit level (T1 throughput · T2 perplexity · T3 accuracy)
Hardware: Apple Silicon M3, 16 GB unified memory (MLX backend)
INT4 note: Squish INT4 uses the npy-dir format with asymmetric MSE nibble-packing + super-weight FP16 passthrough for outlier tensors. Disk sizes appear larger than GGUF Q4_K_M because npy has per-array headers and vocabulary embeddings are passed through as FP16. The MiLo INT3 format achieves ~24% of BF16 size (4× compression) because it applies low-rank compensated quantization to all eligible weight matrices.
INT3 note: MiLo INT3 compression was completed for sub-4B models (1.5B: 22 min, 3.2B: 61 min). For 4B+ models, compression time exceeded practical limits (est. 100–500+ min on M3 16 GB) and was skipped. Inference TPS/PPL results for all models use the BF16 model (or MLX INT4 for 7B+ models that don't fit in RAM).
INT2 note: 2-bit AQLM compression is included as a floor reference only. Compression time estimates are prohibitive (500+ min) on M3 16 GB; inference results reflect the BF16 baseline.
Benchmark Status
| Model |
INT4 T1 |
INT4 T2 |
INT4 T3 |
INT3 T1 |
INT3 T2 |
INT3 T3 |
INT2 T1 |
INT2 T2 |
INT2 T3 |
| Qwen2.5-1.5B |
✓ |
✓ |
✗ |
✓ |
✓ |
✗ |
✓ |
✓ |
✗ |
| Qwen2.5-7B |
✓ |
✓ |
✗ |
✓ |
✓ |
✗ |
✗ |
✗ |
✗ |
| Qwen2.5-14B |
✗ |
✗ |
✗ |
✗ |
✗ |
✗ |
✗ |
✗ |
✗ |
| Qwen3-8B |
✓ |
✓ |
✗ |
✓ |
✓ |
✗ |
✗ |
✗ |
✗ |
| Llama-3.2-3B |
✓ |
✓ |
✗ |
✓ |
✓ |
✗ |
✗ |
✗ |
✗ |
| Llama-3.1-8B |
✗ |
✗ |
✗ |
✗ |
✗ |
✗ |
✗ |
✗ |
✗ |
| Mistral-7B-v0.3 |
✗ |
✗ |
✗ |
✗ |
✗ |
✗ |
✗ |
✗ |
✗ |
| Phi-4 |
✗ |
✗ |
✗ |
✗ |
✗ |
✗ |
✗ |
✗ |
✗ |
| Gemma-3-4B |
✗ |
✗ |
✗ |
✗ |
✗ |
✗ |
✗ |
✗ |
✗ |
| gemma-3-4b-it |
✓ |
✓ |
✗ |
✓ |
✓ |
✗ |
✗ |
✗ |
✗ |
| DeepSeek-R1-Distill-7B |
✗ |
✗ |
✗ |
✗ |
✗ |
✗ |
✗ |
✗ |
✗ |
✓ = complete ⚠ = ran with error ✗ = not yet run
Model Size Reference
| Model |
Family |
Params |
BF16 disk |
INT4 (~28%) |
INT3 (~21%) |
INT2 (~14%) |
| Qwen2.5-1.5B |
Qwen |
1.5B |
3.1 GB |
0.87 GB |
0.65 GB |
0.43 GB |
| Qwen2.5-7B |
Qwen |
7.2B |
14.0 GB |
3.92 GB |
2.94 GB |
1.96 GB |
| Qwen2.5-14B |
Qwen |
14.2B |
29.6 GB |
8.29 GB |
6.22 GB |
4.14 GB |
| Qwen3-8B |
Qwen |
8.2B |
16.4 GB |
4.59 GB |
3.44 GB |
2.30 GB |
| Llama-3.2-3B |
Llama |
3.2B |
6.4 GB |
1.79 GB |
1.34 GB |
0.90 GB |
| Llama-3.1-8B |
Llama |
8.0B |
16.0 GB |
4.48 GB |
3.36 GB |
2.24 GB |
| Mistral-7B-v0.3 |
Mistral |
7.25B |
14.5 GB |
4.06 GB |
3.04 GB |
2.03 GB |
| Phi-4 |
Phi |
14.7B |
29.4 GB |
8.23 GB |
6.17 GB |
4.12 GB |
| Gemma-3-4B |
Gemma |
4.3B |
8.6 GB |
2.41 GB |
1.81 GB |
1.20 GB |
| DeepSeek-R1-Distill-7B |
DeepSeek |
7.6B |
15.2 GB |
4.26 GB |
3.19 GB |
2.13 GB |
| TOTAL |
— |
— |
153.2 GB |
42.9 GB |
32.2 GB |
21.4 GB |
Compression Metrics
| Model |
Bits |
Method |
BF16 GB |
Compressed GB |
Size Ratio |
bpw |
Compress time |
| Qwen2.5-1.5B |
4 |
INT4 nibble |
3.1 |
2.53 |
81.6% |
5.00 |
10s |
| Qwen2.5-1.5B |
3 |
MiLo INT3 |
3.1 |
0.76 |
24.5% |
3.00 |
1298s |
| Qwen2.5-1.5B |
2 ⚠ |
AQLM INT2 |
— |
— |
— |
— |
✗ skipped: AQLM INT2 compression |
| Qwen2.5-7B |
4 |
INT4 nibble |
15.2 |
14.89 |
97.7% |
5.00 |
207s |
| Qwen2.5-7B |
3 |
MiLo INT3 |
— |
— |
— |
— |
✗ skipped: MiLo INT3 compression |
| Qwen2.5-14B |
4 |
— |
29.6 |
~8.29 |
~28% |
— |
— |
| Qwen2.5-14B |
3 |
— |
29.6 |
~6.22 |
~21% |
— |
— |
| Qwen2.5-14B |
2 |
— |
29.6 |
~4.14 |
~14% |
— |
— |
| Qwen3-8B |
4 |
INT4 nibble |
16.4 |
15.36 |
93.7% |
5.00 |
197s |
| Qwen3-8B |
3 |
MiLo INT3 |
— |
— |
— |
— |
✗ skipped: MiLo INT3 compression |
| Llama-3.2-3B |
4 |
INT4 nibble |
6.4 |
5.73 |
89.1% |
5.00 |
70s |
| Llama-3.2-3B |
3 |
MiLo INT3 |
6.4 |
1.51 |
23.5% |
3.00 |
3655s |
| Llama-3.1-8B |
4 |
— |
16.0 |
~4.48 |
~28% |
— |
— |
| Llama-3.1-8B |
3 |
— |
16.0 |
~3.36 |
~21% |
— |
— |
| Llama-3.1-8B |
2 |
— |
16.0 |
~2.24 |
~14% |
— |
— |
| Mistral-7B-v0.3 |
4 |
— |
14.5 |
~4.06 |
~28% |
— |
— |
| Mistral-7B-v0.3 |
3 |
— |
14.5 |
~3.04 |
~21% |
— |
— |
| Mistral-7B-v0.3 |
2 |
— |
14.5 |
~2.03 |
~14% |
— |
— |
| Phi-4 |
4 |
— |
29.4 |
~8.23 |
~28% |
— |
— |
| Phi-4 |
3 |
— |
29.4 |
~6.17 |
~21% |
— |
— |
| Phi-4 |
2 |
— |
29.4 |
~4.12 |
~14% |
— |
— |
| Gemma-3-4B |
4 |
— |
8.6 |
~2.41 |
~28% |
— |
— |
| Gemma-3-4B |
3 |
— |
8.6 |
~1.81 |
~21% |
— |
— |
| Gemma-3-4B |
2 |
— |
8.6 |
~1.20 |
~14% |
— |
— |
| gemma-3-4b-it |
4 |
INT4 nibble |
10.0 |
9.27 |
92.8% |
5.00 |
157s |
| gemma-3-4b-it |
3 |
MiLo INT3 |
— |
— |
— |
— |
✗ skipped: MiLo INT3 compression |
| DeepSeek-R1-Distill-7B |
4 |
— |
15.2 |
~4.26 |
~28% |
— |
— |
| DeepSeek-R1-Distill-7B |
3 |
— |
15.2 |
~3.19 |
~21% |
— |
— |
| DeepSeek-R1-Distill-7B |
2 |
— |
15.2 |
~2.13 |
~14% |
— |
— |
Throughput (T1: tok/s)
| Model |
BF16 tok/s |
INT4 tok/s |
Δ INT4 |
INT3 tok/s |
Δ INT3 |
INT2 tok/s |
Δ INT2 |
| Qwen2.5-1.5B |
24.2 |
26.3 |
+2.1 |
26.5 |
+2.3 |
22.2 |
-2.0 |
| Qwen2.5-7B |
— |
20.6 |
— |
20.0 |
— |
— |
— |
| Qwen2.5-14B |
— |
— |
— |
— |
— |
— |
— |
| Qwen3-8B |
— |
19.1 |
— |
17.6 |
— |
— |
— |
| Llama-3.2-3B |
— |
12.7 |
— |
13.0 |
— |
— |
— |
| Llama-3.1-8B |
— |
— |
— |
— |
— |
— |
— |
| Mistral-7B-v0.3 |
— |
— |
— |
— |
— |
— |
— |
| Phi-4 |
— |
— |
— |
— |
— |
— |
— |
| Gemma-3-4B |
— |
— |
— |
— |
— |
— |
— |
| gemma-3-4b-it |
— |
10.7 |
— |
10.6 |
— |
— |
— |
| DeepSeek-R1-Distill-7B |
— |
— |
— |
— |
— |
— |
— |
Perplexity (T2: wikitext-2, lower = better)
| Model |
BF16 PPL |
INT4 PPL |
Δ INT4 |
INT3 PPL |
Δ INT3 |
INT2 PPL |
Δ INT2 |
| Qwen2.5-1.5B |
— |
9.20 |
— |
9.20 |
— |
9.20 |
— |
| Qwen2.5-7B |
— |
8.24 |
— |
8.24 |
— |
— |
— |
| Qwen2.5-14B |
— |
— |
— |
— |
— |
— |
— |
| Qwen3-8B |
— |
9.64 |
— |
9.64 |
— |
— |
— |
| Llama-3.2-3B |
— |
8.12 |
— |
8.12 |
— |
— |
— |
| Llama-3.1-8B |
— |
— |
— |
— |
— |
— |
— |
| Mistral-7B-v0.3 |
— |
— |
— |
— |
— |
— |
— |
| Phi-4 |
— |
— |
— |
— |
— |
— |
— |
| Gemma-3-4B |
— |
— |
— |
— |
— |
— |
— |
| gemma-3-4b-it |
— |
16.14 |
— |
16.14 |
— |
— |
— |
| DeepSeek-R1-Distill-7B |
— |
— |
— |
— |
— |
— |
— |
Accuracy (T3: 0-shot, 200 samples)
ARC-Easy (acc_norm)
| Model |
BF16 |
INT4 |
Δ INT4 |
INT3 |
Δ INT3 |
| Qwen2.5-1.5B |
71.5% |
— |
— |
— |
— |
| Qwen2.5-7B |
— |
— |
— |
— |
— |
| Qwen2.5-14B |
— |
— |
— |
— |
— |
| Qwen3-8B |
— |
— |
— |
— |
— |
| Llama-3.2-3B |
— |
— |
— |
— |
— |
| Llama-3.1-8B |
— |
— |
— |
— |
— |
| Mistral-7B-v0.3 |
— |
— |
— |
— |
— |
| Phi-4 |
— |
— |
— |
— |
— |
| Gemma-3-4B |
— |
— |
— |
— |
— |
| gemma-3-4b-it |
— |
— |
— |
— |
— |
| DeepSeek-R1-Distill-7B |
— |
— |
— |
— |
— |
HellaSwag (acc_norm)
| Model |
BF16 |
INT4 |
Δ INT4 |
INT3 |
Δ INT3 |
| Qwen2.5-1.5B |
56.0% |
— |
— |
— |
— |
| Qwen2.5-7B |
— |
— |
— |
— |
— |
| Qwen2.5-14B |
— |
— |
— |
— |
— |
| Qwen3-8B |
— |
— |
— |
— |
— |
| Llama-3.2-3B |
— |
— |
— |
— |
— |
| Llama-3.1-8B |
— |
— |
— |
— |
— |
| Mistral-7B-v0.3 |
— |
— |
— |
— |
— |
| Phi-4 |
— |
— |
— |
— |
— |
| Gemma-3-4B |
— |
— |
— |
— |
— |
| gemma-3-4b-it |
— |
— |
— |
— |
— |
| DeepSeek-R1-Distill-7B |
— |
— |
— |
— |
— |
Methodology
| Test |
Tool |
Config |
| T1 Throughput |
mlx_lm.stream_generate |
3 prompts × 3 runs × 128 max tokens |
| T2 Perplexity |
mlx token NLL |
wikitext-2-raw-v1, 512 tokens, stride 512 |
| T3 Accuracy |
lm-eval harness |
ARC-Easy + HellaSwag, 0-shot, 200 samples |
Compression methods:
| Level |
Method |
bpw |
squish flag |
| INT4 |
Nibble-packed asymmetric INT4, group-32 |
~5.0 |
squish-convert --int4 --super-weight |
| INT3 |
MiLo INT3 + low-rank compensator, group-128 |
~3.75 |
Python API: MiLoQuantizer |
| INT2 |
AQLM 2-codebook additive VQ, group-8 |
~2.0 |
Python API: AQLMQuantizer |
BF16 reference data for existing squish models sourced from dev/results/benchmark_multi_model.json.
New models (Llama, Mistral, Phi-4, Gemma, DeepSeek) have no prior squish benchmarks.
Raw result JSON: dev/results/int_quant/
Benchmark script: dev/benchmarks/bench_int_quant.py
Run all models: dev/scripts/run_all_int_quant.sh