Skip to content

INT3 vs INT2 Quantization Benchmark

Synthetic Gaussian weights (σ=0.02) at transformer-realistic shapes.
INT3: MiLo (group_size=64, max_rank=8)
INT2: WeightOnlyInt2Quant (group_size=64, asymmetric)

Per-Layer Results

Layer Method BPW SNR (dB) Compress ratio Compress (ms) Decomp (ms) Size (MB)
attn_proj [576×576] INT3-MiLo 4.89 13.4 6.55× 271.8 140.8 0.20
attn_proj [576×576] INT2-WOQ 3.00 7.4 10.67× 9.1 0.5 0.12
ffn_up [576×1536] INT3-MiLo 4.61 13.3 6.94× 725.3 381.0 0.51
ffn_up [576×1536] INT2-WOQ 3.00 7.4 10.67× 15.3 1.3 0.33
attn_proj [1024×1024] INT3-MiLo 4.50 13.3 7.11× 933.0 454.1 0.59
attn_proj [1024×1024] INT2-WOQ 3.00 7.4 10.67× 18.0 1.6 0.39
ffn_up [1024×3072] INT3-MiLo 4.33 13.2 7.38× 2849.5 1357.5 1.70
ffn_up [1024×3072] INT2-WOQ 3.00 7.4 10.67× 54.7 5.3 1.18
attn_proj [2048×2048] INT3-MiLo 4.25 13.2 7.53× 5004.6 1830.3 2.23
attn_proj [2048×2048] INT2-WOQ 3.00 7.4 10.67× 72.2 7.7 1.57
ffn_up [2048×8192] INT3-MiLo 4.16 13.2 7.70× 17875.4 7655.9 8.72
ffn_up [2048×8192] INT2-WOQ 3.00 7.4 10.67× 293.4 30.7 6.29
attn_proj [4096×4096] INT3-MiLo 4.12 13.2 7.76× 28514.9 7562.5 8.65
attn_proj [4096×4096] INT2-WOQ 3.00 7.4 10.67× 290.9 30.6 6.29
ffn_up [4096×14336] INT3-MiLo 4.08 13.2 7.84× 79157.8 27573.9 29.95
ffn_up [4096×14336] INT2-WOQ 3.00 7.4 10.67× 1045.1 109.7 22.02

Averages

Method BPW SNR (dB) Compress ratio Compress (ms) Decomp (ms)
INT3-MiLo 4.37 13.2 7.35× 16917 5869.5
INT2-WOQ 3.00 7.4 10.67× 225 23.4

Model Projection (28-layer Qwen3-1.7B scale, FP32 baseline)

Method Compressed (GB) Original FP32 (GB) Memory Savings
INT3-MiLo 1.44 11.14 87%
INT2-WOQ 1.04 11.14 91%