Squish Architecture: Technical Deep Dive¶

One-sentence summary: Squish separates the storage format of a transformer's weight tensors from their runtime format, enabling aggressive compression at rest, lossless reconstruction on demand, and Metal-native caching that loads a Qwen2.5-1.5B model in 0.33–0.53 seconds (54× faster than a cold mlx_lm load (28.8 s) and 3.7× faster than a warm one) while using 160 MB of peak additional RAM versus the 2.4 GB typically consumed during a standard load.

Note: The Tier 2 MLX safetensors cache requires ~34 GB RAM to build and is not built on 16 GB machines. On a standard 16 GB M3, an 8B INT4 model loads in ~2.7 seconds via the lut_int2 path. The 0.33–0.53 s figure applies to the 1.5B model on hardware with sufficient RAM to build the Tier 2 cache.

1. The Problem with Status-Quo Model Distribution¶

Every serious open-source model (Llama, Gemma, Mistral, Qwen, Falcon) ships as one or more HuggingFace safetensors shards. The format is a flat binary blob: each tensor is stored in the dtype the training run used (typically bfloat16 or float16), preceded by a JSON header describing name, dtype, and shape.

This design has several load-time inefficiencies:

Inefficiency	Root cause
Full model in RAM simultaneously	Standard loader calls `mx.load()` on the whole shard before `model.load_weights()`
No compression	safetensors is a raw binary format; disk = wire = RAM occupancy
Cold-boot penalty	Every Python process restart deserialises the full model from disk
Format coupling	Implementation cannot change storage layout without breaking all downstreams

A 1.5B-parameter bfloat16 model is ~3 GB on disk and ~3 GB additional RAM during loading. At 7B it becomes ~14 GB. At 70B it's simply impossible on consumer hardware.

2. The Squish Architecture¶

Squish introduces a five-path weight management system:

flowchart TD
    T0a["Tier 0a · Native MLX model<br/>mlx_lm.load()<br/>Standard HuggingFace safetensors via mlx_lm<br/>Fallback path — no squish cache present"]
    T0b["Tier 0b · squish_4bit/<br/>INT4 group-quantized shards · sentinel .squish_4bit_ready<br/>Decompressed by the squish_quant Rust extension on load<br/>Disk ≈ 50% of native bf16"]
    T0c["Tier 0c · squish_3bit/<br/>INT3 group-quantized shards · sentinel .squish_3bit_ready<br/>Only for families that pass the accuracy gate (Qwen3, …)<br/>Disk ≈ 37% of native bf16"]
    T1["Tier 1 · squish_weights.safetensors<br/>Single bf16 in MLX-native layout · sentinel .squish_ready<br/>mx.load() → direct Metal memory-map<br/>Disabled on 16 GB for 8B+ models (RAM guard)<br/>Load 0.33 s · RAM delta 160 MB"]
    T2["Tier 2 · finalized/ .npy cache<br/>One float16 .npy per tensor, memory-mappable<br/>np.load(mmap_mode='r') → mx.array → bf16<br/>Active path on 16 GB for 8B+ models<br/>Load ≈ 2.7 s (8B INT4, M3 16 GB)"]

    T0a -->|"squish compress · ~5–19 min"| T0b
    T0b -->|"squish compress --int3 · extra pass"| T0c
    T0c -->|"first run: reconstruct + save · ~2 s"| T1
    T1 -->|"first run: saved post-load"| T2

    classDef native fill:#1e293b,stroke:#475569,color:#e2e8f0;
    classDef quant fill:#312e3f,stroke:#7c3aed,color:#ede9fe;
    classDef fast fill:#0f2e2a,stroke:#10b981,color:#d1fae5;
    class T0a native;
    class T0b,T0c quant;
    class T1,T2 fast;

Why does Tier 2 load so much faster?¶

mx.load() on a safetensors file performs a direct Metal memory-map: the weight bytes are mapped into the GPU address space without materialising an intermediate CPU numpy buffer. The file written by mx.save_safetensors() is already stored in the exact byte layout (bfloat16, row-major) that MLX uses internally, so zero conversion occurs at load time.

The reference mlx_lm.load() path must: 1. Open and parse the HuggingFace safetensors JSON header 2. Instantiate tokenizer (loads sentencepiece vocabulary) 3. Materialise all arrays into a Python dict before model.load_weights() 4. Apply dtype promotions for any mixed-precision shards

Squish's _load_mlx_cache() path: 1. _instantiate_model(): builds MLX graph skeleton from config.json 2. mx.load(): single syscall, OS mmap, Metal GPU mapping 3. model.load_weights(): inject by name 4. AutoTokenizer.from_pretrained(): cached by transformers' local disk cache

3. The Vectro INT8 Quantization Kernel¶

Vectro uses asymmetric per-row INT8 scalar quantization:

For each weight matrix W of shape (n_rows, n_cols):
  For each row r in W:
    scale[r]  = max(|W[r, :]|) / 127
    q[r, :]   = round(W[r, :] / scale[r]).clip(-128, 127).astype(int8)

Storage: q  (int8,   n_rows × n_cols)
         s  (float32, n_rows)

Reconstruction:
  W_hat[r, :] = q[r, :].astype(float32) * scale[r]

Compression ratio for a matrix with 4-byte float32 elements: - Original: 4 × n_rows × n_cols bytes - Compressed: 1 × n_rows × n_cols + 4 × n_rows ≈ 1 byte/element (for wide matrices) - Theoretical: 4× compression on eligible tensors

Why not all tensors are quantised (89 passthrough): Embedding tables, output projection (lm_head), layer normalisation weights, and bias vectors are stored as-is (float16). These tensors either have very few parameters (biases, norms) or are so sensitive to quantisation noise that any distortion measurably degrades perplexity (embed_tokens with 151 936 rows).

The 249 quantised tensors are the large attention (q/k/v_proj, o_proj) and feed-forward (gate_proj, up_proj, down_proj) matrices where INT8 rows introduce sub-0.02% cosine distance from the original, within training noise.

4. The npy-dir Storage Format¶

{compressed_dir}/
├── manifest.json                      # safe_key → original_name mapping
├── tensors/
│   ├── {safe_key}__q.npy             # int8 quantised values  [n_rows, n_cols]  (Tier 0a)
│   ├── {safe_key}__s.npy             # float32 row scales     [n_rows]           (Tier 0a)
│   ├── {safe_key}__shape.npy         # original shape         [ndim]
│   ├── {safe_key}__pt.npy            # passthrough float16    [...]              (all tiers)
│   ├── {safe_key}__q4.npy            # uint8 nibble-packed    [n_rows, n_cols//2](Tier 0b INT4)
│   ├── {safe_key}__s4.npy            # float32 group scales   [n_rows, n_groups] (Tier 0b INT4)
│   └── ...  (249 quantised × 3 q/s/shape + 89 PT + optional 249 q4/s4 pairs)
├── finalized/
│   ├── {original_name_dotted}.npy    # reconstructed float16  (Tier 1 cache)
│   └── .ready                        # sentinel: cache is complete
├── squish_weights.safetensors          # Tier 2: bf16 MLX safetensors
├── .squish_ready                      # sentinel: Tier 2 is complete
└── .squish_int4_ready                 # sentinel: INT4 conversion complete (Tier 0b)

Converting INT8 → INT4 (run once, ~30s):

from compressed_loader import save_int4_npy_dir
result = save_int4_npy_dir('/path/to/compressed_dir')
# Saves {sk}__q4.npy + {sk}__s4.npy alongside existing INT8 files
# Writes .squish_int4_ready sentinel when complete
# All subsequent loads auto-select INT4 via _dequantize_npy_dir() priority order
print(f"Savings: {result['savings_pct']:.0f}%")

Why .npy over .npz: - .npz files apply zlib compression: takes 9 minutes to write and 9 seconds to decompress at load time. - .npy files are raw binary with a tiny header: memory-mappable, zero decompression cost, and the per-file overhead is amortised over 338 tensors. - The INT8 quantisation already provides the compression; zlib on top of int8 data yields negligible additional savings.

Memory-mapped loading (mmap_mode='r'): - np.load(path, mmap_mode='r') returns a numpy memmap: the OS does not read the file contents until a byte is accessed. - For non-Tier-2 loads, only the bytes needed to construct mx.array() are ever paged in, keeping peak RSS small.

5. RAM Efficiency¶

Standard mlx_lm.load() for a 1.5B model:

Baseline RSS:                      185 MB  (Python + MLX runtime)
Load weights (all in-memory):     +2400 MB  (all safetensors arrays alive at once)
model.load_weights():             weights transfer to GPU buffers
Garbage collect numpy arrays:     -2200 MB
Net delta:                        +2100-2500 MB

Squish Tier 2 (forge-mlx cache):

Baseline RSS:                      185 MB
mx.load() memory-map:              + 12 MB (mmap region, not RSS)
model.load_weights():              weights transfer to GPU buffers
mx.eval() / GC:                    + 148 MB net RSS increase
Net delta:                         +160 MB

The 13× RAM advantage during loading comes from Metal's memory mapping: the weight bytes are mapped into the GPU's virtual address space directly from the file, bypassing the CPU heap allocation that the standard numpy-based loader performs.

6. Accuracy Preservation¶

INT8 quantisation introduces bounded numerical error. Per weight matrix:

max absolute error = max(|W[r, :] - W_hat[r, :]|)
                   = max(scale[r]) × 0.5       (half-step rounding error)

cosine similarity ≥ 0.99995  (measured: 338 tensors on Qwen2.5-1.5B)
mean cosine sim   = 0.99999

At the model-output level: - 100% first-token agreement with the FP16 reference on a 5-prompt evaluation - 73–100% token agreement over 20-token sequences (natural output variation from INT8 noise compounds over a long sequence, similar to temperature > 0)

Industry-standard benchmarks (ARC-Easy, HellaSwag, MMLU) show <2% accuracy delta vs the uncompressed model, within the random variance of different evaluation seeds.

7. Three-Tier Loading Strategy: Decision Tree¶

load_compressed_model(compressed_dir, model_dir)
        │
        ├── .squish_4bit_ready exists?
        │         │
        │         └── YES → mlx_lm.load(squish_4bit/)  ← Tier 3  (1.5-2s, large models)
        │
        ├── .squish_ready exists?
        │         │
        │         └── YES → _load_mlx_cache()           ← Tier 2  (0.3-2s)
        │
        ├── finalized/.ready exists?
        │         │
        │         └── YES → _load_finalized_cache()     ← Tier 1  (4-5s)
        │
        └── (neither) → Vectro/Rust first load          ← Tier 0  (15-20s)
                  │
                  ├── auto-select per-tensor:
                  │     .squish_int4_ready + squish_quant → INT4 Rust dequantize
                  │     __q.npy/__s.npy present           → INT8 Vectro dequantize
                  │     __pt.npy present                  → float16 passthrough
                  ├── serial loop: decomp → save f16 .npy inline
                  ├── save squish_weights.safetensors (mx.save_safetensors)
                  ├── write .squish_ready
                  └── write finalized/.ready

The first load is a one-time cost. Every subsequent invocation, in any Python process on the same machine, hits Tier 2 and loads in sub-second time.

8. Extension Points¶

Capability	Status	Notes
npy-dir format	✅	Production-ready
Finalized f16 cache (Tier 1)	✅	Fallback if Tier 2 missing
MLX safetensors cache (Tier 2)	✅	0.33s loads
Streaming layer-by-layer loader	✅	`streaming_loader.py`
lm-eval harness integration	✅	`squish_lm_eval.py`
7B model support	✅	squish_4bit path for large models
INT4 nibble-packed storage	✅	`save_int4_npy_dir()` + Rust deq — 50% disk vs INT8
AWQ calibration	✅	`squish/awq.py` — `collect_activation_scales` → `save_awq_scales` → `--awq-scales` in convert
KV cache quantisation	✅	`squish/kv_cache.py` — KIVI INT8 + SnapKV; mlx_lm `update_and_fetch` protocol
Remote/cloud weight streaming	🔜	npy-dir format is range-request friendly
Multi-shard models	🔜	Convert individually, merge manifest
GGUF / ONNX export from cache	🔜	weight_dict already in bf16

9. Comparison to Existing Solutions¶

Approach	Load (cold)	Load (warm)	Disk	RAM delta	ARC-Easy	HellaSwag
`mlx_lm.load()` native	1.96–6.7s	1.96s	3087 MB	~2400 MB	74.5%	63.5%
`mlx_lm` + 4-bit quant	~1.5s	~1.5s	~850 MB	~900 MB	-3-5% est.	-3-5% est.
GGUF (llama.cpp)	~2-3s	~2-3s	~1200 MB	~1000 MB	-1-2% est.	-1-2% est.
Squish Tier 2	0.33–0.53s	0.33s	2682 MB	160 MB	73.5%	62.0%
Squish Tier 1 (fallback)	4.65s	4.65s	2682 MB	~2100 MB	73.5%	62.0%
Squish Tier 0 (first run)	~19s	n/a	2682 MB	~2200 MB	73.5%	62.0%

ARC-Easy and HellaSwag accuracy measured with lm-evaluation-harness v0.4.11, 200 examples. 4-bit / GGUF accuracy estimates are from published benchmarks; exact numbers vary by implementation.

Key insight: Squish achieves within 1–2% of reference accuracy while loading an order of magnitude faster and using 15× less RAM during the load phase (against a true cold mlx_lm first boot, 28.8 s with a page-cache miss, the load speedup is 54×; see the paper §4.1).

10. Further reading¶

The paper: full methodology, thermal-controlled benchmarks (§4.4), accuracy gates, and the decode-acceleration ablation.
Benchmarks: reproducible numbers with commands.
Module reference: the composable optimisation modules.

Squish is source-available under BUSL-1.1: free for personal and non-production use.