Troubleshooting / FAQ¶

This page covers the most common issues encountered when running Squish on Apple Silicon Macs. If your issue is not listed here, open a GitHub Discussion.

Quick diagnostics¶

Run these two commands first. The answers are required context for any bug report or Discord question.

squish --version
python -c "import mlx; print(mlx.__version__)"

Expected output (versions will differ):

squish 9.34.1
0.22.0

Additional environment snapshot:

python --version                       # should be 3.10+
pip show squish mlx mlx-lm            # installed versions and locations
sw_vers                                # macOS version
sysctl -n hw.memsize                   # total unified memory in bytes
squish models                          # locally cached models

Paste the output of all of the above when filing a bug.

8 GB Mac: out of memory (OOM)¶

Symptoms¶

squish run or squish serve exits immediately with no error message
Kernel logs show jetsam events (log show --predicate 'eventMessage contains "jetsam"' --last 5m)
The macOS Activity Monitor "Memory Pressure" gauge is solidly red before inference starts
Python raises MemoryError or the process is killed mid-generation

Why it happens¶

Apple Silicon's unified memory is shared between the CPU, the GPU (Metal), and the OS. On an 8 GB device, macOS reserves approximately 1.5–2 GB for the kernel and active apps, leaving roughly 6–6.5 GB for model weights and the inference runtime. A 7B/8B model at full bf16 is ~16 GB and will not fit; even INT8 is ~8 GB. Squish defaults to INT4 (~4 GB) and offers INT3 (~3.5 GB).

Fixes¶

1. Use INT4 quantisation (recommended for 8 GB Macs)

INT4 halves the weight footprint compared to INT8.

squish pull llama3.1:8b --quant int4
squish run  llama3.1:8b --quant int4

INT4 accuracy

INT4 produces measurably lower quality output on models below 3B parameters. For 7B+ models the degradation is minor in practice.

2. Use --fault-tolerance to survive partial allocation failures

squish run llama3.1:8b --quant int4 --fault-tolerance

--fault-tolerance catches Metal allocation errors and retries with a smaller KV-cache instead of crashing.

3. Use --adaptive-quant to let Squish pick the quantisation level automatically

squish run llama3.1:8b --adaptive-quant

Squish reads available unified memory at startup and selects INT4 or INT8 accordingly.

4. Reduce the context window

Every additional token in the context window consumes KV-cache memory. Drop the window to 2 048 or 4 096 if your task allows it:

squish run llama3.1:8b --quant int4 --ctx 2048

5. Free memory before running Squish

Quit Safari, Xcode, Simulator, and any Electron apps. Each frees hundreds of megabytes. You can also temporarily reduce the number of active menu-bar apps.

6. Choose a smaller model

3B models fit comfortably in INT8 on 8 GB devices:

squish pull llama3.2:3b
squish run  llama3.2:3b

Reference: model memory requirements¶

Model	INT8 (approx.)	INT4 (approx.)	Fits 8 GB?
llama3.2:1b	~1.2 GB	~0.7 GB	Yes (either quant)
llama3.2:3b	~3.5 GB	~2.0 GB	Yes (either quant)
llama3.1:8b	~8.5 GB	~4.5 GB	INT4 only
qwen3:8b	~8.3 GB	~4.4 GB	INT4 only
llama3.1:70b	~72 GB	~38 GB	No

Tokenizer errors¶

Symptoms¶

RuntimeError: Tokenizer 'tiktoken' not found
OSError: Can't load tokenizer for '<model>'
ValueError: Unrecognized model in AutoTokenizer
KeyError: 'tokenizer_config.json'

Root causes and fixes¶

Missing tokenizer files in the model cache

The model download may have been interrupted, leaving an incomplete cache entry. Delete and re-pull:

squish rm llama3.1:8b
squish pull llama3.1:8b

To verify cache integrity manually:

ls ~/.squish/models/llama3.1-8b/
# Should contain: config.json  tokenizer.json  tokenizer_config.json  *.safetensors (or *.npy)

If any of those files are missing, the pull was incomplete. Re-pull.

Incompatible transformers version

Some tokenizers require a minimum version of transformers:

pip install --upgrade transformers

If you need to pin transformers for another reason, check the model card on HuggingFace for the minimum supported version.

tiktoken not installed

GPT-style models (e.g. Qwen, some Mistral variants) use tiktoken:

pip install tiktoken

sentencepiece not installed

LLaMA-family and Gemma models use SentencePiece:

pip install sentencepiece

tokenizers Rust extension missing

If pip install squish-ai ran in an environment without a Rust compiler and a pre-built wheel was unavailable for your Python version, the tokenizers package may have fallen back to a pure-Python stub that is missing methods:

pip install --upgrade --force-reinstall tokenizers

If the error persists, install Rust and rebuild:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source "$HOME/.cargo/env"
pip install --upgrade --force-reinstall tokenizers

Checking which tokenizer a model needs¶

python -c "
import json, pathlib
cfg = pathlib.Path('~/.squish/models/').expanduser()
for p in cfg.rglob('tokenizer_config.json'):
    data = json.loads(p.read_text())
    print(p.parent.name, '->', data.get('tokenizer_class', 'unknown'))
"

MLX version mismatches¶

Symptoms¶

ImportError: cannot import name 'quantize' from 'mlx.core'
AttributeError: module 'mlx.nn' has no attribute 'QuantizedLinear'
TypeError: forward() got an unexpected keyword argument 'cache'
Inference completes but produces garbled / nonsensical output

Why it happens¶

Squish pins a minimum MLX version because the MLX API changes frequently between minor releases. If the installed MLX is older or newer than what Squish expects, the above errors appear.

Check current versions¶

pip show squish mlx mlx-lm

The Requires: field in the squish output lists the accepted MLX range.

Upgrade to the latest compatible MLX¶

pip install --upgrade mlx mlx-lm

Pin to a specific version¶

If the latest MLX introduces a regression, pin to the last known-good version:

pip install "mlx==0.22.0" "mlx-lm==0.22.0"

Replace 0.22.0 with whichever version the Squish release notes specify.

Use a virtual environment

Running inside a venv or conda environment makes version pinning safe and reversible without touching your system Python:

python3 -m venv .venv
source .venv/bin/activate
pip install "squish" "mlx==0.22.0" "mlx-lm==0.22.0"

Force-reinstall if the environment is in a broken state¶

pip install --upgrade --force-reinstall mlx mlx-lm squish

Verify MLX works after reinstall¶

python -c "
import mlx.core as mx
print('MLX version:', mx.__version__)
arr = mx.array([1.0, 2.0, 3.0])
print('Basic computation:', arr * 2)
print('Metal device available:', mx.default_device())
"

Expected output includes Device(gpu, 0) confirming Metal is active.

Known version compatibility table¶

Squish version	Minimum MLX	Maximum tested MLX
0.9.x	0.18.0	0.22.x
0.8.x	0.16.0	0.19.x
0.7.x	0.14.0	0.17.x

Ollama port conflicts¶

Symptoms¶

squish serve exits immediately with OSError: [Errno 48] Address already in use
curl http://localhost:11435/v1/models returns connection refused even after squish serve appeared to start
Requests to Squish unexpectedly return Ollama-formatted responses

Background¶

Squish listens on port 11435 by default. Ollama listens on port 11434 by default. These are different ports, but conflicts can still occur if:

You started Squish on a custom port that clashes with another process
Ollama was reconfigured (via OLLAMA_HOST) to use port 11435
Another Squish instance is already running

Find what is using a port¶

# Replace 11435 with the port in question
lsof -nP -iTCP:11435 -sTCP:LISTEN

The output shows the PID and process name occupying the port.

Stop the conflicting process¶

OllamaAnother Squish instanceOther process

# Stop Ollama gracefully
brew services stop ollama
# or
pkill -f ollama

pkill -f "squish serve"
# or kill by PID from lsof output
kill <PID>

# Kill by PID obtained from lsof
kill <PID>

Run Squish on a different port¶

If you need both Squish and Ollama running at the same time, give each a distinct port:

# Squish on 11435 (default)
squish serve --port 11435

# Ollama stays on 11434 (its default) — no change needed

If port 11435 is already taken by something you cannot stop:

squish serve --port 11435

Then point your OpenAI-compatible client at http://localhost:11435/v1.

brew services stop ollama
brew services disable ollama   # prevents restart on next login

To re-enable later:

brew services enable ollama
brew services start ollama

Verify Squish is listening after startup¶

# Check the port is bound
lsof -nP -iTCP:11435 -sTCP:LISTEN

# Hit the health endpoint
curl -s http://localhost:11435/v1/models | python -m json.tool

A valid response lists the locally cached models.

Common errors and solutions¶

Error message	Likely cause	Fix
`Address already in use` on port 11435	Another process (often a prior Squish instance) is bound to the port	`lsof -nP -iTCP:11435` → `kill <PID>`
`jetsam` killed process	8 GB OOM: model weights exceeded available unified memory	Switch to `--quant int4` or use a smaller model
`MemoryError` during model load	Insufficient free unified memory	Close other apps; use `--quant int4 --ctx 2048`
`RuntimeError: Tokenizer 'tiktoken' not found`	`tiktoken` package not installed	`pip install tiktoken`
`OSError: Can't load tokenizer for '<model>'`	Incomplete model download	`squish rm <model>` → `squish pull <model>`
`KeyError: 'tokenizer_config.json'`	Missing file in model cache	Re-pull the model
`ImportError: cannot import name 'quantize' from 'mlx.core'`	MLX version too old	`pip install --upgrade mlx mlx-lm`
`AttributeError: module 'mlx.nn' has no attribute 'QuantizedLinear'`	MLX version too old	`pip install --upgrade mlx mlx-lm`
`TypeError: forward() got an unexpected keyword argument 'cache'`	MLX / mlx-lm API mismatch	Pin matching versions: see MLX version mismatches
Garbled or nonsensical output	MLX version mismatch producing silent wrong results	`pip install --upgrade --force-reinstall mlx mlx-lm squish-ai`
`pip install squish-ai` fails with Rust/maturin error	Rust compiler not present and no pre-built wheel available	`curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs \\| sh` then retry
`squish: command not found` after `pip install squish-ai`	`~/.local/bin` not in `PATH`	Add `export PATH="$HOME/.local/bin:$PATH"` to `~/.zshrc`
Requests to Squish return Ollama JSON format	Client is hitting port 11434 (Ollama) instead of 11435 (Squish)	Update `base_url` to `http://localhost:11435/v1`

Still stuck?¶

Re-run the Quick diagnostics commands and copy the full output.
Check open GitHub issues: your error may already have a fix.
Open a GitHub Discussion with the diagnostic output.

Troubleshooting / FAQ¶

Quick diagnostics¶

8 GB Mac: out of memory (OOM)¶

Symptoms¶

Why it happens¶

Fixes¶

Reference: model memory requirements¶

Tokenizer errors¶

Symptoms¶

Root causes and fixes¶

Checking which tokenizer a model needs¶

MLX version mismatches¶

Symptoms¶

Why it happens¶

Check current versions¶

Upgrade to the latest compatible MLX¶

Pin to a specific version¶

Force-reinstall if the environment is in a broken state¶

Verify MLX works after reinstall¶

Known version compatibility table¶

Ollama port conflicts¶

Symptoms¶

Background¶

Find what is using a port¶

Stop the conflicting process¶

Run Squish on a different port¶

Prevent Ollama from auto-starting at login¶

Verify Squish is listening after startup¶

Common errors and solutions¶

Still stuck?¶