API Reference¶

Squish exposes an OpenAI-compatible REST API on http://localhost:11435 by default.

Authentication¶

By default the server accepts requests without an API key.

To require a key, set the environment variable before starting the server:

export SQUISH_API_KEY=my-secret-key
squish serve

Pass the key using the standard Authorization: Bearer <key> header.

Environment Variables¶

Variable	Default	Description
`SQUISH_API_KEY`	(none)	When set, all API requests must supply `Authorization: Bearer <key>`
`HF_TOKEN`	(none)	HuggingFace access token — required for gated models
`SQUISH_OFFLINE`	`0`	Set to `1` to disable all network access (model must already be cached)
`SQUISH_CACHE_DIR`	`~/.squish/models`	Override the default model cache directory

Endpoints¶

`GET /v1/models`¶

Lists all locally available models.

Response

{
  "object": "list",
  "data": [
    {
      "id": "llama3.1:8b",
      "object": "model",
      "created": 1720000000,
      "owned_by": "squish"
    }
  ]
}

`POST /v1/chat/completions`¶

OpenAI-compatible chat completion.

Request body

Field	Type	Required	Default	Description
`model`	string	✅	—	Model ID (e.g. `llama3.1:8b`)
`messages`	array	✅	—	Array of `{"role": "...", "content": "..."}`
`max_tokens`	integer		512	Maximum tokens to generate
`temperature`	float		0.7	Sampling temperature
`top_p`	float		0.9	Top-p nucleus sampling
`stream`	boolean		false	Stream tokens via SSE
`stop`	string/array		null	Stop sequence(s)

Example

curl http://localhost:11435/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "messages": [
      {"role": "system", "content": "You are a concise assistant."},
      {"role": "user", "content": "What is MLX?"}
    ],
    "max_tokens": 128,
    "temperature": 0.5
  }'

Response

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1720000001,
  "model": "llama3.1:8b",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "MLX is Apple's machine-learning framework optimised for Apple Silicon..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 24,
    "completion_tokens": 42,
    "total_tokens": 66
  }
}

`POST /v1/completions`¶

Text completion (non-chat). Supports single prompts and batched requests.

Request body

Field	Type	Required	Default	Description
`model`	string	✅	—	Model ID
`prompt`	string	✅*	—	Single prompt text
`batch`	array	✅*	—	Array of prompt strings (mutually exclusive with `prompt`)
`max_tokens`	integer		256	Maximum tokens per completion
`temperature`	float		0.7	Sampling temperature
`top_p`	float		0.9	Top-p nucleus sampling

*Either prompt or batch is required.

Single prompt example

curl http://localhost:11435/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.1:8b", "prompt": "Once upon a time"}'

Batch example

curl http://localhost:11435/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "batch": ["The sky is", "The ocean is", "The forest is"],
    "max_tokens": 32
  }'

`GET /health`¶

Liveness probe.

curl http://localhost:11435/health
# {"status": "ok"}

Streaming¶

Set "stream": true in a /v1/chat/completions request to receive tokens via Server-Sent Events (SSE), exactly like the OpenAI API:

curl http://localhost:11435/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"llama3.1:8b","messages":[{"role":"user","content":"Count to 5"}],"stream":true}'

Each SSE event is a JSON delta. The stream ends with data: [DONE].

Error responses¶

HTTP status	Meaning
400	Bad request — missing or invalid fields
401	Unauthorized — invalid or missing API key
404	Model not found — run `squish pull <model>` first
429	Too many requests — queue is full, back off and retry
500	Internal server error — check server logs

CLI reference¶

Command	Description
`squish pull <model>`	Download + compress a model
`squish run <model>`	Interactive chat REPL
`squish run <model> --prompt "..."`	Single-turn inference
`squish serve`	Start the API server
`squish serve --port N`	Custom port
`squish list`	List downloaded models
`squish rm <model>`	Delete a model
`squish search [query]`	Search the community hub