Quickstart¶

Get from zero to a running 8B chat model in under two minutes.

1. Pull a model¶

Squish downloads pre-compressed INT8 weights from the squish-community HuggingFace org:

squish pull llama3.1:8b

Progress is shown as weights are streamed. Models are cached in ~/.squish/models/.

To see all available models:

squish search
# or
squish search llama

2. Chat interactively¶

squish run llama3.1:8b

Opens a REPL-style chat loop. Type your message and press Enter. Use Ctrl+D or /exit to quit.

3. Single-turn prompt¶

squish run llama3.1:8b --prompt "Explain gradient descent in one sentence."

4. Start the API server¶

squish serve
# or specify port / host
squish serve --port 11435 --host 0.0.0.0

The server binds to http://localhost:11435 by default and is OpenAI-compatible.

5. Call the API¶

curlPython (openai SDK)Python (requests)

curl http://localhost:11435/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is the capital of France?"}
    ]
  }'

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11435/v1",
    api_key="not-needed",  # squish ignores the key by default
)

response = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[{"role": "user", "content": "What is the capital of France?"}],
)
print(response.choices[0].message.content)

import requests

resp = requests.post(
    "http://localhost:11435/v1/chat/completions",
    json={
        "model": "llama3.1:8b",
        "messages": [{"role": "user", "content": "What is the capital of France?"}],
    },
)
print(resp.json()["choices"][0]["message"]["content"])

6. Batch inference¶

Send multiple prompts in a single request with the batch field:

curl http://localhost:11435/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "batch": [
      "The capital of France is",
      "The largest planet is",
      "Water boils at"
    ]
  }'

7. Manage local models¶

squish models        # show downloaded models
squish rm llama3.1:8b   # delete a model

Next steps¶

API Reference — full endpoint documentation
Architecture — how INT8 mmap compression works
Contributing — add a model, fix a bug, write a test