Quickstart¶
Get from zero to a running 8B chat model in under two minutes.
1. Pull a model¶
Squish downloads pre-compressed INT8 weights from the squish-community HuggingFace org:
Progress is shown as weights are streamed. Models are cached in ~/.squish/models/.
To see all available models:
2. Chat interactively¶
Opens a REPL-style chat loop. Type your message and press Enter. Use Ctrl+D or /exit to quit.
3. Single-turn prompt¶
4. Start the API server¶
The server binds to http://localhost:11435 by default and is OpenAI-compatible.
5. Call the API¶
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11435/v1",
api_key="not-needed", # squish ignores the key by default
)
response = client.chat.completions.create(
model="llama3.1:8b",
messages=[{"role": "user", "content": "What is the capital of France?"}],
)
print(response.choices[0].message.content)
6. Batch inference¶
Send multiple prompts in a single request with the batch field:
curl http://localhost:11435/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.1:8b",
"batch": [
"The capital of France is",
"The largest planet is",
"Water boils at"
]
}'
7. Manage local models¶
Next steps¶
- API Reference — full endpoint documentation
- Architecture — how INT8 mmap compression works
- Contributing — add a model, fix a bug, write a test