HTTP API (Ollama / OpenAI)
TensorSharp.Server exposes three API styles plus utilities, all on http://localhost:5000. Existing Ollama and OpenAI clients work unchanged — just point them at the local base URL.
| Style | Endpoints |
|---|---|
| Ollama-compatible | /api/generate, /api/chat/ollama, /api/tags, /api/show |
| OpenAI-compatible | /v1/chat/completions, /v1/models |
| Web UI (SSE) | /api/chat, /api/sessions, /api/models, /api/upload |
| Utilities | /api/version, /api/queue/status |
Requests must name the hosted GGUF file (or its basename). The server hosts one model selected with --model; see Server & Web UI.
1 · Ollama-compatible API
List & show models
curl http://localhost:5000/api/tags
curl -X POST http://localhost:5000/api/show \
-H "Content-Type: application/json" \
-d '{"model": "Qwen3-4B-Q8_0.gguf"}'
Generate (non-streaming)
curl -X POST http://localhost:5000/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3-4B-Q8_0.gguf",
"prompt": "What is 1+1?",
"stream": false,
"options": { "num_predict": 50, "temperature": 0.7, "top_p": 0.9 }
}'
{
"model": "Qwen3-4B-Q8_0.gguf",
"response": "1+1 equals 2.",
"done": true,
"done_reason": "stop",
"prompt_eval_count": 15,
"eval_count": 10,
"prompt_cache_hit_tokens": 0,
"prompt_cache_hit_ratio": 0.0
}
prompt_cache_hit_tokens reports how many prompt tokens were served straight from the prior turn's KV cache. /api/generate always resets the session, so it is always 0; it is non-zero on /api/chat/ollama when the prompt prefix matches a previous turn.
Generate (streaming)
curl -X POST http://localhost:5000/api/generate \
-H "Content-Type: application/json" \
-d '{ "model": "Qwen3-4B-Q8_0.gguf", "prompt": "Tell me a joke.", "stream": true, "options": {"num_predict": 100} }'
Each line is a JSON object (newline-delimited JSON); the final "done": true chunk carries timing and cache fields.
Chat (multi-turn)
curl -X POST http://localhost:5000/api/chat/ollama \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3-4B-Q8_0.gguf",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
],
"stream": false,
"options": {"num_predict": 100}
}'
Generate with an image (multimodal)
Images are sent as base64-encoded bytes in the images array:
IMG_B64=$(base64 < photo.png)
curl -X POST http://localhost:5000/api/generate \
-H "Content-Type: application/json" \
-d "{
\"model\": \"gemma-4-E4B-it-Q8_0.gguf\",
\"prompt\": \"What is in this image?\",
\"images\": [\"$IMG_B64\"],
\"stream\": false,
\"options\": {\"num_predict\": 200}
}"
Chat with thinking mode
Thinking-capable models accept "think": true and split chain-of-thought into message.thinking:
curl -X POST http://localhost:5000/api/chat/ollama \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3-4B-Q8_0.gguf",
"messages": [{"role": "user", "content": "Solve 17 * 23 step by step."}],
"think": true, "stream": false, "options": {"num_predict": 200}
}'
{
"message": {
"role": "assistant",
"content": "17 * 23 = 391.",
"thinking": "17 * 20 = 340. 17 * 3 = 51. 340 + 51 = 391."
},
"done": true, "done_reason": "stop"
}
2 · OpenAI-compatible API
Chat Completions (non-streaming)
curl -X POST http://localhost:5000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3-4B-Q8_0.gguf",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is 2+3?"}
],
"max_tokens": 50, "temperature": 0.7
}'
{
"id": "chatcmpl-abc123...",
"object": "chat.completion",
"model": "Qwen3-4B-Q8_0.gguf",
"choices": [{
"index": 0,
"message": {"role": "assistant", "content": "2 + 3 = 5."},
"finish_reason": "stop"
}],
"usage": {
"prompt_tokens": 20, "completion_tokens": 8, "total_tokens": 28,
"prompt_tokens_details": { "cached_tokens": 0 }
}
}
usage.prompt_tokens_details.cached_tokens follows OpenAI's KV-cache-hit extension — on a follow-up turn that shares a prefix it approaches prompt_tokens.
Chat Completions (streaming)
curl -X POST http://localhost:5000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{ "model": "Qwen3-4B-Q8_0.gguf", "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 50, "stream": true }'
Each chunk is an SSE data: frame of object: "chat.completion.chunk"; the stream ends with data: [DONE].
Image input (OpenAI format)
IMG_B64=$(base64 < photo.png)
curl -X POST http://localhost:5000/v1/chat/completions \
-H "Content-Type: application/json" \
-d "{
\"model\": \"gemma-4-E4B-it-Q8_0.gguf\",
\"messages\": [{
\"role\": \"user\",
\"content\": [
{\"type\": \"text\", \"text\": \"What is in this image?\"},
{\"type\": \"image_url\", \"image_url\": {\"url\": \"data:image/png;base64,$IMG_B64\"}}
]
}],
\"max_tokens\": 200
}"
3 · Tool calling over HTTP
Send a tools array; the server detects the architecture's wire format and returns structured tool_calls.
curl -X POST http://localhost:5000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3-4B-Q8_0.gguf",
"messages": [{"role": "user", "content": "What is the weather in Paris?"}],
"tools": [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city.",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string"},
"units": {"type": "string", "enum": ["c", "f"]}
},
"required": ["city"]
}
}
}],
"max_tokens": 200
}'
{
"choices": [{
"message": {
"role": "assistant", "content": null,
"tool_calls": [{
"id": "call_abc123", "type": "function",
"function": { "name": "get_weather", "arguments": "{\"city\":\"Paris\",\"units\":\"c\"}" }
}]
},
"finish_reason": "tool_calls"
}]
}
Continue the loop by appending the assistant tool_calls plus a {"role": "tool", "tool_call_id": "...", "content": "..."} message with the function result, then calling the endpoint again. (The Ollama endpoint uses the same flow with a role: "tool" message.)
4 · Structured outputs (JSON schema)
The OpenAI response_format accepts text, json_object, and validated json_schema. The server injects strict JSON instructions and validates the output before returning it.
curl -X POST http://localhost:5000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3-4B-Q8_0.gguf",
"messages": [
{"role": "system", "content": "You are a concise extraction assistant."},
{"role": "user", "content": "Extract the city and country from: Paris, France."}
],
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "location_extraction", "strict": true,
"schema": {
"type": "object",
"properties": {
"city": { "type": "string" },
"country": { "type": "string" },
"confidence": { "type": ["string", "null"] }
},
"required": ["city", "country", "confidence"],
"additionalProperties": false
}
}
},
"max_tokens": 120
}'
json_schema cannot be combined with tools or think. Invalid schemas return HTTP 400; outputs that fail validation return HTTP 422.
5 · Web UI SSE protocol (/api/chat)
This is the protocol the bundled chat UI uses; external UIs can plug into the same endpoint. Every event is a JSON object delivered as a single data: … SSE frame.
# Create / dispose a session (Web UI flow only)
curl -X POST http://localhost:5000/api/sessions # {"sessionId":"a3b1c2..."}
curl -X DELETE http://localhost:5000/api/sessions/a3b1c2...
# Streaming chat
curl -N -X POST http://localhost:5000/api/chat \
-H "Content-Type: application/json" \
-d '{ "messages": [{"role": "user", "content": "Hi"}], "maxTokens": 50, "sessionId": null, "newChat": false, "think": false, "tools": [] }'
Reusing the same sessionId splices prior assistant tokens into the KV-cache prefix on the next turn. The terminal frame reports reuse:
data: {"done":true,"tokenCount":187,"elapsed":2.143,"tokPerSec":87.23,"promptTokens":512,"kvReusedTokens":420,"kvReusePercent":82.0}
Event fields include token (content), thinking (reasoning chunk), tool_calls, replace/diffusionStep (DiffusionGemma previews), and the terminal done summary.
6 · Sampling parameters
Ollama-style (inside the options object)
| Parameter | Type | Default | Description |
|---|---|---|---|
num_predict | int | 200 | Maximum tokens to generate |
temperature | float | 0 | Sampling temperature (0 = greedy) |
top_k | int | 0 | Top-K filtering (0 = disabled) |
top_p | float | 1.0 | Nucleus sampling threshold |
min_p | float | 0 | Minimum probability filtering |
repeat_penalty | float | 1.0 | Repetition penalty |
presence_penalty / frequency_penalty | float | 0 | Presence / frequency penalties |
seed | int | -1 | Random seed (-1 = random) |
stop | array | null | Stop sequences |
OpenAI-style (top-level)
max_tokens, temperature, top_p, presence_penalty, frequency_penalty, seed, stop, and response_format (text / json_object / json_schema).
7 · Python client examples
Using requests (Ollama-style)
import requests
resp = requests.post("http://localhost:5000/api/generate", json={
"model": "Qwen3-4B-Q8_0.gguf",
"prompt": "What is machine learning?",
"stream": False,
"options": {"num_predict": 100, "temperature": 0.7}
})
print(resp.json()["response"])
Using the openai SDK
from openai import OpenAI
client = OpenAI(base_url="http://localhost:5000/v1", api_key="not-needed")
response = client.chat.completions.create(
model="Qwen3-4B-Q8_0.gguf",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is 2+3?"}
],
max_tokens=50, temperature=0.7
)
print(response.choices[0].message.content)
Streaming with the openai SDK
stream = client.chat.completions.create(
model="Qwen3-4B-Q8_0.gguf",
messages=[{"role": "user", "content": "Tell me about Python."}],
max_tokens=200, stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
print()