HTTP API (Ollama / OpenAI)

TensorSharp.Server exposes three API styles plus utilities, all on http://localhost:5000. Existing Ollama and OpenAI clients work unchanged — just point them at the local base URL.

StyleEndpoints
Ollama-compatible/api/generate, /api/chat/ollama, /api/tags, /api/show
OpenAI-compatible/v1/chat/completions, /v1/models
Web UI (SSE)/api/chat, /api/sessions, /api/models, /api/upload
Utilities/api/version, /api/queue/status
📌

Requests must name the hosted GGUF file (or its basename). The server hosts one model selected with --model; see Server & Web UI.

1 · Ollama-compatible API

List & show models

curl http://localhost:5000/api/tags

curl -X POST http://localhost:5000/api/show \
  -H "Content-Type: application/json" \
  -d '{"model": "Qwen3-4B-Q8_0.gguf"}'

Generate (non-streaming)

curl -X POST http://localhost:5000/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3-4B-Q8_0.gguf",
    "prompt": "What is 1+1?",
    "stream": false,
    "options": { "num_predict": 50, "temperature": 0.7, "top_p": 0.9 }
  }'
{
  "model": "Qwen3-4B-Q8_0.gguf",
  "response": "1+1 equals 2.",
  "done": true,
  "done_reason": "stop",
  "prompt_eval_count": 15,
  "eval_count": 10,
  "prompt_cache_hit_tokens": 0,
  "prompt_cache_hit_ratio": 0.0
}

prompt_cache_hit_tokens reports how many prompt tokens were served straight from the prior turn's KV cache. /api/generate always resets the session, so it is always 0; it is non-zero on /api/chat/ollama when the prompt prefix matches a previous turn.

Generate (streaming)

curl -X POST http://localhost:5000/api/generate \
  -H "Content-Type: application/json" \
  -d '{ "model": "Qwen3-4B-Q8_0.gguf", "prompt": "Tell me a joke.", "stream": true, "options": {"num_predict": 100} }'

Each line is a JSON object (newline-delimited JSON); the final "done": true chunk carries timing and cache fields.

Chat (multi-turn)

curl -X POST http://localhost:5000/api/chat/ollama \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3-4B-Q8_0.gguf",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "stream": false,
    "options": {"num_predict": 100}
  }'

Generate with an image (multimodal)

Images are sent as base64-encoded bytes in the images array:

IMG_B64=$(base64 < photo.png)
curl -X POST http://localhost:5000/api/generate \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"gemma-4-E4B-it-Q8_0.gguf\",
    \"prompt\": \"What is in this image?\",
    \"images\": [\"$IMG_B64\"],
    \"stream\": false,
    \"options\": {\"num_predict\": 200}
  }"

Chat with thinking mode

Thinking-capable models accept "think": true and split chain-of-thought into message.thinking:

curl -X POST http://localhost:5000/api/chat/ollama \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3-4B-Q8_0.gguf",
    "messages": [{"role": "user", "content": "Solve 17 * 23 step by step."}],
    "think": true, "stream": false, "options": {"num_predict": 200}
  }'
{
  "message": {
    "role": "assistant",
    "content": "17 * 23 = 391.",
    "thinking": "17 * 20 = 340. 17 * 3 = 51. 340 + 51 = 391."
  },
  "done": true, "done_reason": "stop"
}

2 · OpenAI-compatible API

Chat Completions (non-streaming)

curl -X POST http://localhost:5000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3-4B-Q8_0.gguf",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is 2+3?"}
    ],
    "max_tokens": 50, "temperature": 0.7
  }'
{
  "id": "chatcmpl-abc123...",
  "object": "chat.completion",
  "model": "Qwen3-4B-Q8_0.gguf",
  "choices": [{
    "index": 0,
    "message": {"role": "assistant", "content": "2 + 3 = 5."},
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 20, "completion_tokens": 8, "total_tokens": 28,
    "prompt_tokens_details": { "cached_tokens": 0 }
  }
}

usage.prompt_tokens_details.cached_tokens follows OpenAI's KV-cache-hit extension — on a follow-up turn that shares a prefix it approaches prompt_tokens.

Chat Completions (streaming)

curl -X POST http://localhost:5000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{ "model": "Qwen3-4B-Q8_0.gguf", "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 50, "stream": true }'

Each chunk is an SSE data: frame of object: "chat.completion.chunk"; the stream ends with data: [DONE].

Image input (OpenAI format)

IMG_B64=$(base64 < photo.png)
curl -X POST http://localhost:5000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"gemma-4-E4B-it-Q8_0.gguf\",
    \"messages\": [{
      \"role\": \"user\",
      \"content\": [
        {\"type\": \"text\", \"text\": \"What is in this image?\"},
        {\"type\": \"image_url\", \"image_url\": {\"url\": \"data:image/png;base64,$IMG_B64\"}}
      ]
    }],
    \"max_tokens\": 200
  }"

3 · Tool calling over HTTP

Send a tools array; the server detects the architecture's wire format and returns structured tool_calls.

curl -X POST http://localhost:5000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3-4B-Q8_0.gguf",
    "messages": [{"role": "user", "content": "What is the weather in Paris?"}],
    "tools": [{
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get current weather for a city.",
        "parameters": {
          "type": "object",
          "properties": {
            "city":  {"type": "string"},
            "units": {"type": "string", "enum": ["c", "f"]}
          },
          "required": ["city"]
        }
      }
    }],
    "max_tokens": 200
  }'
{
  "choices": [{
    "message": {
      "role": "assistant", "content": null,
      "tool_calls": [{
        "id": "call_abc123", "type": "function",
        "function": { "name": "get_weather", "arguments": "{\"city\":\"Paris\",\"units\":\"c\"}" }
      }]
    },
    "finish_reason": "tool_calls"
  }]
}

Continue the loop by appending the assistant tool_calls plus a {"role": "tool", "tool_call_id": "...", "content": "..."} message with the function result, then calling the endpoint again. (The Ollama endpoint uses the same flow with a role: "tool" message.)

4 · Structured outputs (JSON schema)

The OpenAI response_format accepts text, json_object, and validated json_schema. The server injects strict JSON instructions and validates the output before returning it.

curl -X POST http://localhost:5000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3-4B-Q8_0.gguf",
    "messages": [
      {"role": "system", "content": "You are a concise extraction assistant."},
      {"role": "user", "content": "Extract the city and country from: Paris, France."}
    ],
    "response_format": {
      "type": "json_schema",
      "json_schema": {
        "name": "location_extraction", "strict": true,
        "schema": {
          "type": "object",
          "properties": {
            "city": { "type": "string" },
            "country": { "type": "string" },
            "confidence": { "type": ["string", "null"] }
          },
          "required": ["city", "country", "confidence"],
          "additionalProperties": false
        }
      }
    },
    "max_tokens": 120
  }'
⚠️

json_schema cannot be combined with tools or think. Invalid schemas return HTTP 400; outputs that fail validation return HTTP 422.

5 · Web UI SSE protocol (/api/chat)

This is the protocol the bundled chat UI uses; external UIs can plug into the same endpoint. Every event is a JSON object delivered as a single data: … SSE frame.

# Create / dispose a session (Web UI flow only)
curl -X POST http://localhost:5000/api/sessions          # {"sessionId":"a3b1c2..."}
curl -X DELETE http://localhost:5000/api/sessions/a3b1c2...

# Streaming chat
curl -N -X POST http://localhost:5000/api/chat \
  -H "Content-Type: application/json" \
  -d '{ "messages": [{"role": "user", "content": "Hi"}], "maxTokens": 50, "sessionId": null, "newChat": false, "think": false, "tools": [] }'

Reusing the same sessionId splices prior assistant tokens into the KV-cache prefix on the next turn. The terminal frame reports reuse:

data: {"done":true,"tokenCount":187,"elapsed":2.143,"tokPerSec":87.23,"promptTokens":512,"kvReusedTokens":420,"kvReusePercent":82.0}

Event fields include token (content), thinking (reasoning chunk), tool_calls, replace/diffusionStep (DiffusionGemma previews), and the terminal done summary.

6 · Sampling parameters

Ollama-style (inside the options object)

ParameterTypeDefaultDescription
num_predictint200Maximum tokens to generate
temperaturefloat0Sampling temperature (0 = greedy)
top_kint0Top-K filtering (0 = disabled)
top_pfloat1.0Nucleus sampling threshold
min_pfloat0Minimum probability filtering
repeat_penaltyfloat1.0Repetition penalty
presence_penalty / frequency_penaltyfloat0Presence / frequency penalties
seedint-1Random seed (-1 = random)
stoparraynullStop sequences

OpenAI-style (top-level)

max_tokens, temperature, top_p, presence_penalty, frequency_penalty, seed, stop, and response_format (text / json_object / json_schema).

7 · Python client examples

Using requests (Ollama-style)

import requests

resp = requests.post("http://localhost:5000/api/generate", json={
    "model": "Qwen3-4B-Q8_0.gguf",
    "prompt": "What is machine learning?",
    "stream": False,
    "options": {"num_predict": 100, "temperature": 0.7}
})
print(resp.json()["response"])

Using the openai SDK

from openai import OpenAI

client = OpenAI(base_url="http://localhost:5000/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="Qwen3-4B-Q8_0.gguf",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is 2+3?"}
    ],
    max_tokens=50, temperature=0.7
)
print(response.choices[0].message.content)

Streaming with the openai SDK

stream = client.chat.completions.create(
    model="Qwen3-4B-Q8_0.gguf",
    messages=[{"role": "user", "content": "Tell me about Python."}],
    max_tokens=200, stream=True
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
print()