Server & Web UI

TensorSharp.Server is an ASP.NET Core app that hosts a single GGUF model and exposes a browser chat UI plus Ollama- and OpenAI-compatible REST APIs on the same port. The continuous-batching engine handles concurrency.

Start the server

cd TensorSharp.Server/bin

# Start the server with the exact hosted model
./TensorSharp.Server --model ./models/model.gguf --backend ggml_metal

# Linux + NVIDIA GPU
./TensorSharp.Server --model ./models/model.gguf --backend ggml_cuda

# Multimodal models: host an explicit projector too
./TensorSharp.Server --model ./models/model.gguf --mmproj ./models/mmproj.gguf --backend ggml_cuda

The server starts on http://localhost:5000. Open it in a browser for the chat UI, or point an HTTP client at the compatibility endpoints.

📌

The server hosts exactly one GGUF chosen with --model (and an optional projector via --mmproj). It does not scan a model directory; requests must name that hosted file or its basename.

Server-wide default sampling

Defaults fill in any field a request omits; per-request values always win.

./TensorSharp.Server --model ./models/model.gguf --backend ggml_metal \
    --temperature 0.7 --top-p 0.9 --top-k 40 --repeat-penalty 1.1 \
    --presence-penalty 0.0 --frequency-penalty 0.0 --seed 42 \
    --stop "</s>" --stop "<|endoftext|>"

Web UI features

Open http://localhost:5000. The browser interface supports:

Multi-turn chat conversations with streaming token generation (SSE).
Per-tab chat sessions — each browser tab owns its own tracked history and KV cache.
Image, video, and audio uploads for multimodal inference (up to 500 MB).
Thinking / reasoning mode toggle and tool calling with function definitions.
Message editing and deletion with regeneration from any point in the conversation.
DiffusionGemma denoising previews when a diffusion-gemma GGUF is hosted (the whole assistant message is replaced on each step, then finalized).
Free scrolling — read earlier replies while new tokens stream; auto-scroll resumes at the bottom.

Server options

Option	Description
`--model <path>`	GGUF file to host (required for inference).
`--mmproj <path>`	Multimodal projector GGUF (pass `none` to disable). Requires `--model`.
`--backend <type>`	Default backend: `cpu`, `cuda`, `mlx`, `ggml_cpu`, `ggml_metal`, `ggml_cuda`.
`--max-tokens <N>`	Default generation limit when a request omits it (default: `20000`).
`--temperature / --top-k / --top-p / --min-p`	Default sampling values when a request does not provide them.
`--repeat-penalty / --presence-penalty / --frequency-penalty / --seed`	Default penalties and seed.
`--stop <string>`	Default stop sequence (repeatable). Per-request `stop` replaces the list.
`--continuous-batching` / `--no-continuous-batching`	Enable (default) or disable iteration-level paged batching. Alias: `--paged-batching`.
`--mtp-spec` / `--no-mtp-spec`	Enable / disable NextN / MTP speculative decoding (default off). → MTP
`--mtp-draft <N>`	Max tokens drafted per speculative step (default `8`).
`--mtp-pmin <f>`	Minimum draft-head confidence to keep a token (default `0.75`).
`--mtp-draft-model <path>`	Separate MTP draft GGUF (Gemma 4's `gemma4-assistant`); ignored for Qwen 3.6.

✅

Per-request fields (temperature, top_p, seed, stop, …) always override these server-wide defaults; the defaults only fill in what a client omits.

Environment variables

Variable	Description
`BACKEND`	Default backend when `--backend` is not passed (default: `ggml_metal` on macOS, `ggml_cpu` elsewhere).
`MAX_TOKENS`	Default max generation length (default: `20000`).
`MAX_TEXT_FILE_CHARS`	Character cap for truncating plain-text uploads (default: `8000`).
`VIDEO_SAMPLE_FPS` / `VIDEO_MAX_FRAMES`	Frames sampled per second of video / optional upper bound on extracted frames.
`TENSORSHARP_TEMPERATURE`, `…_TOP_K`, `…_TOP_P`, `…_MIN_P`	Default sampling values when neither the flag nor the request body sets one.
`TENSORSHARP_REPEAT_PENALTY`, `…_PRESENCE_PENALTY`, `…_FREQUENCY_PENALTY`, `…_SEED`	Default penalties and seed.
`TENSORSHARP_LOG_LEVEL` / `…_LOG_DIR` / `…_LOG_FILE`	Logger level, directory, and file toggle (also honored by the CLI).
`DIFFUSION_STEPS` / `DIFFUSION_MAX_BATCH`	DiffusionGemma denoising steps per block / max concurrent diffusion requests batched.

The current binary listens on a fixed http://0.0.0.0:5000; the Docker Space images patch that constant to 7860 at build time.

Continuous-batching tunables

The scheduler / engine knobs are read at process start. Set them via the environment (or the --continuous-batching flags, which translate to them).

Variable	Description
`TS_SCHED_DISABLE_BATCHED`	`1` forces per-sequence KV-swap even when a model supports batching (= `--no-continuous-batching`).
`TS_SCHED_MAX_BATCHED_TOKENS`	Per-step token budget (default `4096`).
`TS_SCHED_MAX_RUNNING_SEQS`	Maximum in-flight sequences (default `16`).
`TS_SCHED_PREFILL_CHUNK`	Maximum prefill tokens per step (default `1024`).
`TS_SCHED_NUM_BLOCKS` / `TS_SCHED_BLOCK_SIZE`	Physical blocks in the engine pool (default `256`) / tokens per block (default `256`).
`TS_SCHED_PREFIX_CACHE`	`0` disables block-hash prefix sharing across requests.
`TS_<FAMILY>_BATCHED=0`	Per-model escape hatch (e.g. `TS_GEMMA4_BATCHED=0`) to fall back to per-sequence KV-swap.

The full environment-variable surface (MLX tunables, MTP knobs, diffusion) is on the API Reference page and the Advanced page.

← Command Line Next: HTTP API →