Server & Web UI

TensorSharp.Server is an ASP.NET Core app that hosts a single GGUF model and exposes a browser chat UI plus Ollama- and OpenAI-compatible REST APIs on the same port. The continuous-batching engine handles concurrency.

Start the server

cd TensorSharp.Server/bin

# Start the server with the exact hosted model
./TensorSharp.Server --model ./models/model.gguf --backend ggml_metal

# Linux + NVIDIA GPU
./TensorSharp.Server --model ./models/model.gguf --backend ggml_cuda

# Multimodal models: host an explicit projector too
./TensorSharp.Server --model ./models/model.gguf --mmproj ./models/mmproj.gguf --backend ggml_cuda

The server starts on http://localhost:5000. Open it in a browser for the chat UI, or point an HTTP client at the compatibility endpoints.

📌

The server hosts exactly one GGUF chosen with --model (and an optional projector via --mmproj). It does not scan a model directory; requests must name that hosted file or its basename.

Server-wide default sampling

Defaults fill in any field a request omits; per-request values always win.

./TensorSharp.Server --model ./models/model.gguf --backend ggml_metal \
    --temperature 0.7 --top-p 0.9 --top-k 40 --repeat-penalty 1.1 \
    --presence-penalty 0.0 --frequency-penalty 0.0 --seed 42 \
    --stop "</s>" --stop "<|endoftext|>"

Web UI features

Open http://localhost:5000. The browser interface supports:

Server options

OptionDescription
--model <path>GGUF file to host (required for inference).
--mmproj <path>Multimodal projector GGUF (pass none to disable). Requires --model.
--backend <type>Default backend: cpu, cuda, mlx, ggml_cpu, ggml_metal, ggml_cuda.
--max-tokens <N>Default generation limit when a request omits it (default: 20000).
--temperature / --top-k / --top-p / --min-pDefault sampling values when a request does not provide them.
--repeat-penalty / --presence-penalty / --frequency-penalty / --seedDefault penalties and seed.
--stop <string>Default stop sequence (repeatable). Per-request stop replaces the list.
--continuous-batching / --no-continuous-batchingEnable (default) or disable iteration-level paged batching. Alias: --paged-batching.
--mtp-spec / --no-mtp-specEnable / disable NextN / MTP speculative decoding (default off). → MTP
--mtp-draft <N>Max tokens drafted per speculative step (default 8).
--mtp-pmin <f>Minimum draft-head confidence to keep a token (default 0.75).
--mtp-draft-model <path>Separate MTP draft GGUF (Gemma 4's gemma4-assistant); ignored for Qwen 3.6.

Per-request fields (temperature, top_p, seed, stop, …) always override these server-wide defaults; the defaults only fill in what a client omits.

Environment variables

VariableDescription
BACKENDDefault backend when --backend is not passed (default: ggml_metal on macOS, ggml_cpu elsewhere).
MAX_TOKENSDefault max generation length (default: 20000).
MAX_TEXT_FILE_CHARSCharacter cap for truncating plain-text uploads (default: 8000).
VIDEO_SAMPLE_FPS / VIDEO_MAX_FRAMESFrames sampled per second of video / optional upper bound on extracted frames.
TENSORSHARP_TEMPERATURE, …_TOP_K, …_TOP_P, …_MIN_PDefault sampling values when neither the flag nor the request body sets one.
TENSORSHARP_REPEAT_PENALTY, …_PRESENCE_PENALTY, …_FREQUENCY_PENALTY, …_SEEDDefault penalties and seed.
TENSORSHARP_LOG_LEVEL / …_LOG_DIR / …_LOG_FILELogger level, directory, and file toggle (also honored by the CLI).
DIFFUSION_STEPS / DIFFUSION_MAX_BATCHDiffusionGemma denoising steps per block / max concurrent diffusion requests batched.

The current binary listens on a fixed http://0.0.0.0:5000; the Docker Space images patch that constant to 7860 at build time.

Continuous-batching tunables

The scheduler / engine knobs are read at process start. Set them via the environment (or the --continuous-batching flags, which translate to them).

VariableDescription
TS_SCHED_DISABLE_BATCHED1 forces per-sequence KV-swap even when a model supports batching (= --no-continuous-batching).
TS_SCHED_MAX_BATCHED_TOKENSPer-step token budget (default 4096).
TS_SCHED_MAX_RUNNING_SEQSMaximum in-flight sequences (default 16).
TS_SCHED_PREFILL_CHUNKMaximum prefill tokens per step (default 1024).
TS_SCHED_NUM_BLOCKS / TS_SCHED_BLOCK_SIZEPhysical blocks in the engine pool (default 256) / tokens per block (default 256).
TS_SCHED_PREFIX_CACHE0 disables block-hash prefix sharing across requests.
TS_<FAMILY>_BATCHED=0Per-model escape hatch (e.g. TS_GEMMA4_BATCHED=0) to fall back to per-sequence KV-swap.

The full environment-variable surface (MLX tunables, MTP knobs, diffusion) is on the API Reference page and the Advanced page.