Server & Web UI
TensorSharp.Server is an ASP.NET Core app that hosts a single GGUF model and exposes a browser chat UI plus Ollama- and OpenAI-compatible REST APIs on the same port. The continuous-batching engine handles concurrency.
Start the server
cd TensorSharp.Server/bin
# Start the server with the exact hosted model
./TensorSharp.Server --model ./models/model.gguf --backend ggml_metal
# Linux + NVIDIA GPU
./TensorSharp.Server --model ./models/model.gguf --backend ggml_cuda
# Multimodal models: host an explicit projector too
./TensorSharp.Server --model ./models/model.gguf --mmproj ./models/mmproj.gguf --backend ggml_cuda
The server starts on http://localhost:5000. Open it in a browser for the chat UI, or point an HTTP client at the compatibility endpoints.
The server hosts exactly one GGUF chosen with --model (and an optional projector via --mmproj). It does not scan a model directory; requests must name that hosted file or its basename.
Server-wide default sampling
Defaults fill in any field a request omits; per-request values always win.
./TensorSharp.Server --model ./models/model.gguf --backend ggml_metal \
--temperature 0.7 --top-p 0.9 --top-k 40 --repeat-penalty 1.1 \
--presence-penalty 0.0 --frequency-penalty 0.0 --seed 42 \
--stop "</s>" --stop "<|endoftext|>"
Web UI features
Open http://localhost:5000. The browser interface supports:
- Multi-turn chat conversations with streaming token generation (SSE).
- Per-tab chat sessions — each browser tab owns its own tracked history and KV cache.
- Image, video, and audio uploads for multimodal inference (up to 500 MB).
- Thinking / reasoning mode toggle and tool calling with function definitions.
- Message editing and deletion with regeneration from any point in the conversation.
- DiffusionGemma denoising previews when a
diffusion-gemmaGGUF is hosted (the whole assistant message is replaced on each step, then finalized). - Free scrolling — read earlier replies while new tokens stream; auto-scroll resumes at the bottom.
Server options
| Option | Description |
|---|---|
--model <path> | GGUF file to host (required for inference). |
--mmproj <path> | Multimodal projector GGUF (pass none to disable). Requires --model. |
--backend <type> | Default backend: cpu, cuda, mlx, ggml_cpu, ggml_metal, ggml_cuda. |
--max-tokens <N> | Default generation limit when a request omits it (default: 20000). |
--temperature / --top-k / --top-p / --min-p | Default sampling values when a request does not provide them. |
--repeat-penalty / --presence-penalty / --frequency-penalty / --seed | Default penalties and seed. |
--stop <string> | Default stop sequence (repeatable). Per-request stop replaces the list. |
--continuous-batching / --no-continuous-batching | Enable (default) or disable iteration-level paged batching. Alias: --paged-batching. |
--mtp-spec / --no-mtp-spec | Enable / disable NextN / MTP speculative decoding (default off). → MTP |
--mtp-draft <N> | Max tokens drafted per speculative step (default 8). |
--mtp-pmin <f> | Minimum draft-head confidence to keep a token (default 0.75). |
--mtp-draft-model <path> | Separate MTP draft GGUF (Gemma 4's gemma4-assistant); ignored for Qwen 3.6. |
Per-request fields (temperature, top_p, seed, stop, …) always override these server-wide defaults; the defaults only fill in what a client omits.
Environment variables
| Variable | Description |
|---|---|
BACKEND | Default backend when --backend is not passed (default: ggml_metal on macOS, ggml_cpu elsewhere). |
MAX_TOKENS | Default max generation length (default: 20000). |
MAX_TEXT_FILE_CHARS | Character cap for truncating plain-text uploads (default: 8000). |
VIDEO_SAMPLE_FPS / VIDEO_MAX_FRAMES | Frames sampled per second of video / optional upper bound on extracted frames. |
TENSORSHARP_TEMPERATURE, …_TOP_K, …_TOP_P, …_MIN_P | Default sampling values when neither the flag nor the request body sets one. |
TENSORSHARP_REPEAT_PENALTY, …_PRESENCE_PENALTY, …_FREQUENCY_PENALTY, …_SEED | Default penalties and seed. |
TENSORSHARP_LOG_LEVEL / …_LOG_DIR / …_LOG_FILE | Logger level, directory, and file toggle (also honored by the CLI). |
DIFFUSION_STEPS / DIFFUSION_MAX_BATCH | DiffusionGemma denoising steps per block / max concurrent diffusion requests batched. |
The current binary listens on a fixed http://0.0.0.0:5000; the Docker Space images patch that constant to 7860 at build time.
Continuous-batching tunables
The scheduler / engine knobs are read at process start. Set them via the environment (or the --continuous-batching flags, which translate to them).
| Variable | Description |
|---|---|
TS_SCHED_DISABLE_BATCHED | 1 forces per-sequence KV-swap even when a model supports batching (= --no-continuous-batching). |
TS_SCHED_MAX_BATCHED_TOKENS | Per-step token budget (default 4096). |
TS_SCHED_MAX_RUNNING_SEQS | Maximum in-flight sequences (default 16). |
TS_SCHED_PREFILL_CHUNK | Maximum prefill tokens per step (default 1024). |
TS_SCHED_NUM_BLOCKS / TS_SCHED_BLOCK_SIZE | Physical blocks in the engine pool (default 256) / tokens per block (default 256). |
TS_SCHED_PREFIX_CACHE | 0 disables block-hash prefix sharing across requests. |
TS_<FAMILY>_BATCHED=0 | Per-model escape hatch (e.g. TS_GEMMA4_BATCHED=0) to fall back to per-sequence KV-swap. |
The full environment-variable surface (MLX tunables, MTP knobs, diffusion) is on the API Reference page and the Advanced page.