API Reference
Every flag, variable, endpoint, and public type in one place. Type in the box to filter all tables below instantly — or press / for wiki-wide search.
· Matching rows are highlighted live across all sections.
CLI flags — TensorSharp.Cli
| Flag | Description |
|---|---|
--model <path> | Path to a GGUF model file (required). |
--input <path> | Text file containing the user prompt. |
--input-jsonl <path> | JSONL file with batch requests (one JSON per line). |
--multi-turn-jsonl <path> | JSONL for multi-turn chat simulation with KV-cache reuse. |
--output <path> | Write generated text to this file. |
--image / --video / --audio <path> | Media for vision / video / audio inference. |
--mmproj <path> | Multimodal projector GGUF (auto-detected beside the model). |
--max-tokens <N> | Maximum tokens to generate (default 100). |
--backend <type> | cpu, cuda, mlx, ggml_cpu, ggml_metal, ggml_cuda. |
--kv-cache-dtype <type> | KV cache precision: f32 (default), f16, q8_0. |
--interactive / -i | Start the interactive REPL. |
--system <text> / --system-file <path> | Seed the system prompt. |
--think | Enable thinking / reasoning mode. |
--tools <path> | JSON file with tool / function definitions. |
--temperature / --top-k / --top-p / --min-p | Sampling controls. |
--repeat-penalty / --presence-penalty / --frequency-penalty | Penalties (1.0 / 0 = off). |
--seed <N> / --stop <string> | Random seed (-1 = random) / stop sequence (repeatable). |
--dump-prompt | Render prompt + tokenization and exit. |
--diffusion-steps / --diffusion-seed / --diffusion-blocks <N> | DiffusionGemma generation controls. |
--benchmark / --bench-prefill / --bench-decode / --bench-runs | Synthetic throughput benchmark. |
--bench-kvcache / --bench-kv-turns <N> | Multi-turn KV-cache reuse benchmark. |
--warmup-runs <N> | Throw-away forward passes before timing (default 0). |
--test / --test-templates <dir> | Built-in tokenizer/template tests; validate templates against GGUF Jinja2. |
--log-level / --log-dir / --log-file / --log-console | Logger level, directory, and file/console toggles. |
Server flags — TensorSharp.Server
| Flag | Description |
|---|---|
--model <path> | GGUF file to host (required for inference). |
--mmproj <path> | Multimodal projector GGUF; none to disable. |
--backend <type> | Default compute backend. |
--max-tokens <N> | Default generation limit when a request omits it (default 20000). |
--temperature / --top-k / --top-p / --min-p | Default sampling values. |
--repeat-penalty / --presence-penalty / --frequency-penalty / --seed | Default penalties and seed. |
--stop <string> | Default stop sequence (repeatable); per-request replaces the list. |
--continuous-batching / --no-continuous-batching | Enable (default) / disable iteration-level paged batching. Alias --paged-batching. |
--mtp-spec / --no-mtp-spec | Enable / disable NextN/MTP speculative decoding (default off). |
--mtp-draft <N> | Max tokens drafted per speculative step (default 8). |
--mtp-pmin <f> | Minimum draft-head confidence to keep a token (default 0.75). |
--mtp-draft-model <path> | Separate MTP draft GGUF (Gemma 4 gemma4-assistant). |
--paged-kv* / --paged-kv-quant-bits | Legacy standalone paged-KV flags (engine now owns KV state). |
Environment variables
| Variable | Description |
|---|---|
BACKEND | Default backend (ggml_metal on macOS, ggml_cpu elsewhere). |
MAX_TOKENS | Default max generation length (20000). |
MAX_TEXT_FILE_CHARS | Char cap for plain-text uploads (8000). |
VIDEO_SAMPLE_FPS / VIDEO_MAX_FRAMES | Video frame sampling rate / cap. |
TENSORSHARP_TEMPERATURE / _TOP_K / _TOP_P / _MIN_P | Default sampling values. |
TENSORSHARP_REPEAT_PENALTY / _PRESENCE_PENALTY / _FREQUENCY_PENALTY / _SEED | Default penalties and seed. |
TENSORSHARP_LOG_LEVEL / _LOG_DIR / _LOG_FILE | Logging level, directory, file toggle (CLI + server). |
DIFFUSION_STEPS / DIFFUSION_MAX_BATCH | DiffusionGemma steps per block / max batched requests. |
TS_SCHED_DISABLE_BATCHED | 1 forces per-sequence KV-swap (= --no-continuous-batching). |
TS_SCHED_MAX_BATCHED_TOKENS | Per-step token budget (4096). |
TS_SCHED_MAX_RUNNING_SEQS | Max in-flight sequences (16). |
TS_SCHED_PREFILL_CHUNK | Max prefill tokens per step (1024). |
TS_SCHED_NUM_BLOCKS / TS_SCHED_BLOCK_SIZE | Engine block-pool size (256) / tokens per block (256). |
TS_SCHED_PREFIX_CACHE | 0 disables block-hash prefix sharing. |
TS_<FAMILY>_BATCHED | 0 forces a family onto the per-sequence path (e.g. TS_GEMMA4_BATCHED, TS_QWEN35_BATCHED). |
TS_MTP_SPEC / TS_MTP_DRAFT / TS_MTP_PMIN / TS_MTP_DRAFT_MODEL | MTP speculative-decoding knobs (mirror the --mtp-* flags). |
TS_GMTP_NO_FUSED / TS_GMTP_NO_FAST_ROLLBACK / TS_GMTP_BATCHED_TRUNK | Gemma 4 MTP draft-path A/B switches. |
TS_MLX_* | MLX backend tuning: pipelined decode, mlock GGUF, fused KV write, batched MoE decode, memory caps. |
TENSORSHARP_MLX_LIBRARY / _LIBRARY_DIR | Override the search path for libmlxc. |
TENSORSHARP_GGML_NO_UPDATE / _GGML_GIT_REF | Skip / pin the ggml source clone on native builds. |
HTTP endpoints
| Method & path | Style | Purpose |
|---|---|---|
POST /api/generate | Ollama | Single-prompt completion (stream or not). |
POST /api/chat/ollama | Ollama | Multi-turn chat with optional think / tools / images. |
GET /api/tags | Ollama | List the hosted model. |
POST /api/show | Ollama | Model info. |
POST /v1/chat/completions | OpenAI | Chat Completions (stream, tools, response_format). |
GET /v1/models | OpenAI | List models. |
POST /api/chat | Web UI | SSE chat stream with session + KV-reuse fields. |
POST /api/sessions · DELETE /api/sessions/{id} | Web UI | Create / dispose a per-tab session. |
POST /api/upload | Web UI | Upload an image / audio / video / text file. |
GET /api/models | Web UI | Hosted model, supported backends, defaults. |
POST /api/models/load | Web UI | Reload the hosted model. |
GET /api/version · GET /api/queue/status | Utility | Server version / legacy queue snapshot. |
Sampling parameters
Ollama (options) | OpenAI (top-level) | Default | Meaning |
|---|---|---|---|
num_predict | max_tokens | 200 | Maximum tokens to generate. |
temperature | temperature | 0 | Sampling temperature (0 = greedy). |
top_k | — | 0 | Top-K filtering (0 = disabled). |
top_p | top_p | 1.0 | Nucleus sampling threshold. |
min_p | — | 0 | Minimum probability filtering. |
repeat_penalty | — | 1.0 | Repetition penalty. |
presence_penalty / frequency_penalty | presence_penalty / frequency_penalty | 0 | Presence / frequency penalties. |
seed | seed | -1 | Random seed (-1 = random). |
stop | stop | null | Stop sequences. |
| — | response_format | null | text, json_object, or json_schema. |
C# public API
| Member | Signature / values | Notes |
|---|---|---|
ModelBase.Create | static ModelBase Create(string ggufPath, BackendType backend) | Auto-detects architecture from GGUF metadata. |
ModelBase.Forward | float[] Forward(int[] tokens) | Returns next-token logits (length = vocab size). |
ModelBase.Sample | int Sample(float[] logits, SamplingConfig config, IList<int> generated = null) | Applies penalties + sampling. |
ModelBase.SampleGreedy | int SampleGreedy(float[] logits) | Deterministic argmax. |
ModelBase.Config / .Tokenizer | ModelConfig / ITokenizer | Config.VocabSize, context length, etc. |
BackendType | Cpu, GgmlCpu, GgmlMetal, GgmlCuda, Cuda, Mlx | Backend selector enum. |
ITokenizer.Encode | Encode(string text, bool addSpecial) | Text → token ids. |
ITokenizer.Decode | string Decode(List<int> ids) | Token ids → text. |
ITokenizer.IsEos / .EosTokenIds | bool IsEos(int id) / int[] EosTokenIds | End-of-sequence detection. |
SamplingConfig | Temperature, TopK, TopP, MinP, penalties, Seed, StopSequences, MaxTokens | See C# Library. |
IBatchedPagedModel.ForwardBatch | batched/paged forward | Implemented by most architectures for continuous batching. |
REPL commands
| Command | Description |
|---|---|
/help, /? | Show all interactive commands. |
/exit, /quit | Leave the session. |
/reset, /new | Clear conversation history and KV cache. |
/history · /save <file> | Print / append the transcript. |
/system <text> | Set the system prompt (resets KV cache). |
/think on|off · /multiline on|off | Toggle reasoning mode / multi-line input. |
/info, /status | Show model, backend, architecture, context/vocab, projector, depth. |
/model <path> · /backend <name> · /mmproj <path> | Hot-swap model, backend, or projector. |
/sampling, /show | Print current sampling configuration. |
/max · /temp · /topk · /topp · /minp | Set reply length / temperature / top-k / top-p / min-p. |
/repeat · /presence · /frequency · /seed | Set penalties and seed. |
/stop <text> · /clearstop | Add / clear stop sequences. |
/image · /audio · /video · /text <path> · /clearattach | Attach media / text for the next turn; drop pending attachments. |