API Reference

Every flag, variable, endpoint, and public type in one place. Type in the box to filter all tables below instantly — or press / for wiki-wide search.

· Matching rows are highlighted live across all sections.

CLI flags — `TensorSharp.Cli`

Flag	Description
`--model <path>`	Path to a GGUF model file (required).
`--input <path>`	Text file containing the user prompt.
`--input-jsonl <path>`	JSONL file with batch requests (one JSON per line).
`--multi-turn-jsonl <path>`	JSONL for multi-turn chat simulation with KV-cache reuse.
`--output <path>`	Write generated text to this file.
`--image / --video / --audio <path>`	Media for vision / video / audio inference.
`--mmproj <path>`	Multimodal projector GGUF (auto-detected beside the model).
`--max-tokens <N>`	Maximum tokens to generate (default 100).
`--backend <type>`	`cpu`, `cuda`, `mlx`, `ggml_cpu`, `ggml_metal`, `ggml_cuda`.
`--kv-cache-dtype <type>`	KV cache precision: `f32` (default), `f16`, `q8_0`.
`--interactive` / `-i`	Start the interactive REPL.
`--system <text>` / `--system-file <path>`	Seed the system prompt.
`--think`	Enable thinking / reasoning mode.
`--tools <path>`	JSON file with tool / function definitions.
`--temperature / --top-k / --top-p / --min-p`	Sampling controls.
`--repeat-penalty / --presence-penalty / --frequency-penalty`	Penalties (1.0 / 0 = off).
`--seed <N>` / `--stop <string>`	Random seed (-1 = random) / stop sequence (repeatable).
`--dump-prompt`	Render prompt + tokenization and exit.
`--diffusion-steps / --diffusion-seed / --diffusion-blocks <N>`	DiffusionGemma generation controls.
`--benchmark / --bench-prefill / --bench-decode / --bench-runs`	Synthetic throughput benchmark.
`--bench-kvcache / --bench-kv-turns <N>`	Multi-turn KV-cache reuse benchmark.
`--warmup-runs <N>`	Throw-away forward passes before timing (default 0).
`--test / --test-templates <dir>`	Built-in tokenizer/template tests; validate templates against GGUF Jinja2.
`--log-level / --log-dir / --log-file / --log-console`	Logger level, directory, and file/console toggles.

Server flags — `TensorSharp.Server`

Flag	Description
`--model <path>`	GGUF file to host (required for inference).
`--mmproj <path>`	Multimodal projector GGUF; `none` to disable.
`--backend <type>`	Default compute backend.
`--max-tokens <N>`	Default generation limit when a request omits it (default 20000).
`--temperature / --top-k / --top-p / --min-p`	Default sampling values.
`--repeat-penalty / --presence-penalty / --frequency-penalty / --seed`	Default penalties and seed.
`--stop <string>`	Default stop sequence (repeatable); per-request replaces the list.
`--continuous-batching / --no-continuous-batching`	Enable (default) / disable iteration-level paged batching. Alias `--paged-batching`.
`--mtp-spec / --no-mtp-spec`	Enable / disable NextN/MTP speculative decoding (default off).
`--mtp-draft <N>`	Max tokens drafted per speculative step (default 8).
`--mtp-pmin <f>`	Minimum draft-head confidence to keep a token (default 0.75).
`--mtp-draft-model <path>`	Separate MTP draft GGUF (Gemma 4 `gemma4-assistant`).
`--paged-kv* / --paged-kv-quant-bits`	Legacy standalone paged-KV flags (engine now owns KV state).

Environment variables

Variable	Description
`BACKEND`	Default backend (`ggml_metal` on macOS, `ggml_cpu` elsewhere).
`MAX_TOKENS`	Default max generation length (20000).
`MAX_TEXT_FILE_CHARS`	Char cap for plain-text uploads (8000).
`VIDEO_SAMPLE_FPS / VIDEO_MAX_FRAMES`	Video frame sampling rate / cap.
`TENSORSHARP_TEMPERATURE / _TOP_K / _TOP_P / _MIN_P`	Default sampling values.
`TENSORSHARP_REPEAT_PENALTY / _PRESENCE_PENALTY / _FREQUENCY_PENALTY / _SEED`	Default penalties and seed.
`TENSORSHARP_LOG_LEVEL / _LOG_DIR / _LOG_FILE`	Logging level, directory, file toggle (CLI + server).
`DIFFUSION_STEPS / DIFFUSION_MAX_BATCH`	DiffusionGemma steps per block / max batched requests.
`TS_SCHED_DISABLE_BATCHED`	`1` forces per-sequence KV-swap (= `--no-continuous-batching`).
`TS_SCHED_MAX_BATCHED_TOKENS`	Per-step token budget (4096).
`TS_SCHED_MAX_RUNNING_SEQS`	Max in-flight sequences (16).
`TS_SCHED_PREFILL_CHUNK`	Max prefill tokens per step (1024).
`TS_SCHED_NUM_BLOCKS / TS_SCHED_BLOCK_SIZE`	Engine block-pool size (256) / tokens per block (256).
`TS_SCHED_PREFIX_CACHE`	`0` disables block-hash prefix sharing.
`TS_<FAMILY>_BATCHED`	`0` forces a family onto the per-sequence path (e.g. `TS_GEMMA4_BATCHED`, `TS_QWEN35_BATCHED`).
`TS_MTP_SPEC / TS_MTP_DRAFT / TS_MTP_PMIN / TS_MTP_DRAFT_MODEL`	MTP speculative-decoding knobs (mirror the `--mtp-*` flags).
`TS_GMTP_NO_FUSED / TS_GMTP_NO_FAST_ROLLBACK / TS_GMTP_BATCHED_TRUNK`	Gemma 4 MTP draft-path A/B switches.
`TS_MLX_*`	MLX backend tuning: pipelined decode, mlock GGUF, fused KV write, batched MoE decode, memory caps.
`TENSORSHARP_MLX_LIBRARY / _LIBRARY_DIR`	Override the search path for `libmlxc`.
`TENSORSHARP_GGML_NO_UPDATE / _GGML_GIT_REF`	Skip / pin the ggml source clone on native builds.

HTTP endpoints

Method & path	Style	Purpose
`POST /api/generate`	Ollama	Single-prompt completion (stream or not).
`POST /api/chat/ollama`	Ollama	Multi-turn chat with optional think / tools / images.
`GET /api/tags`	Ollama	List the hosted model.
`POST /api/show`	Ollama	Model info.
`POST /v1/chat/completions`	OpenAI	Chat Completions (stream, tools, response_format).
`GET /v1/models`	OpenAI	List models.
`POST /api/chat`	Web UI	SSE chat stream with session + KV-reuse fields.
`POST /api/sessions` · `DELETE /api/sessions/{id}`	Web UI	Create / dispose a per-tab session.
`POST /api/upload`	Web UI	Upload an image / audio / video / text file.
`GET /api/models`	Web UI	Hosted model, supported backends, defaults.
`POST /api/models/load`	Web UI	Reload the hosted model.
`GET /api/version` · `GET /api/queue/status`	Utility	Server version / legacy queue snapshot.

Sampling parameters

Ollama (`options`)	OpenAI (top-level)	Default	Meaning
`num_predict`	`max_tokens`	200	Maximum tokens to generate.
`temperature`	`temperature`	0	Sampling temperature (0 = greedy).
`top_k`	—	0	Top-K filtering (0 = disabled).
`top_p`	`top_p`	1.0	Nucleus sampling threshold.
`min_p`	—	0	Minimum probability filtering.
`repeat_penalty`	—	1.0	Repetition penalty.
`presence_penalty / frequency_penalty`	`presence_penalty / frequency_penalty`	0	Presence / frequency penalties.
`seed`	`seed`	-1	Random seed (-1 = random).
`stop`	`stop`	null	Stop sequences.
—	`response_format`	null	`text`, `json_object`, or `json_schema`.

C# public API

Member	Signature / values	Notes
`ModelBase.Create`	`static ModelBase Create(string ggufPath, BackendType backend)`	Auto-detects architecture from GGUF metadata.
`ModelBase.Forward`	`float[] Forward(int[] tokens)`	Returns next-token logits (length = vocab size).
`ModelBase.Sample`	`int Sample(float[] logits, SamplingConfig config, IList<int> generated = null)`	Applies penalties + sampling.
`ModelBase.SampleGreedy`	`int SampleGreedy(float[] logits)`	Deterministic argmax.
`ModelBase.Config` / `.Tokenizer`	`ModelConfig` / `ITokenizer`	`Config.VocabSize`, context length, etc.
`BackendType`	`Cpu, GgmlCpu, GgmlMetal, GgmlCuda, Cuda, Mlx`	Backend selector enum.
`ITokenizer.Encode`	`Encode(string text, bool addSpecial)`	Text → token ids.
`ITokenizer.Decode`	`string Decode(List<int> ids)`	Token ids → text.
`ITokenizer.IsEos` / `.EosTokenIds`	`bool IsEos(int id)` / `int[] EosTokenIds`	End-of-sequence detection.
`SamplingConfig`	Temperature, TopK, TopP, MinP, penalties, Seed, StopSequences, MaxTokens	See C# Library.
`IBatchedPagedModel.ForwardBatch`	batched/paged forward	Implemented by most architectures for continuous batching.

REPL commands

Command	Description
`/help`, `/?`	Show all interactive commands.
`/exit`, `/quit`	Leave the session.
`/reset`, `/new`	Clear conversation history and KV cache.
`/history` · `/save <file>`	Print / append the transcript.
`/system <text>`	Set the system prompt (resets KV cache).
`/think on\|off` · `/multiline on\|off`	Toggle reasoning mode / multi-line input.
`/info`, `/status`	Show model, backend, architecture, context/vocab, projector, depth.
`/model <path>` · `/backend <name>` · `/mmproj <path>`	Hot-swap model, backend, or projector.
`/sampling`, `/show`	Print current sampling configuration.
`/max · /temp · /topk · /topp · /minp`	Set reply length / temperature / top-k / top-p / min-p.
`/repeat · /presence · /frequency · /seed`	Set penalties and seed.
`/stop <text>` · `/clearstop`	Add / clear stop sequences.
`/image · /audio · /video · /text <path>` · `/clearattach`	Attach media / text for the next turn; drop pending attachments.

← Benchmarks & Testing Next: Glossary & FAQ →

API Reference

CLI flags — TensorSharp.Cli

Server flags — TensorSharp.Server

Environment variables

HTTP endpoints

Sampling parameters

C# public API

REPL commands

CLI flags — `TensorSharp.Cli`

Server flags — `TensorSharp.Server`