API Reference

Every flag, variable, endpoint, and public type in one place. Type in the box to filter all tables below instantly — or press / for wiki-wide search.

· Matching rows are highlighted live across all sections.

CLI flags — TensorSharp.Cli

FlagDescription
--model <path>Path to a GGUF model file (required).
--input <path>Text file containing the user prompt.
--input-jsonl <path>JSONL file with batch requests (one JSON per line).
--multi-turn-jsonl <path>JSONL for multi-turn chat simulation with KV-cache reuse.
--output <path>Write generated text to this file.
--image / --video / --audio <path>Media for vision / video / audio inference.
--mmproj <path>Multimodal projector GGUF (auto-detected beside the model).
--max-tokens <N>Maximum tokens to generate (default 100).
--backend <type>cpu, cuda, mlx, ggml_cpu, ggml_metal, ggml_cuda.
--kv-cache-dtype <type>KV cache precision: f32 (default), f16, q8_0.
--interactive / -iStart the interactive REPL.
--system <text> / --system-file <path>Seed the system prompt.
--thinkEnable thinking / reasoning mode.
--tools <path>JSON file with tool / function definitions.
--temperature / --top-k / --top-p / --min-pSampling controls.
--repeat-penalty / --presence-penalty / --frequency-penaltyPenalties (1.0 / 0 = off).
--seed <N> / --stop <string>Random seed (-1 = random) / stop sequence (repeatable).
--dump-promptRender prompt + tokenization and exit.
--diffusion-steps / --diffusion-seed / --diffusion-blocks <N>DiffusionGemma generation controls.
--benchmark / --bench-prefill / --bench-decode / --bench-runsSynthetic throughput benchmark.
--bench-kvcache / --bench-kv-turns <N>Multi-turn KV-cache reuse benchmark.
--warmup-runs <N>Throw-away forward passes before timing (default 0).
--test / --test-templates <dir>Built-in tokenizer/template tests; validate templates against GGUF Jinja2.
--log-level / --log-dir / --log-file / --log-consoleLogger level, directory, and file/console toggles.

Server flags — TensorSharp.Server

FlagDescription
--model <path>GGUF file to host (required for inference).
--mmproj <path>Multimodal projector GGUF; none to disable.
--backend <type>Default compute backend.
--max-tokens <N>Default generation limit when a request omits it (default 20000).
--temperature / --top-k / --top-p / --min-pDefault sampling values.
--repeat-penalty / --presence-penalty / --frequency-penalty / --seedDefault penalties and seed.
--stop <string>Default stop sequence (repeatable); per-request replaces the list.
--continuous-batching / --no-continuous-batchingEnable (default) / disable iteration-level paged batching. Alias --paged-batching.
--mtp-spec / --no-mtp-specEnable / disable NextN/MTP speculative decoding (default off).
--mtp-draft <N>Max tokens drafted per speculative step (default 8).
--mtp-pmin <f>Minimum draft-head confidence to keep a token (default 0.75).
--mtp-draft-model <path>Separate MTP draft GGUF (Gemma 4 gemma4-assistant).
--paged-kv* / --paged-kv-quant-bitsLegacy standalone paged-KV flags (engine now owns KV state).

Environment variables

VariableDescription
BACKENDDefault backend (ggml_metal on macOS, ggml_cpu elsewhere).
MAX_TOKENSDefault max generation length (20000).
MAX_TEXT_FILE_CHARSChar cap for plain-text uploads (8000).
VIDEO_SAMPLE_FPS / VIDEO_MAX_FRAMESVideo frame sampling rate / cap.
TENSORSHARP_TEMPERATURE / _TOP_K / _TOP_P / _MIN_PDefault sampling values.
TENSORSHARP_REPEAT_PENALTY / _PRESENCE_PENALTY / _FREQUENCY_PENALTY / _SEEDDefault penalties and seed.
TENSORSHARP_LOG_LEVEL / _LOG_DIR / _LOG_FILELogging level, directory, file toggle (CLI + server).
DIFFUSION_STEPS / DIFFUSION_MAX_BATCHDiffusionGemma steps per block / max batched requests.
TS_SCHED_DISABLE_BATCHED1 forces per-sequence KV-swap (= --no-continuous-batching).
TS_SCHED_MAX_BATCHED_TOKENSPer-step token budget (4096).
TS_SCHED_MAX_RUNNING_SEQSMax in-flight sequences (16).
TS_SCHED_PREFILL_CHUNKMax prefill tokens per step (1024).
TS_SCHED_NUM_BLOCKS / TS_SCHED_BLOCK_SIZEEngine block-pool size (256) / tokens per block (256).
TS_SCHED_PREFIX_CACHE0 disables block-hash prefix sharing.
TS_<FAMILY>_BATCHED0 forces a family onto the per-sequence path (e.g. TS_GEMMA4_BATCHED, TS_QWEN35_BATCHED).
TS_MTP_SPEC / TS_MTP_DRAFT / TS_MTP_PMIN / TS_MTP_DRAFT_MODELMTP speculative-decoding knobs (mirror the --mtp-* flags).
TS_GMTP_NO_FUSED / TS_GMTP_NO_FAST_ROLLBACK / TS_GMTP_BATCHED_TRUNKGemma 4 MTP draft-path A/B switches.
TS_MLX_* MLX backend tuning: pipelined decode, mlock GGUF, fused KV write, batched MoE decode, memory caps.
TENSORSHARP_MLX_LIBRARY / _LIBRARY_DIROverride the search path for libmlxc.
TENSORSHARP_GGML_NO_UPDATE / _GGML_GIT_REFSkip / pin the ggml source clone on native builds.

HTTP endpoints

Method & pathStylePurpose
POST /api/generateOllamaSingle-prompt completion (stream or not).
POST /api/chat/ollamaOllamaMulti-turn chat with optional think / tools / images.
GET /api/tagsOllamaList the hosted model.
POST /api/showOllamaModel info.
POST /v1/chat/completionsOpenAIChat Completions (stream, tools, response_format).
GET /v1/modelsOpenAIList models.
POST /api/chatWeb UISSE chat stream with session + KV-reuse fields.
POST /api/sessions · DELETE /api/sessions/{id}Web UICreate / dispose a per-tab session.
POST /api/uploadWeb UIUpload an image / audio / video / text file.
GET /api/modelsWeb UIHosted model, supported backends, defaults.
POST /api/models/loadWeb UIReload the hosted model.
GET /api/version · GET /api/queue/statusUtilityServer version / legacy queue snapshot.

Sampling parameters

Ollama (options)OpenAI (top-level)DefaultMeaning
num_predictmax_tokens200Maximum tokens to generate.
temperaturetemperature0Sampling temperature (0 = greedy).
top_k0Top-K filtering (0 = disabled).
top_ptop_p1.0Nucleus sampling threshold.
min_p0Minimum probability filtering.
repeat_penalty1.0Repetition penalty.
presence_penalty / frequency_penaltypresence_penalty / frequency_penalty0Presence / frequency penalties.
seedseed-1Random seed (-1 = random).
stopstopnullStop sequences.
response_formatnulltext, json_object, or json_schema.

C# public API

MemberSignature / valuesNotes
ModelBase.Createstatic ModelBase Create(string ggufPath, BackendType backend)Auto-detects architecture from GGUF metadata.
ModelBase.Forwardfloat[] Forward(int[] tokens)Returns next-token logits (length = vocab size).
ModelBase.Sampleint Sample(float[] logits, SamplingConfig config, IList<int> generated = null)Applies penalties + sampling.
ModelBase.SampleGreedyint SampleGreedy(float[] logits)Deterministic argmax.
ModelBase.Config / .TokenizerModelConfig / ITokenizerConfig.VocabSize, context length, etc.
BackendTypeCpu, GgmlCpu, GgmlMetal, GgmlCuda, Cuda, MlxBackend selector enum.
ITokenizer.EncodeEncode(string text, bool addSpecial)Text → token ids.
ITokenizer.Decodestring Decode(List<int> ids)Token ids → text.
ITokenizer.IsEos / .EosTokenIdsbool IsEos(int id) / int[] EosTokenIdsEnd-of-sequence detection.
SamplingConfigTemperature, TopK, TopP, MinP, penalties, Seed, StopSequences, MaxTokensSee C# Library.
IBatchedPagedModel.ForwardBatchbatched/paged forwardImplemented by most architectures for continuous batching.

REPL commands

CommandDescription
/help, /?Show all interactive commands.
/exit, /quitLeave the session.
/reset, /newClear conversation history and KV cache.
/history · /save <file>Print / append the transcript.
/system <text>Set the system prompt (resets KV cache).
/think on|off · /multiline on|offToggle reasoning mode / multi-line input.
/info, /statusShow model, backend, architecture, context/vocab, projector, depth.
/model <path> · /backend <name> · /mmproj <path>Hot-swap model, backend, or projector.
/sampling, /showPrint current sampling configuration.
/max · /temp · /topk · /topp · /minpSet reply length / temperature / top-k / top-p / min-p.
/repeat · /presence · /frequency · /seedSet penalties and seed.
/stop <text> · /clearstopAdd / clear stop sequences.
/image · /audio · /video · /text <path> · /clearattachAttach media / text for the next turn; drop pending attachments.