Command Line (CLI)
TensorSharp.Cli is the console host for local prompts, multimodal experiments, prompt inspection, JSONL batch workflows, the interactive REPL, and built-in benchmarks. The binary lands in TensorSharp.Cli/bin/... after the build.
Examples
cd TensorSharp.Cli/bin
# Text inference (macOS)
./TensorSharp.Cli --model <model.gguf> --input prompt.txt --output result.txt \
--max-tokens 200 --backend ggml_metal
# Text inference on Windows/Linux + NVIDIA GPU
./TensorSharp.Cli --model <model.gguf> --input prompt.txt --output result.txt \
--max-tokens 200 --backend ggml_cuda
# Interactive turn-by-turn chat (REPL) with KV-cache reuse and slash commands
./TensorSharp.Cli --model <model.gguf> --backend ggml_metal --interactive
./TensorSharp.Cli --model <model.gguf> --backend ggml_metal -i \
--system "You are a terse assistant." --temperature 0.7 --top-p 0.9 --think
Multimodal
# Image inference (Gemma 3/4, Qwen 3.5-family)
./TensorSharp.Cli --model <model.gguf> --image photo.png --backend ggml_metal
# Video inference (Gemma 4)
./TensorSharp.Cli --model <model.gguf> --video clip.mp4 --backend ggml_metal
# Audio inference (Gemma 4)
./TensorSharp.Cli --model <model.gguf> --audio speech.wav --backend ggml_metal
Reasoning, tools & sampling
# Thinking / reasoning mode
./TensorSharp.Cli --model <model.gguf> --input prompt.txt --backend ggml_metal --think
# Tool calling
./TensorSharp.Cli --model <model.gguf> --input prompt.txt --backend ggml_metal \
--tools tools.json
# With sampling parameters
./TensorSharp.Cli --model <model.gguf> --input prompt.txt --backend ggml_metal \
--temperature 0.7 --top-p 0.9 --top-k 40 --repeat-penalty 1.2 --seed 42
DiffusionGemma & inspection
# DiffusionGemma text-diffusion generation
./TensorSharp.Cli --model <diffusion-gemma.gguf> --input prompt.txt --backend ggml_metal \
--max-tokens 256 --diffusion-steps 48 --diffusion-seed 0
# Inspect the rendered prompt and tokenization without running inference
./TensorSharp.Cli --model <model.gguf> --input prompt.txt --dump-prompt
Batch & benchmarks
# Batch processing (JSONL)
./TensorSharp.Cli --model <model.gguf> --input-jsonl requests.jsonl \
--output results.txt --backend ggml_metal
# Multi-turn chat simulation with KV-cache reuse (mirrors the web UI behavior)
./TensorSharp.Cli --model <model.gguf> --multi-turn-jsonl chat.jsonl \
--backend ggml_metal --max-tokens 200
# Throughput benchmark: best-of-N prefill and decode timing
./TensorSharp.Cli --model <model.gguf> --backend ggml_metal \
--benchmark --bench-prefill 256 --bench-decode 128 --bench-runs 3
The JSONL format is one JSON object per line:
{"id": "q1", "messages": [{"role": "user", "content": "What is 2+3?"}], "max_tokens": 50}
{"id": "q2", "messages": [{"role": "user", "content": "Write a haiku."}], "max_tokens": 100, "temperature": 0.8}
Command-line options
Input / output
| Option | Description |
|---|---|
--model <path> | Path to a GGUF model file (required). |
--input <path> | Text file containing the user prompt. |
--input-jsonl <path> | JSONL file with batch requests (one JSON per line). |
--multi-turn-jsonl <path> | JSONL file for multi-turn chat simulation with KV-cache reuse. |
--output <path> | Write generated text to this file. |
--image / --video / --audio <path> | Media file for vision / video / audio inference. |
--mmproj <path> | Path to the multimodal projector GGUF file (auto-detected if placed beside the model). |
Runtime
| Option | Description |
|---|---|
--max-tokens <N> | Maximum tokens to generate (default: 100). |
--backend <type> | Compute backend: cpu, cuda, mlx, ggml_cpu, ggml_metal, ggml_cuda. |
--kv-cache-dtype <type> | KV cache precision: f32 (default), f16, or q8_0. |
--interactive / -i | Start the interactive REPL (see below). |
--system <text> / --system-file <path> | Seed the session's system prompt from text or a file. |
--think | Enable thinking / reasoning mode (chain-of-thought). |
--tools <path> | JSON file with tool / function definitions. |
--dump-prompt | Render the prompt + tokenization and exit (no generation). |
Sampling
| Option | Description |
|---|---|
--temperature <f> | Sampling temperature (0 = greedy). |
--top-k <N> | Top-K filtering (0 = disabled). |
--top-p <f> | Nucleus sampling threshold (1.0 = disabled). |
--min-p <f> | Minimum probability filtering (0 = disabled). |
--repeat-penalty <f> | Repetition penalty (1.0 = none). |
--presence-penalty <f> / --frequency-penalty <f> | Presence / frequency penalties (0 = disabled). |
--seed <N> | Random seed (-1 = non-deterministic). |
--stop <string> | Stop sequence (can be repeated). |
DiffusionGemma, benchmarks & logging
| Option | Description |
|---|---|
--diffusion-steps <N> | DiffusionGemma denoising steps per block (default: 48). |
--diffusion-seed <N> | DiffusionGemma deterministic sampler seed (default: 0). |
--diffusion-blocks <N> | Block-autoregressive canvas count (0 derives it from --max-tokens). |
--benchmark | Run a synthetic prefill/decode throughput benchmark. |
--bench-prefill / --bench-decode / --bench-runs <N> | Synthetic prefill length, decode length, and run count. |
--bench-kvcache / --bench-kv-turns <N> | Multi-turn KV-cache reuse benchmark (with-cache vs forced-reset). |
--warmup-runs <N> | Throw-away forward passes before timing (default: 0). |
--log-level <lvl> | trace, debug, info, warning, error, critical, off. |
--log-dir <path> / --log-file <0|1> / --log-console <0|1> | JSON-line file logger directory and toggles. |
See the full reference, including --test, --test-templates, and chunked-prefill correctness checks, on the API Reference page.
Interactive REPL commands
Launch with --interactive / -i. Anything that does not start with / is a user turn; type /help for the list. The prompt header shows the current model, backend, architecture, context length, projector, conversation depth, and pending attachments. Press Ctrl+C while generating to interrupt; at the prompt to exit.
Conversation
| Command | Description |
|---|---|
/help, /? | Show all interactive commands. |
/exit, /quit | Leave the session. |
/reset, /new | Clear conversation history and KV cache. |
/history · /save <file> | Print / append the transcript. |
/system <text> | Set the system prompt (resets KV cache). |
/think on|off · /multiline on|off | Toggle reasoning mode / multi-line input. |
Model & runtime
| Command | Description |
|---|---|
/info, /status | Show loaded model, backend, architecture, context/vocab size, projector, depth. |
/model <path> | Load a different .gguf on the current backend (resets the session). |
/backend <name> | Reload the current model on a different backend. |
/mmproj <path> | Load or replace the multimodal projector. Alias: /projector. |
Sampling (live) & uploads (next turn)
| Command | Description |
|---|---|
/sampling, /show | Print current sampling configuration. |
/max · /temp · /topk · /topp · /minp | Set reply length / temperature / top-k / top-p / min-p. |
/repeat · /presence · /frequency · /seed | Set penalties and the random seed. |
/stop <text> · /clearstop | Add / clear stop sequences. |
/image <path> · /audio · /video · /text | Attach media or inline a text file for the next turn. |
/clearattach | Drop pending attachments without sending a turn. |