Command Line (CLI)

TensorSharp.Cli is the console host for local prompts, multimodal experiments, prompt inspection, JSONL batch workflows, the interactive REPL, and built-in benchmarks. The binary lands in TensorSharp.Cli/bin/... after the build.

Examples

cd TensorSharp.Cli/bin

# Text inference (macOS)
./TensorSharp.Cli --model <model.gguf> --input prompt.txt --output result.txt \
    --max-tokens 200 --backend ggml_metal

# Text inference on Windows/Linux + NVIDIA GPU
./TensorSharp.Cli --model <model.gguf> --input prompt.txt --output result.txt \
    --max-tokens 200 --backend ggml_cuda

# Interactive turn-by-turn chat (REPL) with KV-cache reuse and slash commands
./TensorSharp.Cli --model <model.gguf> --backend ggml_metal --interactive
./TensorSharp.Cli --model <model.gguf> --backend ggml_metal -i \
    --system "You are a terse assistant." --temperature 0.7 --top-p 0.9 --think

Multimodal

# Image inference (Gemma 3/4, Qwen 3.5-family)
./TensorSharp.Cli --model <model.gguf> --image photo.png --backend ggml_metal

# Video inference (Gemma 4)
./TensorSharp.Cli --model <model.gguf> --video clip.mp4 --backend ggml_metal

# Audio inference (Gemma 4)
./TensorSharp.Cli --model <model.gguf> --audio speech.wav --backend ggml_metal

Reasoning, tools & sampling

# Thinking / reasoning mode
./TensorSharp.Cli --model <model.gguf> --input prompt.txt --backend ggml_metal --think

# Tool calling
./TensorSharp.Cli --model <model.gguf> --input prompt.txt --backend ggml_metal \
    --tools tools.json

# With sampling parameters
./TensorSharp.Cli --model <model.gguf> --input prompt.txt --backend ggml_metal \
    --temperature 0.7 --top-p 0.9 --top-k 40 --repeat-penalty 1.2 --seed 42

DiffusionGemma & inspection

# DiffusionGemma text-diffusion generation
./TensorSharp.Cli --model <diffusion-gemma.gguf> --input prompt.txt --backend ggml_metal \
    --max-tokens 256 --diffusion-steps 48 --diffusion-seed 0

# Inspect the rendered prompt and tokenization without running inference
./TensorSharp.Cli --model <model.gguf> --input prompt.txt --dump-prompt

Batch & benchmarks

# Batch processing (JSONL)
./TensorSharp.Cli --model <model.gguf> --input-jsonl requests.jsonl \
    --output results.txt --backend ggml_metal

# Multi-turn chat simulation with KV-cache reuse (mirrors the web UI behavior)
./TensorSharp.Cli --model <model.gguf> --multi-turn-jsonl chat.jsonl \
    --backend ggml_metal --max-tokens 200

# Throughput benchmark: best-of-N prefill and decode timing
./TensorSharp.Cli --model <model.gguf> --backend ggml_metal \
    --benchmark --bench-prefill 256 --bench-decode 128 --bench-runs 3

The JSONL format is one JSON object per line:

{"id": "q1", "messages": [{"role": "user", "content": "What is 2+3?"}], "max_tokens": 50}
{"id": "q2", "messages": [{"role": "user", "content": "Write a haiku."}], "max_tokens": 100, "temperature": 0.8}

Command-line options

Input / output

Option	Description
`--model <path>`	Path to a GGUF model file (required).
`--input <path>`	Text file containing the user prompt.
`--input-jsonl <path>`	JSONL file with batch requests (one JSON per line).
`--multi-turn-jsonl <path>`	JSONL file for multi-turn chat simulation with KV-cache reuse.
`--output <path>`	Write generated text to this file.
`--image / --video / --audio <path>`	Media file for vision / video / audio inference.
`--mmproj <path>`	Path to the multimodal projector GGUF file (auto-detected if placed beside the model).

Runtime

Option	Description
`--max-tokens <N>`	Maximum tokens to generate (default: 100).
`--backend <type>`	Compute backend: `cpu`, `cuda`, `mlx`, `ggml_cpu`, `ggml_metal`, `ggml_cuda`.
`--kv-cache-dtype <type>`	KV cache precision: `f32` (default), `f16`, or `q8_0`.
`--interactive` / `-i`	Start the interactive REPL (see below).
`--system <text>` / `--system-file <path>`	Seed the session's system prompt from text or a file.
`--think`	Enable thinking / reasoning mode (chain-of-thought).
`--tools <path>`	JSON file with tool / function definitions.
`--dump-prompt`	Render the prompt + tokenization and exit (no generation).

Sampling

Option	Description
`--temperature <f>`	Sampling temperature (0 = greedy).
`--top-k <N>`	Top-K filtering (0 = disabled).
`--top-p <f>`	Nucleus sampling threshold (1.0 = disabled).
`--min-p <f>`	Minimum probability filtering (0 = disabled).
`--repeat-penalty <f>`	Repetition penalty (1.0 = none).
`--presence-penalty <f>` / `--frequency-penalty <f>`	Presence / frequency penalties (0 = disabled).
`--seed <N>`	Random seed (-1 = non-deterministic).
`--stop <string>`	Stop sequence (can be repeated).

DiffusionGemma, benchmarks & logging

Option	Description
`--diffusion-steps <N>`	DiffusionGemma denoising steps per block (default: 48).
`--diffusion-seed <N>`	DiffusionGemma deterministic sampler seed (default: 0).
`--diffusion-blocks <N>`	Block-autoregressive canvas count (`0` derives it from `--max-tokens`).
`--benchmark`	Run a synthetic prefill/decode throughput benchmark.
`--bench-prefill / --bench-decode / --bench-runs <N>`	Synthetic prefill length, decode length, and run count.
`--bench-kvcache` / `--bench-kv-turns <N>`	Multi-turn KV-cache reuse benchmark (with-cache vs forced-reset).
`--warmup-runs <N>`	Throw-away forward passes before timing (default: 0).
`--log-level <lvl>`	`trace`, `debug`, `info`, `warning`, `error`, `critical`, `off`.
`--log-dir <path>` / `--log-file <0\|1>` / `--log-console <0\|1>`	JSON-line file logger directory and toggles.

See the full reference, including --test, --test-templates, and chunked-prefill correctness checks, on the API Reference page.

Interactive REPL commands

Launch with --interactive / -i. Anything that does not start with / is a user turn; type /help for the list. The prompt header shows the current model, backend, architecture, context length, projector, conversation depth, and pending attachments. Press Ctrl+C while generating to interrupt; at the prompt to exit.

Conversation

Command	Description
`/help`, `/?`	Show all interactive commands.
`/exit`, `/quit`	Leave the session.
`/reset`, `/new`	Clear conversation history and KV cache.
`/history` · `/save <file>`	Print / append the transcript.
`/system <text>`	Set the system prompt (resets KV cache).
`/think on\|off` · `/multiline on\|off`	Toggle reasoning mode / multi-line input.

Model & runtime

Command	Description
`/info`, `/status`	Show loaded model, backend, architecture, context/vocab size, projector, depth.
`/model <path>`	Load a different `.gguf` on the current backend (resets the session).
`/backend <name>`	Reload the current model on a different backend.
`/mmproj <path>`	Load or replace the multimodal projector. Alias: `/projector`.

Sampling (live) & uploads (next turn)

Command	Description
`/sampling`, `/show`	Print current sampling configuration.
`/max` · `/temp` · `/topk` · `/topp` · `/minp`	Set reply length / temperature / top-k / top-p / min-p.
`/repeat` · `/presence` · `/frequency` · `/seed`	Set penalties and the random seed.
`/stop <text>` · `/clearstop`	Add / clear stop sequences.
`/image <path>` · `/audio` · `/video` · `/text`	Attach media or inline a text file for the next turn.
`/clearattach`	Drop pending attachments without sending a turn.

← Supported Models Next: Server & Web UI →