Command Line (CLI)

TensorSharp.Cli is the console host for local prompts, multimodal experiments, prompt inspection, JSONL batch workflows, the interactive REPL, and built-in benchmarks. The binary lands in TensorSharp.Cli/bin/... after the build.

Examples

cd TensorSharp.Cli/bin

# Text inference (macOS)
./TensorSharp.Cli --model <model.gguf> --input prompt.txt --output result.txt \
    --max-tokens 200 --backend ggml_metal

# Text inference on Windows/Linux + NVIDIA GPU
./TensorSharp.Cli --model <model.gguf> --input prompt.txt --output result.txt \
    --max-tokens 200 --backend ggml_cuda

# Interactive turn-by-turn chat (REPL) with KV-cache reuse and slash commands
./TensorSharp.Cli --model <model.gguf> --backend ggml_metal --interactive
./TensorSharp.Cli --model <model.gguf> --backend ggml_metal -i \
    --system "You are a terse assistant." --temperature 0.7 --top-p 0.9 --think

Multimodal

# Image inference (Gemma 3/4, Qwen 3.5-family)
./TensorSharp.Cli --model <model.gguf> --image photo.png --backend ggml_metal

# Video inference (Gemma 4)
./TensorSharp.Cli --model <model.gguf> --video clip.mp4 --backend ggml_metal

# Audio inference (Gemma 4)
./TensorSharp.Cli --model <model.gguf> --audio speech.wav --backend ggml_metal

Reasoning, tools & sampling

# Thinking / reasoning mode
./TensorSharp.Cli --model <model.gguf> --input prompt.txt --backend ggml_metal --think

# Tool calling
./TensorSharp.Cli --model <model.gguf> --input prompt.txt --backend ggml_metal \
    --tools tools.json

# With sampling parameters
./TensorSharp.Cli --model <model.gguf> --input prompt.txt --backend ggml_metal \
    --temperature 0.7 --top-p 0.9 --top-k 40 --repeat-penalty 1.2 --seed 42

DiffusionGemma & inspection

# DiffusionGemma text-diffusion generation
./TensorSharp.Cli --model <diffusion-gemma.gguf> --input prompt.txt --backend ggml_metal \
    --max-tokens 256 --diffusion-steps 48 --diffusion-seed 0

# Inspect the rendered prompt and tokenization without running inference
./TensorSharp.Cli --model <model.gguf> --input prompt.txt --dump-prompt

Batch & benchmarks

# Batch processing (JSONL)
./TensorSharp.Cli --model <model.gguf> --input-jsonl requests.jsonl \
    --output results.txt --backend ggml_metal

# Multi-turn chat simulation with KV-cache reuse (mirrors the web UI behavior)
./TensorSharp.Cli --model <model.gguf> --multi-turn-jsonl chat.jsonl \
    --backend ggml_metal --max-tokens 200

# Throughput benchmark: best-of-N prefill and decode timing
./TensorSharp.Cli --model <model.gguf> --backend ggml_metal \
    --benchmark --bench-prefill 256 --bench-decode 128 --bench-runs 3

The JSONL format is one JSON object per line:

{"id": "q1", "messages": [{"role": "user", "content": "What is 2+3?"}], "max_tokens": 50}
{"id": "q2", "messages": [{"role": "user", "content": "Write a haiku."}], "max_tokens": 100, "temperature": 0.8}

Command-line options

Input / output

OptionDescription
--model <path>Path to a GGUF model file (required).
--input <path>Text file containing the user prompt.
--input-jsonl <path>JSONL file with batch requests (one JSON per line).
--multi-turn-jsonl <path>JSONL file for multi-turn chat simulation with KV-cache reuse.
--output <path>Write generated text to this file.
--image / --video / --audio <path>Media file for vision / video / audio inference.
--mmproj <path>Path to the multimodal projector GGUF file (auto-detected if placed beside the model).

Runtime

OptionDescription
--max-tokens <N>Maximum tokens to generate (default: 100).
--backend <type>Compute backend: cpu, cuda, mlx, ggml_cpu, ggml_metal, ggml_cuda.
--kv-cache-dtype <type>KV cache precision: f32 (default), f16, or q8_0.
--interactive / -iStart the interactive REPL (see below).
--system <text> / --system-file <path>Seed the session's system prompt from text or a file.
--thinkEnable thinking / reasoning mode (chain-of-thought).
--tools <path>JSON file with tool / function definitions.
--dump-promptRender the prompt + tokenization and exit (no generation).

Sampling

OptionDescription
--temperature <f>Sampling temperature (0 = greedy).
--top-k <N>Top-K filtering (0 = disabled).
--top-p <f>Nucleus sampling threshold (1.0 = disabled).
--min-p <f>Minimum probability filtering (0 = disabled).
--repeat-penalty <f>Repetition penalty (1.0 = none).
--presence-penalty <f> / --frequency-penalty <f>Presence / frequency penalties (0 = disabled).
--seed <N>Random seed (-1 = non-deterministic).
--stop <string>Stop sequence (can be repeated).

DiffusionGemma, benchmarks & logging

OptionDescription
--diffusion-steps <N>DiffusionGemma denoising steps per block (default: 48).
--diffusion-seed <N>DiffusionGemma deterministic sampler seed (default: 0).
--diffusion-blocks <N>Block-autoregressive canvas count (0 derives it from --max-tokens).
--benchmarkRun a synthetic prefill/decode throughput benchmark.
--bench-prefill / --bench-decode / --bench-runs <N>Synthetic prefill length, decode length, and run count.
--bench-kvcache / --bench-kv-turns <N>Multi-turn KV-cache reuse benchmark (with-cache vs forced-reset).
--warmup-runs <N>Throw-away forward passes before timing (default: 0).
--log-level <lvl>trace, debug, info, warning, error, critical, off.
--log-dir <path> / --log-file <0|1> / --log-console <0|1>JSON-line file logger directory and toggles.

See the full reference, including --test, --test-templates, and chunked-prefill correctness checks, on the API Reference page.

Interactive REPL commands

Launch with --interactive / -i. Anything that does not start with / is a user turn; type /help for the list. The prompt header shows the current model, backend, architecture, context length, projector, conversation depth, and pending attachments. Press Ctrl+C while generating to interrupt; at the prompt to exit.

Conversation

CommandDescription
/help, /?Show all interactive commands.
/exit, /quitLeave the session.
/reset, /newClear conversation history and KV cache.
/history · /save <file>Print / append the transcript.
/system <text>Set the system prompt (resets KV cache).
/think on|off · /multiline on|offToggle reasoning mode / multi-line input.

Model & runtime

CommandDescription
/info, /statusShow loaded model, backend, architecture, context/vocab size, projector, depth.
/model <path>Load a different .gguf on the current backend (resets the session).
/backend <name>Reload the current model on a different backend.
/mmproj <path>Load or replace the multimodal projector. Alias: /projector.

Sampling (live) & uploads (next turn)

CommandDescription
/sampling, /showPrint current sampling configuration.
/max · /temp · /topk · /topp · /minpSet reply length / temperature / top-k / top-p / min-p.
/repeat · /presence · /frequency · /seedSet penalties and the random seed.
/stop <text> · /clearstopAdd / clear stop sequences.
/image <path> · /audio · /video · /textAttach media or inline a text file for the next turn.
/clearattachDrop pending attachments without sending a turn.