Supported Models

TensorSharp loads models in GGUF format and auto-detects the architecture from the file's general.architecture metadata. Pick a quantization that fits your hardware (Q4_K_M for low memory, Q8_0 for higher quality).

Supported architectures

Architecture	GGUF arch keys	Example models	Multimodal	Thinking	Tools	MTP spec
Gemma 4	`gemma4`	gemma-4-E4B, 31B, 26B-A4B (MoE)	Image, Video, Audio	Yes	Yes	Yes (separate draft)
Gemma 3	`gemma3`	gemma-3-4b	Image	No	No	—
Qwen 3	`qwen3`	Qwen3-4B	Text only	Yes	Yes	—
Qwen 3.5 / 3.6	`qwen35`, `qwen35moe`, `qwen3next`	Qwen3.5-9B, Qwen3.5/3.6-35B-A3B (MoE)	Image	Yes	Yes	Yes on 3.6 (embedded NextN)
GPT OSS	`gptoss`, `gpt-oss`	gpt-oss-20b (MoE)	Text only	Yes (always)	Yes	—
Nemotron-H	`nemotron_h`, `nemotron_h_moe`	Nemotron-H-8B, 47B, Nemotron 3 Nano Omni	Image (Omni)	Yes	Yes	—
Mistral 3	`mistral3`	Mistral-Small-3.1-24B-Instruct	Image	No	No	—
DiffusionGemma	`diffusion-gemma`, `diffusion_gemma`	diffusion-gemma text-diffusion GGUFs	Text only	No	No	—

Detailed per-model architecture cards (forward graph, components, parameters, and how TensorSharp optimizes prefill/decode) live under docs/models/ in the repository.

Model downloads (GGUF)

Architecture	Model	Download
Gemma 4	gemma-4-E4B-it	ggml-org/gemma-4-E4B-it-GGUF
Gemma 4	gemma-4-31B-it	ggml-org/gemma-4-31B-it-GGUF
Gemma 4	gemma-4-26B-A4B-it (MoE)	ggml-org/gemma-4-26B-A4B-it-GGUF
Gemma 3	gemma-3-4b-it	google/gemma-3-4b-it-qat-q4_0-gguf
Qwen 3	Qwen3-4B	Qwen/Qwen3-4B-GGUF
Qwen 3.5 / 3.6	Qwen3.5-9B	unsloth/Qwen3.5-9B-GGUF
Qwen 3.5 / 3.6	Qwen3.5-35B-A3B (MoE)	ggml-org/Qwen3.5-35B-A3B-GGUF
GPT OSS	gpt-oss-20b (MoE)	ggml-org/gpt-oss-20b-GGUF
Nemotron-H	Nemotron-H-8B-Reasoning-128K	bartowski/nvidia_Nemotron-H-8B-…
Nemotron-H	Nemotron-H-47B-Reasoning-128K	bartowski/nvidia_Nemotron-H-47B-…
Mistral 3	Mistral-Small-3.1-24B-Instruct	bartowski/Mistral-Small-3.1-24B-…

🧩

Multimodal models need a projector (mmproj) file. The Gemma 4 / Gemma 3 / Qwen 3.5 / Mistral 3 projectors ship in or alongside the repos above; place the projector next to the model with a recognized name for auto-loading, or pass it explicitly with --mmproj.

Multimodal support

Family	Inputs	Notes
Gemma 4	Image · Video · Audio	Images PNG/JPEG/HEIC; Video MP4 (1 fps via OpenCV); Audio WAV 16 kHz mono / MP3 / OGG. Projector: `gemma-4-mmproj-F16.gguf`.
Gemma 3	Image	PNG / JPEG / HEIC. Projector: `mmproj-gemma3-4b-f16.gguf`.
Qwen 3.5 / 3.6	Image	Dynamic-resolution vision encoder. Projector: `Qwen3.5-mmproj-F16.gguf`.
Mistral 3	Image	Pixtral vision encoder. Projector: `mistral3-mmproj.gguf`.
Nemotron-H (Omni)	Image	RADIO / v2_vl ViT encoder. Pass the matching `--mmproj`; image tokens expand at `<image>` placeholders.

Send images/audio/video via the CLI (--image, --video, --audio), the Web UI uploads, or the HTTP API (base64 images array for Ollama, image_url data URI for OpenAI).

Thinking / reasoning mode

Thinking-capable models (Qwen 3, Qwen 3.5/3.6, Gemma 4, GPT OSS, Nemotron-H) produce structured chain-of-thought before the final answer. The thinking content is separated from the visible response so the client can show or hide it.

Qwen 3 / Qwen 3.5/3.6 / Nemotron-H — <think>…</think> tags.
Gemma 4 — <|channel>thought …<channel|> tags.
GPT OSS — Harmony format: <|channel|>analysis for thinking, <|channel|>final for the answer.

Enable it via --think (CLI), "think": true (Ollama API / Web UI), or the thinking toggle in the browser. Responses expose the reasoning separately — e.g. message.thinking in the Ollama chat response.

Tool calling / function calling

Models can invoke user-defined tools and participate in multi-turn tool-call conversations. Define tools as JSON and pass them via --tools (CLI) or the tools parameter (API). Each architecture uses its own wire format, but the output parser extracts calls into structured tool_calls regardless:

Qwen 3 / Qwen 3.5/3.6 / Nemotron-H — <tool_call>{"name": …, "arguments": {…}}</tool_call>
Gemma 4 — <|tool_call>call:function_name{args}<tool_call|>
GPT OSS (Harmony) — tools declared as a TypeScript namespace; calls emitted on the commentary channel.

See Tool calling over HTTP for a complete request/response example and the continuation loop.

← Compute Backends Next: Command Line →