Features

A complete catalog of what TensorSharp does today. Each item links to the page where you can use it.

Highlights

🧠

Gemma 4 / 3, Qwen 3 / 3.5 / 3.6, GPT OSS, Nemotron-H, Mistral 3, DiffusionGemma.

🖼️

Image, video, and audio inputs (Gemma 4); image input for several others.

💭

Structured chain-of-thought, separated from the visible answer.

🛠️

Multi-turn function calling across all three API styles.

📦

Q4_K_M, Q8_0, MXFP4, IQ2_XXS and more run in matmul without dequantizing to FP32.

🔀

vLLM-style paged KV cache with cross-request prefix sharing.

⏩

MTP / NextN draft heads accelerate solo decode losslessly.

🔌

Drop-in endpoints for existing tooling, plus a browser chat UI.

Multi-architecture support — Gemma 4, Gemma 3, DiffusionGemma, Qwen 3, Qwen 3.5/3.6-family, GPT OSS, Nemotron-H, Mistral 3. → Supported models
Multimodal inference — image, video, and audio inputs for Gemma 4; images for Gemma 3, Qwen 3.5-family, Mistral 3, and Nemotron-H Omni. → Multimodal
Mixture of Experts (MoE) — Gemma 4 MoE (e.g. 26B-A4B), GPT OSS MoE (gpt-oss-20b), Qwen 3.5/3.6 MoE (35B-A3B), and Nemotron-H MoE FFN layers, with a fused batched GPU MoE dispatch.
Hybrid SSM-Transformer — Nemotron-H mixes Mamba2 SSM layers, attention layers, and MoE FFN in one model.
Hybrid Attention-Recurrent — Qwen 3.5/3.6-family mix full-attention layers with GatedDeltaNet recurrent layers.
Text-diffusion generation — DiffusionGemma uses an iterative EntropyBound denoising sampler instead of autoregressive decode. → DiffusionGemma

Thinking / reasoning mode — structured chain-of-thought with <think> / <|channel> tags (Qwen 3, Qwen 3.5/3.6, Gemma 4, GPT OSS, Nemotron-H). → Thinking mode
Tool calling / function calling — architecture-agnostic output parsing turns raw model output into structured tool_calls. → Tool calling
Configurable sampling — temperature, top-k, top-p, min-p, repetition / presence / frequency penalties, seed, and stop sequences. → Sampling
Structured outputs — OpenAI response_format with text, json_object, and validated json_schema. → Structured outputs
Chat templates — auto-loaded from GGUF metadata (Jinja2), with hardcoded fallbacks per architecture.
Streaming — token-by-token output via SSE (web) or stdout (console), with abort/stop support for in-flight generations.

GPU-accelerated — GGML Metal (macOS), GGML CUDA (Windows/Linux + NVIDIA), a direct CUDA/cuBLAS backend, and an MLX backend for Apple Silicon — all with CPU fallbacks. → Backends
Continuous batching & paged KV cache — block-paged KV pool with block-hash prefix sharing, an iteration-level scheduler that admits/preempts sequences mid-batch, optional SSD-backed tier, and a native fused paged-attention kernel. → Deep dive
Batched / parallel inference — N sequences packed into a single forward pass with paged K/V scatter (Mistral 3, Gemma 4, GPT OSS, Qwen 3 / 3.5 / 3.6, Nemotron-H).
MTP / NextN speculative decoding — multi-token-prediction draft heads accelerate solo decode; lossless because the request's own sampler drives both draft and verify. → Speculative decoding
Native quantized compute — quantized weights are used directly in matmul without expanding to FP32, saving memory and bandwidth.
Optimized pure C# CPU backend — managed GEMM fast paths plus fused SIMD kernels for RMSNorm, RoPE, softmax, and fused activations.
KV cache codecs — pluggable IKvBlockCodec with a built-in TurboQuant (Q4 / Q8) compressed codec for paged blocks.

Ollama & OpenAI API compatibility — drop-in replacement endpoints for existing tooling. → HTTP API
Browser chat UI — multi-turn chat, file uploads up to 500 MB, thinking toggle, tool calling, message editing, and live streaming. → Web UI
Interactive REPL — a turn-by-turn console chat with slash commands, hot-swappable model/backend/projector, and live sampling tuning. → REPL
Batch processing — JSONL input in the console application, plus a built-in prefill/decode benchmark.
NuGet packages — depend on only the layers you need and embed inference in your own .NET app. → C# Library
Per-turn observability — structured logs of full input/output plus KV-cache hit ratio, surfaced through every API (prompt_cache_hit_*, cached_tokens, kvReused*).