Features
A complete catalog of what TensorSharp does today. Each item links to the page where you can use it.
Highlights
🧠
Multi-architecture
Gemma 4 / 3, Qwen 3 / 3.5 / 3.6, GPT OSS, Nemotron-H, Mistral 3, DiffusionGemma.
🖼️
Multimodal
Image, video, and audio inputs (Gemma 4); image input for several others.
💭
Thinking mode
Structured chain-of-thought, separated from the visible answer.
🛠️
Tool calling
Multi-turn function calling across all three API styles.
📦
Native quantized compute
Q4_K_M, Q8_0, MXFP4, IQ2_XXS and more run in matmul without dequantizing to FP32.
🔀
Continuous batching
vLLM-style paged KV cache with cross-request prefix sharing.
⏩
Speculative decoding
MTP / NextN draft heads accelerate solo decode losslessly.
🔌
Ollama & OpenAI APIs
Drop-in endpoints for existing tooling, plus a browser chat UI.
Models & modalities
- Multi-architecture support — Gemma 4, Gemma 3, DiffusionGemma, Qwen 3, Qwen 3.5/3.6-family, GPT OSS, Nemotron-H, Mistral 3. → Supported models
- Multimodal inference — image, video, and audio inputs for Gemma 4; images for Gemma 3, Qwen 3.5-family, Mistral 3, and Nemotron-H Omni. → Multimodal
- Mixture of Experts (MoE) — Gemma 4 MoE (e.g. 26B-A4B), GPT OSS MoE (gpt-oss-20b), Qwen 3.5/3.6 MoE (35B-A3B), and Nemotron-H MoE FFN layers, with a fused batched GPU MoE dispatch.
- Hybrid SSM-Transformer — Nemotron-H mixes Mamba2 SSM layers, attention layers, and MoE FFN in one model.
- Hybrid Attention-Recurrent — Qwen 3.5/3.6-family mix full-attention layers with GatedDeltaNet recurrent layers.
- Text-diffusion generation — DiffusionGemma uses an iterative EntropyBound denoising sampler instead of autoregressive decode. → DiffusionGemma
Generation & control
- Thinking / reasoning mode — structured chain-of-thought with
<think>/<|channel>tags (Qwen 3, Qwen 3.5/3.6, Gemma 4, GPT OSS, Nemotron-H). → Thinking mode - Tool calling / function calling — architecture-agnostic output parsing turns raw model output into structured
tool_calls. → Tool calling - Configurable sampling — temperature, top-k, top-p, min-p, repetition / presence / frequency penalties, seed, and stop sequences. → Sampling
- Structured outputs — OpenAI
response_formatwithtext,json_object, and validatedjson_schema. → Structured outputs - Chat templates — auto-loaded from GGUF metadata (Jinja2), with hardcoded fallbacks per architecture.
- Streaming — token-by-token output via SSE (web) or stdout (console), with abort/stop support for in-flight generations.
Performance & scale
- GPU-accelerated — GGML Metal (macOS), GGML CUDA (Windows/Linux + NVIDIA), a direct CUDA/cuBLAS backend, and an MLX backend for Apple Silicon — all with CPU fallbacks. → Backends
- Continuous batching & paged KV cache — block-paged KV pool with block-hash prefix sharing, an iteration-level scheduler that admits/preempts sequences mid-batch, optional SSD-backed tier, and a native fused paged-attention kernel. → Deep dive
- Batched / parallel inference — N sequences packed into a single forward pass with paged K/V scatter (Mistral 3, Gemma 4, GPT OSS, Qwen 3 / 3.5 / 3.6, Nemotron-H).
- MTP / NextN speculative decoding — multi-token-prediction draft heads accelerate solo decode; lossless because the request's own sampler drives both draft and verify. → Speculative decoding
- Native quantized compute — quantized weights are used directly in matmul without expanding to FP32, saving memory and bandwidth.
- Optimized pure C# CPU backend — managed GEMM fast paths plus fused SIMD kernels for RMSNorm, RoPE, softmax, and fused activations.
- KV cache codecs — pluggable
IKvBlockCodecwith a built-in TurboQuant (Q4 / Q8) compressed codec for paged blocks.
Interfaces & integration
- Ollama & OpenAI API compatibility — drop-in replacement endpoints for existing tooling. → HTTP API
- Browser chat UI — multi-turn chat, file uploads up to 500 MB, thinking toggle, tool calling, message editing, and live streaming. → Web UI
- Interactive REPL — a turn-by-turn console chat with slash commands, hot-swappable model/backend/projector, and live sampling tuning. → REPL
- Batch processing — JSONL input in the console application, plus a built-in prefill/decode benchmark.
- NuGet packages — depend on only the layers you need and embed inference in your own .NET app. → C# Library
- Per-turn observability — structured logs of full input/output plus KV-cache hit ratio, surfaced through every API (
prompt_cache_hit_*,cached_tokens,kvReused*).