Overview & Architecture

TensorSharp is a native .NET LLM inference engine for GGUF models — including autoregressive LLMs and DiffusionGemma-style text-diffusion models. It ships a console application, a web-based chatbot, and Ollama/OpenAI-compatible HTTP APIs.

What it is — in plain terms

A large language model (LLM) is a neural network that predicts text. To use one you need an inference engine: software that loads the model's weights and runs the math to turn your prompt into a reply. TensorSharp is that engine, written in modern C# / .NET 10, focused on the GGUF model format used widely in the local-LLM ecosystem.

It gives you three ways to use a model, all from one binary set:

Command line — run a prompt, an image, a batch of questions, or a benchmark. → CLI
Server — a browser chatbot plus REST endpoints that mimic Ollama and OpenAI. → Server
Library — reference the NuGet packages and call the engine from your own .NET code. → C# Library

Business value

For decision-makers evaluating local inference, the core trade is control and cost in exchange for running your own hardware.

🔒

Data privacy & compliance

Prompts and documents never leave your infrastructure — a fit for regulated, on-prem, or air-gapped environments.

💸

Predictable cost

No per-token API billing. Capacity is bounded by hardware you already budget for.

🧩

No vendor lock-in

Open GGUF models from Hugging Face, and an OpenAI/Ollama-compatible surface that existing tools already speak.

🏗️

.NET-native integration

Embed inference inside existing C# services instead of bridging to an external runtime.

📈

Scales on one box

Continuous batching serves many concurrent users from a single hosted model.

🔁

Migrate easily

Point existing OpenAI SDK code at http://localhost:5000/v1 and keep your application unchanged.

Architecture

TensorSharp is a layered system. Each layer is an independently publishable NuGet package, so consumers depend only on what they need.

Layer	Responsibility
TensorSharp.Core	The core `Tensor` type, storage abstraction, device abstraction, and the extensible operation registry (`Ops`). CPU implementations use `System.Numerics.Vectors` for SIMD.
TensorSharp.Runtime	GGUF parsing, tokenizers (SentencePiece / BPE), chat-template rendering, sampling, output parsing, the paged KV cache, and the continuous-batching scheduler / engine.
TensorSharp.Models	`ModelBase` plus the concrete architectures and multimodal encoders. Models are loaded via `ModelBase.Create()`, which auto-detects the architecture from GGUF metadata.
TensorSharp.Backends.GGML	Accelerated ops via a native C++ bridge (`libGgmlOps`) linking ggml — Metal on macOS, CUDA on Windows/Linux, and native CPU.
TensorSharp.Backends.Cuda	The direct CUDA path: CUDA Driver API, cuBLAS GEMM, and PTX kernels for hot ops, with CPU fallbacks.
TensorSharp.Backends.MLX	The Apple-Silicon MLX path, wrapping mlx-c with quantized, fused, and compiled kernels.
TensorSharp.Server	The HTTP / application layer: Ollama- and OpenAI-compatible REST APIs, the browser chat UI, upload handling, and the per-model continuous-batching engine host.
TensorSharp.Cli	The console host for local prompts, multimodal experiments, prompt inspection, JSONL batch workflows, the interactive REPL, and benchmarks.

💡

Every backend falls back to CPU for any operation it does not yet implement, so output stays correct on all of them — you trade speed, never correctness.

How a request flows

Load — ModelBase.Create(path, backend) reads GGUF metadata, picks the architecture, and maps the quantized weights into the chosen backend.
Render — the prompt (and any system message, images, audio, tools) is turned into tokens via the architecture's chat template and tokenizer.
Prefill — the prompt tokens are processed in a batched forward pass that populates the KV cache.
Decode — tokens are generated one at a time (optionally several at once via speculative decoding), sampled with your settings, and streamed back.
Serve — in the server, the continuous-batching engine interleaves many requests against one model, sharing KV-cache prefixes across them.

Project structure

The repository is organized by the layers above. The most useful entry points:

Path	Contents
`TensorSharp.Core/`	Tensor library, ops, memory, device abstraction, CPU SIMD/quantized kernels.
`TensorSharp.Runtime/`	GGUF, tokenizers, templates, sampling; `Paged/` KV primitives and `Scheduling/` the inference engine + MTP core.
`TensorSharp.Models/Models/<Family>/`	One folder per architecture (Gemma3/4, Qwen3/35, GptOss, Nemotron, Mistral3, DiffusionGemma), each with a legacy and a batched forward.
`TensorSharp.GGML.Native/`	The native C++ bridge to ggml (matmul, fused transformer kernels, paged attention, MoE, Mamba2, GatedDeltaNet, diffusion).
`TensorSharp.Server/`	ASP.NET Core server: program bootstrap, model service, inference-engine host, chat pipeline, telemetry.
`docs/`	Per-model architecture cards, paged-attention deep dive, env-var matrix, benchmark matrix.

Current status & capabilities

Area	Status
Model families	Gemma 3/4, DiffusionGemma, Qwen 3, Qwen 3.5/3.6-family (`qwen35`, `qwen35moe`, `qwen3next`), GPT OSS, Nemotron-H (incl. Nemotron 3 Nano Omni), Mistral 3.
Inference hosts	CLI, interactive REPL, ASP.NET Core web UI, Ollama-style API, OpenAI Chat Completions-style API.
Backends	Pure C# CPU, direct CUDA/cuBLAS (`cuda`), MLX Metal (`mlx`), GGML CPU, GGML Metal, GGML CUDA.
Multimodal	Gemma 4 image/video/audio; Gemma 3, Qwen 3.5-family, Mistral 3, and Nemotron-H Omni image input.
Continuous batching	vLLM-style paged KV cache, block-hash prefix sharing, iteration-level scheduler (on by default; opt-out via `--no-continuous-batching`).
Speculative decoding	MTP / NextN draft heads for Qwen 3.6 (embedded) and Gemma 4 (separate draft GGUF); off by default, opt-in via `--mtp-spec`.
Observability	Structured per-turn logs, queue status, and KV-cache reuse metrics across Web UI, Ollama, and OpenAI response shapes.

← Home Next: Features →