Overview & Architecture
TensorSharp is a native .NET LLM inference engine for GGUF models — including autoregressive LLMs and DiffusionGemma-style text-diffusion models. It ships a console application, a web-based chatbot, and Ollama/OpenAI-compatible HTTP APIs.
What it is — in plain terms
A large language model (LLM) is a neural network that predicts text. To use one you need an inference engine: software that loads the model's weights and runs the math to turn your prompt into a reply. TensorSharp is that engine, written in modern C# / .NET 10, focused on the GGUF model format used widely in the local-LLM ecosystem.
It gives you three ways to use a model, all from one binary set:
- Command line — run a prompt, an image, a batch of questions, or a benchmark. → CLI
- Server — a browser chatbot plus REST endpoints that mimic Ollama and OpenAI. → Server
- Library — reference the NuGet packages and call the engine from your own .NET code. → C# Library
Business value
For decision-makers evaluating local inference, the core trade is control and cost in exchange for running your own hardware.
Data privacy & compliance
Prompts and documents never leave your infrastructure — a fit for regulated, on-prem, or air-gapped environments.
Predictable cost
No per-token API billing. Capacity is bounded by hardware you already budget for.
No vendor lock-in
Open GGUF models from Hugging Face, and an OpenAI/Ollama-compatible surface that existing tools already speak.
.NET-native integration
Embed inference inside existing C# services instead of bridging to an external runtime.
Scales on one box
Continuous batching serves many concurrent users from a single hosted model.
Migrate easily
Point existing OpenAI SDK code at http://localhost:5000/v1 and keep your application unchanged.
Architecture
TensorSharp is a layered system. Each layer is an independently publishable NuGet package, so consumers depend only on what they need.
| Layer | Responsibility |
|---|---|
| TensorSharp.Core | The core Tensor type, storage abstraction, device abstraction, and the extensible operation registry (Ops). CPU implementations use System.Numerics.Vectors for SIMD. |
| TensorSharp.Runtime | GGUF parsing, tokenizers (SentencePiece / BPE), chat-template rendering, sampling, output parsing, the paged KV cache, and the continuous-batching scheduler / engine. |
| TensorSharp.Models | ModelBase plus the concrete architectures and multimodal encoders. Models are loaded via ModelBase.Create(), which auto-detects the architecture from GGUF metadata. |
| TensorSharp.Backends.GGML | Accelerated ops via a native C++ bridge (libGgmlOps) linking ggml — Metal on macOS, CUDA on Windows/Linux, and native CPU. |
| TensorSharp.Backends.Cuda | The direct CUDA path: CUDA Driver API, cuBLAS GEMM, and PTX kernels for hot ops, with CPU fallbacks. |
| TensorSharp.Backends.MLX | The Apple-Silicon MLX path, wrapping mlx-c with quantized, fused, and compiled kernels. |
| TensorSharp.Server | The HTTP / application layer: Ollama- and OpenAI-compatible REST APIs, the browser chat UI, upload handling, and the per-model continuous-batching engine host. |
| TensorSharp.Cli | The console host for local prompts, multimodal experiments, prompt inspection, JSONL batch workflows, the interactive REPL, and benchmarks. |
Every backend falls back to CPU for any operation it does not yet implement, so output stays correct on all of them — you trade speed, never correctness.
How a request flows
- Load —
ModelBase.Create(path, backend)reads GGUF metadata, picks the architecture, and maps the quantized weights into the chosen backend. - Render — the prompt (and any system message, images, audio, tools) is turned into tokens via the architecture's chat template and tokenizer.
- Prefill — the prompt tokens are processed in a batched forward pass that populates the KV cache.
- Decode — tokens are generated one at a time (optionally several at once via speculative decoding), sampled with your settings, and streamed back.
- Serve — in the server, the continuous-batching engine interleaves many requests against one model, sharing KV-cache prefixes across them.
Project structure
The repository is organized by the layers above. The most useful entry points:
| Path | Contents |
|---|---|
TensorSharp.Core/ | Tensor library, ops, memory, device abstraction, CPU SIMD/quantized kernels. |
TensorSharp.Runtime/ | GGUF, tokenizers, templates, sampling; Paged/ KV primitives and Scheduling/ the inference engine + MTP core. |
TensorSharp.Models/Models/<Family>/ | One folder per architecture (Gemma3/4, Qwen3/35, GptOss, Nemotron, Mistral3, DiffusionGemma), each with a legacy and a batched forward. |
TensorSharp.GGML.Native/ | The native C++ bridge to ggml (matmul, fused transformer kernels, paged attention, MoE, Mamba2, GatedDeltaNet, diffusion). |
TensorSharp.Server/ | ASP.NET Core server: program bootstrap, model service, inference-engine host, chat pipeline, telemetry. |
docs/ | Per-model architecture cards, paged-attention deep dive, env-var matrix, benchmark matrix. |
Current status & capabilities
| Area | Status |
|---|---|
| Model families | Gemma 3/4, DiffusionGemma, Qwen 3, Qwen 3.5/3.6-family (qwen35, qwen35moe, qwen3next), GPT OSS, Nemotron-H (incl. Nemotron 3 Nano Omni), Mistral 3. |
| Inference hosts | CLI, interactive REPL, ASP.NET Core web UI, Ollama-style API, OpenAI Chat Completions-style API. |
| Backends | Pure C# CPU, direct CUDA/cuBLAS (cuda), MLX Metal (mlx), GGML CPU, GGML Metal, GGML CUDA. |
| Multimodal | Gemma 4 image/video/audio; Gemma 3, Qwen 3.5-family, Mistral 3, and Nemotron-H Omni image input. |
| Continuous batching | vLLM-style paged KV cache, block-hash prefix sharing, iteration-level scheduler (on by default; opt-out via --no-continuous-batching). |
| Speculative decoding | MTP / NextN draft heads for Qwen 3.6 (embedded) and Gemma 4 (separate draft GGUF); off by default, opt-in via --mtp-spec. |
| Observability | Structured per-turn logs, queue status, and KV-cache reuse metrics across Web UI, Ollama, and OpenAI response shapes. |