Overview & Architecture

TensorSharp is a native .NET LLM inference engine for GGUF models — including autoregressive LLMs and DiffusionGemma-style text-diffusion models. It ships a console application, a web-based chatbot, and Ollama/OpenAI-compatible HTTP APIs.

What it is — in plain terms

A large language model (LLM) is a neural network that predicts text. To use one you need an inference engine: software that loads the model's weights and runs the math to turn your prompt into a reply. TensorSharp is that engine, written in modern C# / .NET 10, focused on the GGUF model format used widely in the local-LLM ecosystem.

It gives you three ways to use a model, all from one binary set:

Business value

For decision-makers evaluating local inference, the core trade is control and cost in exchange for running your own hardware.

🔒

Data privacy & compliance

Prompts and documents never leave your infrastructure — a fit for regulated, on-prem, or air-gapped environments.

💸

Predictable cost

No per-token API billing. Capacity is bounded by hardware you already budget for.

🧩

No vendor lock-in

Open GGUF models from Hugging Face, and an OpenAI/Ollama-compatible surface that existing tools already speak.

🏗️

.NET-native integration

Embed inference inside existing C# services instead of bridging to an external runtime.

📈

Scales on one box

Continuous batching serves many concurrent users from a single hosted model.

🔁

Migrate easily

Point existing OpenAI SDK code at http://localhost:5000/v1 and keep your application unchanged.

Architecture

TensorSharp is a layered system. Each layer is an independently publishable NuGet package, so consumers depend only on what they need.

LayerResponsibility
TensorSharp.CoreThe core Tensor type, storage abstraction, device abstraction, and the extensible operation registry (Ops). CPU implementations use System.Numerics.Vectors for SIMD.
TensorSharp.RuntimeGGUF parsing, tokenizers (SentencePiece / BPE), chat-template rendering, sampling, output parsing, the paged KV cache, and the continuous-batching scheduler / engine.
TensorSharp.ModelsModelBase plus the concrete architectures and multimodal encoders. Models are loaded via ModelBase.Create(), which auto-detects the architecture from GGUF metadata.
TensorSharp.Backends.GGMLAccelerated ops via a native C++ bridge (libGgmlOps) linking ggml — Metal on macOS, CUDA on Windows/Linux, and native CPU.
TensorSharp.Backends.CudaThe direct CUDA path: CUDA Driver API, cuBLAS GEMM, and PTX kernels for hot ops, with CPU fallbacks.
TensorSharp.Backends.MLXThe Apple-Silicon MLX path, wrapping mlx-c with quantized, fused, and compiled kernels.
TensorSharp.ServerThe HTTP / application layer: Ollama- and OpenAI-compatible REST APIs, the browser chat UI, upload handling, and the per-model continuous-batching engine host.
TensorSharp.CliThe console host for local prompts, multimodal experiments, prompt inspection, JSONL batch workflows, the interactive REPL, and benchmarks.
💡

Every backend falls back to CPU for any operation it does not yet implement, so output stays correct on all of them — you trade speed, never correctness.

How a request flows

  1. LoadModelBase.Create(path, backend) reads GGUF metadata, picks the architecture, and maps the quantized weights into the chosen backend.
  2. Render — the prompt (and any system message, images, audio, tools) is turned into tokens via the architecture's chat template and tokenizer.
  3. Prefill — the prompt tokens are processed in a batched forward pass that populates the KV cache.
  4. Decode — tokens are generated one at a time (optionally several at once via speculative decoding), sampled with your settings, and streamed back.
  5. Serve — in the server, the continuous-batching engine interleaves many requests against one model, sharing KV-cache prefixes across them.

Project structure

The repository is organized by the layers above. The most useful entry points:

PathContents
TensorSharp.Core/Tensor library, ops, memory, device abstraction, CPU SIMD/quantized kernels.
TensorSharp.Runtime/GGUF, tokenizers, templates, sampling; Paged/ KV primitives and Scheduling/ the inference engine + MTP core.
TensorSharp.Models/Models/<Family>/One folder per architecture (Gemma3/4, Qwen3/35, GptOss, Nemotron, Mistral3, DiffusionGemma), each with a legacy and a batched forward.
TensorSharp.GGML.Native/The native C++ bridge to ggml (matmul, fused transformer kernels, paged attention, MoE, Mamba2, GatedDeltaNet, diffusion).
TensorSharp.Server/ASP.NET Core server: program bootstrap, model service, inference-engine host, chat pipeline, telemetry.
docs/Per-model architecture cards, paged-attention deep dive, env-var matrix, benchmark matrix.

Current status & capabilities

AreaStatus
Model familiesGemma 3/4, DiffusionGemma, Qwen 3, Qwen 3.5/3.6-family (qwen35, qwen35moe, qwen3next), GPT OSS, Nemotron-H (incl. Nemotron 3 Nano Omni), Mistral 3.
Inference hostsCLI, interactive REPL, ASP.NET Core web UI, Ollama-style API, OpenAI Chat Completions-style API.
BackendsPure C# CPU, direct CUDA/cuBLAS (cuda), MLX Metal (mlx), GGML CPU, GGML Metal, GGML CUDA.
MultimodalGemma 4 image/video/audio; Gemma 3, Qwen 3.5-family, Mistral 3, and Nemotron-H Omni image input.
Continuous batchingvLLM-style paged KV cache, block-hash prefix sharing, iteration-level scheduler (on by default; opt-out via --no-continuous-batching).
Speculative decodingMTP / NextN draft heads for Qwen 3.6 (embedded) and Gemma 4 (separate draft GGUF); off by default, opt-in via --mtp-spec.
ObservabilityStructured per-turn logs, queue status, and KV-cache reuse metrics across Web UI, Ollama, and OpenAI response shapes.