TensorSharp
A native .NET LLM inference engine for GGUF models — with a command-line tool, a browser chat server, and Ollama- & OpenAI-compatible APIs for programmatic access.
Everything runs on your own hardware: your laptop, workstation, or server. No data leaves the machine, there are no per-token fees, and the same engine powers a quick command-line test, a shared internal chatbot, and a production REST endpoint. This wiki is the complete guide — pick a starting point below or use / to search.
Explore the wiki
Getting Started
Prerequisites, build, download a model, and stream your first reply.
Command Line
Run prompts, images, audio, batches, and benchmarks from the CLI.
Server & Web UI
Host a browser chatbot and HTTP endpoints on localhost.
HTTP API
Call it from curl, Python, or any Ollama/OpenAI client.
C# Library
Embed the engine directly in your .NET application.
API Reference
Searchable tables of flags, env vars, endpoints, and types.
Models
Supported architectures, downloads, multimodal, and reasoning.
Glossary & FAQ
New to LLMs? Plain-language definitions and common questions.
Quick start in ~30 seconds
After installing the .NET 10 SDK, you are four commands away from a streaming reply (model download aside).
-
Clone & build
The native GGML library compiles automatically on the first build.
git clone https://github.com/zhongkaifu/TensorSharp.git cd TensorSharp dotnet build TensorSharp.slnx -c Release -
Download a model
A small, well-tested starting point is Gemma-4-E4B (Q8_0) from Hugging Face. More in Model downloads.
-
Run it
Pick the
--backendfor your hardware.echo "Explain mixture-of-experts in one sentence." > prompt.txt # macOS (Apple Silicon) ./TensorSharp.Cli --model gemma-4-E4B-it-Q8_0.gguf --input prompt.txt --backend ggml_metal # Windows / Linux + NVIDIA ./TensorSharp.Cli --model gemma-4-E4B-it-Q8_0.gguf --input prompt.txt --backend ggml_cuda -
Prefer a UI + API?
Start the server and open the browser chat — it also serves the compatibility endpoints.
./TensorSharp.Server --model gemma-4-E4B-it-Q8_0.gguf --backend ggml_metal # open http://localhost:5000
Why TensorSharp?
Private by default
Inference happens on your hardware. Prompts, documents, and images never leave the machine.
No per-token bill
Run as much as your hardware allows — predictable cost, no metered API.
Drop-in compatible
Speaks the Ollama and OpenAI wire formats, so existing tools and SDKs just work.
Runs anywhere
NVIDIA (CUDA), Apple Silicon (Metal/MLX), or pure CPU — with automatic fallbacks.
Modern model support
Gemma, Qwen, GPT-OSS, Nemotron-H, Mistral, plus vision, audio, reasoning & tools.
Built in .NET
A native C# engine you can embed in your apps, not just a black-box binary.
Who is this for?
TensorSharp serves a wide range of visitors. Here is the fastest path for each.
Beginners & students
Start with the Glossary & FAQ, then Getting Started.
Developers
Jump to the HTTP API, C# Library, and API Reference.
Senior / principal engineers
Read Advanced Features — paged KV, continuous batching, speculative decoding.
Managers, CTOs & CEOs
See the business value and capability matrix.
Sales & marketing
Use the feature catalog and benchmarks for positioning.
Researchers & professors
Explore model architectures and the cross-engine matrix.