⚡ C# · .NET 10 · GGUF · GPU-accelerated

TensorSharp

A native .NET LLM inference engine for GGUF models — with a command-line tool, a browser chat server, and Ollama- & OpenAI-compatible APIs for programmatic access.

Everything runs on your own hardware: your laptop, workstation, or server. No data leaves the machine, there are no per-token fees, and the same engine powers a quick command-line test, a shared internal chatbot, and a production REST endpoint. This wiki is the complete guide — pick a starting point below or use / to search.

Explore the wiki

🚀

Getting Started

Prerequisites, build, download a model, and stream your first reply.

⌨️

Command Line

Run prompts, images, audio, batches, and benchmarks from the CLI.

🌐

Server & Web UI

Host a browser chatbot and HTTP endpoints on localhost.

🔌

HTTP API

Call it from curl, Python, or any Ollama/OpenAI client.

🧩

C# Library

Embed the engine directly in your .NET application.

📚

API Reference

Searchable tables of flags, env vars, endpoints, and types.

🧠

Models

Supported architectures, downloads, multimodal, and reasoning.

📖

Glossary & FAQ

New to LLMs? Plain-language definitions and common questions.

Quick start in ~30 seconds

After installing the .NET 10 SDK, you are four commands away from a streaming reply (model download aside).

  1. Clone & build

    The native GGML library compiles automatically on the first build.

    git clone https://github.com/zhongkaifu/TensorSharp.git
    cd TensorSharp
    dotnet build TensorSharp.slnx -c Release
  2. Download a model

    A small, well-tested starting point is Gemma-4-E4B (Q8_0) from Hugging Face. More in Model downloads.

  3. Run it

    Pick the --backend for your hardware.

    echo "Explain mixture-of-experts in one sentence." > prompt.txt
    
    # macOS (Apple Silicon)
    ./TensorSharp.Cli --model gemma-4-E4B-it-Q8_0.gguf --input prompt.txt --backend ggml_metal
    
    # Windows / Linux + NVIDIA
    ./TensorSharp.Cli --model gemma-4-E4B-it-Q8_0.gguf --input prompt.txt --backend ggml_cuda
  4. Prefer a UI + API?

    Start the server and open the browser chat — it also serves the compatibility endpoints.

    ./TensorSharp.Server --model gemma-4-E4B-it-Q8_0.gguf --backend ggml_metal
    # open http://localhost:5000

Why TensorSharp?

🔒

Private by default

Inference happens on your hardware. Prompts, documents, and images never leave the machine.

💸

No per-token bill

Run as much as your hardware allows — predictable cost, no metered API.

🔁

Drop-in compatible

Speaks the Ollama and OpenAI wire formats, so existing tools and SDKs just work.

🖥️

Runs anywhere

NVIDIA (CUDA), Apple Silicon (Metal/MLX), or pure CPU — with automatic fallbacks.

🧠

Modern model support

Gemma, Qwen, GPT-OSS, Nemotron-H, Mistral, plus vision, audio, reasoning & tools.

⚙️

Built in .NET

A native C# engine you can embed in your apps, not just a black-box binary.

Who is this for?

TensorSharp serves a wide range of visitors. Here is the fastest path for each.

Beginners & students

Start with the Glossary & FAQ, then Getting Started.

Developers

Jump to the HTTP API, C# Library, and API Reference.

Senior / principal engineers

Read Advanced Features — paged KV, continuous batching, speculative decoding.

Managers, CTOs & CEOs

See the business value and capability matrix.

Sales & marketing

Use the feature catalog and benchmarks for positioning.

Researchers & professors

Explore model architectures and the cross-engine matrix.