Getting Started

From an empty machine to a streaming reply. The native GGML library compiles automatically on the first build, so the only manual install is the .NET SDK (plus a GPU toolchain if you want acceleration).

1 · Prerequisites

Per-platform toolchains (only for GPU acceleration)

PlatformNeeded forInstall
macOS (Metal)ggml_metal / mlxCMake 3.20+ and Xcode command-line tools. MLX additionally builds libmlxc.
Windowsggml_cuda / cudaCMake 3.20+, Visual Studio 2022 C++ build tools, NVIDIA driver + CUDA Toolkit 12.x (with cuBLAS).
Linuxggml_cuda / cudaCMake 3.20+, NVIDIA driver + CUDA Toolkit 12.x (with cuBLAS).
Any (CPU only)cpu / ggml_cpuNothing beyond the .NET SDK.

2 · Build

Clone and build the whole solution. The first build also compiles the native GGML bridge (this can take a while, especially for CUDA — subsequent builds are fast).

git clone https://github.com/zhongkaifu/TensorSharp.git
cd TensorSharp
dotnet build TensorSharp.slnx -c Release

Or build just one application:

# Console application
dotnet build TensorSharp.Cli/TensorSharp.Cli.csproj -c Release

# Web application
dotnet build TensorSharp.Server/TensorSharp.Server.csproj -c Release

The CLI binary lands in TensorSharp.Cli/bin/... and the server in TensorSharp.Server/bin/.... For a CUDA-enabled native build, manual native builds, or MLX, see Building the native libraries.

📦

Prefer not to build from source? The Release Binaries workflow publishes self-contained archives (Windows/Linux CPU & CUDA, macOS arm64) that bundle the .NET 10 runtime and native libraries — they run with no separate install. See Backends.

3 · Download a model

TensorSharp loads models in GGUF format. A small, well-tested starting point is Gemma-4-E4B (Q8_0) from ggml-org/gemma-4-E4B-it-GGUF. Pick a quantization that fits your hardware — Q4_K_M for low memory, Q8_0 for higher quality. The full list is on the Models page.

🧩

For multimodal models, also download the matching projector (mmproj) file. If you place it next to the model with a recognized name (e.g. gemma-4-mmproj-F16.gguf), it is auto-detected.

4 · First run

Choose the --backend for your hardware. Every backend produces correct output; they differ only in speed.

Option A — one-shot generation (CLI)

echo "Explain mixture-of-experts in one sentence." > prompt.txt

# macOS (Apple Silicon)
./TensorSharp.Cli --model gemma-4-E4B-it-Q8_0.gguf --input prompt.txt --backend ggml_metal

# Windows / Linux + NVIDIA
./TensorSharp.Cli --model gemma-4-E4B-it-Q8_0.gguf --input prompt.txt --backend ggml_cuda

# Portable / debugging (no GPU)
./TensorSharp.Cli --model gemma-4-E4B-it-Q8_0.gguf --input prompt.txt --backend cpu

Option B — interactive chat (REPL)

./TensorSharp.Cli --model gemma-4-E4B-it-Q8_0.gguf -i --backend ggml_metal

Type messages turn-by-turn; drive the session with slash commands like /reset, /think on, or /image photo.png.

Option C — browser UI + HTTP APIs (server)

./TensorSharp.Server --model gemma-4-E4B-it-Q8_0.gguf --backend ggml_metal
# open http://localhost:5000

This serves the chat UI and the Ollama- and OpenAI-compatible endpoints on the same port.

Where to go next

🖥️

Pick a backend

Match --backend to your hardware.

⌨️

CLI reference

All flags, the REPL, and batch workflows.

🔌

HTTP API

Call the server from curl, Python, or SDKs.

🧠

Models

What's supported and where to download.