Getting Started
From an empty machine to a streaming reply. The native GGML library compiles automatically on the first build, so the only manual install is the .NET SDK (plus a GPU toolchain if you want acceleration).
1 · Prerequisites
- .NET 10 SDK — required.
git+ network access — the native build clones the ggml sources from github.com/ggml-org/ggml intoExternalProjects/ggml/on the first build. SetTENSORSHARP_GGML_NO_UPDATE=1to skip the network update on offline rebuilds.- A GGUF model file — e.g. from Hugging Face. See Model downloads.
Per-platform toolchains (only for GPU acceleration)
| Platform | Needed for | Install |
|---|---|---|
| macOS (Metal) | ggml_metal / mlx | CMake 3.20+ and Xcode command-line tools. MLX additionally builds libmlxc. |
| Windows | ggml_cuda / cuda | CMake 3.20+, Visual Studio 2022 C++ build tools, NVIDIA driver + CUDA Toolkit 12.x (with cuBLAS). |
| Linux | ggml_cuda / cuda | CMake 3.20+, NVIDIA driver + CUDA Toolkit 12.x (with cuBLAS). |
| Any (CPU only) | cpu / ggml_cpu | Nothing beyond the .NET SDK. |
2 · Build
Clone and build the whole solution. The first build also compiles the native GGML bridge (this can take a while, especially for CUDA — subsequent builds are fast).
git clone https://github.com/zhongkaifu/TensorSharp.git
cd TensorSharp
dotnet build TensorSharp.slnx -c Release
Or build just one application:
# Console application
dotnet build TensorSharp.Cli/TensorSharp.Cli.csproj -c Release
# Web application
dotnet build TensorSharp.Server/TensorSharp.Server.csproj -c Release
The CLI binary lands in TensorSharp.Cli/bin/... and the server in TensorSharp.Server/bin/.... For a CUDA-enabled native build, manual native builds, or MLX, see Building the native libraries.
Prefer not to build from source? The Release Binaries workflow publishes self-contained archives (Windows/Linux CPU & CUDA, macOS arm64) that bundle the .NET 10 runtime and native libraries — they run with no separate install. See Backends.
3 · Download a model
TensorSharp loads models in GGUF format. A small, well-tested starting point is Gemma-4-E4B (Q8_0) from ggml-org/gemma-4-E4B-it-GGUF. Pick a quantization that fits your hardware — Q4_K_M for low memory, Q8_0 for higher quality. The full list is on the Models page.
For multimodal models, also download the matching projector (mmproj) file. If you place it next to the model with a recognized name (e.g. gemma-4-mmproj-F16.gguf), it is auto-detected.
4 · First run
Choose the --backend for your hardware. Every backend produces correct output; they differ only in speed.
Option A — one-shot generation (CLI)
echo "Explain mixture-of-experts in one sentence." > prompt.txt
# macOS (Apple Silicon)
./TensorSharp.Cli --model gemma-4-E4B-it-Q8_0.gguf --input prompt.txt --backend ggml_metal
# Windows / Linux + NVIDIA
./TensorSharp.Cli --model gemma-4-E4B-it-Q8_0.gguf --input prompt.txt --backend ggml_cuda
# Portable / debugging (no GPU)
./TensorSharp.Cli --model gemma-4-E4B-it-Q8_0.gguf --input prompt.txt --backend cpu
Option B — interactive chat (REPL)
./TensorSharp.Cli --model gemma-4-E4B-it-Q8_0.gguf -i --backend ggml_metal
Type messages turn-by-turn; drive the session with slash commands like /reset, /think on, or /image photo.png.
Option C — browser UI + HTTP APIs (server)
./TensorSharp.Server --model gemma-4-E4B-it-Q8_0.gguf --backend ggml_metal
# open http://localhost:5000
This serves the chat UI and the Ollama- and OpenAI-compatible endpoints on the same port.