Compute Backends
A backend is selected with --backend (CLI and server) and decides which hardware runs the math. Every backend falls back to CPU for any unimplemented op, so output stays correct everywhere — backends differ only in speed.
Which backend should I use?
| Your hardware | Recommended | Flag | Notes |
|---|---|---|---|
| Apple Silicon (Mac) | GGML Metal | ggml_metal | Default on macOS. mlx is an alternative Apple-Silicon GPU path. |
| Windows / Linux + NVIDIA | GGML CUDA | ggml_cuda | Most-tested NVIDIA path. cuda is the direct PTX/cuBLAS backend for experimentation. |
| No GPU / portability / debugging | Pure C# CPU | cpu | No native dependencies. For faster CPU inference use ggml_cpu (native kernels). |
All backends in detail
| Backend | Flag | Best fit |
|---|---|---|
| Direct CUDA / cuBLAS | cuda | NVIDIA inference & experimentation |
| MLX Metal | mlx | Apple Silicon (alternative to GGML Metal) |
| GGML Metal | ggml_metal | Apple Silicon (default on macOS) |
| GGML CUDA | ggml_cuda | NVIDIA inference through ggml |
| GGML CPU | ggml_cpu | Native CPU kernels |
| Pure C# CPU | cpu | Portability & debugging |
GGML Metal & GGML CUDA
The most-tested GPU paths, built on a native C++ bridge that links ggml.
- GGML Metal (
ggml_metal, macOS) — quantized weights are mapped zero-copy from the GGUF file into Metal command buffers via host-pointer buffers, so the resident set stays close to the on-disk model size. - GGML CUDA (
ggml_cuda, Windows/Linux + NVIDIA) — quantized weights are uploaded to device memory once at load time and the host copy is released afterwards.
Both run native quantized matmul (Q4_K_M, Q8_0, …) without dequantizing to FP32, plus a native paged-attention kernel that drives ggml_flash_attn_ext.
MLX Metal
--backend mlx is a GPU-accelerated Apple-Silicon path built on mlx-c. It implements quantized ops (Q4_K_M, Q8_0, Q5_K, Q6_K, IQ2_XXS, IQ4_XS, IQ4_NL, MXFP4, …) without dequantizing to FP32, fused decode/prefill Metal kernels, compiled-graph kernels, async worker dispatch, batched MoE decode, and MoE expert offload. It pins the GGUF mmap in physical RAM via mlock(2) and derives allocator caps from the host's unified-memory capacity. Requires libmlxc (built locally or located via TENSORSHARP_MLX_LIBRARY / TENSORSHARP_MLX_LIBRARY_DIR).
Direct CUDA
--backend cuda is a pure-C# path using the CUDA Driver API, cuBLAS GEMM, and PTX kernels for common float32 ops (fill, unary/binary/ternary, activations, RMSNorm, softmax, RoPE/RoPEEx, SDPA, GQA prefill/decode, causal mask, gather/concat) plus native quantized matmul/get-rows for supported quant types. Unsupported ops route through CPU fallbacks while preserving tensor semantics. It is also the pure-C# backend where MTP speculative decoding is profitable.
CPU backends
- Pure C# CPU (
cpu) — portable inference with no native dependencies, using managed GEMM fast paths and fused SIMD kernels. Ideal for debugging and maximum portability. - GGML CPU (
ggml_cpu) — native GGML CPU kernels with quantized weights mapped zero-copy from the GGUF file. Faster than pure C# on most CPUs.
The server reports which backends are actually available on the host in GET /api/models (supportedBackends). If a CUDA or MLX backend is missing, the host did not detect a usable driver/runtime at startup.
Building the native libraries
The native GGML library is built automatically on the first dotnet build. To build it manually or with CUDA:
cd TensorSharp.GGML.Native
# macOS (Metal)
bash build-macos.sh
# Linux — CPU only / with CUDA
bash build-linux.sh
bash build-linux.sh --cuda
# Windows — CPU only / with CUDA
.\build-windows.ps1 --no-cuda
.\build-windows.ps1 --cuda
On Windows/Linux the script auto-detects the visible NVIDIA GPU compute capability and passes a narrow CMAKE_CUDA_ARCHITECTURES (e.g. 86-real on an RTX 3080), which cuts CUDA build time substantially. Override it explicitly:
TENSORSHARP_GGML_NATIVE_CUDA_ARCHITECTURES='86-real;89-real' bash build-linux.sh --cuda
bash build-linux.sh --cuda --cuda-arch='86-real;89-real'
You can also request CUDA from dotnet build directly:
TENSORSHARP_GGML_NATIVE_ENABLE_CUDA=ON dotnet build TensorSharp.Cli/TensorSharp.Cli.csproj -c Release
MLX native library (macOS only)
The MLX backend depends on libmlxc. A helper script fetches and builds it:
bash TensorSharp.Backends.MLX/build-native-macos.sh
It writes the libraries into TensorSharp.Backends.MLX/Native/dist/. At run time the backend probes the application directory first; point it elsewhere with TENSORSHARP_MLX_LIBRARY or TENSORSHARP_MLX_LIBRARY_DIR.
Pre-built platform binaries
Pushing a v* tag builds self-contained archives that bundle the .NET 10 runtime and the platform's native libraries — they run without a separate .NET install or native build:
| Archive | Native backend(s) bundled | Format |
|---|---|---|
win-x64-cpu | GGML CPU | .zip |
win-x64-cuda | GGML CUDA + pure-C# CUDA (PTX) + CUDA 12.x runtime | .zip |
linux-x64-cpu | GGML CPU | .tar.gz |
linux-x64-cuda | GGML CUDA + pure-C# CUDA (PTX) + CUDA 12.x runtime | .tar.gz |
osx-arm64 | GGML Metal + MLX | .tar.gz |
The -cuda archives still require an NVIDIA GPU and a compatible driver at run time; the macOS archive requires Apple Silicon.